1 Introduction

In the foundational paper [9], the authors introduced Riemannian optimization frameworks for Stiefel and Grassmann manifolds, where the Riemannian gradient and Hessian are given globally using a projection operator and a Christoffel function, a bilinear operator on the matrix space containing the Stiefel manifolds. The introduction of these global operators allows efficient computation when the manifolds are defined by constraints, or as quotients of constrained manifolds, as opposed to computation using local charts in an abstract manifold. The frameworks help popularize the field of Riemannian optimization. Since then, similar frameworks have been introduced for many manifolds with several software packages [8, 16, 26] implementing aspects of these frameworks. The Riemannian Hessian calculation is in general difficult. It could be computed using the calculus of variation. ”Doing so is tedious” [9], so the detailed calculations for Stiefel manifolds were not included in that paper, a generalization and full derivation recently appeared in [11].

In this article, we attempt to address the problem: given a manifold, described by equality constraints in a vector space, or quotient of such manifold, and a metric, also defined by an analytic formula, compute the gradient and Riemannian Hessian for a function on the manifold. Recall the Riemannian Hessian is computed using the Levi-Civita connection. By computing, we mean a procedural, not necessarily closed-form approach. We are looking for a sequence of equations, operators, and expressions to solve and evaluate, rather than starting from a distance minimizing problem. We believe that the approach we take, using a classical formula for projections together with an adaptation of the Christoffel symbol calculation to ambient space addresses the problem effectively for many manifolds encountered in applications. This method provides a very explicit and transparent procedure that we hope will be helpful to researchers in the field. The main feature is it can handle manifolds with non-constant embedded metrics, such as Stiefel manifolds with the canonical metric, or the manifold of positive-definite matrices with the affine-invariant metrics. The method allows us to compute Riemannian gradients and Hessian for several new families of metrics on manifolds often encountered in applications, including optimization and machine learning. While the effect of changing metrics on first-order optimization methods has been considered previously, we hope this will lead to future works on adapting metrics to second-order methods. The approach is also suitable in the case where the gradient formula is not of closed-form. We also provide several useful identities known in special cases.

As an application of the method developed here, we give a short derivation of the gradient and Hessian of a family of metrics studied recently in [11], extending both the canonical and embedded metric on the Stiefel manifolds with a closed-form geodesics formula. (We were ignorance of [11], which were done before we started this project in 2020). We derive the Riemannian framework for the induced metrics on certain quotients of Stiefel manifolds, including flag manifolds. We also give complete metrics on the fixed-rank positive-semidefinite matrix manifolds, with efficiently computable gradient, Hessian, and geodesics.

When the metric on a submanifold of a vector space (called the ambient space) is induced from the Euclidean metric on the vector space, the Riemannian gradient and Hessian correspond to the projected gradient and the projected Hessian in the literature of constrained optimization [9, section 4.9], [10]. When the metric is given by an analytic expression, (which is a metric when restricted to the constrained set, but not necessarily on the whole vector space), using [22, Lemma 4.3], we can compute Riemannian Hessian by projecting the Riemannian Hessian on the ambient space, and [22, Lemma 7.45] give us a similar computation for a quotient manifold. Representing the metric as an operator defined on the quotient/constrained manifold, we derive directly formulas for the Levi-Civita connection and the Riemannian Hessian that are dependent on the operator values on the manifold alone, giving us a procedural and simple approach, as explained. This approach has also been advocated in [1]. However, its usage is dependent on an extension with a computable Levi-Civita connection in the ambient space. This last step seems to restrict its use, a tricky extension is sometimes used, for example, in the case of the Grassmann manifold in [1]. We address this issue by working directly with the metric expression, giving an explicit formula for the connection in Eq. (3.10). This formula generalizes the connection formula using the Weingarten map in [2] for constant metrics.

Our framework works for both real and complex cases. The paper [15] gave the original treatment of the Hessian for the unitary/complex case. The case of Stiefel manifolds was studied in [9, 11], we reprove the results using our framework and provide complete formulas for the Hessian. Optimization on flag manifolds was studied in [21, 28] but a second-order method had not been given in an efficient form, an expression for the Riemannian Hessian vector product was not given/implemented. We provide second-order methods for the flag manifolds with the full family of metrics of [11], including efficient formulas for the Riemannian Hessian.

The affine-invariant metric on positive-definite matrices was also widely studied, for example in [12, 23,24,25]. There are numerous metrics on the fixed-rank positive-semidefinite (PSD) manifolds, which we mentioned [7] that motivated our approach. Although working with the same product of Stiefel and positive-definite manifolds with the affine-invariant metric, that paper did not use the Riemannian submersion metric on the quotient and focused on first-order methods. We compute the Levi-Civita connection for second-order methods. In [13, 27], two different families of metrics on PSD manifolds are studied. They both require solving Lyapunov equations but have different behaviors on the positive-definite part. Articles [15, 17] discuss the effect of adapting metrics to optimization problems ([17] adapts ambient metrics to the objective function using first-order methods.)

In the next section, we provide some background and summarize notations. In Sect. 3, we formulate and prove the main theoretical results of the paper. We then identify the adjoints of common operators on matrix spaces. We apply the theory developed to the manifolds discussed above. We then discuss numerical results and implementation. We conclude with a discussion of future directions.

2 Preliminaries

First-order approximation of a function f on \({\mathbb {R}}^n\) relies on the computation of the gradient and the second-order approximation relies on the Hessian matrix or Hessian-vector product. When a function f is defined on a Riemannian manifold \({\mathcal {M}}\), the relevant quantities are the Riemannian gradient, which provides a first-order approximation for f, and Riemannian Hessian, providing the second-order term.

When a manifold \({\mathcal {M}}\) is embedded in an inner product space \({\mathcal {E}}\) with the inner product denoted by \(\langle \cdots \rangle _{{\mathcal {E}}}\), if we have a function \({{\textsf{g}}}\) from \({\mathcal {M}}\) with values in the set of positive-definite operators operating on \({\mathcal {E}}\), we can define an inner product of two vectors \(\omega _1, \omega _2\in {\mathcal {E}}\) by \(\langle \omega _1, {{\textsf{g}}}(Y)\omega _2\rangle _{{\mathcal {E}}}\) for \(Y\in {\mathcal {M}}\). This induces an inner product on each tangent space \(T_Y{\mathcal {M}}\) and hence a Riemannian metric on \({\mathcal {M}}\), assuming sufficient smoothness. In this setup, if f is extended to a function \(\hat{f}\) on an open neighborhood of \({\mathcal {M}}\subset {\mathcal {E}}\), then the Riemannian gradient relates to the gradient of \(\hat{f}\) through a projection to \(T_Y{\mathcal {M}}\). In the theory of generalized least squares (GLS) [4], it is well-known that a projection to the nullspace of a full-rank matrix J in an inner product space equipped with a metric g (also represented by a matrix) is given by the formula \(I_{{\mathcal {E}}}-g^{-1}J^{{{\textsf{T}}}}(Jg^{-1}J^{{{\textsf{T}}}})^{-1}J\) (\(I_{{\mathcal {E}}}\) is the identity matrix/operator of \({\mathcal {E}}\)). If the tangent space is the nullspace of an operator \({{\,\textrm{J}\,}}\), and an operator \({{\textsf{g}}}\) is used to describe the metric instead of matrices J and g, we have a similar formula where the transposed matrix \(J^{{{\textsf{T}}}}\) is replaced by the adjoint operator \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}\). \({{\,\textrm{J}\,}}\) is the Jacobian of a full rank equality constraint. This projection formula is not often used in the literature, the projection is usually derived directly by minimizing the distance to the tangent space. It turns out when \({{\,\textrm{J}\,}}\) is given by a matrix equation, \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) is simple to compute. For manifolds common in applications, \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) could often be inverted efficiently. Thus, this will be our main approach to computing the Riemannian gradient.

The Levi-Civita connection of the manifold, which allows us to take covariant derivatives of the gradient, is used to compute the Riemannian Hessian. A vector field \(\xi \) in our context could be considered as a \({\mathcal {E}}\)-valued function from \({\mathcal {M}}\), such that \(\xi (Y)\in T_Y{\mathcal {M}}\) for all \(Y\in {\mathcal {M}}\). For two vector fields \(\xi , \eta \) on \({\mathcal {M}}\), the directional derivative \({\textrm{D}}_{\xi }\eta \) is an \({\mathcal {E}}\)-valued function but generally not a vector field (i.e. \(({\textrm{D}}_{\xi }\eta )(Y)\in T_Y{\mathcal {M}}\) may not hold). A covariant derivative [22] (or connection) associates a vector field \(\nabla _{\xi }\eta \) to two vector fields \(\xi , \eta \) on \({\mathcal {M}}\). The association is linear in \(\xi \), \({\mathbb {R}}\)-linear in \(\eta \) and satisfies the product rule

$$\begin{aligned} \nabla _{\xi }(f\eta ) = f\nabla _{\xi }\eta + ({\textrm{D}}_{\xi }f)\eta \end{aligned}$$

for a function f on \({\mathcal {M}}\), where \({\textrm{D}}_{\xi }f\) denotes the Lie derivative of f (the directional derivative of f along direction \(\xi _x\) at each \(x\in {\mathcal {M}}\)). For a Riemannian metric \(\langle ,\rangle _R\) on \({\mathcal {M}}\), the Levi-Civita connection is the unique connection that is compatible with metric, \({\textrm{D}}_{\xi }\langle \eta , \phi \rangle _R = \langle \nabla _{\xi } \eta ,\phi \rangle _R +\langle \eta , \nabla _{\xi }\phi \rangle _R\) (\(\phi \) is another vector field), and torsion-free, \(\nabla _{\xi }\eta - \nabla _{\eta }\xi = [\xi , \eta ]\). If a coordinate chart of \({\mathcal {M}}\) is identified with an open subset of \({\mathbb {R}}^n\) and \(\langle ,\rangle _R\) is given by a positive-definite operator \({{\textsf{g}}}_R\), (i.e. \(\langle \xi , \eta \rangle _R = \langle \xi ,{{\textsf{g}}}_R \eta \rangle _{{\mathbb {R}}^n}\))

$$\begin{aligned} \nabla _{\xi }\eta = {\textrm{D}}_{\xi }\eta + \frac{1}{2}{{\textsf{g}}}_R^{-1}(({\textrm{D}}_{\xi }{{\textsf{g}}}_R)\eta + ({\textrm{D}}_{\eta }{{\textsf{g}}}_R)\xi -{\mathcal {X}}(\xi , \eta )), \end{aligned}$$

where \({\mathcal {X}}(\xi , \eta )\in {\mathbb {R}}^n\) (uniquely defined) satisfies \(\langle {\textrm{D}}_{\phi }{{\textsf{g}}}_R\xi , \eta \rangle _{{\mathbb {R}}^n} = \langle \phi , {\mathcal {X}}(\xi , \eta )\rangle _{{\mathbb {R}}^n}\) for all vector field \(\phi \). The formula is valid for each coordinate chart, and it is often given in terms of Christoffel symbols in index notation ([22], proposition 3.13). We will generalize this operator formula. The gradient and \({\mathcal {X}}\) are examples of index-raising, translating from a (multi)linear scalar function h to a (multi)linear vector-valued function of one less variable, that evaluates back to h using the inner product pairing.

The Riemannian Hessian could be provided in two formats, as a bilinear form \({\textsf{rhess}}_f^{02}(\xi , \eta )\), returning a scalar function to every two vector fields \(\xi , \eta \) on the manifold, or a (Riemannian) Hessian vector product \({\textsf{rhess}}^{11}_f\xi \), an operator returning a vector field given a vector field input \(\xi \). In optimization, as we need to invert the Hessian in second-order methods, the Riemannian Hessian vector product form \({\textsf{rhess}}_f^{11}\) is more practical. However, \({\textsf{rhess}}^{02}_f\) is directly related to the Levi-Civita connection (see Eq. (3.12) below), and can be read from the geodesic equation: In [9], the authors showed the geodesic equation (for a Stiefel manifold) is given by \(\ddot{Y} + {\varGamma }(\dot{Y}, \dot{Y})=0\) where the Christoffel function \({\varGamma }\) (defined below) maps two vector fields to an ambient function and the bilinear form \({\textsf{rhess}}^{02}_f\) is \(\hat{f}_{YY}(\xi , \eta ) - \langle {\varGamma }(\xi , \eta ), \hat{f}_{Y}\rangle _{{\mathcal {E}}}\). Here, \(\hat{f}_{Y}\) and \(\hat{f}_{YY}\) are the ambient gradient and Hessian, see Sect. 3.

In application, we also work with quotient manifolds, an example of Riemannian submersion. Recall ([22], Definition 7.44) a Riemannian submersion \(\pi :{\mathcal {M}}\rightarrow {\mathcal {B}}\) between two manifolds \({\mathcal {M}}\) and \({\mathcal {B}}\) is a smooth, onto mapping, such that the differential \(d\pi \) is onto at every point \(Y\in {\mathcal {M}}\), the fiber \(\pi ^{-1}(b), b\in {\mathcal {B}}\) is a Riemannian submanifold of \({\mathcal {M}}\), and \(d\pi \) preserves scalar products of vectors normal to fibers. An important example is the quotient space by a free and proper action of a group of isometries. At each point \(Y\in {\mathcal {M}}\), the tangent space of \(\pi ^{-1}(\pi Y)\) is called the vertical space, and its orthogonal complement with respect to the Riemannian metric is called the horizontal space. The collection of horizontal spaces \({\mathcal {H}}_Y\) (\(Y\in {\mathcal {M}}\)) of a submersion is a subbundle \({\mathcal {H}}\). The horizontal lift, identifying a tangent vector \(\xi \) at \(b=\pi (Y)\in {\mathcal {B}}\) with a horizontal tangent vector \(\xi _{{\mathcal {H}}}\) at Y is a linear isometry between the tangent space \(T_b{\mathcal {B}}\) and \({\mathcal {H}}_Y\), the horizontal space at Y. The Riemannian framework for a quotient of embedded manifolds is studied through the horizontal lifts, the focus is on the horizontal bundle \({\mathcal {H}}\) instead of the tangent bundle \(T{\mathcal {M}}\).

The reader can consult [1, 9] for details of Riemannian optimization, including the basic algorithms once the Euclidean and Riemannian gradient and Hessian are computed. In essence, it has been recognized that many popular equation solving and optimization algorithms on Euclidean spaces can be extended to a manifold framework ([9, 10]). Steepest descent on real vector spaces could be extended to manifolds using the Riemannian gradient defined above together with a retraction. Here, a retraction R is a sufficiently smooth map mapping \(X\in {\mathcal {M}}, \eta \in T_X{\mathcal {M}}\) to \(R(X, \eta )\in {\mathcal {M}}\) for sufficiently small \(\eta \). Also, using the Riemannian Hessian, second-order optimization methods, for example, Trust-Region ([1]), could be extended to manifold context. At the i-th iteration step, an optimization algorithm produces a tangent vector \(\eta _{i}\) to the manifold point \(Y_i\), which will produce the next iteration point \(Y_{i+1}=R(Y_i, \eta _i)\) ([3], chapter 4 of [1]). For manifolds considered in this article, computationally efficient retractions are available.

2.1 Notations

We will attempt to state and prove statements for both the real and Hermitian cases at the same time when there are parallel results, as discussed in Sect. 4.1. The base field \({\mathbb {K}}\) will be \({\mathbb {R}}\) or \({\mathbb {C}}\). We use the notation \({\mathbb {K}}^{n\times m}\) to denote the space of matrices of size \(n\times m\) over \({\mathbb {K}}\). We consider both real and complex vector spaces as real vector spaces, and by \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\) we denote the trace of a matrix in the real case or the real part of the trace in the complex case. A real matrix space is a real inner product space with the Frobenius inner product \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}ab^{{{\textsf{T}}}}\), while a complex matrix space becomes a real inner product space with inner product \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}ab^{{{\textsf{H}}}}\) (see Sect. 4.1). We will use the notation \({\mathfrak {t}}\) to denote the real adjoint \({{\textsf{T}}}\) for a real vector space, and Hermitian adjoint \({{\textsf{H}}}\) for a complex vector space, both for matrices and operators. We denote \({\textrm{sym}}_{{\mathfrak {t}}}X = \frac{1}{2}(X + X^{{\mathfrak {t}}})\), \({\textrm{skew}}_{{\mathfrak {t}}}X = \frac{1}{2}(X - X^{{\mathfrak {t}}})\). We denote by \({\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) the space of \({\mathfrak {t}}\)-symmetric matrices \(X\in {\mathbb {K}}^{n\times n}\) with \(X^{{\mathfrak {t}}} = X\). The \({\mathfrak {t}}\)-antisymmetric space \({\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) is defined similarly. Symbols \(\xi , \eta \) are often used to denote tangent vector or vector fields, while \(\omega \) is used to denote a vector on the ambient space. The directional derivative in direction v is denoted by \({\textrm{D}}_{v}\), it applies to scalar, vector, or operator-valued functions. If \(\texttt{X}\) is a vector field and f is a function, the Lie derivatives will be written as \({\textrm{D}}_{\texttt{X}}f\). We also apply Lie derivatives on scalar or operator-valued functions when \(\texttt{X}\) is a vector field, and write \({\textrm{D}}_{\texttt{X}}{{\textsf{g}}}\) for example, where \({{\textsf{g}}}\) is a metric operator. Because the vector field \(\texttt{X}\) may be a matrix, we prefer the notation \({\textrm{D}}_{\texttt{X}}{{\textsf{g}}}\) to the usual Lie derivative notation \(\texttt{X}{{\textsf{g}}}\) which may be ambiguous. By \({\textrm{U}}_{{\mathbb {K}}, d}\) we denote the group of \({\mathbb {K}}^{d\times d}\) matrices U satisfying \(U^{{\mathfrak {t}}}U = I_d\) (called \({\mathfrak {t}}\)-orthogonal), thus \({\textrm{U}}_{{\mathbb {K}}, d}\) is the real orthogonal group \({\textrm{O}}(d)\) when \({\mathbb {K}}={\mathbb {R}}\) and unitary group \({\textrm{U}}(d)\) when \({\mathbb {K}}={\mathbb {C}}\).

In our approach, a subspace \({\mathcal {H}}_Y\) of the tangent space at a point Y on a manifold \({\mathcal {M}}\) is defined as either the nullspace of an operator \({{\,\textrm{J}\,}}(Y)\), or the range of an operator \({\textrm{N}}(Y)\), both are operator-valued functions on \({\mathcal {M}}\). Since we most often work with one manifold point Y at a time, we sometimes drop the symbol Y to make the expressions less lengthy. Other operator-valued functions defined in this paper include the ambient metric \({{\textsf{g}}}\), the projection \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\) to \({\mathcal {H}}\), the Christoffel metric term \({\textrm{K}}\), and their directional derivatives. We also use the symbols \(\hat{f}_{Y}\) and \(\hat{f}_{YY}\) to denote the ambient gradient and Hessian (Y is the manifold variable). We summarize below symbols and concepts related to the main ideas in the paper, with the Stiefel case as an example (details explained in sections below).

Symbol

Concept

\({\mathcal {E}}\)

Ambient space, \({\mathcal {M}}\) is embedded in \({\mathcal {E}}\). (e.g. \({\mathbb {K}}^{n\times p}\)).

\({\mathcal {E}}_{{{\,\textrm{J}\,}}}, {\mathcal {E}}_{{\textrm{N}}}\)

Inner product spaces, range of \({{\,\textrm{J}\,}}(Y)\) and domain of \({\textrm{N}}(Y)\) below.

\({\mathcal {H}}\)

A subbundle of \(T{\mathcal {M}}\). Either \(T{\mathcal {M}}\) or a horizontal bundle in practice.

\({{\,\textrm{J}\,}}(Y)\)

Operator from \({\mathcal {E}}\) onto \({\mathcal {E}}_{{{\,\textrm{J}\,}}}\), \({{\,\textrm{Null}\,}}({{\,\textrm{J}\,}}(Y))={\mathcal {H}}_Y\subset T_Y{\mathcal {M}}\). (e.g. \(Y^{{\mathfrak {t}}}\omega + \omega ^{{\mathfrak {t}}}Y\)).

\({\textrm{N}}(Y)\)

Inject. oper. from \({\mathcal {E}}_{{\textrm{N}}}\) to \({\mathcal {E}}\) onto \({\mathcal {H}}_Y\subset T_Y{\mathcal {M}}\) (e.g. \({\textrm{N}}(A, B) = Y A + Y_{\perp } B\)).

\({{\,\textrm{xtrace}\,}}\)

Index-raising operator for the trace (Frobenius) inner product.

\({{\textsf{g}}}(Y)\)

Metric given as self-adjoint operator on \({\mathcal {E}}\) (e.g. (\(\alpha _1YY^{{\mathfrak {t}}} + \alpha _0 Y_{\perp }Y_{\perp }^{{\mathfrak {t}}})\eta \)).

\({\varPi }_{{{\textsf{g}}}}, {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\)

Projection to \({\mathcal {H}}\subset T{\mathcal {M}}\) in Proposition 3.2.

\({\textrm{K}}(\xi , \eta )\)

Chrisfl. metric term \(\frac{1}{2}(({\textrm{D}}_{\xi }{{\textsf{g}}})\eta + ({\textrm{D}}_{\eta }){{\textsf{g}}}\xi -{{\,\textrm{xtrace}\,}}(\langle ({\textrm{D}}_\phi {{\textsf{g}}})\xi , \eta \rangle _{{\mathcal {E}}}, \phi ))\).

\({\varGamma }_{{\mathcal {H}}}(\xi , \eta )\)

Chrisfl. function \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta )-({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}})\eta \).

3 Ambient Space and Optimization on Riemannian Manifolds

If \(\hat{f}\) is a scalar function from an open subset \({\mathcal {U}}\) of a Euclidean space \({\mathcal {E}}\), its gradient \({\textsf{grad}}\hat{f}\) satisfies \(\langle {\hat{\eta }}, {\textsf{grad}}\hat{f}\rangle _{{\mathcal {E}}} = {\textrm{D}}_{{\hat{\eta }}} \hat{f}\) for all vector fields \({\hat{\eta }}\) on \({\mathcal {U}}\) where \({\textrm{D}}_{{\hat{\eta }}} \hat{f}\) is the Lie derivative of \(\hat{f}\) with respect to \({\hat{\eta }}\). As well-known [1, 9], the Riemannian gradient and Hessian product of a function f on a submanifold \({\mathcal {M}}\subset {\mathcal {E}}\) could be computed from the Euclidean gradient and Hessian, which are evaluated by extending f to a function \(\hat{f}\) on a region of \({\mathcal {E}}\) near \({\mathcal {M}}\). The process is independent of the extension \(\hat{f}\).

Definition 3.1

We call an inner product (Euclidean) space \(({\mathcal {E}}, \langle ,\rangle _{{\mathcal {E}}})\) an embedded ambient space of a Riemannian manifold \({\mathcal {M}}\) if there is a differentiable (not necessarily Riemannian) embedding \({\mathcal {M}}\subset {\mathcal {E}}\).

Let f be a function on \({\mathcal {M}}\) and \(\hat{f}\) be an extension of f to an open neighborhood of \({\mathcal {E}}\) containing \({\mathcal {M}}\). We call \({\textsf{grad}}\hat{f}\) an ambient gradient of f. It is a vector-valued function from \({\mathcal {M}}\) to \({\mathcal {E}}\) such that for all vector fields \(\eta \) on \({\mathcal {M}}\)

$$\begin{aligned} \langle \eta (Y), {\textsf{grad}}\hat{f}(Y)\rangle _{{\mathcal {E}}} = ({\textrm{D}}_{\eta (Y)} f)(Y)\text { for all }Y\in {\mathcal {M}}\end{aligned}$$
(3.1)

or equivalently \(\langle \eta , {\textsf{grad}}\hat{f}\rangle _{{\mathcal {E}}} = {\textrm{D}}_\eta f\). Given an ambient gradient \({\textsf{grad}}\hat{f}\) with continuous derivatives, we define the ambient Hessian to be the map \({\textsf{hess}}\hat{f}\) associating to a vector field \(\xi \) on \({\mathcal {M}}\) the derivative \({\textrm{D}}_{\xi }{\textsf{grad}}\hat{f}\). We define the ambient Hessian bilinear form \({\textsf{hess}}\hat{f}^{02}(\xi , \eta )\) to be \(\langle ({\textrm{D}}_{\xi }{\textsf{grad}}\hat{f}),\eta \rangle _{{\mathcal {E}}}\). If \(Y\in {\mathcal {M}}\) is considered as a variable, we also use the notation \(\hat{f}_{Y}\) for \({\textsf{grad}}\hat{f}\) and \(\hat{f}_{YY}\) for \({\textsf{hess}}\hat{f}\).

By the Whitney embedding theorem, any manifold has an ambient space. Coordinate charts could be considered as a collection of compatible local ambient spaces.

From the embedding \({\mathcal {M}}\subset {\mathcal {E}}\), the tangent space of \({\mathcal {M}}\) at each point \(Y\in {\mathcal {M}}\) is considered as a subspace of \({\mathcal {E}}\). Thus, a vector field on \({\mathcal {M}}\) could be considered as an \({\mathcal {E}}\)-valued function on \({\mathcal {M}}\) and we can take its directional derivatives. This derivative is dependent on the embedding and hence not intrinsic. For a function f and two vector fields \(\xi , \eta \) on \({\mathcal {M}}\subset {\mathcal {E}}\) we have:

$$\begin{aligned} \hat{f}_{YY}^{02}(\xi , \eta )= {\textsf{hess}}\hat{f}^{02}(\xi , \eta ) = {\textrm{D}}_{\xi }({\textrm{D}}_{\eta } f)- \langle {\textrm{D}}_{\xi }\eta , \hat{f}_{Y}\rangle _{{\mathcal {E}}}. \end{aligned}$$
(3.2)

This follows from \({\textrm{D}}_{\xi }\langle \eta , \hat{f}_{Y}\rangle _{{\mathcal {E}}} = \langle {\textrm{D}}_{\xi }\eta , \hat{f}_{Y}\rangle _{{\mathcal {E}}} + \langle \eta , {\textrm{D}}_{\xi }(\hat{f}_{Y})\rangle _{{\mathcal {E}}} = {\textrm{D}}_{\xi }({\textrm{D}}_{\eta } f)\) by taking directional derivatives of Eq. (3.1), thus \(\langle \eta , {\textrm{D}}_{\xi }(\hat{f}_{Y})\rangle _{{\mathcal {E}}}\) can be evaluated by Eq. (3.2).

We begin with a standard result of inner product spaces. Recall that the adjoint of a linear map A between two inner product spaces V and W is the map \(A^{{\mathfrak {t}}}\) such that \(\langle Av, w\rangle _W = \langle v, A^{{\mathfrak {t}}}w\rangle _V\) where \(\langle ,\rangle _V, \langle ,\rangle _W\) denote the inner products on V and W, respectively. If A is represented by a matrix also called A in two orthogonal bases in V and W respectively, then \(A^{{\mathfrak {t}}}\) is represented by its transpose \(A^{{{\textsf{T}}}}\). A projection from an inner product space V to a subspace W is a linear operator \({\varPi }_W\) on V such that \({\varPi }_W v\in W\) and \(\langle v, w\rangle _V = \langle {\varPi }_W v, w\rangle _V\) for all \(w\in W\), \(v\in V\). It is well-known a projection always exists and unique, and \({\varPi }_W v\) minimizes the distance from v to W.

Proposition 3.1

Let \({\mathcal {E}}\) be a vector space with an inner product \(\langle ,\rangle _{{\mathcal {E}}}\). Let \({{\textsf{g}}}\) be a self-adjoint positive-definite operator on \({\mathcal {E}}\), thus \(\langle {{\textsf{g}}}e_1, e_2\rangle _{{\mathcal {E}}}= \langle e_1, {{\textsf{g}}}e_2\rangle _{{\mathcal {E}}}\). The operator \({{\textsf{g}}}\) defines a new inner product on \({\mathcal {E}}\) by \(\langle e_1, e_2\rangle _{{\mathcal {E}}, {{\textsf{g}}}}:= \langle e_1, {{\textsf{g}}}e_2\rangle _{{\mathcal {E}}}\). If \(W = {{\,\textrm{Null}\,}}({{\,\textrm{J}\,}})\) for a map \({{\,\textrm{J}\,}}\) from \({\mathcal {E}}\) onto an inner product space \({\mathcal {E}}_{{{\,\textrm{J}\,}}}\), the projection \({\varPi }_{{{\textsf{g}}}}={\varPi }_{{{\textsf{g}}}, W}\) from \({\mathcal {E}}\) to W under the inner product \(\langle , \rangle _{{\mathcal {E}}, {{\textsf{g}}}}\) is given by \({\varPi }_{{{\textsf{g}}}}e= e- {{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}e\), where \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) is the adjoint map of \({{\,\textrm{J}\,}}\) for all \(e\in {\mathcal {E}}\).

Alternatively, if \({\textrm{N}}\) is a one-to-one map from an inner product space \({\mathcal {E}}_{{\textrm{N}}}\) to \({\mathcal {E}}\) such that \(W={\textrm{N}}({\mathcal {E}}_{{\textrm{N}}})\), then the projection to W could be given by \({\varPi }_{{{\textsf{g}}}}e= {\textrm{N}}({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}{\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}e\).

The operators \({{\textsf{g}}}{\varPi }_{{{\textsf{g}}}}\) and \({\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}\) are self-adjoint under \(\langle ,\rangle _{{\mathcal {E}}}\).

Proof

The assumption that \({{\,\textrm{J}\,}}\) is onto \({\mathcal {E}}_{{{\,\textrm{J}\,}}}\) shows \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) is injective (as \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}a = 0\) implies \(\langle a, {{\,\textrm{J}\,}}\omega \rangle _{{\mathcal {E}}_{{{\,\textrm{J}\,}}}} = 0\) for all \(\omega \in {\mathcal {E}}\), and since \({{\,\textrm{J}\,}}\) is onto this implies \(a = 0\)). This in turn implies \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) is invertible as if \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}} a = 0\), then \(\langle {{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}a, a\rangle _{{\mathcal {E}}} = 0\), so \(\langle {{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}a, {{\,\textrm{J}\,}}^{{\mathfrak {t}}}a\rangle _{{\mathcal {E}}} = 0\) and hence \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}a = 0\) as \({{\textsf{g}}}\) is positive-definite. We can show \({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}}\) is invertible similarly.

For the first case, if \(e_W \in W={{\,\textrm{Null}\,}}({{\,\textrm{J}\,}})\) and \(e\in {\mathcal {E}}\),

$$\begin{aligned} \langle {{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}e, {{\textsf{g}}}e_W\rangle _{{\mathcal {E}}} = \langle {{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}e, e_W\rangle _{{\mathcal {E}}}=\langle ({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}e, {{\,\textrm{J}\,}}e_W\rangle _{{\mathcal {E}}}, \end{aligned}$$

where the last term is zero because \(e_W\in {{\,\textrm{Null}\,}}({{\,\textrm{J}\,}})\), so \(\langle {\varPi }_{{{\textsf{g}}}} e, {{\textsf{g}}}e_W\rangle _{{\mathcal {E}}} = \langle e, {{\textsf{g}}}e_W\rangle _{{\mathcal {E}}}\). For the second case, assuming \(e_W ={\textrm{N}}(e_{{\textrm{N}}})\) for \(e_{{\textrm{N}}}\in {\mathcal {E}}_{{\textrm{N}}}\) then (Using \((AB)^{{\mathfrak {t}}} = B^{{\mathfrak {t}}}A^{{\mathfrak {t}}}\)):

$$\begin{aligned} \langle {\textrm{N}}({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}{\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}e, {{\textsf{g}}}{\textrm{N}}(e_{{\textrm{N}}})\rangle _{{\mathcal {E}}} = \langle {{\textsf{g}}}e, {\textrm{N}}({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}{\textrm{N}}^{{\mathfrak {t}}} {{\textsf{g}}}{\textrm{N}}(e_{{\textrm{N}}})\rangle _{{\mathcal {E}}}=\langle {{\textsf{g}}}e, {\textrm{N}}(e_{{\textrm{N}}})\rangle _{{\mathcal {E}}}. \end{aligned}$$

The last statement follows from the defining equations of \({\varPi }_{{{\textsf{g}}}}\). \(\square \)

Recall the Riemannian gradient of a function f on a manifold \({\mathcal {M}}\) with Riemannian metric \(\langle ,\rangle _R\) is the vector field \({\textsf{rgrad}}_f\) such that \(\langle {\textsf{rgrad}}_f(Y), \xi (Y)\rangle _R = ({\textrm{D}}_{\xi } f)(Y)\) for any point \(Y\in {\mathcal {M}}\), and any vector field \(\xi \). Let \({\mathcal {H}}\) be a subbundle of the tangent bundle \(T{\mathcal {M}}\), we recall this means \({\mathcal {H}}\) is a collection of subspaces (fibers) \({\mathcal {H}}_Y\subset T_Y{\mathcal {M}}\) for \(Y\in {\mathcal {M}}\) such that \({\mathcal {H}}\) is itself a vector bundle on \({\mathcal {M}}\), i.e. \({\mathcal {H}}\) is locally a product of a vector space and an open subset of \({\mathcal {M}}\), together with a linear coordinate change condition (see [22], definition 7.24 for details). We can define the \({\mathcal {H}}\)-Riemannian gradient \({\textsf{rgrad}}_{{\mathcal {H}}, f}\) of f as the unique \({\mathcal {H}}\)-valued vector field such that \(\langle {\textsf{rgrad}}_{{\mathcal {H}}, f}, \xi _{{\mathcal {H}}}\rangle _R = {\textrm{D}}_{\xi _{{\mathcal {H}}}} f\) for any \({\mathcal {H}}\)-valued vector field \(\xi _{{\mathcal {H}}}\). Uniqueness follows from nondegeneracy of the inner product restricted to \({\mathcal {H}}\). Clearly, when \({\mathcal {H}}=T{\mathcal {M}}\), \({\textsf{rgrad}}_{{\mathcal {H}}, f}={\textsf{rgrad}}_{T{\mathcal {M}}}f\) is the usual Riemannian gradient. We have:

Proposition 3.2

Let \(({\mathcal {E}}, \langle ,\rangle _{{\mathcal {E}}})\) be an embedded ambient space of a manifold \({\mathcal {M}}\) as in definition 3.1. Let \({{\textsf{g}}}\) be a smooth operator-valued function associating each \(Y\in {\mathcal {M}}\) a self-adjoint positive-definite operator \({{\textsf{g}}}(Y)\) on \({\mathcal {E}}\). Thus, each \({{\textsf{g}}}(Y)\) defines an inner product on \({\mathcal {E}}\), which induces an inner product on \(T_Y{\mathcal {M}}\) and hence \({{\textsf{g}}}\) induces a Riemannian metric on \({\mathcal {M}}\). Let \({\mathcal {H}}\) be a subbundle of \(T{\mathcal {M}}\). Define \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\) to be the operator-valued function such that \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}(Y)\) is the projection associated with \({{\textsf{g}}}(Y)\) from \({\mathcal {E}}\) to the fiber \({\mathcal {H}}_Y\), and for the case \({\mathcal {H}}=T{\mathcal {M}}\), define \({\varPi }_{{\mathcal {M}},{{\textsf{g}}}} = {\varPi }_{T{\mathcal {M}},{{\textsf{g}}}}\). For an ambient gradient \({\textsf{grad}}\hat{f}\) of f, the \({\mathcal {H}}\)-Riemannian gradient of f can be evaluated as:

$$\begin{aligned} {\textsf{rgrad}}_{{\mathcal {H}}, f}= {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1} {\textsf{grad}}\hat{f}. \end{aligned}$$
(3.3)

If there is an inner product space \({\mathcal {E}}_{{{\,\textrm{J}\,}}}\) and a map \({{\,\textrm{J}\,}}\) from \({\mathcal {M}}\) to the space \({\mathfrak {L}}({\mathcal {E}}, {\mathcal {E}}_{{{\,\textrm{J}\,}}})\) of linear maps from \({\mathcal {E}}\) to \({\mathcal {E}}_{{{\,\textrm{J}\,}}}\), such that for each \(Y\in {\mathcal {M}}\), the range of \({{\,\textrm{J}\,}}(Y)\) is precisely \({\mathcal {E}}_{{{\,\textrm{J}\,}}}\), and its nullspace is \({\mathcal {H}}_Y\) then \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}(Y)e\) for \(e\in {\mathcal {E}}\) could be given by:

$$\begin{aligned} {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}(Y)e= e- {{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}e; \ \text {all are evaluated at }Y. \end{aligned}$$
(3.4)

If there is an inner product space \({\mathcal {E}}_{{\textrm{N}}}\) and a map \({\textrm{N}}\) from \({\mathcal {M}}\) to the space \({\mathfrak {L}}({\mathcal {E}}_{{\textrm{N}}}, {\mathcal {E}})\) of linear maps from \({\mathcal {E}}_{{\textrm{N}}}\) to \({\mathcal {E}}\) such that for each \(Y\in {\mathcal {M}}\), \({\textrm{N}}(Y)\) is one-to-one, with its range is precisely \({\mathcal {H}}_Y\) then:

$$\begin{aligned} {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}e= {\textrm{N}}({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}{\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}e; \text { all are evaluated at }Y. \end{aligned}$$
(3.5)

Proof

For any \({\mathcal {H}}\)-valued vector field \(\xi _{{\mathcal {H}}}\), we have:

$$\begin{aligned} \langle {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textsf{grad}}\hat{f},{{\textsf{g}}}\xi _{{\mathcal {H}}} \rangle _{{\mathcal {E}}} = \langle {\textsf{grad}}\hat{f},{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}{{\textsf{g}}}\xi _{{\mathcal {H}}} \rangle _{{\mathcal {E}}} = \langle \hat{f}_{Y}, \xi _{{\mathcal {H}}} \rangle _{{\mathcal {E}}} ={\textrm{D}}_{\xi _{{\mathcal {H}}}} f \end{aligned}$$

because \({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}\) is self-adjoint and the projection is idempotent. The remaining statements are just a parametrized version of Proposition 3.1. \(\square \)

Note, we are not making any smoothness assumption on \({{\,\textrm{J}\,}}\) or \({\textrm{N}}\) yet, although \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\) is assumed to be sufficiently smooth. In fact, \({\textrm{N}}\) is often not smooth. \({{\,\textrm{J}\,}}\) is usually smooth as it is constructed from a smooth constraint on \({\mathcal {M}}\), or on the horizontal requirements of a vector field.

Definition 3.2

A triple \(({\mathcal {M}}, {{\textsf{g}}}, {\mathcal {E}})\) with \({\mathcal {E}}\) an inner product space, \({\mathcal {M}}\subset {\mathcal {E}}\) a differentiable manifold submersion, and \({{\textsf{g}}}\) is a positive-definite operator-valued-function from \({\mathcal {M}}\) to \({\mathfrak {L}}({\mathcal {E}}, {\mathcal {E}})\) is called an embedded ambient structure of \({\mathcal {M}}\). \({\mathcal {M}}\) is a Riemannian manifold with the metric induced by \({{\textsf{g}}}\).

From the definition of Lie brackets, for an embedded ambient space \({\mathcal {E}}\) of \({\mathcal {M}}\) we have

$$\begin{aligned} {\textrm{D}}_{\xi }\eta -{\textrm{D}}_{\eta }\xi = [\xi , \eta ]\text { for all vector fields } \xi , \eta \text { on } {\mathcal {M}}. \end{aligned}$$
(3.6)

Recall if \({\mathcal {M}}, \langle ,\rangle _R\) is a Riemannian manifold with the Levi-Civita connection \(\nabla \), the Riemannian Hessian (vector product) of a function f is the operator sending a tangent vector \(\xi \) to the tangent vector \({\textsf{rhess}}_f^{11}\xi = \nabla _{\xi }{\textsf{rgrad}}_f\). The Riemannian Hessian bilinear form is the map evaluating on two vector fields \(\xi , \eta \) as \(\langle \nabla _{\xi }{\textsf{rgrad}}_f, \eta \rangle _R\). For a subbundle \({\mathcal {H}}\) of \(T{\mathcal {M}}\) and a \({\mathcal {H}}\)-valued vector field \(\xi _{{\mathcal {H}}}\), we define the \({\mathcal {H}}\)-Riemannian Hessian similarly as \({\textsf{rhess}}_{{\mathcal {H}}, f}^{11}\xi _{{\mathcal {H}}} = {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}\nabla _{\xi _{{\mathcal {H}}}}{\textsf{rgrad}}_{{\mathcal {H}}, f}\) and we call \({\textsf{rhess}}_{{\mathcal {H}}, f}^{02}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}) = \langle {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}\nabla _{\xi _{{\mathcal {H}}}}{\textsf{rgrad}}_{{\mathcal {H}}, f}, \eta _{{\mathcal {H}}}\rangle _R = \langle \nabla _{\xi _{{\mathcal {H}}}}{\textsf{rgrad}}_{{\mathcal {H}}, f}, \eta _{{\mathcal {H}}}\rangle _R\) the \({\mathcal {H}}\)-Riemannian Hessian bilinear form. The next theorem shows how to compute the Riemannian connection and the associated Riemannian Hessian.

Theorem 3.1

Let \(({\mathcal {M}}, {{\textsf{g}}}, {\mathcal {E}})\) be an embedded ambient structure of a Riemannian manifold \({\mathcal {M}}\). There exists an \({\mathcal {E}}\)-valued bilinear form \({\mathcal {X}}\) sending a pair of vector fields \((\xi , \eta )\) to \({\mathcal {X}}(\xi , \eta )\in {\mathcal {E}}\) such that for any vector field \(\xi _0\):

$$\begin{aligned} \langle {\mathcal {X}}(\xi , \eta ), \xi _0\rangle _{{\mathcal {E}}} = \langle \xi , ({\textrm{D}}_{\xi _0}{{\textsf{g}}}) \eta \rangle _{{\mathcal {E}}}. \end{aligned}$$
(3.7)

Let \({\varPi }_{{\mathcal {M}}, {{\textsf{g}}}}\) be the projection from \({\mathcal {E}}\) to the tangent bundle of \({\mathcal {M}}\). Then \({\varPi }_{{\mathcal {M}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}{\mathcal {X}}(\xi , \eta )\) is uniquely defined given \(\xi , \eta \) and \({\mathcal {X}}(\xi , \eta )\) is also unique if we require \({\mathcal {X}}(\xi (Y), \eta (Y))\) to be in \(T_Y{\mathcal {M}}\) for all \(Y\in {\mathcal {M}}\). For two vector fields \(\xi , \eta \) on \({\mathcal {M}}\), define

$$\begin{aligned} \begin{aligned}&{\textrm{K}}(\xi , \eta ):= \frac{1}{2}(({\textrm{D}}_{\xi }{{\textsf{g}}})\eta +({\textrm{D}}_{\eta }{{\textsf{g}}})\xi -{\mathcal {X}}(\xi , \eta ))\in {\mathcal {E}},\\&{\hat{\nabla }}_{\xi }\eta := {\textrm{D}}_{\xi } \eta + {{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ),\\&\nabla _{\xi }\eta := {\varPi }_{{\mathcal {M}};{{\textsf{g}}}}{\hat{\nabla }}_{\xi }\eta = {\varPi }_{{\mathcal {M}};{{\textsf{g}}}}({\textrm{D}}_{\xi } \eta + {{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta )). \end{aligned} \end{aligned}$$
(3.8)

Then \(\nabla _{\xi }\eta \) is the covariant derivative associated with the Levi-Civita connection. It could be written using the Christoffel function \({\varGamma }\):

$$\begin{aligned} \begin{aligned} {\varGamma }(\xi , \eta )&:= -({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {M}};{{\textsf{g}}}})\eta + {\varPi }_{{\mathcal {M}};{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ),\\ \nabla _{\xi }\eta&= {\textrm{D}}_{\xi }\eta +{\varGamma }(\xi , \eta ). \end{aligned} \end{aligned}$$
(3.9)

If \({\mathcal {H}}\) is a subbundle of \(T{\mathcal {M}}\), and \(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}\) are two \({\mathcal {H}}\)-valued vector fields, we have:

$$\begin{aligned} \begin{aligned}&{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}\nabla _{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}= {\textrm{D}}_{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}+{\varGamma }_{{\mathcal {H}}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}) \text { with }\\&{\varGamma }_{{\mathcal {H}}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}):= -({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}};{{\textsf{g}}}})\eta _{{\mathcal {H}}}+ {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}). \end{aligned} \end{aligned}$$
(3.10)

If f is a function on \({\mathcal {M}}\), \(\hat{f}_{Y}\) is an ambient gradient of f and \(\hat{f}_{YY}\) is the ambient Hessian operator, then \({\textsf{rhess}}_{{\mathcal {H}}, f}^{11}\xi _{{\mathcal {H}}}:= {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}\nabla _{\xi _{{\mathcal {H}}}}{\textsf{rgrad}}_{{\mathcal {H}}, f}\) and \({\textsf{rhess}}_{{\mathcal {H}}, f}^{02}\) are given by:

$$\begin{aligned}&{\textsf{rhess}}_{{\mathcal {H}}, f}^{11}\xi _{{\mathcal {H}}}= {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}{{\textsf{g}}}^{-1}(\hat{f}_{YY}\xi _{{\mathcal {H}}}+ {{\textsf{g}}}({\textrm{D}}_{\xi _{{\mathcal {H}}}}({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}))\hat{f}_{Y}+{\textrm{K}}(\xi _{{\mathcal {H}}}, {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}\hat{f}_{Y})) \nonumber \\&\quad ={\varPi }_{{\mathcal {H}};{{\textsf{g}}}}{{\textsf{g}}}^{-1}(\hat{f}_{YY}\xi _{{\mathcal {H}}}+ {{\textsf{g}}}({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}){{\textsf{g}}}^{-1}\hat{f}_{Y}\nonumber \\&-({\textrm{D}}_{\xi _{{\mathcal {H}}}}{{\textsf{g}}}){{\textsf{g}}}^{-1}\hat{f}_{Y}+{\textrm{K}}(\xi _{{\mathcal {H}}}, {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}\hat{f}_{Y})), \end{aligned}$$
(3.11)
$$\begin{aligned}&{\textsf{rhess}}_{{\mathcal {H}}, f}^{02}(\eta _{{\mathcal {H}}}, \xi _{{\mathcal {H}}}) = \langle \nabla _{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}},{{\textsf{g}}}} {{\textsf{g}}}^{-1}\hat{f}_{Y}, {{\textsf{g}}}\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}}\nonumber \\&= \hat{f}_{YY}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}) -\langle {\varGamma }_{{\mathcal {H}}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}),\hat{f}_{Y}\rangle _{{\mathcal {E}}}. \end{aligned}$$
(3.12)

The form \({\varGamma }(\xi , \eta )\) appeared in [9] and was computed for the case of a Stiefel manifold, and was called a Christoffel function. It includes the Christoffel metric term \({\textrm{K}}\) and the derivative of \({\varPi }_{{\mathcal {M}}, {{\textsf{g}}}}\). Evaluated at \(Y\in {\mathcal {M}}\), it depends only on the tangent vectors \(\eta (Y)\) and \(\xi (Y)\), not on the whole vector fields. Equation (2.57) in that reference is the expression of \({\textsf{rhess}}^{02}_f\) in terms of \({\varGamma }\) above. In [9], \({\varGamma }_{{\mathcal {H}}}\) was computed for a Grassmann manifold. The formulation for subbundles allows us to extend the result to Riemannian submersions and quotient manifolds. Equation 3.11 generalizes the Weingarten map formula in [2, equations 7,10] when \({{\textsf{g}}}=I_{{\mathcal {E}}}\), since by product rule

$$\begin{aligned} {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}})\hat{f}_{Y}= ({\textrm{D}}_{\xi _{{\mathcal {H}}}}({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}^2))\hat{f}_{Y}- ({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}){\varPi }_{{\mathcal {H}};{{\textsf{g}}}}\hat{f}_{Y}, \end{aligned}$$

and \({\textsf{rhess}}_{{\mathcal {H}}, f}^{11}\xi _{{\mathcal {H}}}\) becomes

$$\begin{aligned} {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}(\hat{f}_{YY}\xi _{{\mathcal {H}}}) + {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}})\hat{f}_{Y}={\varPi }_{{\mathcal {H}};{{\textsf{g}}}}(\hat{f}_{YY}\xi _{{\mathcal {H}}}) + ({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}})((I_{{\mathcal {E}}} -{\varPi }_{{\mathcal {H}};{{\textsf{g}}}})\hat{f}_{Y}). \end{aligned}$$

Here, \(V:= (I_{{\mathcal {E}}} -{\varPi }_{{\mathcal {H}};{{\textsf{g}}}})\hat{f}_{Y}\) is vertical (\({\varPi }_{{\mathcal {H}};{{\textsf{g}}}} V = 0\)) and \(({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}}, {{\textsf{g}}}})V\) is horizontal.

Proof

\({\mathcal {X}}\) is the familiar index-raising term: for \(Y\in {\mathcal {M}}\) and \(v_0, v_1, v_2\in T_Y{\mathcal {M}}\), as \(\langle v_1, ({\textrm{D}}_{v_0}{{\textsf{g}}}) v_2\rangle _{{\mathcal {E}}}\) is a tri-linear function on \(T_Y{\mathcal {M}}\) and the Riemannian inner product on \(T_Y{\mathcal {M}}\) is nondegenerate, the index-raising bilinear form \({\tilde{{\mathcal {X}}}}\) with value in \(T_Y{\mathcal {M}}\) is uniquely defined, so \({\mathcal {X}}(\xi (Y), \eta (Y))={\tilde{{\mathcal {X}}}}(\xi (Y), \eta (Y))\) satisfies Eq. (3.7), where we consider \(T_Y{\mathcal {M}}\) as a subspace of \({\mathcal {E}}\). Thus, we have proved the existence of \({\mathcal {X}}\). If we take another \({\mathcal {E}}\)-valued function \({\mathcal {X}}_1\) satisfying the same condition but not necessarily in the tangent space, the expression \({\varPi }_{{\mathcal {M}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\mathcal {X}}_1\), hence \({\varPi }_{{\mathcal {M}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}\) is independent of the choice of \({\mathcal {X}}_1\), as for three vector fields \(\xi _0, \xi , \eta \)

$$\begin{aligned} \langle {{\textsf{g}}}\xi _0, {\varPi }_{{\mathcal {M}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\mathcal {X}}_1(\xi , \eta )\rangle _{{\mathcal {E}}} = \langle {{\textsf{g}}}\xi _0, {{\textsf{g}}}^{-1}{\mathcal {X}}_1(\xi , \eta )\rangle _{{\mathcal {E}}}=\langle \xi , {\textrm{D}}_{\xi _0}\eta \rangle _{{\mathcal {E}}}. \end{aligned}$$

We can verify directly that \(\nabla _{\xi }\eta \) satisfies the conditions of a covariant derivative: linear in \(\xi \) and satisfying the product rule with respect to \(\eta \). Similar to the calculation with coordinate charts, we can show \(\nabla \) is compatible with metric: for two vector fields \(\eta , \xi \), \(2\langle \nabla _{\xi }\eta , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}}= 2\langle {\varPi }_{{\mathcal {M}},{{\textsf{g}}}}{\hat{\nabla }}_{\xi } \eta , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}}\), which is \(2\langle {\textrm{D}}_{\xi } \eta +{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ), {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}}\) by definition and by property of the projection. Expanding the last expression and use \(\langle {\mathcal {X}}(\xi , \eta ), \eta \rangle _{{\mathcal {E}}}=\langle ({\textrm{D}}_{\eta }{{\textsf{g}}})\xi ,\eta \rangle _{{\mathcal {E}}}\)

$$\begin{aligned} \begin{aligned}&2\langle {\textrm{D}}_{\xi } \eta , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}}+2\langle {{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ), {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}}\\&= 2\langle {\textrm{D}}_{\xi } \eta , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}} + \langle ({\textrm{D}}_{\xi }{{\textsf{g}}})\eta + ({\textrm{D}}_{\eta }{{\textsf{g}}})\xi -{\mathcal {X}}(\xi , \eta ), \eta \rangle _{{\mathcal {E}}}\\&\quad =2\langle {\textrm{D}}_{\xi }\eta , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}} + \langle ({\textrm{D}}_{\xi }{{\textsf{g}}})\eta , \eta \rangle _{{\mathcal {E}}} = {\textrm{D}}_{\xi }\langle \eta , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}}. \end{aligned} \end{aligned}$$

Torsion-free follows from the fact that \({\textrm{K}}\) is symmetric and Eq. (3.6):

$$\begin{aligned} \nabla _{\xi }\eta - \nabla _{\eta }\xi = {\varPi }_{{\mathcal {M}},{{\textsf{g}}}}({\textrm{D}}_{\xi }\eta - {\textrm{D}}_{\eta }\xi ) = [\xi , \eta ]. \end{aligned}$$

For Eq. (3.9), we note \({\varPi }_{{\mathcal {M}},{{\textsf{g}}}}{\textrm{D}}_{\xi }\eta = {\textrm{D}}_{\xi }({\varPi }_{{\mathcal {M}},{{\textsf{g}}}}\eta ) - ({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {M}},{{\textsf{g}}}})\eta \) so

$$\begin{aligned} \nabla _{\xi }\eta = {\textrm{D}}_{\xi }(\eta ) - ({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {M}},{{\textsf{g}}}})\eta + {\varPi }_{{\mathcal {M}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ). \end{aligned}$$

For Eq. (3.10), \({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{\varPi }_{{\mathcal {M}}, {{\textsf{g}}}}= {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}\) if \(Y\in {\mathcal {M}}\), \({\mathcal {H}}_Y\subset T_Y{\mathcal {M}}\). Hence, Eq. (3.8) implies

$$\begin{aligned} {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}\nabla _{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}= {\varPi }_{{\mathcal {H}};{{\textsf{g}}}}({\textrm{D}}_{\xi } \eta _{{\mathcal {H}}}+ {{\textsf{g}}}^{-1}{\textrm{K}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}})), \end{aligned}$$

and as before, we use \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{\textrm{D}}_{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}= {\textrm{D}}_{\xi _{{\mathcal {H}}}}({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\eta _{{\mathcal {H}}}) - ({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}},{{\textsf{g}}}})\eta _{{\mathcal {H}}}\) and \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\eta _{{\mathcal {H}}}=\eta _{{\mathcal {H}}}\). The first line of Eq. (3.11) is by definition and \({\textrm{D}}_{\xi }({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}\hat{f}_{Y}) = {\textrm{D}}_{\xi }({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1})\hat{f}_{Y}+{\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}\hat{f}_{YY}\). Expand, note \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}({{\textsf{g}}}{\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{\textrm{D}}_{\xi _{{\mathcal {H}}}}{{\textsf{g}}}^{-1})\hat{f}_{Y}= -{\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}({\textrm{D}}_{\xi _{{\mathcal {H}}}}{{\textsf{g}}}){{\textsf{g}}}^{-1}\hat{f}_{Y}\) (as \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\) is idempotent), we have the second line. For Eq. (3.12):

$$\begin{aligned} \begin{aligned}&\langle \nabla _{\xi _{{\mathcal {H}}}} {\varPi }_{{\mathcal {H}},{{\textsf{g}}}} ({{\textsf{g}}}^{-1}\hat{f}_{Y}), {{\textsf{g}}}\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}}= {\textrm{D}}_{\xi _{{\mathcal {H}}}}\langle {\varPi }_{{\mathcal {H}},{{\textsf{g}}}} ({{\textsf{g}}}^{-1}\hat{f}_{Y}), {{\textsf{g}}}\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}}- \langle {\varPi }_{{\mathcal {H}},{{\textsf{g}}}} ({{\textsf{g}}}^{-1}\hat{f}_{Y}), {{\textsf{g}}}\nabla _{\xi }\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}}\\&\quad ={\textrm{D}}_{\xi _{{\mathcal {H}}}}\langle \hat{f}_{Y}, \eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}} -\langle \hat{f}_{Y}, {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{{\textsf{g}}}{\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{\hat{\nabla }}_{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}}\\&\quad ={\textrm{D}}_{\xi _{{\mathcal {H}}}}\langle \hat{f}_{Y}, \eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}} -\langle \hat{f}_{Y}, {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{\hat{\nabla }}_{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}}\\&\quad ={\textrm{D}}_{\xi _{{\mathcal {H}}}}\langle \hat{f}_{Y}, \eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}} -\langle \hat{f}_{Y}, {\textrm{D}}_{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}\rangle _{{\mathcal {E}}} - \langle \hat{f}_{Y}, {\varGamma }_{{\mathcal {H}}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}})\rangle _{{\mathcal {E}}} \\&\quad =\hat{f}_{YY}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}) - \langle \hat{f}_{Y}, {\varGamma }_{{\mathcal {H}}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}})\rangle _{{\mathcal {E}}}, \end{aligned} \end{aligned}$$

from compatibility with metric, idempotency of \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\), Eqs. (3.10) and (3.2). \(\square \)

When the projection is given in terms of \({{\,\textrm{J}\,}}\), and \({{\,\textrm{J}\,}}\) is sufficiently smooth we have:

Proposition 3.3

If \({{\,\textrm{J}\,}}\) as in Proposition 3.2 is of class \(C^2\) then:

$$\begin{aligned} \begin{aligned} {\varGamma }_{{\mathcal {H}}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}) = {{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}({\textrm{D}}_{\xi _{{\mathcal {H}}}}{{\,\textrm{J}\,}})\eta _{{\mathcal {H}}}+ {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1} {\textrm{K}}(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}) \end{aligned} \end{aligned}$$
(3.13)

for two \({\mathcal {H}}\)-valued tangent vectors \(\xi _{{\mathcal {H}}},\eta _{{\mathcal {H}}}\) at \(Y\in {\mathcal {M}}\). We can evaluate \({\textsf{rhess}}_{{\mathcal {H}}, f}^{11}\xi _{{\mathcal {H}}}\) by setting \(\omega =\hat{f}_{Y}\) in the following formula, which is valid for all \({\mathcal {E}}\)-valued function \(\omega \):

$$\begin{aligned} \begin{aligned}&\nabla _{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1} \omega = {\varPi }_{{\mathcal {H}},{{\textsf{g}}}} {{\textsf{g}}}^{-1} {\textrm{D}}_{\xi _{{\mathcal {H}}}}\omega - {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}({\textrm{D}}_{\xi _{{\mathcal {H}}}} {{\textsf{g}}}){{\textsf{g}}}^{-1}\omega \\&\quad -{\varPi }_{{\mathcal {H}},{{\textsf{g}}}} ({\textrm{D}}_{\xi _{{\mathcal {H}}}}({{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})) ({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}\omega + {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi _{{\mathcal {H}}}, {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}({{\textsf{g}}}^{-1} \omega )). \end{aligned}\nonumber \\ \end{aligned}$$
(3.14)

Proof

The first expression follows by expanding \({\textrm{D}}_{\xi _{{\mathcal {H}}}}{\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\) in terms of \({{\,\textrm{J}\,}}\), noting \({{\,\textrm{J}\,}}\eta _{{\mathcal {H}}}= 0\). For the second, expand \({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{\hat{\nabla }}_{\xi _{{\mathcal {H}}}}({\mathcal {H}}, {\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}\omega )={\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{\textrm{D}}_{\xi _{{\mathcal {H}}}} ({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}\omega )+ {\varPi }_{{\mathcal {H}},{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi _{{\mathcal {H}}}, {\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}\omega )\), then expand the first term and use \({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}} = 0\). \(\square \)

The following proposition allows us to apply the results so far in familiar contexts:

Proposition 3.4

Let \(({\mathcal {M}}, {{\textsf{g}}}, {\mathcal {E}})\) be an embedded ambient structure.

  1. 1.

    Fix an orthogonal basis \(e_i\) of \({\mathcal {E}}\), let f be a function on \({\mathcal {M}}\), which is a restriction of a function \(\hat{f}\) on \({\mathcal {E}}\), define \(\hat{f}_{Y}\) to be the function from \({\mathcal {M}}\) to \({\mathcal {E}}\), having the i-th component the directional derivative \({\textrm{D}}_{e_i}\hat{f}\), then \(\hat{f}_{Y}\) is an ambient gradient. If \({\mathcal {M}}\) is defined by the equation \(C(Y) = 0\) (\(Y\in {\mathcal {M}}\)) with a full rank Jacobian, then the nullspace of the Jacobian \({{\,\textrm{J}\,}}_{C}(Y)\) is the tangent space of \({\mathcal {M}}\) at Y, hence \({{\,\textrm{J}\,}}_{C}(Y)\) could be used as the operator \({{\,\textrm{J}\,}}(Y)\).

  2. 2.

    (Riemannian submersion) Let \(({\mathcal {M}}, {{\textsf{g}}}, {\mathcal {E}})\) be an embedded ambient structure. Let \(\pi :{\mathcal {M}}\rightarrow {\mathcal {B}}\) be a Riemannian submersion, with \({\mathcal {H}}\) the corresponding horizontal subbundle of \(T{\mathcal {M}}\). If \(\xi , \eta \) are two vector fields on \({\mathcal {B}}\) with \(\xi _{{\mathcal {H}}}, \eta _{{\mathcal {H}}}\) their horizontal lifts, then the Levi-Civita connection \(\nabla ^{{\mathcal {B}}}_{\xi }\eta \) on \({\mathcal {B}}\) lifts to \({\varPi }_{{\mathcal {H}},{{\textsf{g}}}}\nabla _{\xi _{{\mathcal {H}}}}\eta _{{\mathcal {H}}}\), hence Eq. (3.10) applies. Also, Riemannian gradients and Hessians on \({\mathcal {B}}\) lift to \({\mathcal {H}}\)-Riemannian gradients and Hessians on \({\mathcal {M}}\).

Proof

The construction of \(\hat{f}_{Y}\) ensures \(\langle \hat{f}_{Y}, e_i\rangle _{{\mathcal {E}}} = {\textrm{D}}_{e_i}f\). The statement about the Jacobian is simply the implicit function theorem. Isometry of horizontal lift and [22], Lemma 7.45, item 3, gives us Statement 2. \(\square \)

In practice, \(\hat{f}_{Y}\) is computed by index-raising the directional derivative. For clarity, so far we use the subscript \({\mathcal {H}}\) to indicate the relation to a subbundle \({\mathcal {H}}\). For the rest of the paper, we will drop the subscripts \({\mathcal {H}}\) on vector fields (referring to \(\xi \) instead of \(\xi _{{\mathcal {H}}}\)) as it will be clear from the context if we discuss a vector field in \({\mathcal {H}}\), or just a regular vector field.

Remark 3.1

The results of this section offer two theoretical insights in deciding potential metric candidates in optimization problems:

  1. (1.)

    Non-constant ambient metrics may have the same big-O time picture as the constant one. This is the case with the examples in this paper when the constraint and the metrics are given in matrix polynomials or inversion. If the ambient Hessian could be computed efficiently, in many cases the (maybe tedious) Riemannian Hessian expressions, could be computed by operator composition with the same order-of-magnitude time complexity as the Riemannian gradient. This suggests non-constant metrics may be competitive if the improvement in convergence rate is significant. For certain problems involving positive-definite matrices, a non-constant metric is a better option([25]).

  2. (2.)

    There is a theoretical bound for the cost of computing the gradient, assuming that the metric \({{\textsf{g}}}\) is easy to invert. If the complexity of computing \({{\textsf{g}}}\) and \({{\,\textrm{J}\,}}\) is known, it remains to estimate the cost of inverting \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) (or \({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}}\)). While in our examples these operators are reduced to simple ones that could be inverted efficiently, otherwise, \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) could be solved by a conjugate gradient (CG)-method (one example is in [14]). In that case, the time cost is proportional to the rank of \({{\,\textrm{J}\,}}\) (or \({\textrm{N}}))\) times the cost of each CG step, which can be estimated depending on the problem.

Example 3.1

Let \({\mathcal {M}}\) be a submanifold of \({\mathcal {E}}\), defined by a system of equations \(C(x)=0\), where \(C\) is a map from \({\mathcal {E}}\) to \({\mathbb {R}}^k\) (\(x\in {\mathcal {M}}\)). In this case, \({{\,\textrm{J}\,}}_{C}=C_{x}\) is the Jacobian of \(C\), assumed to be of full rank. The projection of \(\omega \in {\mathcal {E}}\) to the tangent space \(T_x{\mathcal {M}}\) given by

$$\begin{aligned} {\varPi }_{{{\textsf{g}}}}\omega = \omega -C_{x}^{{{\textsf{T}}}}(C_{x}C_{x}^{{{\textsf{T}}}})^{-1}C_{x}\omega \end{aligned}$$
(3.15)

and the covariant derivative is given by \(\nabla _{\xi }\eta = {\textrm{D}}_{\xi }\eta + C_{x}^{{{\textsf{T}}}}(C_{x}C_{x}^{{{\textsf{T}}}})^{-1}({\textrm{D}}_\xi C_{x})\eta \) for two vector fields \(\xi , \eta \). With \({\varGamma }(\xi , \eta ) = C_{x}^{{{\textsf{T}}}}(C_{x}C_{x}^{{{\textsf{T}}}})^{-1}({\textrm{D}}_\xi C_{x})\eta \), the Riemannian Hessian bilinear form is computed from Eq. (3.12), and the Riemannian Hessian operator is:

$$\begin{aligned} {\varPi }_{{{\textsf{g}}}}(\hat{f}_{xx}\xi -({\textrm{D}}_\xi C_{x})^{{{\textsf{T}}}}(C_{x}C_{x}^{{{\textsf{T}}}})^{-1}C_{x}\hat{f}_x ). \end{aligned}$$

The expression \((C_{x}C_{x}^{{{\textsf{T}}}})^{-1}C_{x}\hat{f}_x\) is often used as an estimate for the Lagrange multiplier, this result was discussed in section 4.9 of [9]. When \(C(x) = x^{{{\textsf{T}}}}x- 1\) (the unit sphere) \({{\,\textrm{J}\,}}_{C}\omega = x^{{{\textsf{T}}}}\omega \), the Riemannian connection is thus \(\nabla _{\xi }\eta = {\textrm{D}}_{\xi }\eta +x\xi ^{{{\textsf{T}}}}\eta \), a well-known result.

Our main interest is to study matrix manifolds. As seen, we need to compute \({\textrm{N}}^{{\mathfrak {t}}}\) or \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}\). We will review adjoint operators for basic matrix operations.

4 Matrix Manifolds: Inner Products and Adjoint Operators

4.1 Matrices and Adjoints

We will use the trace (Frobenius) inner product on matrix vector spaces considered here. Again, the base field \({\mathbb {K}}\) is either \({\mathbb {R}}\) or \({\mathbb {C}}\). We use the letters mnp to denote the dimensions of vector spaces. We will prove results for both the real and complex cases together, as often there is a complex result using the Hermitian transpose corresponding to a real result using the real transpose. The reason is when \({\mathbb {C}}^{n\times m}\), as a real vector space, is equipped with the real inner product \({{\,\textrm{Re}\,}}{{\,\textrm{Tr}\,}}(ab^{{{\textsf{H}}}})\) (for \(a, b\in {\mathbb {C}}^{n\times m}\), \({{\textsf{H}}}\) is the Hermitian transpose), then the adjoint of the scalar multiplication operator by a complex number c, is the multiplication by the conjugate \({\bar{c}}\). T o fix some notations, we use the symbol \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\) to denote the real part of the trace, so for a matrix \(a\in {\mathbb {K}}^{n\times n}\), \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}a = {{\,\textrm{Tr}\,}}a\) if \({\mathbb {K}}={\mathbb {R}}\) and \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}a = {{\,\textrm{Re}\,}}({{\,\textrm{Tr}\,}}a)\) if \({\mathbb {K}}={\mathbb {C}}\). The symbol \({\mathfrak {t}}\) will be used on either an operator, where it specifies the adjoint with respect to these inner products, or to a matrix, where it specifies the corresponding adjoint matrix. When \({\mathbb {K}}={\mathbb {R}}\), we take \({\mathfrak {t}}\) to be the real transpose, and when \({\mathbb {K}}={\mathbb {C}}\) we take \({\mathfrak {t}}\) to be the hermitian transpose. The inner product of two matrices ab is \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(ab^{{\mathfrak {t}}})\). Recall that we denote by \({\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) the space of all \({\mathfrak {t}}\)-symmetric matrices (\(A^{{\mathfrak {t}}} = A\)), and \({\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) the space of all \({\mathfrak {t}}\)-antisymmetric matrices (\(A^{{\mathfrak {t}}} = -A\)). We consider both \({\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) and \({\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) inner product spaces under \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\). We defined the symmetrizer \({\textrm{sym}}_{{\mathfrak {t}}}\) and antisymmetrizer \({\textrm{skew}}_{{\mathfrak {t}}}\) in Sect. 2.1, with the usual meaning.

Proposition 4.1

With the above notations, let \(A_i, B_i, X\) be matrices such that the functional \(L(X)= \sum _{i=1}^k {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(A_i X B_i) + {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(C_i X^{{\mathfrak {t}}} D_i)\) is well-formed. We have:

1. The matrix \({{\,\textrm{xtrace}\,}}(L, X) = \sum _{i=1}^k A_i^{{\mathfrak {t}}}B_i^{{\mathfrak {t}}} + D_i C_i\) is the unique matrix \(L_1\) such that \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}L_1 X^{{\mathfrak {t}}}=L(X)\) for all \(X\in {\mathbb {K}}^{n\times m}\) (this is the gradient of L).

2. The matrix \({{\,\textrm{xtrace}\,}}^{{\textrm{sym}}}(L, X) = {\textrm{sym}}_{{\mathfrak {t}}}(\sum _{i=1}^k A_i^{{\mathfrak {t}}}B_i^{{\mathfrak {t}}} + D_i C_i)\) is the unique matrix \(L_2\in {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) satisfying \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(L_2X^{{\mathfrak {t}}}) = L(X)\) for all \(X\in {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\).

3. The matrix \({{\,\textrm{xtrace}\,}}^{{\textrm{skew}}}(L, X) = {\textrm{skew}}_{{\mathfrak {t}}}(\sum _{i=1}^k A_i^{{\mathfrak {t}}}B_i^{{\mathfrak {t}}} + D_i C_i)\) is the unique matrix \(L_3\in {\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) satisfying \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(L_3X^{{\mathfrak {t}}}) = L(X)\) for all \(X\in {\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\).

There is an abuse of notation as \({{\,\textrm{xtrace}\,}}(L, X)\) is not a function of two variables, but X should be considered a (symbolic) variable and L is a function in X, however, this notation is convenient in symbolic implementation.

Proof

We have \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}({{\,\textrm{xtrace}\,}}(L)X^{{\mathfrak {t}}}) = L(X)\) from

$$\begin{aligned} {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(A_i X B_i) = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(B_i^{{\mathfrak {t}}}X^{{\mathfrak {t}}}A^{{\mathfrak {t}}}_i) = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(A^{{\mathfrak {t}}}_iB_i^{{\mathfrak {t}}}X^{{\mathfrak {t}}}) \end{aligned}$$

and \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(C_i X^{{\mathfrak {t}}} D_i) = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}( D_iC_i X^{{\mathfrak {t}}})\). Uniqueness follows from the fact that \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\) is a non-degenerate bilinear form. The last two statements follow from

  • \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}({{\,\textrm{xtrace}\,}}(L)X^{{\mathfrak {t}}})={{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}({{\,\textrm{xtrace}\,}}(L)^{{\mathfrak {t}}})X\) if \(X^{{\mathfrak {t}}} = X\).

  • \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}({{\,\textrm{xtrace}\,}}(L)^{{\mathfrak {t}}}X)=-{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}({{\,\textrm{xtrace}\,}}(L)^{{\mathfrak {t}}})X\) if \(X^{{\mathfrak {t}}} = -X\).

\(\square \)

Remark 4.1

The index-raising operation/gradient \({{\,\textrm{xtrace}\,}}\) could be implemented as a symbolic operation on matrix trace expressions, as it involves only linear operations, matrix transpose, and multiplications. It could be used to compute an ambient gradient, for example. For another application, let \({\mathcal {M}}\) be a manifold with ambient space \({\mathbb {K}}^{n\times m}\), recall \({\textsf{rhess}}_f^{02}(\xi , \eta ) = \hat{f}_{YY}(\xi , \eta ) -{\varGamma }(\xi , \eta )\). Assume \(\langle {\varGamma }(\xi , \eta ),\hat{f}_{Y}\rangle _{{\mathcal {E}}} = \sum _i{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(A_i \eta B_i) + {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(C_i \eta ^{{\mathfrak {t}}} D_i)\) with \(A_i, B_i, C_i, D_i\) are not dependent on \(\eta \), and identify tangent vectors with their images in \({\mathbb {K}}^{n\times m}\), we have:

$$\begin{aligned} {\textsf{rhess}}_f^{11}\xi = {\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}{{\,\textrm{xtrace}\,}}({\textsf{rhess}}_f^{02}(\xi , \eta ), \eta ), \end{aligned}$$

as the inner product of the right-hand side with \(\eta \) is \({\textsf{rhess}}_f^{02}(\xi , \eta )\), and the projection ensures it is in the tangent space. If the ambient space is identified with \({\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\), \({\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\) or a direct sum of matrix spaces, we also have similar statements.

Proposition 4.2

With the same notations as Proposition 4.1:

1. The adjoint of the left multiplication operator by a matrix \(A\in {\mathbb {K}}^{m\times n}\), sending \(X\in {\mathbb {K}}^{n\times p}\) to \(AX\in {\mathbb {K}}^{m\times p}\) is the left multiplication by \(A^{{\mathfrak {t}}}\), sending \(Y\in {\mathbb {K}}^{m\times p}\) to \(A^{{\mathfrak {t}}}Y \in {\mathbb {K}}^{m\times n}\).

2. The adjoint of the right multiplication operator by a matrix \(A\in {\mathbb {K}}^{m\times n}\) from \({\mathbb {K}}^{p\times m}\) to \({\mathbb {K}}^{p\times n}\) is the right multiplication by \(A^{{\mathfrak {t}}}\).

3. The adjoint of the operator sending \(X\mapsto X^{{\mathfrak {t}}}\) for \(X\in {\mathbb {K}}^{m\times m}\) is again the operator \(Y\mapsto Y^{{\mathfrak {t}}}\) for \(Y\in {\mathbb {K}}^{m\times m}\). Adjoint is additive, and \((F\circ G)^{{\mathfrak {t}}} = G^{{\mathfrak {t}}}\circ F^{{\mathfrak {t}}}\) for two linear operators F and G.

4. The adjoint of the left multiplication operator by \(A\in {\mathbb {K}}^{p\times n}\) sending \(X\in {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) to \(AX\in {\mathbb {K}}^{p\times n}\) is the operator sending \(Y\mapsto \frac{1}{2}(A^{{\mathfrak {t}}}Y+Y^{{\mathfrak {t}}}A)\) for \(Y\in {\mathbb {K}}^{p\times n}\). Conversely, the adjoint of the operator \(Y\mapsto \frac{1}{2}(A^{{\mathfrak {t}}}Y+Y^{{\mathfrak {t}}}A)\in {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) is the operator \(X\mapsto AX\).

5. The adjoint of the left multiplication operator by \(A\in {\mathbb {K}}^{p\times n}\) sending \(X\in {\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) to \(AX\in {\mathbb {K}}^{p\times n}\) is the operator sending \(Y\mapsto \frac{1}{2}(A^{{\mathfrak {t}}}Y-Y^{{\mathfrak {t}}}A)\) for \(Y\in {\mathbb {K}}^{p\times n}\). Conversely, the adjoint of the operator \(Y\mapsto \frac{1}{2}(A^{{\mathfrak {t}}}Y-Y^{{\mathfrak {t}}}A)\in {\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) is the operator \(X\mapsto AX\).

6. Adjoint is linear on the space of operators. If \(F_1\) and \(F_2\) are two linear operators from a space V to two spaces \(W_1\) and \(W_2\), then the adjoint of the direct sum operator (operator sending X to \(\begin{bmatrix}F_1 X&F_2 X\end{bmatrix}\)) is the map sending \(\begin{bmatrix}A\\ B\end{bmatrix}\) to \(F_1^{{\mathfrak {t}}}A + F_2^{{\mathfrak {t}}}B\). Adjoint of the map sending \(\begin{bmatrix}X_1 \\ X_2 \end{bmatrix}\) to \(FX_1\) is the map \(Y\mapsto \begin{bmatrix}F^{{\mathfrak {t}}}Y \\ 0 \end{bmatrix}\), and more generally a map sending a row block \(X_i\) of a matrix X to \(FX_i\) is the map sending Y to a matrix where the i-th block is \(F^{{\mathfrak {t}}}Y\), and zero outside of this block.

Most of the proof is just a simple application of trace calculus. For the first statement, the real case follows from \({{\,\textrm{Tr}\,}}(Aab^{{{\textsf{T}}}})={{\,\textrm{Tr}\,}}(a(A^{{{\textsf{T}}}}b)^{{{\textsf{T}}}})\), and \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(Aab^{{{\textsf{H}}}})={{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(a(A^{{{\textsf{H}}}}b)^{{{\textsf{H}}}})\) gives us the complex case. Statement 2. is proved similarly, statement 4 is standard. Statements 4. and 5. are checked by direct substitution, and 6. is just the operator version of the corresponding matrix statement, observing for example:

$$\begin{aligned} {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(F_1XA^{{\mathfrak {t}}} + F_2XB^{{\mathfrak {t}}}) = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}((F_1^{{\mathfrak {t}}}A + F_2^{{\mathfrak {t}}}B)X^{{\mathfrak {t}}}). \end{aligned}$$

5 Application to Stiefel Manifold

The Stiefel manifold \({\textrm{St}}_{{\mathbb {K}}, d, n}\) on \({\mathcal {E}}={\mathbb {K}}^{n\times d}\) is defined by the equation \(Y^{{\mathfrak {t}}}Y= I_{d}\), where the tangent space at a point \(Y\) consists of matrices \(\eta \) satisfying \(\eta ^{{\mathfrak {t}}}Y+ Y\eta ^{{\mathfrak {t}}} = 0\). When \({\mathbb {K}}={\mathbb {R}}\) we assume \(d< n\), so the manifold is connected. We apply the results of Sect. 3 for the full tangent bundle \({\mathcal {H}}=T{\textrm{St}}_{{\mathbb {K}}, d, n}\). We can consider an ambient metric:

$$\begin{aligned} {{\textsf{g}}}(Y) \omega = \alpha _0\omega + (\alpha _1-\alpha _0)YY^{{\mathfrak {t}}}\omega = \alpha _0(I_n - YY^{{\mathfrak {t}}})\omega + \alpha _1 YY^{{\mathfrak {t}}}\omega \end{aligned}$$
(5.1)

for \(\omega \in {\mathcal {E}}={\mathbb {K}}^{n\times d}\). It is easy to see \(\omega _0-YY^{{\mathfrak {t}}}\omega _0\) is an eigenvector of \({{\textsf{g}}}(Y)\) with eigenvalue \(\alpha _0\), and \(YY^{{\mathfrak {t}}}\omega _1\) is an eigenvector with eigenvalue \(\alpha _1\), for any \(\omega _0,\omega _1\in {\mathcal {E}}\), and these are the only eigenvalues and vectors. Hence, \({{\textsf{g}}}(Y)^{-1}\omega = \alpha _0^{-1}(I_n - YY^{{\mathfrak {t}}})\omega + \alpha ^{-1}_1YY^{{\mathfrak {t}}}\omega \) and \({{\textsf{g}}}\) is a Riemannian metric if \(\alpha _0, \alpha _1\) are positive. We can describe the tangent space as a nullspace of \({{\,\textrm{J}\,}}(Y)\) with \({{\,\textrm{J}\,}}(Y)\omega = \omega ^{{\mathfrak {t}}}Y+ Y^{{\mathfrak {t}}}\omega \in {\mathcal {E}}_{{{\,\textrm{J}\,}}}:={\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, d}\). We will evaluate everything at Y, so we will write \({{\,\textrm{J}\,}}\) and \({{\textsf{g}}}\) instead of \({{\,\textrm{J}\,}}(Y)\) and \({{\textsf{g}}}(Y)\), etc. By Proposition 4.2, \({{\,\textrm{J}\,}}^{{\mathfrak {t}}}a= (aY^{{\mathfrak {t}}})^{{\mathfrak {t}}} + Ya=2Ya\) for \(a\in {\mathcal {E}}_{{{\,\textrm{J}\,}}}\). We have \({{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}a=\alpha _0^{-1}2Ya+(\alpha _1^{-1}-\alpha _0^{-1})2Ya= 2\alpha _1^{-1}Ya\). Thus \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}a= {{\,\textrm{J}\,}}(2\alpha _1^{-1}Ya) =4\alpha _1^{-1}a\) and by Proposition 3.1:

$$\begin{aligned} {\varPi }_{{{\textsf{g}}}}\nu = {\varPi }_{{{\textsf{g}}}}\nu = \nu - 2\alpha _1^{-1}\frac{\alpha _1}{4}(Y\nu ^{{\mathfrak {t}}}Y+ YY^{{\mathfrak {t}}}\nu )=\nu - \frac{1}{2}(Y\nu ^{{\mathfrak {t}}}Y+ YY^{{\mathfrak {t}}}\nu ). \end{aligned}$$
(5.2)

In this case, the ambient gradient \(\hat{f}_{Y}\) is the matrix of partial derivatives of an extension of f on the ambient space \({\mathbb {K}}^{n\times d}\). More conveniently, using the eigenspaces of \({{\textsf{g}}}\), \({\varPi }_{{{\textsf{g}}}}\nu = (I_n-YY^{{\mathfrak {t}}})v + Y{\textrm{skew}}_{{\mathfrak {t}}}Y^{{\mathfrak {t}}}v\) and \({{\textsf{g}}}^{-1}\hat{f}_{Y}= \alpha _0^{-1}(I_n-YY^{{\mathfrak {t}}})\hat{f}_{Y}+\alpha _1^{-1}YY^{{\mathfrak {t}}}\hat{f}_{Y}\)

$$\begin{aligned} \begin{aligned}&{\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1} \hat{f}_{Y}=\alpha _0^{-1}(I_n - YY^{{\mathfrak {t}}})\hat{f}_{Y}+\alpha _1^{-1}Y{\textrm{skew}}_{{\mathfrak {t}}}(Y^{{\mathfrak {t}}}\hat{f}_{Y})\\&\quad = \alpha _0^{-1}\hat{f}_{Y}+\frac{\alpha _1^{-1}-2\alpha _0^{-1}}{2}YY^{{\mathfrak {t}}}\hat{f}_{Y}-\frac{\alpha _1^{-1}}{2}Y\hat{f}_{Y}^{{\mathfrak {t}}}Y. \end{aligned} \end{aligned}$$

If \(\xi \) and \(\eta \) are vector fields, \(({\textrm{D}}_{\xi }{{\textsf{g}}}) \eta = (\alpha _1-\alpha _0)(\xi Y^{{\mathfrak {t}}}+Y\xi ^{{\mathfrak {t}}})\eta \). Using Proposition 4.1, we can take the cross term \({\mathcal {X}}(\xi , \eta ) = (\alpha _1-\alpha _0)(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y\), thus:

$$\begin{aligned} {\textrm{K}}(\xi , \eta ) = \frac{\alpha _1 - \alpha _0}{2}((\xi Y^{{\mathfrak {t}}}\eta +\eta Y^{{\mathfrak {t}}}\xi )+Y(\xi ^{{\mathfrak {t}}}\eta +\eta ^{{\mathfrak {t}}}\xi ) -(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y). \end{aligned}$$

By the tangent condition, \((\xi Y^{{\mathfrak {t}}}\eta +\eta Y^{{\mathfrak {t}}}\xi )= -(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y\), hence \({\textrm{K}}(\xi , \eta ) = \frac{\alpha _1-\alpha _0}{2}F\) with \(F = Y(\xi ^{{\mathfrak {t}}}\eta + \eta ^{{\mathfrak {t}}}\xi ) - 2(\xi ^{{\mathfrak {t}}}\eta + \eta ^{{\mathfrak {t}}}\xi )Y\), we see \(Y^{{\mathfrak {t}}}F\) is symmetric so \({\textrm{skew}}_{{\mathfrak {t}}}Y^{{\mathfrak {t}}}F = 0\), therefore

$$\begin{aligned}{} & {} {\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}F = \alpha _0^{-1}(I_n-YY^{{\mathfrak {t}}})F = - 2\alpha _0^{-1}(I_n-YY^{{\mathfrak {t}}})(\xi ^{{\mathfrak {t}}}\eta + \eta ^{{\mathfrak {t}}}\xi )Y, \nonumber \\{} & {} {\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta )=\frac{\alpha _0-\alpha _1}{\alpha _0}(I_n-YY^{{\mathfrak {t}}})(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y. \end{aligned}$$
(5.3)

Using \({{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}({\textrm{D}}_\xi {{\,\textrm{J}\,}})\eta = \frac{1}{2}Y(\xi ^{{\mathfrak {t}}}\eta +\eta ^{{\mathfrak {t}}}\xi )\) to evaluate Eq. (3.13), the connection for two vector fields \(\xi , \eta \) is:

$$\begin{aligned} \nabla _{\xi }\eta = {\textrm{D}}_{\xi }\eta +\frac{1}{2}Y(\xi ^{{\mathfrak {t}}}\eta +\eta ^{{\mathfrak {t}}}\xi ) +\frac{\alpha _0-\alpha _1}{\alpha _0}(I_n-YY^{{\mathfrak {t}}})(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y. \end{aligned}$$
(5.4)

With \({\varPi }_0 = (I_n-YY^{{\mathfrak {t}}})\) and let \(\hat{f}_{YY}\) be the ambient Hessian, from Eq. (3.12):

$$\begin{aligned} {\textsf{rhess}}^{02}_f(\xi , \eta )=\hat{f}_{YY}(\xi ,\eta ) -{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(\hat{f}_{Y}^{{\mathfrak {t}}}\{\frac{1}{2}Y(\xi ^{{\mathfrak {t}}}\eta +\eta ^{{\mathfrak {t}}}\xi ) +\frac{\alpha _0-\alpha _1}{\alpha _0}{\varPi }_0(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y\}).\nonumber \\ \end{aligned}$$
(5.5)

By Remark 4.1, \({\textsf{rhess}}^{11}_f\xi \) is \({\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}{{\,\textrm{xtrace}\,}}({\textsf{rhess}}^{02}_{\xi , \eta }, \eta )\), thus

$$\begin{aligned} {\textsf{rhess}}^{11}_f\xi ={\varPi }_{Y, {{\textsf{g}}}}{{\textsf{g}}}^{-1}(\hat{f}_{YY}\xi -\frac{1}{2}\xi (\hat{f}_{Y}^{{\mathfrak {t}}}Y+Y^{{\mathfrak {t}}} \hat{f}_{Y}) -\frac{\alpha _0-\alpha _1}{\alpha _0}({\varPi }_0 \hat{f}_{Y}Y^{{\mathfrak {t}}}+Y\hat{f}_{Y}^{{\mathfrak {t}}}{\varPi }_0)\xi ).\nonumber \\ \end{aligned}$$
(5.6)

We note the term inside \({\varPi }_{Y, {{\textsf{g}}}}{{\textsf{g}}}^{-1}\) can be modified by any expression sent to zero by \({\varPi }_{Y, {{\textsf{g}}}}{{\textsf{g}}}^{-1}\). The case \(\alpha _0 = 1, \alpha _1=\frac{1}{2}\) correspond to the canonical metric on a Stiefel manifold, where the connection is given by formula 2.49 of [9], in a slightly different form, but we could show they are the same by noting \(YY^{{\mathfrak {t}}}(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y=Y(\xi ^{{\mathfrak {t}}}YY^{{\mathfrak {t}}}\eta + \eta ^{{\mathfrak {t}}}YY^{{\mathfrak {t}}}\xi )\) using the tangent constraint. The case \(\alpha _0 = \alpha _1 = 1\) corresponds to the constant trace metric where we do not need to compute \({\textrm{K}}\). This family of metrics has been studied in [11], where a closed-form geodesic formula is provided. In [20] we also provide efficient closed-form geodesic formulas similar to those in [9].

6 Quotients of a Stiefel Manifold and Flag Manifolds

Continuing with the setup in the previous section, consider a Stiefel manifold \({\textrm{St}}_{{\mathbb {K}}, d, n}\) (we will assume \(0<d < n\)). The metric induced by the operator \({{\textsf{g}}}\) in Eq. (5.1), \(\alpha _0{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\omega _1^{{\mathfrak {t}}}\omega _2 + (\alpha _1-\alpha _0){{\,\textrm{Tr}\,}}\omega _1^{{\mathfrak {t}}}YY^{{\mathfrak {t}}}\omega _2\) with \(Y\in {\textrm{St}}_{{\mathbb {K}}, d, n}\), \(\omega _1, \omega _2\in {\mathcal {E}}\) is preserved if we replace \(Y, \omega _1, \omega _2\) by \(YU, \omega _1U, \omega _2U\), for \(U^{{\mathfrak {t}}}U = I_d\), or if we define the \({\mathfrak {t}}\)-orthogonal group by \({\textrm{U}}_{{\mathbb {K}}, d}:= \{U\in {\mathbb {K}}^{d\times d}| U^{{\mathfrak {t}}}U = I_d\}\) then this is a group of isometries of \({{\textsf{g}}}\). Therefore, any subgroup G of \({\textrm{U}}_{{\mathbb {K}}, d}\) acts on \({\textrm{St}}_{{\mathbb {K}}, d, n}\) by right-multiplication also preserves the metric, and if G is compact, we can consider the quotient manifold \({\textrm{St}}_{{\mathbb {K}}, d, n}/G\), identifying \(Y\in {\textrm{St}}_{{\mathbb {K}}, d, n}\) with YU for \(U\in G\).

If the cost function f on the Stiefel manifold is invariant when applying transformations by a group G, it may be advantageous to consider optimization on a quotient manifold. The case of the Rayleigh quotient cost function \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}Y^{{\mathfrak {t}}}A Y\) for \(Y\in {\textrm{St}}_{{\mathbb {K}}, d, n}\) and A is a positive-definite matrix is well-known. As the cost function is invariant if we replace Y by YU for \(U\in G={\textrm{U}}_{{\mathbb {K}}, d}\), we can optimize over the Grassmann manifolds \({\textrm{St}}_{{\mathbb {K}}, d, n}/{\textrm{U}}_{{\mathbb {K}}, d}\). When \(Y\in {\textrm{St}}_{{\mathbb {K}}, d, n}\) is divided into two column blocks \(Y=[Y_1|Y_2]\) of \(d_1 + d_2=d\) columns, the cost function \(f_1={{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}Y_1^{{\mathfrak {t}}}A_1Y_1 + {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}Y_2^{{\mathfrak {t}}}A_2Y_2\) for two positive-definite matrices \(A_1, A_2\in {\mathbb {K}}^{n\times n}\) is invariant if we replace \(Y_1, Y_2\) by \(Y_1U_1, Y_2U_2\) for \(U_1\in {\textrm{U}}_{{\mathbb {K}}, d_1}, U_2\in {\textrm{U}}_{{\mathbb {K}}, d_2}\), while the cost function \(f_2 = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}Y_1^{{\mathfrak {t}}}A_1Y_1 + {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}Y_2B_2Y_2^{{\mathfrak {t}}}\) for positive-definite matrices \(A_1\in {\mathbb {K}}^{n\times n}, B_2\in {\mathbb {K}}^{d_2\times d_2}\) is invariant if we replace \(Y_1\) by \(Y_1U_1\) only, for a generic \(B_2\). In the first case, we can optimize over \({\textrm{St}}_{{\mathbb {K}}, d, n}/({\textrm{U}}_{{\mathbb {K}}, d_1}\times {\textrm{U}}_{{\mathbb {K}}, d_2})\) and in the second case we can optimize over \({\textrm{St}}_{{\mathbb {K}}, d, n}/{\textrm{U}}_{{\mathbb {K}}, d_1}\). We will define an optimization framework for a quotient of \({\textrm{St}}_{{\mathbb {K}}, d, n}\) by a group G, \(\{I_d\}\subset G\subset {\textrm{U}}_{{\mathbb {K}}, d}\), where G consists of \(q+1\) blocks diagonal blocks, \(q\ge 0\), with at most one block could be trivial. The case \(G =\{I_d\}\) corresponds to the Stiefel manifold, \(G={\textrm{U}}_{{\mathbb {K}}, d}\) corresponds to the Grassmann manifold and the intermediate case includes flag manifolds. Background materials for optimization on flag manifolds are in [21, 28], but the examples just described and the review below should be sufficient to understand the setup and the results. We generalize the formula for \({\textsf{rhess}}^{02}\) in [28, Proposition 25] to the full family of metrics in [11] and provide a formula for \({\textsf{rhess}}^{11}\).

Let us describe the group G of block-diagonal matrices considered here. Assume there is a sequence of positive integers \(\varvec{\hat{d}}= \{d_1,\cdots , d_q\}\), \(d_i > 0\) for \(1\le i\le q\), such that \(\sum _{i=1}^q d_i \le d\). Set \(d_{q+1} = d - \sum _{i=1}^q d_i\), thus \(d_{q+1}\ge 0\). This sequence allows a partition of a matrix \(A\in {\mathbb {K}}^{d\times d}\) to \((q+1)\times (q+1)\) blocks \(A_{[ij]}\in {\mathbb {K}}^{d_i\times d_j}\), \(1\le i, j\le q+1\). The right-most or bottom blocks correspond to i or j equals to \(q+1\) are empty when \(d_{q+1} = 0\). Consider the subgroup \(G={\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}= {\textrm{U}}_{{\mathbb {K}}, d_1}\times {\textrm{U}}_{{\mathbb {K}}, d_2}\cdots \times {\textrm{U}}_{{\mathbb {K}}, d_q}\times \{I_{d_{q+1}}\}\) of \({\textrm{U}}_{{\mathbb {K}}, d}\) of block-diagonal matrices U, with the i-th diagonal block from the top \(U_{[ii]}\in {\textrm{U}}_{{\mathbb {K}}, d_i}\), \(1\le i\le q\), and \(U_{[q+1, q+1]} = I_{d_{q+1}}\). An element \(U\in {\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}\) has the form

$$\begin{aligned} U ={{\,\textrm{diag}\,}}(U_{[11]}, \cdots , U_{[qq]}, I_{d_{q+1}})\text { for } U_{[ii]}\in {\textrm{U}}_{{\mathbb {K}}, d_i}, 1\le i\le q \end{aligned}$$

and transforms \(Y = [Y_1| \cdots |Y_q|Y_{q+1}]\) to \([Y_1U_1| \cdots |Y_qU_q|Y_{q+1}]\). When \(q=0\), we define \(\varvec{\hat{d}}=\emptyset \) and \({\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}= \{I_{d}\}\). We will consider the manifold \({\textrm{St}}_{{\mathbb {K}}, d, n}/G\) for \(G={\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}\). Thus, when \(\varvec{\hat{d}}= \emptyset \), this quotient is the Stiefel manifold \({\textrm{St}}_{{\mathbb {K}}, d, n}\) itself, when \(\varvec{\hat{d}}= \{d\}\), it is the Grassmann manifold. When \(d_{q+1} = 0\) i.e. \(\sum _{i=1}^q d_i = d\), the quotient is called a flag manifold, denoted by \({{\,\textrm{Flag}\,}}(d_1,\cdots , d_q; n, {\mathbb {K}})\). In the example above for \(q=2\), if \(d_1 + d_2 = d\), \({\textrm{St}}_{{\mathbb {K}}, d, n}/({\textrm{U}}_{{\mathbb {K}}, d_1}\times {\textrm{U}}_{{\mathbb {K}}, d_2})\) is a flag manifold (we do not have a special name if \(d_1 + d_2 < d\)). Therefore, these quotients could be considered as intermediate objects between a Stiefel and a Grassmann manifold, as we will soon see more clearly.

Define the operator \({\textrm{symf}}\) acting on \({\mathbb {K}}^{d\times d}\), sending \(A\in {\mathbb {K}}^{d\times d}\) to \(A_{{\textrm{symf}}}\) such that \((A_{{\textrm{symf}}})_{[ij]} = \frac{1}{2}(A_{[ij]} +A_{[ji]}^{{\mathfrak {t}}})\) if \(1\le i\ne j\le q+1\) or \(i=j=q+1\), and \((A_{{\textrm{symf}}})_{[ii]} = A_{[ii]}\) if \(1 \le i\le q\). Thus, \({\textrm{symf}}\) preserves the diagonal blocks for \(1\le i \le q\), but symmetrizes the off-diagonal blocks and the \(q+1\)-th diagonal block. The following illustrates the operation when \(q=2\) for \(A = (A_{[ij]})\in {\mathbb {K}}^{d\times d}\).

$$\begin{aligned} A_{{\textrm{symf}}} = \frac{1}{2}\begin{bmatrix}2A_{[11]} &{} A_{[12]} + A_{[21]}^{{\mathfrak {t}}} &{} A_{[13]} + A_{[31]}^{{\mathfrak {t}}} \\ A_{[21]} + A_{[12]}^{{\mathfrak {t}}} &{} 2A_{[22]} &{} A_{[23]} + A_{[32]}^{{\mathfrak {t}}}\\ A_{[31]} + A_{[13]}^{{\mathfrak {t}}} &{} A_{[32]} + A_{[23]}^{{\mathfrak {t}}} &{} A_{[33]} + A_{[33]}^{{\mathfrak {t}}} \end{bmatrix}. \end{aligned}$$

For the case \(\varvec{\hat{d}}=\emptyset \) of the full Stiefel manifold, \({\textrm{symf}}\) is just \({\textrm{sym}}_{{\mathfrak {t}}}\) and for the case \(\varvec{\hat{d}}=\{d\}\) of the Grassmann manifold, \({\textrm{symf}}\) is the identity map. We show these quotients share similar Riemannian optimization settings.

Theorem 6.1

With the metric in Eq. (5.1), the horizontal space \({\mathcal {H}}_Y\) at \(Y\in {\textrm{St}}_{{\mathbb {K}}, d, n}\) of the quotient \({\textrm{St}}_{{\mathbb {K}}, d, n}\rightarrow {\textrm{St}}_{{\mathbb {K}}, d, n}/{\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}\) consists of matrices \(\omega \in {\mathcal {E}}:={\mathbb {K}}^{n\times d}\) such that

$$\begin{aligned} (Y^{{\mathfrak {t}}}\omega )_{{\textrm{symf}}} = 0, \end{aligned}$$
(6.1)

or equivalently, \(Y^{{\mathfrak {t}}}\omega \) is \({\mathfrak {t}}\)-antisymmetric with first q diagonal blocks \(((Y^{{\mathfrak {t}}}\omega )_{{\textrm{symf}}})_{[ii]}\) vanish for \(1\le i\le q\). For \(\omega \in {\mathcal {E}}={\mathbb {K}}^{n\times d}\), the projection \({\varPi }_{{\mathcal {H}}}\) from \({\mathcal {E}}\) to \({\mathcal {H}}_Y\) and the Riemannian gradient are given by

$$\begin{aligned}{} & {} {\varPi }_{{\mathcal {H}}}\omega = \omega - Y(Y^{{\mathfrak {t}}}\omega )_{{\textrm{symf}}}, \end{aligned}$$
(6.2)
$$\begin{aligned}{} & {} {\varPi }_{{\mathcal {H}}}{{\textsf{g}}}^{-1}\hat{f}_{Y}= \alpha _0^{-1}\hat{f}_{Y}+ (\alpha _1^{-1} - \alpha _0^{-1})YY^{{\mathfrak {t}}}\hat{f}_{Y}- \alpha _1^{-1}Y(Y^{{\mathfrak {t}}}\hat{f}_{Y})_{{\textrm{symf}}}. \end{aligned}$$
(6.3)

Let \({\varPi }_0 = I_n - YY^{{\mathfrak {t}}}\). For two vector fields \(\xi , \eta \), the horizontal lift of the Levi-Civita connection and Riemannian Hessians are given by

$$\begin{aligned}{} & {} {\varPi }_{{\mathcal {H}}}\nabla _{\xi }\eta = {\textrm{D}}_{\xi }\eta +Y(\xi ^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}} +\frac{\alpha _0-\alpha _1}{\alpha _0}{\varPi }_0(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y, \end{aligned}$$
(6.4)
$$\begin{aligned}{} & {} {\textsf{rhess}}^{02}_f(\xi , \eta )=\hat{f}_{YY}(\xi ,\eta ) -{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(\hat{f}_{Y}^{{\mathfrak {t}}}\{Y(\xi ^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}} +\frac{\alpha _0-\alpha _1}{\alpha _0}{\varPi }_0(\xi \eta ^{{\mathfrak {t}}}+\eta \xi ^{{\mathfrak {t}}})Y\}),\nonumber \\ \end{aligned}$$
(6.5)
$$\begin{aligned}{} & {} {\textsf{rhess}}^{11}_f\xi ={\varPi }_{{\mathcal {H}}}{{\textsf{g}}}^{-1}(\hat{f}_{YY}\xi -\xi (Y^{{\mathfrak {t}}}\hat{f}_{Y})_{{\textrm{symf}}} -\frac{\alpha _0-\alpha _1}{\alpha _0}({\varPi }_0 \hat{f}_{Y}Y^{{\mathfrak {t}}}+Y\hat{f}_{Y}^{{\mathfrak {t}}}{\varPi }_0)\xi ).\nonumber \\ \end{aligned}$$
(6.6)

Proof

First we note that \({\textrm{symf}}\) is a self-adjoint operator, as both the identity operator on the first q diagonal blocks and symmetrize operator on the remaining blocks are self-adjoint. The orbit of \(Y\in {\textrm{St}}_{{\mathbb {K}}, d, n}\) under the action of \({\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}\) is \(Y{\textrm{U}}_{{\mathbb {K}}, \varvec{\hat{d}}}\), thus the vertical space consists of matrices of the form YD with D is block-diagonal, \({\mathfrak {t}}\)-skewsymmetric and \(D_{[(q+1),(q+1)]} = 0\). Since \({{\textsf{g}}}YD = \alpha _1 YD\), a horizontal vector \(\omega \) satisfies \({\textrm{skew}}_{{\mathfrak {t}}}(Y^{{\mathfrak {t}}}\omega ) = 0\) and \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(\omega ^{{\mathfrak {t}}}YD) = 0\). This shows the first q diagonal blocks of \(Y^{{\mathfrak {t}}}\omega \) are zero, hence \((Y^{{\mathfrak {t}}}\omega )_{{\textrm{symf}}} = 0\).

For the projection, we proceed like the Stiefel case, with the map \({{\,\textrm{J}\,}}\omega = (Y^{{\mathfrak {t}}}\omega )_{{\textrm{symf}}}\), mapping \({\mathcal {E}}\) to \({\mathcal {E}}_{{{\,\textrm{J}\,}}}= \{ A\in {\mathbb {K}}^{d\times d}| A_{[ij]} =A_{[ji]}^{{\mathfrak {t}}}, 1\le i \ne j\le q+1 \text { or } i=j=q+1\}\). Since \({\textrm{symf}}\) is self-adjoint, \({{\,\textrm{J}\,}}^{{\mathfrak {t}}} A = YA_{{\textrm{symf}}} = YA\) for \(A\in {\mathcal {E}}_{{{\,\textrm{J}\,}}}\). From here we get \(({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})A = \alpha _1^{-1}A\) and Eq. (6.2) follows. Equation 6.3 is a substitution of \({{\textsf{g}}}^{-1}\hat{f}_{Y}\) to Eq. (6.2), noting \((Y^{{\mathfrak {t}}}({{\textsf{g}}}^{-1}\hat{f}_{Y}))_{{\textrm{symf}}} = \alpha _1^{-1}(Y^{{\mathfrak {t}}}\hat{f}_{Y})_{{\textrm{symf}}}\), using the eigen decomposition of \({{\textsf{g}}}\).

For the Levi-Civita connection, we use Eq. (3.10). For two horizontal vector fields \(\xi , \eta \), \(({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}})\eta = -\xi (Y^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}} - Y(\xi ^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}}=-Y(\xi ^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}}\) and \({\varPi }_{{\mathcal {H}}}={\varPi }_{{\mathcal {H}}}{\varPi }\), where \({\varPi }\) is the Stiefel projection Eq. (5.2), hence \({\varPi }_{{\mathcal {H}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ) = {\varPi }{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ) - Y(Y^{{\mathfrak {t}}}{\varPi }{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ))_{{\textrm{symf}}}\). The last term vanishes from Eq. (5.3), and we have Eq. (6.4).

Equation (6.5) follows from Eq. (3.12). We derive Eq. (6.6) from remark 4.1 and self-adjointness of \({\textrm{symf}}\), expanding \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\hat{f}_{Y}Y(\xi ^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}}\) to

$$\begin{aligned} {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(Y^{{\mathfrak {t}}}\hat{f}_{Y})^{{\mathfrak {t}}}(\xi ^{{\mathfrak {t}}}\eta )_{{\textrm{symf}}} = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(Y^{{\mathfrak {t}}}\hat{f}_{Y})_{{\textrm{symf}}}(\xi ^{{\mathfrak {t}}}\eta )^{{\mathfrak {t}}}={{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\xi (Y^{{\mathfrak {t}}}\hat{f}_{Y})_{{\textrm{symf}}}\eta ^{{\mathfrak {t}}}. \end{aligned}$$

\(\square \)

7 Positive-Definite Matrices

Consider the manifold \({\textrm{S}}^{+}_{{\mathbb {K}}, n}\) of \({{\mathfrak {t}}}\)-symmetric positive-definite matrices in \({\mathbb {K}}^{n\times n}\). In our approach, we take \({\mathcal {E}}={\mathbb {K}}^{n\times n}\) with its Frobenius inner product. The metric \({{\textsf{g}}}\) is \(\langle \xi , {{\textsf{g}}}\eta \rangle _{{\mathcal {E}}} = {{\,\textrm{Tr}\,}}(\xi Y^{-1}\eta Y^{-1})\), with the metric operator \({{\textsf{g}}}:\eta \mapsto Y^{-1}\eta Y^{-1}\) for two vector fields \(\xi , \eta \). The full tangent bundle \({\mathcal {H}}=T{\textrm{S}}^{+}_{{\mathbb {K}}, n}\) is identified fiber-wise with the nullspace of the operator \({{\,\textrm{J}\,}}:\eta \mapsto {{\,\textrm{J}\,}}\eta = \eta - \eta ^{{\mathfrak {t}}}\), with \({\mathcal {E}}_{{{\,\textrm{J}\,}}} = {\textrm{Skew}}_{{\mathfrak {t}}, {\mathbb {K}}, n}\). By item 5 in Proposition 4.2, we have \({{\,\textrm{J}\,}}^{{\mathfrak {t}}} a= 2a\) where a is a \({\mathfrak {t}}\)-antisymmetric matrix. From here \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}a= 4YaY\) and (write \({\varPi }\) for \({\varPi }_{{{\textsf{g}}}}\)):

$$\begin{aligned} {\varPi }\eta{} & {} = \eta - {{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}})^{-1}{{\,\textrm{J}\,}}\eta =\eta - 2Y(\frac{1}{4}Y^{-1}(\eta - \eta ^{{\mathfrak {t}}})Y^{-1})Y= \frac{1}{2}(\eta +\eta ^{{\mathfrak {t}}})\\{} & {} ={\textrm{sym}}_{{\mathfrak {t}}}\eta . \end{aligned}$$

Thus, the Riemannian gradient is \({\varPi }{{\textsf{g}}}^{-1}\hat{f}_{Y}=\frac{1}{2}Y(\hat{f}_{Y}+\hat{f}_{Y}^{{\mathfrak {t}}})Y\). Next, we compute \(({\textrm{D}}_{\xi }{{\textsf{g}}})\eta = {\textrm{D}}_{\xi |\eta \text { constant}}(Y^{-1}\eta Y^{-1}) = -Y^{-1}\xi Y^{-1}\eta Y^{-1} - Y^{-1}\eta Y^{-1}\xi Y^{-1}\), where we keep \(\eta \) constant in the derivative, as we evaluate \({\textrm{D}}_{\xi }{{\textsf{g}}}\) as an operator-valued function. From here, \(({\textrm{D}}_{\eta }{{\textsf{g}}})\xi =({\textrm{D}}_{\xi }{{\textsf{g}}})\eta \). We note for three vector fields \(\xi , \eta , \xi _0\)

$$\begin{aligned} {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(Y^{-1}\xi _0Y^{-1}\eta Y^{-1} + Y^{-1}\eta Y^{-1}\xi _0Y^{-1})\xi= & {} {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}\xi _0(Y^{-1}\eta Y^{-1}\xi Y^{-1} \\{} & {} \quad + Y^{-1}\xi Y^{-1}\eta Y^{-1}). \end{aligned}$$

Thus, we can take \({\mathcal {X}}(\xi , \eta )=-Y^{-1}\eta Y^{-1}\xi Y^{-1} - Y^{-1}\xi Y^{-1}\eta Y^{-1}\) and

$$\begin{aligned}{} & {} {\varGamma }(\xi , \eta ) = -({\textrm{D}}_{\xi }{\varPi })\eta - \frac{1}{2}Y(Y^{-1}\eta Y^{-1}\xi Y^{-1} + Y^{-1}\xi Y^{-1}\eta Y^{-1})Y,\nonumber \\{} & {} \nabla _{\xi }\eta = {\textrm{D}}_{\xi }\eta -\frac{1}{2}(\xi Y^{-1}\eta + \eta Y^{-1}\xi ). \end{aligned}$$
(7.1)

Hence, the Riemannian Hessian bilinear form \({\textsf{rhess}}^{02}(\xi , \eta )\) is

$$\begin{aligned} \hat{f}_{YY}(\xi , \eta ) {+} \frac{1}{2}{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}((\xi Y^{-1}\eta {+} \eta Y^{-1}\xi ) \hat{f}_{Y}) {=} \hat{f}_{YY}(\xi , \eta ){+}\frac{1}{2}{{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}((\hat{f}_{Y}\xi Y^{-1} {+} Y^{-1}\xi \hat{f}_{Y})\eta ).\nonumber \\ \end{aligned}$$
(7.2)

Using a symmetric version of remark 4.1, \({\textsf{rhess}}^{11}_f\xi = {\varPi }_{{{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textrm{sym}}_{{\mathfrak {t}}}(\hat{f}_{YY}\xi + \frac{1}{2}(\hat{f}_{Y}\xi Y^{-1} + Y^{-1}\xi \hat{f}_{Y}))\). We get the following formula, as in [8]:

$$\begin{aligned} {\textsf{rhess}}_f^{11}\xi = Y{\textrm{sym}}_{{\mathfrak {t}}}(\hat{f}_{YY}\xi )Y+ {\textrm{sym}}_{{\mathfrak {t}}}(\xi {\textrm{sym}}_{{\mathfrak {t}}}(\hat{f}_{Y}) Y). \end{aligned}$$
(7.3)

8 A Family of Metrics for the Manifold of Positive-Semidefinite Matrices of Fixed Rank

In [7], the authors defined a family of metrics on the manifold \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) of positive-semidefinite matrices of size n and rank p for the case \({\mathbb {K}}= {\mathbb {R}}\). Each such matrix will have the form \(YPY^{{\mathfrak {t}}}\) with \(Y\in {\textrm{St}}_{{\mathbb {K}}, p, n}\) (\(Y^{{\mathfrak {t}}}Y=I_p\)) and P is positive-definite of size \(p\times p\), up to the equivalent relation \((Y, P)\sim (YU, U^{{\mathfrak {t}}}PU)\) for a matrix \(U\in {\textrm{U}}_{{\mathbb {K}}, p}\), (that means \(U\in {\mathbb {K}}^{p\times p}\) and \(UU^{{\mathfrak {t}}} = I_p\)). So the manifold \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) could be identified with the quotient space \(({\textrm{St}}_{{\mathbb {K}}, p, n}\times {\textrm{S}}^{+}_{{\mathbb {K}}, p})/{\textrm{U}}_{{\mathbb {K}}, p}\) of the product of the Stiefel manifold \({\textrm{St}}_{{\mathbb {K}}, p, n}\) and the manifold of positive-definite matrices \({\textrm{S}}^{+}_{{\mathbb {K}}, p}\) over the \({\mathfrak {t}}\)-orthogonal group \({\textrm{U}}_{{\mathbb {K}}, p}\). (The paper actually uses \(R= P^{\frac{1}{2}}\) to parametrize the space.) From our point of view, the ambient space is \({\mathcal {E}}= {\mathbb {K}}^{n\times p}\times {\mathbb {K}}^{p\times p}\), and the tangent space is identified with the image of the operator \({\textrm{N}}_1\) from \({\mathbb {K}}^{(n-p)\times p}\times {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) to \({\mathcal {E}}\), \({\textrm{N}}_1: (B, D) \mapsto (Y_{\perp } B, D)\), where the matrix \((Y| Y_{\perp })\) is \({\mathfrak {t}}\)-orthogonal. On the tangent space, [7] uses the metric \({{\,\textrm{Tr}\,}}(BB^{{\mathfrak {t}}})+k{{\,\textrm{Tr}\,}}(DP^{-1}DP^{-1})\) for a positive number k. The action of the group gives us the vertical vectors \((Y{{\textsf{q}}}, P{{\textsf{q}}}- {{\textsf{q}}}P)\) for a \({\mathfrak {t}}\)-antisymmetric matrix \({{\textsf{q}}}\) such that \({{\textsf{q}}}+{{\textsf{q}}}^{{\mathfrak {t}}} = 0\). In the paper, the image of \({\textrm{N}}_1\) transverses but not orthogonal to the vertical vectors, and no second-order method is provided. We modify this approach, using a Riemannian quotient metric to provide a second-order method. In the following, the projection to the horizontal space is denoted by \({\varPi }_{{\mathcal {H}}}\). The horizontal lift of the Levi-Civita connection (denote by \(\nabla ^{{\mathcal {H}}}\)) is \({\varPi }_{{\mathcal {H}}}\nabla \).

Theorem 8.1

Let \({\mathcal {M}}= {\textrm{St}}_{{\mathbb {K}}, p, n}\times {\textrm{S}}^{+}_{{\mathbb {K}}, p}\subset {\mathcal {E}}:= {\mathbb {K}}^{n\times p}\times {\mathbb {K}}^{p\times p}\) be an embedded ambient space of \({\mathcal {M}}\). Identifying the manifold \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) of positive-semidefinite matrices with \({\mathcal {B}}= ({\textrm{St}}_{{\mathbb {K}}, p, n}\times {\textrm{S}}^{+}_{{\mathbb {K}}, p})/{\textrm{U}}_{{\mathbb {K}}, p}\), where the pair \(S =(Y, P)\) represents the matrix \(YPY^{{\mathfrak {t}}}\) with \(Y\in {\textrm{St}}_{{\mathbb {K}}, p, n}, P\in {\textrm{S}}^{+}_{{\mathbb {K}}, p}\), and the action of \(U\in {\textrm{U}}_{{\mathbb {K}}, p}\) sends (YP) to \((YU, U^{{\mathfrak {t}}}PU)\). The self-adjoint metric operator

$$\begin{aligned} {{\textsf{g}}}(S)(\omega _Y, \omega _P)={{\textsf{g}}}(\omega _Y, \omega _P)=(\alpha _0\omega _Y + (\alpha _1 - \alpha _0)YY^{{\mathfrak {t}}}\omega _P, \beta P^{-1}\omega _P P^{-1}) \end{aligned}$$
(8.1)

for \(\omega = (\omega _Y, \omega _p)\in {\mathcal {E}}= {\mathbb {K}}^{n\times p}\times {\mathbb {K}}^{p\times p}\) defines the inner product \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(\alpha _0\omega _Y^{{\mathfrak {t}}}\omega _Y + (\alpha _1 - \alpha _0)\omega _Y^{{\mathfrak {t}}}YY^{{\mathfrak {t}}}\omega _Y +\beta \omega _PP^{-1}\omega _P P^{-1})\) on \({\mathcal {E}}\), which induces a metric on \({\textrm{St}}_{{\mathbb {K}}, p, n}\times {\textrm{S}}^{+}_{{\mathbb {K}}, p}\), invariant under the action of \({\textrm{U}}_{{\mathbb {K}}, p}\) and induces a quotient metric on \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\). Its tangent bundle \(T{\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) lifts to the subbundle \({\mathcal {H}}\subset T({\textrm{St}}_{{\mathbb {K}}, p, n}\times {\textrm{S}}^{+}_{{\mathbb {K}}, p})\) horizontal to the group action, where a vector \(\eta =(\eta _Y, \eta _P)\in {\mathcal {E}}\) is a horizontal tangent vector at \(S = (Y, P)\) if and only if it satisfies:

$$\begin{aligned} \alpha _1Y^{{\mathfrak {t}}}\eta _Y +\beta \eta _PP^{-1} - \beta P^{-1}\eta _P=0. \end{aligned}$$
(8.2)

\({\mathcal {H}}_S = {\mathcal {H}}_{(Y, P)}\) could be identified as the range of the one-to-one operator \({\textrm{N}}(S)\) from \({\mathcal {E}}_{{\textrm{N}}} = {\mathbb {K}}^{(n-p)\times p}\times {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) to \({\mathcal {E}}\), mapping \((B, D)\in {\mathcal {E}}_{{\textrm{N}}}\) to:

$$\begin{aligned} {\textrm{N}}(S)(B, D) = ({\textrm{N}}_Y(B, D), {\textrm{N}}_P(B, D)):= (\beta Y(P^{-1}D - D P^{-1}) + Y_{\perp }B, \alpha _1 D),\nonumber \\ \end{aligned}$$
(8.3)

where \(Y_{\perp }\) is orthogonal complement matrix to Y, \((Y| Y_{\perp })\in {\textrm{U}}_{{\mathbb {K}}, n}\). The projection of \(\omega = (\omega _Y, \omega _P)\in {\mathcal {E}}={\mathbb {K}}^{n\times p}\times {\mathbb {K}}^{p\times p}\) to the horizontal space \({\mathcal {H}}_{(Y, P)}\) is given by

$$\begin{aligned} \begin{aligned}&{\varPi }_{{\mathcal {H}}}(S)(\omega _Y, \omega _P) = (\beta Y(P^{-1}{\mathcal {D}}- {\mathcal {D}}P^{-1}) +\omega _Y-YY^{{\mathfrak {t}}}\omega _Y,\alpha _1{\mathcal {D}})\\&\quad \text {with }{\mathcal {D}}= {\mathcal {D}}(P) \omega := L(P)^{-1}{\textrm{sym}}_{{\mathfrak {t}}}(\omega _P + Y^{{\mathfrak {t}}}\omega _YP - PY^{{\mathfrak {t}}}\omega _Y)\\&\quad \text {where }L(P)X:= (\alpha _1 - 2\beta ) X + \beta (P X P^{-1} + P^{-1} X P). \end{aligned} \end{aligned}$$
(8.4)

The operator L(P) could be inverted by Proposition 8.1. The Riemannian Hessian could be evaluated by Eq. (3.11), and the lift of the Levi-Civita connection is given by

$$\begin{aligned} \begin{aligned} \nabla ^{{\mathcal {H}}}_{\xi }\eta := {\varPi }_{{\mathcal {H}}}\nabla _{\xi }\eta ={\textrm{D}}_{\xi }\eta -({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}})\eta + {\varPi }_{{\mathcal {H}}}{{\textsf{g}}}^{-1}{\textrm{K}}(\xi , \eta ), \end{aligned} \end{aligned}$$
(8.5)

for horizontal lifts \(\xi =(\xi _Y, \xi _P), \eta = (\eta _Y, \eta _P)\) of tangent vectors of \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) with

$$\begin{aligned}{} & {} {\textrm{K}}(\xi , \eta )_Y = \frac{\alpha _1 - \alpha _0}{2}(Y (\eta _Y^{{\mathfrak {t}}} \xi _Y+\xi _Y^{{\mathfrak {t}}}\eta _Y) -2(\eta _Y\xi _Y^{{\mathfrak {t}}}+\xi _Y\eta _Y^{{\mathfrak {t}}})Y),\nonumber \\{} & {} {\textrm{K}}(\xi , \eta )_P = - \frac{\beta }{2}(P^{-1} \eta _{P} P^{-1} \xi _{P} P^{-1} + P^{-1} \xi _{P} P^{-1} \eta _{P} P^{-1}), \end{aligned}$$
(8.6)
$$\begin{aligned}{} & {} ({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}})\omega = ((({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}})\omega )_Y,\alpha _1\mathring{{\mathcal {D}}}),\nonumber \\{} & {} (({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}})\omega )_Y = \beta \xi _Y(P^{-1}{\mathcal {D}}- {\mathcal {D}}P^{-1}) + \beta Y(P^{-1}\mathring{{\mathcal {D}}}- \mathring{{\mathcal {D}}}P^{-1} \nonumber \\{} & {} \quad +{\mathcal {D}}P^{-1}\xi _P P^{-1}-P^{-1}\xi _P P^{-1}{\mathcal {D}}) -(\xi _Y Y^{{\mathfrak {t}}}+ Y\xi _Y^{{\mathfrak {t}}})\omega _Y, \nonumber \\{} & {} \quad \text { with }\mathring{{\mathcal {D}}}:= ({\textrm{D}}_{\xi }{\mathcal {D}})\omega = L(P)^{-1}\{ {\textrm{sym}}_{{\mathfrak {t}}}( \xi _Y^{{\mathfrak {t}}}\omega _YP - P\xi _Y^{{\mathfrak {t}}}\omega _Y +Y^{{\mathfrak {t}}}\omega _Y\xi _P-\xi _PY^{{\mathfrak {t}}}\omega _Y) \nonumber \\{} & {} \quad - \beta (\xi _P {\mathcal {D}}P^{-1} + P^{-1}{\mathcal {D}}\xi _P - P {\mathcal {D}}P^{-1}\xi _PP^{-1} - P^{-1}\xi _PP^{-1}{\mathcal {D}}P)\}. \end{aligned}$$
(8.7)

Thus, the Hessian could be evaluated at \(O(np^2)\)-complexity, by operator composition using Eq. (3.11). Note that \(({\textrm{D}}_{\xi }{\varPi }_{{\mathcal {H}}})\eta \) could be further simplified if \(\eta =(\eta _Y, \eta _P)\) is a horizontal tangent vector, as \({\mathcal {D}}= \frac{1}{\alpha _1}\eta _P\) in that case.

Proof

We have \({{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}(\alpha _1\eta _Y^{{\mathfrak {t}}}Y{{\textsf{q}}}+ \beta (\eta _P^{{\mathfrak {t}}}P^{-1}(P{{\textsf{q}}}- {{\textsf{q}}}P)P^{-1})) = 0\) for a tangent vector \((\eta _Y, \eta _P)\) and a \({\mathfrak {t}}\)-antisymmetric matrix \({{\textsf{q}}}\) from the horizontal condition. Using Proposition 4.1 this means \({\textrm{skew}}_{{\mathfrak {t}}}(\alpha _1\eta _Y^{{\mathfrak {t}}}Y +\beta P^{-1}\eta ^{{\mathfrak {t}}}_P - \beta \eta _P^{{\mathfrak {t}}} P^{-1}) = 0\). Using the fact that \(\eta _P^{{\mathfrak {t}}}=\eta _P\) and \(\eta _Y^{{\mathfrak {t}}}Y\) is \({\mathfrak {t}}\)-antisymmetric, we have Eq. (8.2).

It is clear that \({\textrm{N}}(B, D)\) satisfies this equation and is one-to-one: if \({\textrm{N}}(B, D) = 0\) then immediately \(D=0\), and \(B=0\) since \(Y_{\perp }^{{\mathfrak {t}}}Y_{\perp }=I\). It is onto the tangent space by a dimension count. The adjoint \({\textrm{N}}^{{\mathfrak {t}}} =({\textrm{N}}^{{\mathfrak {t}}}_B, {\textrm{N}}^{{\mathfrak {t}}}_D)\) has two components corresponding to the B and D factors. By Proposition 4.2 we have

$$\begin{aligned} {\textrm{N}}^{{\mathfrak {t}}}(\omega _Y, \omega _P)_B&= Y_{\perp }^{{\mathfrak {t}}} \omega _{Y},\\ {\textrm{N}}^{{\mathfrak {t}}}(\omega _Y, \omega _P)_D&= {\textrm{sym}}_{{\mathfrak {t}}}(\alpha _{1}\omega _{P} + \beta P^{-1} Y^{{\mathfrak {t}}} \omega _{Y} - \beta Y^{{\mathfrak {t}}} \omega _{Y} P^{-1}),\\ {\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}(\omega _Y, \omega _P)_B&= \alpha _{0} Y_{\perp }^{{\mathfrak {t}}} \omega _{Y},\\ {\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}(\omega _Y, \omega _P)_D&= \alpha _{1} \beta {\textrm{sym}}_{{\mathfrak {t}}}(P^{-1} \omega _{P} P^{-1} + P^{-1} Y^{{\mathfrak {t}}}\omega _{Y} - Y^{{\mathfrak {t}}} \omega _{Y} P^{-1}),\\ ({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}}(B, D))_B&= \alpha _{0} B,\\ ({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}}(B, D))_D&= \alpha _{1} \beta {\textrm{sym}}_{{\mathfrak {t}}}(P^{-1}(\alpha _1D) P^{-1} \\&\quad + P^{-1} Y^{{\mathfrak {t}}} (\beta Y(P^{-1}D - D P^{-1}) + Y_{\perp }B) \\&\quad -Y^{{\mathfrak {t}}} (\beta Y(P^{-1}D - D P^{-1}) + Y_{\perp }B) P^{-1})\\&= \alpha _{1}\beta {\textrm{sym}}_{{\mathfrak {t}}}(\alpha _1P^{-1}D P^{-1}+ \beta P^{-2}D - \beta P^{-1}D P^{-1}\\&\quad - \beta P^{-1}D P^{-1} +\beta D P^{-2})\\&= \alpha _1\beta P^{-1}((\alpha _1 -2\beta ) D + \beta P D P^{-1} + \beta P^{-1} D P)P^{-1}\\&= \alpha _1\beta P^{-1}L(P)DP^{-1}. \end{aligned}$$

Hence

$$\begin{aligned}{} & {} ({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}(B, D) = (\alpha _0^{-1}B, (\alpha _1\beta )^{-1}L(P)^{-1}(PDP)),\\{} & {} ({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}{\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}(\omega _Y, \omega _P)\\{} & {} = (Y_{\perp }^{{\mathfrak {t}}}\omega _Y, {\mathcal {D}}). \end{aligned}$$

The projection formula Eq. (8.4) is just \({\textrm{N}}({\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}{\textrm{N}})^{-1}{\textrm{N}}^{{\mathfrak {t}}}{{\textsf{g}}}\). The formulas for \({\textrm{K}}\) follow from the corresponding Stiefel and positive-definite manifold formulas. For \({\textrm{D}}_{\xi } {\varPi }_{{\mathcal {H}}}\), we take directional derivative of Eq. (8.4) using standard matrix calculus, the only difficulty is \(\mathring{{\mathcal {D}}}= ({\textrm{D}}_{\xi }{\mathcal {D}})\omega \). We evaluate it by evaluating \({\textrm{D}}_{\xi }(L(P){\mathcal {D}})\omega \) as

$$\begin{aligned}{} & {} (\alpha _1 - 2\beta ) (\mathring{{\mathcal {D}}}+ \beta (P \mathring{{\mathcal {D}}}P^{-1} + P^{-1} \mathring{{\mathcal {D}}}P) + \beta (\xi _P{\mathcal {D}}P^{-1} + P^{-1}{\mathcal {D}}\xi _P) \\{} & {} \qquad -\beta (P {\mathcal {D}}P^{-1}\xi _PP^{-1} + P^{-1}\xi _PP^{-1}{\mathcal {D}}P) \\{} & {} \quad = {\textrm{sym}}_{{\mathfrak {t}}}( \xi _Y^{{\mathfrak {t}}}\omega _YP - P\xi _Y^{{\mathfrak {t}}}\omega _Y +Y^{{\mathfrak {t}}}\omega _Y\xi _P-\xi _PY^{{\mathfrak {t}}}\omega _Y) \end{aligned}$$

by differentiating the equation for \({\mathcal {D}}\). From here, we get the equation for \(\mathring{{\mathcal {D}}}\). \(\square \)

To solve for \({\mathcal {D}}\) and \(\mathring{{\mathcal {D}}}\), we need an extension of the symmetric Lyapunov equation:

Proposition 8.1

Let \(P\in {\textrm{Sym}}_{{\mathfrak {t}}, {\mathbb {K}}, p}\) be an \({\mathfrak {t}}\)-symmetric matrix having the eigenvalue decomposition \(P = U{\varLambda } U^{{\mathfrak {t}}}\) with eigenvalues \(({\varLambda }_i)_{i=1}^p\) and \(UU^{{\mathfrak {t}}}=I\). Let the set of coefficients \((c_{st})_{-k\le s, t \le k}\) be such that \(M_{ij}:= \sum _{s=-k, t=-k}^{s=k, t=k}c_{st}{\varLambda }^s_i {\varLambda }^t_j\) is not zero for all pairs (\({\varLambda }_i, {\varLambda }_j)\) of eigenvalues of P, then the equation

$$\begin{aligned} \sum _{s=-k, t=-k}^{s=k, t=k} c_{st}P^{s} X P^{t} = B \end{aligned}$$
(8.8)

has the following unique solution that could be computed at \(O(p^3)\) complexity:

$$\begin{aligned} X = U\{(U^{{\mathfrak {t}}}BU) / M\} U^{{\mathfrak {t}}} \end{aligned}$$
(8.9)

with \(M=(M_{ij})_{i=1,j=1}^{i=p, j=p}\) and / denotes the by-entry division. In particular, for a positive-definite matrix P and positive scalars \((\alpha \), \(\beta )\), the equation

$$\begin{aligned} (\alpha - 2\beta )X + \beta (P^{-1}XP + PXP^{-1})= B \end{aligned}$$

has a unique solution with \(M_{ij} = \alpha +\beta ({\varLambda }_i^{-1}{\varLambda }_j + {\varLambda }_i{\varLambda }_j^{-1}-2)\).

Proof

We follow the idea of [5, 6] but use the eigenvalue decomposition in place of the Schur decomposition. Substitute \(P = U{\varLambda } U^{{\mathfrak {t}}}\) to Eq. (8.8) and multiply \(U^{{\mathfrak {t}}}\) and U on the left-hand and right-hand sides of that equation, we get

$$\begin{aligned} \sum _{s=-k, t=-k}^{s=k, t=k} c_{st}{\varLambda }^{s}U^{{\mathfrak {t}}} X U {\varLambda }^{t} = U^{{\mathfrak {t}}}B U, \end{aligned}$$

which is equivalent to \((U^{{\mathfrak {t}}} X U)_{ij}M_{ij} = (U^{{\mathfrak {t}}}B U)_{ij}\) or \(U^{{\mathfrak {t}}} X U = (U^{{\mathfrak {t}}}B U)/M\), and we have Eq. (8.9). If \(\alpha \) and \(\beta \) are positive then \(\alpha _1 + \beta ({\varLambda }_i^{-1}{\varLambda }_j + {\varLambda }_i{\varLambda }_j^{-1}-2) >0\) by the AGM-inequality. The eigenvalue decomposition has \(O(p^3)\) cost. \(\square \)

Horizontal lifts of geodesics on \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) are geodesics on \({\textrm{St}}_{{\mathbb {K}}, p, n}\times {\textrm{S}}^{+}_{{\mathbb {K}}, p}\). Recall a complete manifold has geodesics that extend indefinitely along any direction. Under this quotient metric, \({\textrm{S}}^{+}_{{\mathbb {K}}, p, n}\) is a complete Riemannian manifold, as both factors above are. The metric in [27] is complete, while the metric in [13] is not. If \((\eta _Y, \eta _P)\) is a horizontal tangent vector at \(S=(Y, P)\), a horizontal geodesic \(\gamma \) with \(\gamma (0) = S, {\dot{\gamma }}(0) = (\eta _Y, \eta _P)\) is of the form \((Y(t), P^{1/2}\exp (t P^{-1/2}\eta _P P^{-1/2})P^{1/2})\), with Y(t) is the geodesic of the metric in Sect. 5, described in [11, 20].

9 Implementation and Data Availability

We developed a Python package based on Pymanopt and Manopt [8, 26] implementing the manifolds with metrics considered in this paper, for both the real and complex cases in the package [18]. To reuse the optimization code in [8, 26], our NullRangeManifold class implements templates of the methods egrad2rgrad and ehess2rhess based on the formulas in Propositions 3.2 and 3.3 and Theorem 3.1. For a new manifold, the user needs to implement the constraint operator \({{\,\textrm{J}\,}}\), its transpose and derivative, the metric operator \({{\textsf{g}}}\), its inverse, derivative, and the third Christoffel term \({\mathcal {X}}\), as well as a method to solve \({{\,\textrm{J}\,}}{{\textsf{g}}}^{-1}{{\,\textrm{J}\,}}^{{\mathfrak {t}}}\) (defaulted to use a conjugate-gradient solver otherwise). A retraction is also required. NullRangeManifold automatically provides the projection, Riemannian gradient, and Hessian derived in this paper. Besides the numerical implementation, we also include notebooks showing symbolic derivations of the formulas, and numerical tests, including geodesics in most cases.

We also implement real and complex Stiefel manifolds and positive-semidefinite manifolds with metric parameters. For each manifold, we provide a manifold class to support optimization problems. For flag manifolds, [18] contains a larger class of metric, the results in Sect. 6) is implemented separately in [19], the numerical result is shown here. For testing, we numerically verify the projection satisfying the nullspace condition, metric compatibility, and torsion-freeness of the connection. As the manifolds considered here are constructed from Stiefel or positive-definite matrix manifolds, we use their typical retractions. Since we focus on methodology in this paper, we will not discuss formal numerical experiments. However, we have tested each manifold with a quadratic cost problem including matrices with one size of 1000 dimensions with a Trust-Region solver, ([26, 1, section 7.2.2], summarized in Algorithm 1). Here, an approximate solution is sufficient in step 3 with \(\eta \in T_{x_k}{\mathcal {M}}\), this step is the inner iteration versus the outer iteration in step 2; \(||_{x_k}\) and \(\langle \rangle _k\) denote norm/inner product at \(x_k\); a terminal condition is imposed).

figure a

For flag manifolds, we optimize the function \(f(Y) = {{\,\mathrm{Tr_{{\mathbb {R}}}}\,}}((Y{\varLambda } Y^{{\mathfrak {t}}}A)^2)\) over matrices \(Y\in {\textrm{St}}_{{\mathbb {R}}, d, n}\). Here, A is a positive-definite matrix, \(d<n\) are two positive integers, \(\varvec{\hat{d}}=(d_1,\cdots , d_q)\) is a partition of d with \(\sum _{i=1}^q d_i = d\), \({\varLambda } = {{\,\textrm{diag}\,}}(\lambda _1 I_{d_1},\cdots , \lambda _q I_{d_q})\) for positive numbers \(\lambda _1\cdots \lambda _q\). This cost function is invariant under the action of \({\textrm{U}}_{{\mathbb {R}}, \varvec{\hat{d}}}\), thus could be considered as a function on the flag manifold \({\textrm{St}}_{{\mathbb {R}}, d, n}/{\textrm{U}}_{{\mathbb {R}}, \varvec{\hat{d}}}\). The Euclidean gradient is \(4(AY{\varLambda })Y^{{\mathfrak {t}}}(AY{\varLambda })\), and the Euclidean Hessian is computed by routine matrix calculus. For testing, we consider \(d=60, \varvec{\hat{d}}=(30, 20, 10), n=1000\) and use a trust-region solver. In this case, there is no noticeable variation in time when different values of \(\alpha \) are used, typically a few seconds on a free colab engine in the notebook colab/SimpleFlag.ipynb in [19]. Convergence is achieved after typically 16 trust-region iterations, with small alpha requiring more outer iterations. The convergence is superlinear as seen in Fig. 1.

Fig. 1
figure 1

Optimization on flag manifold \({{\,\textrm{Flag}\,}}(30, 20, 10; 1000, {\mathbb {R}})\) using a trust-region solver. Left, number of total iterations versus \(\alpha =\alpha _1/\alpha _0\). Right, \(\log _{10}\) of gradient for \(\alpha =1\)

For another test of the Riemannian optimization framework, we consider a nonlinear weighted PCA (principal component analysis) problem, which could be solved by optimizing over the positive-semidefinite matrix manifold. Given a symmetric matrix \(A\in {\textrm{Sym}}_{{{\textsf{T}}}, {\mathbb {R}}, n}\) and a weight vector \(W\in {\mathbb {R}}^n\), take the cost function to be

$$\begin{aligned} {{\,\textrm{Tr}\,}}(A-YPY^{{{\textsf{T}}}}){{\,\textrm{diag}\,}}(W)(A-YPY^{{{\textsf{T}}}}){=}{{\,\textrm{Tr}\,}}W_d(A^2{-}AYPY^{{{\textsf{T}}}}{-}YPY^{{{\textsf{T}}}}A {+} YP^2Y^{{{\textsf{T}}}}) \end{aligned}$$

of a positive-semidefinite matrix \(S = YPY^{{{\textsf{T}}}}\in {\textrm{S}}^{+}_{{\mathbb {R}}, p, n}\), with \(Y\in {\textrm{St}}_{{\mathbb {R}}, p, n}, P\in {\textrm{S}}^{+}_{{\mathbb {R}}, p}\). Here, \(W_d\) denotes the diagonal matrix \({{\,\textrm{diag}\,}}(W)\). When W has identical weight \(\lambda \), \(W_d = \lambda I_n\), expanding the cost function, we need to minimize \({{\,\textrm{Tr}\,}}P^2 - 2{{\,\textrm{Tr}\,}}Y^T A Y P\) in Y and P, which implies \(P = Y^T A Y\). Thus, the problem is optimizing \( -{{\,\textrm{Tr}\,}}(Y^{{{\textsf{T}}}}AY)^2\) over the Stiefel manifold (actually over the Grassmann manifold as the function is invariant when Y is multiplied on the right by an orthogonal matrix), which could be considered as a quadratic PCA problem. When W has nonidentical weights, we optimize over the positive-semidefinite manifold, with a trust-region solver.

Fig. 2
figure 2

Weighted PCA using Positive-semidefinite manifold optimization with a trust-region solver. Left, \(\log _{10}\) of distance to the optimal value. Right, quadratic cost by iteration

The cost function from \({\mathcal {M}}= {\textrm{St}}_{{\mathbb {R}}, p, n}, \times {\textrm{S}}^{+}_{{\mathbb {R}}, p}\) extends to \({\mathcal {E}}={\mathbb {R}}^{n\times p}\times {\mathbb {R}}^{p\times p}\), and is denoted by \(\hat{f}(Y, P)\). For a horizontal tangent vector \(\xi = (\xi _Y, \xi _P)\) at \((Y, P)\in {\mathcal {M}}\)

$$\begin{aligned} ({\textrm{D}}_{\xi }\hat{f})(Y, P)= & {} {{\,\textrm{Tr}\,}}W_d(-A\xi _YPY^{{{\textsf{T}}}} - AY\xi _PY^{{{\textsf{T}}}} -AYP\xi _Y^{{{\textsf{T}}}}\\{} & {} \quad -\xi _YPY^{{{\textsf{T}}}}A -Y\xi _PY^{{{\textsf{T}}}}A -YP\xi _Y^{{{\textsf{T}}}}A + \xi _YP^2Y^{{{\textsf{T}}}} \\{} & {} \quad + Y(\xi _PP + P\xi _P)Y^{{{\textsf{T}}}} + YP^2\xi _Y^{{{\textsf{T}}}}). \end{aligned}$$

We have \({{\,\textrm{Tr}\,}}W_d(-A\xi _YPY^{{{\textsf{T}}}} -AYP\xi _Y^{{{\textsf{T}}}} -\xi _YPY^{{{\textsf{T}}}}A -YP\xi _Y^{{{\textsf{T}}}}A) =-2{{\,\textrm{Tr}\,}}(AW_d + W_dA)YP\xi _Y^{{{\textsf{T}}}}\), \({{\,\textrm{Tr}\,}}W_d(\xi _YP^2Y^{{{\textsf{T}}}} + YP^2\xi _Y^{{{\textsf{T}}}}) =2{{\,\textrm{Tr}\,}}W_dYP^2\xi _Y^{{{\textsf{T}}}}\) and similar equalities for \(\xi _P\) give us

$$\begin{aligned} {\textsf{grad}}\hat{f}= (-4{\textrm{sym}}_{{{\textsf{T}}}}(A W_d)YP + 2W_dYP^2, -2{\textrm{sym}}_{{{\textsf{T}}}}(Y^{{{\textsf{T}}}}W_d(AY - Y P))). \end{aligned}$$

The ambient Hessian \({\textsf{hess}}\hat{f}(\xi )\) follows from a directional derivative calculation

$$\begin{aligned} {\textsf{hess}}\hat{f}(\xi )= & {} -4{\textrm{sym}}_{{{\textsf{T}}}}(AW_d)(\xi _YP + Y\xi _P) + 2W_d\xi _YP^2 + 2W_dY(\xi _PP + P\xi _P),\\{} & {} \quad -2{\textrm{sym}}_{{{\textsf{T}}}}(\xi _Y^{{{\textsf{T}}}}W_d(AY - Y P)) \\{} & {} \quad - 2 {\textrm{sym}}_{{{\textsf{T}}}}(Y^{{{\textsf{T}}}}W_d(A\xi _Y - \xi _Y P - Y \xi _P))). \end{aligned}$$

The Riemannian gradient is computed as \({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}{{\textsf{g}}}^{-1}{\textsf{grad}}\hat{f}\), with \({\varPi }_{{\mathcal {H}}, {{\textsf{g}}}}\) is given by Eq. (8.4), the Riemannian Hessian is computed from Eq. (3.11) and Theorem 8.1.

In our experiment (implemented in the notebook colab/WeightedPCA.ipynb in [18]), we take \(n= 1000, p = 50\), with A and W generated randomly. To find the optimum \(S=YPY^{{{\textsf{T}}}}\), we optimize with \(\alpha _0 = \alpha _1 = 1\), with \(\beta \) is 0.1 for the first 20 iterations, \(\beta =10\) for the next 20 and \(\beta =30\) for the remaining iterations. This choice of \(\beta \) comes from our limited experiments, we find varying \(\beta \) has a strong effect on the speed of convergence, and updating \(\beta \) as such gives better convergence rates than a static \(\beta \). Philosophically, the small starting \(\beta \) could be thought of as focusing first on aligning the subspace. The convergence graph is summarized in Fig. 2. We hope to revisit the topic with a more systematic study in future works.

10 Conclusion

In this paper, we have proposed a framework to compute the Riemannian gradient, Levi-Civita connection, and the Riemannian Hessian effectively when the constraints, the symmetry of the problem, and the metrics are given analytically and have applied the framework to several manifolds important in applications. We look to apply the results in this paper to several problems in optimization, machine learning and computational statistics. We hope the research community will find the method useful in future works.