10.1 Linearized Deformations

A standard way to ensure the existence of a smooth solution of a matching problem is to add a penalty term in the matching functional. This term would complete (9.1) to form

$$\begin{aligned} E_{I, I'}(\varphi ) = \rho (\varphi ) + D(\varphi \cdot I, I'). \end{aligned}$$
(10.1)

A large variety of such methods have been designed, in approximation theory, statistics and signal processing for solving ill-posed problems. The simplest (and typical) form of penalty function is

$$ \rho (\varphi ) = \left\| \varphi -{\mathrm {id}} \right\| _H^2 $$

for some Hilbert (or Banach) space of functions. Some more complex functions of \(\varphi -{\mathrm {id}}\) may also be designed, related to energies of non-linear elasticity (see, among others [13, 27, 28, 89, 123, 144, 237]). Such methods may be called “small deformation” methods because they work on the deviation of \(u = \varphi - {\mathrm {id}}\), and controlling the size or smoothness of u alone is most of the time not enough to guarantee that \(\varphi \) is a diffeomorphism (unless u is small, as we have seen in Sect. 7.1). There is, in general, no way of proving the existence of a solution of the minimization problem within some group of diffeomorphisms G, unless some restrictive assumptions are made on the objects to be matched.

Our focus here is on diffeomorphic matching. Because of this, we shall not detail many of these methods. However, it is interesting to note that these functionals also have a Eulerian gradient within an RKHS of vector fields with a smooth enough kernel, and can therefore be minimized using (9.7). We illustrate this with the following example, in which we skip the proper justification of the existence of derivatives.

Consider the function \(\rho (\varphi ) = \int _{{\mathbb {R}}^d} \left| d\varphi (x)-{\mathrm {Id}} \right| ^2 dx\), where the matrix norm is

$$ |A|^2 = {\mathrm {trace}}({{A}^T}A) = \sum _{i, j} a_{ij}^2 $$

(Hilbert–Schmidt norm). Letting \(u = \varphi - {\mathrm {id}}\), we have

$$ {\left( {d\rho (\varphi )}\, \left| {d\rho (\varphi )}\, {h}\right. \right) } = 2 \int _{{\mathbb {R}}^d} {\mathrm {trace}}(du^Tdh) dx = - 2 \int _{{\mathbb {R}}^d} \varDelta u^T h dx, $$

where \(\varDelta u\) is the vector formed by the Laplacian of the coordinates of u (recall that we assume that \(u=0\) at infinity). This implies that (given that \(\varDelta u = \varDelta \varphi \))

$$ {\left( {\bar{\partial }\rho (\varphi )}\, \left| {\bar{\partial }\rho (\varphi )}\, {h}\right. \right) } = - 2 \int _\varOmega \varDelta \varphi ^T h \circ \varphi dx $$

and

$$\begin{aligned} {\overline{\nabla }}^V_\varphi \rho (\cdot ) = - 2\int _\varOmega K(\cdot , \varphi (x)) \varDelta \varphi (x) dx. \end{aligned}$$
(10.2)

This provides a regularized greedy image-matching algorithm, which includes a regularization term (a similar algorithm may easily be written for point matching).

Algorithm 2

The following procedure is an Eulerian gradient descent, on V, for the energy

$$ E_{I.I'}(\varphi ) = \int _{{\mathbb {R}}^d} \left| d\varphi (x) - {\mathrm {id}} \right| ^2 dx + \frac{1}{\sigma ^2} \int _{{\mathbb {R}}^d} \left| I\circ \varphi ^{-1}(x) - I'(x) \right| dx. $$

Start with an initial \(\varphi _0 = {\mathrm {id}}\) and solve the differential equation

$$\begin{aligned} \partial _t\varphi (t, y)= & {} - 2\int _\varOmega K(\varphi (t,y), \varphi (t,x)) \varDelta \varphi (t, x) dx\end{aligned}$$
(10.3)
$$\begin{aligned}+ & {} \frac{2}{\sigma ^2}\int _\varOmega (J(t,x) - I'(x)) K(\varphi (t, y), x)\nabla J(t, x) dx \end{aligned}$$
(10.4)

with \(J(t,\cdot ) = I\circ \varphi (t)^{-1}(\cdot )\).

This algorithm, which, like the previous greedy procedures, has the fundamental feature of providing a smooth flow of diffeomorphisms to minimize the matching functional, suffers from the same limitations as its predecessors concerning its limit behavior, which are essentially due to the fact that the variational problem itself is not well-posed; minimizers may not exist, and when they exist they are not necessarily diffeomorphisms. In order to ensure the existence of, at least, homeomorphic solutions, the energy must include terms that must not only prevent \(d\varphi \) from being too large, but also from being too small (or its inverse from being too large). In [90], the following regularization is proved to ensure the existence of homeomorphic solutions:

$$\begin{aligned} \delta (\varphi ) = \int _\varOmega (a \Vert d\varphi \Vert ^p + b \Vert \mathrm {Adj}(d\varphi )\Vert ^q + c (\det d\varphi )^r + d (\det d\varphi )^{-s}) dx \end{aligned}$$
(10.5)

under some assumptions on pqr and s, namely \(p, q > 3\), \(r>1\) and \(s > 2q/(q-3)\).

10.2 The Monge–Kantorovitch Problem

We briefly discuss in this section the mass transfer problem, which is, under some assumptions, a diffeomorphic method for matching probability densities, i.e., positive functions on \({\mathbb {R}}^d\) with integral equal to 1. Consider such a density, \(\zeta \), and a diffeomorphism \(\varphi \) on \({\mathbb {R}}^d\). If an object has density \(\zeta \), the mass included in an infinitesimal volume dx around x is \(\zeta (x)dx\). Now, if each point x in the object is transported to the location \(y = \varphi (x)\), the mass of a volume dy around y is the same as the mass of the volume \(\varphi ^{-1}(dy)\) around \(x = \varphi ^{-1}(y)\), which is \(\zeta \circ \varphi ^{-1}(y) |\det (d(\varphi ^{-1}))(y)| dy\) (this provides a physical interpretation of Proposition 9.5).

Given two densities \(\zeta \) and \( \zeta '\), the optimal mass transfer problem consists in finding a diffeomorphism \(\varphi \) with minimal cost such that \(\zeta ' = \zeta \circ \varphi ^{-1} |\det (d(\varphi ^{-1}))|\). The cost associated to \(\varphi \) in this context is related to the distance along which the transfer is made, measured by a function \(\rho (x, \varphi (x))\). The total cost comes after summing over the transferred mass, yielding

$$ E(\varphi ) = \int _\varOmega \rho (x, \varphi (x)) \zeta (x) dx. $$

The mass transfer problem now is to minimize E over all \(\varphi \)’s such that \(\zeta ' = \zeta \circ \varphi ^{-1} |\det (d(\varphi ^{-1}))|\). The problem is slightly different from the matching formulations that we discuss in the other sections of this chapter, because the minimization is associated to exact matching.

It is very interesting that this apparently very complex and highly nonlinear problem can be reduced to linear programming, albeit infinite-dimensional. Let us first consider a more general formulation. Instead of looking for a one-to-one correspondence \(x\mapsto \varphi (x)\), one can decide that the mass in a small neighborhood of x is dispatched over all \(\varOmega \) with weights \(y \mapsto q(x, y)\), where \(q(x, y) \ge 0\) and \(\int _\varOmega q(x, y) dy = 1\). We still have the constraint that the mass density arriving at y is \(\tilde{\zeta }(y)\), which gives

$$ \int _\varOmega \zeta (x) q(x, y) dx = \tilde{\zeta }(y). $$

The cost now has the simple expression (linear in q)

$$ E = \int _{\varOmega ^2} \rho (x, y) \zeta (x)q(x, y) dxdy. $$

The original formulation can be retrieved by letting \(q(x, y)dy \rightarrow \delta _{\varphi (x)}(y)\) (i.e., pass to the limit \(\sigma = 0\) with \(q(x, y) = \exp (-|y-\varphi (x)|^2/2\sigma ^2)/(2\pi \sigma ^2)^{d/2}\)).

If we write \(g(x,y) = \zeta (x) q(x, y)\), this relaxed problem is clearly equivalent to minimizing

$$ E(g) = \int _{\varOmega ^2} \rho (x, y) g(x, y) dxdy $$

subject to the constraints \(g(x, y) \ge 0\), \(\int g(x, y) dy = \zeta (x)\) and \(\int g(x, y) dx = \tilde{\zeta }(y)\). In fact, the natural formulation of this problem uses measures instead of densities: given two probability measures \(\mu \) and \({\tilde{\mu }}\) on \(\varOmega \), minimize

$$ E(\nu ) = \int _{\varOmega ^2} \rho (x, y) \nu (dx, dy) $$

subject to the constraints that the marginals of \(\nu \) are \(\mu \) and \({\tilde{\mu }}\). This provides the Wasserstein distance between \(\mu \) and \({\tilde{\mu }}\), associated to the transportation cost \(\rho \). Note that this formulation generalizes the computation of the Wasserstein distance (9.24) between discrete measures.

This problem is much nicer than the original one, since it is a linear programming problem. The theory of convex optimization (that we only apply formally in this infinite-dimensional context; see [44] for rigorous proofs) implies that it has an equivalent dual formulation which is: maximize

$$ F(h) = \int _\varOmega h d\mu + \int _\varOmega \tilde{h} {\tilde{\mu }}$$

subject to the constraint that, for all \(x, y\in \varOmega \), \(h(x) + \tilde{h}(y) \le \rho (x, y)\).

The duality equivalence means that the maximum of F coincides with the minimum of E. The solutions are, moreover, related by duality conditions (the KKT conditions) that imply that \(\nu \) must be supported by the set

$$\begin{aligned} A = \left\{ (x,y): h(x) + \tilde{h}(y) = \rho (x, y) \right\} . \end{aligned}$$
(10.6)

For the dual problem, one is obviously interested in making h and \(\tilde{h}\) as large as possible. Given h, one should therefore choose \(\tilde{h}\) as

$$ \tilde{h}(y) = \sup _{x} (\rho (x, y) - h(x)), $$

so that the set in (10.6) is exactly the set of \((y^*, y)\) where \(y^*\) is a point that achieves the maximum of \(\rho (x, y) - h(x)\).

The situation is particularly interesting when \(\rho (x, y) = |x-y|^2/2\). In this situation,

$$ \tilde{h}(y) = \frac{y^2}{2} + \sup _{x} \left( x^Ty + \frac{x^2}{2} - h(x)\right) . $$

From this equation, it is natural to introduce the auxiliary functions \(s(x) = h(x) - x^2/2\) and \({\tilde{s}}(y) = \tilde{h}(y) - y^2/2\). Using these functions, the set A in (10.6) becomes

$$ A = \left\{ (x, y): s(x) + {\tilde{s}}(y) = x^T y \right\} , $$

with \({\tilde{s}}(y) = \sup _{x}(x^Ty - s(x))\). Because the latter is a supremum of linear functions, we obtain the fact that \({\tilde{s}}\) is convex, and so is s by symmetry; \({\tilde{s}}\) is in fact what is called the convex conjugate of s, denoted \({\tilde{s}}= s^*\). Convex functions are almost everywhere differentiable, and, in order that \((x, y)\in A\), x must maximize \(u \mapsto u^Ty - s(u)\), which implies that \(y = \nabla s(x)\). So, the conclusion is that, whenever s is the solution of the dual problem, the solution of the primal problem is provided by \(y = \nabla s(x)\). This shows that the relaxed mass transport problem has the same solution as the initial one, with \(\varphi = \nabla s\), s being a convex function. That \(\varphi \) is invertible is obvious by symmetry: \(\varphi ^{-1} = \nabla {\tilde{s}}\).

This result is fundamental, since it is the basis for the construction of a numerical procedure for the solution of the mass transport problem in this case. Introduce a time-dependent vector field \(v(t,\cdot )\) and the corresponding flow of diffeomorphisms \(\varphi _{0t}^v\). Let \(h(t, \cdot ) = \det (d\varphi _{t0}^v)\, \zeta \circ \varphi _{t0}^v \). Then

$$ \det (d\varphi _{0t}^v)\, h(t) \circ \varphi _{0t}^v = \zeta .$$

The time derivative of this equation yields

$$\begin{aligned} \partial _t h + \mathrm {div}(hv) = 0. \end{aligned}$$
(10.7)

We have the following theorem [34].

Theorem 10.1

Consider the following energy:

$$ G(v) = \int _0^1 \int _\varOmega h(t, x) |v(t, x)|^2 dx dt $$

and the variational problem: minimize G subject to the constraints \(h(0) = \zeta \), \(h(1) = \tilde{\zeta }\) and (10.7). If v is the solution of the above problem, then \(\varphi _{01}^v\) solves the optimal mass transport problem.

Proof

Indeed, in G, we can make the change of variables \(x = \varphi _{0t}(y)\), which yields

$$\begin{aligned} G(v)= & {} \int _0^1\int _\varOmega \zeta (y) \left| v(t, \varphi _{0t}^v(y))\right| ^2 dy dt\\= & {} \int _\varOmega \zeta (y) \int _0^1 \left| \partial _t \varphi _{0t}^v \right| ^2 dt\\\ge & {} \int _\varOmega \zeta (y) \left| \varphi _{01}^v(y) - y \right| ^2 dy. \end{aligned}$$

So the minimum of G is always larger than the minimum of E. If \(\varphi \) solves the mass transport problem, then one can take v(tx) such that \(\varphi _{0t}^v(x) = (1-t) x + t\varphi (x)\), which is a diffeomorphism [190] and achieves the minimum of G.    \(\square \)

We refer to [34] for a numerical algorithm that computes the optimal \(\varphi \). Note that \(\rho (x, y) = |x-y|^2\) is not the only transportation cost that can be used in this context, but that others (like \(|x-y|\), which is not strictly convex in the distance) may fail to provide diffeomorphic solutions. Important developments on this subject can be found in [49, 119, 296].

We now discuss methods that are both diffeomorphic and metric (i.e., they relate to a distance). They also rely on the representation of diffeomorphisms using flows of ordinary differential equations.

10.3 Optimizing Over Flows

We return in this section to the representation of diffeomorphisms with flows of ordinary differential equations (ODEs) and describe how this representation can be used for diffeomorphic registration. Instead of using a norm to evaluate the difference between \(\varphi \) and the identity mapping, we now consider, as a regularizing term, the distance \(d_V\) that was defined in Sect. 7.2.6. More precisely, we set

$$ \rho (\varphi ) = \frac{1}{2} d_V({\mathrm {id}}, \varphi )^2 $$

and henceforth restrict the matching to diffeomorphisms belonging to \(\mathrm {Diff}_V\).

In this context, we have the following important theorem:

Theorem 10.2

Let V be a Hilbert space embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) so that \(\mathrm {Diff}_V\subset \mathrm {Diff}^{p,\infty }_0\). Assume that the functional \(U: \mathrm {Diff}_0^{p,\infty } \mapsto {\mathbb {R}}\) is bounded from below and continuous for the \((p, \infty )\)-compact topology. Then, there exists a minimizer of

$$\begin{aligned} E(\varphi ) = \frac{1}{2} d_V({\mathrm {id}}, \varphi )^2 + U(\varphi ) \end{aligned}$$
(10.8)

over \(\mathrm {Diff}_V\).

(The \((p, \infty )\)-compact topology is defined just after Theorem 7.13.)

Proof

E has an infimum \(E_{min}\) over \(\mathrm {Diff}_V\), since it is bounded from below. We need to show that this infimum is also a minimum, i.e., that it is achieved at some \(\varphi \in \mathrm {Diff}_V\).

We first use the following lemma (recall that we have denoted by \({\mathcal X}^1_V\) (resp. \({\mathcal X}^2_V\)) the set of time-dependent vector fields on \(\varOmega \) with integrable (resp. square integrable) V-norm over [0, 1]):

Lemma 10.3

Minimizing \(E(\varphi ) = d({\mathrm {id}}, \varphi )^2/2 + U(\varphi )\) over \(\mathrm {Diff}_V\) is equivalent to minimizing the function

$$\begin{aligned} \tilde{E}(v) =\frac{1}{2} \int _0^1\left\| v(t) \right\| _V^2 dt + U(\varphi ^v_{01}) \end{aligned}$$
(10.9)

over \({\mathcal X}^2_V\).

Let us prove this lemma. For \(v\in {\mathcal X}^2_V\), we have, by definition of the distance

$$ d_V({\mathrm {id}}, \varphi _{01}^v)^2 \le \int _0^1\left\| v(t) \right\| _V^2 dt, $$

which implies \(E(\varphi _{01}^v) \le \tilde{E}(v)\). This obviously implies that \(\inf _{\mathrm {Diff}_V} E(\varphi ) \le \tilde{E}(v)\), and since this is true for all \(v\in {\mathcal X}^2_V\), we have \(\inf E \le \inf \tilde{E}\). Now, assume that \(\varphi \) is such that \(E(\varphi ) \le \inf E + \varepsilon /2\). Then, by definition of the distance, there exists a v such that \(\varphi = \varphi ^v_{01}\) and

$$ \int _0^1\left\| v(t) \right\| _V^2 dt \le d_V({\mathrm {id}}, \varphi )^2 + \varepsilon , $$

which implies that

$$ \tilde{E}(v) \le E(\varphi ) + \varepsilon /2 \le \inf E +\varepsilon , $$

so that \(\inf E \le \inf \tilde{E}\).

We therefore have \(\inf E = \inf \tilde{E}\). Moreover, if there exists a v such that \(\tilde{E}(v) = \min \tilde{E}= \inf E\), then, since we know that \(E(\varphi _{01}^v) \le \tilde{E}\), we must have \(E(\varphi _{01}^v) = \min E\). Conversely, if \(E(\varphi ) = \min E\), by Theorem 7.22, \(E(\varphi ) = E(\varphi _{01}^v)\) for some v and this v must achieve the infimum of \(\tilde{E}\), which proves the lemma.

This lemma shows that it suffices to study the minimizers of \(\tilde{E}\). Now, as done in the proof of Theorem 7.22, one can find, by taking a subsequence of a minimizing sequence, a sequence \(v^n\) in \({\mathcal X}^2_V\) which converges weakly to some \(v\in {\mathcal X}^2_V\) and \(\tilde{E}(v^n)\) tends to \(E_{min}\). Because

$$ \liminf \int _0^1\left\| v^n(t) \right\| _V^2 dt \ge \int _0^1\left\| v(t) \right\| _V^2 dt $$

and because weak convergence in \({\mathcal X}^1_2\) implies convergence of the flow in the \((p, \infty )\)-compact topology (Theorem 7.13) we also have \(U(\varphi ^{v^n}_{01}) \rightarrow U(\varphi ^{v}_{01})\), so that \(\tilde{E}(v) = E_{min}\) and v is a minimizer.    \(\square \)

The general problem of minimizing functionals such as (10.9) has been called “large deformation diffeomorphic metric mapping”, or LDDMM. The first algorithms were introduced for this purpose in the case of landmark matching [159] and image matching [32] (these papers were preceded by theoretical developments in [93, 278, 283]). The following sections describe these algorithms, and other that were recently proposed.

10.4 Euler–Lagrange Equations and Gradient

10.4.1 Gradient: Direct Computation

We now detail the computation of the gradient for energies such as (10.8). As remarked in the proof of Theorem 10.2, the variational problem which has to be solved is conveniently expressed as a problem over \({\mathcal X}^2_V\). The function which is minimized over this space takes the form

$$ E(v) = \frac{1}{2} \int _0^1 \left\| v(t) \right\| ^2_V dt + U(\varphi ^v_{01}). $$

Assume that V is embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) and that U is differentiable on \(\mathrm {Diff}^{p, \infty }_0\). Then Theorem 7.12 and the chain rule implies that E is differentiable on \({\mathcal X}^2_V\) with

$$ {\left( {dE(v)}\, \left| {dE(v)}\, {h}\right. \right) } = {\big \langle {v}\, , \, {h}\big \rangle }_{{\mathcal X}^2_V(\varOmega )} +{\left( {dU(\varphi ^v_{01})}\, \left| {dU(\varphi ^v_{01})}\, {\partial _v \varphi ^v_{01} \, h}\right. \right) }, $$

where \(\partial _v \varphi ^v_{01} \, h\) is given in Theorem 7.12.

We now identify the gradient of E for the Hilbert structure of \({\mathcal X}^2_V\). This gradient is a function, denoted \(\nabla ^V E: v\mapsto \nabla ^V E(v) \in {\mathcal X}^2_V\), that satisfies

$$ {\left( {dE(v)}\, \left| {dE(v)}\, {h}\right. \right) } = {\big \langle {\nabla ^V E(v)}\, , \, {h}\big \rangle }_{{\mathcal X}^2_V} = \int _0^1 {\big \langle {\nabla ^VE(v)(t)}\, , \, {h(t)}\big \rangle }_V dt $$

for all vh in \({\mathcal X}^2_V\).

Since the set V is fixed in this section, we will drop the exponent from the notation, and simply refer to the gradient \(\nabla E(v)\). Note that this is different from the Eulerian gradient we have dealt with before; \(\nabla E\) now represents the usual gradient of a function defined over a Hilbert space. One important thing to keep in mind is that the gradient we define here is an element of \({\mathcal X}^2_V\), henceforth a time-dependent vector field, whereas the Eulerian gradient was an element of V (a vector field on \(\varOmega \)). Theorem 10.5 relates the two (and allows us to reuse the computations that were made in Chap. 9). For this, we need to introduce the following operation of diffeomorphisms acting on vector fields.

Definition 10.4

Let \(\varphi \) be a diffeomorphism of \(\varOmega \) and v a vector field on \(\varOmega \). We denote by \({\mathrm {Ad}}_\varphi v\) the vector field on \(\varOmega \) defined by

$$\begin{aligned} {\mathrm {Ad}}_\varphi v (x) = (d\varphi \ v)\circ \varphi ^{-1}(x). \end{aligned}$$
(10.10)

\({\mathrm {Ad}}_\varphi \) is called the adjoint representation of \(\varphi \).

If \(\varphi \in \mathrm {Diff}^{p+1, \infty }(\varOmega )\), then an application of Lemma 7.3 and the Leibnitz formula implies that \({\mathrm {Ad}}_\varphi v \in C^{p}_0(\varOmega , {\mathbb {R}}^d)\) as soon as \(v\in C^{p}_0(\varOmega , {\mathbb {R}}^d)\) and more precisely that \({\mathrm {Ad}}_\varphi \) is a bounded linear operator from \(C^{p}_0(\varOmega , {\mathbb {R}}^d)\) to itself. We can therefore define its conjugate on \(C_0^{p}(\varOmega , {\mathbb {R}}^d)^*\), with \({\mathrm {Ad}}^*_\varphi \rho \) given by

$$\begin{aligned} {\left( {{\mathrm {Ad}}_\varphi ^*\rho }\, \left| {{\mathrm {Ad}}_\varphi ^*\rho }\, {v}\right. \right) } = {\left( {\rho }\, \left| {\rho }\, {{\mathrm {Ad}}_\varphi v}\right. \right) } \end{aligned}$$
(10.11)

for \(\rho \in C_0^{p}(\varOmega , {\mathbb {R}}^d)^*\), \(v\in C_0^{p}(\varOmega , {\mathbb {R}}^d)\). Note that \(\mathrm {Ad}_\varphi ^*\rho \) is, a fortiori, in \(V^*\), because V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\).

Let \(\mathbb L:V\rightarrow V^*\) denote the duality operator on V and \(V^{(r)}\) denote the set of vector fields \(v\in V\) such that \(\mathbb Lv\in C^r_0(\varOmega , {\mathbb {R}}^d)^*\) (for \(r\le p+1\)). Then, for \(v\in V^{(p)}\), we can define, with \({\mathbb {K}}= \mathbb L^{-1}\),

$$\begin{aligned} \mathrm {Ad}_\varphi ^T v = {\mathbb {K}}(\mathrm {Ad}_\varphi ^*\mathbb Lv). \end{aligned}$$
(10.12)

This is well-defined, because, by construction, \(\mathrm {Ad}_\varphi ^*\mathbb Lv\in C^{p}_0(\varOmega , {\mathbb {R}}^d)^* \subset V^*\). We have in particular, for \(v\in V^{(p)}\) and \(w\in V\),

$$ {\big \langle {\mathrm {Ad}_\varphi ^T v}\, , \, {w}\big \rangle }_V = {\left( {\mathrm {Ad}_\varphi ^*\mathbb Lv}\, \left| {\mathrm {Ad}_\varphi ^*\mathbb Lv}\, {w}\right. \right) } = {\left( {\mathbb Lv}\, \left| {\mathbb Lv}\, {\mathrm {Ad}_\varphi w}\right. \right) }. $$

Recall that the Eulerian derivative of U is defined by

$$ {\left( {\bar{\partial }U(\varphi )}\, \left| {\bar{\partial }U(\varphi )}\, {w}\right. \right) } = {\left( {dU(\varphi )}\, \left| {dU(\varphi )}\, {w\circ \varphi }\right. \right) }. $$

Using Theorem 7.12, we have

$$ \partial _v \varphi ^v_{01} \, h = \int _0^1 (d\varphi ^v_{u1} h(u)) \circ \varphi ^{v}_{0u}\, du = \int _0^1 ({\mathrm {Ad}}_{\varphi ^v_{u1} }h(u))\circ \varphi ^v_{01}\, du $$

so that

$$\begin{aligned} {\left( {dU(\varphi _{01}^v)}\, \left| {dU(\varphi _{01}^v)}\, {\partial _v \varphi ^v_{01} \, h}\right. \right) }&= {\Big ( {\bar{\partial }U(\varphi _{01}^v)}\,\Big |\, { \int _0^1 ({\mathrm {Ad}}_{\varphi ^v_{u1}} h(u))\, du}\Big )} \\&\quad \qquad \quad \qquad = \int _0^1 {\left( {\bar{\partial }U(\varphi _{01}^v)}\, \left| {\bar{\partial }U(\varphi _{01}^v)}\, {{\mathrm {Ad}}_{\varphi ^v_{u1}} h(u))}\right. \right) }\, du. \end{aligned}$$

With this notation, we have the following theorem.

Theorem 10.5

Assume that V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) and that U is continuously differentiable on \(\mathrm {Diff}^{p, \infty }_0\). Then, the \({\mathcal X}^2_V\) gradient of \(\tilde{U}: v \mapsto U(\varphi _{01}^v)\) is given by the formula

$$\begin{aligned} \nabla \tilde{U}(v)(t) = {\mathbb {K}}{\mathrm {Ad}}_{\varphi _{t1}^v}^* \bar{\partial }U(\varphi _{01}^v) = {\mathrm {Ad}}_{\varphi _{t1}^v}^T {\overline{\nabla }}U(\varphi _{01}^v) . \end{aligned}$$
(10.13)

This important result has the following simple consequences.

Proposition 10.6

Let U satisfy the assumptions of Theorem 10.5. If \(v \in {\mathcal X}^2_V\) is a minimizer of

$$\begin{aligned} \tilde{E}(v) = \frac{1}{2} \int _0^1 \Vert v(t)\Vert _V^2 dt + U(\varphi _{01}^v), \end{aligned}$$
(10.14)

then, for all t

$$\begin{aligned} v(t) = {\mathrm {Ad}}_{\varphi _{t1}^v}^T v(1), \end{aligned}$$
(10.15)

with \(v(1) = - {\overline{\nabla }}^V U(\varphi _{01}^v) (x)\). In particular, v is a continuous function of t and \(v(t) \in V^{(p)}\) for all t.

Corollary 10.7

Under the same conditions on U, if \(v \in {\mathcal X}^2_V\) is a minimizer of

$$ \tilde{E}(v) =\frac{1}{2} \int _0^1 \Vert v_t\Vert _V^2 dt + U(\varphi _{01}^v) $$

then, for all t,

$$\begin{aligned} v_t = {\mathrm {Ad}}_{\varphi _{t0}^v}^T v_0, \end{aligned}$$
(10.16)

with \(v_0\in V^{(p)}\).

Proposition 10.6 is a direct consequence of Theorem 10.5. For the corollary, we need to use the fact that \({\mathrm {Ad}}_\varphi {\mathrm {Ad}}_\psi = {\mathrm {Ad}}_{\varphi \circ \psi }\), which can be checked by direct computation, and write

$$ v_t = {\mathrm {Ad}}_{\varphi _{t1}^v}^T v_1 = {\mathrm {Ad}}_{\varphi _{t1}^v}^T {\mathrm {Ad}}_{\varphi _{10}^v}^T v_0 = ({\mathrm {Ad}}_{\varphi _{10}^v}{\mathrm {Ad}}_{\varphi _{t1}^v})^T v_0 = {\mathrm {Ad}}_{\varphi _{t0}^v}^Tv_0. $$

Equations \(v_t = {\mathrm {Ad}}_{\varphi _{t0}^v}^T v_0\) and \(v_1 = - {\overline{\nabla }}^V U(\varphi _{01}^v) (x)\) together are equivalent to the Euler–Lagrange equations for \(\tilde{E}\) and will lead to interesting numerical procedures. Equation (10.16) is a cornerstone of the theory. It describes a general mechanical property called the conservation of momentum, to which we will return later.

10.4.2 Derivative Using Optimal Control

We can also apply the Pontryagin maximum principle (see Appendix D) to obtain an alternative expression of the optimality conditions and gradient. Indeed, we can repeat the construction made in Sect. 7.2.2 with a slightly different notation, letting \(f(\omega , v) = v\circ ({\mathrm {id}}+ \omega )\), defined over \(C^p_0(\varOmega , {\mathbb {R}}^d)\times V\). With \(g(\omega , v) = \Vert v\Vert _V^2\), we are in the framework described in Sect. D.3.1, leading to Theorem D.7, where \(\omega \) represents the state and v is the control. Introducing a co-state \(\mu \), define the Hamiltonian

$$ H_v(\mu , \omega ) = {\left( {\mu }\, \left| {\mu }\, {v\circ ({\mathrm {id}}+\omega )}\right. \right) } - \Vert v\Vert _V^2/2. $$

Letting \(\xi _\varphi : v \mapsto v\circ \varphi \) from V to \(C^p_0(\varOmega , {\mathbb {R}}^d)\), we obtain the fact that an optimal solution must satisfy (with \(\varphi = {\mathrm {id}}+\omega \)), for some \(\mu :[0,1]\rightarrow C^p_0(\varOmega , {\mathbb {R}}^d)^*\)

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t \varphi (t) = v(t)\circ \varphi (t)\\&{\left( {\partial _t \mu (t)}\, \left| {\partial _t \mu (t)}\, {h}\right. \right) } = -{\left( {\mu (t)}\, \left| {\mu (t)}\, {dv(t)\circ \varphi (t) \, h}\right. \right) } , \quad \forall h\in C^p_0(\varOmega , {\mathbb {R}}^d) \\&\mathbb Lv(t) = \xi _{\varphi (t)}^* \mu (t) \end{aligned}\right. \end{aligned}$$
(10.17)

with \(\varphi (0) = {\mathrm {id}}\) and \(\mu (1) = - dU(\varphi (1))\). One can check that the second equation is equivalent to

$$ {\left( {\mu (t)}\, \left| {\mu (t)}\, {h}\right. \right) } = {\left( {\mu (0)}\, \left| {\mu (0)}\, {(d\varphi (t))^{-1} h}\right. \right) }, $$

which is Corollary 10.7 expressed in terms of the co-state \(\mu \). Applying Eq. (D.12), we obtain

$$\begin{aligned} d\tilde{E}(v) (t) = - \xi _{\varphi (t)}^* \mu (t) + 2\mathbb Lv(t), \end{aligned}$$
(10.18)

where \(\varphi \) and \(\mu \) satisfy the first two equations of (10.17).

10.4.3 An Alternative Form Using the RKHS Structure

The conjugate of the adjoint can be put into a form explicitly involving the reproducing kernel of V. Before detailing this, we introduce a notation that will be used throughout this chapter. If \(\rho \) is a linear form on function spaces, we have been denoting by \({\left( {\rho }\, \left| {\rho }\, {v}\right. \right) }\) the result of \(\rho \) applied to v. In the formulas that will come, we will need to emphasize the variable on which v depends, and we will use the alternative notation \({\left( {\rho }\, \left| {\rho }\, {v(x)}\right. \right) }_x\) to denote the same quantity. Thus,

$$ \rho (v) = {\left( {\rho }\, \left| {\rho }\, {v}\right. \right) } = {\left( {\rho }\, \left| {\rho }\, {v(x)}\right. \right) }_x. $$

In particular, when v depends on two variables, the notation \({\left( {\rho }\, \left| {\rho }\, {v(x, y)}\right. \right) }_x\) will represent \(\rho \) applied to the function \(x\mapsto v(x, y)\) with y considered as constant.

We still assume that V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\). Then, the following theorem holds.

Theorem 10.8

Assume that \(\varphi \in C^{q+1}_0(\varOmega , {\mathbb {R}}^d)\) and \(\rho \in C^r_0(\varOmega , {\mathbb {R}}^d)^*\), with \(r = \min (p+1, q)\). Let \(v = {\mathbb {K}}\rho \) and \((e_1, \ldots , e_d)\) be an orthonormal basis of \({\mathbb {R}}^d\). Then, for \(y\in \varOmega \), we have

$$\begin{aligned} {\mathrm {Ad}}_\varphi ^T v(y) = \sum _{i=1}^d {\left( {\rho }\, \left| {\rho }\, {\mathrm {Ad}_\varphi (K(x, y)e_i)}\right. \right) }_x e_i, \end{aligned}$$
(10.19)

where K is the reproducing kernel of V.

Proof

For \(b\in {\mathbb {R}}^d\), we have

$$\begin{aligned} b^T {\mathrm {Ad}}_\varphi ^T v(y)= & {} {\big \langle {\mathrm {Ad}_\varphi ^T v}\, , \, {K(\cdot ,y)b}\big \rangle }_V\\= & {} {\big \langle {v}\, , \, {\mathrm {Ad}_\varphi (K(\cdot ,y)b)}\big \rangle }_V\\= & {} {\left( {\rho }\, \left| {\rho }\, {\mathrm {Ad}_\varphi (K(x, y)b)}\right. \right) }_x. \end{aligned}$$

Theorem 10.8 is now a consequence of the decomposition

$$ {\mathrm {Ad}}_\varphi ^T v(y) = \sum _{i=1}^d e_i^T {\mathrm {Ad}}_\varphi ^T v(y) e_i. $$

   \(\square \)

Recall that \(K(\cdot ,\cdot )\) is a matrix, so that \(K(\cdot , y)e_i\) is the ith column of \(K(\cdot , y)\), which we can denote by \(K^i\). Equation (10.19) states that the ith coordinate of \({\mathrm {Ad}}_\varphi ^T v\) is \({\left( {\rho }\, \left| {\rho }\, {{\mathrm {Ad}}_\varphi K^i(x, y)}\right. \right) }_x\).

Using Proposition 10.6 and Theorem 10.8, we obtain another expression of the V-gradient of E:

Corollary 10.9

Under the hypotheses of Proposition 10.6, the V-gradient of

$$ \tilde{E}(v) = \frac{1}{2} \int _0^1 \Vert v(t)\Vert ^2_V dt + U(\varphi _{01}^v) $$

is equal to

$$\begin{aligned} \nabla ^V\tilde{E}(v)(y) = v(t, y) + \sum _{i = 1}^d {\left( {\rho (1)}\, \left| {\rho (1)}\, {d\varphi _{t1}(\varphi ^v_{1t}(x)) K^i(\varphi _{1t}^v(x), y)}\right. \right) }_x e_i \end{aligned}$$
(10.20)

with \(\rho (1) = \bar{\partial }U(\varphi _{01}^v) (x)\).

10.5 Conservation of Momentum

10.5.1 Interpretation

Equation (10.16) can be interpreted as a momentum conservation equation. The justification of the term momentum comes from the analogy of \(E_{\mathrm {kin}} := (1/2) \Vert v(t)\Vert ^2_V\) with the total kinetic energy at time t of a dynamical system. In fluid mechanics, this energy is usually defined as (introducing a mass density, z)

$$ E_{\mathrm {kin}} = \frac{1}{2} \int z(t, y) |v(t, y)|^2 dy, $$

the momentum here being \( \rho (t) = z(t, y)v(t, y) dy\) with \(E_{\mathrm {kin}} = (1/2) {\left( {\rho }\, \left| {\rho }\, {v}\right. \right) }\). In our case, taking \(\rho (t) = \mathbb Lv(t)\), we still have \(E_{\mathrm {kin}} = (1/2) {\left( {\rho }\, \left| {\rho }\, {v}\right. \right) }\), so that \(\rho (t)\) is also interpreted as a momentum.

To interpret (10.16) as a conservation equation, we need to understand how a change of coordinate system affects the momentum. Indeed, interpret v(ty) as the velocity of a particle located at coordinates y, so \(v = dy/dt\). Now assume that we want to use a new coordinate system, and replace y by \(x = \varphi (y)\). In the new coordinates, the same particle moves with velocity

$$ \partial _t x = d\varphi (y) \partial _t y = d\varphi (y)\, v(t, y) = (d\varphi \, v(t)) \circ \varphi ^{-1}(x) $$

so that the translation from the old to the new expression of the velocity is precisely given by the adjoint operator: \(v(y) \rightarrow \tilde{v}(x) = \mathrm {Ad}_\varphi v(x)\) if \(x = \varphi (y)\). To obtain the correct transformation of the momentum, it suffices to notice that the energy of the system must remain the same if we just change the coordinates, so that, if \(\rho \) and \(\tilde{\rho }\) are the momenta before and after the change of coordinates, we must have

$$ {\left( {\tilde{\rho }}\, \left| {\tilde{\rho }}\, {{\tilde{v}}}\right. \right) } = {\left( {\rho }\, \left| {\rho }\, {v}\right. \right) } $$

which yields \(\mathrm {Ad}_\varphi ^* \tilde{\rho }= \rho \) or \(\tilde{\rho } = \mathrm {Ad}_{\varphi ^{-1}}^* \rho \).

Now, we return to Eq. (10.16). Here, v(ty) is the velocity at time t of the particle that was at \(x = \varphi ^v_{t0}(y)\) at time 0. So it is the expression of the velocity in a coordinate system that evolves with the flow, and \(\mathbb Lv(t)\) is the momentum in the same system. By the previous argument, the expression of the momentum in the fixed coordinate system, taken at time \(t=0\), is \({\mathrm {Ad}}^*_{\varphi ^v_{0t}} \mathbb Lv(t)\). Equation (10.16) simply states that this expression remains constant over time, i.e., the momentum is conserved when measured in a fixed coordinate system.

The conservation of momentum equation, described in Corollary 10.7, is a fundamental equation in Geometric Mechanics [149, 187], which appears in a wide variety of contexts. It has been described in abstract form by Arnold [18, 19] in his analysis of invariant Riemannian metrics on Lie groups. This equation also derives from an application of the Euler–Poincaré principle, as described in [149, 150, 188]. Combined with a volume-preservation constraint, this equation is equivalent to the Euler equation for incompressible fluids, in the case when \(\Vert v(t)\Vert _V = \Vert v(t)\Vert _2\), the \(L^2\) norm. Another type of norm on V (called the \(H^1_\alpha \) norm) relates to models of waves in shallow water, and provides the Camassa–Holm equation [50, 116, 149]. A discussion of (10.16) in the particular case of template matching is provided in [205], and a parallel with the solitons emerging from the Camassa–Holm equation is discussed in [151].

10.5.2 Properties of the Momentum Conservation Equation

Combining Eq. (10.19) and the fact that \(\partial _t \varphi _{0t}^v = v(t, \varphi _{0t}^v)\), we get, for the optimal v (letting \(v_0 = {\mathbb {K}}\rho _0\))

$$ \partial _t \varphi (t, y) = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { (d\varphi (t, x))^{-1} K^i(\varphi (t,x), \varphi (t, y))}\right. \right) }_x e_i. $$

Letting \(\varphi = {\mathrm {id}}+\omega \), we consider the equation

$$\begin{aligned} \partial _t \omega (t, y) = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { ({\mathrm {Id}}+ d\omega (t, x))^{-1} K^i(x + \omega (t,x), y + \omega (t, y))}\right. \right) }_x e_i. \end{aligned}$$
(10.21)

We now consider this equation as an ODE over \(C^{p}_0(\varOmega , {\mathbb {R}}^d)\) and discuss conditions on \(\rho _0\) ensuring the existence and uniqueness of solutions. We will make the following assumptions.

  1. (I)

    V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) and its kernel, K, is such that all derivatives \(\partial _1^{k}\partial _2^{k} K(y, y)\) are bounded over \( \varOmega \) for \(k \le p+1\).

  2. (II)

    \(\rho _0 \in C^r(\varOmega , {\mathbb {R}}^d)^*\) for some \(r \le p-1\).

  3. (III)

    \(\rho _0\) is compactly supported: there exists a compact subset \(Q'\subset {\mathbb {R}}^d\) such that \({\left( {\rho _0}\, \left| {\rho _0}\, {f}\right. \right) } = 0\) for all \(f \in C^r_0(\varOmega , {\mathbb {R}}^d)\) such that \(f(x) = 0\) for all \(x\in Q'\).

Assumption (I) is true in particular when \(\varOmega ={\mathbb {R}}^d\) and K is translation-invariant.

Taking Q slightly larger than \(Q'\) in assumption (III), and choosing a \(C^\infty \) function \(\varepsilon \) such that \(\varepsilon =1\) on \(Q'\) and \(\varepsilon =0\) on \(Q^c\), we have \({\left( {\rho _0}\, \left| {\rho _0}\, {f}\right. \right) } = {\left( {\rho _0}\, \left| {\rho _0}\, {\varepsilon f}\right. \right) }\) for all \(f \in C^r_0(\varOmega , {\mathbb {R}}^d)\), from which we can deduce that, for some constant C

$$ {\left( {\rho _0}\, \left| {\rho _0}\, {f}\right. \right) } \le C \Vert f\Vert _{r, Q}, $$

where

$$ \Vert f\Vert _{r, Q} = \max _{x\in Q}\max _{|J|\le r} |\partial _J f(x)|. $$

The following lemma provides the required properties for the well-posedness of (10.21).

Let \(\mathcal O = \mathrm {Diff}^p_0 -\mathrm {id}\), an open subset of \(C^p_0(\varOmega , \mathbb R^d)\).

Lemma 10.10

Let

$$\begin{aligned} \mathscr {V}(\omega )(y) = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { ({\mathrm {Id}}+ d\omega (t, x))^{-1} K^i(x + \omega (t,x), y + \omega (t, y))}\right. \right) }_x e_i. \end{aligned}$$
(10.22)

Under assumptions (I), (II), (III) above, \(\mathscr {V}\) is a differentiable mapping from \(\mathcal O\) into \(C^p_0(\varOmega , \mathbb R^d)\) and, letting \(\varphi = {\mathrm {id}}+\omega \),

$$\begin{aligned} \Vert d\mathscr {V}\Vert _{ op (C^p_0(\varOmega , {\mathbb {R}}^d))} \le C(\Vert d\varphi (\cdot )^{-1}\Vert _\infty , \Vert \varphi \Vert _{p,\infty }) \end{aligned}$$
(10.23)

for some continuous function C.

Proof

Step 1. We first check that the right-hand side of (10.21) is well defined. Since we assume that V is embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\), we know that, for all \(0\le r, s,\le p+1\), \(\partial _1^r \partial _2^s K^i\) is in \(C_0(\varOmega , {\mathbb {R}}^d)\) with respect to each of its variables. In particular, \(x\mapsto ({\mathrm {Id}}+ d\omega (t, x))^{-1} K^i(x + \omega (t,x), y + \omega (t, y))\) is in \(C^{p-1}_0(\varOmega , {\mathbb {R}}^d)\) as soon as \(\omega \in C^{p}_0(\varOmega , {\mathbb {R}}^d)\), so that \(\rho _0\) can be evaluated on it.

Step 2. We now prove that the right-hand side of (10.21) is in \(C_0^{p}(\varOmega , {\mathbb {R}}^d)\), which ensures that (10.21) forms an ODE in this space. Let

$$ v^\varphi (y) = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { d\varphi (x)^{-1} K^i(\varphi (x), y}\right. \right) }_x e_i $$

so that (10.21) can be written as \(\partial _t \omega = v^{{\mathrm {id}}+\omega }\circ ({\mathrm {id}}+\omega )\). We want to show that \(v^\varphi \in C^{p}_0(\varOmega , {\mathbb {R}}^d)\) when \(\varphi ={\mathrm {id}}+\omega \) and \(\omega \in C^p_0(\varOmega , {\mathbb {R}}^d)\). It is obviously sufficient to prove that each coordinate

$$ v_i^\varphi (y) = {\left( {\rho _0}\, \left| {\rho _0}\, { d\varphi (x)^{-1} K^i(\varphi (x), y)}\right. \right) }_x $$

belongs to \(C^{p}_0(\varOmega , {\mathbb {R}})\). We first justify the fact that \(v_i^\varphi \) is p-times differentiable, with

$$ d^r v_i^\varphi (y) = {\left( {\rho _0}\, \left| {\rho _0}\, { d\varphi (x)^{-1} \partial _2^r K^i(\varphi (x), y)}\right. \right) }_x $$

for \(r\le p\). Using a Taylor expansion, we can write (letting \(h^{(k)}\) denote the k-tuple \((h,\ldots , h)\))

$$\begin{aligned}&K^i(\varphi (x), y+h) = \sum _{k=0}^{p+1} \frac{1}{k!}\partial _2^k K^i(\varphi (x), y) h^{(k)} \\&\qquad + \frac{1}{p!} \int _0^1(\partial _2^{p+1} K^i(\varphi (x), y+th) - \partial _2^{p+1} K^i(\varphi (x), y))h^{(p+1)} (1-t)^p \, dt \end{aligned}$$

so that

$$\begin{aligned}&\quad v_i^\varphi (y+h) = \sum _{k=0}^{p+1} \frac{1}{k!}{\left( {\rho _0}\, \left| {\rho _0}\, {d\varphi (x)^{-1}\partial _2^k K^i(\varphi (x), y) h^{(k)}}\right. \right) }_x \\&+ \frac{1}{p!} {\Big ( {\rho _0}\,\Big |\, {d\varphi (x)^{-1}\int _0^1(\partial _2^{p+1} K^i(\varphi (x), y+th) - \partial _2^{p+1} K^i(\varphi (x), y))h^{(p+1)} (1-t)^p \, dt}\Big )}_x \end{aligned}$$

and it suffices to prove that the remainder is an \(o(|h|^{p+1})\). This will be true provided

$$ \lim _{y'\rightarrow y} \Vert \partial _2^{p+1} K^i(\cdot , y') - \partial _2^{p+1} K^i(\cdot , y)\Vert _{r, \infty } = 0. $$

For \(k_1\le r\), we have, using Eq. (8.8),

$$\begin{aligned}&\Vert \partial _1^{k_1} \partial _2^{p+1} K^i(\cdot , y') - \partial _1^{k_1}\partial _2^{p+1} K^i(\cdot , y)\Vert _{r, \infty } \\&\quad \le C \max _{h:|h|=1} \Vert \partial _2^{p+1} K^i(\cdot , y')(h^{(p+1)}) - \partial _2^{p+1} K^i(\cdot , y)(h^{(p+1)})\Vert _V \\&\quad = C \left| \partial _1^{p+1}\partial _2^{p+1} K^{ii}(y', y') + \partial _1^{p+1}\partial _2^{p+1} K^{ii}(y, y) - 2\partial _1^{p+1}\partial _2^{p+1} K^{ii}(y, y')\right| ^{1/2} \end{aligned}$$

for some constant C, where \(K^{ij}\) denotes the ij entry of K. This proves the desired result, since \(\partial _1^{p+1}\partial _2^{p+1}K\) is continuous. A similar argument can be made to prove the continuity of \(y\mapsto d^pv(y)\).

To prove that \(v^\varphi \in C^p_0(\varOmega , {\mathbb {R}}^d)\), it suffices to show that, for all \(k\le p+1\), \(\Vert \partial _2^k K^i(\cdot , y)\Vert _{r, Q}\) goes to 0 when y goes to infinity. (This is where we use the fact that \(\rho _0\) has compact support.)

To reach a contradiction, assume that there exists sequences \((x_n), (y_n)\) with \(x_n\in Q\) and \(y_n\) tending to infinity or \(\partial \varOmega \) such that \(|\partial _1^{k_1}\partial _2^{k_2} K^i(x_n, y_n)| >\varepsilon \), for some fixed \(\varepsilon >0\) and \(k_1\le r\), \(k_2\le p+1\). Replacing \((x_n)\) by a subsequence if needed, we can assume that \(x_n\) converges to some \(x\in Q\). Note that \(\partial _1^{k_1}\partial _2^{k_2} K^{ij}(x, y_n) = \partial _1^{k_2}\partial _2^{k_1} K^{ji}(y_n, x)\). Since \(\partial _2^{k_1} K^{j}(\cdot , x) \in V\), we can conclude that \(\partial _1^{k_2}\partial _2^{k_1} K^{j}(y_n, x) \rightarrow 0\) for all j, implying that \(\partial _1^{k_1}\partial _2^{k_2} K^i(x, y_n) \rightarrow 0\) for all i, too.

Similarly, \(\partial _1^{k_1}\partial _2^{k_2} K^{ij}(x_n, y_n) - \partial _1^{k_1}\partial _2^{k_2} K^{ij}(x, y_n)\) is the ith entry of \(\partial _1^{k_2}\partial _2^{k_1} K^j(y_n, x_n) - \partial _2^{k_1}\partial _1^{k_2} K^j(y_n, x)\) and

$$\begin{aligned}&\sup _y |\partial _1^{k_2}\partial _2^{k_1} K^j(y, x_n) - \partial _1^{k_2}\partial _2^{k_1} K^j(y, x)| \\&\quad \le C \max _{h:|h|=1} \Vert \partial _2^{k_1} K^j(\cdot , x_n)(h^{(k_1)}) - \partial _2^{k_1} K^j(\cdot , x)(h^{(k_1)})\Vert _V\\&\quad \le C \left| \partial _1^{k_1}\partial _2^{k_1} K^{jj}(x_n, x_n) - 2\partial _1^{k_1}\partial _2^{k_1} K^{jj}(x_n, x) + \partial _1^{k_1}\partial _2^{k_1} K^{jj}(x, x)\right| ^{1/2}, \end{aligned}$$

which goes to 0. This is our contradiction.

Step 3: We now study the differentiability of the mapping \(\mathscr {V}: \omega \mapsto v^{{\mathrm {id}}+\omega }\circ ({\mathrm {id}}+\omega )\) from \(C^p_0(\varOmega , {\mathbb {R}}^d)\) into itself. The candidate for \(d\mathscr {V}(\omega )\eta \) is

$$\begin{aligned} (\mathscr {W}(\omega )\eta )(y) =&- \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, {d\varphi (x)^{-1}d\eta (x)^{-1}d\varphi (x)^{-1} K^i(\varphi (x), \varphi (y))}\right. \right) }_x e_i\\&+ \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, {d\varphi (x)^{-1}\partial _1 K^i(\varphi (x), \varphi (y))\eta (x)}\right. \right) }_x e_i\\&+ \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, {d\varphi (x)^{-1}\partial _2 K^i(\varphi (x), \varphi (y))\eta (y)}\right. \right) }_x e_i, \end{aligned}$$

still with \(\varphi ={\mathrm {id}}+\omega \). We can decompose \(\mathcal V(\omega +\eta )(y) - \mathcal V(\omega )(y) - (\mathscr {W}(\omega )\eta )(y)\) as the sum of five terms

$$ \sum _{k=1}^5\sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, {A^i(x, y)}\right. \right) }_x $$

(described below), which we will study separately. For each term, we need to prove that, for \(k_1\le r\), \(k_2\le p\), one has

$$ \sup _{x, y}|\partial _1^{k_1}\partial _2^{k_2} A_k^i(x, y)| = o(\Vert \eta \Vert _{p, \infty }). $$

The important point in the following discussion is that none of the estimates will require more than p derivatives in \(\varphi \) and \(\eta \), and no more than \(p+1\) in K.

(i) We let

$$\begin{aligned} A^i_1(x, y) = ((d\varphi (x)+d\eta (x))^{-1}- d\varphi (x)^{-1}&+ d\varphi (x)^{-1}d\eta (x)d\varphi (x)^{-1})\\&\, K^i(\varphi (x)+\eta (x), \varphi (y)+\eta (y)). \end{aligned}$$

We first note that \( Inv : M\mapsto M^{-1}\) is infinitely differentiable on \(\mathrm {GL}_d({\mathbb {R}})\) with

$$ d^q Inv (M)(H_1, \ldots , H_q) = (-1)^q \sum _{\sigma \in \mathfrak S_q} M^{-1}H_{\sigma (1)} M^{-1} \cdots M^{-1} H_{\sigma (q)} M^{-1}, $$

where \(\mathfrak S_q\) is the set of permutations of \(\{1,\ldots , q\}\). In particular, \(\Vert d^q Inv (M)\Vert = O(\Vert M^{-1}\Vert ^{q+1})\). Writing

$$\begin{aligned} (d\varphi (x)+d\eta (x))^{-1}&- d\varphi (x)^{-1} + d\varphi (x)^{-1}d\eta (x)d\varphi (x)^{-1} =\\&\quad \int _0^1 d^2 Inv (d\varphi (x) + td\eta (x))(d\eta (x), d\eta (x)) (1-t)\, dt, \end{aligned}$$

we see that

$$ \left\| d^{k_1}((d\varphi (x)+d\eta (x))^{-1}- d\varphi (x)^{-1} + d\varphi (x)^{-1}d\eta (x)d\varphi (x)^{-1}) \right\| _\infty $$

will be less than \(C(\varphi )\Vert d\varphi ^{-1}\Vert _\infty ^{k_1+3}\Vert \eta \Vert ^2_{k_1+1, \infty }\). Using the bound

$$ \Vert \partial _{2}^{k_2} K^i(\cdot , y)\Vert _{p+1, \infty }^2 \le C \partial _{1}^{k_2}\partial _{2}^{k_2} K^{ii}(y, y), $$

applying Lemma 7.3 and the product formula, we see that the desired conclusion holds for \(A^i_1\).

(ii) Let

$$\begin{aligned} A^i_2(x, y) = d\varphi (x)^{-1} ( K^i(\varphi (x)+\eta (x), \varphi (y)&+\eta (y)) - K^i(\varphi (x), \varphi (y)+\eta (y)) \\&\!\! - \partial _1 K^i(\varphi (x), \varphi (y)+\eta (y))\eta (x)). \end{aligned}$$

Writing the right-hand side in the form

$$ d\varphi (x)^{-1} \int _0^1 \partial _1^2 K^i(\varphi (x)+t\eta (x), \varphi (y)+\eta (y))(\eta (x), \eta (x)) (1-t) \, dt, $$

the same estimate on the derivative of K can be used, based on the fact that \(k_1+2\le p+1\).

(iii) The third term is

$$\begin{aligned} A^i_3(x, y) = d\varphi (x)^{-1} ( K^i(\varphi (x), \varphi (y)+\eta (y))&- K^i(\varphi (x), \varphi (y)) \\&\quad - \partial _2 K^i(\varphi (x), \varphi (y))\eta (y)). \end{aligned}$$

It can be handled similarly, requiring \(k_2+1\le p+1\) derivatives of \(K^i\) in the second variable.

(iv) These were the three main terms in the decomposition and the remaining two are just bridging gaps. The first one is

$$\begin{aligned} A^i_3(x, y)&= d\varphi (x)^{-1}d\eta (x)d\varphi (x)^{-1}\\&\qquad \qquad \quad (K^i(\varphi (x)+\eta (x), \varphi (y)+\eta (y)) - K^i(\varphi (x), \varphi (y))). \end{aligned}$$

Here, we note that, for some constants C and \(\tilde{C}\),

$$\begin{aligned}&\sup _{x} |\partial _1^{k_1}\partial _2^{k_2} K^i(x, y') - \partial _1^{k_1}\partial _2^{k_2} K^i(x, y)|^2\\&\le C \left| \partial _1^{k_2}\partial _2^{k_2}(K(y', y') - 2K(y',y) + K(y, y))\right| \\&\le \tilde{C} (\partial _1^{k_2+1}\partial _2^{k_2+1} K(y, y) + \partial _1^{k_2+1}\partial _2^{k_2+1} K(y', y')) |y-y'| \end{aligned}$$

(with a similar inequality when the roles of x and y are reversed) and these estimates can be used to check that

$$ \partial _1^{k_1}\partial _2^{k_2} (K^i(\varphi (x)+\eta (x), \varphi (y)+\eta (y)) - K^i(\varphi (x), \varphi (y))) $$

tends to 0 uniformly in x and y.

(v) The last term is

$$\begin{aligned} A^i_5(x, y) = d\varphi (x)^{-1} ( \partial _1 K^i(\varphi (x), \varphi (y)+\eta (y)) - \partial _1 K^i(\varphi (x), \varphi (y)))\eta (x) \end{aligned}$$

and can be handled similarly.

Step 4. It remains to check that \(\mathscr {W}(\omega )\) maps \(C^p_0(\varOmega , {\mathbb {R}}^d)\) to itself. This can be done in the same way we proved that \(\mathscr {V}(\omega ) \in C^p_0(\varOmega , {\mathbb {R}}^d)\), using Taylor expansions and the fact that \(d^k(\mathscr {W}(\omega )\eta )(y)\) will involve no more than k derivatives of \(\omega \) and \(\eta \), and \(k+1\) of K. This shows that \(\mathscr {W} = d\mathscr {V}\). The bound (10.23) can also be shown using the same techniques. We leave the final details to the reader.    \(\square \)

Lemma 10.10 implies that (10.21) has unique local solutions (unique solutions over small enough time intervals). If we can prove that \(\Vert (d\varphi )^{-1}\Vert _\infty \) and \(\Vert \varphi \Vert _{p, \infty }\) remains bounded over solutions of the equation, inequality (10.23) will be enough to ensure that solutions exist over arbitrary times intervals. This fact will be obtained at the end of the next section.

10.5.3 Time Variation of the Eulerian Momentum

Assume that \(\varphi \) satisfies \(\partial _t \varphi (t) = v(t)\circ \varphi (t)\) with \(v\in {\mathcal X}^{p+1,1}(\varOmega )\). If \(\rho _0\in C^{p-1}(\varOmega , {\mathbb {R}}^d)^*\), we can apply the chain rule to the equation

$$ {\left( {\rho (t)}\, \left| {\rho (t)}\, {w}\right. \right) } = {\left( {\rho _0}\, \left| {\rho _0}\, {{\mathrm {Ad}}_{\varphi (t)^{-1}} w}\right. \right) } = {\left( {\rho _0}\, \left| {\rho _0}\, {d\varphi (t)^{-1} w\circ \varphi (t)}\right. \right) }, $$

in which we assume that \(w\in C^p_0(\varOmega , {\mathbb {R}}^d)\). We have (with \(\partial _t d\varphi = dv\circ \varphi \, d\varphi \))

$$\begin{aligned} \partial _t {\mathrm {Ad}}_{\varphi (t)^{-1}} w&= - d\varphi (t)^{-1} dv(t)\circ \varphi (t)\, w\circ \varphi (t) + d\varphi (t)^{-1} dw\circ \varphi (t) \, v(t)\circ \varphi (t)\\&= - {\mathrm {Ad}}_{\varphi (t)^{-1}} (dv(t)\, w - dw\, v(t)). \end{aligned}$$

The term in the right-hand side involves the adjoint representation of v(t), as expressed in the following definition.

Definition 10.11

If v is a differentiable vector field on \(\varOmega \), we denote by \({\mathrm {ad}}_v\) the mapping that transform a differentiable vector field w into

$$\begin{aligned} {\mathrm {ad}}_v w = dv\, w - dw\, v. \end{aligned}$$
(10.24)

Observe that \(dv\,w - dw\, v = -[v, w]\), where the latter is the Lie bracket between right-invariant vector fields over the group of diffeomorphisms. Note that \({\mathrm {ad}}_v\) continuously maps \(C^p_0(\varOmega , {\mathbb {R}}^d)\) to \(C^{p-1}_0(\varOmega , {\mathbb {R}}^d)\). With this notation, we therefore have, for \(w \in C^p_0(\varOmega , {\mathbb {R}}^d)\):

$$ \partial _t {\mathrm {Ad}}_{\varphi (t)^{-1}} w = - {\mathrm {Ad}}_{\varphi (t)^{-1}} {\mathrm {ad}}_{v(t)} w $$

so that

$$ \partial _t {\left( {\rho (t)}\, \left| {\rho (t)}\, {w}\right. \right) } = - {\left( {\rho (t)}\, \left| {\rho (t)}\, {{\mathrm {ad}}_{v(t)} w}\right. \right) }. $$

This yields the equation, called EPDiff, in which we let \(\tilde{\rho }(t)\) denote the restriction of \(\rho (t)\) to \(C^p_0(\varOmega , {\mathbb {R}}^d)\),

$$\begin{aligned} \partial _t \tilde{\rho }(t) + {\mathrm {ad}}_{v(t)}^* \rho (t) = 0. \end{aligned}$$
(10.25)

Equation (10.25) can be used to prove the following proposition.

Proposition 10.12

Let \(\varphi (t) = {\mathrm {id}}+ \omega (t)\), where \(\omega \) is a solution of (10.21). Let \(v_0 = K\rho _0\) and \(v(t) = {\mathrm {Ad}}_{\varphi (t)^{-1}}^T v_0\). Then \(\Vert v(t)\Vert _V\) is independent of time.

Proof

Indeed, we have, for \(\varepsilon > 0\),

$$ \frac{1}{\varepsilon }(\Vert v(t+\varepsilon )\Vert ^2 - \Vert v(t)\Vert _V^2) = \frac{2}{\varepsilon }{\left( {\rho (t+\varepsilon ) - \rho (t)}\, \left| {\rho (t+\varepsilon ) - \rho (t)}\, {v(t)}\right. \right) } + \frac{1}{\varepsilon } \Vert v(t+\varepsilon ). - v(t)\Vert _V^2 $$

Since \(v(t)\in V \subset C^p_0(\varOmega , {\mathbb {R}}^d)\), (10.25) implies that the first term on the right-hand side converges to

$$ - 2{\left( {\rho (t)}\, \left| {\rho (t)}\, {{\mathrm {ad}}_{v(t)} v(t)}\right. \right) } = 0.$$

For the second term, we have

$$\begin{aligned} \Vert v(t+\varepsilon ) - v(t)\Vert _V= & {} \sup _{\Vert w\Vert _V \le 1} {\left( {\rho (t+\varepsilon ) - \rho (t)}\, \left| {\rho (t+\varepsilon ) - \rho (t)}\, {w}\right. \right) } \\= & {} \sup _{\Vert w\Vert _V \le 1} \int _0^\varepsilon {\left( {\rho (t+s)}\, \left| {\rho (t+s)}\, {{\mathrm {ad}}_{v(t+s)}w}\right. \right) } ds, \end{aligned}$$

which tends to 0 with \(\varepsilon \).    \(\square \)

We can now prove that (10.21) has a unique solution over arbitrary time intervals.

Theorem 10.13

Under the hypotheses of Lemma 10.10, Eq. (10.21) has solutions over all times, uniquely specified by its initial conditions.

Proof

As already mentioned, Lemma 10.10 implies that solutions exist over small time intervals. Inequality (10.23) implies that these solutions can be extended as long as \(\Vert d\varphi (t)^{-1}\Vert _\infty \) and \(\Vert \varphi (t)\Vert _{p, \infty }\) remain finite. However, both these quantities are controlled by \(\int _0^t \Vert v(t)\Vert _V \, dt\). For the latter, this is a consequence of (C.6). For \(d\varphi (t)^{-1}\), we can note that

$$ \partial _t (d\varphi (t)^{-1}) = - d\varphi (t)^{-1} dv(t) \circ \varphi (t) $$

and use Gronwall’s lemma to ensure that

$$ \Vert d\varphi (t)^{-1}\Vert _\infty \le \exp \left( C\int _0^1 \Vert v(s)\Vert _V\, ds\right) $$

for some constant C.    \(\square \)

10.5.4 Explicit Expression

The assumption that \(\rho _0\in C^{p-1}_0(\varOmega , {\mathbb {R}}^d)^*\) “essentially” expresses the fact that the evaluation of \({\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }\) will involve no more than \(p-1\) derivatives of w. This implies that the evaluation of the right-hand side of (10.21) will involve derivatives up to order p in \(\varphi = {\mathrm {id}}+ \omega \). In numerical implementations, it is often preferable to track the evolution of these derivatives over time, rather than approximate them using, e.g., finite differences. It often happens, for example, that the evaluation of \(\rho _0\) only requires the evaluation of \(\varphi \) and its derivatives over a submanifold of lower dimension, and tracking their values on a dense grid becomes counter-productive.

The evolution of the derivatives of \(\varphi \) can easily be computed by differentiating (10.21) with respect to the y variable. This is summarized in the system

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t \varphi (t, y) = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { (d\varphi (t, x))^{-1} K^i(\varphi (t,x), \varphi (t,y))}\right. \right) }_x e_i \\&\partial _t d \varphi (t, y)a = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { (d\varphi (t, x))^{-1} \partial _2 K^i(\varphi (t, x), \varphi (t, y)) (d\varphi (t, y)a)}\right. \right) }_x e_i\\&\quad \quad \vdots \\&\,\\&\partial _t d^{p}\varphi (t, y)(a_1, \ldots , a_{p}) \\&\quad \quad \quad = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { (d\varphi (t, x))^{-1} \partial ^p_2 K^i(\varphi (t,x), \varphi (t, y)) (a_1, \ldots , a_p)}\right. \right) }_x e_i. \end{aligned} \right. \end{aligned}$$
(10.26)

It should be clear from this system that, if the computation of \({\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }\) only requires the evaluation of w and its derivatives on some subset of \({\mathbb {R}}^d\), then \(\varphi \) and its derivatives only need to be tracked for y belonging to the same subset.

10.5.5 The Hamiltonian Form of EPDiff

We now provide an alternative form of (10.26), using the optimal control formulation discussed in Sect. 10.4.2, in which we introduced the co-state

$$\begin{aligned} {\left( {\mu (t)}\, \left| {\mu (t)}\, {w}\right. \right) } = {\left( {\rho _0}\, \left| {\rho _0}\, {d\varphi (t)^{-1} w}\right. \right) } = {\left( {\rho (t)}\, \left| {\rho (t)}\, {w\circ \varphi (t)^{-1}}\right. \right) }. \end{aligned}$$
(10.27)

Let \(M(t)=(d\varphi (t))^{-1}\) so that \(\partial _t M = - M\,(\partial _t d\varphi )\, M\). The second equation of (10.26) then becomes

$$ \partial _t M(t, y) a = - \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { M(t, x) \partial _2 K^i(\varphi (t,x), \varphi (t, y))a}\right. \right) }_x M(t, y)e_i. $$

This implies that, for any \(w\in V\)

$$\begin{aligned} {\Big ( {\partial _t \mu (t)}\,\Big |\, {w}\Big )}= & {} - {\Big ( {\rho _0}\,\Big |\, {\partial _t M \, w}\Big )}\\= & {} \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, {{\left( {\rho _0}\, \left| {\rho _0}\, { M(t, x) \partial _2 K^i(\varphi (t,x), \varphi (t,y))w(y)}\right. \right) }_x M(t, y)e_i}\right. \right) }_y\\= & {} - \sum _{i=1}^d {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\left( {\mu (t)}\, \left| {\mu (t)}\, { \partial _2 K^i(\varphi (t,x), \varphi (t, y))w(y)}\right. \right) }_x e_i}\right. \right) }_y. \end{aligned}$$

We therefore have the system

$$\begin{aligned} \left\{ \begin{aligned} \partial _t \varphi (t, y)&= \sum _{i=1}^d {\left( {\mu (t)}\, \left| {\mu (t)}\, {K^i(\varphi (t,x), \varphi (t, y))}\right. \right) }_xe_i\\ {\Big ( {\partial _t \mu (t)}\,\Big |\, {w}\Big )}&= - \sum _{i=1}^d {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\left( {\mu (t)}\, \left| {\mu (t)}\, { \partial _2 K^i(\varphi (t,x), \varphi (t, y))w(y)}\right. \right) }_x e_i}\right. \right) }_y. \end{aligned} \right. \end{aligned}$$
(10.28)

Note that this system is an alternative expression of the first two equations of (10.17). When \({\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }\) does not depend on the derivatives of w (more precisely, \(\rho _0\in C^0_0(\varOmega , {\mathbb {R}}^d)^*\)), this provides an ordinary differential equation in the variables \((\varphi , \mu )\) (of the form \((d/dt)(\varphi ,\mu ) = F(\varphi , \mu )\)). The initial conditions are \(\varphi _0 = {\mathrm {id}}\) and \(\mu _0 = \rho _0\).

10.5.6 The Case of Measure Momenta

An interesting feature of (10.28) is that it can easily be reduced to a smaller number of dimensions when \(\rho _0\) takes specific forms. As a typical example, we perform the computation in the case

$$\begin{aligned} \rho _0 = \sum _{k=1}^N z_k(0,\cdot ) \gamma _k, \end{aligned}$$
(10.29)

where \(\gamma _k\) is an arbitrary measure on \(\varOmega \) and \(z_k(0)\) a vector field. (We recall the notation \({\left( {z \gamma }\, \left| {z \gamma }\, {w}\right. \right) } = \int z(x)^Tw(x)\, \gamma (dx)\).) Most of the Eulerian differentials that we have computed in Chap. 9 have been reduced to this form. From the definition of \(\mu (t)\), we have \(\mu (t) = \sum _{k=1}^N z_k(t, .) \gamma _k\) (where \(z_k(t, x) = d\varphi _{0t}(x)^{-T} z_k(0,x)\)). The first equation in (10.28) is

$$ \partial _t \varphi (t, y) = \sum _{i=1}^d \sum _{k=1}^N \int _\varOmega z_k(t,x)^T K^i(\varphi (t,x), \varphi (t, y))e_i d\gamma _k(x). $$

For a matrix A with ith column vector \(A^i\), and a vector z, \(z^TA^i\) is the ith coordinate of \(A^Tz\). Applying this to the previous equation yields

$$\begin{aligned} \partial _t \varphi (t, y) = \sum _{k=1}^N \int _\varOmega K(\varphi (t,y), \varphi (t,x))z_k(t, x) d\gamma _k(x), \end{aligned}$$
(10.30)

where we have used the fact that \(K(\varphi (t,x), \varphi (t,y))^T = K(\varphi (t,y), \varphi (t, x))\). The second equation in (10.28) becomes

$$\begin{aligned}&{\Big ( {\partial _t\mu (t)}\,\Big |\, {w}\Big )} \\&= - \sum _{i=1}^d {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\left( {\mu (t)}\, \left| {\mu (t)}\, { \partial _2 K^i(\varphi (t,x), \varphi (t, y))w(y)}\right. \right) }_x e_i}\right. \right) }_y\\&= - \sum _{k, l=1}^N \int _\varOmega \int _\varOmega \sum _{i=1}^d z_l^T(t, x) \partial _2 K^i(\varphi (t,x), \varphi (t,y))w(y)\, z_k(t, y)^Te_i d\gamma _l(x) d\gamma _k(y)\\&= - \sum _{k=1}^N \int _\varOmega \left( \int _\varOmega \sum _{l=1}^N \sum _{i=1}^d z^i_k(t,y) z_l(t, x)^T \partial _2 K^i(\varphi (t,x), \varphi (t, y)) d\gamma _l(x)\right) w(y) d\gamma _k(y), \end{aligned}$$

where \(z_k^i\) is the ith coordinate of \(z_k\). From the expression of \(\mu (t)\), we also have

$$ \partial _t \mu = \sum _{k=1}^N (\partial _t z_k) \gamma _k. $$

Letting \(K^{ij}\) denote the entries of K, we can identify \(\partial _t z_k\) as

$$\begin{aligned} \nonumber&\partial _t z_k(t, y) =\\ \nonumber&- \int _\varOmega \sum _{l=1}^N \sum _{i, j=1}^d z^i_k(t,y) z^j_l(t, x) \nabla _2 K^{ij}(\varphi (t,x), \varphi (t, y)) d\gamma _l(x)\\ =&- \int _\varOmega \sum _{l=1}^N \sum _{i, j=1}^d z^i_l(t,y) z_k^j(t, x) \nabla _1 K^{ij}(\varphi (t,y), \varphi (t, x)) d\gamma _l(x). \end{aligned}$$
(10.31)

This equation is somewhat simpler when K is a scalar kernel, in which case \(K^{ij}(x,y) = \varGamma (x, y)\) if \(i=j\) and 0 otherwise, where \(\varGamma \) takes real values. We get, in this case

$$\begin{aligned} \partial _t z_k(t, y)= & {} - \sum _{l=1}^N \int _\varOmega \nabla _2 \varGamma (\varphi (t,x), \varphi (t,y)) z_k(t,y)^Tz_l(t, x) d\gamma _l(x)\\= & {} - \sum _{l=1}^N \int _\varOmega \nabla _1 \varGamma (\varphi (t,y), \varphi (t,x)) z_k(t,y)^Tz_l(t, x) d\gamma _l(x). \end{aligned}$$

In all cases, we see that the evolution of \(\mu \) can be completely described using the evolution of \(z_1, \ldots , z_N\). In the particular case when the \(z_k\)’s are constant vectors (which corresponds to most of the point-matching problems), this provides a finite-dimensional system on the \(\mu \) part.

10.6 Optimization Strategies for Flow-Based Matching

We have formulated flow-based matching as an optimization problem over time-dependent vector fields. We discuss here other possible optimization strategies that take advantage of the different formulations that we obtained for the EPDiff equation. They will correspond to taking different control variables with respect to which the minimization is performed, and we will in each case provide the expression of the gradient of E with respect to a suitable metric. Optimization can then be performed by gradient descent, conjugate gradient or higher-order optimization algorithms when feasible (see Appendix D or [221]).

After discussing the general formulation of each of these strategies, we will provide the specific expression of the gradients for point-matching problems, in the following form: minimize

$$\begin{aligned} E(\varphi ) = \frac{1}{2} d_V({\mathrm {id}}, \varphi )^2 + F(\varphi (x_1), \ldots , \varphi (x_N)) \end{aligned}$$
(10.32)

with respect to \(\varphi \), where \(x_1, \ldots , x_N\) are fixed points in \(\varOmega \). These problems are important because, in addition to the labeled and unlabeled point matching problems we have discussed, other problems, such as curve and surface matching, end up being discretized in this form (we will discuss algorithms for image matching in the next section). The following discussion describes (and often extends) several algorithms that have been proposed in the literature, in [32, 159, 203, 204, 289, 309] among other references.

10.6.1 Gradient Descent in \({\mathcal X}^2_V\)

The original problem having been expressed in this form, Corollary 10.9 directly provides the expression of the gradient of E considered as a function defined over \({\mathcal X}_V^2\), with respect to the metric in this space. Using \(t\mapsto v(t,\cdot )\) as an optimization variable has some disadvantages, however. The most obvious is that it results in solving a huge dimensional problem (over a \(d+1\)-dimensional variable) even if the original objects are, say, collections of N landmarks in \({\mathbb {R}}^d\).

When the matching functional U is only a function of the deformation of a fixed object, i.e.,

$$ U(\varphi ) = F(\varphi \cdot I), $$

then some simplifications can be made. To go further, we will need to compute derivatives in the object space, and henceforth assume that \(\mathcal I\) is an open subset of a Banach space \(\varvec{I}\). We assume that \(\mathrm {Diff}^{p+1}_{0}\) acts on \(\mathcal I\) and that the mapping \(\varPhi _I: \varphi \mapsto \varphi \cdot I\) is differentiable on \(\mathrm {Diff}^{p+1}_0\) for all \(I\in {\mathcal I}\), so that an infinitesimal action is defined by (see Sect. B.5.3)

$$ h \cdot I = d\varPhi _I({\mathrm {id}}) \, h\in \varvec{I} $$

for \(h\in C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\). We assume as usual that V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) so that \(v\cdot I\) is well defined for \(v\in V\) and \(d\varPhi _I({\mathrm {id}})\) restricted to V is also bounded with respect to \(\Vert \cdot \Vert _V\).

Let \(v\in {\mathcal X}^2_V\). If \(\partial _t \varphi = v \circ \varphi \), let \(J(t) = \varphi (t)\cdot I\) be the deforming object. Then \(\partial _t J(t) = v(t) \cdot J(t)\). With this in mind, we can write, when \(\tilde{E}\) is given by (10.14)

$$ \min _{v(t, \cdot )} \tilde{E}(v) = \min _{J(t, \cdot ), J(0) = I} \left( \min _{v:\, \partial _t J = v(t)\cdot J(t)} \tilde{E}(v)\right) . $$

The iterated minimization first minimizes with respect to v for fixed object trajectories, then optimizes over the object trajectories.

When \(J(t, \cdot )\) is given, the inner minimization is

$$\begin{aligned} \nonumber \min _{v:\, \partial _t J = v(t)\cdot J(t)} \tilde{E}(v)= & {} \min _{v:\, \partial _t J = v(t)\cdot J(t)} \left( \frac{1}{2}\int _0^1 \Vert v(t)\Vert ^2_V \, dt+ F(J(1))\right) \\= & {} \frac{1}{2} \int _0^1 \left( \inf _{w:\, \partial _t J = w \cdot J(t)} \Vert w\Vert ^2_V\right) \, dt + F(J(1)) \end{aligned}$$
(10.33)

since the constraints apply separately to each v(t). This expression only depends on the trajectory J(t). One can therefore try to compute its gradient with respect to this object trajectory and run a minimization algorithm accordingly. One difficulty with this approach is that, given an object trajectory J(t), there may exist no \(w\in V\) such that \(\partial _t J = w\cdot J(t)\) (which results in the minimum in the integral being infinite), so that the possibility of expressing the trajectory as evolving according to a flow is a constraint of the problem. This may be intractable in the general case, but always satisfied for point-matching problems as long as the points remain distinct. We will discuss this in the next section.

However, what (10.33) tells us is that, if a time-dependent vector field \(\tilde{v}(t,\cdot )\) is given, one always reduces the value of \(\tilde{E}(\tilde{v})\) by replacing \(\tilde{v}(t, \cdot )\) by

$$\begin{aligned} v(t, \cdot ) = \mathop {\mathrm {argmin}}_{w:\, w\cdot J(t) = \tilde{v}\cdot J(t)} \Vert w\Vert _V^2 \end{aligned}$$
(10.34)

with \(J(t) = \varphi _{0t}^{\tilde{v}}\cdot I\). Introduce the space

$$ N_J = Null (d\varPhi _J({\mathrm {id}})) = \left\{ u\in V: u\cdot J = 0 \right\} $$

and its perpendicular \(V_J = N_J^\perp = \left\{ u\in V: {\big \langle {u}\, , \, {\tilde{u}}\big \rangle }_V = 0 \text { for all } \tilde{u} \in N_J \right\} \). Then we have the following lemma.

Lemma 10.14

Let \(I\in {\mathcal I}\) and \(\tilde{v}\in V\). Then, the minimizer of \(\Vert w\Vert ^2_V\) over all \(V\in V\) such that \(w\cdot J = \tilde{v}\cdot J\) is given by \(\pi _{V_{J}} (\tilde{v})\), the orthogonal projection of \(\tilde{v}\) on \(V_{J}\).

Proof

Let \(v = \pi _{V_{J}} (\tilde{v})\). We want to prove that v is a minimizer of \(\Vert \cdot \Vert ^2_V\) over the set of all \(w\in V\) such that \(w = \tilde{v} + u\) with \(u\in N_J\). For such a w, we have

$$ \pi _{V_J}(w) = v + \pi _{V_J}(u) = v $$

and \(\Vert w\Vert _V^2 \ge \Vert \pi _{V_J}(w)\Vert ^2_V\). Moreover, from the characteristic properties of an orthogonal projection, we have \(\tilde{v} - v \in V_J^\perp = N_J\), the inequality holding because \(N_J\) is closed (because it is the null set of a bounded linear map).    \(\square \)

The numerical computation of this orthogonal projection is not always easy, but when it is, it generally has a form which is more specific than a generic time-dependent vector field, and provides an improved gradient descent algorithm in \({\mathcal X}_V^2\) as follows. Assume that, at time \(\tau \) in the algorithm, the current vector field \(v^\tau \) in the minimization of E is such that \(v^\tau (t) \in V_{J^\tau (t)}\) at all times t. Then define a vector field at the next step \(\tau + \delta \tau \) by

$$ \tilde{v}^{\tau +\delta \tau }(t,y) = v^\tau (t, y) - \delta \tau \left( v(t, y) + \sum _{i = 1}^d {\left( {\rho (1)}\, \left| {\rho (1)}\, {d\varphi _{t1}(\varphi ^v_{1t}(x)) K^i(\varphi _{1t}^v(x), y)}\right. \right) }_x e_i\right) , $$

which corresponds to one step of gradient descent, as specified in (10.20), then compute \(J(t) = \varphi _{0t}^{\tilde{v}^{\tau +\delta \tau }}\cdot I\) and define

$$ v^{\tau +\delta \tau }(t) = \pi _{V_{J(t)}} (\tilde{v}^{\tau + \delta \tau }) $$

at all times t.

Application to Point Matching

Consider the point-matching energy. In this case, letting

$$\begin{aligned} U(\varphi ) = F(\varphi (x_1), \ldots , \varphi (x_N)), \end{aligned}$$

we have

$$ \rho (1) = \bar{\partial }U(\varphi ^v_{01}) = \sum _{i=1}^N \partial _i F(\varphi ^v_{01}(x)) \delta _{\varphi ^v_{01}(x_k)} . $$

We therefore have, by Corollary 10.9, with \(\tilde{U}(x) = U(\varphi _{01}^v)\),

$$\begin{aligned} \nabla ^V \tilde{U}(v)(t, y)= & {} \sum _{i = 1}^d {\left( {\rho (1)}\, \left| {\rho (1)}\, {d\varphi ^v_{t1}(\varphi ^v_{1t}(x)) K( \varphi ^v_{1t}(x), y)e_i}\right. \right) }_x e_i\\= & {} \sum _{i = 1}^d \sum _{q=1}^N \big (\partial _q F(\varphi ^v_{01}(x))^T d\varphi ^v_{t1}(\varphi ^v_{0t}(x_q)) K(\varphi ^v_{0t}(x_q), y)e_i\big ) e_i\\= & {} \sum _{q=1}^N K(y,\varphi ^v_{0t}(x_q)) d\varphi ^v_{t1}(\varphi ^v_{0t}(x_q))^T \partial _qF(\varphi ^v_{01}(x_q)), \end{aligned}$$

so that

$$\begin{aligned} \nabla ^V \tilde{E}(v) (t, y) = v(t, y) + \sum _{q=1}^N K(y,\varphi ^v_{0t}(x_q)) d\varphi ^v_{t1}(\varphi ^v_{0t}(x_q))^T \partial _qF(\varphi ^v_{01}(x_q)) . \end{aligned}$$
(10.35)

So, a basic gradient descent algorithm in \({\mathcal X}^2_V\) would implement the evolution (letting \(\tau \) denote the algorithm time)

$$\begin{aligned} \partial _\tau v^\tau (t, y) = -\gamma \left( v^\tau (t, y) + \sum _{q=1}^N K(y,\varphi ^{v^\tau }_{0t}(x_q)) d\varphi ^{v^\tau }_{t1}(\varphi ^{v^\tau }_{0t}(x_q))^T \partial _qF(\varphi ^{v^\tau }_{01}(x_q))\right) . \end{aligned}$$
(10.36)

The two-step algorithm defined in the previous section is especially efficient with point sets. When \(x = (x_1, \ldots , x_N)\), \(v\cdot x = (v(x_1), \ldots , v(x_N)\), the projection on

$$ V_x = \left\{ v: v\cdot x = 0 \right\} ^\perp = \left\{ v: v(x_1) = \cdots = v(x_N) = 0 \right\} ^\perp $$

is given by spline interpolation with the kernel, as described in Theorem 8.8, i.e.,

$$\begin{aligned} V_x = \left\{ v = \sum _{k=1}^N K(., x_k) a_k, a_1, \ldots , a_N \in {\mathbb {R}}^d \right\} . \end{aligned}$$
(10.37)

More precisely, define \(x_i^v(t) = \varphi _{0t}^v(x_i)\). We assume that, at time \(\tau \), we have a time-dependent vector field \(v^\tau \) which takes the form

$$\begin{aligned} v^\tau (t, y) = \sum _{i=1}^N K(y, x_i^{v^\tau }(t)) \alpha ^\tau _i(t). \end{aligned}$$
(10.38)

Using (10.36), we define

$$ \tilde{v}(t, y) = v^\tau - \delta \tau \big (v^\tau (t, y) + \sum _{q=1}^N K(y,\varphi ^{v^\tau }_{0t}(x_q)) d\varphi ^{v^\tau }_{t1}(\varphi ^{v^\tau }_{0t}(x_q))^T \partial _qF(\varphi ^{v^\tau }_{01}(x_q))\big ). $$

The values of \(\tilde{v}(t,\cdot )\) are in fact only needed at the points \(x^{\tilde{v}}_i(t) = \varphi ^{\tilde{v}}_{0t}(x_i)\). These points are obtained by solving the differential equation

$$\begin{aligned} \partial _t x = v^\tau (t, x) - \delta \tau \left( v^\tau (t, x) + \sum _{i=1}^N K(x_i^{v^\tau }(t), x) d\varphi ^{v^\tau }_{t1}(x_i^{v^\tau }(t))^T F_i(x^{v^\tau }(1))\right) \end{aligned}$$
(10.39)

with \(x(0) = x_i\). Solving this equation provides both \(x^{\tilde{v}}_i(t) \) and \(\tilde{v}(x^{\tilde{v}}_i(t))\) for \(t\in [0,1]\).

Once this is done, define \(v^{\tau + \delta \tau }(t, \cdot )\) to be the solution of the approximation problem \(\inf _w \Vert w\Vert _V\) with \(w(x^{\tilde{v}}_i(t)) = v(x^{\tilde{v}}_i(t))\), which will therefore take the form

$$ v^{\tau +\delta \tau } (t, y) = \sum _{i=1}^N K(y, x_i^{v^{\tau +\delta \tau }}(t)) \alpha ^{\tau +\delta \tau }_i(t) . $$

Solving (10.39) requires evaluating the expression of \(v^\tau \), which can be done exactly using (10.38). It also requires computing the expression of \(d\varphi ^{v^\tau }_{t1}(x_i^{v^\tau }(t))\), which can be obtained from the expression

$$ \partial _t d\varphi ^{v}_{t1} \circ \varphi _{1t} = \partial _t (d\varphi _{1t})^{-1} = -(d\varphi _{1t})^{-1} (\partial _t (d\varphi _{1t})) (d\varphi _{1t})^{-1}, $$

which yields:

$$ \partial _t d\varphi ^{v}_{t1}(x_i^{v}(t)) = - d\varphi ^{v}_{t1}(x_i^{v}(t)) dv (t, x_i^v(t)). $$

Thus, \(d\varphi ^{v}_{t1}(x_i^{v}(t))\) is a solution of \(\partial _t M = -M dv(x_i^v(t))\) with initial condition \(M = {\mathrm {Id}}\). The matrix \(dv(t, x_i^v(t))\) can be computed explicitly as a function of the point trajectories \(x_j^v(t), j=1, \ldots , N\), using the explicit expression (10.38). This algorithm was introduced in [31].

10.6.2 Gradient in the Hamiltonian Form

As we have seen, one can use the optimal control formalism with the Pontryagin principle to compute the gradient of \(\tilde{E}\) in v. Given \(v\in {\mathcal X}^2_V\), this gradient can be computed by solving (10.28) with boundary conditions \(\varphi (0) = {\mathrm {id}}\) and \(\mu (1) = - dU(\varphi (1))\) (which can be achieved by solving the first equation in (10.28) from \(t=0\) to \(t=1\), then the second one backward in time, from \(t=1\) to \(t=0\)) and, using (10.18), letting

$$ \nabla \tilde{E}(v)(t) = {\mathbb {K}}(d\tilde{E}(v)(t)) = -{\mathbb {K}}\xi ^*_{\varphi (t)} \mu (t) + v(t). $$

This equation (or the maximum principle) implies that the optimal v must be such that \(v(t) = {\mathbb {K}}\xi ^*_{\varphi (t)} \mu (t)\) for some \(\mu \in C^{p-1}_0(\varOmega , {\mathbb {R}}^d)^*\) and there is therefore no loss of generality in restricting the optimization problem to v’s taking this form. With this constraint, we have

$$ \Vert v(t)\Vert _V^2 = {\big \langle {{\mathbb {K}}\xi ^*_{\varphi (t)}\mu (t)}\, , \, {{\mathbb {K}}\xi ^*_{\varphi (t)}\mu (t)}\big \rangle }_V = {\left( {\mu (t)}\, \left| {\mu (t)}\, {\xi _{\varphi (t)} {\mathbb {K}}\xi ^*_{\varphi (t)}\mu (t)}\right. \right) }. $$

Let \({\mathbb {K}}_\varphi = \xi _{\varphi } {\mathbb {K}}\xi ^*_{\varphi }\) so that \(\Vert v(t)\Vert ^2_V = {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\mathbb {K}}_{\varphi (t)} \mu (t)}\right. \right) }\). One has

$$\begin{aligned} a^T({\mathbb {K}}_{\varphi }\mu )(y)&= a^T({\mathbb {K}}\xi ^*_{\varphi }\mu )(\varphi (y)) \\&= {\big \langle {K(\cdot , \varphi (y))a}\, , \, {{\mathbb {K}}\xi ^*_{\varphi }\mu }\big \rangle }_V\\&= {\left( {\mu }\, \left| {\mu }\, {K(\varphi (x), \varphi (y))a}\right. \right) }_x \end{aligned}$$

so that

$$ ({\mathbb {K}}_{\varphi }\mu )(y) = \sum _{i=1}^d {\left( {\mu }\, \left| {\mu }\, {K(\varphi (x), \varphi (y))e_i}\right. \right) }_x e_i. $$

With this notation, the state equation \(\partial _t \varphi = v\circ \varphi \) becomes \(\partial _t\varphi = {\mathbb {K}}_\varphi \mu \) and the original optimal control problem is reformulated as minimizing

$$ E(\varphi , \mu ) = \frac{1}{2} \int _0^1 {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\mathbb {K}}_{\varphi (t)}\mu (t)}\right. \right) } \, dt + U(\varphi _{01}) $$

subject to \(\partial _t \varphi = \mathbb K_{\varphi } \mu \).

Expressing the problem in this form slightly changes the expression of the differential. The computation of the gradient (and its justification) based on a co-state \(\alpha \) and the Hamiltonian

$$ H_\mu (\alpha ,\varphi ) = {\left( {\alpha }\, \left| {\alpha }\, {{\mathbb {K}}_\varphi \mu }\right. \right) } - \frac{1}{2} {\left( {\mu }\, \left| {\mu }\, {{\mathbb {K}}_\varphi \mu }\right. \right) } $$

are obtained using the same methods as in Sect. 10.4.2, so we skip the details. Let \(\varphi ^\mu \) be the solution of \(\partial _t \varphi = {\mathbb {K}}_{\varphi } \mu \) with \(\varphi (0) = {\mathrm {id}}\). Then, with \(\tilde{E}(\mu ) = E(\varphi ^\mu , \mu )\), we have

$$ d\tilde{E}(\mu ) = {\mathbb {K}}_\varphi \alpha - {\mathbb {K}}_\varphi \mu , $$

where \(\varphi \) and \(\alpha \) are given by the system

$$\begin{aligned} \left\{ \begin{aligned} \partial _t \varphi&= {\mathbb {K}}_{\varphi } \mu \\ {\Big ( {\partial _t \alpha (t)}\,\Big |\, {w}\Big )}&= - \sum _{i=1}^d {\left( {\alpha (t)}\, \left| {\alpha (t)}\, {{\left( {\mu (t)}\, \left| {\mu (t)}\, { \partial _2 K^i(\varphi (t,x), \varphi (t, y))w(y)}\right. \right) }_x e_i}\right. \right) }_y\\&\quad - \sum _{i=1}^d {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\left( {\alpha (t)}\, \left| {\alpha (t)}\, { \partial _2 K^i(\varphi (t,x), \varphi (t, y))w(y)}\right. \right) }_x e_i}\right. \right) }_y\\&\quad + 2 \sum _{i=1}^d {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\left( {\mu (t)}\, \left| {\mu (t)}\, { \partial _2 K^i(\varphi (t,x), \varphi (t, y))w(y)}\right. \right) }_x e_i}\right. \right) }_y \end{aligned} \right. \end{aligned}$$
(10.40)

with \(\varphi (0) = {\mathrm {id}}\) and \(\alpha (1) = - dU(\varphi (1))\). Unsurprisingly, this system boils down to (10.28) when \(d\tilde{E}(\mu ) = 0\), i.e., when \(\alpha = \mu \).

The gradient of \(\tilde{E}\) expressed with respect to the inner product

$$ {\big \langle {\mu }\, , \, {\tilde{\mu }}\big \rangle }_{\varphi } = {\left( {\mu }\, \left| {\mu }\, {K_\varphi \mu }\right. \right) }. $$

(this choice will be justified in Sect. 11.4 as the dual Riemannian metric on the diffeomorphism group) is

$$ \nabla \tilde{E}(\mu ) = \alpha - \mu , $$

a remarkably simple expression.

Consider now the case in which \(U(\varphi ) = F(\varphi \cdot I)\) where I is a fixed object. With the notation and assumptions made in Sect. 10.6.1, we found in Lemma 10.14 that there was no loss of generality in restricting the minimization to \(v(t) \in V_{J(t)}\) at all times. This often entails additional constraints on the momentum \(\rho (t) = \mathbb Lv(t)\), or on \(\mu (t) = \xi _{\varphi (t)^{-1}}^* \rho (t)\), that can be leveraged to reduce the dimension of the control variable. For example, we have seen that for point sets (in which we let \(J=x\)) \(V_x\) was given by (10.37), so that \(\rho (t)\) must take the form

$$ \rho (t) = \sum _{k=1}^N z_k(t) \delta _{x_k(t)} $$

for some \(z_1(t), \ldots , z_N(t) \in {\mathbb {R}}^d\), from which we can deduce (using \(x_k(t) = \varphi (t, x_k(0))\)) that \(\mu (t)\) must take the form

$$ \mu (t) = \sum _{k=1}^N z_k(t) \delta _{x_k(0)}. $$

One can then use \(z_1, \ldots , z_N\) as a new control, as described below.

Another interesting special case is when \(\mu (t)\) can be expressed as a vector measure, because, as discussed in Sect. 10.5.6, one can then assume that

$$ \mu (t) = \sum _{k=1}^N z_k(t, \cdot ) \gamma _k, $$

where \(\gamma _1, \ldots , \gamma _N\) are fixed measures. One can then use the vector fields \(z_1, \ldots , z_N\) to parametrize the problem. This leads to a simplification when the measures have a sparse support. They are, for example, Dirac measures for point matching. We now review this case in more detail.

Application to Point Matching

When \(U(\varphi ) = F(\varphi (x_1), \ldots , \varphi (x_N))\), we have

$$ {\left( {dU(\varphi )}\, \left| {dU(\varphi )}\, {h}\right. \right) } = \sum _{k=1}^N \partial _k F(\varphi (x_1), \ldots , \varphi (x_N))^T h(x_k) $$

so that

$$ \mu (1) = -\sum _{k=1}^N \partial _k F(\varphi (x_1), \ldots , \varphi (x_N)) \delta _{x_k} $$

is a vector measure. We can therefore look for a solution in the form

$$ \mu (t) = \sum _{k=1}^N z_k(t) \delta _{x_k} $$

at all times, for some coefficients \(z_1, \ldots , z_N\).

In order to obtain \(\alpha \) in (10.40) given a current \(\mu \), it suffices to solve the first equation only for the values of \(y_k(t) = \varphi (t, x_k)\), \(k=1, \ldots , N\), which requires us to solve the system

$$ \partial _t y_k = \sum _{l=1}^N K(y_k, y_l) \xi _l. $$

One then sets

$$ \alpha (1) = -\sum _{k=1}^N \partial _k F(y_1(1), \ldots , y_N(1)) \delta _{x_k} $$

and solves the second equation backward in time, knowing that the solution will take the form

$$ \alpha (t) = \sum _{k=1}^N \eta _k(t) \delta _{x_k} $$

with \(\eta _k(1) = -\partial _k F(y_1(1), \ldots , y_n(1))\) and

$$\begin{aligned} \partial _t \eta _k =&\sum _{l=1}^N \sum _{i, j=1}^d \eta _l^i z_k^j\nabla _1K^{ij}(y_k, y_l) + \sum _{l=1}^N \sum _{i, j=1}^d z_l^i\eta _k^j\nabla _1K^{ij}(y_k, y_l) \\&- 2 \sum _{l=1}^N \sum _{i, j=1}^d z_l^iz_k^j\nabla _1K^{ij}(y_k, y_l). \end{aligned}$$

Given this, we have

$$ \nabla \tilde{E}(\mu ) = \sum _{k=1}^N (\eta _k-z_k) \delta _{x_k}. $$

10.6.3 Gradient in the Initial Momentum

We now use the fact that Eq. (10.16) implies that the optimal v(t) is uniquely constrained by its value at \(t=0\) for formulating the variations of the objective function in terms of these initial conditions. We therefore optimize with respect to \(v_0\), or equivalently with respect to \(\mu _0 = \rho _0\). This requires finding \(\rho _0\) such that

$$ \frac{1}{2} \int _0^1 \Vert v(t)\Vert _V^2 dt + U(\varphi (1)) $$

is minimal under the constraints \(\partial _t\varphi (t) = v(t)\circ \varphi (t)\), with

$$v(t) = \sum _{i=1}^d {\left( {\rho _0}\, \left| {\rho _0}\, { d\varphi (t) K^{(i)}(x, y)}\right. \right) }_x e_i.$$

Proposition 10.12 helps us to simplify this expression, since it implies that \(\int _0^1 \Vert v(t)\Vert ^2 dt = {\left( {\rho _0}\, \left| {\rho _0}\, {{\mathbb {K}}\rho _0}\right. \right) }\) and the minimization problem therefore is to find \(\rho _0\) such that

$$ E(\rho _0) = \frac{1}{2} {\left( {\rho _0}\, \left| {\rho _0}\, {{\mathbb {K}}\rho _0}\right. \right) } + U(\varphi (1)) $$

is minimal, where \((\varphi , \mu )\) is a solution of system (10.28) with initial conditions \(\varphi (0) = {\mathrm {id}}\) and \(\mu (0) = \rho _0\). Writing (10.28) as

$$ \partial _t \begin{pmatrix}\varphi \\ \mu \end{pmatrix} = \begin{pmatrix}\mathscr {V}_1(\varphi , \mu )\\ \mathscr {V}_2(\varphi , \mu )\end{pmatrix} = \mathscr {V}(\varphi , \mu ) $$

and applying Proposition D.12, we have

$$ dE(\rho _0) = {\mathbb {K}}\rho _0 - p_\mu (0), $$

where the pair \(\left( {\begin{matrix} p_\varphi (t)\\ p_\mu (t)\end{matrix}}\right) \) satisfies \(p_\varphi (1) = - dU(\varphi (1))\), \(p_\mu (0) = 0\) and

$$\begin{aligned} \left\{ \begin{aligned} \partial _t p_\varphi&= -\partial _1 \mathscr {V}_1^*\, p_\varphi - \partial _1 \mathscr {V}_2^* \, p_\mu ,\\ \partial _t p_\mu&= -\partial _2 \mathscr {V}_1^*\, p_\varphi - \partial _2 \mathscr {V}_2^*\, p_\mu . \end{aligned} \right. \end{aligned}$$
(10.41)

The gradient of E with respect to the metric on \(V^*\) is then given by \(\nabla E(\rho _0) = \rho _0 - {\mathbb {K}}^{-1} p_\mu \).

The practical application of these formulas requires us to make explicit the expressions of \(\partial _i\mathscr {V}_j^*\) for \(i, j=1,2\). Returning to (10.28), we have

$$\begin{aligned} {\left( {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, {h}\right. \right) } =&\sum _{i=1}^d {\left( {p_\varphi }\, \left| {p_\varphi }\, {{\left( {\mu }\, \left| {\mu }\, {\partial _1 K^i(\varphi (x), \varphi (y))h(x)}\right. \right) }_xe_i}\right. \right) }_y\\&+\sum _{i=1}^d {\left( {p_\varphi }\, \left| {p_\varphi }\, {{\left( {\mu }\, \left| {\mu }\, {\partial _2 K^i(\varphi (x), \varphi (y))h(y)}\right. \right) }_xe_i}\right. \right) }_y, \end{aligned}$$
$$\begin{aligned} {\left( {\partial _2 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _2 \mathscr {V}_1^*\, p_\varphi }\, {\eta }\right. \right) } =&\sum _{i=1}^d {\left( {p_\varphi }\, \left| {p_\varphi }\, {{\left( {\eta }\, \left| {\eta }\, {K^i(\varphi (x), \varphi (y))}\right. \right) }_xe_i}\right. \right) }_y, \end{aligned}$$
$$\begin{aligned} {\left( {\partial _1 \mathscr {V}_2^*\, p_\mu }\, \left| {\partial _1 \mathscr {V}_2^*\, p_\mu }\, {h}\right. \right) } =&- \sum _{i=1}^d {\left( {\mu }\, \left| {\mu }\, {{\left( {\mu }\, \left| {\mu }\, {\partial _1 \partial _2 K^i(\varphi (x), \varphi (y))(h(x), p_\mu (y))}\right. \right) }_x e_i}\right. \right) }_y\\&- \sum _{i=1}^d {\left( {\mu }\, \left| {\mu }\, {{\left( {\mu }\, \left| {\mu }\, { \partial _2^2 K^i(\varphi (x), \varphi (y))(h(y), p_\mu (y))}\right. \right) }_x e_i}\right. \right) }_y, \text { and} \end{aligned}$$
$$\begin{aligned} {\left( {\partial _2 \mathscr {V}_2^*\, p_\mu }\, \left| {\partial _2 \mathscr {V}_2^*\, p_\mu }\, {\eta }\right. \right) } =&- \sum _{i=1}^d {\left( {\eta }\, \left| {\eta }\, {{\left( {\mu }\, \left| {\mu }\, { \partial _2 K^i(\varphi (x), \varphi (y))p_\mu (y)}\right. \right) }_x e_i}\right. \right) }_y\\&- \sum _{i=1}^d {\left( {\mu }\, \left| {\mu }\, {{\left( {\eta }\, \left| {\eta }\, { \partial _2 K^i(\varphi (x), \varphi (y))p_\mu (y)}\right. \right) }_x e_i}\right. \right) }_y. \end{aligned}$$

Forming explicit expressions of \(\partial _i\mathscr {V}_j^*\) requires isolating h or \(\eta \) from the right-hand sides. To do this, we will need to change the order in which linear forms are applied to the x and y coordinates. This issue is addressed in the following lemma.

Lemma 10.15

Assume that \(\mu \in C^{r}(\varOmega , {\mathbb {R}}^d)^*\) and \(\nu \in C^{r'}(\varOmega , {\mathbb {R}}^d)^*\). Let \(g:\varOmega \times \varOmega \rightarrow {\mathbb {R}}\) be a function such that \(\partial ^k_1\partial _2^{k'} g \in C_0(\varOmega \times \varOmega , {\mathbb {R}})\) for all \(k\le r\) and \(k'\le r\). Then, for all \(a, b\in {\mathbb {R}}^d\), \({\left( {\mu }\, \left| {\mu }\, {g(x, \cdot )a}\right. \right) }_x \in C^{r'}(\varOmega , {\mathbb {R}}^d)\) and \({\left( {\nu }\, \left| {\nu }\, {g(\cdot , y)b}\right. \right) }_y \in C^{r}(\varOmega , {\mathbb {R}})\), with

$$\begin{aligned} {\left( {\mu }\, \left| {\mu }\, {{\left( {\nu }\, \left| {\nu }\, {g(x, y)b}\right. \right) }_ya}\right. \right) }_x = {\left( {\nu }\, \left| {\nu }\, {{\left( {\mu }\, \left| {\mu }\, {g(x, y)a}\right. \right) }_xb}\right. \right) }_y. \end{aligned}$$
(10.42)

Proof

Let \(f(y) = {\left( {\mu }\, \left| {\mu }\, {g(x, y)a}\right. \right) }_x\). Using Taylor’s formula, we can write

$$\begin{aligned} g(x, y+h)&= \sum _{k=0}^{r'} \frac{1}{k!} \partial _2^kg(x, y) h^{(k)} + \\&\quad \frac{1}{(r'-1)!}\int _0^1 (\partial _2^{r'} g(x, y+th) - \partial _2^{r'} g(x, y))h^{(r'}) (1-t)^{r'-1} \, dt \end{aligned}$$

so that

$$\begin{aligned} f(y&+h) = \sum _{k=0}^{r'} \frac{1}{k!} {\left( {\mu }\, \left| {\mu }\, {\partial _2^kg(x, y) h^{(k)}a}\right. \right) } \\&+\frac{1}{(r'-1)!}{\Big ( {\mu }\,\Big |\, {\int _0^1 (\partial _2^{r'} g(x, y+th) - \partial _2^{r'} g(x, y))h^{(r')}) (1-t)^{r'-1}a \, dt}\Big )}. \end{aligned}$$

The last term (call it R) is such that

$$ |R| \le \frac{\Vert \mu \Vert _{r, \infty , *}|h|^{r'}|a|}{r'!} \max _{t\in [0,1]}\Vert \partial _2^{r'} g(\cdot , y+th) - \partial _2^{r'} g(\cdot , y)\Vert _{r, \infty }. $$

The uniform continuity of \(\partial _1^k\partial _2^{r'} g\) for \(k\le r\) implies that \(R = o(|h|^{r'})\) so that \(f\in C^{r'}(\varOmega , {\mathbb {R}})\). Similarly, letting \(f'(x) = {\left( {\mu }\, \left| {\mu }\, {g(x, y)b}\right. \right) }_y\), one has \(f'\in C^r(\varOmega , {\mathbb {R}})\).

The computation also shows that, for some constant C,

$$\begin{aligned} \max ({\left( {\mu }\, \left| {\mu }\, {{\left( {\nu }\, \left| {\nu }\, {g(x, y)b}\right. \right) }_ya}\right. \right) }_x,&{\left( {\nu }\, \left| {\nu }\, {{\left( {\mu }\, \left| {\mu }\, {g(x, y)a}\right. \right) }_xb}\right. \right) }_y)\\&\qquad \quad \qquad \le C \Vert \mu \Vert _{r, \infty , *} \Vert \nu \Vert _{r', \infty , *} \Vert g\Vert _{r, r',\infty } \end{aligned}$$

with

$$ \Vert g\Vert _{r, r',\infty } = \max _{k\le r, k'\le r'} \Vert \partial ^k_1\partial ^{k'}_2 g\Vert _\infty , $$

so that both sides of (10.42) are continuous in g with respect to this norm. To conclude, it suffices to notice that (10.42) is true when g takes the form

$$ g(x, y) = \sum _{k=1}^n c_k f_k(x)f_k'(y) $$

and that these functions form a dense set for \(\Vert g\Vert _{r, r',\infty }\), so that the identity extends by continuity.    \(\square \)

Let us use this lemma to identify the first term in \({\left( {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, {h}\right. \right) }\) as a linear form acting on h. Write, letting \(\partial _{i, k}\) denote the derivative with respect to the kth coordinate of the ith variable,

$$\begin{aligned}&\sum _{i=1}^d {\left( {p_\varphi }\, \left| {p_\varphi }\, {{\left( {\mu }\, \left| {\mu }\, {\partial _1 K^i(\varphi (x), \varphi (y))h(x)}\right. \right) }_xe_i}\right. \right) }_y\\ =&\sum _{i,j, k=1}^d {\left( {p_\varphi }\, \left| {p_\varphi }\, {{\left( {\mu }\, \left| {\mu }\, {\partial _{1,k} K^{ij}(\varphi (x), \varphi (y))h^k(x)e_j}\right. \right) }_xe_i}\right. \right) }_y\\ =&\sum _{i,j, k=1}^d {\left( {\mu }\, \left| {\mu }\, {{\left( {p_\varphi }\, \left| {p_\varphi }\, {\partial _{1,k} K^{ij}(\varphi (x), \varphi (y))h^k(x)e_i}\right. \right) }_ye_j}\right. \right) }_x\\ =&\sum _{i,j, k=1}^d {\left( {\mu }\, \left| {\mu }\, {{\left( {p_\varphi }\, \left| {p_\varphi }\, {\partial _{2,k} K^{ji}(\varphi (y), \varphi (x))e_i}\right. \right) }_y h^k(x)e_j}\right. \right) }_x\\ =&\sum _{j, k=1}^d {\left( {\mu }\, \left| {\mu }\, {{\left( {p_\varphi }\, \left| {p_\varphi }\, {\partial _{2,k} K^{j}(\varphi (y), \varphi (x))}\right. \right) }_y h^k(x)e_j}\right. \right) }_x \\ =&\, {\left( {\mu }\, \left| {\mu }\, {\mathscr {U}_{1} h}\right. \right) }, \end{aligned}$$

where \(\mathscr {U}_{1}(x)\) is the matrix with coefficients

$$ \mathscr {U}_{1}^{j, q}(x) = {\left( {p_\varphi }\, \left| {p_\varphi }\, {\partial _{2,q} K^{j}(\varphi (y), \varphi (x))}\right. \right) }_y. $$

Write \({\left( {\mathscr {U}_{1}^T \mu }\, \left| {\mathscr {U}_{1}^T \mu }\, {h}\right. \right) } = {\left( {\mu }\, \left| {\mu }\, {\mathscr {U}_{1} h}\right. \right) }\), a notation generalizing the one introduced for vector measures. After a similar computation for the second term of \({\left( {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, {h}\right. \right) }\) (which does not require Lemma 10.15), we get

$$ \partial _1 \mathscr {V}_1^*\, p_\varphi = \mathscr {U}_{1}^T \mu + \mathscr {U}_{2}^T p_\varphi $$

with

$$ \mathscr {U}_{2}^{j, q}(x) = {\left( {\mu }\, \left| {\mu }\, {\partial _{2,q} K^{j}(\varphi (y), \varphi (x))}\right. \right) }_y. $$

Consider now \(\partial _2 \mathscr {V}_1^*\, p_\varphi \), writing

$$\begin{aligned} {\left( {\partial _2 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _2 \mathscr {V}_1^*\, p_\varphi }\, {\eta }\right. \right) } =&\sum _{i, j=1}^d {\left( {\eta }\, \left| {\eta }\, {{\left( {p_\varphi }\, \left| {p_\varphi }\, {K^{ij}(\varphi (x), \varphi (y))e_i}\right. \right) }_ye_j}\right. \right) }_x\\ =&\sum _{j=1}^d {\left( {\eta }\, \left| {\eta }\, {{\left( {p_\varphi }\, \left| {p_\varphi }\, {K^{j}(\varphi (y), \varphi (x))}\right. \right) }_ye_j}\right. \right) }_x, \end{aligned}$$

so that

$$ \partial _2 \mathscr {V}_1^*\, p_\varphi (x) = \sum _{j=1}^d {\left( {p_\varphi }\, \left| {p_\varphi }\, {K^{j}(\varphi (y), \varphi (x))}\right. \right) }_ye_j. $$

With similar computations for \(\mathscr {V}_2\), and skipping the details, we find

$$ \partial _1 \mathscr {V}_2^*\, p_\mu = - \mathscr {U}_3^T \mu - \mathscr {U}_4^T \mu $$

where

$$\begin{aligned} \mathscr {U}_3^{jq}(x) = {\left( {\mu }\, \left| {\mu }\, {\partial _{2,q}(\partial _1K^j(\varphi (y), \varphi (x))p_\mu (y))}\right. \right) }_y,&\text { and }\\ \mathscr {U}_4^{jq}(x) = {\left( {\mu }\, \left| {\mu }\, {\partial _{2,q}(\partial _2K^j(\varphi (y), \varphi (x))p_\mu (x))}\right. \right) }_y&. \end{aligned}$$

Finally,

$$\begin{aligned} \partial _2 \mathscr {V}_2^*\, p_\mu (x)&= - \sum _{i=1}^d {\left( {\mu }\, \left| {\mu }\, { \partial _2 K^i(\varphi (y), \varphi (x))p_\mu (x)}\right. \right) }_y e_i \\&\qquad \qquad \qquad \qquad - \sum _{i=1}^d {\left( {\mu }\, \left| {\mu }\, { \partial _1 K^i(\varphi (y), \varphi (x))p_\mu (y)}\right. \right) }_y e_i. \end{aligned}$$

Let us take the special case of vector measures, assuming that \(\mu (t) = \sum _{k=1}^Nz_k(t, \cdot ) \gamma _k\). We will look for \(p_\varphi \) in the form

$$ p_\varphi (t) = \sum _{k=1}^N\alpha _k(t, \cdot ) \gamma _k, $$

\(p_\mu \) being a function defined over the support of \(\mu \).

With these assumptions, we have

  • \(\displaystyle \partial _1 \mathscr {V}_1^*\, p_\varphi = \sum _{k=1}^N \zeta _k^{1,1} \gamma _k\) with

    $$\begin{aligned} \zeta _k^{1,1}(x) =&\sum _{l=1}^N \sum _{i, j=1}^d {\left( {\gamma _l}\, \left| {\gamma _l}\, {\alpha ^i_l(y) z_k^j(x)\nabla _1 K^{ij}(\varphi (x), \varphi (y)) }\right. \right) }_y \\&+ \sum _{l=1}^N \sum _{i, j=1}^d {\left( {\gamma _l}\, \left| {\gamma _l}\, {z^i_l(y) \alpha _k^j(x)\nabla _1 K^{ij}(\varphi (x), \varphi (y)) }\right. \right) }_y. \end{aligned}$$
  • \(\displaystyle \partial _2 \mathscr {V}_1^*\, p_\varphi (x) = \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, {K(\varphi (x), \varphi (y)) \alpha _k(y)}\right. \right) }_y\).

  • \(\displaystyle \partial _1 \mathscr {V}_2^*\, p_\mu = \sum _{k=1}^N \zeta _k^{2,1} \gamma _k\) with

    $$\begin{aligned} \zeta _k^{2,1}(x) =&-\sum _{i, j=1}^d \sum _{l=1}^N {\left( {\gamma _l}\, \left| {\gamma _l}\, {z_l^i(y) z_k^j(x) \partial _1\partial _2 K^{ij}(\varphi (x), \varphi (y)) p_\mu (y)}\right. \right) }_y\\&- \sum _{i, j=1}^d \sum _{l=1}^N {\left( {\gamma _l}\, \left| {\gamma _l}\, {z_l^i(y) z_k^j(x) \partial _1^2 K^{ij}(\varphi (x), \varphi (y)) p_\mu (x)}\right. \right) }_y. \end{aligned}$$
  • \( \begin{aligned} \partial _2 \mathscr {V}_2^*\, p_\mu (x) =&- \sum _{i=1}^d \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, { z^i_k(y) \partial _1 K^i(\varphi (x), \varphi (y))p_\mu (x)}\right. \right) }_y \\&- \sum _{i=1}^d \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, {z^i_k(y) \partial _2 K^i(\varphi (x), \varphi (y))p_\mu (y)}\right. \right) }_y. \end{aligned} \)

System (10.41) can now be simplified as

$$\begin{aligned} \left\{ \begin{aligned} \partial _t\alpha _k =&\, \zeta _k^{1,1} + \zeta _k^{2,1}\\ \partial _t p_\mu =&\sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, {K(\varphi (x), \varphi (y)) \alpha _k(y)}\right. \right) }_y\\&-\sum _{i=1}^d \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, { z^i_k(y) \partial _1 K^i(\varphi (x), \varphi (y))p_\mu (x)}\right. \right) }_y \\&- \sum _{i=1}^d \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, {z^i_k(y) \partial _2 K^i(\varphi (x), \varphi (y))p_\mu (y)}\right. \right) }_y. \end{aligned} \right. \end{aligned}$$
(10.43)

Application to Point Matching

We now apply this approach to point-matching problems. Since \(\rho _0\) takes the form

$$ \rho _0 = \sum _{k=1}^N a_{0,k} \delta _{x_{0,k}} $$

we are in the vector measure case with \(\gamma _k = \delta _{x_{0,k}}\). The densities \(z_k\) and \(\alpha _k\) for \(\mu \) and \(p_\varphi \) can therefore be considered as vectors in \({\mathbb {R}}^d\), and \(p_\mu \) being defined on the support of \(\mu \) is also a collection of vectors \(p_{\mu , k} = p_\mu (x_k)\). Given this, we can therefore immediately rewrite

  • \(\displaystyle \partial _1 \mathscr {V}_1^*\, p_\varphi = \sum _{k=1}^N \zeta _k^{1,1} \delta _{x_{0,k}}\) with

    $$\begin{aligned} \zeta _k^{1,1} =\sum _{l=1}^N \sum _{i, j=1}^d \left( \alpha ^i_l \nabla _1 K^{ij}(x_k, x_l) z_k^j + z^i_l \nabla _1 K^{ij}(x_k, x_l) \alpha _k^j\right) . \end{aligned}$$
  • \(\displaystyle \partial _2 \mathscr {V}_1^*\, p_\varphi (x_{0,k}) = \sum _{l=1}^N K(x_k, x_l) \alpha _l\).

  • \(\displaystyle \partial _1 \mathscr {V}_2^*\, p_\mu = \sum _{k=1}^N \zeta _k^{2,1} \delta _{x_{0,k}}\) with

    $$\begin{aligned} \zeta _k^{2,1} = -\sum _{i, j=1}^d \sum _{l=1}^N z_l^i z_k^j (\partial _1\partial _2 K^{ij}(x_k, x_l)\, p_{\mu , l} + \partial _1^2 K^{ij}(x_k, x_l)\, p_{\mu , k}). \end{aligned}$$
  • \( \begin{aligned} \partial _2 \mathscr {V}_2^*\, p_\mu (x_k) =&- \sum _{i=1}^d \sum _{l=1}^N z^i_l (\partial _1 K^i(x_k, x_l)p_{\mu , k} + \partial _2 K^i(x_k, x_l)p_{\mu , l}). \end{aligned} \)

This algorithm is illustrated in Fig. 10.1. In the same figure, we also provide (for comparison purposes) the results provided by spline interpolation, which computes \(\varphi (x) = x + v(x)\), where v is computed (using Theorem 8.9) in order to minimize

$$ \left\| v \right\| _V^2 + C \sum _{i=1}^N \left| v(x_i) - (y_i-x_i) \right| ^2. $$

Although this is a widely spread registration method [42], Fig. 10.1 shows that it is far from being diffeomorphic for large deformations.

Fig. 10.1
figure 1

Metric point matching. The first two rows provide results obtained with gradient descent in the initial momentum for point matching, with the same input as in Fig. 9.1, using Gaussian kernels \(K(x, y) = \exp (-|x-y|^2/2\sigma ^2)\) with \(\sigma = 1, 2, 4\) in grid units. The impact of the diffeomorphic regularization on the quality of the result is particularly obvious in the last experiment. The last row provides the output of Gaussian spline registration with the same kernels, exhibiting singularities and ambiguities in the registration

10.6.4 Shooting

The optimality conditions for our problem are \(\mu (1) = -dU(\varphi (1))\) with \(\mu (t)\) given by (10.28). The shooting approach in optimal control consists in finding an initial momentum \(\rho _0 = \mu (0)\) such that these conditions are satisfied. Root finding methods, such as Newton’s algorithm, can be used for this purpose. At a given step of Newton’s algorithm, one modifies the current value of \(\rho _0\), by letting \(\rho _0 \rightarrow \rho _0 + \eta \) such that, letting \(F(\rho _0) := \mu (1) + dU(\varphi (1))\), one has

$$ F(\rho _0) + dF(\rho _0)\eta = 0. $$

One therefore needs to solve this linear equation in order to update the current \(\rho _0\). One has

$$ dF(\rho _0) = W_{\mu \mu }(1) + W_{\varphi \mu }(1)^*d^2U(\varphi (1)), $$

where

$$ W = \begin{pmatrix} W_{\varphi \varphi }&{} W_{\varphi \mu }\\ W_{\mu \varphi } &{} W_{\mu \mu }\end{pmatrix} $$

is (using the notation of the previous section) the differential of the solution of the equation

$$ \partial _t \begin{pmatrix}\varphi \\ \mu \end{pmatrix} = \mathscr {V}(\varphi , \mu ), $$

with respect to its initial condition, i.e., the solution of

$$ \partial _t W = d\mathscr {V}(\varphi , \mu ) W $$

with initial condition \(W(0) = {\mathrm {Id}}\).

Because one needs to compute the solution of this differential equation at every step of the algorithm, then solve for a linear system, the shooting method is feasible only for problems that can be discretized into a relatively small number of dimensions. One can use it, for example, in point matching problems with no more than a few hundred landmarks (see [290] for an application to labeled point matching), in which case the algorithm can be very efficient. Another issue is that root finding algorithms are not guaranteed to converge. Usually, a good initial solution must be found, using, for example, a few preliminary steps of gradient descent.

10.6.5 Gradient in the Deformable Object

Finally, we consider the option of using the time derivative of the deformable object as a control variable, using the fact that, by (10.33), the objective function can be reduced to

$$ E(J) = \int _0^1 L(\partial _t J(t), J(t)) dt + F(J(1)) $$

with \(L(\eta , J) = \min _{w:\, \eta = w \cdot J(t)} \Vert w\Vert ^2_V\). This formulation is limited, in that \(L(\eta , J)\) is not always defined for all \((\eta , J)\), resulting in constraints in the minimization that are not always easy to handle. Even if well-defined, the computation of L may be numerically demanding. To illustrate this, consider the image-matching case, in which \(v\cdot J = -\nabla J^T v\). An obvious constraint is that, in order for

$$ \nabla J^T w = -\eta $$

to have at least one solution, the variation \(\eta \) must be supported by the set \(\nabla J \ne 0\). To compute this solution when it exists, one can write, for \(x\in \varOmega \),

$$ \nabla J(x)^T w(x) = {\big \langle {K(\cdot , x)\nabla J(x)}\, , \, {w}\big \rangle }_V, $$

and it is possible to look for a solution in the form

$$ w(y) = \int _\varOmega \lambda (x) K(y, x) \nabla J(x) dx, $$

where \(\lambda (x)\) can be interpreted as a continuous family of Lagrange multipliers. This results in a linear equation in \(\lambda \), namely

$$ \int _\varOmega \lambda (x) K(y, x) \nabla J(y)^T\nabla J(x)) dx = -\eta (y), $$

which is numerically challenging.

For point sets, however, the approach is feasible [159] because L can be made explicit. Given a point-set trajectory \(x(t) = (x^{(1)}(t), \ldots , x^{(N)}(t))\), let S(x(t)) denote the block matrix with (ij) block given by \(K(x^{(i)}(t), x^{(j)}(t))\). The constraints are \(x_t = S(x(t)) \xi (t)\) so that \(\xi (t) = S(x(t))^{-1} \dot{x}_t\) and the minimization reduces to

$$ E(x) = \frac{1}{2} \int _0^1 {{\dot{x}(t)}^T} S(x(t))^{-1} \dot{x}(t) dt + U(x(1)). $$

Minimizing this function with respect to x by gradient descent is possible, and has been described in [158, 159] for labeled landmark matching. The basic computation is as follows: if \(s_{pq, r} = \partial _{r} s_{pq}\), we can write (using the fact that \(\partial _{r} (S^{-1}) = - S^{-1} (\partial _{r}S) S^{-1}\))

$$\begin{aligned} {\left( {dE(x)}\, \left| {dE(x)}\, {h}\right. \right) }= & {} \int _0^1 {{\dot{x}(t)}^T} S(x(t))^{-1} \dot{h}(t) dt\\&- \int _0^1 \sum _{p,q,r} \xi ^{(p)}(t) \xi ^{q)}(t) s_{pq, r}(x(t)) h^{(r)}(t) dt + \nabla U(x(1))^T h(1). \end{aligned}$$

After an integration by parts in the first integral, we obtain

$$ dE(x) = - \partial _t \left( S(x(t))^{-1} \dot{x}\right) - z(t) + \left( S(x(1))^{-1} \dot{x}(1) + \nabla U(x(1)) \right) \delta _1(t), $$

where \(z_r(t) = \sum _{p,q} \xi _p(t) \xi _q(t) s_{pq, r}(x(t))\) and \(\delta _1\) is the Dirac measure at \(t=1\).

This singular part can be dealt with by computing the gradient in a Hilbert space in which the evaluation function \(x(\cdot ) \mapsto x(1)\) is continuous. This method has been suggested, in particular, in [129, 161]. Let H be the space of all trajectories \(x : t\mapsto x(t) = (x^{(1)}(t), \ldots , x^{(N)}(t))\), with fixed starting point x(0), free end-point x(1) and square integrable time derivative. This is a space of the form \(x(0) + H\) where H is the Hilbert space of time-dependent functions \(t\mapsto h(t)\), considered as column vectors of size Nk, with \(h(0)=0\) and

$$ {\big \langle {h}\, , \, {\tilde{h}}\big \rangle }_H = \int _0^1 {{\dot{h}}^T}\dot{\tilde{h}} dt + {{h(1)}^T}\tilde{h}(1). $$

To compute the gradient for this inner product, we need to write \({\left( {dE(x)}\, \left| {dE(x)}\, {h}\right. \right) }\) in the form \({\big \langle {\nabla ^H E(x)}\, , \, {h}\big \rangle }_H\). We will make the assumption that

$$ \int _0^1 \left| S(x(t))^{-1} \dot{x}(t) \right| ^2 dt < \infty , $$

which implies that

$$ \int _0^1 {{\dot{x}(t)}^T} S(x(t))^{-1} \dot{h}(t) dt \le \sqrt{\int _0^1 \left| S(x(t))^{-1} \dot{x}(t) \right| ^2 dt \int _0^1 \left| \dot{h}(t) \right| ^2 dt} $$

is continuous in h. Similarly, the linear form \(\xi \mapsto \nabla U(x(1))^T h(1)\) is continuous since

$$ \nabla U(x(1))^Th(1) \le \left| \nabla U(x(1)) \right| \left| h(1) \right| . $$

Finally, \(h \mapsto \int _0^1 {{z(t)}^T}\dot{h} dt\) is continuous provided that we assume that

$$\eta (t) = \int _0^t z(s) dt$$

is square integrable over [0, 1], since this yields

$$ \int _0^1 {{z(t)}^T} h(t) dt = \eta (1) h(1) - \int _0^1 \eta (t) \dot{h}(t) dt, $$

which is continuous in h with respect to the H norm.

Thus, under these assumptions, \(h \mapsto {\left( {dE(x)}\, \left| {dE(x)}\, {h}\right. \right) }\) is continuous over H, and the Riesz representation theorem implies that \(\nabla ^H E(x)\) exists as an element of H. We now proceed to its computation. Letting

$$ \mu (t) = \int _0^t S(x(s))^{-1} \dot{h}(s) ds $$

and \(a = \nabla U(x(1))\), the problem is to find \(\zeta \in H\) such that, for all \(h\in H\),

$$ {\big \langle {\zeta }\, , \, {h}\big \rangle }_H = \int _0^1 {{\dot{\mu }}^T} \dot{h}dt + \int _0^1 {{z(t)}^T} h(t) dt + {{a}^T}\xi (1). $$

This expression can also be written

$$ \int _0^1 {{\left( \dot{\zeta }+ \zeta (1)\right) }^T} \dot{h} dt = \int _0^1 {{\left( \dot{\mu }+ \eta (1) -\eta (t) + a\right) }^T} \dot{h} dt. $$

This suggests selecting \(\zeta \) such that \(\zeta (0)=0\) and

$$\dot{\zeta }+ \zeta (1) = \dot{\mu }+\eta (1) - \eta (t) + a,$$

which implies

$$\zeta (t) + t\zeta (1) = \mu (t) -\int _0^t \eta (s) ds + t(\eta (1)+a).$$

At \(t=1\), this yields

$$2\zeta (1) = \mu (1) - \int _0^1 \eta (s) ds + \eta (1) + a$$

and we finally obtain

$$ \zeta (t) = \mu (t) -\int _0^t \eta (s) ds + \frac{t}{2} \left( \int _0^1 \eta (s) ds - \mu (1) + \eta (1) + a\right) . $$

We summarize this in an algorithm, in which \(\tau \) is again the computation time.

Algorithm 3

(Gradient descent algorithm for landmark matching) Start with initial landmark trajectories \(x(t, \tau ) = (x^{(1)}(t,\tau ), \ldots , x^{(N)}(t,\tau ))\).

Solve

$$\begin{aligned} \partial _\tau x(t,\tau )= & {} -\gamma \Big (\mu (t, \tau ) -\int _0^t \eta (s, \tau ) ds \\+ & {} \frac{t}{2} \Big (\int _0^1 \eta (s, \tau ) ds - \mu (1, \tau ) + \eta (1, \tau ) + a(\tau )\Big )\Big ) \end{aligned}$$

with \( a(\tau )= \nabla U(x(1,\tau ))\), \(\mu (t, \tau ) = \int _0^t \xi (s, \tau ) ds\), \(\eta (t, \tau ) = \int _0^t z(s, \tau ) dt\) and

$$\begin{aligned} \xi (t, \tau )= & {} S(x(t, \tau ))^{-1} \dot{x}(t, \tau ) \\ z^{(q)}(t, \tau )= & {} \sum _{p,r} \xi ^{(p)}(t, \tau ) \xi ^{(r)}(t, \tau ) s_{pq, r}(x(t, \tau )). \end{aligned}$$

10.6.6 Image Matching

We now take an infinite-dimensional example to illustrate some of the previously discussed methods and focus on the image-matching problem. We therefore consider

$$ U(\varphi ) = \frac{\lambda }{2} \int _\varOmega (I\circ \varphi ^{-1} - \tilde{I})^2 dx, $$

where \(I, \tilde{I}\) are functions \(\varOmega \rightarrow {\mathbb {R}}\), I being differentiable. The Eulerian differential of U is given by (9.21):

$$ \bar{\partial }U(\varphi ) = -\lambda (I\circ \varphi ^{-1} - I') \nabla (I\circ \varphi ^{-1}) dx. $$

So, according to (10.19), and letting \(\tilde{U}(v) = U(\varphi _{01}^v)\),

$$\begin{aligned} \nabla \tilde{U}(v)(t, y)= & {} \sum _{i = 1}^d {\left( {\bar{\partial }U(\varphi _{01}^v)}\, \left| {\bar{\partial }U(\varphi _{01}^v)}\, {d\varphi ^v_{t1}(\varphi ^v_{1t}(.)) K(y, \varphi ^v_{1t}(.))e_i}\right. \right) } e_i\\= & {} -\lambda \sum _{i=1}^d e_i \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x))\\&\quad \quad \nabla (I\circ \varphi ^v_{10})(x)^T d\varphi ^v_{t1}(\varphi ^v_{1t}(x)) K(y, \varphi ^v_{1t}(x)) e_i dx\\= & {} -\lambda \sum _{i=1}^d e_i \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x))\\&\quad \quad \nabla I\circ \varphi ^v_{10}(x)^T d\varphi ^v_{10}(x) d\varphi ^v_{t1}(\varphi ^v_{1t}(x)) K(y, \varphi ^v_{1t}(x)) e_i dx\\= & {} -\lambda \sum _{i=1}^d e_i \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x))\\&\quad \quad \nabla I\circ \varphi ^v_{10}(x)^T d\varphi ^v_{t0}(\varphi ^v_{1t}(x)) K(y, \varphi ^v_{1t}(x)) e_i dx\\= & {} -\lambda \sum _{i=1}^d e_i \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x)) \nabla (I\circ \varphi ^v_{t0})(\varphi ^v_{1t}(x))^T K(y, \varphi ^v_{1t}(x)) e_i dx\\= & {} -\lambda \sum _{i=1}^d e_i \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x)) e_i^T K(\varphi _{1t}^v(x), y) \nabla (I\circ \varphi ^v_{t0})(\varphi ^v_{1t}(x)) dx\\= & {} -\lambda \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x)) K(\varphi _{1t}^v(x), y) \nabla (I\circ \varphi ^v_{t0})(\varphi ^v_{1t}(x)) dx. \end{aligned}$$

This provides the expression of the V-gradient of \(\tilde{E}\) for image matching, namely

$$\begin{aligned}&(\nabla ^V E(v))(t, y) = v(t, y) \\&\qquad - \lambda \int _\varOmega (I\circ \varphi ^v_{10}(x) - \tilde{I}(x)) K(\varphi _{1t}^v(x), y) \nabla (I\circ \varphi ^v_{t0})(\varphi ^v_{1t}(x)) dx. \nonumber \end{aligned}$$
(10.44)

Using a change of variable in the integral, the gradient may also be written as

$$\begin{aligned}&\quad (\nabla ^VE)(t, y) = v(t, y) \\&- \lambda \int _\varOmega (I\circ \varphi ^v_{t0}(x) - \tilde{I}\circ \varphi ^v_{t1}(x)) K(x, y) \nabla (I\circ \varphi ^v_{t0})(x) \det (d\varphi ^v_{t1}(x)) dx, \nonumber \end{aligned}$$
(10.45)

the associated gradient descent algorithm having been proposed in [32].

Let us now consider an optimization with respect to the initial \(\rho _0\). First notice that, by (9.20), \(\mu (1) = \lambda \,\det (d\varphi (1))(I - I'\circ \varphi (1)) d\varphi (1)^{-T}\nabla I \, dx\) is a vector measure. Also, we have

$$\begin{aligned} {\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }= & {} {\left( {\mu (1)}\, \left| {\mu (1)}\, {d\varphi (1) w}\right. \right) }\\= & {} \lambda {\left( {dx}\, \left| {dx}\, {\det (d\varphi (1))(I - I'\circ \varphi (1)) \nabla I^T w }\right. \right) }, \end{aligned}$$

which shows that one can assume that \(\rho _0 = z_0 dx\) for some vector-valued function \(z_0\) (with \(z_0 = \det (d\varphi (1))(I - I'\circ \varphi (1)) \nabla I\) for an optimal control).

We now make explicit the computation of the differential of the energy with respect to \(\rho _0\). We have \(\mu (t) = z(t, \cdot ) dx\), with \(z(0) = z_0\) and

$$\begin{aligned} \left\{ \begin{aligned} \partial _t \varphi (t,y)&= \int _{{\mathbb {R}}^d} K(\varphi (t,y), \varphi (t,x)) z(t,x) dx\\ \partial _t z(t,y)&= -\int _{{\mathbb {R}}^d} z^i(t,y)z^j(t, x) \nabla _1 K^{ij}(\varphi (t,y), \varphi (t, x))dx. \end{aligned} \right. \end{aligned}$$
(10.46)

The differential \(dE(\rho _0) = K\rho _0 - p_\mu (0)\) is computed by solving, using \(\alpha (1) = \lambda \det (d\varphi (1))(I - I'\circ \varphi (1)) d\varphi (1)^{-T}\nabla I\) and \(p_\mu (1) = 0\),

$$\begin{aligned} \left\{ \begin{aligned} \partial _t\alpha =&\zeta ^{1,1} + \zeta ^{2,1}\\ \partial _t p_\mu =&\int _{{\mathbb {R}}^d} K(\varphi (x), \varphi (y)) \alpha (y) dy\\&-\sum _{i=1}^d \int _{{\mathbb {R}}^d} z^i(y) \partial _1 K^i(\varphi (x), \varphi (y))p_\mu (x)dy \\&- \sum _{i=1}^d \int _{{\mathbb {R}}^d} z^i(y) \partial _2 K^i(\varphi (x), \varphi (y))p_\mu (y)dy, \end{aligned} \right. \end{aligned}$$
(10.47)

in which

$$\begin{aligned} \zeta ^{1,1}(x) =&\sum _{i, j=1}^d \int _{{\mathbb {R}}^d} (\alpha ^i(y) z^j(x)+z^i(y) \alpha ^j(x))\nabla _1 K^{ij}(\varphi (x), \varphi (y)) dy \end{aligned}$$

and

$$\begin{aligned} \zeta ^{2,1}(x) =&-\sum _{i, j=1}^d \int _{{\mathbb {R}}^d} z^i(y) z^j(x) \partial _1\partial _2 K^{ij}(\varphi (x), \varphi (y)) p_\mu (y)dy\\&- \sum _{i, j=1}^d \int _{{\mathbb {R}}^d} z_l^i(y) z_k^j(x) \partial _1^2 K^{ij}(\varphi (x), \varphi (y)) p_\mu (x)dy. \end{aligned}$$

We summarize the computation of the gradient of the image-matching functional with respect to \(z_0\) such that \(\rho _0 = z_0 dx\):

Algorithm 4

  1. 1.

    Solve (10.46) with initial conditions \(\varphi (0)= {\mathrm {id}}\) and \(z(0) = z_0\) and compute \(dU(\varphi (1)) = -\lambda (I- I'\circ \varphi (1))\det (d\varphi (1)) d\varphi (1)^{-T} \nabla I\).

  2. 2.

    Solve, backwards in time, until time \(t=0\) the system (10.47) with boundary conditions \(\alpha (1) = -dU(\varphi (1))\) and \(p_\mu (1) = 0\).

  3. 3.

    Set \(\nabla E(z_0) = 2z_0 - {\mathbb {K}}^{-1}p_\mu (0)\).

The gradient is computed with the metric \({\big \langle {z}\, , \, {z'}\big \rangle } = \int _{{\mathbb {R}}^d} z(y)^T {\mathbb {K}}z(y) dy\). Results obtained with this algorithm are presented in Fig. 10.2.

Fig. 10.2
figure 2

Metric image matching. Output of Algorithm 4 when estimating a deformation of the first image to match the second one (compare to Fig. 9.2). The third image is the obtained deformation of the first one and the last provides the deformation applied to a grid

One can also use the fact that \(z_0 = f_0 \nabla I\) for a scalar-valued \(f_0\). Since we have

$$ {\Big ( {dE(z_0)}\,\Big |\, {h_0}\Big )} = \int _\varOmega ({\mathbb {K}}z_0 - p_\mu (0))^T h_0 dy,$$

we can write, with \(\tilde{E}(f_0) = E(f_0 \nabla I)\):

$$ {\Big ( {d\tilde{E}(f_0)}\,\Big |\, {u_0}\Big )} = \int _\varOmega ({\mathbb {K}}(f_0\nabla I) - p_\mu (0))^T\nabla I u_0 dy,$$

which leads to replacing the last step in Algorithm 4 by

$$\nabla E(f_0) = -\nabla I^T ({\mathbb {K}}(f_0\nabla I) - p_\mu (0)),$$

which corresponds to using the \(L^2\) metric in \(f_0\) for gradient descent. However, a more natural metric, in this case, is the one induced by the kernel, i.e.,

$$ {\big \langle {f}\, , \, {f'}\big \rangle }_I = \int _\varOmega \int _\varOmega {\mathbb {K}}(f\nabla I)(y) (f'(y) \nabla I(y) dy = \int _\varOmega K_I(x, y) f(x)f'(y) dx dy $$

with \(K_I(x,y) = \nabla I(x)^T K(x, y) \nabla I(y)\). With this metric, \(z_0\) is updated with

$$ \nabla E(f_0) = z_0 - {\mathbb {K}}_I^{-1} \nabla I^Tp_\mu (0).$$

Although this metric is more satisfactory from a theoretical viewpoint, the inversion of \({\mathbb {K}}_I\) might be difficult, numerically.

10.6.7 Pros and Cons of the Optimization Strategies

In the previous sections we have reviewed several possible choices of control variables with respect to which the optimization of the matching energy can be performed. For all but the shooting method, this results in specific expressions of the gradient that can then be used in optimization procedures such as those discussed in Appendix D.

All these procedures have been implemented in the literature to solve a diffeomorphic-matching problem in at least one specific context, but no extensive study has ever been made to compare them. Even if the outcome of such a study is likely to be that the best method depends on the specific application, one can still provide a few general facts that can help a user decide which one to use.

When feasible (that is, when the linear system it involves at each step can be efficiently computed and solved), the shooting method is probably the most efficient. If the initialization is not too far from the solution, convergence can be achieved in a very small number of iterations. One cannot guarantee, however, that the method will converge starting from any initial point, and shooting needs to be combined with some gradient-based procedure in order to find a good starting position.

Since they optimize with respect to the same variable, the most natural procedure to combine with shooting is optimization with respect to the initial momentum. Even when shooting is not feasible (e.g., for large-scale problems), this specific choice of control variable is important, because it makes sure that the final solution satisfies the EPDiff equation, which guarantees the consistency of the momentum representation, which will be discussed in Sect. 11.5.2. The limitation is that, with large and complex deformations, the sensitivity of the solution to small changes in the control variable can be large, which may result in an unstable optimization procedure.

The other methods, which optimize with respect to time-dependent quantities, are generally more able to compute very large deformations. Beside the obvious additional burden in computer memory that they require, one must be aware that the discrete solution can sometimes be far from satisfying the EPDiff equation unless the time discretization is fine enough (which may be impossible to achieve within a feasible implementation for large-scale problems). Therefore, these methods do not constitute the best choice if obtaining a reliable momentum representation is important. Among the three time-dependent control variables that we have studied (velocity, momentum and deformable object), one may have a slight preference for the representation using the time-dependent momenta, even if the computation it involves is slightly more complex than the others. There are at least two reasons for this. First, the momenta are generally more parsimonious in the space variables, because they incorporate normality constraints to transformations that leave the deformable objects invariant. Second, because the forward and backward equations solved at each iteration immediately provide a gradient with respect to the correct metric, so that the implementation does not have to include the solution of a possibly large-dimensional linear system which is required by other representations.

10.7 Numerical Aspects

10.7.1 Discretization

The implementation of the diffeomorphic matching algorithms that were just discussed requires a proper discretization of the different variables that are involved. The discretization in time of optimal control problems is discussed in Sect. D.4. This discussion directly applies here and we refer the reader to the relevant pages in the chapter for more details. If the deformed objects are already discrete (e.g., points sets), this suffices in order to design a numerical implementation.

When the deformed objects are continuous, some discrete approximation must obviously be made. One interesting feature of the problems that we have discussed is that they all derive from the general formulation (10.8), but can be reduced, using Sect. 10.6.2, to a situation in which the state and controls are finite dimensional after discretization. Typically, starting from (10.8), the discretization implies that only the end-point cost function is modified, replacing \(U(\varphi ) = F(\varphi \cdot I_0)\) by an approximation taking the form \(U^{(n)}(\varphi ) = F^{(n)}(\varphi , I_0^{(n)})\). For example, when matching curves, one may replace the objective function \(F(\varphi \cdot I_0) = \Vert \mu _{\varphi \cdot I_0} - \mu _{I'}\Vert _{W^*}^2\) in (9.40) by the discrete approximation in (9.46), in which the curves \(I_0\) and \(I'\) are approximated by point sets. Similar approximations can be made for the other types of cost functions discussed for curves and surfaces. In such cases, the following proposition can be applied to compare solutions of the original problem with their discrete approximations.

Proposition 10.16

Assume that V is continuously embedded in \(C^{p+1}_0({\mathbb {R}}^d, {\mathbb {R}}^d)\). Consider a family of optimal control problems minimizing

$$\begin{aligned} E^{(n)}(v) = \frac{1}{2} \int _0^1 \Vert v\Vert _V^2 \, dt + U^{(n)}(\varphi ^v_{01}), \end{aligned}$$
(10.48)

with \(U^{(n)}\) continuous for the \((p, \infty )\)-compact topology. Let U be continuous with respect to the same topology and assume that, for some \(p>0\), the following uniform convergence is true: for all \(A>0\) and \(\varepsilon >0\), there exists an \(n_0\) such that, for all \(n\ge n_0\), for all \(\varphi \in \mathrm {Diff}^{p, \infty }_0\) such that \(\max (\Vert \varphi \Vert _{p, \infty }, \Vert \varphi ^{-1}\Vert _{p, \infty }) < A\), one has \(|U^{(n)}(\varphi ) - U(\varphi )|<\varepsilon \).

Then, given a sequence \(v^{(n)}\) of minimizers of (10.48), one can extract a subsequence \(v^{(n_k)}\) that weakly converges to v in \({\mathcal X}^2_V\), with v minimizing

$$\begin{aligned} E(w) = \frac{1}{2} \int _0^1 \Vert w\Vert _V^2 \, dt + U(\varphi ^w_{01}). \end{aligned}$$
(10.49)

Proof

Let w be a minimizer of (10.49). Our assumptions implying that \(U^{(n)}(\varphi ^{w}_{01})\) converges to \(U(\varphi ^{w}_{01})\) (so that their difference is bounded), we see that \(E^{(n)}(w) \le E(w) + C\) for some constant C, so that, letting \(v^{(n)}\) be a minimizer of \(E^{(n)}\), we have \(\Vert v^{(n)}\Vert ^2_{{\mathcal X}^2_V} \le 2E^{(n)}(v^{(n)}) \le 2E(w) + 2C\). From this we find that \(v^{(n)}\) is a bounded sequence in \({\mathcal X}^2_V\), so that, replacing it with a subsequence if needed, we can assume that it weakly converges to some \(v\in {\mathcal X}^2_V\). Applying Theorem 7.13, we find that \(\varphi ^{v^{(n)}}_{01}\) converges to \(\varphi ^v\) in the \((p, \infty )\)-compact topology. Moreover, Theorem 7.10 implies that the sequences \((\Vert \varphi _{01}^{v^{(n)}}\Vert _{p, \infty }, \Vert \varphi _{10}^{v^{(n)}}\Vert _{p, \infty })\) are bounded. Applying the uniform convergence of \(U^{(n)}\) to U on bounded sets and the continuity of U, we see that \(U^{(n)}(\varphi _{01}^{v^{(n)}})\) converges to \(U(\varphi ^v_{01})\) as n tends to infinity. Since, in addition

$$ \Vert v\Vert _{{\mathcal X}^2_V}\le \liminf \Vert v^{(n)}\Vert _{{\mathcal X}^2_V} $$

we obtain the fact that \(E(v) \le \liminf E^{(n)}(v^{(n)})\). We also have

$$ E^{(n)}(v^{(n)}) \le E^{(n)}(w) = E(w) + U(\varphi _{01}^w) - U^{(n)}(\varphi _{01}^w) \rightarrow E(w), $$

so that \(E(v)= E(w)\) and v is also a minimizer of (10.49).   \(\square \)

Curves and Surfaces. We can apply this theorem to curve and surface matching according to the following discussion, in which we focus on surface matching using currents, but which can, with very little modification, be applied to curves, and to measure or varifold matching terms. Let \(\varSigma \) and \(\tilde{\varSigma }\) be regular surfaces and \(S^{(n)}\), \(\tilde{S}^{(n)}\) be sequences of triangulated surfaces that converge to them as defined before Theorem 4.3. Let (fixing an RKHS W with kernel \(\xi \))

$$ U(\varphi ) = \Vert \nu _{\varphi \cdot \varSigma } - \nu _{\tilde{\varSigma }}\Vert _{W^*}^2, $$

using the vector measures defined in Eq. (9.49), and

$$ U^{(n)}(\varphi ) = \Vert \nu _{\varphi \cdot S^{(n)}} - \nu _{\tilde{S}^{(n)}}\Vert _{W^*}^2, $$

using the discrete version as in (9.56). Then, Theorem 4.3, slightly modified to account for double integrals, can be used to check that the assumptions of Proposition 10.16 are satisfied.

Images. The image matching problem can be discretized using finite grids, assuming that the considered images are supported by the interval \([0,1]^d\). Consider the cost function

$$ U(\varphi ) = \Vert I\circ \varphi ^{-1} - \tilde{I}\Vert _2^2, $$

in which we assume, to simplify, that I and \(\tilde{I}\) are compactly supported (say, on \(\mathcal K = [-M, M]^d\)) and bounded. We first start with a discretization that can be applied to general \(L^2\) functions. Let \(\mathcal G_n = \{-M+ 2^{-n+1}kM, k=0, \ldots , 2^n\}^d\) provide a discrete grid on \(\mathcal K\) and associate to each point \(z\in \mathcal G_n\) its Voronoï cell, \(\varGamma _n(z)\), provided by the set of points in \(\mathcal K\) that are closer to x than to any other point in the grid (i.e., \(\varGamma _n(z)\) is the intersection of \(\mathcal K\) and the cube of size \(2^{-n}\) centered at x). Define

$$ I^{(n)}(x) = \sum _{z\in \mathcal G_n} \bar{I}^{(n)}(z) {\mathbf 1}_{\varGamma _n(z)}(x), $$

where

$$ \bar{I}^{(n)}(z) = \frac{1}{|\varGamma _n(z)|}\int _{\varGamma _n(z)} I(x) \, dx $$

is the average value of I over \(\varGamma _n(z)\).

Define \(\tilde{I}^{(n)}\) similarly and consider the approximation of U given by \(U^{(n)}(\varphi ) = \Vert I^{(n)}\circ \varphi ^{-1} - \tilde{I}^{(n)}\Vert _2^2\). Then \(U^{(n)}\) and U satisfy the hypotheses of Proposition 10.16.

Indeed, assume that \(\max (\Vert \varphi \Vert _{1, \infty }, \Vert \varphi ^{-1}\Vert _{1, \infty }) < A\). We have

$$\begin{aligned} |U^{(n)}(\varphi ) - U(\varphi )|&\le 2\Vert I\circ \varphi ^{-1} - I^{(n)}\circ \varphi ^{-1}\Vert _2^2 + 2\Vert \tilde{I} - \tilde{I}^{(n)}\Vert _2^2\\&\le 2C(A) \Vert I - I^{(n)}\Vert _2^2 + 2\Vert \tilde{I} - \tilde{I}^{(n)}\Vert _2^2, \end{aligned}$$

where the second inequality is obtained after a change of variable in the first \(L^2\) norm and C(A) is an upper bound for the Jacobian determinant of \(\varphi \) depending only on A. As a consequence, Proposition 10.12 will be true as soon as one shows that \(I^{(n)}\) and \(\tilde{I}^{(n)}\) converge in \(L^2\) to I and \(\tilde{I}\) respectively (and will also be true for any sequence of approximations of I and \(\tilde{I}\) that satisfies this property). The \(L^2\) convergence is true in our case because \(I^{(n)}\) is the orthogonal projection of I on the space \(W_n\) of \(L^2\) functions that are constant on each set \(\varGamma _n(z), z\in \mathcal G_n\). This implies that \(I^{(n)}\) converges in \(L^2\) to the projection of I on \(W_\infty = \overline{\bigcup _{n\ge 1} W_n}\) (see Proposition A.11), but one has \(W_\infty =L^2\), because any function J orthogonal to this space would have its integral vanish on any dyadic cube, which is only possible for \(J=0\).

Note that, with this approximation, one can write

$$\begin{aligned} \Vert I^{(n)} \circ \varphi ^{-1}- \tilde{I}\Vert ^2_2&= \sum _{z\in \mathcal G_n} I(z)^2 |\varphi (\varGamma _n(z))| + \sum _{z\in \mathcal G_n} \tilde{I}(z)^2 |\varGamma _n(z)| \\&\qquad \,\,\qquad - 2\sum _{z, z'\in \mathcal G_n} I(z)\tilde{I}(z') |\varphi (\varGamma _n(z))\cap \varGamma _n(z')|, \end{aligned}$$

where |A| denotes the volume of \(A\subset {\mathbb {R}}^d\). To make this expression computable, one needs to approximate the sets \(\varphi (\varGamma _n(z))\), where the simplest approximation is to take the polyhedron formed by the image of the vertices of \(\varGamma _n(z)\) by \(\varphi \) (which will retain the same topology as the original cube is n is large enough). The verification that this approximation is valid (in the sense of Proposition 10.16) is left to the reader.

However, even with this approximation, the numerical problem is still highly computational, since it becomes a point set problem over \(\mathcal G_n\), which is typically a huge set. Most current implementations use a simpler scheme, in which \(I^{(n)}\) is interpolated between the values \((I(z), z\in \mathcal G_n)\), who are therefore assumed to be well defined, and the cost function is simply approximated by

$$ U^{(n)}(\varphi ) = \sum _{z\in \mathcal G_n} (I(\varphi ^{-1}(z)) - \tilde{I}(z))^2 |\varGamma _n(z)|. $$

Here again, we leave to the reader to check that this provides a valid approximation in the sense of Proposition 10.16 as soon as, say, I and \(\tilde{I}\) are continuous and one uses a linear interpolation scheme, as described below.

Using this approximation (for a fixed n that we will remove from the notation), we now work the implementation in more detail, starting with the computation of the gradient in (10.45). Assume that time is discretized at \(t_k = kh\) for \(h=1/Q\) and that \(v_k(\cdot ) = v(t_k, \cdot )\) is discretized over a regular grid \({\mathcal G}\).

It will be convenient to introduce the momentum and express \(v_k\) in the form

$$\begin{aligned} v_k(y) = \sum _{z\in {\mathcal G}} K(y, z) \rho _k(z). \end{aligned}$$
(10.50)

We can consider \((\rho _k(z), z\in {\mathcal G})\) as new control variables, noting that (10.45) directly provides the gradient of the energy in \(V^*\), namely

$$ (\nabla ^{V^*}E)(t, y) = 2\rho (t) - 2 \det (d\varphi ^v_{t1}) (I\circ \varphi ^v_{t0} - \tilde{I}\circ \varphi ^v_{t1}) \nabla (I\circ \varphi ^v_{t0}) dx. $$

From this expression, we see that we can interpret the family \((\rho _k(z), z\in {\mathcal G})\) as discretizing a measure, namely

$$ \rho _k = \sum _{z\in {\mathcal G}} \rho _k(z) \delta _{z}. $$

Given this, the gradient in \(V^*\) can be discretized as

$$ \xi _k(z) = 2\rho _k(z) - 2 \det (d\varphi ^v_{t_k1}(z)) (I\circ \varphi ^v_{t_k0}(z) - \tilde{I}\circ \varphi ^v_{t_k1}(z)) \nabla (I\circ \varphi ^v_{t_k0}(z)) \delta _z, $$

which can be used to update \(\rho _k(z)\).

The last requirement in order to obtain a fully discrete procedure is to select interpolation schemes for the computation of the diffeomorphisms \(\varphi ^v\) and for the compositions of I and \(I'\) with them. Interpolation algorithms (linear, or cubic, for example) are standard procedures that are included in many software packages [234]. In mathematical representation, they are linear operators that take a discrete signal f on a grid \({\mathcal G}\) (i.e., \(f\in {\mathbb {R}}^{\mathcal G}\)) and return a function, that we will denote by \({\mathcal R}f\), defined everywhere. By linearity, we must have

$$ ({\mathcal R}f)(z) = \sum _ {z\in {\mathcal G}} r_z(x) f(z) $$

for some “interpolants” \(r_z(\cdot ), z\in \mathcal G\). In the approximation of the data attachment term, one can then replace I by \(\mathcal R(I_{|_{\mathcal G}})\), the interpolation of the restriction of I to \(\mathcal G\).

Linear interpolation, for example, corresponds, in one dimension, to \(r_z(x) = 1-2^n|z - x|\) if \(|z-x| < 2^{-n}\) and 0 otherwise. In dimension d, one takes

$$ r_z(x) = \prod _{i=1}^d (1 - 2^n|z_i-x_i|) $$

if \(\max _i(|z_i-x_i|) < 2^{-n}\) and 0 otherwise (where \(z = (z_1, \ldots , z_d)\) and \(x = (x_1, \ldots , x_d)\)).

Given an interpolation operator \({\mathcal R}\), one can replace, say, \(I\circ \varphi _{t_k0}(z)\) in the expression of the gradient by

$$ ({\mathcal R}I)(\varphi _{t_k0}(z)) = \sum _ {z'\in {\mathcal G}} r_{z'}(\varphi _{t_k0}(z)) I(z'). $$

For computational purposes, it is also convenient to replace the definition of \(v_k\) in (10.50) by an interpolated form

$$\begin{aligned} v_k(x) = \sum _{z\in {\mathcal G}} r_z(x) \sum _{i\in {\mathcal G}} K(z, z')\rho _k(z') \end{aligned}$$
(10.51)

because the inner sum can be computed very efficiently using Fourier transforms (see the next section).

To complete the discretization, introduce

$$ \psi _{lk} = ({\mathrm {id}}- h v_l) \circ \cdots \circ ({\mathrm {id}}- hv_{k-1}), $$

where an empty product of compositions is equal to the identity, so that \(\psi _{lk}\) is an approximation of \(\varphi _{t_kt_l}\). Define the cost function, which is explicitly computable as a function of \(\rho _0, \ldots , \rho _{Q-1}\):

$$ E(\rho ) = \sum _{k=0}^{Q-1} \sum _{z, z'\in {\mathcal G}} K(z, z') \rho _k(z)^T\rho _k(z') + \sum _{z\in {\mathcal G}} (({\mathcal R}I)(\psi _{0Q}(z)) - \tilde{I}(z))^2. $$

If we make a variation \(\rho \mapsto \rho + \varepsilon \delta \rho \), then \(v\mapsto v + \varepsilon \delta v\) with (using the interpolated expression of v)

$$ \delta v_k(y) = \sum _{z\in {\mathcal G}} r_z(y) \sum _{z'\in {\mathcal G}} K(z, z') \delta \rho _k(z') $$

and letting \(\delta \psi _{lk} = \partial _\varepsilon \psi _{lk}\), we have, by direct computation

$$ \delta \psi _{lk} = -h\sum _{q=l}^{k-1} d\psi _{lq}\circ \psi _{qk}\, \delta v_q \circ \psi _{q+1k}. $$

Using this, we can compute the variation of the E, yielding

$$\begin{aligned} {\left( {\partial _\varepsilon E}\, \left| {\partial _\varepsilon E}\, {\delta \rho }\right. \right) }= & {} 2\sum _{k=0}^{Q-1} \sum _{z,z',\in {\mathcal G}} K(z, z') \, \rho _k(z')^T \delta \rho _k(z) \\- & {} 2h \sum _{k=0}^{Q-1}\sum _{z,z',y\in {\mathcal G}} K(z, z')\, r_z(\psi _{k+1\, Q}(y))\, (({\mathcal R}I)(\psi _{0 Q}(y)) - \tilde{I}(y))\\&\quad \quad \quad \quad \nabla ({\mathcal R}I)(\psi _{0 Q}(y))^T(d\psi _{0k} \circ \psi _{k\, Q}(y)\, \delta \rho _k(z')) \end{aligned}$$

This provides the expression of the gradient of the discretized E in \(V^*\), namely

$$\begin{aligned}&\quad (\nabla ^{V^*} E(\rho ))_k(z) = 2 \rho _k(z) \\&- 2h \sum _{z'\in {\mathcal G}} r_z(\psi _{k+1\, Q}(z')) (({\mathcal R}I)(\psi _{0 Q}(z')) - \tilde{I}(z')) \nabla ({\mathcal R}I\circ \psi _{0k})(\psi _{k Q}(z')). \end{aligned}$$

10.7.2 Kernel-Related Numerics

Most of the previously discussed methods included repeated computations of linear combination of the kernel. A basic such step is to compute, given points \(y_1, \ldots , y_M\), \(x_1, \ldots , x_N\) and vectors (or scalars) \(\alpha _1, \ldots , \alpha _N\), the sums

$$ \sum _{k=1}^N K(y_j, x_k) \alpha _k, \quad j=1, \dots , M. $$

Such sums are involved when deriving velocities from momenta, for example, or when evaluating dual RKHS norms in curve or surface matching.

Computing these sums explicitly requires NM evaluations of the kernel (and this probably several times per iteration of an optimization algorithm). When N or M are reasonably small (say, less than 1,000), such a direct evaluation is not a problem. But for large-scale methods, such as triangulated surface matching, where the surface may have tens of thousands of nodes, or image matching, where a three-dimensional grid typically has millions of nodes, this becomes unfeasible (the feasibility limit has however been pushed further by recent efficient implementations on GPUs [59, 157, 247]).

If \(x=y\) is supported by a regular grid \(\mathcal G\), and K is translation invariant, i.e., \(K(x, y) = \varGamma (x-y)\), then, letting \(x_k = hk\) where k is a multi-index (\(k=(k_1, \ldots , k_d)\)) and h the discretization step, we see that

$$ \sum _{k\in \mathcal G} \varGamma (h(k-l)) \alpha _l $$

is a convolution that can be implemented with \(O(N\log N)\) operations, using fast Fourier transforms (with \(N = |\mathcal G|\)). The same conclusion holds if K takes the form \(K(x, y) = A(x)^T\varGamma (x-y) A(y)\) for some matrix A (which can be used to censor the kernel at the boundary of a domain), since the resulting operation is

$$ A(x_k)^T \left( \sum _{k\in \mathcal G} \varGamma (h(k-l)) (A(x_l)\alpha _l)\right) , $$

which can still be implemented in \(O(N\log N)\) operations.

The situation is less favorable when x and y are not regularly spaced. In such cases, feasibility must come with some approximation.

Still assuming a translation-invariant kernel \(K(x, y) = \varGamma (x-y)\), we can associate to a grid \(\mathcal G\) in \({\mathbb {R}}^d\) the interpolated kernel

$$ K_{\mathcal G}(x,y) = \sum _{j, j'\in \mathcal G} r_z(x) \varGamma (h(z-z')) r_{z'}(y), $$

where the \(r_z\)’s are interpolants adapted to the grid. This approximation provides a non-negative kernel, with null space equal to the space of functions with vanishing interpolation on \(\mathcal G\). With such a kernel, we have

$$ \sum _{k=1}^N K(y_j, x_k) \alpha _k = \sum _{z\in \mathcal G} r_z(y_j) \sum _{z'\in \mathcal G} \varGamma (h(z-z')) \sum _{k=1}^Nr_{z'}(x_k) \alpha _k. $$

The computation of this expression therefore requires using the following sequence of operations:

  1. 1.

    Compute, for all \(z'\in \mathcal G\), the quantity

    $$ a_{z'} = \sum _{k=1}^Nr_{z'}(x_k) \alpha _k. $$

    Because, for each \(x_k\), only a fixed number of \(r_{z'}(x_k)\) are non-vanishing, this requires an O(N) number of operations.

  2. 2.

    Compute, for all \(z\in \mathcal G\),

    $$ b_z = \sum _{z'\in \mathcal G} \varGamma (h(z-z')) a_{z'}, $$

    which is a convolution requiring \(O(|\mathcal G|\log |\mathcal G|)\) operations.

  3. 3.

    Compute, for all \(j=1, \ldots , M\), the interpolation

    $$ \sum _{z\in \mathcal G} r_z(y_j) b_z, $$

    which requires O(M) operations.

So the resulting cost is \(O(M+N+|\mathcal G|\log |\mathcal G|)\), which must be compared to the original O(MN), the comparison being favorable essentially when MN is larger than the number of nodes in the grid, \(|\mathcal G|\). This formulation (which has been proposed in [156]) has the advantage that the resulting algorithm is quite simple, and that the resulting \(K_{\mathcal G}\) remains a non-negative kernel, which is important.

Another class of methods, called “fast multipole”, computes sums such as

$$ \sum _{k=1}^N K(y, x_k) \alpha _k $$

by taking advantage of the fact that K(yx) varies slowly as x varies in a region which is far away from y. By grouping the \(x_k\)’s in clusters, assigning centers to these clusters and approximating the kernel using asymptotic expansions valid at a large enough distance from the clusters, fast multipole methods can organize the computation of the sums with a resulting cost of order \(M+N\) when M sums over N terms are computed. Even if it is smaller than a constant times \((M+N)\), the total number of operations increases (via the size of the constant) with the required accuracy. The interested reader may refer to [30, 140] for more details.

Another important operation involving the kernel is the inversion of the system of equations (say, with a scalar kernel)

$$\begin{aligned} \sum _{k=1}^N K(x_k, x_l) \alpha _l = u_k,\quad k=1, \ldots , N. \end{aligned}$$
(10.52)

This is the spline interpolation problem, but it is also part of several of the algorithms that we have discussed, including for example the projection steps that have been introduced to obtain a gradient in the correct metric.

Such a problem is governed by an uncertainty principle [258] between accuracy of the approximation, which is given by the distance between a smooth function \(x\mapsto u(x)\) and its interpolation

$$ x\mapsto \sum _{k=1}^N K(x, x_l) \alpha _l, $$

where \(\alpha _1, \ldots , \alpha _N\) are given by (10.52) with \(u_k = u(x_k)\), and the stability of the system (10.52) measured by the condition number (the ratio of the largest to the smallest eigenvalue) of the matrix \(S(x) = (K(x_i, x_j), i, j=1, \ldots , N)\), evaluated as a function of the smallest distance between two distinct \(x_k\)’s (S(x) is singular if two \(x_k\)’s coincide).

When \(K(x, y) = \varGamma (x-y)\), the trade-off is measured by how fast \(\xi \mapsto \hat{\varGamma }(\xi )\) (the Fourier transform of \(\varGamma \)) decreases at infinity. One extreme is given by the Gaussian kernel, for which \(\hat{\varGamma }\) decreases like \(e^{-c|\xi |^2}\), which is highly accurate and highly unstable. On the other side of the range are Laplacian kernels, which decrease polynomially in the Fourier domain. In this dilemma, one possible rule is to prefer accuracy for small values of N, therefore using a kernel like the Gaussian, and go for stability for large-scale problems (using a Laplacian kernel with high enough degree).

For the numerical inversion of system (10.52), iterative methods, such as conjugate gradient, should be used (especially for large N). Methods using preconditioned conjugate gradient have been introduced, for example, in [105, 141] and the interested reader may refer to these references for more details.