Abstract
This chapter discusses variational registration methods, minimizing registration costs with a guarantee that the optimal solution is a diffeomorphism. The main focus will be on methods optimizing over flows associated with differential equations, using concepts introduced in the previous chapters.
Access provided by Autonomous University of Puebla. Download chapter PDF
10.1 Linearized Deformations
A standard way to ensure the existence of a smooth solution of a matching problem is to add a penalty term in the matching functional. This term would complete (9.1) to form
A large variety of such methods have been designed, in approximation theory, statistics and signal processing for solving ill-posed problems. The simplest (and typical) form of penalty function is
for some Hilbert (or Banach) space of functions. Some more complex functions of \(\varphi -{\mathrm {id}}\) may also be designed, related to energies of non-linear elasticity (see, among others [13, 27, 28, 89, 123, 144, 237]). Such methods may be called “small deformation” methods because they work on the deviation of \(u = \varphi - {\mathrm {id}}\), and controlling the size or smoothness of u alone is most of the time not enough to guarantee that \(\varphi \) is a diffeomorphism (unless u is small, as we have seen in Sect. 7.1). There is, in general, no way of proving the existence of a solution of the minimization problem within some group of diffeomorphisms G, unless some restrictive assumptions are made on the objects to be matched.
Our focus here is on diffeomorphic matching. Because of this, we shall not detail many of these methods. However, it is interesting to note that these functionals also have a Eulerian gradient within an RKHS of vector fields with a smooth enough kernel, and can therefore be minimized using (9.7). We illustrate this with the following example, in which we skip the proper justification of the existence of derivatives.
Consider the function \(\rho (\varphi ) = \int _{{\mathbb {R}}^d} \left| d\varphi (x)-{\mathrm {Id}} \right| ^2 dx\), where the matrix norm is
(Hilbert–Schmidt norm). Letting \(u = \varphi - {\mathrm {id}}\), we have
where \(\varDelta u\) is the vector formed by the Laplacian of the coordinates of u (recall that we assume that \(u=0\) at infinity). This implies that (given that \(\varDelta u = \varDelta \varphi \))
and
This provides a regularized greedy image-matching algorithm, which includes a regularization term (a similar algorithm may easily be written for point matching).
Algorithm 2
The following procedure is an Eulerian gradient descent, on V, for the energy
Start with an initial \(\varphi _0 = {\mathrm {id}}\) and solve the differential equation
with \(J(t,\cdot ) = I\circ \varphi (t)^{-1}(\cdot )\).
This algorithm, which, like the previous greedy procedures, has the fundamental feature of providing a smooth flow of diffeomorphisms to minimize the matching functional, suffers from the same limitations as its predecessors concerning its limit behavior, which are essentially due to the fact that the variational problem itself is not well-posed; minimizers may not exist, and when they exist they are not necessarily diffeomorphisms. In order to ensure the existence of, at least, homeomorphic solutions, the energy must include terms that must not only prevent \(d\varphi \) from being too large, but also from being too small (or its inverse from being too large). In [90], the following regularization is proved to ensure the existence of homeomorphic solutions:
under some assumptions on p, q, r and s, namely \(p, q > 3\), \(r>1\) and \(s > 2q/(q-3)\).
10.2 The Monge–Kantorovitch Problem
We briefly discuss in this section the mass transfer problem, which is, under some assumptions, a diffeomorphic method for matching probability densities, i.e., positive functions on \({\mathbb {R}}^d\) with integral equal to 1. Consider such a density, \(\zeta \), and a diffeomorphism \(\varphi \) on \({\mathbb {R}}^d\). If an object has density \(\zeta \), the mass included in an infinitesimal volume dx around x is \(\zeta (x)dx\). Now, if each point x in the object is transported to the location \(y = \varphi (x)\), the mass of a volume dy around y is the same as the mass of the volume \(\varphi ^{-1}(dy)\) around \(x = \varphi ^{-1}(y)\), which is \(\zeta \circ \varphi ^{-1}(y) |\det (d(\varphi ^{-1}))(y)| dy\) (this provides a physical interpretation of Proposition 9.5).
Given two densities \(\zeta \) and \( \zeta '\), the optimal mass transfer problem consists in finding a diffeomorphism \(\varphi \) with minimal cost such that \(\zeta ' = \zeta \circ \varphi ^{-1} |\det (d(\varphi ^{-1}))|\). The cost associated to \(\varphi \) in this context is related to the distance along which the transfer is made, measured by a function \(\rho (x, \varphi (x))\). The total cost comes after summing over the transferred mass, yielding
The mass transfer problem now is to minimize E over all \(\varphi \)’s such that \(\zeta ' = \zeta \circ \varphi ^{-1} |\det (d(\varphi ^{-1}))|\). The problem is slightly different from the matching formulations that we discuss in the other sections of this chapter, because the minimization is associated to exact matching.
It is very interesting that this apparently very complex and highly nonlinear problem can be reduced to linear programming, albeit infinite-dimensional. Let us first consider a more general formulation. Instead of looking for a one-to-one correspondence \(x\mapsto \varphi (x)\), one can decide that the mass in a small neighborhood of x is dispatched over all \(\varOmega \) with weights \(y \mapsto q(x, y)\), where \(q(x, y) \ge 0\) and \(\int _\varOmega q(x, y) dy = 1\). We still have the constraint that the mass density arriving at y is \(\tilde{\zeta }(y)\), which gives
The cost now has the simple expression (linear in q)
The original formulation can be retrieved by letting \(q(x, y)dy \rightarrow \delta _{\varphi (x)}(y)\) (i.e., pass to the limit \(\sigma = 0\) with \(q(x, y) = \exp (-|y-\varphi (x)|^2/2\sigma ^2)/(2\pi \sigma ^2)^{d/2}\)).
If we write \(g(x,y) = \zeta (x) q(x, y)\), this relaxed problem is clearly equivalent to minimizing
subject to the constraints \(g(x, y) \ge 0\), \(\int g(x, y) dy = \zeta (x)\) and \(\int g(x, y) dx = \tilde{\zeta }(y)\). In fact, the natural formulation of this problem uses measures instead of densities: given two probability measures \(\mu \) and \({\tilde{\mu }}\) on \(\varOmega \), minimize
subject to the constraints that the marginals of \(\nu \) are \(\mu \) and \({\tilde{\mu }}\). This provides the Wasserstein distance between \(\mu \) and \({\tilde{\mu }}\), associated to the transportation cost \(\rho \). Note that this formulation generalizes the computation of the Wasserstein distance (9.24) between discrete measures.
This problem is much nicer than the original one, since it is a linear programming problem. The theory of convex optimization (that we only apply formally in this infinite-dimensional context; see [44] for rigorous proofs) implies that it has an equivalent dual formulation which is: maximize
subject to the constraint that, for all \(x, y\in \varOmega \), \(h(x) + \tilde{h}(y) \le \rho (x, y)\).
The duality equivalence means that the maximum of F coincides with the minimum of E. The solutions are, moreover, related by duality conditions (the KKT conditions) that imply that \(\nu \) must be supported by the set
For the dual problem, one is obviously interested in making h and \(\tilde{h}\) as large as possible. Given h, one should therefore choose \(\tilde{h}\) as
so that the set in (10.6) is exactly the set of \((y^*, y)\) where \(y^*\) is a point that achieves the maximum of \(\rho (x, y) - h(x)\).
The situation is particularly interesting when \(\rho (x, y) = |x-y|^2/2\). In this situation,
From this equation, it is natural to introduce the auxiliary functions \(s(x) = h(x) - x^2/2\) and \({\tilde{s}}(y) = \tilde{h}(y) - y^2/2\). Using these functions, the set A in (10.6) becomes
with \({\tilde{s}}(y) = \sup _{x}(x^Ty - s(x))\). Because the latter is a supremum of linear functions, we obtain the fact that \({\tilde{s}}\) is convex, and so is s by symmetry; \({\tilde{s}}\) is in fact what is called the convex conjugate of s, denoted \({\tilde{s}}= s^*\). Convex functions are almost everywhere differentiable, and, in order that \((x, y)\in A\), x must maximize \(u \mapsto u^Ty - s(u)\), which implies that \(y = \nabla s(x)\). So, the conclusion is that, whenever s is the solution of the dual problem, the solution of the primal problem is provided by \(y = \nabla s(x)\). This shows that the relaxed mass transport problem has the same solution as the initial one, with \(\varphi = \nabla s\), s being a convex function. That \(\varphi \) is invertible is obvious by symmetry: \(\varphi ^{-1} = \nabla {\tilde{s}}\).
This result is fundamental, since it is the basis for the construction of a numerical procedure for the solution of the mass transport problem in this case. Introduce a time-dependent vector field \(v(t,\cdot )\) and the corresponding flow of diffeomorphisms \(\varphi _{0t}^v\). Let \(h(t, \cdot ) = \det (d\varphi _{t0}^v)\, \zeta \circ \varphi _{t0}^v \). Then
The time derivative of this equation yields
We have the following theorem [34].
Theorem 10.1
Consider the following energy:
and the variational problem: minimize G subject to the constraints \(h(0) = \zeta \), \(h(1) = \tilde{\zeta }\) and (10.7). If v is the solution of the above problem, then \(\varphi _{01}^v\) solves the optimal mass transport problem.
Proof
Indeed, in G, we can make the change of variables \(x = \varphi _{0t}(y)\), which yields
So the minimum of G is always larger than the minimum of E. If \(\varphi \) solves the mass transport problem, then one can take v(t, x) such that \(\varphi _{0t}^v(x) = (1-t) x + t\varphi (x)\), which is a diffeomorphism [190] and achieves the minimum of G. \(\square \)
We refer to [34] for a numerical algorithm that computes the optimal \(\varphi \). Note that \(\rho (x, y) = |x-y|^2\) is not the only transportation cost that can be used in this context, but that others (like \(|x-y|\), which is not strictly convex in the distance) may fail to provide diffeomorphic solutions. Important developments on this subject can be found in [49, 119, 296].
We now discuss methods that are both diffeomorphic and metric (i.e., they relate to a distance). They also rely on the representation of diffeomorphisms using flows of ordinary differential equations.
10.3 Optimizing Over Flows
We return in this section to the representation of diffeomorphisms with flows of ordinary differential equations (ODEs) and describe how this representation can be used for diffeomorphic registration. Instead of using a norm to evaluate the difference between \(\varphi \) and the identity mapping, we now consider, as a regularizing term, the distance \(d_V\) that was defined in Sect. 7.2.6. More precisely, we set
and henceforth restrict the matching to diffeomorphisms belonging to \(\mathrm {Diff}_V\).
In this context, we have the following important theorem:
Theorem 10.2
Let V be a Hilbert space embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) so that \(\mathrm {Diff}_V\subset \mathrm {Diff}^{p,\infty }_0\). Assume that the functional \(U: \mathrm {Diff}_0^{p,\infty } \mapsto {\mathbb {R}}\) is bounded from below and continuous for the \((p, \infty )\)-compact topology. Then, there exists a minimizer of
over \(\mathrm {Diff}_V\).
(The \((p, \infty )\)-compact topology is defined just after Theorem 7.13.)
Proof
E has an infimum \(E_{min}\) over \(\mathrm {Diff}_V\), since it is bounded from below. We need to show that this infimum is also a minimum, i.e., that it is achieved at some \(\varphi \in \mathrm {Diff}_V\).
We first use the following lemma (recall that we have denoted by \({\mathcal X}^1_V\) (resp. \({\mathcal X}^2_V\)) the set of time-dependent vector fields on \(\varOmega \) with integrable (resp. square integrable) V-norm over [0, 1]):
Lemma 10.3
Minimizing \(E(\varphi ) = d({\mathrm {id}}, \varphi )^2/2 + U(\varphi )\) over \(\mathrm {Diff}_V\) is equivalent to minimizing the function
over \({\mathcal X}^2_V\).
Let us prove this lemma. For \(v\in {\mathcal X}^2_V\), we have, by definition of the distance
which implies \(E(\varphi _{01}^v) \le \tilde{E}(v)\). This obviously implies that \(\inf _{\mathrm {Diff}_V} E(\varphi ) \le \tilde{E}(v)\), and since this is true for all \(v\in {\mathcal X}^2_V\), we have \(\inf E \le \inf \tilde{E}\). Now, assume that \(\varphi \) is such that \(E(\varphi ) \le \inf E + \varepsilon /2\). Then, by definition of the distance, there exists a v such that \(\varphi = \varphi ^v_{01}\) and
which implies that
so that \(\inf E \le \inf \tilde{E}\).
We therefore have \(\inf E = \inf \tilde{E}\). Moreover, if there exists a v such that \(\tilde{E}(v) = \min \tilde{E}= \inf E\), then, since we know that \(E(\varphi _{01}^v) \le \tilde{E}\), we must have \(E(\varphi _{01}^v) = \min E\). Conversely, if \(E(\varphi ) = \min E\), by Theorem 7.22, \(E(\varphi ) = E(\varphi _{01}^v)\) for some v and this v must achieve the infimum of \(\tilde{E}\), which proves the lemma.
This lemma shows that it suffices to study the minimizers of \(\tilde{E}\). Now, as done in the proof of Theorem 7.22, one can find, by taking a subsequence of a minimizing sequence, a sequence \(v^n\) in \({\mathcal X}^2_V\) which converges weakly to some \(v\in {\mathcal X}^2_V\) and \(\tilde{E}(v^n)\) tends to \(E_{min}\). Because
and because weak convergence in \({\mathcal X}^1_2\) implies convergence of the flow in the \((p, \infty )\)-compact topology (Theorem 7.13) we also have \(U(\varphi ^{v^n}_{01}) \rightarrow U(\varphi ^{v}_{01})\), so that \(\tilde{E}(v) = E_{min}\) and v is a minimizer. \(\square \)
The general problem of minimizing functionals such as (10.9) has been called “large deformation diffeomorphic metric mapping”, or LDDMM. The first algorithms were introduced for this purpose in the case of landmark matching [159] and image matching [32] (these papers were preceded by theoretical developments in [93, 278, 283]). The following sections describe these algorithms, and other that were recently proposed.
10.4 Euler–Lagrange Equations and Gradient
10.4.1 Gradient: Direct Computation
We now detail the computation of the gradient for energies such as (10.8). As remarked in the proof of Theorem 10.2, the variational problem which has to be solved is conveniently expressed as a problem over \({\mathcal X}^2_V\). The function which is minimized over this space takes the form
Assume that V is embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) and that U is differentiable on \(\mathrm {Diff}^{p, \infty }_0\). Then Theorem 7.12 and the chain rule implies that E is differentiable on \({\mathcal X}^2_V\) with
where \(\partial _v \varphi ^v_{01} \, h\) is given in Theorem 7.12.
We now identify the gradient of E for the Hilbert structure of \({\mathcal X}^2_V\). This gradient is a function, denoted \(\nabla ^V E: v\mapsto \nabla ^V E(v) \in {\mathcal X}^2_V\), that satisfies
for all v, h in \({\mathcal X}^2_V\).
Since the set V is fixed in this section, we will drop the exponent from the notation, and simply refer to the gradient \(\nabla E(v)\). Note that this is different from the Eulerian gradient we have dealt with before; \(\nabla E\) now represents the usual gradient of a function defined over a Hilbert space. One important thing to keep in mind is that the gradient we define here is an element of \({\mathcal X}^2_V\), henceforth a time-dependent vector field, whereas the Eulerian gradient was an element of V (a vector field on \(\varOmega \)). Theorem 10.5 relates the two (and allows us to reuse the computations that were made in Chap. 9). For this, we need to introduce the following operation of diffeomorphisms acting on vector fields.
Definition 10.4
Let \(\varphi \) be a diffeomorphism of \(\varOmega \) and v a vector field on \(\varOmega \). We denote by \({\mathrm {Ad}}_\varphi v\) the vector field on \(\varOmega \) defined by
\({\mathrm {Ad}}_\varphi \) is called the adjoint representation of \(\varphi \).
If \(\varphi \in \mathrm {Diff}^{p+1, \infty }(\varOmega )\), then an application of Lemma 7.3 and the Leibnitz formula implies that \({\mathrm {Ad}}_\varphi v \in C^{p}_0(\varOmega , {\mathbb {R}}^d)\) as soon as \(v\in C^{p}_0(\varOmega , {\mathbb {R}}^d)\) and more precisely that \({\mathrm {Ad}}_\varphi \) is a bounded linear operator from \(C^{p}_0(\varOmega , {\mathbb {R}}^d)\) to itself. We can therefore define its conjugate on \(C_0^{p}(\varOmega , {\mathbb {R}}^d)^*\), with \({\mathrm {Ad}}^*_\varphi \rho \) given by
for \(\rho \in C_0^{p}(\varOmega , {\mathbb {R}}^d)^*\), \(v\in C_0^{p}(\varOmega , {\mathbb {R}}^d)\). Note that \(\mathrm {Ad}_\varphi ^*\rho \) is, a fortiori, in \(V^*\), because V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\).
Let \(\mathbb L:V\rightarrow V^*\) denote the duality operator on V and \(V^{(r)}\) denote the set of vector fields \(v\in V\) such that \(\mathbb Lv\in C^r_0(\varOmega , {\mathbb {R}}^d)^*\) (for \(r\le p+1\)). Then, for \(v\in V^{(p)}\), we can define, with \({\mathbb {K}}= \mathbb L^{-1}\),
This is well-defined, because, by construction, \(\mathrm {Ad}_\varphi ^*\mathbb Lv\in C^{p}_0(\varOmega , {\mathbb {R}}^d)^* \subset V^*\). We have in particular, for \(v\in V^{(p)}\) and \(w\in V\),
Recall that the Eulerian derivative of U is defined by
Using Theorem 7.12, we have
so that
With this notation, we have the following theorem.
Theorem 10.5
Assume that V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) and that U is continuously differentiable on \(\mathrm {Diff}^{p, \infty }_0\). Then, the \({\mathcal X}^2_V\) gradient of \(\tilde{U}: v \mapsto U(\varphi _{01}^v)\) is given by the formula
This important result has the following simple consequences.
Proposition 10.6
Let U satisfy the assumptions of Theorem 10.5. If \(v \in {\mathcal X}^2_V\) is a minimizer of
then, for all t
with \(v(1) = - {\overline{\nabla }}^V U(\varphi _{01}^v) (x)\). In particular, v is a continuous function of t and \(v(t) \in V^{(p)}\) for all t.
Corollary 10.7
Under the same conditions on U, if \(v \in {\mathcal X}^2_V\) is a minimizer of
then, for all t,
with \(v_0\in V^{(p)}\).
Proposition 10.6 is a direct consequence of Theorem 10.5. For the corollary, we need to use the fact that \({\mathrm {Ad}}_\varphi {\mathrm {Ad}}_\psi = {\mathrm {Ad}}_{\varphi \circ \psi }\), which can be checked by direct computation, and write
Equations \(v_t = {\mathrm {Ad}}_{\varphi _{t0}^v}^T v_0\) and \(v_1 = - {\overline{\nabla }}^V U(\varphi _{01}^v) (x)\) together are equivalent to the Euler–Lagrange equations for \(\tilde{E}\) and will lead to interesting numerical procedures. Equation (10.16) is a cornerstone of the theory. It describes a general mechanical property called the conservation of momentum, to which we will return later.
10.4.2 Derivative Using Optimal Control
We can also apply the Pontryagin maximum principle (see Appendix D) to obtain an alternative expression of the optimality conditions and gradient. Indeed, we can repeat the construction made in Sect. 7.2.2 with a slightly different notation, letting \(f(\omega , v) = v\circ ({\mathrm {id}}+ \omega )\), defined over \(C^p_0(\varOmega , {\mathbb {R}}^d)\times V\). With \(g(\omega , v) = \Vert v\Vert _V^2\), we are in the framework described in Sect. D.3.1, leading to Theorem D.7, where \(\omega \) represents the state and v is the control. Introducing a co-state \(\mu \), define the Hamiltonian
Letting \(\xi _\varphi : v \mapsto v\circ \varphi \) from V to \(C^p_0(\varOmega , {\mathbb {R}}^d)\), we obtain the fact that an optimal solution must satisfy (with \(\varphi = {\mathrm {id}}+\omega \)), for some \(\mu :[0,1]\rightarrow C^p_0(\varOmega , {\mathbb {R}}^d)^*\)
with \(\varphi (0) = {\mathrm {id}}\) and \(\mu (1) = - dU(\varphi (1))\). One can check that the second equation is equivalent to
which is Corollary 10.7 expressed in terms of the co-state \(\mu \). Applying Eq. (D.12), we obtain
where \(\varphi \) and \(\mu \) satisfy the first two equations of (10.17).
10.4.3 An Alternative Form Using the RKHS Structure
The conjugate of the adjoint can be put into a form explicitly involving the reproducing kernel of V. Before detailing this, we introduce a notation that will be used throughout this chapter. If \(\rho \) is a linear form on function spaces, we have been denoting by \({\left( {\rho }\, \left| {\rho }\, {v}\right. \right) }\) the result of \(\rho \) applied to v. In the formulas that will come, we will need to emphasize the variable on which v depends, and we will use the alternative notation \({\left( {\rho }\, \left| {\rho }\, {v(x)}\right. \right) }_x\) to denote the same quantity. Thus,
In particular, when v depends on two variables, the notation \({\left( {\rho }\, \left| {\rho }\, {v(x, y)}\right. \right) }_x\) will represent \(\rho \) applied to the function \(x\mapsto v(x, y)\) with y considered as constant.
We still assume that V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\). Then, the following theorem holds.
Theorem 10.8
Assume that \(\varphi \in C^{q+1}_0(\varOmega , {\mathbb {R}}^d)\) and \(\rho \in C^r_0(\varOmega , {\mathbb {R}}^d)^*\), with \(r = \min (p+1, q)\). Let \(v = {\mathbb {K}}\rho \) and \((e_1, \ldots , e_d)\) be an orthonormal basis of \({\mathbb {R}}^d\). Then, for \(y\in \varOmega \), we have
where K is the reproducing kernel of V.
Proof
For \(b\in {\mathbb {R}}^d\), we have
Theorem 10.8 is now a consequence of the decomposition
\(\square \)
Recall that \(K(\cdot ,\cdot )\) is a matrix, so that \(K(\cdot , y)e_i\) is the ith column of \(K(\cdot , y)\), which we can denote by \(K^i\). Equation (10.19) states that the ith coordinate of \({\mathrm {Ad}}_\varphi ^T v\) is \({\left( {\rho }\, \left| {\rho }\, {{\mathrm {Ad}}_\varphi K^i(x, y)}\right. \right) }_x\).
Using Proposition 10.6 and Theorem 10.8, we obtain another expression of the V-gradient of E:
Corollary 10.9
Under the hypotheses of Proposition 10.6, the V-gradient of
is equal to
with \(\rho (1) = \bar{\partial }U(\varphi _{01}^v) (x)\).
10.5 Conservation of Momentum
10.5.1 Interpretation
Equation (10.16) can be interpreted as a momentum conservation equation. The justification of the term momentum comes from the analogy of \(E_{\mathrm {kin}} := (1/2) \Vert v(t)\Vert ^2_V\) with the total kinetic energy at time t of a dynamical system. In fluid mechanics, this energy is usually defined as (introducing a mass density, z)
the momentum here being \( \rho (t) = z(t, y)v(t, y) dy\) with \(E_{\mathrm {kin}} = (1/2) {\left( {\rho }\, \left| {\rho }\, {v}\right. \right) }\). In our case, taking \(\rho (t) = \mathbb Lv(t)\), we still have \(E_{\mathrm {kin}} = (1/2) {\left( {\rho }\, \left| {\rho }\, {v}\right. \right) }\), so that \(\rho (t)\) is also interpreted as a momentum.
To interpret (10.16) as a conservation equation, we need to understand how a change of coordinate system affects the momentum. Indeed, interpret v(t, y) as the velocity of a particle located at coordinates y, so \(v = dy/dt\). Now assume that we want to use a new coordinate system, and replace y by \(x = \varphi (y)\). In the new coordinates, the same particle moves with velocity
so that the translation from the old to the new expression of the velocity is precisely given by the adjoint operator: \(v(y) \rightarrow \tilde{v}(x) = \mathrm {Ad}_\varphi v(x)\) if \(x = \varphi (y)\). To obtain the correct transformation of the momentum, it suffices to notice that the energy of the system must remain the same if we just change the coordinates, so that, if \(\rho \) and \(\tilde{\rho }\) are the momenta before and after the change of coordinates, we must have
which yields \(\mathrm {Ad}_\varphi ^* \tilde{\rho }= \rho \) or \(\tilde{\rho } = \mathrm {Ad}_{\varphi ^{-1}}^* \rho \).
Now, we return to Eq. (10.16). Here, v(t, y) is the velocity at time t of the particle that was at \(x = \varphi ^v_{t0}(y)\) at time 0. So it is the expression of the velocity in a coordinate system that evolves with the flow, and \(\mathbb Lv(t)\) is the momentum in the same system. By the previous argument, the expression of the momentum in the fixed coordinate system, taken at time \(t=0\), is \({\mathrm {Ad}}^*_{\varphi ^v_{0t}} \mathbb Lv(t)\). Equation (10.16) simply states that this expression remains constant over time, i.e., the momentum is conserved when measured in a fixed coordinate system.
The conservation of momentum equation, described in Corollary 10.7, is a fundamental equation in Geometric Mechanics [149, 187], which appears in a wide variety of contexts. It has been described in abstract form by Arnold [18, 19] in his analysis of invariant Riemannian metrics on Lie groups. This equation also derives from an application of the Euler–Poincaré principle, as described in [149, 150, 188]. Combined with a volume-preservation constraint, this equation is equivalent to the Euler equation for incompressible fluids, in the case when \(\Vert v(t)\Vert _V = \Vert v(t)\Vert _2\), the \(L^2\) norm. Another type of norm on V (called the \(H^1_\alpha \) norm) relates to models of waves in shallow water, and provides the Camassa–Holm equation [50, 116, 149]. A discussion of (10.16) in the particular case of template matching is provided in [205], and a parallel with the solitons emerging from the Camassa–Holm equation is discussed in [151].
10.5.2 Properties of the Momentum Conservation Equation
Combining Eq. (10.19) and the fact that \(\partial _t \varphi _{0t}^v = v(t, \varphi _{0t}^v)\), we get, for the optimal v (letting \(v_0 = {\mathbb {K}}\rho _0\))
Letting \(\varphi = {\mathrm {id}}+\omega \), we consider the equation
We now consider this equation as an ODE over \(C^{p}_0(\varOmega , {\mathbb {R}}^d)\) and discuss conditions on \(\rho _0\) ensuring the existence and uniqueness of solutions. We will make the following assumptions.
-
(I)
V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) and its kernel, K, is such that all derivatives \(\partial _1^{k}\partial _2^{k} K(y, y)\) are bounded over \( \varOmega \) for \(k \le p+1\).
-
(II)
\(\rho _0 \in C^r(\varOmega , {\mathbb {R}}^d)^*\) for some \(r \le p-1\).
-
(III)
\(\rho _0\) is compactly supported: there exists a compact subset \(Q'\subset {\mathbb {R}}^d\) such that \({\left( {\rho _0}\, \left| {\rho _0}\, {f}\right. \right) } = 0\) for all \(f \in C^r_0(\varOmega , {\mathbb {R}}^d)\) such that \(f(x) = 0\) for all \(x\in Q'\).
Assumption (I) is true in particular when \(\varOmega ={\mathbb {R}}^d\) and K is translation-invariant.
Taking Q slightly larger than \(Q'\) in assumption (III), and choosing a \(C^\infty \) function \(\varepsilon \) such that \(\varepsilon =1\) on \(Q'\) and \(\varepsilon =0\) on \(Q^c\), we have \({\left( {\rho _0}\, \left| {\rho _0}\, {f}\right. \right) } = {\left( {\rho _0}\, \left| {\rho _0}\, {\varepsilon f}\right. \right) }\) for all \(f \in C^r_0(\varOmega , {\mathbb {R}}^d)\), from which we can deduce that, for some constant C
where
The following lemma provides the required properties for the well-posedness of (10.21).
Let \(\mathcal O = \mathrm {Diff}^p_0 -\mathrm {id}\), an open subset of \(C^p_0(\varOmega , \mathbb R^d)\).
Lemma 10.10
Let
Under assumptions (I), (II), (III) above, \(\mathscr {V}\) is a differentiable mapping from \(\mathcal O\) into \(C^p_0(\varOmega , \mathbb R^d)\) and, letting \(\varphi = {\mathrm {id}}+\omega \),
for some continuous function C.
Proof
Step 1. We first check that the right-hand side of (10.21) is well defined. Since we assume that V is embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\), we know that, for all \(0\le r, s,\le p+1\), \(\partial _1^r \partial _2^s K^i\) is in \(C_0(\varOmega , {\mathbb {R}}^d)\) with respect to each of its variables. In particular, \(x\mapsto ({\mathrm {Id}}+ d\omega (t, x))^{-1} K^i(x + \omega (t,x), y + \omega (t, y))\) is in \(C^{p-1}_0(\varOmega , {\mathbb {R}}^d)\) as soon as \(\omega \in C^{p}_0(\varOmega , {\mathbb {R}}^d)\), so that \(\rho _0\) can be evaluated on it.
Step 2. We now prove that the right-hand side of (10.21) is in \(C_0^{p}(\varOmega , {\mathbb {R}}^d)\), which ensures that (10.21) forms an ODE in this space. Let
so that (10.21) can be written as \(\partial _t \omega = v^{{\mathrm {id}}+\omega }\circ ({\mathrm {id}}+\omega )\). We want to show that \(v^\varphi \in C^{p}_0(\varOmega , {\mathbb {R}}^d)\) when \(\varphi ={\mathrm {id}}+\omega \) and \(\omega \in C^p_0(\varOmega , {\mathbb {R}}^d)\). It is obviously sufficient to prove that each coordinate
belongs to \(C^{p}_0(\varOmega , {\mathbb {R}})\). We first justify the fact that \(v_i^\varphi \) is p-times differentiable, with
for \(r\le p\). Using a Taylor expansion, we can write (letting \(h^{(k)}\) denote the k-tuple \((h,\ldots , h)\))
so that
and it suffices to prove that the remainder is an \(o(|h|^{p+1})\). This will be true provided
For \(k_1\le r\), we have, using Eq. (8.8),
for some constant C, where \(K^{ij}\) denotes the i, j entry of K. This proves the desired result, since \(\partial _1^{p+1}\partial _2^{p+1}K\) is continuous. A similar argument can be made to prove the continuity of \(y\mapsto d^pv(y)\).
To prove that \(v^\varphi \in C^p_0(\varOmega , {\mathbb {R}}^d)\), it suffices to show that, for all \(k\le p+1\), \(\Vert \partial _2^k K^i(\cdot , y)\Vert _{r, Q}\) goes to 0 when y goes to infinity. (This is where we use the fact that \(\rho _0\) has compact support.)
To reach a contradiction, assume that there exists sequences \((x_n), (y_n)\) with \(x_n\in Q\) and \(y_n\) tending to infinity or \(\partial \varOmega \) such that \(|\partial _1^{k_1}\partial _2^{k_2} K^i(x_n, y_n)| >\varepsilon \), for some fixed \(\varepsilon >0\) and \(k_1\le r\), \(k_2\le p+1\). Replacing \((x_n)\) by a subsequence if needed, we can assume that \(x_n\) converges to some \(x\in Q\). Note that \(\partial _1^{k_1}\partial _2^{k_2} K^{ij}(x, y_n) = \partial _1^{k_2}\partial _2^{k_1} K^{ji}(y_n, x)\). Since \(\partial _2^{k_1} K^{j}(\cdot , x) \in V\), we can conclude that \(\partial _1^{k_2}\partial _2^{k_1} K^{j}(y_n, x) \rightarrow 0\) for all j, implying that \(\partial _1^{k_1}\partial _2^{k_2} K^i(x, y_n) \rightarrow 0\) for all i, too.
Similarly, \(\partial _1^{k_1}\partial _2^{k_2} K^{ij}(x_n, y_n) - \partial _1^{k_1}\partial _2^{k_2} K^{ij}(x, y_n)\) is the ith entry of \(\partial _1^{k_2}\partial _2^{k_1} K^j(y_n, x_n) - \partial _2^{k_1}\partial _1^{k_2} K^j(y_n, x)\) and
which goes to 0. This is our contradiction.
Step 3: We now study the differentiability of the mapping \(\mathscr {V}: \omega \mapsto v^{{\mathrm {id}}+\omega }\circ ({\mathrm {id}}+\omega )\) from \(C^p_0(\varOmega , {\mathbb {R}}^d)\) into itself. The candidate for \(d\mathscr {V}(\omega )\eta \) is
still with \(\varphi ={\mathrm {id}}+\omega \). We can decompose \(\mathcal V(\omega +\eta )(y) - \mathcal V(\omega )(y) - (\mathscr {W}(\omega )\eta )(y)\) as the sum of five terms
(described below), which we will study separately. For each term, we need to prove that, for \(k_1\le r\), \(k_2\le p\), one has
The important point in the following discussion is that none of the estimates will require more than p derivatives in \(\varphi \) and \(\eta \), and no more than \(p+1\) in K.
(i) We let
We first note that \( Inv : M\mapsto M^{-1}\) is infinitely differentiable on \(\mathrm {GL}_d({\mathbb {R}})\) with
where \(\mathfrak S_q\) is the set of permutations of \(\{1,\ldots , q\}\). In particular, \(\Vert d^q Inv (M)\Vert = O(\Vert M^{-1}\Vert ^{q+1})\). Writing
we see that
will be less than \(C(\varphi )\Vert d\varphi ^{-1}\Vert _\infty ^{k_1+3}\Vert \eta \Vert ^2_{k_1+1, \infty }\). Using the bound
applying Lemma 7.3 and the product formula, we see that the desired conclusion holds for \(A^i_1\).
(ii) Let
Writing the right-hand side in the form
the same estimate on the derivative of K can be used, based on the fact that \(k_1+2\le p+1\).
(iii) The third term is
It can be handled similarly, requiring \(k_2+1\le p+1\) derivatives of \(K^i\) in the second variable.
(iv) These were the three main terms in the decomposition and the remaining two are just bridging gaps. The first one is
Here, we note that, for some constants C and \(\tilde{C}\),
(with a similar inequality when the roles of x and y are reversed) and these estimates can be used to check that
tends to 0 uniformly in x and y.
(v) The last term is
and can be handled similarly.
Step 4. It remains to check that \(\mathscr {W}(\omega )\) maps \(C^p_0(\varOmega , {\mathbb {R}}^d)\) to itself. This can be done in the same way we proved that \(\mathscr {V}(\omega ) \in C^p_0(\varOmega , {\mathbb {R}}^d)\), using Taylor expansions and the fact that \(d^k(\mathscr {W}(\omega )\eta )(y)\) will involve no more than k derivatives of \(\omega \) and \(\eta \), and \(k+1\) of K. This shows that \(\mathscr {W} = d\mathscr {V}\). The bound (10.23) can also be shown using the same techniques. We leave the final details to the reader. \(\square \)
Lemma 10.10 implies that (10.21) has unique local solutions (unique solutions over small enough time intervals). If we can prove that \(\Vert (d\varphi )^{-1}\Vert _\infty \) and \(\Vert \varphi \Vert _{p, \infty }\) remains bounded over solutions of the equation, inequality (10.23) will be enough to ensure that solutions exist over arbitrary times intervals. This fact will be obtained at the end of the next section.
10.5.3 Time Variation of the Eulerian Momentum
Assume that \(\varphi \) satisfies \(\partial _t \varphi (t) = v(t)\circ \varphi (t)\) with \(v\in {\mathcal X}^{p+1,1}(\varOmega )\). If \(\rho _0\in C^{p-1}(\varOmega , {\mathbb {R}}^d)^*\), we can apply the chain rule to the equation
in which we assume that \(w\in C^p_0(\varOmega , {\mathbb {R}}^d)\). We have (with \(\partial _t d\varphi = dv\circ \varphi \, d\varphi \))
The term in the right-hand side involves the adjoint representation of v(t), as expressed in the following definition.
Definition 10.11
If v is a differentiable vector field on \(\varOmega \), we denote by \({\mathrm {ad}}_v\) the mapping that transform a differentiable vector field w into
Observe that \(dv\,w - dw\, v = -[v, w]\), where the latter is the Lie bracket between right-invariant vector fields over the group of diffeomorphisms. Note that \({\mathrm {ad}}_v\) continuously maps \(C^p_0(\varOmega , {\mathbb {R}}^d)\) to \(C^{p-1}_0(\varOmega , {\mathbb {R}}^d)\). With this notation, we therefore have, for \(w \in C^p_0(\varOmega , {\mathbb {R}}^d)\):
so that
This yields the equation, called EPDiff, in which we let \(\tilde{\rho }(t)\) denote the restriction of \(\rho (t)\) to \(C^p_0(\varOmega , {\mathbb {R}}^d)\),
Equation (10.25) can be used to prove the following proposition.
Proposition 10.12
Let \(\varphi (t) = {\mathrm {id}}+ \omega (t)\), where \(\omega \) is a solution of (10.21). Let \(v_0 = K\rho _0\) and \(v(t) = {\mathrm {Ad}}_{\varphi (t)^{-1}}^T v_0\). Then \(\Vert v(t)\Vert _V\) is independent of time.
Proof
Indeed, we have, for \(\varepsilon > 0\),
Since \(v(t)\in V \subset C^p_0(\varOmega , {\mathbb {R}}^d)\), (10.25) implies that the first term on the right-hand side converges to
For the second term, we have
which tends to 0 with \(\varepsilon \). \(\square \)
We can now prove that (10.21) has a unique solution over arbitrary time intervals.
Theorem 10.13
Under the hypotheses of Lemma 10.10, Eq. (10.21) has solutions over all times, uniquely specified by its initial conditions.
Proof
As already mentioned, Lemma 10.10 implies that solutions exist over small time intervals. Inequality (10.23) implies that these solutions can be extended as long as \(\Vert d\varphi (t)^{-1}\Vert _\infty \) and \(\Vert \varphi (t)\Vert _{p, \infty }\) remain finite. However, both these quantities are controlled by \(\int _0^t \Vert v(t)\Vert _V \, dt\). For the latter, this is a consequence of (C.6). For \(d\varphi (t)^{-1}\), we can note that
and use Gronwall’s lemma to ensure that
for some constant C. \(\square \)
10.5.4 Explicit Expression
The assumption that \(\rho _0\in C^{p-1}_0(\varOmega , {\mathbb {R}}^d)^*\) “essentially” expresses the fact that the evaluation of \({\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }\) will involve no more than \(p-1\) derivatives of w. This implies that the evaluation of the right-hand side of (10.21) will involve derivatives up to order p in \(\varphi = {\mathrm {id}}+ \omega \). In numerical implementations, it is often preferable to track the evolution of these derivatives over time, rather than approximate them using, e.g., finite differences. It often happens, for example, that the evaluation of \(\rho _0\) only requires the evaluation of \(\varphi \) and its derivatives over a submanifold of lower dimension, and tracking their values on a dense grid becomes counter-productive.
The evolution of the derivatives of \(\varphi \) can easily be computed by differentiating (10.21) with respect to the y variable. This is summarized in the system
It should be clear from this system that, if the computation of \({\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }\) only requires the evaluation of w and its derivatives on some subset of \({\mathbb {R}}^d\), then \(\varphi \) and its derivatives only need to be tracked for y belonging to the same subset.
10.5.5 The Hamiltonian Form of EPDiff
We now provide an alternative form of (10.26), using the optimal control formulation discussed in Sect. 10.4.2, in which we introduced the co-state
Let \(M(t)=(d\varphi (t))^{-1}\) so that \(\partial _t M = - M\,(\partial _t d\varphi )\, M\). The second equation of (10.26) then becomes
This implies that, for any \(w\in V\)
We therefore have the system
Note that this system is an alternative expression of the first two equations of (10.17). When \({\left( {\rho _0}\, \left| {\rho _0}\, {w}\right. \right) }\) does not depend on the derivatives of w (more precisely, \(\rho _0\in C^0_0(\varOmega , {\mathbb {R}}^d)^*\)), this provides an ordinary differential equation in the variables \((\varphi , \mu )\) (of the form \((d/dt)(\varphi ,\mu ) = F(\varphi , \mu )\)). The initial conditions are \(\varphi _0 = {\mathrm {id}}\) and \(\mu _0 = \rho _0\).
10.5.6 The Case of Measure Momenta
An interesting feature of (10.28) is that it can easily be reduced to a smaller number of dimensions when \(\rho _0\) takes specific forms. As a typical example, we perform the computation in the case
where \(\gamma _k\) is an arbitrary measure on \(\varOmega \) and \(z_k(0)\) a vector field. (We recall the notation \({\left( {z \gamma }\, \left| {z \gamma }\, {w}\right. \right) } = \int z(x)^Tw(x)\, \gamma (dx)\).) Most of the Eulerian differentials that we have computed in Chap. 9 have been reduced to this form. From the definition of \(\mu (t)\), we have \(\mu (t) = \sum _{k=1}^N z_k(t, .) \gamma _k\) (where \(z_k(t, x) = d\varphi _{0t}(x)^{-T} z_k(0,x)\)). The first equation in (10.28) is
For a matrix A with ith column vector \(A^i\), and a vector z, \(z^TA^i\) is the ith coordinate of \(A^Tz\). Applying this to the previous equation yields
where we have used the fact that \(K(\varphi (t,x), \varphi (t,y))^T = K(\varphi (t,y), \varphi (t, x))\). The second equation in (10.28) becomes
where \(z_k^i\) is the ith coordinate of \(z_k\). From the expression of \(\mu (t)\), we also have
Letting \(K^{ij}\) denote the entries of K, we can identify \(\partial _t z_k\) as
This equation is somewhat simpler when K is a scalar kernel, in which case \(K^{ij}(x,y) = \varGamma (x, y)\) if \(i=j\) and 0 otherwise, where \(\varGamma \) takes real values. We get, in this case
In all cases, we see that the evolution of \(\mu \) can be completely described using the evolution of \(z_1, \ldots , z_N\). In the particular case when the \(z_k\)’s are constant vectors (which corresponds to most of the point-matching problems), this provides a finite-dimensional system on the \(\mu \) part.
10.6 Optimization Strategies for Flow-Based Matching
We have formulated flow-based matching as an optimization problem over time-dependent vector fields. We discuss here other possible optimization strategies that take advantage of the different formulations that we obtained for the EPDiff equation. They will correspond to taking different control variables with respect to which the minimization is performed, and we will in each case provide the expression of the gradient of E with respect to a suitable metric. Optimization can then be performed by gradient descent, conjugate gradient or higher-order optimization algorithms when feasible (see Appendix D or [221]).
After discussing the general formulation of each of these strategies, we will provide the specific expression of the gradients for point-matching problems, in the following form: minimize
with respect to \(\varphi \), where \(x_1, \ldots , x_N\) are fixed points in \(\varOmega \). These problems are important because, in addition to the labeled and unlabeled point matching problems we have discussed, other problems, such as curve and surface matching, end up being discretized in this form (we will discuss algorithms for image matching in the next section). The following discussion describes (and often extends) several algorithms that have been proposed in the literature, in [32, 159, 203, 204, 289, 309] among other references.
10.6.1 Gradient Descent in \({\mathcal X}^2_V\)
The original problem having been expressed in this form, Corollary 10.9 directly provides the expression of the gradient of E considered as a function defined over \({\mathcal X}_V^2\), with respect to the metric in this space. Using \(t\mapsto v(t,\cdot )\) as an optimization variable has some disadvantages, however. The most obvious is that it results in solving a huge dimensional problem (over a \(d+1\)-dimensional variable) even if the original objects are, say, collections of N landmarks in \({\mathbb {R}}^d\).
When the matching functional U is only a function of the deformation of a fixed object, i.e.,
then some simplifications can be made. To go further, we will need to compute derivatives in the object space, and henceforth assume that \(\mathcal I\) is an open subset of a Banach space \(\varvec{I}\). We assume that \(\mathrm {Diff}^{p+1}_{0}\) acts on \(\mathcal I\) and that the mapping \(\varPhi _I: \varphi \mapsto \varphi \cdot I\) is differentiable on \(\mathrm {Diff}^{p+1}_0\) for all \(I\in {\mathcal I}\), so that an infinitesimal action is defined by (see Sect. B.5.3)
for \(h\in C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\). We assume as usual that V is continuously embedded in \(C^{p+1}_0(\varOmega , {\mathbb {R}}^d)\) so that \(v\cdot I\) is well defined for \(v\in V\) and \(d\varPhi _I({\mathrm {id}})\) restricted to V is also bounded with respect to \(\Vert \cdot \Vert _V\).
Let \(v\in {\mathcal X}^2_V\). If \(\partial _t \varphi = v \circ \varphi \), let \(J(t) = \varphi (t)\cdot I\) be the deforming object. Then \(\partial _t J(t) = v(t) \cdot J(t)\). With this in mind, we can write, when \(\tilde{E}\) is given by (10.14)
The iterated minimization first minimizes with respect to v for fixed object trajectories, then optimizes over the object trajectories.
When \(J(t, \cdot )\) is given, the inner minimization is
since the constraints apply separately to each v(t). This expression only depends on the trajectory J(t). One can therefore try to compute its gradient with respect to this object trajectory and run a minimization algorithm accordingly. One difficulty with this approach is that, given an object trajectory J(t), there may exist no \(w\in V\) such that \(\partial _t J = w\cdot J(t)\) (which results in the minimum in the integral being infinite), so that the possibility of expressing the trajectory as evolving according to a flow is a constraint of the problem. This may be intractable in the general case, but always satisfied for point-matching problems as long as the points remain distinct. We will discuss this in the next section.
However, what (10.33) tells us is that, if a time-dependent vector field \(\tilde{v}(t,\cdot )\) is given, one always reduces the value of \(\tilde{E}(\tilde{v})\) by replacing \(\tilde{v}(t, \cdot )\) by
with \(J(t) = \varphi _{0t}^{\tilde{v}}\cdot I\). Introduce the space
and its perpendicular \(V_J = N_J^\perp = \left\{ u\in V: {\big \langle {u}\, , \, {\tilde{u}}\big \rangle }_V = 0 \text { for all } \tilde{u} \in N_J \right\} \). Then we have the following lemma.
Lemma 10.14
Let \(I\in {\mathcal I}\) and \(\tilde{v}\in V\). Then, the minimizer of \(\Vert w\Vert ^2_V\) over all \(V\in V\) such that \(w\cdot J = \tilde{v}\cdot J\) is given by \(\pi _{V_{J}} (\tilde{v})\), the orthogonal projection of \(\tilde{v}\) on \(V_{J}\).
Proof
Let \(v = \pi _{V_{J}} (\tilde{v})\). We want to prove that v is a minimizer of \(\Vert \cdot \Vert ^2_V\) over the set of all \(w\in V\) such that \(w = \tilde{v} + u\) with \(u\in N_J\). For such a w, we have
and \(\Vert w\Vert _V^2 \ge \Vert \pi _{V_J}(w)\Vert ^2_V\). Moreover, from the characteristic properties of an orthogonal projection, we have \(\tilde{v} - v \in V_J^\perp = N_J\), the inequality holding because \(N_J\) is closed (because it is the null set of a bounded linear map). \(\square \)
The numerical computation of this orthogonal projection is not always easy, but when it is, it generally has a form which is more specific than a generic time-dependent vector field, and provides an improved gradient descent algorithm in \({\mathcal X}_V^2\) as follows. Assume that, at time \(\tau \) in the algorithm, the current vector field \(v^\tau \) in the minimization of E is such that \(v^\tau (t) \in V_{J^\tau (t)}\) at all times t. Then define a vector field at the next step \(\tau + \delta \tau \) by
which corresponds to one step of gradient descent, as specified in (10.20), then compute \(J(t) = \varphi _{0t}^{\tilde{v}^{\tau +\delta \tau }}\cdot I\) and define
at all times t.
Application to Point Matching
Consider the point-matching energy. In this case, letting
we have
We therefore have, by Corollary 10.9, with \(\tilde{U}(x) = U(\varphi _{01}^v)\),
so that
So, a basic gradient descent algorithm in \({\mathcal X}^2_V\) would implement the evolution (letting \(\tau \) denote the algorithm time)
The two-step algorithm defined in the previous section is especially efficient with point sets. When \(x = (x_1, \ldots , x_N)\), \(v\cdot x = (v(x_1), \ldots , v(x_N)\), the projection on
is given by spline interpolation with the kernel, as described in Theorem 8.8, i.e.,
More precisely, define \(x_i^v(t) = \varphi _{0t}^v(x_i)\). We assume that, at time \(\tau \), we have a time-dependent vector field \(v^\tau \) which takes the form
Using (10.36), we define
The values of \(\tilde{v}(t,\cdot )\) are in fact only needed at the points \(x^{\tilde{v}}_i(t) = \varphi ^{\tilde{v}}_{0t}(x_i)\). These points are obtained by solving the differential equation
with \(x(0) = x_i\). Solving this equation provides both \(x^{\tilde{v}}_i(t) \) and \(\tilde{v}(x^{\tilde{v}}_i(t))\) for \(t\in [0,1]\).
Once this is done, define \(v^{\tau + \delta \tau }(t, \cdot )\) to be the solution of the approximation problem \(\inf _w \Vert w\Vert _V\) with \(w(x^{\tilde{v}}_i(t)) = v(x^{\tilde{v}}_i(t))\), which will therefore take the form
Solving (10.39) requires evaluating the expression of \(v^\tau \), which can be done exactly using (10.38). It also requires computing the expression of \(d\varphi ^{v^\tau }_{t1}(x_i^{v^\tau }(t))\), which can be obtained from the expression
which yields:
Thus, \(d\varphi ^{v}_{t1}(x_i^{v}(t))\) is a solution of \(\partial _t M = -M dv(x_i^v(t))\) with initial condition \(M = {\mathrm {Id}}\). The matrix \(dv(t, x_i^v(t))\) can be computed explicitly as a function of the point trajectories \(x_j^v(t), j=1, \ldots , N\), using the explicit expression (10.38). This algorithm was introduced in [31].
10.6.2 Gradient in the Hamiltonian Form
As we have seen, one can use the optimal control formalism with the Pontryagin principle to compute the gradient of \(\tilde{E}\) in v. Given \(v\in {\mathcal X}^2_V\), this gradient can be computed by solving (10.28) with boundary conditions \(\varphi (0) = {\mathrm {id}}\) and \(\mu (1) = - dU(\varphi (1))\) (which can be achieved by solving the first equation in (10.28) from \(t=0\) to \(t=1\), then the second one backward in time, from \(t=1\) to \(t=0\)) and, using (10.18), letting
This equation (or the maximum principle) implies that the optimal v must be such that \(v(t) = {\mathbb {K}}\xi ^*_{\varphi (t)} \mu (t)\) for some \(\mu \in C^{p-1}_0(\varOmega , {\mathbb {R}}^d)^*\) and there is therefore no loss of generality in restricting the optimization problem to v’s taking this form. With this constraint, we have
Let \({\mathbb {K}}_\varphi = \xi _{\varphi } {\mathbb {K}}\xi ^*_{\varphi }\) so that \(\Vert v(t)\Vert ^2_V = {\left( {\mu (t)}\, \left| {\mu (t)}\, {{\mathbb {K}}_{\varphi (t)} \mu (t)}\right. \right) }\). One has
so that
With this notation, the state equation \(\partial _t \varphi = v\circ \varphi \) becomes \(\partial _t\varphi = {\mathbb {K}}_\varphi \mu \) and the original optimal control problem is reformulated as minimizing
subject to \(\partial _t \varphi = \mathbb K_{\varphi } \mu \).
Expressing the problem in this form slightly changes the expression of the differential. The computation of the gradient (and its justification) based on a co-state \(\alpha \) and the Hamiltonian
are obtained using the same methods as in Sect. 10.4.2, so we skip the details. Let \(\varphi ^\mu \) be the solution of \(\partial _t \varphi = {\mathbb {K}}_{\varphi } \mu \) with \(\varphi (0) = {\mathrm {id}}\). Then, with \(\tilde{E}(\mu ) = E(\varphi ^\mu , \mu )\), we have
where \(\varphi \) and \(\alpha \) are given by the system
with \(\varphi (0) = {\mathrm {id}}\) and \(\alpha (1) = - dU(\varphi (1))\). Unsurprisingly, this system boils down to (10.28) when \(d\tilde{E}(\mu ) = 0\), i.e., when \(\alpha = \mu \).
The gradient of \(\tilde{E}\) expressed with respect to the inner product
(this choice will be justified in Sect. 11.4 as the dual Riemannian metric on the diffeomorphism group) is
a remarkably simple expression.
Consider now the case in which \(U(\varphi ) = F(\varphi \cdot I)\) where I is a fixed object. With the notation and assumptions made in Sect. 10.6.1, we found in Lemma 10.14 that there was no loss of generality in restricting the minimization to \(v(t) \in V_{J(t)}\) at all times. This often entails additional constraints on the momentum \(\rho (t) = \mathbb Lv(t)\), or on \(\mu (t) = \xi _{\varphi (t)^{-1}}^* \rho (t)\), that can be leveraged to reduce the dimension of the control variable. For example, we have seen that for point sets (in which we let \(J=x\)) \(V_x\) was given by (10.37), so that \(\rho (t)\) must take the form
for some \(z_1(t), \ldots , z_N(t) \in {\mathbb {R}}^d\), from which we can deduce (using \(x_k(t) = \varphi (t, x_k(0))\)) that \(\mu (t)\) must take the form
One can then use \(z_1, \ldots , z_N\) as a new control, as described below.
Another interesting special case is when \(\mu (t)\) can be expressed as a vector measure, because, as discussed in Sect. 10.5.6, one can then assume that
where \(\gamma _1, \ldots , \gamma _N\) are fixed measures. One can then use the vector fields \(z_1, \ldots , z_N\) to parametrize the problem. This leads to a simplification when the measures have a sparse support. They are, for example, Dirac measures for point matching. We now review this case in more detail.
Application to Point Matching
When \(U(\varphi ) = F(\varphi (x_1), \ldots , \varphi (x_N))\), we have
so that
is a vector measure. We can therefore look for a solution in the form
at all times, for some coefficients \(z_1, \ldots , z_N\).
In order to obtain \(\alpha \) in (10.40) given a current \(\mu \), it suffices to solve the first equation only for the values of \(y_k(t) = \varphi (t, x_k)\), \(k=1, \ldots , N\), which requires us to solve the system
One then sets
and solves the second equation backward in time, knowing that the solution will take the form
with \(\eta _k(1) = -\partial _k F(y_1(1), \ldots , y_n(1))\) and
Given this, we have
10.6.3 Gradient in the Initial Momentum
We now use the fact that Eq. (10.16) implies that the optimal v(t) is uniquely constrained by its value at \(t=0\) for formulating the variations of the objective function in terms of these initial conditions. We therefore optimize with respect to \(v_0\), or equivalently with respect to \(\mu _0 = \rho _0\). This requires finding \(\rho _0\) such that
is minimal under the constraints \(\partial _t\varphi (t) = v(t)\circ \varphi (t)\), with
Proposition 10.12 helps us to simplify this expression, since it implies that \(\int _0^1 \Vert v(t)\Vert ^2 dt = {\left( {\rho _0}\, \left| {\rho _0}\, {{\mathbb {K}}\rho _0}\right. \right) }\) and the minimization problem therefore is to find \(\rho _0\) such that
is minimal, where \((\varphi , \mu )\) is a solution of system (10.28) with initial conditions \(\varphi (0) = {\mathrm {id}}\) and \(\mu (0) = \rho _0\). Writing (10.28) as
and applying Proposition D.12, we have
where the pair \(\left( {\begin{matrix} p_\varphi (t)\\ p_\mu (t)\end{matrix}}\right) \) satisfies \(p_\varphi (1) = - dU(\varphi (1))\), \(p_\mu (0) = 0\) and
The gradient of E with respect to the metric on \(V^*\) is then given by \(\nabla E(\rho _0) = \rho _0 - {\mathbb {K}}^{-1} p_\mu \).
The practical application of these formulas requires us to make explicit the expressions of \(\partial _i\mathscr {V}_j^*\) for \(i, j=1,2\). Returning to (10.28), we have
Forming explicit expressions of \(\partial _i\mathscr {V}_j^*\) requires isolating h or \(\eta \) from the right-hand sides. To do this, we will need to change the order in which linear forms are applied to the x and y coordinates. This issue is addressed in the following lemma.
Lemma 10.15
Assume that \(\mu \in C^{r}(\varOmega , {\mathbb {R}}^d)^*\) and \(\nu \in C^{r'}(\varOmega , {\mathbb {R}}^d)^*\). Let \(g:\varOmega \times \varOmega \rightarrow {\mathbb {R}}\) be a function such that \(\partial ^k_1\partial _2^{k'} g \in C_0(\varOmega \times \varOmega , {\mathbb {R}})\) for all \(k\le r\) and \(k'\le r\). Then, for all \(a, b\in {\mathbb {R}}^d\), \({\left( {\mu }\, \left| {\mu }\, {g(x, \cdot )a}\right. \right) }_x \in C^{r'}(\varOmega , {\mathbb {R}}^d)\) and \({\left( {\nu }\, \left| {\nu }\, {g(\cdot , y)b}\right. \right) }_y \in C^{r}(\varOmega , {\mathbb {R}})\), with
Proof
Let \(f(y) = {\left( {\mu }\, \left| {\mu }\, {g(x, y)a}\right. \right) }_x\). Using Taylor’s formula, we can write
so that
The last term (call it R) is such that
The uniform continuity of \(\partial _1^k\partial _2^{r'} g\) for \(k\le r\) implies that \(R = o(|h|^{r'})\) so that \(f\in C^{r'}(\varOmega , {\mathbb {R}})\). Similarly, letting \(f'(x) = {\left( {\mu }\, \left| {\mu }\, {g(x, y)b}\right. \right) }_y\), one has \(f'\in C^r(\varOmega , {\mathbb {R}})\).
The computation also shows that, for some constant C,
with
so that both sides of (10.42) are continuous in g with respect to this norm. To conclude, it suffices to notice that (10.42) is true when g takes the form
and that these functions form a dense set for \(\Vert g\Vert _{r, r',\infty }\), so that the identity extends by continuity. \(\square \)
Let us use this lemma to identify the first term in \({\left( {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, {h}\right. \right) }\) as a linear form acting on h. Write, letting \(\partial _{i, k}\) denote the derivative with respect to the kth coordinate of the ith variable,
where \(\mathscr {U}_{1}(x)\) is the matrix with coefficients
Write \({\left( {\mathscr {U}_{1}^T \mu }\, \left| {\mathscr {U}_{1}^T \mu }\, {h}\right. \right) } = {\left( {\mu }\, \left| {\mu }\, {\mathscr {U}_{1} h}\right. \right) }\), a notation generalizing the one introduced for vector measures. After a similar computation for the second term of \({\left( {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, \left| {\partial _1 \mathscr {V}_1^*\, p_\varphi }\, {h}\right. \right) }\) (which does not require Lemma 10.15), we get
with
Consider now \(\partial _2 \mathscr {V}_1^*\, p_\varphi \), writing
so that
With similar computations for \(\mathscr {V}_2\), and skipping the details, we find
where
Finally,
Let us take the special case of vector measures, assuming that \(\mu (t) = \sum _{k=1}^Nz_k(t, \cdot ) \gamma _k\). We will look for \(p_\varphi \) in the form
\(p_\mu \) being a function defined over the support of \(\mu \).
With these assumptions, we have
-
\(\displaystyle \partial _1 \mathscr {V}_1^*\, p_\varphi = \sum _{k=1}^N \zeta _k^{1,1} \gamma _k\) with
$$\begin{aligned} \zeta _k^{1,1}(x) =&\sum _{l=1}^N \sum _{i, j=1}^d {\left( {\gamma _l}\, \left| {\gamma _l}\, {\alpha ^i_l(y) z_k^j(x)\nabla _1 K^{ij}(\varphi (x), \varphi (y)) }\right. \right) }_y \\&+ \sum _{l=1}^N \sum _{i, j=1}^d {\left( {\gamma _l}\, \left| {\gamma _l}\, {z^i_l(y) \alpha _k^j(x)\nabla _1 K^{ij}(\varphi (x), \varphi (y)) }\right. \right) }_y. \end{aligned}$$ -
\(\displaystyle \partial _2 \mathscr {V}_1^*\, p_\varphi (x) = \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, {K(\varphi (x), \varphi (y)) \alpha _k(y)}\right. \right) }_y\).
-
\(\displaystyle \partial _1 \mathscr {V}_2^*\, p_\mu = \sum _{k=1}^N \zeta _k^{2,1} \gamma _k\) with
$$\begin{aligned} \zeta _k^{2,1}(x) =&-\sum _{i, j=1}^d \sum _{l=1}^N {\left( {\gamma _l}\, \left| {\gamma _l}\, {z_l^i(y) z_k^j(x) \partial _1\partial _2 K^{ij}(\varphi (x), \varphi (y)) p_\mu (y)}\right. \right) }_y\\&- \sum _{i, j=1}^d \sum _{l=1}^N {\left( {\gamma _l}\, \left| {\gamma _l}\, {z_l^i(y) z_k^j(x) \partial _1^2 K^{ij}(\varphi (x), \varphi (y)) p_\mu (x)}\right. \right) }_y. \end{aligned}$$ -
\( \begin{aligned} \partial _2 \mathscr {V}_2^*\, p_\mu (x) =&- \sum _{i=1}^d \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, { z^i_k(y) \partial _1 K^i(\varphi (x), \varphi (y))p_\mu (x)}\right. \right) }_y \\&- \sum _{i=1}^d \sum _{k=1}^N {\left( {\gamma _k}\, \left| {\gamma _k}\, {z^i_k(y) \partial _2 K^i(\varphi (x), \varphi (y))p_\mu (y)}\right. \right) }_y. \end{aligned} \)
System (10.41) can now be simplified as
Application to Point Matching
We now apply this approach to point-matching problems. Since \(\rho _0\) takes the form
we are in the vector measure case with \(\gamma _k = \delta _{x_{0,k}}\). The densities \(z_k\) and \(\alpha _k\) for \(\mu \) and \(p_\varphi \) can therefore be considered as vectors in \({\mathbb {R}}^d\), and \(p_\mu \) being defined on the support of \(\mu \) is also a collection of vectors \(p_{\mu , k} = p_\mu (x_k)\). Given this, we can therefore immediately rewrite
-
\(\displaystyle \partial _1 \mathscr {V}_1^*\, p_\varphi = \sum _{k=1}^N \zeta _k^{1,1} \delta _{x_{0,k}}\) with
$$\begin{aligned} \zeta _k^{1,1} =\sum _{l=1}^N \sum _{i, j=1}^d \left( \alpha ^i_l \nabla _1 K^{ij}(x_k, x_l) z_k^j + z^i_l \nabla _1 K^{ij}(x_k, x_l) \alpha _k^j\right) . \end{aligned}$$ -
\(\displaystyle \partial _2 \mathscr {V}_1^*\, p_\varphi (x_{0,k}) = \sum _{l=1}^N K(x_k, x_l) \alpha _l\).
-
\(\displaystyle \partial _1 \mathscr {V}_2^*\, p_\mu = \sum _{k=1}^N \zeta _k^{2,1} \delta _{x_{0,k}}\) with
$$\begin{aligned} \zeta _k^{2,1} = -\sum _{i, j=1}^d \sum _{l=1}^N z_l^i z_k^j (\partial _1\partial _2 K^{ij}(x_k, x_l)\, p_{\mu , l} + \partial _1^2 K^{ij}(x_k, x_l)\, p_{\mu , k}). \end{aligned}$$ -
\( \begin{aligned} \partial _2 \mathscr {V}_2^*\, p_\mu (x_k) =&- \sum _{i=1}^d \sum _{l=1}^N z^i_l (\partial _1 K^i(x_k, x_l)p_{\mu , k} + \partial _2 K^i(x_k, x_l)p_{\mu , l}). \end{aligned} \)
This algorithm is illustrated in Fig. 10.1. In the same figure, we also provide (for comparison purposes) the results provided by spline interpolation, which computes \(\varphi (x) = x + v(x)\), where v is computed (using Theorem 8.9) in order to minimize
Although this is a widely spread registration method [42], Fig. 10.1 shows that it is far from being diffeomorphic for large deformations.
10.6.4 Shooting
The optimality conditions for our problem are \(\mu (1) = -dU(\varphi (1))\) with \(\mu (t)\) given by (10.28). The shooting approach in optimal control consists in finding an initial momentum \(\rho _0 = \mu (0)\) such that these conditions are satisfied. Root finding methods, such as Newton’s algorithm, can be used for this purpose. At a given step of Newton’s algorithm, one modifies the current value of \(\rho _0\), by letting \(\rho _0 \rightarrow \rho _0 + \eta \) such that, letting \(F(\rho _0) := \mu (1) + dU(\varphi (1))\), one has
One therefore needs to solve this linear equation in order to update the current \(\rho _0\). One has
where
is (using the notation of the previous section) the differential of the solution of the equation
with respect to its initial condition, i.e., the solution of
with initial condition \(W(0) = {\mathrm {Id}}\).
Because one needs to compute the solution of this differential equation at every step of the algorithm, then solve for a linear system, the shooting method is feasible only for problems that can be discretized into a relatively small number of dimensions. One can use it, for example, in point matching problems with no more than a few hundred landmarks (see [290] for an application to labeled point matching), in which case the algorithm can be very efficient. Another issue is that root finding algorithms are not guaranteed to converge. Usually, a good initial solution must be found, using, for example, a few preliminary steps of gradient descent.
10.6.5 Gradient in the Deformable Object
Finally, we consider the option of using the time derivative of the deformable object as a control variable, using the fact that, by (10.33), the objective function can be reduced to
with \(L(\eta , J) = \min _{w:\, \eta = w \cdot J(t)} \Vert w\Vert ^2_V\). This formulation is limited, in that \(L(\eta , J)\) is not always defined for all \((\eta , J)\), resulting in constraints in the minimization that are not always easy to handle. Even if well-defined, the computation of L may be numerically demanding. To illustrate this, consider the image-matching case, in which \(v\cdot J = -\nabla J^T v\). An obvious constraint is that, in order for
to have at least one solution, the variation \(\eta \) must be supported by the set \(\nabla J \ne 0\). To compute this solution when it exists, one can write, for \(x\in \varOmega \),
and it is possible to look for a solution in the form
where \(\lambda (x)\) can be interpreted as a continuous family of Lagrange multipliers. This results in a linear equation in \(\lambda \), namely
which is numerically challenging.
For point sets, however, the approach is feasible [159] because L can be made explicit. Given a point-set trajectory \(x(t) = (x^{(1)}(t), \ldots , x^{(N)}(t))\), let S(x(t)) denote the block matrix with (i, j) block given by \(K(x^{(i)}(t), x^{(j)}(t))\). The constraints are \(x_t = S(x(t)) \xi (t)\) so that \(\xi (t) = S(x(t))^{-1} \dot{x}_t\) and the minimization reduces to
Minimizing this function with respect to x by gradient descent is possible, and has been described in [158, 159] for labeled landmark matching. The basic computation is as follows: if \(s_{pq, r} = \partial _{r} s_{pq}\), we can write (using the fact that \(\partial _{r} (S^{-1}) = - S^{-1} (\partial _{r}S) S^{-1}\))
After an integration by parts in the first integral, we obtain
where \(z_r(t) = \sum _{p,q} \xi _p(t) \xi _q(t) s_{pq, r}(x(t))\) and \(\delta _1\) is the Dirac measure at \(t=1\).
This singular part can be dealt with by computing the gradient in a Hilbert space in which the evaluation function \(x(\cdot ) \mapsto x(1)\) is continuous. This method has been suggested, in particular, in [129, 161]. Let H be the space of all trajectories \(x : t\mapsto x(t) = (x^{(1)}(t), \ldots , x^{(N)}(t))\), with fixed starting point x(0), free end-point x(1) and square integrable time derivative. This is a space of the form \(x(0) + H\) where H is the Hilbert space of time-dependent functions \(t\mapsto h(t)\), considered as column vectors of size Nk, with \(h(0)=0\) and
To compute the gradient for this inner product, we need to write \({\left( {dE(x)}\, \left| {dE(x)}\, {h}\right. \right) }\) in the form \({\big \langle {\nabla ^H E(x)}\, , \, {h}\big \rangle }_H\). We will make the assumption that
which implies that
is continuous in h. Similarly, the linear form \(\xi \mapsto \nabla U(x(1))^T h(1)\) is continuous since
Finally, \(h \mapsto \int _0^1 {{z(t)}^T}\dot{h} dt\) is continuous provided that we assume that
is square integrable over [0, 1], since this yields
which is continuous in h with respect to the H norm.
Thus, under these assumptions, \(h \mapsto {\left( {dE(x)}\, \left| {dE(x)}\, {h}\right. \right) }\) is continuous over H, and the Riesz representation theorem implies that \(\nabla ^H E(x)\) exists as an element of H. We now proceed to its computation. Letting
and \(a = \nabla U(x(1))\), the problem is to find \(\zeta \in H\) such that, for all \(h\in H\),
This expression can also be written
This suggests selecting \(\zeta \) such that \(\zeta (0)=0\) and
which implies
At \(t=1\), this yields
and we finally obtain
We summarize this in an algorithm, in which \(\tau \) is again the computation time.
Algorithm 3
(Gradient descent algorithm for landmark matching) Start with initial landmark trajectories \(x(t, \tau ) = (x^{(1)}(t,\tau ), \ldots , x^{(N)}(t,\tau ))\).
Solve
with \( a(\tau )= \nabla U(x(1,\tau ))\), \(\mu (t, \tau ) = \int _0^t \xi (s, \tau ) ds\), \(\eta (t, \tau ) = \int _0^t z(s, \tau ) dt\) and
10.6.6 Image Matching
We now take an infinite-dimensional example to illustrate some of the previously discussed methods and focus on the image-matching problem. We therefore consider
where \(I, \tilde{I}\) are functions \(\varOmega \rightarrow {\mathbb {R}}\), I being differentiable. The Eulerian differential of U is given by (9.21):
So, according to (10.19), and letting \(\tilde{U}(v) = U(\varphi _{01}^v)\),
This provides the expression of the V-gradient of \(\tilde{E}\) for image matching, namely
Using a change of variable in the integral, the gradient may also be written as
the associated gradient descent algorithm having been proposed in [32].
Let us now consider an optimization with respect to the initial \(\rho _0\). First notice that, by (9.20), \(\mu (1) = \lambda \,\det (d\varphi (1))(I - I'\circ \varphi (1)) d\varphi (1)^{-T}\nabla I \, dx\) is a vector measure. Also, we have
which shows that one can assume that \(\rho _0 = z_0 dx\) for some vector-valued function \(z_0\) (with \(z_0 = \det (d\varphi (1))(I - I'\circ \varphi (1)) \nabla I\) for an optimal control).
We now make explicit the computation of the differential of the energy with respect to \(\rho _0\). We have \(\mu (t) = z(t, \cdot ) dx\), with \(z(0) = z_0\) and
The differential \(dE(\rho _0) = K\rho _0 - p_\mu (0)\) is computed by solving, using \(\alpha (1) = \lambda \det (d\varphi (1))(I - I'\circ \varphi (1)) d\varphi (1)^{-T}\nabla I\) and \(p_\mu (1) = 0\),
in which
and
We summarize the computation of the gradient of the image-matching functional with respect to \(z_0\) such that \(\rho _0 = z_0 dx\):
Algorithm 4
-
1.
Solve (10.46) with initial conditions \(\varphi (0)= {\mathrm {id}}\) and \(z(0) = z_0\) and compute \(dU(\varphi (1)) = -\lambda (I- I'\circ \varphi (1))\det (d\varphi (1)) d\varphi (1)^{-T} \nabla I\).
-
2.
Solve, backwards in time, until time \(t=0\) the system (10.47) with boundary conditions \(\alpha (1) = -dU(\varphi (1))\) and \(p_\mu (1) = 0\).
-
3.
Set \(\nabla E(z_0) = 2z_0 - {\mathbb {K}}^{-1}p_\mu (0)\).
The gradient is computed with the metric \({\big \langle {z}\, , \, {z'}\big \rangle } = \int _{{\mathbb {R}}^d} z(y)^T {\mathbb {K}}z(y) dy\). Results obtained with this algorithm are presented in Fig. 10.2.
One can also use the fact that \(z_0 = f_0 \nabla I\) for a scalar-valued \(f_0\). Since we have
we can write, with \(\tilde{E}(f_0) = E(f_0 \nabla I)\):
which leads to replacing the last step in Algorithm 4 by
which corresponds to using the \(L^2\) metric in \(f_0\) for gradient descent. However, a more natural metric, in this case, is the one induced by the kernel, i.e.,
with \(K_I(x,y) = \nabla I(x)^T K(x, y) \nabla I(y)\). With this metric, \(z_0\) is updated with
Although this metric is more satisfactory from a theoretical viewpoint, the inversion of \({\mathbb {K}}_I\) might be difficult, numerically.
10.6.7 Pros and Cons of the Optimization Strategies
In the previous sections we have reviewed several possible choices of control variables with respect to which the optimization of the matching energy can be performed. For all but the shooting method, this results in specific expressions of the gradient that can then be used in optimization procedures such as those discussed in Appendix D.
All these procedures have been implemented in the literature to solve a diffeomorphic-matching problem in at least one specific context, but no extensive study has ever been made to compare them. Even if the outcome of such a study is likely to be that the best method depends on the specific application, one can still provide a few general facts that can help a user decide which one to use.
When feasible (that is, when the linear system it involves at each step can be efficiently computed and solved), the shooting method is probably the most efficient. If the initialization is not too far from the solution, convergence can be achieved in a very small number of iterations. One cannot guarantee, however, that the method will converge starting from any initial point, and shooting needs to be combined with some gradient-based procedure in order to find a good starting position.
Since they optimize with respect to the same variable, the most natural procedure to combine with shooting is optimization with respect to the initial momentum. Even when shooting is not feasible (e.g., for large-scale problems), this specific choice of control variable is important, because it makes sure that the final solution satisfies the EPDiff equation, which guarantees the consistency of the momentum representation, which will be discussed in Sect. 11.5.2. The limitation is that, with large and complex deformations, the sensitivity of the solution to small changes in the control variable can be large, which may result in an unstable optimization procedure.
The other methods, which optimize with respect to time-dependent quantities, are generally more able to compute very large deformations. Beside the obvious additional burden in computer memory that they require, one must be aware that the discrete solution can sometimes be far from satisfying the EPDiff equation unless the time discretization is fine enough (which may be impossible to achieve within a feasible implementation for large-scale problems). Therefore, these methods do not constitute the best choice if obtaining a reliable momentum representation is important. Among the three time-dependent control variables that we have studied (velocity, momentum and deformable object), one may have a slight preference for the representation using the time-dependent momenta, even if the computation it involves is slightly more complex than the others. There are at least two reasons for this. First, the momenta are generally more parsimonious in the space variables, because they incorporate normality constraints to transformations that leave the deformable objects invariant. Second, because the forward and backward equations solved at each iteration immediately provide a gradient with respect to the correct metric, so that the implementation does not have to include the solution of a possibly large-dimensional linear system which is required by other representations.
10.7 Numerical Aspects
10.7.1 Discretization
The implementation of the diffeomorphic matching algorithms that were just discussed requires a proper discretization of the different variables that are involved. The discretization in time of optimal control problems is discussed in Sect. D.4. This discussion directly applies here and we refer the reader to the relevant pages in the chapter for more details. If the deformed objects are already discrete (e.g., points sets), this suffices in order to design a numerical implementation.
When the deformed objects are continuous, some discrete approximation must obviously be made. One interesting feature of the problems that we have discussed is that they all derive from the general formulation (10.8), but can be reduced, using Sect. 10.6.2, to a situation in which the state and controls are finite dimensional after discretization. Typically, starting from (10.8), the discretization implies that only the end-point cost function is modified, replacing \(U(\varphi ) = F(\varphi \cdot I_0)\) by an approximation taking the form \(U^{(n)}(\varphi ) = F^{(n)}(\varphi , I_0^{(n)})\). For example, when matching curves, one may replace the objective function \(F(\varphi \cdot I_0) = \Vert \mu _{\varphi \cdot I_0} - \mu _{I'}\Vert _{W^*}^2\) in (9.40) by the discrete approximation in (9.46), in which the curves \(I_0\) and \(I'\) are approximated by point sets. Similar approximations can be made for the other types of cost functions discussed for curves and surfaces. In such cases, the following proposition can be applied to compare solutions of the original problem with their discrete approximations.
Proposition 10.16
Assume that V is continuously embedded in \(C^{p+1}_0({\mathbb {R}}^d, {\mathbb {R}}^d)\). Consider a family of optimal control problems minimizing
with \(U^{(n)}\) continuous for the \((p, \infty )\)-compact topology. Let U be continuous with respect to the same topology and assume that, for some \(p>0\), the following uniform convergence is true: for all \(A>0\) and \(\varepsilon >0\), there exists an \(n_0\) such that, for all \(n\ge n_0\), for all \(\varphi \in \mathrm {Diff}^{p, \infty }_0\) such that \(\max (\Vert \varphi \Vert _{p, \infty }, \Vert \varphi ^{-1}\Vert _{p, \infty }) < A\), one has \(|U^{(n)}(\varphi ) - U(\varphi )|<\varepsilon \).
Then, given a sequence \(v^{(n)}\) of minimizers of (10.48), one can extract a subsequence \(v^{(n_k)}\) that weakly converges to v in \({\mathcal X}^2_V\), with v minimizing
Proof
Let w be a minimizer of (10.49). Our assumptions implying that \(U^{(n)}(\varphi ^{w}_{01})\) converges to \(U(\varphi ^{w}_{01})\) (so that their difference is bounded), we see that \(E^{(n)}(w) \le E(w) + C\) for some constant C, so that, letting \(v^{(n)}\) be a minimizer of \(E^{(n)}\), we have \(\Vert v^{(n)}\Vert ^2_{{\mathcal X}^2_V} \le 2E^{(n)}(v^{(n)}) \le 2E(w) + 2C\). From this we find that \(v^{(n)}\) is a bounded sequence in \({\mathcal X}^2_V\), so that, replacing it with a subsequence if needed, we can assume that it weakly converges to some \(v\in {\mathcal X}^2_V\). Applying Theorem 7.13, we find that \(\varphi ^{v^{(n)}}_{01}\) converges to \(\varphi ^v\) in the \((p, \infty )\)-compact topology. Moreover, Theorem 7.10 implies that the sequences \((\Vert \varphi _{01}^{v^{(n)}}\Vert _{p, \infty }, \Vert \varphi _{10}^{v^{(n)}}\Vert _{p, \infty })\) are bounded. Applying the uniform convergence of \(U^{(n)}\) to U on bounded sets and the continuity of U, we see that \(U^{(n)}(\varphi _{01}^{v^{(n)}})\) converges to \(U(\varphi ^v_{01})\) as n tends to infinity. Since, in addition
we obtain the fact that \(E(v) \le \liminf E^{(n)}(v^{(n)})\). We also have
so that \(E(v)= E(w)\) and v is also a minimizer of (10.49). \(\square \)
Curves and Surfaces. We can apply this theorem to curve and surface matching according to the following discussion, in which we focus on surface matching using currents, but which can, with very little modification, be applied to curves, and to measure or varifold matching terms. Let \(\varSigma \) and \(\tilde{\varSigma }\) be regular surfaces and \(S^{(n)}\), \(\tilde{S}^{(n)}\) be sequences of triangulated surfaces that converge to them as defined before Theorem 4.3. Let (fixing an RKHS W with kernel \(\xi \))
using the vector measures defined in Eq. (9.49), and
using the discrete version as in (9.56). Then, Theorem 4.3, slightly modified to account for double integrals, can be used to check that the assumptions of Proposition 10.16 are satisfied.
Images. The image matching problem can be discretized using finite grids, assuming that the considered images are supported by the interval \([0,1]^d\). Consider the cost function
in which we assume, to simplify, that I and \(\tilde{I}\) are compactly supported (say, on \(\mathcal K = [-M, M]^d\)) and bounded. We first start with a discretization that can be applied to general \(L^2\) functions. Let \(\mathcal G_n = \{-M+ 2^{-n+1}kM, k=0, \ldots , 2^n\}^d\) provide a discrete grid on \(\mathcal K\) and associate to each point \(z\in \mathcal G_n\) its Voronoï cell, \(\varGamma _n(z)\), provided by the set of points in \(\mathcal K\) that are closer to x than to any other point in the grid (i.e., \(\varGamma _n(z)\) is the intersection of \(\mathcal K\) and the cube of size \(2^{-n}\) centered at x). Define
where
is the average value of I over \(\varGamma _n(z)\).
Define \(\tilde{I}^{(n)}\) similarly and consider the approximation of U given by \(U^{(n)}(\varphi ) = \Vert I^{(n)}\circ \varphi ^{-1} - \tilde{I}^{(n)}\Vert _2^2\). Then \(U^{(n)}\) and U satisfy the hypotheses of Proposition 10.16.
Indeed, assume that \(\max (\Vert \varphi \Vert _{1, \infty }, \Vert \varphi ^{-1}\Vert _{1, \infty }) < A\). We have
where the second inequality is obtained after a change of variable in the first \(L^2\) norm and C(A) is an upper bound for the Jacobian determinant of \(\varphi \) depending only on A. As a consequence, Proposition 10.12 will be true as soon as one shows that \(I^{(n)}\) and \(\tilde{I}^{(n)}\) converge in \(L^2\) to I and \(\tilde{I}\) respectively (and will also be true for any sequence of approximations of I and \(\tilde{I}\) that satisfies this property). The \(L^2\) convergence is true in our case because \(I^{(n)}\) is the orthogonal projection of I on the space \(W_n\) of \(L^2\) functions that are constant on each set \(\varGamma _n(z), z\in \mathcal G_n\). This implies that \(I^{(n)}\) converges in \(L^2\) to the projection of I on \(W_\infty = \overline{\bigcup _{n\ge 1} W_n}\) (see Proposition A.11), but one has \(W_\infty =L^2\), because any function J orthogonal to this space would have its integral vanish on any dyadic cube, which is only possible for \(J=0\).
Note that, with this approximation, one can write
where |A| denotes the volume of \(A\subset {\mathbb {R}}^d\). To make this expression computable, one needs to approximate the sets \(\varphi (\varGamma _n(z))\), where the simplest approximation is to take the polyhedron formed by the image of the vertices of \(\varGamma _n(z)\) by \(\varphi \) (which will retain the same topology as the original cube is n is large enough). The verification that this approximation is valid (in the sense of Proposition 10.16) is left to the reader.
However, even with this approximation, the numerical problem is still highly computational, since it becomes a point set problem over \(\mathcal G_n\), which is typically a huge set. Most current implementations use a simpler scheme, in which \(I^{(n)}\) is interpolated between the values \((I(z), z\in \mathcal G_n)\), who are therefore assumed to be well defined, and the cost function is simply approximated by
Here again, we leave to the reader to check that this provides a valid approximation in the sense of Proposition 10.16 as soon as, say, I and \(\tilde{I}\) are continuous and one uses a linear interpolation scheme, as described below.
Using this approximation (for a fixed n that we will remove from the notation), we now work the implementation in more detail, starting with the computation of the gradient in (10.45). Assume that time is discretized at \(t_k = kh\) for \(h=1/Q\) and that \(v_k(\cdot ) = v(t_k, \cdot )\) is discretized over a regular grid \({\mathcal G}\).
It will be convenient to introduce the momentum and express \(v_k\) in the form
We can consider \((\rho _k(z), z\in {\mathcal G})\) as new control variables, noting that (10.45) directly provides the gradient of the energy in \(V^*\), namely
From this expression, we see that we can interpret the family \((\rho _k(z), z\in {\mathcal G})\) as discretizing a measure, namely
Given this, the gradient in \(V^*\) can be discretized as
which can be used to update \(\rho _k(z)\).
The last requirement in order to obtain a fully discrete procedure is to select interpolation schemes for the computation of the diffeomorphisms \(\varphi ^v\) and for the compositions of I and \(I'\) with them. Interpolation algorithms (linear, or cubic, for example) are standard procedures that are included in many software packages [234]. In mathematical representation, they are linear operators that take a discrete signal f on a grid \({\mathcal G}\) (i.e., \(f\in {\mathbb {R}}^{\mathcal G}\)) and return a function, that we will denote by \({\mathcal R}f\), defined everywhere. By linearity, we must have
for some “interpolants” \(r_z(\cdot ), z\in \mathcal G\). In the approximation of the data attachment term, one can then replace I by \(\mathcal R(I_{|_{\mathcal G}})\), the interpolation of the restriction of I to \(\mathcal G\).
Linear interpolation, for example, corresponds, in one dimension, to \(r_z(x) = 1-2^n|z - x|\) if \(|z-x| < 2^{-n}\) and 0 otherwise. In dimension d, one takes
if \(\max _i(|z_i-x_i|) < 2^{-n}\) and 0 otherwise (where \(z = (z_1, \ldots , z_d)\) and \(x = (x_1, \ldots , x_d)\)).
Given an interpolation operator \({\mathcal R}\), one can replace, say, \(I\circ \varphi _{t_k0}(z)\) in the expression of the gradient by
For computational purposes, it is also convenient to replace the definition of \(v_k\) in (10.50) by an interpolated form
because the inner sum can be computed very efficiently using Fourier transforms (see the next section).
To complete the discretization, introduce
where an empty product of compositions is equal to the identity, so that \(\psi _{lk}\) is an approximation of \(\varphi _{t_kt_l}\). Define the cost function, which is explicitly computable as a function of \(\rho _0, \ldots , \rho _{Q-1}\):
If we make a variation \(\rho \mapsto \rho + \varepsilon \delta \rho \), then \(v\mapsto v + \varepsilon \delta v\) with (using the interpolated expression of v)
and letting \(\delta \psi _{lk} = \partial _\varepsilon \psi _{lk}\), we have, by direct computation
Using this, we can compute the variation of the E, yielding
This provides the expression of the gradient of the discretized E in \(V^*\), namely
10.7.2 Kernel-Related Numerics
Most of the previously discussed methods included repeated computations of linear combination of the kernel. A basic such step is to compute, given points \(y_1, \ldots , y_M\), \(x_1, \ldots , x_N\) and vectors (or scalars) \(\alpha _1, \ldots , \alpha _N\), the sums
Such sums are involved when deriving velocities from momenta, for example, or when evaluating dual RKHS norms in curve or surface matching.
Computing these sums explicitly requires NM evaluations of the kernel (and this probably several times per iteration of an optimization algorithm). When N or M are reasonably small (say, less than 1,000), such a direct evaluation is not a problem. But for large-scale methods, such as triangulated surface matching, where the surface may have tens of thousands of nodes, or image matching, where a three-dimensional grid typically has millions of nodes, this becomes unfeasible (the feasibility limit has however been pushed further by recent efficient implementations on GPUs [59, 157, 247]).
If \(x=y\) is supported by a regular grid \(\mathcal G\), and K is translation invariant, i.e., \(K(x, y) = \varGamma (x-y)\), then, letting \(x_k = hk\) where k is a multi-index (\(k=(k_1, \ldots , k_d)\)) and h the discretization step, we see that
is a convolution that can be implemented with \(O(N\log N)\) operations, using fast Fourier transforms (with \(N = |\mathcal G|\)). The same conclusion holds if K takes the form \(K(x, y) = A(x)^T\varGamma (x-y) A(y)\) for some matrix A (which can be used to censor the kernel at the boundary of a domain), since the resulting operation is
which can still be implemented in \(O(N\log N)\) operations.
The situation is less favorable when x and y are not regularly spaced. In such cases, feasibility must come with some approximation.
Still assuming a translation-invariant kernel \(K(x, y) = \varGamma (x-y)\), we can associate to a grid \(\mathcal G\) in \({\mathbb {R}}^d\) the interpolated kernel
where the \(r_z\)’s are interpolants adapted to the grid. This approximation provides a non-negative kernel, with null space equal to the space of functions with vanishing interpolation on \(\mathcal G\). With such a kernel, we have
The computation of this expression therefore requires using the following sequence of operations:
-
1.
Compute, for all \(z'\in \mathcal G\), the quantity
$$ a_{z'} = \sum _{k=1}^Nr_{z'}(x_k) \alpha _k. $$Because, for each \(x_k\), only a fixed number of \(r_{z'}(x_k)\) are non-vanishing, this requires an O(N) number of operations.
-
2.
Compute, for all \(z\in \mathcal G\),
$$ b_z = \sum _{z'\in \mathcal G} \varGamma (h(z-z')) a_{z'}, $$which is a convolution requiring \(O(|\mathcal G|\log |\mathcal G|)\) operations.
-
3.
Compute, for all \(j=1, \ldots , M\), the interpolation
$$ \sum _{z\in \mathcal G} r_z(y_j) b_z, $$which requires O(M) operations.
So the resulting cost is \(O(M+N+|\mathcal G|\log |\mathcal G|)\), which must be compared to the original O(MN), the comparison being favorable essentially when MN is larger than the number of nodes in the grid, \(|\mathcal G|\). This formulation (which has been proposed in [156]) has the advantage that the resulting algorithm is quite simple, and that the resulting \(K_{\mathcal G}\) remains a non-negative kernel, which is important.
Another class of methods, called “fast multipole”, computes sums such as
by taking advantage of the fact that K(y, x) varies slowly as x varies in a region which is far away from y. By grouping the \(x_k\)’s in clusters, assigning centers to these clusters and approximating the kernel using asymptotic expansions valid at a large enough distance from the clusters, fast multipole methods can organize the computation of the sums with a resulting cost of order \(M+N\) when M sums over N terms are computed. Even if it is smaller than a constant times \((M+N)\), the total number of operations increases (via the size of the constant) with the required accuracy. The interested reader may refer to [30, 140] for more details.
Another important operation involving the kernel is the inversion of the system of equations (say, with a scalar kernel)
This is the spline interpolation problem, but it is also part of several of the algorithms that we have discussed, including for example the projection steps that have been introduced to obtain a gradient in the correct metric.
Such a problem is governed by an uncertainty principle [258] between accuracy of the approximation, which is given by the distance between a smooth function \(x\mapsto u(x)\) and its interpolation
where \(\alpha _1, \ldots , \alpha _N\) are given by (10.52) with \(u_k = u(x_k)\), and the stability of the system (10.52) measured by the condition number (the ratio of the largest to the smallest eigenvalue) of the matrix \(S(x) = (K(x_i, x_j), i, j=1, \ldots , N)\), evaluated as a function of the smallest distance between two distinct \(x_k\)’s (S(x) is singular if two \(x_k\)’s coincide).
When \(K(x, y) = \varGamma (x-y)\), the trade-off is measured by how fast \(\xi \mapsto \hat{\varGamma }(\xi )\) (the Fourier transform of \(\varGamma \)) decreases at infinity. One extreme is given by the Gaussian kernel, for which \(\hat{\varGamma }\) decreases like \(e^{-c|\xi |^2}\), which is highly accurate and highly unstable. On the other side of the range are Laplacian kernels, which decrease polynomially in the Fourier domain. In this dilemma, one possible rule is to prefer accuracy for small values of N, therefore using a kernel like the Gaussian, and go for stability for large-scale problems (using a Laplacian kernel with high enough degree).
For the numerical inversion of system (10.52), iterative methods, such as conjugate gradient, should be used (especially for large N). Methods using preconditioned conjugate gradient have been introduced, for example, in [105, 141] and the interested reader may refer to these references for more details.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Younes, L. (2019). Diffeomorphic Matching . In: Shapes and Diffeomorphisms. Applied Mathematical Sciences, vol 171. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58496-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-58496-5_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58495-8
Online ISBN: 978-3-662-58496-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)