1 Introduction

Image registration is a common problem that arises in many applications of medical image analysis. The problem is to find a non-rigid transformation which aligns two images. In this paper, we focus on the variational principle to formulate this problem, which has found broad acceptance in the literature [4, 5, 33]. The sought transformation corresponds to the minimum of a functional, which finds a compromise between the image similarity of the transformed target and the reference image \(I_T,I_R:\Omega \rightarrow \mathrm{I\!R}\), defined on some domain \(\Omega \), and the smoothness of the image transformation \(\varvec{\varphi }:\Omega \rightarrow \Omega \). Formally, the problem can be written as

$$\begin{aligned} \mathcal {J}[\varvec{\varphi }] := \mathcal {D}[I_R,I_T,\varvec{\varphi }]+\mathcal {S}[\varvec{\varphi }], \end{aligned}$$
(1)

where \(\mathcal {D}\) is an image distance measure and \(\mathcal {S}\) a regularizer. Using calculus of variation, the optimum of (1) can be found by solving a system of partial differential equations.

An elegant framework to minimize Eq. (1) was proposed by Beutien et al. [4], where a minimum is reached using a convolution-based approach. Different regularization properties can be achieved by choosing different convolution kernels. However, their approach only addresses stationary kernels and only works efficiently if the kernel is separable.

In this paper, we present a generalization of this framework, where the regularization kernel are even nonseparable or nonstationary. By employing low-rank tensor decomposition [18], we approximate nonseparable convolution kernels by separable 1D kernels in order to perform the convolution separately in each space dimension by successive 1D convolutions. Compared to nonseparable convolution, this dramatically reduces the computational complexity and, additionally, it accurately retains the regularization properties. For instance, in 3D the separable Gaussian kernel requires three 1D convolutions, while a rank-4 nonseparable kernel needs \(4\times 3=12\) separable 1D convolutions.

Furthermore, we extend the framework to nonstationary regularizers, i.e. regularizers which depend on the spatial location. The corresponding spatially varying local filter kernels require an efficient handling, which we approach by a powerful caching scheme. In addition, we use an extension of the framework in order to ensure that the resulting transformations are diffeomorphic. As in [28, 37, 38], in each iteration, we compute an efficient approximation of the exponential mapping that keeps the transformations diffeomorphic.

While this generalization is useful in a wide variety of registration tasks, our original motivation was to integrate landmarks into the regularization. In Fig. 1, we illustrate the practical importance of hybrid landmark and image registration. In this example, the landmarks help to greatly improve the registration accuracy of the patellar surface of a human femur. Using our nonstationary filtering approach, we can efficiently compute a solution to the hybrid registration problem as formulated by Lüthi et al. [21], which is conceptually appealing, but has so far been computationally infeasible. The idea is to integrate landmarks directly into the regularization, which in our terms means to minimize a slightly different functional

$$\begin{aligned} \hat{\mathcal {J}}[\varvec{\varphi }] := \mathcal {D}[I_R,I_T,\varvec{\varphi }]+\hat{\mathcal {S}}[\varvec{\varphi },X], \end{aligned}$$
(2)

where \(X\) is a list of landmark displacements. Since the regularizer \(\hat{\mathcal {S}}\) considers the given displacements at the landmark positions, the resulting kernel is nonstationary. Therefore, simple convolution approaches are not applicable, since the filter kernel varies for each image location. However, in our framework, we are able to efficiently handle the regularizer’s local dependency. Hence, performing hybrid image registration in terms of Eq. (2) becomes computationally feasible even for 3D images.

Fig. 1
figure 1

The first two figures from the left show the reference and target femur surfaces. The landmark points are depicted in color, red and green respectively. The patellar surface is marked in dark gray. In the right two figures the reference is transformed to the target with two different methods. First, using standard diffeomorphic Demons [39], which does not incorporate the landmarks. The patellar surface is clearly misaligned. In the last figure, the registration was performed with our hybrid approach, which considers the landmarks. This results in an accurate patellar surface alignment (Color figure online)

Non-rigid image registration has been extensively studied in literature and several attempts have been made to reach a general framework for different regularizers [5, 33]. For an overview of image registration methods, we refer to the survey paper of Sotiras et al. [30], and more specifically to the book of Modersitzki [23] for a deeper discussion about variational methods for image registration. For the diffeomorphic regularization we refer to [13, 28, 37, 38]. Nonstationary filtering methods have been used e.g. in [6, 31], where the regularization is locally adjusted depending on local image features as e.g. curvature or local transformation properties like stiffness. Different hybrid methods, which combine landmarks and image features have been proposed in literature. For example in [14, 16, 26], the landmarks are treated as additional constraints. The methods require a perfect interpolation of the landmarks resulting in numerical problems during optimization. Other methods do not enforce the landmark constraints strictly, but add the landmark differences as another cost term to the functional in Eq. (1) [2, 17, 20, 25, 29]. In our work, we were inspired by the Bayesian approach of Lüthi et al. [21], where these two kinds of methods are combined by integrating the landmarks into the regularization. This makes the uncertainty on the landmarks independent from the data term and reduces the search space to transformations, agreeing with the landmark displacements. Contrary to the approach of Lüthi et. al. [21] however, our nonstationary filtering technique is able to efficiently handle large 3D images.

Our paper is structured as follows: in the background section we briefly introduce the variational image registration framework and present ideas about the hybrid image registration framework of Lüthi et al. [21] adapted to our method. In the subsequent method section we present our separate and nonstationary filter approach. In our experiments, we show registration results using different positive definite kernel functions such as Gaussian and exponential kernels. By using low-rank approximations of nonseparable kernels, we show an accuracy gain as well as an improved convergence property during the registration process. Applying a nonstationary kernel, we show the memory savings we reach with our caching scheme using tensor decomposition. Furthermore, we discuss in more detail the introductory patellar surface example, where the incorporation of the landmarks leads to better registration performance. Finally, we discuss the advantages and challenges related to the landmark based transformation.

2 Background

2.1 Variational Image Registration Framework

Considering dense image features, for instance gray scale values of a CT image, formulating the registration problem as a variational optimization problem turned out to be very useful in literature [4, 5, 28, 33, 37, 38]. A mapping that registers the two images \(I_R\) and \(I_T\) is sought as the minimum of the joint functional (1). Using methods from the calculus of variations, the functional is differentiated with respect to the mapping \({\varvec{\varphi }}\). If we denote the space of all admissible mappings as \(\Phi \), the functional derivatives of the two terms are \(d\mathcal {D}=:\mathbf {f}:\Phi \rightarrow \mathrm{I\!R}^d\) and \(d\mathcal {S}=:\mathcal {A}:\Phi \rightarrow \Phi .\, \mathcal {A}\) is typically a differential operator and a minimum of the functional has to satisfy the system of partial differential equations

$$\begin{aligned} \mathcal {A}[\varvec{\varphi }]=\mathbf {f}(\varvec{\varphi }). \end{aligned}$$
(3)

Many different strategies to solve this PDE and the associated minimization problem have been put forward, including finite difference methods, finite element methods, B-spline based methods etc.

The fastest method to solve (3), on which our work is based, is a convolution approach [28, 33]. This approach is possible if the fundamental solution or “Green’s function” for the operator \(\mathcal {A}\) is known. The Green’s function is then a positive definite kernel function \(k:\Omega \times \Omega \rightarrow \mathrm{I\!R}\), and the PDE can be minimized by the iteration scheme

$$\begin{aligned} \varvec{\varphi }^{i+1}=k*(\varvec{\varphi }^i \circ \mathbf {f}(\varvec{\varphi }^i)), \end{aligned}$$
(4)

where \(*\) denotes the convolution operation.

The classical example is the Demons algorithm where \(\mathcal {A}=-\Delta \) and \(k\) is the Gaussian kernel. In principle it is also possible to choose the kernel directly, without actually specifying the corresponding operator. Long et al. [19] proposed to use the exponential kernel \(k(x,y)=\frac{1}{C_d}\cdot e^{-\frac{\Vert x-y\Vert }{\alpha }}\), where \(C_d\) is a normalization constant for the number of dimensions \(d. C_1=2\alpha ,C_2=2\pi \alpha ^2,C_3=8\pi \alpha ^3\). They showed that compared to the Gaussian kernel, this kernel better approximates the linear elasticity regularization, which leads to better transformation properties (see also Steinke and Schölkopf [32]). Since the kernel is not separable, standard computational requirements are exceeded if it is directly applied in this framework. In our experiments later in this paper, we also show registration results using this regularizer.

2.2 Neighborhood Preservation

Common simple regularizers, as the ones discussed in the previous section, penalize unsmooth transformations. They cannot avoid foldings, nor do they lead to invertible transformations. But in medical applications invertible transformations are preferred. Various authors [1, 8, 28, 35, 37, 38] have therefore proposed an additional restriction of the mappings \(\varvec{\varphi }^i\) by modeling \(\varvec{\varphi }\) using geodesic flows of diffeomorphisms [3]. Following the diffeomorphic Demons [37] approach, the restriction of an optimal displacement field to be diffeomorphic \(\varvec{\varphi }\in Diff (\Omega )\) can be achieved by mapping the current transformation \(\varvec{\varphi }^i\) back onto the Lie group of diffeomorphisms [1, 28, 37, 38]. This ensures the transformation to be invertible and neighborhood preserving. A possible way to do that, which only marginally changes the optimization scheme in Eq. (4), is by calculating the group exponential map \(\exp \) of the Lie group of diffeomorphisms of \(\varvec{\varphi }^i\) after evaluating the field update

$$\begin{aligned} \varvec{\varphi }^{i+1}=k*(\varvec{\varphi }^i \circ \exp (\mathbf {f}(\varvec{\varphi }^i))). \end{aligned}$$
(5)

The exponential mapping can be efficiently approximated by a scaling and squaring algorithm [1, 37].

2.3 Hybrid Image Registration

In hybrid image registration, both dense image features and landmarks are available. In addition to the images \(I_R\) and \(I_T\), we are given lists \(X_R=\{x_i^R\}_{i=1}^n\) and \(X_T=\{x_i^T\}_{i=1}^n\) of \(n\) corresponding landmark points for each image. The displacements induced by these landmarks are given by

$$\begin{aligned} X&= \{(x_1^R,x_1^T-x_1^R),\ldots ,(x_n^R,x_n^T-x_n^R)\}\\&= :\{(x_1,y_1),\ldots ,(x_n,y_n)\}. \end{aligned}$$

Further, let \(u:\Omega \rightarrow \mathrm{I\!R}^d\) be an unknown displacement field, such that the warp \(\varvec{\varphi }(x)=x+u(x), x\in \Omega \) and the target image is warped by \(I_T(x+u(x))\). In [21], Lüthi et al. modeled the prior knowledge about \(u\) using a Gaussian process \(\mathcal {GP}(\mu ,k)\), which is defined by a mean function \(\mu :\Omega \rightarrow \mathrm{I\!R}^d\) and a covariance function \(k:\Omega \times \Omega \rightarrow \mathrm{I\!R}\). In our terms, this means to minimize

$$\begin{aligned} \mathcal {J}_{GP}[\varvec{\varphi }] := \mathcal {D}_{GP}[I_R,I_T,\mu ,u]+\mathcal {S}_{GP}[k,u], \end{aligned}$$

where the target image is warped by \(I_T(x+\mu (x)+u(x))\). Assuming a zero mean, this functional is equivalent to the functional (1) introduced at the beginning of this paper. However, the strength of this interpretation is that we can now formulate the hybrid registration problem by conditioning the Gaussian process on the \(n\) given landmark displacements. The resulting posterior process \(\mathcal {GP}_X(\mu _{X},k_{X})\) is given in closed form by

$$\begin{aligned} \mu _{X}(x)&= \mu (x) + K_X(x)^T(K+\sigma ^2Id)^{-1}Y \\ k_{X}(x,y)&= k(x,y) - K_X(x)^T(K+\sigma ^2Id)^{-1}K_X(y)^T,\nonumber \end{aligned}$$
(6)

where \(K_X(x)=(k(x,x_i))_{i=1}^n\in \mathrm{I\!R}^n, K\in \mathrm{I\!R}^{n\times n}\) is the kernel matrix with entries \(K_{i,j}=k(x_i,x_j), Y=(y_1-\mu (x_1),\ldots ,y_n-\mu (x_n))^T\in \mathrm{I\!R}^n\) are the mean free landmark displacements and \(\sigma ^2\) models the uncertainty about matching accuracy of the landmarks (see also e.g. Rasmussen in [27, Chap. 2.2]). Hence, our functional which we minimize becomes

$$\begin{aligned} \mathcal {J}_{GP_X}[\varvec{\varphi }] := \mathcal {D}_{GP_X}[I_R,I_T,\mu _{X},u]+\mathcal {S}_{GP_X}[k_{X},u], \end{aligned}$$
(7)

using \(\mu _X\) as landmark based mean transformation and the kernel function \(k_X\) for regularization.

Since \(k_X\) depends on the landmark displacements, it is not stationary and cannot directly be handled by the optimization scheme (4) introduced above. In the following, we present the necessary adjustments to still benefit from the advantages of the framework.

3 Methods

In this paper, we generalize the variational image registration framework to kernel functions, which possibly are nonseparable and nonstationary. While the optimization scheme in (4) is conceptually not restricted to separable convolution kernels, nonseparable filtering exceeds standard computational requirements. In Sect. 3.1, we present a separable 1D filter approximation for nonseparable filters, based on low-rank tensor approximation, which enables us to perform the convolution efficiently.

For the generalization to nonstationary kernels, which vary depending the spatial location, the Eq. (4) has to be rewritten. We explicitly write the convolution integral but with a kernel which is not stationary (cf. the work of McOwen in [22, Chap. 2.3] and Evans [11, Chap. 2.3]). This becomes the following integral equation

$$\begin{aligned} \varvec{\varphi }^{i+1}=\int \limits _\Omega k_{X}(\cdot ,s)(\varvec{\varphi }^i \circ \mathbf {f}(\varvec{\varphi }^i))\mathrm{{d}}s. \end{aligned}$$
(8)

Similar, with \(\varvec{\varphi }\) restricted to be diffeomorphic, Eq. (5) becomes

$$\begin{aligned} \varvec{\varphi }^{i+1}=\int \limits _\Omega k_{X}(\cdot ,s)(\varvec{\varphi }^i \circ \exp (\mathbf {f}(\varvec{\varphi }^i)))\mathrm{{d}}s. \end{aligned}$$
(9)

In Sect. 3.2, we further introduce a nonstationary filter approach, which makes the approximation of the integral (8) and (9) computationally feasible.

3.1 Separable Filter Decomposition

Applying the proposed optimization scheme to image registration requires the discretization of the formulation in Eq. (9). To make it more clear, we start by writing the spatially discretized version of Eq. (5), where the kernel \(k\) becomes a 3D tensor \(H_0\):

$$\begin{aligned} \varvec{\varphi }^{i+1}(x)= H_0*(\varvec{\varphi }^i \circ \exp (\mathbf {f}(\varvec{\varphi }^i)))_x. \end{aligned}$$
(10)

\(H_0\) is the discrete unit impulse response of \(k\) with elements \(H_{0_{ijq}}=k(0,(i,j,q)^T)\) and \(i,j,q\) cover the neighborhood around \(0\), while the subscript \(x\) of the second term indicates the equally large discrete neighborhood around the point \(x\).

If the kernel is separable the iteration scheme of Eq. (10) can be accelerated greatly by performing the convolution separately in each space dimension by successive 1D convolutions. The Gaussian kernel has this nice property of separability. Therefore without any further effort, the convolution with this kernel can be performed separately. To still benefit from this performance gain for nonseparable kernels, like the exponential kernel [19], their separability has to be approximated. In 2D, this can be achieved by standard singular value decomposition. However in 3D, this leads to mathematical challenges that go beyond standard linear algebra, since a filter kernel in 3D is a third order tensor. In contrast to 2D matrices, it is an NP-hard problem to determine the rank of a specific given higher order tensor (see Kolda and Bader in [18]). Hence, the rank \(R\) becomes a parameter which has to be estimated. Nevertheless, we are able to compute the approximation using a CANDECOMP/PARAFAC (CP) decomposition model [18]. This gives us separable 1D approximations of the discrete unit impulse response \(H_0\). In Fig. 2 the decomposition model is visually illustrated. Such a decomposition can be formulated as a minimization problem

$$\begin{aligned} \min _{\tilde{H}_0}\Vert H_0 - \tilde{H}_0\Vert \quad \mathrm{{ with}}\quad \tilde{H}_0=\sum _{r=1}^R a_r\otimes b_r\otimes c_r, \end{aligned}$$
(11)

where the operation \(\otimes \) denotes the three-way outer product \(\tilde{H}_{0_{ijk}}=\sum _r^R a_{ir}b_{jr}c_{kr}\). Standard algorithms to optimize (11) are based on the alternating least squares (ALS) method [15], which is explained in more detail in the following section. The parameter \(R\) is estimated by testing the approximation performance for different ranks (see Sect. 4.1).

Fig. 2
figure 2

CP tensor decomposition model of the third-order tensor \(H\).

Once the decomposition is performed, the distributivity (12) and the associativity (13) of the convolution operation can be exploited to perform the convolution separately in each dimension with \(a_r,b_r\) and \(c_r\)

$$\begin{aligned} H_0*\mathbf {f}\approx \tilde{H}_0*\mathbf {f}&= \left[ \sum _r^R a_r\otimes b_r\otimes c_r\right] *\mathbf {f}\nonumber \\&= \sum _r^R\left[ a_r\otimes b_r\otimes c_r*\mathbf {f}\right] \end{aligned}$$
(12)
$$\begin{aligned}&= \sum _r^R a_r*\left( b_r*\left( c_r*\mathbf {f}\right) \right) . \end{aligned}$$
(13)

For cubic filter kernels, having a filter width \(m\), the computational cost for each output pixel reduces significantly from m\(^3\) to \(3Rm\).

3.1.1 Alternating Least Squares Method

The low-rank approximation of \(H_0\) is calculated by minimizing the optimization problem (11). The minimizers \(a_r,b_r\) and \(c_r\) are obtained using the alternating least squares (ALS) method [15]. To this purpose, we introduce a notation to represent a tensor in a matrix form.

Let \(H\in \mathrm{I\!R}^{P\times Q \times R}\) be a third-order tensor. By fixing one index the tensor is sliced into two-dimensional sections which have horizontal (mode-1), lateral (mode-2) and frontal (mode-3) orientation for the indices \(\{1,2,3\}\) respectively. The mode-\(n\) unfolding denoted as \(H_{(n)}\) concatenates the mode-\(n\) slices horizontally to a matrix.

Following Kolda and Bader [18], the CANDECOMP/PARAFAC model can be expressed as

$$\begin{aligned} H = [\![A,B,C]\!] \equiv \sum _{r=1}^R a_r\otimes b_r\otimes c_r, \end{aligned}$$

while \(H_{(1)} = A(C\odot B)^T\), with \(A = (a_1, a_2, \ldots , a_R)\) and likewise \(B\) and \(C.\, \odot \) is the Khatri-Rao product (see ).

The matrices \(A,B\) and \(C\), which minimize (11) can be calculated by alternately fixing all but one matrix e.g. \(A\). This is followed by minimizing

$$\begin{aligned} \min _A \Vert H_{(1)} - A(C\odot B)^T\Vert _F, \end{aligned}$$

which has the optimum at

$$\begin{aligned} A = H_{(1)}[(C\odot B)^T]^{\dagger }. \end{aligned}$$

Using the special property that

$$\begin{aligned} (A\odot B)^{\dagger }=(A^{T}A \star B^{T}B)^{\dagger }(A\odot B)^T, \end{aligned}$$

where \(\star \) is the Hadamard product (see ) and \(A^{\dagger }\) the Moore-Penrose pseudo-inverse, the equations can be iteratively solved for \(A, B\) and \(C\)

$$\begin{aligned} A&= H_{(1)} (C\odot B) (B^{T}B \star C^{T}C)^{\dagger }\\ B&= H_{(1)} (C\odot A) (A^{T}A \star C^{T}C)^{\dagger }\\ C&= H_{(1)} (B\odot A) (A^{T}A \star B^{T}B)^{\dagger } \end{aligned}$$

until the values of \(A,B\) and \(C\) converge. The convergence speed depends on the initialization of the fixed matrices. A common choice for the initialization is to use the Higher-order SVD [10] discussed in Sect. 3.2.2.

Since \(H_0\) is now decomposed, the convolution in (10) can be performed separately.

3.2 Efficient Nonstationary Filtering

While stationary kernels \(k(x-y)\) only depend on the difference of \(x\) and \(y\), nonstationary kernels \(k(x,y)\) are dependent on both arguments. Therefore, for such kernels, separable filtering is not possible since the associativity no longer holds. If we spatially discretize the integral Eq. in (9)

$$\begin{aligned} \varvec{\varphi }^{i+1}(x)= H_x*(\varvec{\varphi }^i \circ \exp (\mathbf {f}(\varvec{\varphi }^i)))_x, \end{aligned}$$
(14)

where \(H_x\) is the discrete impulse response of \(k\) at location \(x\), i.e. \(H_{x_{ijq}}=k(x,(i,j,q)^T)\), we see that \(H\) now depends on \(x\), which makes the problem nonstationary. In general, the calculation of all the local impulse responses makes the problem computationally unscalable. However, in the particular case where we minimize the hybrid functional (7) we can exploit the following properties of the landmark kernel \(k_X\) to reach an algorithm which is computationally feasible.

3.2.1 Landmark Kernel Properties

The landmark kernel \(k_X\) consists out of the kernel \(k\) subtracted by a landmark dependent term. The difference between \(k\) and the full landmark kernel \(k_X\) becomes negligible if

$$\begin{aligned} \forall x_i\in X\quad k(x,x_i)<\xi \end{aligned}$$
(15)

i.e. if \(x\) is not in the neighborhood of any landmark. This property is exploited to approximate the integral of Eq. (9) by only considering \(k\), the first part of the landmark kernel \(k_X\), if the value of its second part goes to zero. We perform the approximation in two steps:

  1. 1

    At first, the whole image is filtered separately using the stationary part \(k\).

  2. 2

    Subsequently, the nonseparable and nonstationary filtering with the full kernel \(k_{X}\) is performed, but only for pixels where (15) is not fulfilled.

The second part is the most expensive step, because for each point in the vicinity of the landmarks its discrete local impulse response \(H_x\) has to be calculated. This means a cubically increasing amount of kernel evaluations, which covers the neighborhood of all points having landmark support, and this in each iteration of Eq. (9). To reduce the computational demands we propose the following caching scheme.

3.2.2 Local Filter Caching

Since the landmark kernel is nonstationary, but still time-invariant, it is reasonable to keep the computed filter kernels in memory to save computational time for the following iterations. Jumping out of the frying pan into the fire, the amount of memory to cache all the filter kernels grows rapidly depending on the filter width and the number of landmarks. Therefore, we compress these local filter kernels by again taking advantage of tensor decomposition, before we cache them in the memory.

As we saw in Sect. 3.1.1, the CP decomposition is obtained by the ALS method, which is quite costly due to its iterative nature. Because \(H_0\) has to be decomposed only once, it is still well suited to approximate the separability of the stationary filter. However, it is too slow to decompose all the local impulse responses \(H_x\).

Compared to the CP decomposition the Tucker decomposition [34] (see Fig. 3) is significantly faster. It is an alternative model to decompose a tensor. Similar to the CP model the tensor is decomposed into triplets of vectors, but they are weighted by a full so called “core” tensor.

$$\begin{aligned} \tilde{H}_x = [\![G;A,B,C]\!] \equiv \sum _{p=1}^P\sum _{q=1}^Q\sum _{r=1}^Rg_{pqr}a_p\otimes b_q\otimes c_r,\nonumber \\ \end{aligned}$$
(16)

where \(g_{pqr}\) are the elements of the core tensor \(G\) and \(P,Q,R\) are the ranks for each space dimension. In this model the unfolded tensor \(H\) is represented as

$$\begin{aligned} H_{(1)} = AG_{(1)}(C\bullet B)^T, \end{aligned}$$

where \(\bullet \) is the Kronecker product (see ). Using the Higher-order SVD algorithm of De Lathauwer et al. [10]

$$\begin{aligned} \min \Vert H_{(1)} - AG_{(1)}(C\bullet B)^T\Vert \end{aligned}$$

can be very efficiently minimized by setting \(A,B\) and \(C\) to the leading left singular vectors of the corresponding mode-\(n\) unfolding \(H_{(n)}\)

$$\begin{aligned} A = U_P^{(1)},\, B = U_Q^{(2)},\, C = U_R^{(3)}, \end{aligned}$$

where \(U_l^{(n)}\) is the matrix consisting out of the leading \(l\) singular vectors of \(H_{(n)}\) and \(G_{(1)}=A H_{(1)}(C\bullet B)^T\).

Fig. 3
figure 3

Tucker tensor decomposition model

Compared to the CP model the Tucker decomposition is a less restricted model where the core \(G\) can be dense while in the CP model the core is a super diagonal tensorFootnote 1 with ones on the diagonal. Although it cannot be used for separable filter approximation due to the weighting with the dense \(G\), the memory savings are similar to the CP model. Setting \(P=Q=R\) and having a filter length \(m\), the memory consumption reduces from m\(^3\) to \(R^3+3Rm\) per voxel in the support of the landmarks. In this paper we chose \(P,Q\) and \(R\) by testing the resulting approximation performance.

figure f

3.3 Multi-Resolution Versus Multi-Scale

We presented a method to minimize the registration functional (7). It is mainly based on the local iterative minimization scheme (4). As such, it relies on a reasonable initialization and is prone to get “stuck” in local minima. In order to deal with that, we adopt a multiresolution strategy [40]. The support in voxels of the kernel function \(k\) implicitly increases towards the lower resolution levels. Therefore, in combination with the posterior mean function \(\mu _X\), we use a multiscale kernel \(\tilde{k}\) (cf. Opfer in [24]), which combines kernels with different support, to compute the landmark based mean transformation \(\mu _X\) (6):

$$\begin{aligned} \tilde{k}(x,y) = \sum _{l=0}^L{\lambda _l k^l(x,y)}, \end{aligned}$$

where \(\lambda _l\) are positive weights and \(k^l\) correspond to \(k\) with adjusted kernel parameters per scale level \(l\) and \(L\) is the number of scale levels. The parameter e.g. for the Gaussian kernel becomes \(\sigma _g^l=\sigma _g\cdot 2^{L-l}\). We have set the weights \(\lambda _l\) to \(10^{-l}\).

3.4 The Algorithm

By joining all the previously described building blocks, we have obtained a non-rigid image registration framework, in which different regularizers can be implemented by conveniently exchanging the regularization kernel, even if it is nonseparable or nonstationary. Specifically, the landmark kernel is supported by our framework. The diffeomorphic regularization is also approximated, as shown in Eq. (14). Moreover, we showed a multiscale approach that brings the landmark mean together with the image-based optimization on different resolution levels. The full algorithm which maximizes Eq. (7) by joining all the presented concepts is provided in Listing 1.

In the following, the performance of our filter approximation techniques is evaluated in detail, while we also provide a qualitative hybrid registration example.

4 Results

We presented a method which enables an efficient approximation of the optimization scheme in Eq. (9). In this section, we perform registration experiments for validating our method. First, we provide a detailed study about the separable filter approximation and discuss its approximation performance in terms of accuracy and computational aspects. Second, we analyze the local filter compression with respect to memory consumption, computational demands as well as approximation accuracy. We compare our method with Elastix [17], where the landmarks are incorporated as an additional cost term to the functional (1). This is followed by a qualitative result of the introductory patellar surface example. Likewise, we compare these results with Elastix. In an additional section we discuss the landmark based mean transformation \(\mu _X\) in more detail.

As quality measurement, we use the target registration error (TRE), the dice coefficient (DICE) and we count singularities of the displacement fields, which is the number of voxels where the determinant of the Jacobian is smaller than zero. To compare two displacement fields \(A\) and \(B\) for each vector pair, we consider the magnitude differences and the vectors directional discrepancy. Following that, we define the accuracy loss:

$$\begin{aligned} \tau (A,B)&= \int \limits _\Omega \left( \Vert A(x)\Vert -\Vert B(x) \Vert \right) ^2 \\&+\left\| \frac{A(x)}{\Vert A(x)\Vert }\times \frac{B(x)}{\Vert B(x)\Vert }\right\| ^2 \mathrm{{d}}x,\nonumber \end{aligned}$$
(17)

where \(\tau (A,A)=0\) and greater than zero for dissimilar displacement fields.

Since we only compare different regularizers and their approximations, we use for all experiments, the sum of squared differences similarity measure

$$\begin{aligned} \mathcal {D}[I_R,I_T,\varvec{\varphi }] := \int \limits _\Omega \left( I_T\circ \varvec{\varphi }(x)-I_R(x)\right) ^2\mathrm{{d}}x. \end{aligned}$$

Following Thirion [33], we perform second order gradient descent on \(\mathcal {D}\) and obtain the forces

$$\begin{aligned} \mathbf {f}(\varvec{\varphi }(x))= -\frac{(I_T\circ \varvec{\varphi }(x)-I_R(x))\nabla I_T\circ \varvec{\varphi }(x)}{\kappa ^2\left( I_T\circ \varvec{\varphi }(x)-I_R(x)\right) ^2+\Vert \nabla I_T\circ \varvec{\varphi }(x)\Vert ^2}, \end{aligned}$$

with \(\kappa ^2\) the reciprocal of the mean squared image spacing.

Generally, we set the prior mean function \(\mu \) always to a rigid pre-alignment of the images.

We implemented our algorithm by extending the finite difference solver framework of the Insight Toolkit [39] and performed the experiments on an Intel Xeon CPU @ 3 GHz on 12 cores.

Elastix Configuration For the registration with Elastix [17] we used the B-spline transformation model combined with the mean squares metric and an LBFGS optimizer. For the landmark examples we combined the mean squares metric with the “Corresponding Points Euclidean Distance Metric” which is equivalent to the target registration error.

4.1 POPI Breathing Thorax Model

In this first experiment, we show quantitative results by different approximation ranks of the nonseparable exponential kernel \((\alpha =1)\) without considering landmarks. The filter \(H_0\) has been discretized in a \(23^3\) voxel neighborhood. We compare the results to the exact method, which is obtained with the same kernel, but without separable filtering. We used the POPI dataset provided by the Léon Bérard Cancer Center & CREATIS lab, Lyon, France [36], which contains \(10\) CT images of a breathing lung. The images have a resolution of \(482\times 360\times 141\) voxels and a spacing of \(0.98\times 0.98\times 2\) mm\(^3\). For our experiment, the images have been resampled to \(235\times 175\times 141\) voxels and scaled to isotropic spacing at 2 \(mm^3\). In the experiment, we have chosen the image number 0 to be the reference image, while we calculated the experiments on a single scale. We repeated the experiment by increasing the rank \(R\) of the separable filter approximation from one to four. \(R=1\) corresponds to the rank-one approximation used in Beuthien et al. [4] which serves us as baseline. The exact method corresponds to the algorithm of Long et al. [19] extended to 3D.

In Fig. 4, we illustrate the image error averaged over the nine registrations changing during the optimization for each experiment. In the first three experiments the convergence rate decreases with increasing rank \(R\), while the resulting image error is getting smaller. One can also observe that for \(R\ge 3\) the image error stays nearly the same and is close to the exact method. Moreover, the variance of the image error is getting more narrow with higher \(R\). It can be assumed that for \(R>4\) no significantly improved approximation can be achieved. For a better comparison, all mean curves are again shown together in the last plot. Furthermore, in Table 1, the results of the experiments are summarized in numerical terms. Evaluating the accuracy loss (17) between the approximations and the exact method, higher rank approximations reach greater accuracy. The CPU time is considerable high for the exact method. With a third of the effort, we achieve a good approximation, accepting only a very small loss of accuracy. For a more detailed comparison, we repeated the whole experiment again on different scale levels. The results are listed in Table 2 and the upper part of Table 3. Note that all quantities are averaged over the nine experiments.

Fig. 4
figure 4

This figure shows the image error averaged over all nine experiments for approximation rank one to four as well as for the exact method. For each experiment the mean error is plotted as well as \(\pm \) one standard deviation in a solid style and the max/min as a dashed curve. In the last subfigure the averages for all variants are again shown in one plot

Table 1 This table provides information about convergence properties with different approximation ranks
Table 2 This table provides information about the computational time needed for the different experiments
Table 3 In this table we show the performance evaluation of the POPI experiment with and without landmarks

The results show that for nonseparable kernels, a one-rank approximation is not accurate enough to approximate the filter’s regularization property. With increasing rank, the calculation time gets larger. The increase is linear in \(R\). Since the resulting image error as well as the convergence properties using \(R=4\) do not significantly differ from the exact method, we think that \(4\) ranks are sufficient to approximate the exponential kernel separably.

For a meaningful comparison to Elastix, we performed three experiments. First, the smoothness parameter \(\sigma _\mathrm{{B-spline}}\) of the B-spline transform has been tuned to a small TRE (\(\sigma _\mathrm{{B-spline}}=4\)). Second, \(\sigma _\mathrm{{B-spline}}\) was tuned in order that no singularities are present in the result, but simultaneously for a TRE which is as small as possible (\(\sigma _\mathrm{{B-spline}}=16\)). Finally, the parameter was chosen for a resulting transformation, which are approximately as smooth as the ones obtained by our method (\(\sigma _\mathrm{{B-spline}}=64\)). To quantify the smoothness of a displacement field \(A\) we integrate over the local displacement changes:

$$\begin{aligned} \upsilon (A) = \int \limits _{\Omega }\int \limits _{\mathfrak {B}_x}\Vert A(s)-A(x)\Vert \mathrm{{ d}}s\mathrm{{d}}x, \end{aligned}$$

where \(\mathfrak {B}_x\) is the neighborhood around \(x\) with radius \(1\). The results in Table 3 show, that you can’t have your cake and eat it too. In Elastix there is a trade-off between the TRE and the smoothness of the transformation. \(\sigma _\mathrm{{B-spline}}\) can be tuned for a small TRE accepting a less smooth transformation or it is chosen such, that the resulting transformation is smooth, but with a higher TRE. However, our method reaches significantly smoother transformations compared to Elastix with a similar TRE. Since we regularize for diffeomorphic transformations it was expected, that compared to Elastix using a small smoothness parameter, no singularities will be present in the results. As soon as we increase \(\sigma _\mathrm{{B-spline}}\) such that the transformations are as smooth as in our method, the TRE and DICE performance drops dramatically for Elastix.

To quantify the efficiency of our filter caching approach, we performed the experiments once more, but included \(21\) landmarks provided in the POPI dataset. The landmark uncertainty was set to \(\sigma =0.02\). For comparison, the exact method, which combines the separable filtering and the landmarks, does no compression. In Table 4, the average resources needed for each experiment are listed. As expected, for the Tucker decomposition, slightly more CPU time is needed. However, it is negligible compared to the memory savings reached with this compression. Furthermore, the approximation of the local filter kernels is nearly perfect resulting in a very small loss of accuracy. The most CPU intensive part in each experiment is the 1st iteration, because initially, all local filter responses have to be calculated. Without the caching scheme therefore, the overall CPU time would explode to CPU weeks.

Table 4 The table provides information about resources needed to perform the experiments

To compare our hybrid results with Elastix, we performed the hybrid B-spline registration twice, using a small resp. a large weight \(w\) for the landmark cost term (see Table 3). A large weight results in a smaller TRE while the overall smoothness decreases. Several singularities are present in the Elastix results, while the singularities in our method are negligible. The major advantage of our method becomes apparent with the overall smoothness. Despite the landmark consideration it is much higher than in the Elastix experiment.

The TRE could be decreased, but regarding the small uncertainty on the landmarks we would have expected a smaller landmark error. This discrepancy originates from the discretization of the mean transformation \(\mu _X\), which in this experiment leads to a TRE drift of \(0.264 \pm 0.121\).

$$\begin{aligned} \mathrm{{TRE}}_\mathrm{{drift}} = \mathrm {TRE}({\mu _X}_\mathrm{{discretized}}) - \mathrm{{TRE}}({\mu _X}_\mathrm{{exact}}) \end{aligned}$$

Compared to the experiment in Sect. 4.2 where the resolution is about twice as high the discretization error results in a TRE drift of \(0.056 \pm 0.002\), which is negligible. Therefore, the mean transformation should be discretized on a finer grid.

A note on the parallelization Since our method is based on image filtering, it is well suited to perform the filtering for each voxel in parallel. Hence, the standard parallelization framework of ITK could directly be used to speed up the calculations. We performed the experiments with \(24\) processes and reached an average speedup between \(15\) and \(18\) as listed in Table 5. Because the landmarks are not evenly distributed over the image domain, the work load is also not evenly distributed to the processes. Therefore, we reached a lower speedup in the hybrid registration experiment. The actual time needed to perform the experiments is the CPU time listed in Tables 1 and 2 divided by the speedups listed in Table 5. For example, calculating the 9 registrations on level 2, the exact method took us \(2.5\) days instead of \(8.5\) weeks.

Table 5 The speedups were calculated by \(\dfrac{\hbox {Time (1 process)} - \hbox {Time (24 processes)}}{\hbox {Time (24 processes)}}\)

4.2 Patellar Surface Registration

We further performed a 3D experiment registering two femur shapes. The challenge with this kind of data is that the border of the patellar surface is potentially hard to recognize and its variation can be quite large, such that an accurate registration of the patellar surfaces using fully automatic algorithms is difficult. We obtained the patellar surfaces of the target and reference bone from an expert. By incorporating well-chosen landmarks, we can force our algorithm to register even the patellar surface correctly. The shapes were represented as signed distance images of \(353\times 327\times 491\) voxels (isotropic spacing 0.57 mm\(^3\)) and registered on 5 scale levels. For \(k\) we used the Gaussian kernel with \(\sigma _g=1\) and a landmark uncertainty of \(\sigma =0.3\times 10^{-3}\). We approximated the landmark kernel \(k_X\) with \(P=Q=R=5\). For illustration, we performed the experiment once without landmarks and once including the landmarks.

In Fig. 5, the warp fields are shown resampled on the bone surface depicted as arrows. Especially at the upper border of the patellar surface, one can see the strong impact of the landmarks. In Fig. 6, we plotted the warped reference shape including the dark gray marked part. Without considering the landmarks, the border of the patellar surface is clearly misaligned, while it is correctly registered when the landmarks are incorporated.

Fig. 5
figure 5

The figures show the warp field resampled on the reference’s surface, depicted as arrows. First row registration was performed without landmarks. Second row registration was performed including the landmarks (Color figure online)

Fig. 6
figure 6

The figures show the warped reference shape including the colored patellar surface. First row registration was performed without landmarks. Second row registration was performed including the landmarks. Third row ground truth target shape

We performed the same experiment with Elastix and summarized the results in Table 6. The parameters of the hybrid B-spline registration were tuned concerning the TRE and DICE performance measures. While Elastix brings the TRE down to nearly zero a very large amount of singularities are present in the resulting transformation and the Dice coefficient is rather low. Our method reaches a small TRE as well. Furthermore, the singularity count is very low, the DICE quite high and the displacement field smooth.

Table 6 Quantitative measures of the femur example

4.3 Smooth Mean Displacement

Since we can force multiple reference landmarks to match one single target landmark by setting \(\sigma \) equal to zero, \(\mu _X\) is not guaranteed to be invertible. In Fig. 7, an artificial example is shown where a grid is transformed by the mean displacement using different \(\sigma \). Setting \(\sigma \) equal to zero, or too small, results in unfavorable folds and barely make sense in real world medical problems. Therefore, in our patellar surface experiment, we have chosen the parameters such that folds in \(\mu _X\) hardly ever occur. The mean transformed reference shape is shown in Fig. 8, where no holes can be identified on the surface.

Fig. 7
figure 7

Transformed grid (200 px\(^2\) and isotropic spacing of 0.1 mm\(^2\)) with mean displacement using the Gaussian kernel \((\sigma _g=6)\). There are three landmarks defined as reference and target points (red reference, green target, the yellow ones are equal for both). The uncertainty on the landmarks is increased for the experiments from left to right \((\sigma =0, 0.5\times 10^{-3}, 0.75\times 10^{-3}, 0.1\times 10^{-2}, 0.25\times 10^{-2})\). The arrows illustrate to which location a point is transformed by the displacement field (Color figure online)

Fig. 8
figure 8

The figures show the warped reference shape by the mean transformation using the multiscale kernel \(\tilde{k}\) with five scale levels. Overall the shape looks the same as the reference except in the regions of the landmarks. There we have a smooth transformation to the target

Nevertheless, an inverse transformation can be obtained using the fixed-point approach of Chen et al. [7], where the inverse is iteratively approximated. An entirely different approach could be to perform diffeomorphic point matching [2, 9, 12, 13] for obtaining a invertible mean displacement. This will be addressed in future work.

5 Conclusion

In this paper, we implemented an efficient variational image registration framework, where a large variety of positive definite kernels can be used for regularization. Compared to standard approaches, we are able to accurately approximate separable filters for nonseparable regularizers in order to relax the computational demands. With less than a third of the computational effort, we approximate the true regularization with a very small loss of accuracy, while the rank-1 approximation is three times faster but results in a accuracy loss which is one order of magnitude larger. Furthermore, using an efficient nonstationary filtering scheme, we allow for location-dependent regularization. This enables us to perform hybrid landmark and image registration by utilizing the landmark kernel which incorporates landmark displacements as prior knowledge. For this purpose, accepting little more computational time, we can reduce the memory usage by at least one order of magnitude. Additionally, we added the diffeomorphic constraint on the resulting transformation. Its approximation does not significantly change the optimization scheme. The comparison with the hybrid B-spline registration shows, that our method results in smoother displacement fields even if landmark displacements are incorporated. We also discussed challenges associated with the invertibility of the landmark based transformation. An additional prior on this transformation, which ensures invertibility, similar to [2, 12], would further improve the registration. This will be addressed in future work.