1 Introduction

Deformable image registration is an important tool in a variety of medical image analysis tasks, such as multi-modality image alignment [12, 18, 25], statistical analysis for population image studies [26, 32, 35], atlas-guided image segmentation or classification [27, 30, 33], and object tracking with anomaly detection [11, 24]. In many clinical applications, it is desirable that the estimated transformations are diffeomorphisms (i.e., bijective, smooth, and inverse smooth mappings) because they produce anatomically plausible images [7]. Despite recent achievements in treating the problem of diffeomorphic image registration as a fast learning task, current approaches oftentimes have an assumption that the topology of objects presented in images is intact [6, 10, 17, 31]. Existing algorithms fail badly in cases where appearance changes occur (e.g., missing data caused by pathology, such as tumors, myocardial scars, multiple sclerosis, and etc.) because they have little to no control over these unknown variables.

To address this issue, a few algorithms of image metamorphosis have been developed to incorporate the modeling of appearance changes in registration functions [8, 14, 16, 21, 23]. Existing metamorphic image registration methods mainly fall into two categories: (i) exclude appearance changes via manually delineated segmentations of abnormal regions [21, 23], and (ii) treat the appearance changes as unknown variables estimated out from images [8, 14]. These approaches either heavily depend on manually segmented labels of 3D volumetric data that are time and labor-consuming, or struggle with balancing between the effects of appearance vs. geometric changes. A recent work [8] has developed a metamorphic autoencoder that estimates the deformation and appearance variations by decoupling the geometric and appearance representations in latent spaces. However, such a model is highly sensitive to parameter-tuning due to its difficulty in differentiating changes caused by geometric transformations vs. appearances.

In this paper, we develop a novel learning-based model of metamorphic image registration, named as MetaMorph, that provides more robust and accurate registration results in images with appearance changes. In contrast to previous approaches [8, 14, 21, 23], we incorporate a new appearance-aware regularization in the network loss function that enforces a piecewise constraint on geometric transformation fields. Such a constraint will be learned simultaneously from a jointly optimized segmentation task. In addition, we effectively augment the segmentation labels by utilizing the learned transformations in the training process. This not only substantially improves the segmentation performance, but also reduces the requirement for massive ground truth segmentation labels. The main contributions of our proposed MetaMorph are summarized in three folds:

  • To the best of our knowledge, MetaMorph is the first predictive registration algorithm that utilizes jointly learned segmentation maps to model appearance changes.

  • MetaMorph learns a new appearance-aware regularization that piecewisely constrains the variations of image intensities caused by geometric transformations separately from appearance changes.

  • The joint learning scheme of MetaMorph maximizes the mutual benefits of metamorphic image registration and segmentation.

To demonstrate the effectiveness of our model, we validate MetaMorph on real 3D human brain tumor MRIs. Experimental results show that MetaMorph outperforms the state-of-the-art learning-based registration models [6, 8] with substantially increased accuracy. The developed MetaMorph has great potential in various image-guided clinical interventions, e.g., real-time image-guided navigation systems for tumor removal surgery.

2 Background: Diffeomorphic Image Registration

In this section, we briefly review the concept of the diffeomorphic image registration in the setting of large deformation diffeomorphic metric mapping (LDDMM) with a geodesic shooting algorithm [7, 20, 29].

Let S be the source image and T be the target image defined on a d-dimensional torus domain \(\varGamma = \mathbb {R}^d / \mathbb {Z}^d\) (\(S(x), T(x) : \varGamma \rightarrow \mathbb {R}\)). The problem of diffeomorphic image registration is to find the geodesic (a.k.a. shortest path) to generate time-varying diffeomorphisms \(\{\psi _t(x)\}: t \in [0,1] \) such that \(S \circ \psi _1\) is similar to T, where \(\circ \) is an interpolation operation that deforms S by the smooth deformation field \(\psi _1\). This is typically formulated as an optimization problem by minimizing an explicit energy function over the transformation fields \(\psi _t\) as

$$\begin{aligned} {{\,\textrm{E}\,}}(v_t) = \textrm{Dist}[S \circ \psi _1(v_t), T] + \textrm{Reg}[\psi _t(v_t)], \end{aligned}$$
(1)

where the distance function \(\textrm{Dist}(\cdot , \cdot )\) measures the image dissimilarity between the source and the deformed image. Commonly used distance functions include a sum-of-squared difference of image intensities [7], normalized cross correlation [4], and mutual information [34, 36]. The regularization term \(\textrm{Reg}(\cdot )\) is a constraint that enforces the spatial smoothness of transformations, arising from a distance metric on the tangent space V of diffeomorphisms, i.e., an integral over the norm of time-dependent velocity fields \(\{v_t(x)\} \in V\),

$$\begin{aligned} \textrm{Reg}(\psi _t) = \int _0^1 (L v_t, v_t) \, dt, \quad \text {with} \quad \frac{d\psi _t}{dt} = - D\psi _t\cdot v_t, \end{aligned}$$
(2)

where \(L: V\rightarrow V^{*}\) is a symmetric, positive-definite differential operator that maps a tangent vector \( v_t\in V\) into its dual space as a momentum vector \(m_t \in V^*\). We typically write \(m_t = L v_t\), or \(v_t = K m_t\), with K being an inverse operator of L. The notation \((\cdot , \cdot )\) denotes the pairing of a momentum vector with a tangent vector, which is similar to an inner product. Here, the operator D denotes a Jacobian matrix and \(\cdot \) represents element-wise matrix multiplication.

A geodesic curve with a fixed endpoint is characterized by an extremum of the energy function (2) that satisfies the Euler-Poincaré differential (EPDiff) equation [2, 20],

$$\begin{aligned} \frac{\partial v_t}{\partial t} = - K \left[ (D v_t)^T \cdot m_t + D m_t \cdot v_t + m_t \cdot {\text {div}} v_t \right] , \end{aligned}$$
(3)

where \({\text {div}}\) is the divergence. This process in Eq. (3) is known as geodesic shooting, stating that the geodesic path \(\{\psi _t\}\) can be uniquely determined by integrating a given initial velocity \(v_0\) forward in time by using the rule (3).

Therefore, we rewrite the optimization of Eq. (1) equivalently as

$$ \begin{aligned} {{\,\textrm{E}\,}}(v_0) =\textrm{Dist}[S \circ \psi _1(v_0), T] + (L v_0, v_0), \, \, \text {s.t.} \, \, \text {Eq.}~(2) \& \text {Eq.}~(3) . \end{aligned}$$
(4)

3 Our Model: MetaMorph

The objective function of diffeomorphic image registration in Eq. (4) works well under the condition that images are ideally of good quality with preserved topology. This assumption breaks when corruptions such as appearance changes or occlusions occur. In this section, we first define an objective function of the metamorphic image registration that considers the modeling of appearance changes. An appearance-aware regularization is developed to effectively suppress the negative influences of appearance changes in typical diffeomorphic image registration algorithms. We then develop a joint learning framework that includes i) a segmentation network for appearance change detection, and ii) a metamorphic registration network incorporating the newly formulated objective function as part of the network loss.

Appearance-Aware Regularization. The purpose of metamorphic image registration is to find an optimal transformation \(\psi (v_0, \delta )\) that is composed of two variables: the optimal initial velocity field \(v_0\), and the appearance change \(\delta \). A recent work proposed to learn these variables via disentangled latent representations in an encoder-decoder neural network [8]. However, it is extremely challenging for this algorithm to differentiate the variations of image intensities caused by geometric transformations from appearance changes since they unavoidably compensate for each other. The ambiguity introduced by optimizing two compensating variables without any guidance fails to search for accurate registration solutions. Additionally, this makes the algorithm highly sensitive to network parameters with an increased risk of poor convergence. To alleviate this issue, we introduce an appearance-aware regularization in the registration framework, guided by learned segmentations of the appearance-changing areas.

Assume U is a union of the learned segmentations of appearance-changing areas from the source image S and the target image T. Analogous to Eq. (4), we define the appearance-aware regularization \({\textbf {Reg}}^*(\cdot )\) in the space of initial velocity fields. To suppress the effects of appearance variations, we piecewisely constrain the initial velocity fields through a segmentation indicator, i.e.,

$$\begin{aligned} {\textbf {Reg}}^*(v_0) = \left( L (v_0 \odot (1-U)), v_0 \odot (1-U) \right) , \, \, \text {s.t.} \, \, \text {Eq.}~(3), \end{aligned}$$
(5)

where \(\odot \) represents an element-wise multiplication between a vector field and a scalar field. For the purpose of notation simplicity, we define \(\hat{v}_0 \overset{\varDelta }{=}\ v_0 \odot (1-U)\) in the following sections.

With the newly defined regularization in Eq. (5), we arrive at the objective function of metamorphic image registration as

$$\begin{aligned} {\textbf {E}}^*[\hat{\psi } (\hat{v}_0)] = {\textbf {Dist}}^*[\hat{S} \circ \hat{\psi }_1 (\hat{v}_0), \hat{T} ] + {\textbf {Reg}}^* (\hat{v}_0), \end{aligned}$$
(6)

where \(\hat{S}\) and \(\hat{T}\) denotes the source and target images with appearance changes masked out, i.e., \(\hat{S} = S \odot (1-U)\), and \(\hat{T} = T \odot (1-U)\). Here, the \({\textbf {Dist}}^*[\cdot , \cdot ]\) is the image dissimilarity term that measures the dissimilarity between the consistent area between the deformed image and target.

3.1 Predictive Metamorphic Image Registration

We develop a deep learning framework to jointly learn the segmentation for appearance change and the masked-out velocity field \(\hat{v}_0\). An overview of our proposed MetaMorph architecture is shown in Fig. 1.

Appearance change can be masked by a fixed foreground segmentation via pre-running image segmentation algorithms [21, 23]. However, performing manual annotations of segmentation labels is time and labor-consuming. In this work, instead of using a fixed mask, we treat the appearance change as a variable from the segmentation network and jointly optimize with the optimal registration solution. We utilize an encoder-decoder based neural network to learn the segmentation masks and then apply them to the associate image pairs for masking out the appearance change. Although we adopt UNet-based architecture for segmentation in this work [28], other networks such as recurrent residual neural networks [1], transformer-based networks [9, 15] can also be easily plugged into the proposed method.

With the developed segmentation network, now we are ready to formulate the loss function of MetaMorph,

$$\begin{aligned} \ell = {\textbf {Dist}}^*[\hat{S} \circ \hat{\psi }_1 (\hat{v}_0), \hat{T} ] + {\textbf {Reg}}^* (\hat{v}_0) + \gamma \cdot \ell _{seg} , \,\, \, \text {s.t.} \, \, \text {Eq.}~(5). \end{aligned}$$
(7)

Here, \(\gamma \) is a weighting parameter that balances the segmentation and registration loss, \(\ell _{seg}\) is a segmentation loss that maximizes the Sørensen-Dice coefficient [13] between ground truth y and the predicted \(\hat{y}\),

$$\begin{aligned} \ell _{seg}&= 1-\text {Dice} (y, \hat{y} ), \end{aligned}$$
(8)

where \(\text {Dice} (y, \hat{y} ) = 2(|y| \cap |\hat{y}|)/(|y| + |\hat{y}|) \).

We adopt an approximated region-based mutual information (RMI) [36], which is a broadly-used distance metric for images from different domains. For simplicity, we let \(\hat{S}_{\psi }\) denote the deformed image. Let \(f(\hat{S}_{\psi })\) and \(f(\hat{T})\) denote the probability density functions for the deformed image and target respectively, and their joint probability density function is \(f (\hat{S}_{\psi }, \hat{T})\). The image dissimilarity with RMI can be formulated as

$$\begin{aligned} {\textbf {Dist}}^*[\hat{S}_{\psi },\hat{T}]&=\text {RMI} (\hat{S}_{\psi }, \hat{T}) = \int _{\hat{S}_{\psi }}\int _{\hat{T}} f (\hat{S}_{\psi }, \hat{T}) \log \frac{f (\hat{S}_{\psi }, \hat{T})}{f(\hat{S}_{\psi })f(\hat{T})} \nonumber \\ {}&\approx l_{ce} (\hat{S}_{\psi },\hat{T}) - \frac{1}{B}\sum ^{B}_{b=1} I_b (\hat{T}; \hat{S}_{\psi }), \end{aligned}$$
(9)

where \(L_{ce}(\cdot , \cdot )\) is a cross entropy loss between two images. The \(I_b (\cdot ;\cdot )\) is a batch-wise lower bound that \(I_b (\hat{T}; \hat{S}_{\psi }) = \frac{1}{2}\log [\det (\varSigma _{\hat{T}|\hat{S}_{\psi }})]\), where \(\varSigma _{\hat{T}|\hat{S}_{\psi }}\) is the posterior covariance matrix of \(\hat{T}\) (a symmetric positive semi-definite matrix), given \(\hat{S}_{\psi }\). Here B denotes the number of images in a mini-batch b. Please refer to [36] for more derivation details.

Fig. 1.
figure 1

An illustration of the network architecture for MetaMorph. Top left to right: input a pair of images into a segmentation network, and apply predicted labels onto images to mask out the appearance change. Bottom right to left: input a pair of images (with masked-out appearance change) to the registration network and predict a piecewise velocity field, integrate geodesic constraints, and produce a deformed image and transformation-propagated segmentation. The deformed images and labels are circulated into the segmentation network as augmented data.

We develop an alternating optimization scheme [22] to minimize the network loss defined in Eq. (7). All network parameters are optimized jointly by alternating between the training of segmentation and image registration. A summary of our joint learning of MetaMorph is in Algorithm 1.

4 Experimental Evaluation

To demonstrate the effectiveness of the proposed model, we compare both segmentation and registration tasks with state-of-the-arts.

Data. For 3D brain tumor MRI scans with tumor segmentation labels, we include 100 public T1-weighted brain scans of different subjects from Brain Tumor Segmentation (BraTS) [5, 19] challenge 2021. We also include 28 landmarks (16 for brain ventricle and 12 for corpus callosum) that are annotated by clinicians to better evaluate the image registration performance. All MRIs are \(155 \times 240 \times 240\), \(1.25\,\text {mm}^{3}\) isotropic voxels. As a preprocessing step, we run affine registration, intensity normalization, and bias field correction on all images.

Experiments. We compare our metamorphic image registration method with two registration baselines, an unsupervised predictive diffeomorphic registration method (VoxelMorph as VM) [6], and a metamorphic autoencoder (MAE) [8] that learns disentangled appearance and shape representations. To better visualize the deformations, we show predicted transformation grids and deformed images with transformation-propagated landmarks for all methods. Quantitatively, we compute the \(L_2\) distance of landmarks as registration error between the propagated and the target frames over 60 pairs.

We evaluate the brain tumor segmentation via computing Dice score [13] by comparing MetaMorph with three segmentation backbones, U-Net architecture [28], U-Net based on recurrent residual convolutional neural network (R2-Unet) [1], and transformer-based Unet (UnetR) [15]. We also show the performance of MetaMorph by replacing the segmentation module in our model with all backbones (named MetaMorph:Unet, MetaMorph:R2-Unet, and MetaMorph:UnetR). We visualize the predicted segmentations overlaid with testing images across all methods.

figure a

Parameter Settings. We set parameter \(\alpha =3\) for the operator L, the number of time steps for Euler integration in EPDiff (Eq. (3)) as 10. We set the weight parameter \(\gamma = 0.5\) and the batch size as 4. We use an adaptive cosine annealing learning rate scheduler that starts from an initial value at \(\eta = 5e-4\) for network training. We run all models for 100 epochs with Adam optimizer and save the networks with the best validation performance. The training and prediction procedure of all learning-based methods are performed on two Nvidia GTX 2070Ti GPUs. We run five-fold cross validation and split the images by using \(70\%\) as training images, \(20\%\) as validation images, and \(10\%\) as testing images.

Fig. 2.
figure 2

Image registration performance comparison for all methods. From left to right, source, target, deformed images by VoxelMorph (VM), metamorphic autoencoder (MAE), and our method. All images are overlaid with annotated landmarks (red circle for ventricle and blue cross for corpus callosum). (Color figure online)

Results. Figure 2 visualizes the image registration prediction of two 3D brain MRIs of study across all methods. It shows MetaMorph significantly outperforms both VM and MAE. General diffeomorphic registration models (e.g., VM) without an appearance-control mechanism may fail and produce less satisfied deformed images without sufficient deformations. MAE offers accurate deformations to a certain level while it produces artifacts. By excluding the appearance change, MetaMorph more accurately deforms all regions (e.g., ventricles and corpus callosum). It also shows that our propagated landmarks align best with the target.

Figure 3 shows two examples of image segmentation performance comparison for all methods. It indicates that MetaMorph-based models predict better segmentation labels (closer to ground truth) than original backbones. The predicted labels by MetaMorph have slightly better segmentations of the brain tumor boundary. This is because we use deformed images and labels that are produced by a joint registration framework as augmented data for each subject; thus learning a broader spectrum for appearance variation in data and offering more accurate prediction when new testing data arrives.

Fig. 3.
figure 3

Image segmentation visualization for all methods. Left to right: overlaid segmentation map comparison between the predicted label (red) and the ground truth (blue) for Unet, MetaMorph: Unet, R2-Unet, MetaMorph: R2-Unet, UnetR and MetaMorph: UnetR. (Color figure online)

Figure 4 (left panel) statistical reports the Dice coefficient comparison. It shows that MetaMorph consistently achieves a higher segmentation accuracy than backbones. Transformer-based methods (UnetR-based) produce the highest Dice for all methods. Figure 4 (right panel) reports the landmark-based registration error between the target image and the deformed image. MetaMorph outperforms other methods with the lowest error, indicating our proposed method finishes the metamorphic image registration task with higher accuracy.

Fig. 4.
figure 4

Left: Dice comparison on brain tumor segmentation across all methods over images. The means of baseline vs. our method are 0.815/0.834, 0.835/0.856, 0.861/0.874; Right: registration error (computed on \(L_2\) distance) of two anatomical landmarks for 60 brain pairs. The means of errors for VM vs. MAE vs. our method are 15.02/10.53/4.64, 16.48/13.59/4.10.

5 Conclusion

We present a predictive metamorphic image registration model, MetaMorph, via deep neural networks in this paper. Different from existing models that have limited control over appearance change, we develop a joint learning framework that adopts a segmentation module to accurately guide the registration network to learn diffeomorphic transformation fields. The developed segmentation module maximally excludes the disadvantageous effect caused by appearance change for learned deformations; thus enabling more precise correspondence alignment between deformed and target frames. Experimental results on 3D brain MRIs with real tumors show that our proposed framework yields a better registration as well as a segmentation model. While our algorithm is presented in the setting of LDDMM with geodesic shooting, the theoretical development is generic to other deformation models, e.g., stationary velocity fields [3]. Our model has great clinical potential on solving one of the most challenging registration problems, e.g., real-time brain shift estimation between preoperative and intraoperative MRI scans with missing data values. Interesting future works of MetaMorph will be i) building a probabilistic model to quantify the registration uncertainty along the boundary of tumor areas and ii) extending the proposed method to more advanced clinical scenarios that appearance changes are difficult to detect, e.g., real-time automated image registration for ultrasound images.