Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Atlas-to-image registration provides spatial information to map anatomical locations from an atlas to a patient. This procedure is crucial for atlas-based segmentation which is used in lesion detection and treatment planning for traumatic brain injury, tumor and stroke cases [7]. However, large brain pathologies often produce appearance changes which may result in large misregistrations, if appearance-mismatch is falsely accounted for by image deformation. This is especially acute for deformable image registration methods, which are needed to capture subtle deformations and, for example, mass effects of tumors.

Several approaches have been proposed for atlas-to-image registrationFootnote 1 with large pathologies. The most straight-forward method is cost function masking, where the lesion area is not considered during image similarity computation [1]. However, this method could be problematic if the lesion area contains important brain structure information. Other methods include joint segmentation and registration that mitigates missing correspondences [3], explicit tumor growth modeling [5], geometric metamorphosis that separates the deformation of healthy brain areas from lesion changes [14], and registration methods accounting for deformation and intensity changes [21].

While effective, these methods require either explicit lesion segmentation, knowledge of lesion location, or the modeling of tumor growth. Two alternatives exist: (1) using a robust cost-function [17] or a mutual saliency map [15] to mitigate the effect of outliers or, instead, (2) learning desired mappings between image types from large-scale image databases. We follow this second approach. A learned mapping then allows synthesizing one image type from another. Image synthesis has been extensively explored to synthesize MR imaging sequences [8], to facilitate multi-modality registration [2, 19] and to segment lesions [18]. Our goal is to synthesize quasi-normal images from images with lesions to simplify atlas-to-lesion-image registration. Using image synthesis rather than a robust cost-function or a mutual saliency map allows reconstructing structural information to guide registration even in highly pathological areas.

Liu et al. [10] proposed a low-rank-plus-sparse (LRS) technique to synthesize quasi-normal brain images from pathological images and to simultaneously estimate a quasi-normal atlas. This approach decomposes images into normal (low-rank) and lesion (sparse) parts. The low-rank part then constitutes the synthesized quasi-normal images, effectively removing lesion effects. By learning from data, no prior lesion information is required. However, the LRS decomposition itself requires good image alignment, hence decomposition and registration have to be interleaved to obtain good results.

Contributions. Our contributions to improve atlas-to-image registration can be summarized as follows: First, similar to [10], we propose a method to directly map a pathology image to a synthesized quasi-normal image to simplify the registration problem. No registration is needed in this process. Second, we use a deep variational encoder-decoder network to learn this mapping and train it using stochastic gradient variational Bayes [9]. Third, since the normal appearance of pathological tissue is unknown per se, we propose loss-function masking and pathology-like “structured noise” to train our model. These strategies ignore mappings between image regions without known correspondence, and artificially create areas with known correspondence which can be used for training, respectively. Fourth, based on the variational formulation, we estimate the reconstruction uncertainty of the predicted quasi-normal image and use it to adjust/improve the image similarity measure so that it focuses more on matching areas of low uncertainty. Fifth, we validate our approach on synthetic tumorFootnote 2 images and data from the BRATS 2015 challenge. Our framework requires no prior knowledge of lesion location (at test time; lesion segmentations are required during training only) and provides comparable or, in many cases, better registration accuracy than the LRS method and cost function masking.

Organization. Section 2 discusses variational Bayes for autoencoders, as well as its denoising criterion. Section 3 introduces our methods to remove brain lesions from images and to compute uncertainty estimates for the prediction of quasi-normal images. Section 4 presents experimental results (for 2D synthetic and real data), discusses extensions to 3D, and possible improvements.

2 Denoising Variational Autoencoding

The problem of mapping a pathology image to a quasi-normal image is similar to the objective of a denoising autoencoder, which aims to transform a noisy image into a noise-free image. Next, we introduce variational inference for autoencoders, followed by an explanation of inference for a denoising autoencoder.

Given a clean brain image \(\varvec{x}\) and the latent variable \(\varvec{z}\), we want to find the posterior distribution \(p(\varvec{z}|\varvec{x})\). Since \(p(\varvec{z}|\varvec{x})\) is intractable, we approximate it with a tractable distribution \(q_{\phi }(\varvec{z}|\varvec{x})\), where \(\phi \) is the parameter of the variational approximation. For a variational autoencoder, the posterior distribution is \(p_{\theta }(\varvec{z}|\varvec{x}) \propto p_{\theta }(\varvec{x}|\varvec{z})p(\varvec{z})\), where the prior \(p(\varvec{z})\) is usually an isotropic Gaussian, and \(\theta \) are the parameters of the observation model \(p_{\theta }(\varvec{x}|\varvec{z})\). When mapping these parameters to an autoencoder, \(\varvec{z}\) corresponds to the hidden layer, \(q_{\phi }(\varvec{z}|\varvec{x})\) refers to the encoding operation and \(p_{\theta }(\varvec{x}|\varvec{z})\) refers to decoding. Thus, \(\phi \) and \(\theta \) correspond to the weights in the encoder and decoder.

To approximate the true posterior with the variational posterior, we minimize the Kullback-Leibler (KL) divergence between these two distributions.

$$\begin{aligned} \begin{aligned} D_{\text {KL}}(q_{\phi }(\varvec{z}|\varvec{x})||p_{\theta }(\varvec{z}|\varvec{x}))&= \mathbb {E}_{q_{\phi }(\varvec{z}|\varvec{x})}\left[ \text {log}\, \frac{q_{\phi }(\varvec{z}|\varvec{x})}{p_{\theta }(\varvec{z}|\varvec{x})}\right] \\&= \text {log}\, p_{\theta }(\varvec{x}) - \mathbb {E}_{q_{\phi }(\varvec{z}|\varvec{x})}\left[ \text {log}\, \frac{p_{\theta }(\varvec{z}, \varvec{x})}{q_{\phi }(\varvec{z}|\varvec{x})}\right] . \end{aligned} \end{aligned}$$
(1)

Since the data \(\varvec{x}\) is independent of the latent variable \(\varvec{z}\), \(\log p_{\theta }(\varvec{x})\) in Eq. (1) is constant. Thus, minimizing the KL-divergence is equivalent to maximizing the term \(\mathbb {E}_{q_{\phi }(\varvec{z}|\varvec{x})}( \log p_{\theta }(\varvec{z}, \varvec{x}) - \log q_{\phi }(\varvec{z}|\varvec{x}))\). Since the KL-divergence is non-negative, we have \(\mathbb {E}_{q_{\phi }(\varvec{z}|\varvec{x})}[\log p_{\theta }(\varvec{z}, \varvec{x}) - \log q_{\phi }(\varvec{z}|\varvec{x})] \le \text {log}\, p_{\theta }(\varvec{x})\), and we call this term the variational lower bound of the data likelihood \(\mathcal {L}_{\text {VAE}}\), i.e.,

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {VAE}}&= \mathbb {E}_{q_{\phi }(\varvec{z}|\varvec{x})}\left[ \text {log}\, \frac{p_{\theta }(\varvec{z}, \varvec{x})}{q_{\phi }(\varvec{z}|\varvec{x})}\right] \\&= -D_{\text {KL}}(q_{\phi }(\varvec{z}|\varvec{x})||p_{\theta }(\varvec{z})) +\mathbb {E}_{q_{\phi }(\varvec{z}|\varvec{x})}[\text {log}\, p_{\theta }(\varvec{x}|\varvec{z})], \end{aligned} \end{aligned}$$
(2)

where the first term can be regarded as the regularizer, matching the variational posterior to the prior of the latent variable, and the second term is the expected network output likelihood w.r.t. the variational posterior \(q_{\phi }(\varvec{z}|\varvec{x})\). During training, the optimization algorithm maximizes this variational lower bound.

Our goal is a denoising autoencoder for pathology-removal. In other words, we regard lesions as a special structured noise. Removing lesion appearance is then equivalent to removing noise in the denoising autoencoder theory. To do this, we introduce the input noise (lesion) corruption distribution as \(p(\varvec{\widetilde{x}}|\varvec{x})\). The variational posterior distribution is then \(\widetilde{q}_{\phi }(\varvec{z}|\varvec{x})=\int q_{\phi }(\varvec{z}|\varvec{\widetilde{x}}) p(\varvec{\widetilde{x}}|\varvec{x})d\varvec{\widetilde{x}}\). If the original variational posterior distribution is a Gaussian, this new posterior can be regarded as a mixture of Gaussians, which has better representation power. As shown in [6], the variational lower bound for a denoising autoencoder is

$$\begin{aligned} \mathcal {L}_{\text {DVAE}} = \mathbb {E}_{\widetilde{q}_{\phi }(\varvec{z}|\varvec{x})}\left[ \log \frac{p_{\theta }(\varvec{z}, \varvec{x})}{q_{\phi }(\varvec{z}|\widetilde{\varvec{x}})}\right] \ge \mathcal {L}_{\text {VAE}} = \mathbb {E}_{\widetilde{q}_{\phi }(\varvec{z}|\varvec{x})}\left[ \log \frac{p_{\theta }(\varvec{z}, \varvec{x})}{\widetilde{q}_{\phi }(\varvec{z}|\varvec{x})}\right] . \end{aligned}$$
(3)

This means that the denoising variational lower bound is higher than the original one, leading to a smaller KL-divergence between the true and the approximated posterior. In the following section, we discuss our implementation of the encoder-decoder network and how to maximize the denoising variational lower bound.

3 Network Model and Registration with Uncertainty

Figure 1 shows the structure of our denoising variational encoder-decoder network. The input is a brain image \(\mathbf {x}\) with intensities normalized to [0, 1]. The encoder network consists of convolution followed by max-pooling layers (ConvPool), and the decoder has max-unpooling layers followed by convolution (UnpoolConv). We choose max-unpooling instead of upsampling as the unpooling operation, because upsampling ignores the pooling location for each pooling patch, which results in severe image degradation. The encoder and decoder are connected by fully connected layers (FC) and the re-parameterization layer (Reparam) [9]. This layer takes the parameters for the variational posterior as input, which in our case is the mean \(\mu \) and standard deviation \(\varSigma \) of the Gaussian distribution, and generates a sampled value from the variational posterior. This enables us to compute the gradient of the regularizer \(-D_\text {KL}(q_{\phi }(\varvec{z}|\varvec{x})||p_{\theta }(\varvec{z}))\) for \(\phi \) using the variational parameters instead of the sampled value, which is not differentiable for \(\phi \). Below we discuss specific techniques implemented for our task.

Training Normal Brain Appearance Using Pathology Images. A model of normal brain appearance would ideally be learned from a large number of healthy brain images with a consistent imaging protocol. Our goal, instead, is to learn a mapping from a pathological image to a quasi-normal image, i.e., train a denoising autoencoder for the lesion ‘noise’, and maximize the denoising variational lower bound. This poses two challenges: first, in general, we do not know what the normal appearance in a pathological area should be; second, pathological images may exhibit spatial deformations not seen in a normal subject population (such as the mass effect for brain tumors). To mitigate these problems, we learn the brain appearance from the normal areas of the pathological brain images only. This can be accomplished by (1) introducing lesion-like structured noise (i.e., circles filled with the mean intensity of the normal brain area for brain tumor cases) via the QuasiLesion layer in Fig. 1, and (2) loss function masking, i.e., ignoring lesion-areas during learning. Suppose we have the lesion segmentation for the training data. For loss-function masking, we first change the input with structured noise \(\varvec{\widetilde{x}}\) to \(\varvec{\widetilde{x}}_{\text {normal}}\) using the following rule: if \(\varvec{\widetilde{x}} \in \text {Normal}\), then \(\varvec{\widetilde{x}}_{\text {normal}}= \varvec{\widetilde{x}}\); otherwise, (i.e., \(\varvec{\widetilde{x}}\in \text {Lesion}\)) \(\varvec{\widetilde{x}}_{\text {normal}}= a + \mathcal {N}(0, \sigma )\). This prevents the network from using tumor-appearance. Experiments show only small differences for different settings of a and \(\sigma \). However, performance suffers when \(\sigma \) is too high, and setting \(a=0\) increases the mean intensity error for the whole image. In our model, we set a to the mean intensity value of the normal area and \(\sigma = 0.03\). Second, we set our network output likelihood for \(\varvec{x}_\text {output}\) to

$$\begin{aligned} \text {log}\, p_{\theta }(\varvec{x}_{\text {output}}|\varvec{z})_{\text {normal}} = {\left\{ \begin{array}{ll} |\varvec{x}_{\text {output}} - \varvec{x}|, &{}\quad \varvec{x}_{\text {output}}\in \text {Normal}\\ 0, &{}\quad \varvec{x}_{\text {output}}\in \text {Lesion}. \end{array}\right. } \end{aligned}$$
(4)

Hence, we disregard any errors in the lesion area during backpropagation. We refer to this two-step strategy as loss-function masking.

Fig. 1.
figure 1

Network structure (numbers indicate the data size).

The overall training procedure for our network is: (1) sample one corrupted input \(\widetilde{\varvec{x}}\) from \(p(\widetilde{\varvec{x}}|\varvec{x})\), (2) mask out the lesion area to get \(\varvec{\widetilde{x}}_{\text {normal}}\), (3) sample one \(\varvec{z}\) from \(q_{\phi }(\varvec{z}|\varvec{\widetilde{x}}_{\text {normal}})\) and obtain a reconstructed image \(\varvec{x}_\text {output}\) from the network, (4) calculate the denoising lower bound \(\mathcal {L}_{\text {DVAE}}\) with the change in Eq. (4) and (5) perform stochastic gradient descent backpropagation to update the network.

Reconstruction Uncertainty for Atlas Registration. During testing, due to the small amount of data available and the possibly large appearance differences among training cases, it is useful to utilize the uncertainty of the reconstructed image to guide registration. In our case, we sample \(\varvec{z}\) from the approximated posterior \(q_{\phi }(\varvec{z}|\varvec{x})\) to generate multiple reconstruction images \(\varvec{x}_{\text {output}}\) with different \(\varvec{z}\). Then, we choose the mean of the sampled images \(\mu _{\varvec{x}_{\text {output}}}\) as the reconstruction result, and the (local) standard deviation \(\sigma _{\varvec{x}_{\text {output}}}\) as uncertainty measure. We define areas of high uncertainty as those areas with large variance, and, for registration, our method down-weights the contribution of those areas to the image similarity measure. We simply use \(w(\varvec{x}_{\text {output}}) = \exp (-\sigma _{\varvec{x}_{\text {output}}}^2\times 2000)\) as a local weight for the image similarity measure in our experimentsFootnote 3. This function ensures that the weight drops to near 0 for a large standard deviation. Note that this is different from cost function/pathology masking. Cost function masking uses a simple binary mask, which is equivalent to setting the weight of the lesion area to zero. Our uncertainty-based weighting, on the other hand, downweights ambiguous areas in the reconstruction process which may not be highly reliable for registration. Our uncertainty weight is in [0, 1]. Hence, structural information is rarely discarded completely as in cost-function masking. Our experimental results in Sect. 4 show that this is indeed desirable.

4 Experiments and Discussion

We evaluate our model in two experiments: one using 2D synthetic images, and one with real BRATS tumor images. The image intensity range is [0, 1]. We implement the network with Torch and use the rmsprop [20] optimization algorithm; we set the learning rate to 0.0001, the momentum decay to 0.1 and the update decay to 0.01. Further, we use a batch size of 16, and for a training dataset with 500 images of size \(196 \times 232\), training 1000 epochs takes about 10 h on a 2012 Nvidia Titan GPU. For data augmentation, we apply random shifting up to 10 pixels in both directions for a training image and add zero-mean Gaussian noise with standard deviation of 0.01. During testing, we sample 100 images for each test case, and calculate their mean and standard deviation. All images for training and testing are extracted from the same slice of their original 3D images, which are pre-aligned to a 3D ICBM T1 atlas [4] using affine registration and judged to be limited to having in-plane deformations. We use NiftyReg [13] (with standard settings) together with normalized cross correlation (NCC) to register the 2D ICBM atlas slice to the reconstructed result. Note that we modified NiftyReg to integrate image uncertainty into the cost function. We used a large number of B-spline control points (\(19\times 23\) for a \(196\times 232\) image). This ensures that displacements large enough to capture the mass effect observed in the BRATS data can be expressed. B-spline registration approaches similar to NiftyReg have successfully been used for registrations of various difficulty [16]; and given sufficient degrees of freedom poor registration performance is likely due to an unsuitable similarity measure, which should be investigated in future work. To capture even larger deformations, NiftyReg could easily be replaced by a fluid-based registration approach. The focus here is to synthesize quasi-normal images and to exploit them and their associated reconstruction uncertainty for registration. For our images, 1 pixel corresponds to \(1\text {mm} \times 1\text {mm}\).

Fig. 2.
figure 2

Mean deformation error of all synthetic tumor test cases for various models. Our model is highlighted in red. Masking tumor area = MT. Add structured noise = ASN. Use uncertainty for registration = UR. (A): affine registration; (B): register to tumor image; (C): low-rank-sparse (LRS) with registration; (D): LRS w/o registration; (E): MT, no ASN, no UR; (F): MT, ASN, no UR; (G): MT, ASN, UR; (H): network trained with clean images, ASN; (I): Use uncertainty on tumor image directly; (J): cost function masking. (Color figure online)

For comparison, we use the LRS method, which is an alternative approach to image synthesis for tumor images. We select the parameters maximizing \(2\times NCC_{\text {tumor}}+NCC_{\text {normal}}\) for the training data. Due to high computational cost of current LRS approaches [10], we use 50 training images for each case. Furthermore, to demonstrate that using synthesized images in fact improves registration accuracy, we also compare our method against using the reconstruction uncertainty map in combination with the original tumor image for registration.

Fig. 3.
figure 3

Exemplary synthetic tumor test case reconstruction and checkerboard comparison with ground truth registration. Best viewed zoomed-in.

Synthetic Tumor Experiment. We use 436 brain images from the OASIS [11] cross-sectional dataset as base images. This chosen dataset is a mix of 43 % Alzheimer’s and 57 % control subjects. We create a synthetic tumor dataset by registering random OASIS images to random BRATS 2015 T1c images (to account for the mass effect of tumors) with tumor area masking, followed by pasting the BRATS’ tumor regions into the OASIS images. We generate 500 training and 50 testing images using separate OASIS and BRATS images. Figure 2 shows boxplots of mean deformation errors of different areas per test case, with respect to the ground truth deformation obtained by registering the atlas to the normal image (i.e., without added tumor). The highlighted boxplot is the network model trained with tumor images, added quasi-tumor (i.e., structured noise) and using uncertainty weighting for the registration. We evaluate the deformation error for three areas: (1) the tumor areas, (2) normal areas within 10 mm from the tumor boundary (near tumor) and (3) normal areas more than 10 mm away from the boundary (far from tumor). By evaluating all three areas we can assess how well the mass effect is captured. This is generally only meaningful for our synthetic experiment. Landmarks (outside the tumor area) are more suitable for real data. For the tumor areas, our method (MT+ASN+UR) outperforms most other methods. For the normal areas, the registration difference between our method and directly registering to the original tumor image is very small, especially compared with the LRS method which tends to remove fine details. Compared to using the tumor image directly for registration, our model decreases the \(99.7\,\%\) upper limit of the mean of the tumor area deformation error from 14.43 mm to 7.83 mm, the mean error from 5.62 mm to 3.60 mm, and the standard deviation from 3.49 mm to 2.16 mm. The significantly decreased deformation error only causes a small increase of mean deformation error for the normal area, from 1.36 mm to 1.49 mm. The only method performing better than our model for this synthetic test is cost function masking, which requires tumor segmentation. Figure 3 shows one example test case. Notice that the LRS method erroneously reconstructs the upper lateral ventricle, resulting in a wrong deformation.

BRATS Experiment. We also evaluate our network using the BRATS 2015 training dataset [12], which contains 274 images. This is a very challenging dataset due to moderate sample size and high variations in image appearance and acquisition. We use cross-validation, and partition the dataset into 4 sets of 244 training images and 30 testing images, resulting in a total of 120 test cases. For preprocessing, we standardize image appearance using adaptive histogram equalization. For evaluation, we manually label, on average, 10 landmarks per case around the tumor area and at major anatomical structures for the test images. We report the target registration error for the landmarks in Table 1. Our method still outperforms most methods, including LRS without registration. Although, the difference of our model and LRS+registration is not statistically significant, the figures in combination with our synthetic results suggest that our method is overall preferable. Note also that LRS requires image registrations for each decomposition iteration and introduces blurring to the brain’s normal area (see Fig. 4), while our method does not suffer from these problems. Moreover, it is interesting to see that cost function masking performs worse than our method. This could be explained by the observation that in cases where the tumor is very large, cost function masking hides too much of the brain structure, making registration inaccurate. Figure 4 shows one exemplary BRATS test case. Because the tumor covers the majority of the white matter in the left hemisphere, cost function masking removes too much information from the registration. As a result, the left lateral ventricle is misregistered. Combining our network reconstructed image and uncertainty information, our registration result is much better.

Table 1. Statistics for landmark errors over the BRATS test cases. The best results in each category are marked in bold.

Modeling Quasi-tumor Appearance. One interesting problem is the choice of quasi-tumor appearance. In our work we use the mean normal brain area intensity as the appearance, while other choices, such as using simulatedFootnote 4 tumor appearance or random noise, are also sensible. To show the effect of quasi-tumor appearance choice on the registration result, we conduct additional experiments using 4 textures to create quasi-tumors: (1) real tumors of the BRATS dataset, (2) mean intensity (our approach), (3) random constant intensities and (4) random noise. Registration performance for all 4 methods is similar, with (2) having lower registration error in normal areas (e.g. median of 1.07/2.78 mm compared to 1.28/2.84 mm using (1) for synthetic/BRATS data). A possible reason why using tumor appearance is not superior is the limited training data available (\(\sim \)200 images). For a larger dataset with more tumor appearance examples to learn from, using tumor appearance could potentially be a better choice.

Discussion. One interesting finding in our work is that while a high-quality lesion area reconstruction is desirable, it is not necessary to improve atlas registration. Lesion reconstruction may be affected by many factors (limited data, large image appearance variance, etc.), but the atlas registration result depends on the quasi-normal reconstruction of the lesion and the faithful reconstruction of the normal tissue. For example, in some cases the LRS method achieves visually pleasing results in the lesion area. However, at the same time it smoothes out the normal area losing important details for image registration. Our method on the other hand preserves details in the normal areas more consistently and hence results in overall better registration accuracy. Moreover, for tightly controlled data (e.g., a synthetic dataset) our method generates better reconstructions for the lesion area. Thus, future experiments using more controlled data (e.g. BRATS 2012 synthetic images) would be interesting. Besides, synthesizing a quasi-normal image generates useful structural information that can help guide the registration, and reconstruction uncertainty can be used to focus the registration on regions of high confidence.

Another interesting question is how to extend our approach to 3D images. In initial experiments, we implemented a 2.5D network which reconstructs 14 slices at once. Training the network on 500 2.5D training cases takes 3 days, which, while not fast, is feasible. One possible approach is to learn mappings for 3D patches using patch location as additional feature, which would enable us to train on a much larger dataset (patches) at a reasonable computational cost.

Fig. 4.
figure 4

Exemplary BRATS test case with landmarks for test image (top row) and warped atlas (bottom row).

Finally, designing a more “lesion-like” noise model and exploring the impact of training data size on the predictions are interesting directions to explore.

Support. This research is supported by NIH R42 NS081792-03A1, NIH R41 NS086295-01 and NSF ECCS-1148870.