End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network

de Vos, Bob D.; Berendsen, Floris F.; Viergever, Max A.; Staring, Marius; Išgum, Ivana

doi:10.1007/978-3-319-67558-9_24

Bob D. de Vos²⁷,
Floris F. Berendsen²⁸,
Max A. Viergever²⁷,
Marius Staring²⁸ &
…
Ivana Išgum²⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10553))

Included in the following conference series:

11k Accesses
260 Citations
17 Altmetric

Abstract

In this work we propose a deep learning network for deformable image registration (DIRNet). The DIRNet consists of a convolutional neural network (ConvNet) regressor, a spatial transformer, and a resampler. The ConvNet analyzes a pair of fixed and moving images and outputs parameters for the spatial transformer, which generates the displacement vector field that enables the resampler to warp the moving image to the fixed image. The DIRNet is trained end-to-end by unsupervised optimization of a similarity metric between input image pairs. A trained DIRNet can be applied to perform registration on unseen image pairs in one pass, thus non-iteratively. Evaluation was performed with registration of images of handwritten digits (MNIST) and cardiac cine MR scans (Sunnybrook Cardiac Data). The results demonstrate that registration with DIRNet is as accurate as a conventional deformable image registration method with short execution times.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A training-free recursive multiresolution framework for diffeomorphic deformable image registration

Article 08 February 2022

Coordinate Translator for Learning Deformable Medical Image Registration

Bayesian Deep Learning for Deformable Medical Image Registration

Keywords

1 Introduction

Image registration is a fundamental step in many medical image analysis tasks. Traditionally, image registration is performed by exploiting intensity information between pairs of fixed and moving images. Since recently, deep learning approaches are used to aid image registration. Wu et al. [11] used a convolutional stacked auto-encoder (CAE) to extract features from fixed and moving images that are subsequently used in conventional deformable image registration algorithms. However, the CAE is decoupled from the image registration task and hence, it does not necessarily extract the features most descriptive for image registration. The training of the CAE was unsupervised, but the registration task was not learned end-to-end. Miao et al. [8] and Liao et al. [6] have used deep learning to learn rigid registration with predefined registration examples. Miao et al. [8] used a convolutional neural network (ConvNet) to predict a transformation matrix for rigid registration of synthetic 2D to 3D images. Liao et al. [6] used a ConvNet for intra-patient rigid registration of CT to cone-beam CT applied to either cardiac or abdominal images. This ConvNet learned to predict iterative updates of registration using reinforcement learning. Both methods are end-to-end but use supervised techniques, i.e. registration examples are necessary for training, which are often task specific and highly challenging to obtain.

Jaderberg et al. [3] introduced the spatial transformer network (STN) that can be used as a building block that aligns input images in a larger network that performs a particular task. By training the entire network end-to-end, the embedded STN deduces optimal alignment for solving that specific task. However, alignment is not guaranteed, and it is only performed when required for the task of the entire network. The STNs were used for affine transformations, as well as deformable transformations using thin-plate splines. However, an STN needs many labeled training examples, and to the best of our knowledge, have not yet been used in medical imaging.

In this work, we present the deformable image registration network (DIRNet). The DIRNet takes pairs of fixed and moving images as inputs, and it outputs moving images warped to the fixed images. Training of the DIRNet is unsupervised. Unlike previous methods, the DIRNet is not trained with known registration examples, but learns to register images by directly optimizing a similarity metric between the fixed and the moving image. Hence, similar to conventional intensity-based image registration, it directly learns the registration task end-to-end and it is truly unsupervised. In addition, a trained DIRNet is able to perform deformable image registration non-iteratively on unseen data. To the best of our knowledge, this is the first deep learning method for end-to-end unsupervised deformable image registration.

2 Method

The proposed DIRNet consists of a ConvNet regressor, a spatial transformer, and a resampler (Fig. 1). The ConvNet regressor analyzes spatially corresponding patches from a pair of fixed and moving input images and outputs local deformation parameters for the spatial transformer. The spatial transformer generates a dense displacement vector field (DVF) that enables the resampler to warp the moving image to the fixed image. The DIRNet learns the registration task end-to-end by unsupervised training with an image similarity metric. Since the training phase involves simultaneous optimization of registration of many image pairs, the ConvNet implicitly learns a representation of the features in images that are important for predictions of local displacement. Unlike regular image registration methods that typically perform iterative optimization for each image pair at hand, a trained DIRNet registers images in one pass.

The ConvNet regressor expects concatenated pairs of moving and fixed images as its input, and applies four alternating layers of \(3\times 3\) convolutions with 0-padding and \(2\times 2\) downsampling layers. Downsampling reduces the number of the ConvNet parameters, but it is associated with translational invariance. We postulate that this effect should be minimal in a ConvNet used for image registration, thus we use average pooling which should retain the most information during downsampling. Subsequently, \(3\times 3\) convolutional layers are added to increase the receptive field of the ConvNet to coincide with the capture range of the control points of the spatial transformer. Finally, three \(1\times 1\) convolutional layers are applied. Batch normalization [2] and exponential linear units [1] are used throughout the network, except in the final layer. The number of kernels per layer can be of arbitrary size, but the number of kernels of the final layer is determined by the dimensionality of the input images (e.g. 2 kernels for 2D images that require 2D displacement). The fully convolutional design in combination with downsampling allows fast analysis of separate spatially corresponding patch pairs from the fixed and moving images. The input image sizes and the number of downsampling layers jointly define the number of output parameters, i.e. the size and spacing of the control point grid. This way, for images of different sizes, similar grid spacing is ensured. Using the control point displacements, the spatial transformer generates a DVF used to warp the moving image to the fixed image. Like in [3], a thin-plate spline could be used as a spatial transformer, but due to its global support it is deemed less suitable for a patched-based approach. Therefore, we implemented a cubic B-spline [10] transformer which has local support. Thereafter, a resampler is used to generate warped moving images with linear interpolation.

The DIRNet is trained by optimizing an image similarity metric (i.e. by backpropagating dissimilarity) between pairs of moving and fixed images from a training set using mini-batch stochastic gradient descent (Adam [4]). Any similarity metric used in conventional image registration could be used. In this work normalized cross correlation is employed.

After training, the DIRNet can be applied for registration of unseen image pairs from a separate dataset.

3 Data

The DIRNet was evaluated with handwritten digits from the MNIST database [5] and clinical MRI scans from the Sunnybrook Cardiac Data (SCD) [9].

The MNIST database contains \(28\times 28\) pixel grayscale images of handwritten digits that were centered by computing the center of mass of the pixels. The test images (10,000 digits) were kept separate from the training images (60,000 digits). One sixth of the training data was used for validation to monitor overfitting during training.

The SCD contains 45 cardiac cine MRI scans that were acquired on a single MRI-scanner. The scans consist of short-axis cardiac image slices each containing 20 timepoints that encompass the entire cardiac cycle. Slice thickness and spacing is 8 mm, and slice dimensions are 256\(\times \)256 with a pixel size of 1.25 mm\(\times 1.25\) mm. The SCD is equally divided in 15 training scans (183 slices), 15 validation scans (168 slices), and 15 test scans (176 slices). An expert annotated the left ventricle myocardium at end-diastolic (ED) and end-systolic (ES) time points following the annotation protocol of the SCD. Annotations were made in the test scans and only used for final quantitative evaluation. In total, 129 image slices were annotated, i.e. 258 annotated timepoints.

4 Experiments and Results

DIRNet was implemented with Theano^{Footnote 1} and Lasagne^{Footnote 2}, and conventional registration was performed with SimpleElastix [7].

4.1 Registration of Handwritten Digits

To demonstrate feasibility of the method we first applied it to registration of handwritten digits from the MNIST data. Separate DIRNet instances were trained for image registration of a specific class: one for each digit. The DIRNets were designed with 16 kernels per convolution layer, the third and fourth downsampling layers were removed. This resulted in a control point grid of \(7\times 7\) (grid spacing of 4 pixels). Each DIRNet was trained separately with random combinations of digits from its class with mini-batches of 32 random fixed and moving image pairs in 5,000 iterations (i.e. backpropagations). See Fig. 2 (left) for the learning curves.

Registration performance of the trained DIRNets was qualitatively assessed on the test data. For each digit, one sample was randomly chosen to be the fixed image. Thereafter, all remaining digits (approximately 1,000 per class) were registered to the corresponding fixed image. Figure 2 (right) shows the registration results.

4.2 Registration of Cardiac MRI

Next, to demonstrate feasibility of the method on real-world medical data, we register cine cardiac MR images from the SCD. The DIRNet was trained by randomly selecting pairs of fixed and moving image slices from cardiac cine MRI scans (4D data). The pairs of fixed and moving images were anatomically corresponding slices from the same 4D scan of a single patient but acquired at different time points in the cardiac cycle. This resulted in 69,540 image pairs for training, and 63,840 pairs for validation.

Table 1. Registration performance was quantified in cardiac MRI by registration of image slices and subsequent transformation of corresponding left ventricle annotations. Mean and standard deviations of the Dice coefficients between the reference and warped segmentations were computed. Additionally, 95^th percentiles of the surface distance (95^thSD), and mean absolute surface distance (MAD) were calculated. The rows list results before registration, with conventional iterative image registration using SimpleElastix, and results obtained using the DIRNet. The rightmost column shows the runtime at inference for the conventional methods and the best performing DIRNet.

Full size table

A baseline DIRNet, as described in Sect. 2, was designed with 16 kernels per convolution layer. This resulted in a grid of \(16\times 16\) control points, i.e. a grid spacing of 16 pixels (20 mm). To evaluate effect of various DIRNet parameters, additional experiments were performed. First, to evaluate the effect of the downsampling method, DIRNet-A1 was designed with max-pooling layers, and DIRNet-A2 was designed with 2\(\times \)2 strided convolutions. Second, to evaluate the effect of the spatial transformer, DIRNet-B1 was designed with a quadratic B-spline transformer, and DIRNet-B2 with a thin-plate spline transformer. Finally, to show the effect of the size of the receptive field (i.e. patch size), DIRNet-C1 was designed with neighbouring (i.e. non-overlapping) patches, by leaving out the last two \(3\times 3\) convolutional layers. In addition, DIRNet-C2 analyzed full image slices for each control point by replacing the \(1\times 1\) convolution layers with a \(3\times 3\) convolution layer, followed by a downsampling layer, two fully connected layers of 1,024 nodes, and a final output layer of \(16\times 16\) 2D control points.

Each DIRNet was trained until convergence in mini-batches of 32 image pairs in at least 10,000 iterations. The training loss closely followed the validation loss in each experiment, and no signs of overfitting were apparent. Figure 3 shows the validation loss of 10,000 iterations during training for all experiments. The DIRNets converged quickly in each experiment, except DIRNet-B2, where convergence was reached after approximately 30,000 iterations. The final loss was lowest for baseline DIRNet.

Quantitative evaluation was performed on the test set by registering image slices at ED to ES, and vice versa, which resulted in 258 independent registration experiments. The obtained transformation parameters were used to warp the left ventricle annotations of the moving image to the fixed image. The transformed annotations were compared with the reference annotations in the fixed images. The results are listed in Table 1. For comparison, the table also lists conventional iterative intensity-based registrations (SimpleElastix), with parameters specifically tuned for this task. A grid spacing was used similar to the DIRNet but in a multi-resolution approach, downsampling first with a factor of 2 and thereafter using the original resolution. Two conventional image registration experiments were performed, one for optimal speed (2 times 100 iterations), and one for optimal registration accuracy (2 times 2000 iterations), but at the cost of longer computation time. Experiments with the DIRNets were performed on an NVIDIA Titan X Maxwell GPU and experiments with SimpleElastix were performed on an Intel Xeon 1620-v3 3.5 GHz CPU using 8 threads. Figure 4 shows registration results for a randomly chosen image pair.

5 Discussion and Conclusion

A deep learning method for unsupervised end-to-end learning of deformable image registration has been presented. The method has been evaluated with registration of images with handwritten digits and image slices from cine cardiac MRI scans. The presented DIRNet achieves a performance that is as accurate as a conventional deformable image registration method with substantially shorter execution times. The method does not require training data, which is often difficult to obtain for medical images. To the best of our knowledge this is the first deep learning method that uses unsupervised end-to-end training for deformable image registration.

Even though registration of images with handwritten digits is an easy task, the performed experiments demonstrate that a single DIRNet architecture can be used to perform registration in different image domains given domain specific training. It would be interesting to further investigate whether a single DIRNet instance can be trained for registration across different image domains.

Registration of slices from cardiac cine MRI scans was quantitatively evaluated between the ES and ED timepoints, so at maximum cardiac compression and maximum dilation. The conventional registration method (SimpleElastix) was specifically tuned for this task and the DIRNet was not, because it was trained for registration of slices from any timepoint of the cardiac cycle. Nevertheless, the results of the DIRNet were comparable to the conventional approach.

The data used in this work did not require pre-alignment of images. However, to extend the applicability of the proposed method in future work, performing affine registration will be investigated. Furthermore, proposed method is designed for registration of 2D images. In future work the method will be extended for registration of 3D images. Moreover, experiments were performed using only normalized cross correlation as a similarity metric, but any differentiable metric could be used.

To conclude, the DIRNet is able to learn image registration tasks in an unsupervised end-to-end fashion using an image similarity metric for optimization. Image registration is performed in one pass, thus non-iteratively. The results demonstrate that the network achieves a performance that is as accurate as a conventional deformable image registration method within shorter execution times.

Notes

1.
http://deeplearning.net/software/theano/ (version 0.8.2).
2.
https://lasagne.readthedocs.io/en/latest/ (version 0.2.dev1).

References

Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: ICLR (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: PMLR, vol. 37, pp. 448–456 (2015)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2017–2025. Curran Associates, Inc., Red Hook (2015)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
LeCun, Y., Cortes, C.: The MNIST database of handwritten digits (1998)
Google Scholar
Liao, R., Miao, S., de Tournemire, P., Grbic, S., Kamen, A., Mansi, T., Comaniciu, D.: An artificial agent for robust image registration. arXiv preprint arXiv:1611.10336 (2016)
Marstal, K., Berendsen, F., Staring, M., Klein, S.: SimpleElastix: a user-friendly, multi-lingual library for medical image registration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2016)
Google Scholar
Miao, S., Wang, Z.J., Liao, R.: A CNN regression approach for real-time 2D/3D registration. IEEE Trans. Med. Imaging 35(5), 1352–1363 (2016)
Article Google Scholar
Radau, P., Lu, Y., Connelly, K., Paul, G., Dick, A., Wright, G.: Evaluation framework for algorithms segmenting short axis cardiac MRI (2009)
Google Scholar
Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans. Med. Imaging 18(8), 712–721 (1999)
Article Google Scholar
Wu, G., Kim, M., Wang, Q., Munsell, B.C., Shen, D.: Scalable high performance image registration framework by unsupervised deep feature representations learning. IEEE Trans. Biomed. Eng. 63(7), 1505–1516 (2016)
Article Google Scholar

Download references

Acknowledgments

This study was funded by the Netherlands Organization for Scientific Research (NWO): project 12726.

Author information

Authors and Affiliations

Image Sciences Institute, University Medical Center Utrecht, Utrecht, The Netherlands
Bob D. de Vos, Max A. Viergever & Ivana Išgum
Division of Image Processing, Leiden University Medical Center, Leiden, The Netherlands
Floris F. Berendsen & Marius Staring

Authors

Bob D. de Vos
View author publications
You can also search for this author in PubMed Google Scholar
Floris F. Berendsen
View author publications
You can also search for this author in PubMed Google Scholar
Max A. Viergever
View author publications
You can also search for this author in PubMed Google Scholar
Marius Staring
View author publications
You can also search for this author in PubMed Google Scholar
Ivana Išgum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bob D. de Vos .

Editor information

Editors and Affiliations

University College London, London, United Kingdom
M. Jorge Cardoso
McGill University, Montreal, Québec, Canada
Tal Arbel
University of Adelaide, Adelaide, South Australia, Australia
Gustavo Carneiro
IBM Research - Almaden, San Jose, California, USA
Tanveer Syeda-Mahmood
Universidade do Porto, Porto, Portugal
João Manuel R.S. Tavares
IBM Research - Almaden, San Jose, USA
Mehdi Moradi
University of Queensland, Brisbane, Queensland, Australia
Andrew Bradley
Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Universidade Estadual Paulista, Bauru, Brazil
João Paulo Papa
Case Western Reserve University, Cleveland, Ohio, USA
Anant Madabhushi
Instituto Superior Técnico, Lisboa, Portugal
Jacinto C. Nascimento
Universidade do Porto, Porto, Portugal
Jaime S. Cardoso
University of Oxford, Oxford, United Kingdom
Vasileios Belagiannis
University of South Australia, Adelaide, South Australia, Australia
Zhi Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Vos, B.D., Berendsen, F.F., Viergever, M.A., Staring, M., Išgum, I. (2017). End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network. In: Cardoso, M., et al. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support . DLMIA ML-CDS 2017 2017. Lecture Notes in Computer Science(), vol 10553. Springer, Cham. https://doi.org/10.1007/978-3-319-67558-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-67558-9_24
Published: 09 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67557-2
Online ISBN: 978-3-319-67558-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics