Introduction

In image-guided therapies (IGTs), e.g., preoperative planning, intervention and diagnosis, deformable image registration is the key to integrate complementary information contained in different time stamps or image modalities. Therefore, developing fast and accurate deformable image registration methods is beneficial for the performance of IGT.

Traditional registration methods such as symmetric normalization (SyN) [1] align a pair of images by iteratively minimizing the appearance dissimilarity under regularization constraints. Furthermore, Deeds [12] utilizes discrete optimization, which shows promising results in abdominal registration [28]. However, solving a pairwise optimization is computationally intensive, resulting in slow speed in practice. Recently, due to the substantial improvement in computational efficiency over the traditional iterative registration, learning-based image registration approaches are becoming more prominent in task-specific and time-intensive applications [7]. Most learning-based registration approaches use fully supervised [3, 4, 20] or semi-supervised learning strategy [5, 15] and heavily rely on ground-truth voxel correspondences and/or organ segmentation labels. Although these approaches struggle with imperfect ground-truth labels, they have made a significant impact on the field of deformable image registration. With the development of spatial transformer network (STN) [16], registration approaches that are based on unsupervised learning have also been introduced. For example, VoxelMorph [2] is a monumental unsupervised registration framework that focuses on registering brain images of the same modality (unimodal registration). By modifying VoxelMorph, researchers have further proposed more unsupervised unimodal registration approaches [6, 14, 18, 25].

Most existing learning-based registration approaches use the so-called mono-stream high-to-low, low-to-high network structure with augmented modules, e.g., skip-connection [2, 8], multi-resolution fusion [14] and intermediate supervision [19]. This structure can significantly increase the size of the receptive field which is highly desirable for recognizing object information in images, but needs to recover the high-resolution information from the low-resolution representations. With increased receptive field sizes, these approaches prioritize overall registration accuracy, which is governed by the majority of easy-to-align regions, and overlook some severely deformed local regions. For example, livers with tumors usually have large local deformation due to progressed disease, and the deformations of the surrounding kidney and spleen are less significant. In a CT-to-MRI abdominal image registration, the aforementioned approaches are likely to estimate a deformation field that accurately registers kidney and spleen, yet perform poorly at local liver lobes alignment.

Besides, most of the image registration networks utilize 3D convolutional neural networks (3D CNNs) to exploit the semantic information in each CT/MRI slice and the spatial relationships across consecutive slices. It is understood that the training of 3D CNN is computationally expensive and may lead to insufficient training due to the small number of clinical datasets.

To address the above problems, we propose a novel unsupervised full-resolution residual registration network (F3RNet), which is shown in Fig. 1(a). Distinct from the conventional mono-stream network structure, F3RNet consists of two parallel streams, namely “full-resolution stream” and “multi-scale residual stream.” Inspired by the success of using a high-resolution stream in human pose estimation and image inpainting tasks [9, 23, 26], “full-resolution stream” takes advantage of the detailed image information and facilitates accurate voxel-level registration, while the “multi-scale residual stream” learns the deep multi-scale residual representations to robustly recognize corresponding organs in both images and guarantee a high overall registration accuracy. Using the multi-scale residual block (MRB) modules, the network can progressively fuse information from the two parallel streams in a residual learning fashion [10] to further boost the performance. In addition, we factorize the 3D convolution into two correlated 2D and 1D convolutions, thus effectively avoid over-parameterization [24].

To the best of our knowledge, we are the first to incorporate full-resolution representations with multi-scale high-level representations in a residual learning fashion to boost deformable image registration performance. The main contributions of our work can be summarized as follows:

  • Our approach can unite the strong capability of capturing deep multi-scale representations with precise full-resolution spatial localization of the anatomical structures by interactively combining two parallel streams via the proposed MRB module and the residual learning mechanism. By taking into account such full-resolution information, the registration network is more sensitive to the hard-to-align regions and can provide better alignments for severely deformed local regions.

  • The factorization of 3D convolution can markedly reduce the training parameters and enhance the network efficiency.

  • We validate the proposed F3RNet on a clinically acquired intra-patient abdominal CT-MRI dataset and a public inspiratory and expiratory thoracic CT dataset. The experimental results on both multimodal and unimodal registration show that our method achieves superior performance over the existing state-of-the-art traditional and learning-based methods.

The outline of the paper is as follows: “Methods” section describes the details of our F3RNet, “Experiments” section presents the experimental details and registration results on both multimodal and unimodal datasets, and “Conclusions” section will draw conclusions of the paper.

Methods

Representing the moving image as \(I_{m}\) and the fixed image as \( I_{f}\), medical image registration aims to estimate an optimal deformation field \(\phi \) with three channels (x, y, z displacements) that can align \(I_{m}\)\( I_{f}\). In this section, we present our full-resolution residual registration network (shown in Fig. 1) firstly. Then, we describe the detailed structure of the designed residual block (RB) and multi-scale residual block (MRB), respectively. The factorization of 3D convolution is presented in “Factorized 3D convolution (F3D)” section, and the loss function of our network is described in “Loss function” section.

Fig. 1
figure 1

Illustration of full-resolution residual stream network (F3RNet). a shows the overview of our F3RNet; b shows the residual block (RB); c shows the multi-resolution residual block (MRB). The network learns parameters for a dense deformation field \(\phi \) that aligns the moving image \(I_{m}\) to the fixed image \(I_{f}\). N denotes the minimum volume is \((1/2^{N})\) the size of the input images

Overview of the network

Distinct from the regular high-to-low, low-to-high one-pass network architecture, full-resolution residual registration network (F3RNet) unifies two parallel streams:

  • Full-resolution Stream. Maintaining high-resolution features has demonstrated its superior performance for dense prediction [9, 22, 23, 26]. The black line in Fig. 1a indicates the data flow of the full-resolution stream. This stream first concatenates \( I_{m}\) and \( I_{f}\), followed by a 3D convolution and a series of residual blocks (RB, described in “Residual block (RB)” section). Then, the low-level features on this stream are successively computed by adding the residual from the other parallel stream. After that, the full-resolution stream reduces the number of channels via consecutive RBs and 3D convolutions step-by-step and estimates the 3-channel deformation field \(\phi \). Spatial transformation network (STN) [16] is applied to warp the moving image \( I_{m}\) with \(\phi \), so that the similarity between the warped image \( I_{w}\) and fixed image \( I_{f}\) can be evaluated. This stream does not employ any downsampling operation, resulting in good boundary localization but poor performance in deep semantic recognition. As such, the hard-to-align regions are propagated throughout the stream. Specifically, the convolutions in the full-resolution stream are all with 16 channels in our experiments except for the final 3-channel convolution used to generate the deformation field.

  • Multi-scale Residual Stream. The data flow of multi-scale residual stream is depicted as the orange line in Fig. 1a. In contrast to the full-resolution stream, this stream is good at capturing high-level features that can improve the organ recognition performance. Specifically, successive pooling and convolution operations are leveraged to increase the receptive fields and enhance the robustness against small noises in the images. We also inherit the skip-connection design in regular high-to-low, low-to-high architecture that the feature spaces with same resolution are skip-connected by addition operation. Besides, with the help of our proposed multi-scale residual blocks (MRBs) that can simultaneously operate on both streams, the high-level features can directly interact with low-level features. The interior architecture of MRB is shown in Fig. 1c with elaboration in “Multi-scale residual block (MRB)” section. In our experiments, we set N to 4, which is the same as VoxelMorph [2], denoting that the lowest resolution is 1/16 of the original image. Specifically, at the resolution of 1/2 and 1/4 scale, the channel number of the feature map is set to 16. At the resolution of 1/8 and 1/16 scale, the number of feature channels becomes 32.

The information of the two distinct streams is automatically fused via residual learning [10]. By repeatedly fusing the features between two streams via computing successive multi-scale residuals, the full-resolution representations become rich for the dense deformation field prediction. At the same time, richer low-level full-resolution information can in turn enhance the high-level multi-scale information.

Residual block (RB)

ResNets, proposed in [10], have demonstrated that residual learning can improve the training characteristics over traditional one-pass feed-forward learning. The interior architecture of the residual block (RB) is depicted in Fig. 1b. The output \(z_{n}\) of the RB can be formulated as:

$$\begin{aligned} z_{n}=z_{n-1}+{\mathcal {R}}\left( z_{n-1}\right) , \end{aligned}$$
(1)

where \({\mathcal {R}}\) represents the residual branch consisting of two 3D convolutions with a kernel size of \( 3 \times 3 \times 3\) followed by LeakyReLU activations. Instead of computing \(z_{n}\) directly as in the traditional feed-forward network, the convolutional branch only needs to compute the residual \({\mathcal {R}}\) in this architecture.

Multi-scale residual block (MRB)

The multi-scale residual block (MRB) follows the basic idea of residual block (RB) but elegantly achieves interaction between the full-resolution stream and multi-scale residual stream. An MRB consists of a series of pooling, 3D convolution and upsampling layers, as shown in Fig. 1c. Each MRB has two inputs, \(l_{n-1}\) as full-resolution low-level features and \(h_{n-1}\) as multi-resolution high-level features, and two corresponding outputs \(l_{n}\) and \(h_{n}\). Intuitively, denoting the entire MRB operation as \({\mathcal {M}}\), the output \(l_{n}\) can be computed as:

$$\begin{aligned} l_{n}=l_{n-1}+{\mathcal {M}}\left( l_{n-1}, h_{n-1}\right) . \end{aligned}$$
(2)

Specifically, first, the resolution of \(l_{n-1}\) is reduced to that of \(h_{n-1}\) by a pooling operation, followed by a feature map concatenation. Then, the concatenated feature map undergoes a 3D convolution with a kernel size of \(3 \times 3 \times 3\), followed by a Residual Block (RB) with the same number of channels, and the output \(h_{n}\) is connected to the next process of the multi-scale residual stream. Meanwhile, the output of the \(3 \times 3 \times 3\) convolutional module adjusts the number of channels and the resolution to be consistent with \(l_{n-1}\) through a \(1 \times 1 \times 1\) convolutional bottleneck layer and an upsampling layer at the other end. By such a process, we can readily use addition operations to integrate the residuals learned in the MRB in the full-resolution stream, thus forming a dual-stream highly interactive residual module.

Fig. 2
figure 2

Illustration of a 3D medical image scans, b regular 3D convolution with kernel size of \(3 \times 3 \times 3\), c F3D convolution block

Factorized 3D convolution (F3D)

Most medical images, as shown in Fig. 2a, consist of 3D image stacks with the size of \(W\times H\times D\), where W, H, D represents the width, height, and the number of sequential slices. Inspired by the Inception [24] where large 2D convolution is factorized into two smaller ones, we factorize 3D convolution block for learning the volumetric representation. Specifically, suppose that we have a 3D convolution with kernel size of \(3 \times 3 \times 3\) (Fig. 2b), it can be factorized into a \(3 \times 3 \times 1\) convolution and a \(1 \times 1 \times 3\) convolution in a cascaded fashion (Fig. 2c) to continuously capture dense 2D features in \(W \times H\) slices with 1D attention weights that build sparse sequential relationships across adjacent slices. As such, the number of trainable parameters is reduced from \(O(3^3=27)\) to \(O(3 \times 3+3=12)\), where we can reduce the parameters by half.

However, it is noteworthy that the factorization is not totally equivalent to regular 3D convolution, and a further ablation study over factorized 3D convolution is presented in “Ablation study of F3D convolution” section.

Loss function

The loss function of our network consists of two components as shown in Eq. (3). The similarity loss \({\mathcal {L}}_\mathrm{sim}\) penalizes the dissimilarity between the fixed image \(I_{f}\) and the warped image \(I_{w}(I_{m} \circ \phi )\). The deformation regularization \({\mathcal {L}}_\mathrm{reg}\) adopts an L2-norm of the gradients of the final deformation field \(\phi \) with a trade-off weight \(\lambda \). We write the total loss as:

$$\begin{aligned} {\mathcal {L}}(I_{m}, I_{f}, \phi )={\mathcal {L}}_{\mathrm{sim}}(I_{f}, I_{m} \circ \phi )+\lambda {\mathcal {L}}_\mathrm{reg}(\phi ). \end{aligned}$$
(3)

Specifically, modality independent neighborhood descriptor (MIND) [11] can be used to measure the similarity of both multimodal and unimodal images. MIND is a modality-invariant structural representation, and we can minimize the difference in the MIND features between the warped image \(I_{w}(I_{m} \circ \phi )\) and the fixed image \(I_{f}\) to effectively train the registration network. We define:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{sim}}\left( I_{f}, I_{m} \circ \phi \right)= & {} \frac{1}{N|R|} \sum _{x}\left\| M I N D\left( I_{m} \circ \phi \right) \nonumber \right. \\&\quad \left. -M I N D\left( I_{f}\right) \right\| _{1}, \end{aligned}$$
(4)

where N denotes the number of voxels in input images \(I_{w}(I_{m} \circ \phi )\) and \(I_{f}\), R is a non-local region around voxel x.

Experiments

Dataset and implementation

In this work, we focus on the application of abdominal CT-MRI multimodal registration to improve the accuracy of percutaneous nephrolithotomy (PCNL). To further validate the effectiveness of our method, we also evaluate the proposed method on a public lung CT unimodal dataset [13].

  • Abdominal CT-MRI dataset: Under the IRB approved study, we obtained an proprietary intra-patient CT-MRI dataset containing paired CT and MR images from 50 patients. The liver, kidney and spleen in both CT and MRI were manually segmented for quantitative evaluation. Standard preprocessing steps, including affine spatial normalization, resampling and intensity normalization, were performed. The images were cropped into \(144\times 144\times 128\) subvolume with 1mm isotropic voxels and divided into two groups for training (40 cases) and testing (10 cases).

  • Learn2reg 2020 Lung CT dataset [13]: This dataset contains paired inspiratory and expiratory thorax CT images from 30 subjects (20 cases for training and 10 cases for testing). For all scans, the lung segmentation masks are provided for evaluation. Standard preprocessing steps, including affine spatial normalization and resampling, had been performed by the challenge organization. We further carried out intensity normalization and cropped images into \(128\times 128\times 160\) subvolume.

The proposed method is implemented using Keras with the Tensorflow backend. We train the network on a NVIDIA Titan X (Pascal) GPU using Adam optimizer [17] with a learning rate of 1e-5. The batch size is set to 1. As for the optimal trade-off weight \(\lambda \), we conduct exhaustive grid search and select the value that achieves the highest average Dice scores of ROIs on hold-out test set.

Measurement

We evaluate the registration performance using a series of metrics for each method, mainly including average surface distance (ASD) (lower is better) and the average Dice score (higher is better) between the segmentation masks of warped images and fixed images. Besides, the average number of voxels with non-positive Jacobian determinant (\(|J_{\phi }| \le 0\)) in the deformation fields is counted for evaluating the diffeomorphism of the local deformation (lower is better). The standard deviation of the Jacobian determinant (\(\sigma (|J_{\phi }|)\)) is also calculated for evaluating the smoothness of transformations (lower is better).

Experimental results

Ablation study of F3D convolution

As mentioned in “Factorized 3D convolution (F3D)” section, although convolution factorization can dramatically reduce the training parameters, it may not be totally equivalent to the regular 3D convolution in practice. Therefore, we investigate the different combinations of F3D convolution in our F3RNet. In our experiments, except for the final 3-channel 3D convolution used to generate the deformation field, other \(3 \times 3 \times 3\) convolutions can be replaced. The variants of F3RNet are presented in Table 1. In particular, the number of parameters of F3RNet-w/ F3D is only 56.8% of the original F3RNet. “More MRBs” indicate that two extra MRBs are added at the lowest resolution path, which means that it is possible to use the reduced parameters to add more MRBs to enhance the network’s learning capability.

Table 1 Different combinations of F3D convolution (\(\checkmark \)) in proposed F3RNet

Figure 3 presents the average Dice scores of ROIs on the hold-out test set for varying values of the smoothing trade-off weight \(\lambda \). The best Dice scores occur when \(\lambda = 1.5\) for F3RNet-w/o F3D, F3RNet-w/ F3D, F3RNet-Dec, F3RNet-FR and F3RNet-MRB, and \(\lambda = 2\) for F3RNet-Enc and F3RNet-MS. In particular, F3RNet-w/o F3D and F3RNet-MRB achieve better Dice scores than all other variants. Moreover, after achieving the best Dice scores at \(\lambda = 1.5\), the results vary slowly over larger \(\lambda \) for F3RNet-w/o F3D and F3RNet-MRB, showing that the two models are more robust to the choice of \(\lambda \).

Fig. 3
figure 3

Results of varying the trade-off weight \(\lambda \) on average Dice score of ROIs

Fig. 4
figure 4

Visual results of an example for CT-to-MRI registration. Outside the grey box shows an example fixed MR image and a zoom-in region with the segmentation masks of the liver (green), kidney (red), and spleen (blue). The corresponding warped CT images and zoom-in regions for baselines and ablation study are presented in the grey box. A good registration will cause structures in warped images to close to the corresponding fixed segmentation masks. The red arrows indicate the registration of interest at the boundary of the organ

Figure 4 shows visual results of warped images for the ablation analysis. We can firstly see that the original F3RNet (F3RNet-w/o F3D) can effectively register the multimodal images. If we replace all 3D convolutions with F3D (F3RNet-w/ F3D) or only replace the convolution in encoder and decoder (F3RNet-Enc and F3RNet-Dec), our methods can still effectively register the CT image but have slight performance degradation. Interestingly, if we replace the regular convolution on the entire multi-scale residual stream or full-resolution stream alone, this will cause the information of the two streams to not effectively interact and introduce noise, resulting in unstable performance and significant registration degradation. Therefore, if we use F3D to reduce the model parameters, the 3D convolution on both streams should be replaced at the same time. Further, we can use the reduced parameters to add more MRBs (F3RNet-MRB). From the visual results, it can be seen that the registration performance is either maintained or slightly improved.

Table 2 Average Dice scores and average ASD evaluations (mean ± std) for CT-to-MRI registration of all baseline methods and F3RNet with different combinations of F3D

Table 2 also provides the comprehensive quantitative results for all baseline methods and the variants of our F3RNet with different combinations of F3D. As for the results for ablation analysis, we can see that F3RNet-w/o F3D and F3RNet-MRB achieve the best performance. Specifically, with only 80.2% parameters of F3RNet-w/o F3D, F3RNet-MRB achieves better ASD results in the liver and kidney registration than F3RNet-w/o F3D, while it also achieves better Dice score in liver and spleen registration with reasonable diffeomorphism and smoothness of the deformation fields. Meanwhile, consistent with the visual assessment, we can also see that F3RNet-FR and F3RNet-MS both yield significant performance degradation over ASD and Dice score as they cause the features of the two streams to be disjointed.

Fig. 5
figure 5

Visual results of an example for MRI-to-CT registration. Outside the grey box shows an example fixed CT image and a zoom-in region with the segmentation masks of the liver (green), kidney (red), and spleen (blue). The corresponding warped MR images and zoom-in regions for all methods are presented in the grey box. The red arrows indicate the registration of interest at the boundary of the organ

Table 3 Average Dice scores and average ASD evaluations (mean ± std) for MRI-to-CT registration

Comparison with baselines on abdominal CT-to-MRI registration

To evaluate our proposed method, five open-source state-of-the-art baseline approaches are also compared, including two traditional methods SyN [1] with mutual information (MI) metric [27] and Deeds [12] with five-levels of discrete optimization, and three unsupervised learning-based methods, marked as VoxelMorph-1 (VM-1) [2], VoxelMorph-2 (VM-2) [2], and FAIM [18]. The three learning-based methods are initially proposed for unimodal registration, and we extend them for both multimodal and unimodal registration by using MIND-based similarity metric. We use the same test set to search for the best regularization weights and then set the weights to 1.5 for VM-1, VM-2 and FAIM. Other parameters, such as learning rate and batch size, remain the same as our method.

Figure 4 also illustrates the warped CT images produced by other baseline methods. As we have mentioned above, liver registration is much more challenging in the abdominal image registration task. From the results, we can see that the traditional method SyN fails to align the liver with large local deformation while Deeds performs much better. As for other deep learning methods, VM-1, VM-2, and FAIM achieve relatively satisfactory performance but still have considerable disagreements. Except for F3RNet-FR and F3RNet-MS, our methods have the most visually appealing boundary alignment, which demonstrates that our F3RNet can better register the hard-to-align regions.

The quantitative results for the baseline methods are also presented in Table 2. Consistent with the visual results, the evaluations over ASD and Dice scores of our proposed methods are better than the traditional methods and other state-of-the-art unsupervised registration methods with reasonable quality of the deformation fields, except for F3RNet-FR and F3RNet-MS. Among the baseline methods, Deeds provides competitive results over SyN and other learning-based methods. Furthermore, the traditional methods take much more time (97s for SyN and 37s for Deeds) to register an image pair. In contrast, all deep learning methods can complete a registration task in 3 seconds with a GPU, making it appealing for image-guided therapies with intense time demand.

Experiments on abdominal MRI-to-CT registration

Among all the proposed networks for CT-to-MRI registration, F3RNet-w/o F3D and F3RNet-MRB provide superior results. To further validate the effectiveness of the two proposed methods, we also perform the MRI-to-CT registration in turn. The division of the dataset and the other training settings of the networks, e.g., regularization trade-off weights, etc., are consistent with the CT-to-MRI registration task.

The visualization of the registration results in Fig. 5 shows that our methods, F3RNet-w/o F3D and F3RNet-MRB, achieve more accurate organ alignment than other traditional and deep learning approaches, especially for the liver.

The quantitative evaluation of MRI-to-CT registration is summarized in Table 3. Our proposed methods achieve better results in terms of ASD and Dice scores than that of the traditional method and other state-of-the-art unsupervised learning registration methods. In particular, F3RNet-MRB achieves the best registration accuracy among all the methods with reasonably low \(|J_{\phi }| \le 0\) and \(\sigma (|J_{\phi }|)\).

Experiments on expiration-to-inspiration lung CT registration

Apart from the large local deformation between expiratory and inspiratory lung CT images, another challenge of the Learn2Reg 2020 Lung CT dataset [13] is that the lungs are not fully visible in several expiratory scans as shown in \(I_{m}\) in Fig. 6. In our experiment, MIND-based similarity metric [11] is still used to guide the network training. Empirically, the regularization weights are all set to 1.5 for VM-1, VM-2, FAIM and F3RNet. Other parameters, such as learning rate and batch size, remain the same as the aforementioned experiments.

Fig. 6
figure 6

Visual results of an example for expiration-to-inspiration lung CT registration from both axial and coronal views. The red contours represent the lung segmentation of the fixed inspiratory CT image

Table 4 Average Dice scores and average ASD evaluations (mean ± std) for lung CT registration

We visualize an example of the registration results from both axial and coronal views in Fig. 6. Apparently, the proposed methods, F3RNet-w/o F3D and F3RNet-MRB, achieve more accurate lung alignment than other traditional and deep learning approaches, especially from the coronal view.

The quantitative evaluation of expiration-to-inspiration lung CT registration is summarized in Table 4. Our proposed methods achieve better results in terms of ASD and Dice scores than that of the traditional methods and other state-of-the-art unsupervised learning registration networks with reasonable tradeoff in the diffeomorphism and smoothness of the deformation fields. In particular, F3RNet-MRB achieves the best performance among all the methods.

Conclusions

In this work, we propose a novel unsupervised registration network, namely full-resolution residual registration network (F3RNet), which takes advantage of full-resolution information, multi-scale fusion, deep residual learning framework and 3D convolution factorization, to improve the deformable registration performance. The experimental results on both multimodal and unimodal tasks indicate that our network can better register the hard-to-align region, yielding superior accuracy of registration. In our experiments, we found the current input size to be a compromise between image resolution and GPU memory limitation. Recently, the Laplacian pyramid image registration network (LapIRN) [21] that includes three pyramid branches to register the image pairs at different resolutions with a coarse-to-fine optimization scheme is proposed, which brings promising enlightenment on improving multi-scale fusion-based registration. Future works will continuously focus on the lighter and more elegant ways to leverage high-resolution information and multi-scale fusion to cope with the large local deformation under limited GPU memory.