1 Introduction

Image registration is fundamental and crucial in many medical image analysis tasks. As a part of the medical image registration task, deformable registration aims to construct a dense and nonlinear transformation from a source image to a target image (denoted as moving image and fixed image) to represent the variations in anatomical shapes in images caused by factors including patient motion, organ motion, and disease development. For example, deformable registration enables researchers to compare the organ anatomical structure evolution of patients over time longitudinally, or the organ differences between individuals with disease and healthy individuals horizontally, which is critical for understanding the evolution of organ anatomical structures of a disease [1,2,3].

Recently, with the rapid development and superior performance of deep learning, deep learning has been widely applied in various medical imaging analysis tasks and has achieved remarkable success in many medical imaging applications. Especially in registration, unsupervised deep learning-based methods [4,5,6,7,8] have been proposed and demonstrated to achieve higher performance without ground-truth information for deformable medical image registration. These methods generally utilize a convolutional neural network (CNN) to estimate a deformation field from a pair of images. Then a spatial transform network (STN) [9] is utilized to interpolate one image via the deformation field to the other. The average similarity metric score on the anatomical segmentations, achieves higher registration accuracy than conventional methods [10]. Most learning-based methods use UNet-like architectures and concatenate a pair of images as one 2-channel image input to their model, which fuses the features before extracting the independent tissue information contained in each resolution level [4, 5, 8, 8, 11, 12]. Nevertheless, these methods ignore the independent information of each image in an image pair. The recent dual-steam, also called two-stream research [13], states that modeling each image of the input image pair individually enhances the deep representation of their model.

Furthermore, medical images, which represent anatomical information in digital data, should have a realistic deformation field. Put differently, an image converted into another via a deformation field should retain its anatomical structure topology properties, which means that a deformation field should be smooth and have less folding within the transformation. Most unsupervised learning-based registration methods [4, 5, 8, 11] impose a global regularization on the gradient of an output displacement field to restrict the deformation to be smooth. However, the problem is that the regularization presumes that the deformations are in the same smoothness hypothesis, which causes over- or underconstraining for establishing anatomical correspondence. Learning-based diffeomorphic stationary velocity field methods [6, 12] provide diffeomorphic transformations to ensure the topology properties, i.e., to restrict the folding and even reduce the number of folding regions to zero. Nevertheless, studying the dynamic motion of organs requires discontinuous transformation [14], which is different than the continuous property of diffeomorphism. Thus, diffeomorphic methods perform poorly when registering the intrapatient organ (e.g., heart or lungs) images of different systolic cycles. Although the displacement-based approaches can generate discontinuous transformations, they also struggle with the annoying issue of folding.

To reduce folding in the deformation field and to ensure as much registration accuracy as possible, we divide the image registration task into two subtasks. The first is to compute highly accurate displacement fields using a global and coarse regularization function; the second is to use a model to find the folding and correct it to be smooth (i.e., reduce the number of folding regions). Here, we propose an unsupervised learning-based image registration method consisting of twinning networks, including a separate encoding network and a folding correction block. Our main contributions are as follows:

  • We propose a novelty separate encoding network for unsupervised deformable image registration, which separately models the independent information of each image of an image pair to enhance the deep representation of the registration model.

  • To further reduce the folding in a deformation field instead of using the global regularization, we propose a novelty folding correction block, a general module for 2D images and 3D volumes, which can learn folding features, recognize folding in displacement fields, and smooth the displacement fields.

  • Quantitative and qualitative results demonstrate that the proposed twinning method outperforms the state-of-the-art displacement-based and velocity-based methods both in the registration accuracy and the number of folding regions in the transformation.

  • We use folding a correction block to revises the state-of-the-art methods output displacement fields and correct them. The results prove that the folding correction block is more applicable and effective than global smooth regularization.

2 Related work

2.1 Conventional deformable approaches

Conventional deformable image registration methods usually employ a similarity function such as NCC [15, 16], MSE [17, 18], or NMI [19] to optimize the registration model iteratively to maximize the similarity between an input pair of images. These methods, including elastic-type models [15, 20], discrete methods [21, 22], and DRAMMS [23] establish the spatial correspondence of two images. These methods regularize the displacement field to be smoothed by using a regularization function or smoothing filter. In addition, several conventional studies use the diffeomorphic model to guarantee that the produced deformation field is differentiable, topology-preserving, and invertible [24, 25]. Diffeomorphic models such as LDDMM [26, 27] and SyN [10] are wildly used and recognized. However, these iterative methods are time-consuming and require a large number of computational resources to register an image pair.

2.2 Learning-based approaches

Many supervised learning-based methods have recently been proposed for deformable registration tasks. These methods usually utilize a CNN model to learn a dense correspondence between an input pair of images. Furthermore, most of these supervised methods [28,29,30,31] require images with a ground-truth deformation field or anatomical segmentation to supervise the learning process. Supervised methods have demonstrated outperformance in the image registration task. However, this ground-truth information requires complex annotations by experts or must be produced by conventional methods. Put differently, information for these approaches is difficult to obtain or is not appropriate as ground-truth information in practice.

In recent years, to avoid the difficulty of collecting supervised information, most learning-based approaches have focused on [4,5,6,7,8, 11, 12] unsupervised training. Unsupervised methods first compute a displacement field, then utilize an STN to warp the moving image to a fixed image, and then use a differentiable similarity function to learn the dense spatial transformations in the image pair. For example, [4] proposed an unsupervised UNet-like 3D volume registration method that employs NCC similarity and L2 regularization constraint displacement field smoothness. However, computing smooth and topology-preserving transformations is still a challenge. To further avoid folding and obtain a topology-preserving warped image, stationary velocity fields are used in diffeomorphic approaches. Dalca et al. [6] proposed a probabilistic diffeomorphic registration method that offers uncertainty estimation within CNN and diffeomorphic integration models. Mok and Chung [12] proposed a diffeomorphic model to estimate both forward and inverse velocity fields simultaneously. Al Safadi and Song [32] proposed a meta-regularization method to learn regularization filters to generate a smoother displacement vector field. Kang et al. [13] employed the two-stream architecture, separately modeling the moving and fixed image to the bottom of the encoder, then restoring them to the upper resolution and fusing their feature maps in each level of the decoder.

In contrast to the abovementioned unsupervised learning-based approaches, we divide image registration into two subproblems and solve them step by step. Unlike most of these methods architectures, the proposed deformable image registration model fully considers the independent information of each image in the input pair and provide displacement fields with coarse regularization. Compared to the recent two-stream method [13], we introduce a separate encoding network that focuses on independent and combined hybrid encoding for the input image pair information. The correction model then learns to distinguish folded feature maps in the displacement field and regularize them to be smooth.

3 Methods

Let X and Y be two images defined in the spatial domain Ω = Ri (i = 2,3). Figure 1 illustrates the overall architecture of the proposed twinning deformable image registration neural network. Our proposed method is divided into two stages: one for image registration and another for correcting folding. In the image registration phase, SEN is the proposed convolutional neural network for deformable image registration, which computes the displacement fields between X and Y. \({{\mathscr{L}}}_{reg}\) is the registration loss function.

Fig. 1
figure 1

Overview of the proposed twinning method for deformable image registration. ∘ denotes the spatial transform network. Δ is the residual factor. \(\bar {\phi }\) represents the corrected deformation field

Many displacement field based approaches [4, 5, 7, 11] employ global regularization, which causes over-or underconstraining and affects their registration performance. Unlike the mentioned methods, we use a model to correct the output displacement fields to reduce the folding. In the folding correction phase, FCB is a proposed convolutional neural network for correcting the displacement field. After SEN training is finished in this coarse regularization way, we freeze the SEN parameters and output the predicted displacement field to FCB. \({{\mathscr{L}}}_{fc}\) is the folding correction loss.

3.1 Separate encoding neural network

We take a deep unsupervised approach to learn the generation function for displacements, which the function denoted as \(g_{\theta }\left (X,Y\right )=(\phi _{XY}^{(1)}\), \(\phi _{YX}^{(1)})\). g𝜃 represents the separate encoding neural network (SEN) with its parameters 𝜃. \(\phi _{XY}^{(1)}\) and \(\phi _{YX}^{(1)}\) are two output displacement fields, which are the direct and inverse displacement fields. The motivation for estimating the bidirectional deformation is to guarantee the existence of the inverse transformation [12, 33]. To guarantee the invertible property in registration, let xX and yY, we compute two directions by functions \({\phi }_{XY}^{(1)}=\phi _{Y\ X\ }^{(0)}\ ({-\phi }_{XY}^{(0)}(x))\) and \({\phi }_{YX}^{(1)}=\phi _{X\ Y\ }^{(0)}\ ({-\phi }_{YX}^{(0)}(y))\), where \({\phi }_{XY}^{(0)}\) and \({\phi }_{YX}^{(0)}\) are two output displacement fields from g𝜃. Therefore, g𝜃 can be rewritten as \(g_{\theta }\left (X,Y\right )=(\phi _{XY}^{(0)}({-\phi }_{YX}^{(0)}(y)),\phi _{YX}^{(0)}({-\phi }_{XY}^{(0)}(y)))\).

Architecture of SEN

As shown in Fig. 2a, our proposed convolutional neural network consists of a 5-level hierarchical encoder-decoder with skip connections, which is similar to UNet [34]. Unlike formal U-shaped networks [4,5,6,7, 12] that concatenate X/fixed and Y/moving image volumes as a single 2-channel input, the proposed SEN is divided into three branches in the encoder. The first branch extracts feature maps for X, the second branch extracts feature maps for Y, and the third branch extracts feature maps for concatenated XY in each level encoder. The proposed SEN concatenates these three-branched feature maps at each level, then further computes the concatenated feature maps and downsamples these feature maps to the next resolution level encoder block as the input to the third branch. The blocks in the encoder consist of 3×3×3 kernel size convolutional layers with a stride of 1, followed by a rectified linear unit (ReLU) activation for computing the constant size feature maps in each resolution level. We apply 3 × 3 × 3 kernel size convolutional layers with a stride of 2 followed by a ReLU activation to downsample the feature maps in half until the lowest resolution level is reached. For each resolution level in the decoder, we apply 3 × 3 × 3 kernel size convolutional layers with a stride of 1 followed by ReLU activation and a 2 × 2 × 2 deconvolutional layer to upsample feature maps to twice their size and then concatenate them with the feature maps from the encoder through skip connections. To ensure that inverse transformation exists, we utilize two 3 × 3 × 3 convolutional layers with a stride of 1 followed by softsign activation (i.e., \(softsign(x)=\frac {x}{1+|x|}\)) to normalize the feature maps to [− 1,1] to obtain direct \(\phi _{XY}^{(0)}\) and inverse \(\phi _{YX}^{(0)}\), and then each of them is multiplied by a constant c within the range [−c,c] to obtain displacement fields.

Fig. 2
figure 2

Illustration of our two subnetworks. (a) illustrates our proposed fully connected network SEN architecture to predict the bidirectional deformation fields. The gray and orange blocks indicate the 3D feature maps from the encoder and decoder, respectively. (b) illustrates our proposed FCB architecture utilized to reduce the folding regions in the SEN predictions

3.2 Folding correction block

We freeze the parameters when our proposed SEN training is finished, and then we reuse the train set through the trained SEN to obtain the displacement fields. We take an unsupervised approach to learn the generation function \(f_{\theta ^{\prime }}(\phi ^{(1)})={\Delta }\) for correcting displacements. \(f_{\theta ^{\prime }}\) is the proposed folding correction block (FCB) with its parameter ϕ(1). Δ is a factor that is used to reduce the folding in the input displacement fields by the formula \(\bar {\phi }=c\times \ \phi ^{(1)}-{\Delta }\). This formula indicates that Δ includes folding location information and the magnitude of the displacement fields that need to be corrected. \(\bar {\phi }\) is the corrected displacement field. Figure 3 shows some folding regions represented by the folding in the grid figure. We can observe that the FCB can recognize folding and correcting to smooth the transformation, i.e., the grid line without crossing.

Fig. 3
figure 3

An example of showing the FCB correcting the folding regions in a displacement field. The grid figure is the visualization of a displacement field. The red frames marked in the deformation field are parts of folding regions. We zoom the marked regions and then find the gridline crossing. The arrows indicate the corrected local displacement field

Architecture of FCB

As shown in Fig. 2b, the proposed FCB consists of four 3×3×3 kernel size convolutional layers, each of them with a stride of 1 and followed by a ReLU activation except for the last layer. Convolutional layers with kernel sizes of 3×3×3 and 5×5×5 with different strides followed by ReLU activation downsample the input displacement fields to the same size. The deconvolutional layers upsample the 1/2 resolution feature maps to the shaped of the original input displacement fields. Then, the last layer outputs the residual factor Δ.

3.3 Loss functions

3.3.1 Registration loss function

We employ the registration loss function to penalize the displacement field \({{\mathscr{L}}}_{reg}={{\mathscr{L}}}_{sim}({\cdot })+{{\mathscr{L}}}_{smooth}({\cdot })\), which is divided into the similarity loss function and the regularization loss function. Each of these two loss functions is pairwise, both consisting of bidirection losses. We use normalized cross-correlation (NCC) and mean square error (MSE) as the similarity loss functions to measure the similarity between the warped image and the fixed image. To measure the similarity between warped X and Y and warped Y and X, the similarity loss function is formulated as

$$ {\mathcal{L}}_{sim}(X,Y)={\mathcal{L}}_{sim}(X\circ(\phi_{XY}^{(1)}),Y)+{\mathcal{L}}_{sim}(Y\circ(\phi_{YX}^{(1)}),X), $$
(1)

where \(X\circ (\phi _{XY}^{(1)})\) and \(Y\circ (\phi _{YX}^{(1)})\) represent image X warped toward Y via the displacement field \(\phi _{XY}^{(1)}\) and image Y warped toward X via the displacement field \(\phi _{YX}^{(1)}\) respectively. \({{\mathscr{L}}}_{sim}\) is NCC when Ω = R3 and \({{\mathscr{L}}}_{sim}\) is MSE when Ω = R2. A higher NCC value or a smaller MSE value indicates a better alignment.

We enforce the deformation field coarse smoothness using an L2 regularization loss function with ∇, which denotes the spatial gradient using differences with neighboring positions. Thus, \({{\mathscr{L}}}_{smooth}\) can be defined as follows:

$$ {\mathcal{L}}_{{smo}}({\phi_{XY}^{(1)}, \phi_{YX}^{(1)}}) = \sum\limits_{x \in {\Omega}}{({{\| {grad(\phi_{XY}^{(1)})} \|}^{2}} + {{\| {grad(\phi_{YX}^{(1)})}\|}^{2}})}. $$
(2)

Therefore, the registration loss function of our first network can be written as follows:

$$ {\mathcal{L}}_{reg}(X,Y)={\mathcal{L}}_{sim}(X,Y)+{\lambda_{1} {\mathcal{L}}}_{smo}(\phi_{XY}^{(1)},\phi_{YX}^{(1)}), $$
(3)

where λ1 is a hyperparameter that balances the accuracy of the network predictions and the coarse smoothness of the output displacement fields.

3.3.2 Folding correction loss function

We propose a folding correction loss function \({{\mathscr{L}}}_{fc}={{\mathscr{L}}}_{sim_{2}}(\cdot )+{{\mathscr{L}}}_{Jdet}(\cdot )+{{\mathscr{L}}}_{enc}(\cdot )\) consisting of three terms, including the deformation field similarity loss \({{\mathscr{L}}}_{sim_{2}}\), the Jacobian determinant regularization loss \({{\mathscr{L}}}_{Jdet}\) and the regularization \({{\mathscr{L}}}_{enc}\). In this section, \({{\mathscr{L}}}_{sim_{2}}\) is an MSE similarity function, which is used to measure the similarity between \(\bar {\phi }\) and ϕ(1). \({{\mathscr{L}}}_{sim_{2}}\) can be formulated as follows:

$$ {{\mathcal{L}}_{{sim}}}({\phi^{(1)}},\bar{\phi}) = MSE{{{(\phi^{(1)},{\bar{\phi}})}}}, $$
(4)

where ϕ(1) is the input displacement field and \(\bar {\phi }\) is the corrected displacement field.

We utilize the Jacobian determinant in the second term in the proposed folding correction loss function because it is positive when the displacement field is smooth. Put differently, we can say that the Jacobian determinant is folding sensitive. The definition of the Jacobian matrix can be written as follows:

$$ J_{\bar{\phi}}(p)= \left\|\begin{array}{lll} \frac{\partial{\bar{\phi}}_{x}(p)}{\partial x}&\frac{\partial{\bar{\phi}}_{x}(p)}{\partial y}&\frac{\partial{\bar{\phi}}_{x}(p)}{\partial z}\\ \frac{\partial{\bar{\phi}}_{y}(p)}{\partial x}&\frac{\partial{\bar{\phi}}_{y}(p)}{\partial y}&\frac{\partial{\bar{\phi}}_{y}(p)}{\partial z}\\ \frac{\partial{\bar{\phi}}_{z}(p)}{\partial x}&\frac{\partial{\bar{\phi}}_{z}(p)}{\partial y}&\frac{\partial{\bar{\phi}}_{z}(p)}{\partial z} \end{array}\right.. $$
(5)

\(J_{\bar {\phi }}(p)\) denotes the Jacobian determinant metric over deformation field \(\bar {\phi }\) at position p.

To measure the degree of a folding region in the deformation field, we utilize Jacobian determinant regularization in [12] and give a smooth formulation. The Jacobian determinant regularization is written as follows:

$$ {\mathcal{L}}_{Jdet}=\ln{\left( \frac{1}{N}\sum\limits_{p\in{\Omega}}ReLU(-\lvert J_{\bar{\phi}}(p) \rvert)\right)}, $$
(6)

where N is the total number of elements in \(\lvert J_{\bar {\phi }} \rvert \). ReLU is the linear activation function that maintains the values when \(J_{\bar {\phi }}(p)\le 0\) and sets the values to zero when \(J_{\bar {\phi }} (p)>0\).

Aiming at balancing the contribution of \({{{\mathscr{L}}}_{sim}}\) and \({{\mathscr{L}}}_{Jdet}\), we use a variant L2 regularization on the spatial gradient of Δ to encourage its change. Thus, \({{\mathscr{L}}}_{enc}\) is defined as follows:

$$ {\mathcal{L}}_{enc}=\ln{\sum\limits_{x \in {\Omega}}(\|grad({\Delta})\|^{2})} $$
(7)

These two functions (4) and (6) in \({{\mathscr{L}}}_{fc}\) enforce the adjustment to the local regions with negative Jacobian determinants in the deformation field ϕ(1). In contrast, the local regions with positive Jacobian determinants are not corrected. This adjustment is made under the premise that the adjusted deformation field \(\bar {\phi }\) is similar to the original deformation field ϕ(1). Put differently, the adjusted local region deformation field maintains the constraints and magnitude constrains in the neighborhood. We balance the contributions of these two terms with weight λ2 multiplied by \(J_{\bar {\phi }}(p)\). Therefore, \({{\mathscr{L}}}_{fc}\) can be written as \({{\mathscr{L}}}_{fc}=\ {{\mathscr{L}}}_{sim}+\lambda _{2}{{\mathscr{L}}}_{Jdet}+\lambda _{3}{{\mathscr{L}}}_{enc}\).

4 Experiments

4.1 Datasets

The first dataset is the EchoNet-Dynamic dataset [35]. This dataset is composed of echocardiogram videos and human expert annotations for the left cardiac ventricle of each subject. We select 1276 image pairs representing end-systole and end-diastole at two separate times in each video, which are annotated by human experts. We use the end-systole phase image as Y and the end-diastole phase image as X. The selected image pairs are randomly divided into 920 for training, 100 for validation, and 256 for evaluatation for each method.

The second dataset is OASIS [36] preprocessed in [4], which consists of a cross-sectional collection of T1-weighted MRI scans from 416 subjects aged 18 to 96, as one of our experimental datasets. These raw MRI scans with shapes of 256 × 256 × 256 and 1mm × 1mm × 1mm resolution are preprocessed by using FreeSurfer [37], resulting in shapes that are 160 × 224 × 192. We resample these scans into 96 × 112 × 96. We randomly select 270 MRI scans from the dataset, and the scans are divided into 200, 36, and 34 scans for training, validation and testing, respectively. We randomly select 4 and 6 MRI volumes from our validation and testing set as fixed, and the remainder denotes the moving image volumes. We perform a registration task by aligning the moving image volumes to each fixed image. To compare with the other methods, we use X as the fixed image and Y as the moving image volumes in our method, and we register a total of 120 fixed/moving image volume pairs for each method.

4.2 Measurement

Because the ground-truth nonlinear deformation field is challenging to obtain, we evaluate registration performance with the Dice similarity coefficient metric and Jacobian determinant (\(\lvert J_{\phi } \rvert \)). For example, we first warp each moving brain MRI volume to each atlas to obtain the deformation field. Then, we warp the anatomical segmentation maps belonging to each moving image to align with the anatomical segmentation maps belonging to each fixed image by using the predicted deformation fields. We evaluate the overlap of the segmentation maps using the percentage of Dice metrics (higher is better). Then, compute \(\lvert J_{\phi } \rvert \) on each displacement field and count the number of pixels with nonpositive Jacobian determinants (i.e., \(\lvert J_{\phi } \rvert \leq 0\), lower is better).

4.2.1 Dice

Dice is a metric for measuring the overlap of anatomical segmentation maps between the warped moving image and the fixed image. In our experiments, for brain MRI, 36 anatomical structures are used for analysis. For cardiac ultrasound images, only the left ventricle annotation is used for analysis. The Dice values ranged from [0,1], and a high Dice metric indicates a high registration accuracy.

4.2.2 Jacobian determinant

The Jacobian metric is defined in (5). In our experiments, we compute the Jacobian determinant of each displacement field and count the number of pixels or voxels with nonpositive Jacobian determinants (i.e., \(\lvert J_{\phi } \rvert \leq 0\)).

4.2.3 Baseline methods and implementation

We compare our proposed method to three unsupervised deep learning-based deformable registration methods. The first and second baseline methods are VoxelMorph (VM) [4] and Vit-V-Net (VVN) [11], both of which predict a displacement field and then utilize global regularization to restrict the displacement fields smoothing. VM employs a UNet and outputs the displacement field directly. VVN is a transformer-based method, that introduces transformer into image registration. The third baseline method is SYMNet (SN) [12], which predicts diffeomorphic transformations. For these methods, we use their official online implementation. We train VM, VVN, and SN and follow the recommended parameter settings in [4, 11, 12]. The proposed method is implemented based on PyTorch. We adopt the Adam [38] optimizer with a learning rate of 0.0001 for SEN and FCB. We train our method and baseline methods on an RTX 3080 GPU. What needs to noted is that we first train our proposed SEN, then we freeze the SEN parameter to compute displacement fields between each image pair. Finally, FCB uses these displacement fields to learn its parameters. The FCB mentioned below are trained based on the SEN predictions. For different datasets, λ1, λ2, and λ3 have different settings, and the specific settings are shown in Section 4, the experimental results.

4.3 Experimental results

4.3.1 Validation on the cardiac dataset

We first evaluate displacement-based methods (VM, VVN) with global regularization λ1 = (0.04,0.05) and our proposed method on the cardiac dataset. We utilize MSE as the loss function and train these methods for 160,000 iterations. We tune the hyperparameter λ1 = 0.05 for the coarse regularization in our method. We set λ2 and λ3 to (40, -1).

Analysis and discussion

The first part of Table 1 shows the registration results on 256 cardiac ultrasound image pairs. We can observe that our proposed single method SEN outperforms the other two baseline methods on the average Dice metric. Our proposed twinning method, denoted as SEN+FCB, outperforms the others both on the Dice metric and the number of nonpositive \(\lvert J_{\phi } \rvert \). Figure 4 shows a registration result, including displacement field computed by the SEN, the displacement field corrected by the FCB, and the final warped image.

Table 1 Comparison of cardiac ultrasound image results
Fig. 4
figure 4

A sample result of registering two cardiac ultrasound images pair. (a) Fixed image. (b) Moving image. (c) Warped moving image by displacement field. (d) Displacement field. (e) Corrected displacement field. (f) Warped moving image by corrected displacement field. (g) Difference representation between warped image and corrected warped image. Colorbar in (g) represents the normalized difference

To illustrate that the FCB correction outperforms global regularization, we use the FCB to correct the output displacement fields of VM (λ1 = 0.04) and VVN (λ1 = 0.04), which indicates VM and VVN are trained with the hyperparameter setting of the global regularization as λ1 = 0.04. Then we compare the correction results to the output displacement fields of VM (λ1 = 0.05) and VVN (λ1 = 0.05), which are trained with the hyperparameter setting of the global regularization as λ1 = 0.05. Compared to VM, SN, and VVN, our proposed SEN and SEN+FCB achieve the best Dice metrics. On the average nonpositive \(\lvert J_{\phi } \rvert \) metric, the results of VM are slightly higher than our proposed SEN+FCB. Comparing the nonpositive \(\lvert J_{\phi } \rvert \) standard deviations of all three methods, our proposed SEN+FCB is the lowest among all approaches, which indicates that our method is robust in predicting deformation fields. The second part of Table 1 shows the global regularization and the correction results. It is worth noting that the Dice values are improved while the number of nonpositive \(\lvert J_{\phi } \rvert \) is significantly reduced when the FCB is utilized for VM, VVN, and SEN. This demonstrates that the correction by using the FCB is more effective than using a global regularization. Figure 7 shows the deformation fields of the cardiac images predicted by each method and the warped image with the overlaid segmentation map.

4.3.2 Validation on brain dataset

We evaluate VM, VVN, SN, and our proposed twinning method on 3D brain MRI volumes. For VM, SN, and the proposed SEN, we use NCC as the loss function. We find that the Dice metric is the best when VM is trained with smooth hyperparameter λ1 = 3 and VVN is trained with λ1 = 0.02 on this 3D brain dataset. We use the recommended global regularization hyperparameters in [12] for SN. SN employs the explicit Jacobian loss term with the hyperparameter λo to achieve folding reduction. We use λo = (0,1000), which is the recommended setting in [12] to restrict the folding, and then compare it to our proposed twinning method results. We tune the hyperparameter λ1 = 2 for the coarse regularization in our method. We set λ2 and λ3 to (50000, -0.01).

Analysis and discussion

The first part of Table 2 shows the experimental results on brain MRI volumes. We find that for SN, the Dice metrics changes too much while the folding reduction is insufficient when the hyperparameter λo of the explicit Jacobian loss term changed (i.e., \(0 \rightarrow 1000\)). This result motivates us to design the folding correction block by utilizing the smooth Jacobian loss term. The single SEN achieves the best results on Dice metrics and leads the other methods by almost 1-2%. The proposed twinning method outperforms the others both on Dice and nonpositive \(\lvert J_{\phi } \rvert \) metrics except VVN with the smooth hyperparameter λ1 = 0.02. When we adjust λ1 to 0.05 to make the number of nonpositive \(\lvert J_{\phi } \rvert \) of VVN’s result similar to ours, VVN does not perform well on the Dice metric. This is due to the worse performance of the model when the smooth hyperparameter is larger, as stated in [6, 12]. Figure 5 shows a registration result, including the displacement field computed by the SEN, the displacement field corrected by the FCB, and the final transformed brain image. The boxplot in Fig. 6 shows the comparison results for each anatomical structure.

Table 2 Comparison of brain MRI scans results
Fig. 5
figure 5

A sample slice of a result of registering two brain MRI volumes pair. (a) Fixed image, (b) Moving image, (c) Warped moving image by displacement field, (d) Displacement field, (e) Corrected displacement field, (f) Warped moving image by corrected displacement field. (g) Difference representation between warped image and corrected warped image. Colorbar in (g) represents the normalized difference

Fig. 6
figure 6

A boxplot illustrating the Dice value of each anatomical structure segmentation for VM, VVN, SN, and our proposed twinning method. We averaged the Dice values of the left and right brain hemispheres and combined them into one structure for visualization

The second part of Table 2 shows the FCB correcting the folding of the other three methods. The results demonstrate that FCB effectively reduces the number of nonpositive \(\lvert J_{\phi } \rvert \) while sacrificing some registration accuracy. FCB reduces almost 85-90% nonpositive \(\lvert J_{\phi } \rvert \) for VM, VVN, and our proposed SEN while reducing 65% nonpositive \(\lvert J_{\phi } \rvert \) for SN. We attribute this gap in the percentage of reducing results to the deformation field generation form: one is based on the displacement field, and the other is based on the velocity field. The results of SN + FCB prove that the use of the additional convolutional block with the Jacobian loss term outperforms than the single network with the explicit Jacobian loss term. Compared to the experimental results on the cardiac dataset, the Dice metrics are reduced after being corrected by the FCB. This is because of the different number of anatomical labels for each subject for evaluation: one label for each ultrasound cardiac image and 36 labels for each brain MRI volume. Overall, compared to SN, FCB can correct the displacement field more effectively and maintain the registration accuracy well. We give each method’s output deformation field of the brain images and the warped images in Fig. 7.

Fig. 7
figure 7

The view of the fixed/moving image slices and each baseline method’s deformation fields and the warped images with the overlaid segmentation maps

4.3.3 Runtime analysis

We register each pair of images for the nonlinear deformable registration task using an NVIDIA RTX 3080 GPU. We measure the execution time for VM, VVN, SN, SEN, and SEN+FCB. Figure 8 shows the average runtime of our proposed methods and these baseline methods. The results show that our method is faster than both of these baseline methods for registering a pair of images. Furthermore, it is worth noting that utilizing the FCB to correct folding in a deformation field does not significantly increase the registration method runtime.

Fig. 8
figure 8

The bar chart of runtime for each method to register a pair of images. The orange bars are the average runtime. The blue bars are the standard deviations of runtime

4.3.4 Ablation study

To demonstrate the effectiveness of our proposed SEN, we remove the separate encoding of each image, leaving only the concatenated encoding branch. Then, we doubled the number of concatenated branch channels to double to keep the number of channels in each level unchanged. As shown in Fig. 2a, this network with the separate encoding branch removed degenerates to an ordinary U-shaped architecture, which is denoted as SEN-1. We apply SEN-1 and SEN to the cardiac and brain datasets. We train SEN and SEN-1 with λ1 = (0.01,0.05,0.1) on the cardiac dataset and λ1 = (2,4) on the brain dataset. We evaluate these two methods on the testing set of the cardiac and brain datasets. Then, the two-direction output displacement field is utilized to warp X and Y. The average Dice metric on the two warped anatomical segmentation maps indicates the registration performance. Table 3 shows that the SEN consistently outperforms the SEN-1 on all hyperparameter settings. This demonstrates that separate encoding for each image enhances the registration accuracy.

Table 3 Ablation comparison between SEN and SEN-1

5 Conclusion

In this work, we introduce a twinning network for learning-based deformable image registration, which consists of two subnetworks. We utilize the proposed SEN to compute the high-accuracy symmetric displacement fields. Then, we utilize the proposed FCB to correct folding in the output displacement field from SEN. We validate our proposed twinning method on 2D ultrasound cardiac images and 3D brain MRI scans. Compared with three other unsupervised learning-based methods, the experimental results demonstrate that our twinning method achieves high registration accuracy on Dice metrics and reduces the number of nonpositive Jacobian determinants in the predicted displacement fields compared to baseline methods. Furthermore, the experimental results on FCB correcting displacement fields of the baseline methods demonstrate that FCB outperforms global regularization on folding reduction. The ablation study shows that separate encoding improves the registration performance.