1 Introduction

Spatial transformations between sets of images play an important role in medical image analysis and are usually used for bringing distinct subjects into anatomical correspondence. This has many uses, such as the alignment of a population into a common coordinate system to compare functional/structural properties of specific anatomy, alignment of a new subject to an atlas, and in the study of anatomical shapes, where the transformations among and between images describe the morphology. In all of these applications, there is an assumption, either explicit or implicit, that the ideal transformation should bring the images into an anatomical correspondence such that key parts of the anatomy are collocated in the transformed image(s). Some methods identify specific anatomical features and find transformations that ensure their alignment [1]. Others find transformations that align unidentified image intensities/features, but regularize the problem with a smoothness penalty on the class of transformations [2, 3]. This approach has the advantage of potential generality, but it ignores known anatomical variability and correspondence. Thus, the metric, regularizations, or representations used to find these transformations do not incorporate any knowledge of transformations or class of transformations that best align members of a given population.

Existing body of literature suggests that anatomical correspondences can be better learned (even in the absence of semantic/functional knowledge) in the context of populations of images or shapes [4,5,6]. There is evidence that correct correspondence produces a population of transformations that is relatively easy to encode. This paper complements and extends these works by integrating population statistics (using non-linear models) into a deep neural network architecture for image registration, which we show is important for accurate characterization of anatomical correspondence.

Very recently, convolutional neural networks (CNNs) are utilized to regress coordinate transformations over the space of input images [7, 8], in an unsupervised manner, by penalizing a metric of alignment between the input image pairs. These works are justified on the basis of computational speed or efficiency, as the feed-forward computation avoids non-linear, iterative optimization required for conventional image registration methods. However, CNNs for image registration offer other advantages, which are so far unexploited. In particular, CNNs do not rely on analytical representations of the coordinate transformation, the space of allowable transformations, or the optimization. This raises the possibility of incorporating empirical knowledge of the transformations, derived from a population of images, into the registration problem.

In this paper, we propose using population-based learning of regularizations or metrics for controlling the class of transformations that CNN learns. To achieve this, we introduce a novel neural network architecture that includes two subnetworks, namely primary and secondary networks, that work cooperatively. The primary network learns the transformations between pairs of images. The secondary network is a bottleneck autoencoder, that learns a low-dimensional description of the population of transformations, and cooperates with the primary network to enforce that the transformations adhere to a latent low-dimensional manifold.

2 Related Work

Deformable image registration has been explored extensively, however, challenges in generality, robustness, and efficiency remain. For brevity, we only focus below on the most closely related research.

Deformable registration is generally an ill-posed problem, and hence regularization is required to achieve plausible transformations, avoid non-smooth transformations, and provide anatomically consistent results. Deformation fields are a classical way to represent transformations, typically regularized through smoothness penalty, usually in the form of Dirichlet/elastic penalty on the deformation [9]. For relatively low-dimensional representations, such as b-splines [10], the basis introduces a degree of smoothness, although some methods apply penalties on the b-spline coefficients. Diffeomorphic registration uses static or dynamic (with time-dependent velocity), smooth flow fields to represent the deformation while guaranteeing invertibility, and has been applied to image alignment and shape analysis [2]. The smoothness in the diffeomorphic setting is typically introduced as part of the metric on the flow field.

Recently, CNNs have been used for image registration to boost the computational efficiency by avoiding the non-linear, iterative optimization routines of conventional methods. Supervised methods for CNN training showed promising results [11], but this requires large amounts of labeled training data (i.e., registration examples solved with other techniques). More recent work performs CNN-based registration in an unsupervised fashion [7, 8]. The work of Balakrishnan et al. [8] shows promising results on learning 3D brain registration displacement fields, improving the computational cost (after training) over the state-of-the-art traditional registration methods, such as ANTs [12], while maintaining registration accuracy. Like most registration methods, this approach also uses smoothness on the deformation fields as a regularizer.

Early works by [4] considered anatomical landmarks on a set of anatomical shapes, and suggested that anatomical variability is relatively low-dimensional. Later work used information-theoretic criteria to parameterize correspondences on populations of shapes [5]. Deformable transformations between images have also been confined to a low-dimensional representation that captures population characteristics [13]. Statistical deformation models [13, 14] learn the probability distribution (subspace or manifold) of the deformation fields for a given population to reduce the dimensionality of the solution space and constrain the registration process. Low-rank representations and spatially varying metrics have also been proposed for diffeomorphic registration [6, 15]. All these methods use linear models (e.g. PCA or low-rank correlations) to feed population statistics back into the registration process. In this paper, we introduce nonlinear models of the population and integrate these into a network architecture for registration.

This paper proposes a neural network architecture where one network influences another. Few proposed systems of interacting neural networks include generative adversarial networks (GAN) [16] and its variants, and domain adaptation (DA) [17]. In these works, the primary network is competing with the secondary network as an adversary, and the steady states of these systems (in training) is a saddle point for the competing energies. In the proposed work, the primary network is minimizing both its loss as well as the reconstruction loss of the secondary network, in an unsupervised setting—and thus we call these architectures cooperative networks.

3 Methods

The proposed cooperative network architecture is depicted in Fig. 1. It consists of two interacting subnetworks, the primary network aims at solving the primary registration task, and the secondary network regularizes the solution space of the primary task. The architecture of the primary network is based on U-Net architecture (Fig. 2), in line with other registration approaches [8]. Given a source (\(I_S\)) and a target (\(I_T\)) image pair (2D/3D), the network produces a displacement field \(\phi \), corresponding to the warp that ideally should match \(I_S\) to \(I_T\). This displacement field, with the source image, is passed through a spatial transform unit [18] to produce a registered image (\(I_R\)). The primary network uses an image matching term between \(I_R\) and \(I_T\) as the loss function (e.g., \(\mathbb {L}_2\) norm or normalized cross-correlation). To re-iterate, the displacement fields \(\phi \) are not required for training, and hence, this is an unsupervised image registration architecture.

The secondary network is a bottleneck autoencoder, which we call a cooperative autoencoder (CAE), that attempts to reconstruct the displacement field. The CAE’s output is denoted as \(\hat{\phi }\). The CAE is a CNN (Fig. 2) with an h-degrees-of-freedom bottleneck layer (i.e. the latent space) represents the low dimensional nonlinear manifold on which the displacement fields should lie (approximately). We add the CAE’s reconstruction loss (\(\mathbb {L}_2\) loss given as \(||\phi - \hat{\phi }||^2\)) to the primary registration loss. CAE acts as a regularizer and pushes the network objective function so that it prefers, among many possible solutions, displacement fields that are accurately represented by the CAE.

Fig. 1.
figure 1

Cooperative network architecture, with the primary unsupervised registration network depicted in the blue box, and the secondary autoencoder based regularizer network in the red box. (Color figure online)

The final objective function constitutes three terms (Eq. 1). The first term represents the registration loss, the second term (weighted by \(\alpha \ge 0\)) is smoothness term [8], and, the third term (weighted by \(\beta \ge 0\)) is the CAE based regularization term.

$$\begin{aligned} \mathcal {Q} = Loss(I_T, I_R) + \alpha ||\nabla \phi ||^2 + \beta ||\phi - \hat{\phi }||^2 \end{aligned}$$
(1)

CAE training requires an initial set of transformation for a preliminary representation, hence, we start training with \(\beta = 0\) (no CAE input), and a small smoothness with weight \(\alpha \). We found that this length of initialization phase does not significantly affect the results of the system, and we always set it at 5% of total iterations. After the initialization phase, we turn on the CAE and set \(\beta \) to a non-zero value and \(\alpha = 0\) (no smoothness), and train the primary and secondary network jointly (cooperatively).

Fig. 2.
figure 2

Left: primary network architecture (input: pair of images, output: displacement field between the images), which is then fed into the Spatial Transform (Fig. 1). Right: architecture of the cooperative autoencoder.

4 Results

In this paper, we use the proposed method to register shapes, represented as binary images and/or distance transforms. The same method applies directly to medical images. For each dataset, we train each network on all pairs of images from the data, with random 25% of the pairs set aside for testing. To clarify, this testing set is of completely held out pairs of images and the remaining 75% of pairs is broken into training and validation set, Training on all pairs ensures that the CAE captures the inherent low-dimensional structure of the displacement fields while avoiding bias. However, the concept of cooperative networks is applicable to other training strategies (e.g. training with a given atlas image) or representations (e.g. momentum fields).

Linear and Rotating Box-Bump

Our first didactic dataset is a set of 2D box-bump (as in [19]) images, where a protrusion on the surface of a rectangular shape is parameterized by its position along the side. We also use another synthetic dataset representative of rotational (non-linear) shape variations. Specifically, a protrusion is set atop of a circular base (parameterized by its angular position, between [−50, +50] degrees from the center). These linear and rotating box-bump datasets respectively represent a single linear and rotating (non-linear) mode of variation. We apply the proposed method on these datasets with the secondary network as cooperative autoencoder (CAE) with the bottleneck of dimension 1 and compare the resulting displacement fields with unsupervised deformable registration (UnDR) proposed in [8], which uses a smoothness penalty on the displacement fields and encodes no population-level information. We use \(\mathbb {L}_2\) difference as primary loss, i.e. \(Loss(I_R, I_T) = ||I_T - I_R||^2\). The results are shown in Fig. 3, along with displacement fields and corresponding Dice coefficients, for a test pair of images. We see that the registration accuracy measured using the Dice coefficient is comparable for UnDR and the proposed method (UnDR-CAE), but produces vastly different displacement fields. Cooperating networks capture a single transverse/rotating component for linear/rotating box bump, respectively, each derived from population statistics. In comparison, UnDR (for both datasets) compresses the protrusion for the source and expands it for the target, which correctly aligns the source and target shapes, but it does not discover the shape variation of the population. This is an important distinction: unlike UnDR, CAE leverages information about the population statistics of the data.

Fig. 3.
figure 3

Linear & rotating box-bump results with different methods, left figure shows the source with the field as produced by the network, and the right shows the false color difference image between the target and the registration output (white: correct overlap, green and magenta: mismatched pixels). (Color figure online)

The core idea of cooperative networks is to restrict displacement fields to a low dimensional manifold. For comparison, we also study some alternative strategies exploiting the same principle. The first option is to reduce the latent space of the primary network architecture (UnDR) to a single dimension bottleneck, which we call “UnDR-BN”, this represents a conventional alternative to the CAE. The results for this approach are shown in Fig. 3 (UnDR-BN). These results show that UnDR-BN is similar to UnDR, which can be explained, in part, by the skip-connections (Fig. 2) in the U-Net architecture used in UnDR. An alternative to UnDR-BN architecture can be to introduce a \(\mathbb {L}_1\) penalty on this layer to encourage sparsity. In our experiments, this leads to similar results as UnDR-BN, and for brevity, we do not present those results in this paper. We also provide additional results (in supplementary material) with UnDR-BN, but with skip-connections of the U-Net architecture removed.

Table 1. Results obtained with Cooperative AutoEncoder networks (CAE, bottleneck size, \(\beta \) coefficient) compared with Unsupervised Deformable Registration (UnDR) by [8]. Landmark errors for box-bump datasets are reported as the percentage of bump width. The AE error for UnDR refers to a separate autoencoder with bottleneck size same as CAE bottleneck (trained after UnDR). \(^{\dagger }\) The AE error is 63.3% for bottleneck size 1, 54.1% for 2, 49.4% for 4, 38.8% for 8, and 33.5% for 16. We also report the average test runtime to compute the displacement fields.

We hypothesize that cooperative networks can discover meaningful correspondences of shape, to validate we define landmarks (analytically) on the family of box-bump shapes (in correspondence with the bump movement) and we evaluate how well each method aligns these ground truth correspondences (Landmark error in Table 1), along with Dice coefficients measuring registration accuracy. The computational cost of discovering displacement fields for a given image pair (testing step), are similar for both UnDR and the proposed method, i.e. CAE does not lose any of its speed over UnDR (speed is the main advantage of UnDR [8]). UnDR-CAE registers with similar accuracy as UnDR (measured by Dice coefficient), but consistently achieves lower landmark errors due to the secondary network which learns population statistics. It is also interesting to see the latent space variations as discovered by the single dimension of CAE and the additional results for this is provided in supplementary material.

For the CAE, we report the reconstruction error (\(\frac{||\phi \,-\,\hat{\phi }||_{\mathbb {L}_2}}{||\phi ||_{\mathbb {L}_2}}\)) in Table 1. For comparison, we train a separate autoencoder on the displacement fields produced by UnDR (Table 1). These results are in agreement with the key idea that the CAE helps the primary network to produce results closer to a low-dimensional manifold, as represented by the ability of the bottle-neck AE to accurately reconstruct its output.

Fig. 4.
figure 4

Two corpus callosum source-target pairs, again one image showing the fields and the other a falsecolor between target and the registered output; top-row: UnDR, bottom-row: CAE.

Corpus Callosum (CC)

In this example, we use a dataset of 324 mid-saggital 2D slices of Corpus Callosum (CC) from the OASIS Brains dataset [20]. Unlike synthetic experiments discussed above, we do not know, apriori, the intrinsic dimensionality of the CC shapes. Therefore, we train the proposed architecture across a range of CAE bottleneck dimensions (2, 4, 8 and 16) and compare resulting Dice coefficients, autoencoder reconstructions, and landmark errors, as in Table 1. Networks are again trained using \(\mathbb {L}_2\) difference as the primary loss. Landmarks were identified using features from the literature [21], and we had multiple raters identify the posterior and anterior points of the CC, the inferior tip of the splenium, the posterior tip of the genu, the posterior angle of the genu, and the interior notch of the splenium. Interrater RMS error is 1.4 mm, and the pixel/voxel size is 1 mm for these images. We see that the optimal bottleneck size for cooperative networks is 8 – increasing the bottleneck to 16 improves the Dice coefficient and AE error, but leads to worse landmark error, which suggests the CAE starts to overfit. The UnDR approach leads to comparable Dice scores, but worse autoencoder and landmark errors (Table 1). As in the synthetic experiments, to report the AE error for UnDR, we trained the autoencoder separately after UnDR training. CAE helps the primary network produce displacement fields that are close to a low-dimensional manifold—a result that is not achieved with the conventional smoothness penalty.

Fig. 5.
figure 5

The results of the 3D LAA registration produced by cooperative networks and UnDR.

Left Atrium Appendage (LAA)

We apply the cooperative network on a 3D dataset of left atrium appendages (LAA). These images are represented as signed distance transforms, and hence we use the normalized cross-correlation loss as in [8], instead of a \(\mathbb {L}_2\) image loss. The Dice scores, AE reconstruction accuracy and compute times are reported in Table 1. We also show the registration of a pair of LAA images in Fig. 5, and landmark (manually obtained clinically validated Ostia landmarks on LAA) reconstruction errors in Table 1.

5 Conclusions

This paper proposes a novel architecture proposed for CNN-based unsupervised image registration that uses a cooperative autoencoder (CAE) and enforces the displacement fields to lie in the vicinity of a low-dimensional manifold. CAE reconstruction loss acts as a regularizer term for unsupervised registration. Cooperative networks have comparable registration run times (Table 1) with UnDR, but much faster as compared to the conventional state-of-the-art registration methods (as analyzed in [8]). Cooperative networks produce meaningful correspondence representation between shapes as compared to other methods (evident by landmark reconstruction errors in Table 1), while maintaining the registration accuracy, making it a viable tool for obtaining fast alignment with anatomically feasible correspondence.