Keywords

1 Introduction

Ultrasound imaging is prominent in prenatal screening since it is noninvasive, real-time, safe, and has low cost compared to other imaging modalities [10]. However the processing of ultrasound data is challenging due to low image quality, high variability of positions and orientations of the embryo, and the presence of the umbilical cord, placenta, and uterine wall. We propose a method to spatially align and segment the embryonic brain using atlas-based image registration in one unsupervised deep learning framework.

Learning based spatial alignment and segmentation in prenatal ultrasound has been addressed before. In Namburete [11] a supervised multi-task approach was presented, which employed prior knowledge of the orientation of the head in the volume, annotated slices, and manual segmentations of the head and eye. Spatial alignment and segmentation was achieved on fetal US scans acquired at 22 till 30 weeks gestational age. Atlas-based registration was proposed by Kuklisova-Murgasova [9] where a MRI atlas and block matching was used to register ultrasound images of fetuses of 23 till 28 week gestational age. Finally Schmidt [13] proposed a CNN and deformable shape models to segment the abdomen in 3D fetal ultrasound. All these works focus on ultrasound data acquired during the second trimester or later and rely on manual annotations. Ground truth segmentations for our application were not available and are laborious to obtain, which motivated our unsupervised approach.

Developing methods for processing of ultrasound data acquired during the first trimester is of great clinical relevance, since the periconception period (14 weeks before till 10 weeks after conception) is of crucial importance for future health [15]. Therefore our method is developed for first trimester ultrasound.

Recently there has been quite some attention for unsupervised deep learning approaches for image registration, since these methods circumvent the need for manual annotations. Several methods were developed to learn dense nonrigid deformations under the assumption that the data is affinely registered [2, 17]. Employing multi-level or multi-stage methods, affine registration can also be included [6, 7, 16]. The framework presented here is based on the method presented in [2] and follows the idea of [6, 7, 16] to dedicate part of the network to learn the affine transformation.

To the best of our knowledge this is the first work that addresses the development of a framework for the alignment and segmentation of the embryonic brain, captured by ultrasound during the first trimester, applying unsupervised deep learning methods for atlas-based registration. Segmentation and alignment are important preprocessing steps for any image analysis task, hence this method contributes to our ultimate goal: further improve precision medicine of human brain disorders from the earliest moment in life.

2 Method

Let I and A be two images defined in the n-D spatial domains \(\left( \varOmega _{I}, \varOmega _{A} \right) \in \mathbb {R}^n\), with I the target image and A the atlas. Both images contain single-channel grayscale data. Assume that A is in standard orientation and the segmentation \(S_{A}\) is available. Our aim is to find two deformations \(\phi _a\) and \(\phi _d\) such that:

$$\begin{aligned} A(x)\approx I\left( \phi _a \circ \phi _d(x)\right) \quad \forall x \in \varOmega _{A}, \end{aligned}$$
(1)

where \(\phi _a\) is an affine transformation and \(\phi _d\) a voxelwise nonrigid deformation.

To obtain \(\phi _a\) and \(\phi _d\) a convolutional neural network (CNN) is used to model the function \(g_{\theta }\): \((\phi _a,\phi _d)=g_{\theta }(I,A)\), with \(\theta \) the network parameters. The affine transformation \(\phi _a:=Tx\) is learned as a m-dimensionalFootnote 1 vector containing the coefficients of the affine transformation matrix \(T\in \mathbb {R}^{(n+1) \times (n+1)}\). The voxelwise nonrigid deformation is defined as a displacement field u(x) with \(\phi _d:=x+u(x)\).

Figure 1 provides an overview of our method. The input of the network is an image pair consisting of the atlas A and target image I. The first part of the network outputs \(\phi _a\) and the affine registered image \(I(\phi _a(x))\). The input of the second part is the affinely registered image together with atlas A. The final output of the network consists of \(\phi _a\), \(\phi _d\), along with the registered and segmented target image \(I_{S_A}(\phi _a \circ \phi _d(x))=S_A(x)\cdot I(\phi _a \circ \phi _d(x))\) and the affinely registered image \(I(\phi _a(x))\)Footnote 2.

Since this is an unsupervised method no ground truth deformations are used for training. The parameters \(\theta \) are found by optimizing the loss function on the training set. The proposed loss function is described in the next section. After training, a new image I can be given to the network together with the atlas to obtain the registration.

Fig. 1.
figure 1

Architecture of our network. Light blue: convolutional layers with a stride of 2 (encoder). Green: convolutional layers with stride of 1, skip-connection, up-sampling layer (decoder). Purple: fully connected layers with 500 neurons and ReLU activation. Dark blue: convolutional layers at full resolution. Orange: \(\phi _a\), red: \(\phi _d\). All convolutional layers have a kernel size of 3 and have a LeakyReLU with parameter 0.2. (Color figure online)

2.1 Network Architecture

The target image I and atlas A are fed to the network as a two-channel image. The first part of the network consists of an encoder where the images are down-sampled, followed by a global average pooling layer. The global average pooling layer outputs one feature per feature map, which forces the network to encode position and orientation globally, and is followed by fully connected layers. The output layer consists of the entries of the affine transformation matrix T. The architecture of the second part of the network is the same as Voxelmorph [2] and consists of an encoder and decoder and convolutional layers at full resolution. The output layer contains the dense displacement field u(x).

The method is implemented using Keras [3] with Tensorflow backend [1]. The ADAM optimizer is used with a learning rate of \(10^{-4}\). Each training batch consist of one pair of volumes and by default we use 500 epochs.

2.2 The Loss Function

The loss function is defined as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}(A,I,\phi _d,\phi _a)\,=\,&\mathcal {L}_{\text {sim}}\left[ A,I\left( \phi _a \circ \phi _d (x)\right) \right] +\lambda _{\text {diffusion}} \mathcal {L}_{\text {diffusion}}\left[ \phi _d \right] \\ {}&+ \lambda _{\text {scaling}} \mathcal {L}_{\text {scaling}}\left[ \phi _a\right] . \end{aligned}\end{aligned}$$
(2)

The first term promote intensity based similarity between the atlas and the deformed image, the second and third therm regularize \(\phi _d\) and respectively \(\phi _a\). Each term is discussed in detail below.

Since in 3D first trimester ultrasound there are other objects in the volumes besides the brain, the similarity terms are only calculated within the region of interest defined by segmentation of the atlas \(S_A\). \(\mathcal {L}_{\text {sim}}\) is chosen as either the mean squared error (MSE) or cross-correlation (CC). They are defined as follows:

$$\begin{aligned}&\text {MSE}(A,Y)= \frac{1}{M} \sum _{p \in \varOmega } W(p)\cdot \left( A(p)-Y(p) \right) ^2\end{aligned}$$
(3)
$$\begin{aligned}&\text {CC}(A,Y)= \nonumber \\&\frac{1}{M} \sum _{p \in \varOmega } W(p)\cdot \frac{\left( \sum _{p_i} [A(p_i)-\bar{A}(p)][Y_{S_A}(p_i)-\bar{Y}_{S_A}(p)]\right) ^2}{\left( \sum _{p_i} [A(p_i)-\bar{A}(p)]^2\right) \left( \sum _{p_i} [Y_{S_A}(p_i)-\bar{Y}_{S_A}(p)]^2\right) } ,\end{aligned}$$
(4)

where M is the number of nonzero elements in W, unless stated otherwise \(W=S_A\), the subscript \(S_{A}\) indicates segmented, \(\bar{A}\) and \(\bar{Y}\) denote: \(\bar{A}(p)=A(p)-\frac{1}{j^3} \sum _{p_i} A(p_i)\), where \(p_i\) iterates over a \(j^3\) volume around \(p\in \varOmega \) with \(j=9\) as in [2].

Image registration is an ill-posed problem; therefore regularization is needed. \(\phi _d\) is regularized by:

$$\begin{aligned} \mathcal {L}_{\text {diffusion}}(u)= \frac{1}{M} \sum _{p \in \varOmega } \Vert \nabla u(p)\Vert ^2,\end{aligned}$$
(5)

which penalizes local spatial variations in \(\phi _d\) to promote smooth local deformations [4].

Initial experiments revealed that, when objects in the background of the target image are present, the affine transformation degenerate towards extreme compression or expansion. To prevent this, extreme zooming is penalized as regularization for \(\phi _a\). The zooming factors must be extracted for T(x). This is done using the Singular Value Decomposition (SVD) [5], which states that any square matrix \(T \in \mathbb {R}^{n\times n}\) can be decomposed in the following way:

$$\begin{aligned} T=U\varSigma V^*,\end{aligned}$$
(6)

where the diagonal matrix \(\varSigma \) contains non-negative real singular values representing the zooming factors. The scaling loss is defined as:

$$\begin{aligned} \mathcal {L}_{\text {scaling}}= \Vert \text {Diag}(\varSigma ) - S \Vert _1. \end{aligned}$$
(7)

with S an n-dimensional vector containing ones.

For \(\lambda _{\text {diffusion}}\) and \(\lambda _{\text {scaling}}\) the optimal values must be chosen. This is addressed in the experiments.

3 Data

The following three datasets were used in the experiments.

3.1 Synthetic 2D Dataset 1

To develop and validate our method against a ground truth, we created two synthetic 2D datasets. These synthetic datasets were created by affinely transforming and nonrigidly deforming the synthetic atlas. As synthetic atlas the Shepp-Logan phantom [14] is used, which was nonrigidly deformed. The first dataset was created by first applying a random affine transformation \(\bar{\phi }_a^{-1}\) on the atlas, followed by a nonrigid deformation \(\bar{\phi }_d^{-1}\).

The coefficients for the affine transformation matrix \(\bar{\phi }_a^{-1}(x):=T_{gt}^{-1}x\) were drawn as follows: translation coefficients \(t_x\), \(t_y\) \(\in [0,40]\) pixels, rotation angle \(\theta \in [0,360]\) degrees, anisotropic zooming factors \(z_x\), \(z_y\) \(\in [0.5,1.5]\), and shear stress in the x direction \(\theta _{s}\) \(\in [0,30]\) degrees. The nonrigid deformation \(\bar{\phi }_d^{-1}(x):=x+\alpha u_{gt}^{-1}(x)\) was generated using a normalized random displacement field \(u_{gt}^{-1}(x)\), were \(\alpha \) defines the magnitude of the displacement. The smoothness of \(u_{gt}^{-1}(x)\) is controlled using \(\sigma \), representing the standard deviation of the Gaussian, which was convolved with u(x). We used \(\alpha =40\), and \(\sigma \in \) [3, 7].

3.2 Synthetic 2D Dataset 2

The second synthetic dataset was created in the same manner as the first, with additionally a background consisting of ellipses which have a random size and orientation. The ellipses are around, behind and adjacent to the synthetic atlas, to mimic the presence of the uterine wall around the embryo, and the body of the embryo attached to the head. Both datasets contain 3000 training, 100 validation and 100 test images.

3.3 3D Ultrasound Data: Rotterdam Periconceptional Cohort

The Rotterdam Periconceptional Cohort (Predict study) is a large hospital-based cohort study embedded in tertiary patient care of the department of Obstetrics and Gynaecology, at the Erasmus MC, University Medical Center Rotterdam, the Netherlands. This prospective cohort focuses on the relationships between periconceptional maternal and paternal health and fetal growth development, and underlying (epi)genetics [15].

Scans collected at 9 weeks gestational age were used as proof of concept for our method. The image chosen as atlas was put in standard orientation and had sufficient quality to segment the embryo and brain semi-automatically using Virtual Reality [12]. There were 170 3D ultrasound scans available with sufficient quality, 140 are used for training and 30 for testing. All scans were padded with zeros and re-scaled to \(64 \times 64 \times 64\) voxels to speed up training.

Since 140 scans is not sufficient for training, data augmentation was applied. When considering a 2D slice, the embryo is either visible in the coronal, saggital, or axial view. To keep this property during augmentation, first an axis was selected at random and a rotation was applied of either 90, 180 or 270 degrees. Subsequently a random rotation on this axis was applied between 0 and 30 degrees followed by a translation \(t_x, t_y, t_z \in [-15,15]\) and anisotropic zooming \(z_x, z_y, z_z \in [0.9,1.3]\). Each volume was augmented 30 times and this resulted in 4340 images for training.

4 Experiments

To validate our method three experiments are performed.

  1. 1.

    Comparison with Voxelmorph [2] on synthetic dataset 1 and \(\mathcal {L}_{\text {sim}}= \text {MSE}\). Goal: evaluate influence of adding a dedicated part of the network for affine registration on images where the object of interest has a wide variation in position and orientation.

  2. 2.

    Evaluation of hyperparameters in loss function Eq. (2) on synthetic dataset 2 and \(\mathcal {L}_{\text {sim}}=\text {MSE}\). Goal: set \(\lambda _{\text {diffusion}}\) and \(\lambda _{\text {scaling}}\) in the presence of objects in the background.

  3. 3.

    Testing method on 3D ultrasound data acquired at 9 weeks gestational age with \(\mathcal {L}_{\text {sim}}=\text {CC}\) and different types of atlases as input for the network. \(\mathcal {L}_{\text {sim}}=CC\) is used, since it is well known that the cross-correlation is more robust to intensity variations and noise.

The main difference between the synthetic data and ultrasound data is that for the synthetic data the atlas is the only object with a clear structure, while the ultrasound data is noisy and more structures similar to the embryonic brain are present, for example the body of the embryo. The body of the embryo is also a prominent round structured shape. To address this, in the third experiment the influence of using an atlas containing the whole embryo versus only the brain is evaluated. Using the atlas containing the whole embryo as input gives more information for alignment. However we aim at registering only the brain, since this is our region of interest and registering the whole embryo introduces new challenges due to movement and wide variation in position of the limbs. To focus on registration of the brain, W(x) in Eq. 4 is adjusted by assigning twice as much weight to the loss calculated in voxels that are part of the brain.

4.1 Evaluation

In the synthetic case the Target Registration Error (TRE) was calculated, which was defined as the mean Euclidean distance between \(x_i \in \mathbb {R}^2\) for i in the set of evaluation points:

$$\begin{aligned} TRE\left[ \bar{\phi }_a^{-1}, \bar{\phi }_d^{-1}, \phi _a, \phi _d \right] = \frac{1}{n}\sum _{i=1}^n \Vert \bar{\phi }_a^{-1} \circ \bar{\phi }_{d}^{-1} \circ \phi _a \circ \phi _d(x_i)-x_i \Vert , \end{aligned}$$
(8)

where the evaluation points mark the boundary of the shape and important internal structures. The TRE is given in pixels.

Table 1. Performance on first synthetic dataset using Voxelmorph [2] and our method for different values of \(\lambda _{\text {diffusion}}\). TRE is expressed in pixels, standard deviation between brackets.

In the case of real ultrasound data we visually asses the quality of alignment in the 30 test images. The following scoring is used: 0: fail, 1: correct orthogonal directions, 2: brain and atlas overlap, 3: alignment. Where score 1 indicates the network was able to detect the correct plane, score 2 indicates the network was able to map the brain to the atlas and 3 indicates successful alignment.

5 Results

In the first experiment we compared our method with Voxelmorph [2] on the first synthetic dataset. The experiment was done for different values of \(\lambda _{\text {diffusion}}\) with \(\lambda _{\text {scaling}}=0\). Table 1 shows that with the architecture of Voxelmorph it was not possible to capture the global transformation needed. This is also illustrated by row one in Fig. 2. Using our method a small TRE was achieved for both the train and validation set, see row 2 of Fig. 2 for an example. Setting \(\lambda _{\text {diffusion}}=0.8\) gave a TRE of \(2.71 \pm 1.67\) pixels on the test set, which is comparable to the result on the train and validation set.

In the second experiment we evaluated how to deal with objects in the background by penalizing extreme zooming. In Table 2 one can find the results for \(\lambda _{\text {diffusion}}=0.2\) and \(\lambda _{\text {diffusion}}=0.8\) and for different values of \(\lambda _{\text {scaling}}\). Setting \(\lambda _{\text {scaling}}\) too high restricts the network to much, setting this value too low causes extreme scaling. The best result on the validation set was found for \(\lambda _{\text {diffusion}}=0.8\) and \(\lambda _{\text {scaling}}=0.004\), using this model to register the test set gave a TRE of \(2.90 \pm 1.97\) pixels, which is again comparable to the result for the training and validation set. An example can be found in row three of Fig. 2.

Table 2. Target registration error for different hyperparameter settings of the loss function. TRE is expressed in pixels. The standard deviation is given between brackets.
Fig. 2.
figure 2

Visual result for experiment 1 and 2, \(Y=I(\phi (x))\) in case of Voxelmorph architecture, \(Y=I(\phi _a \circ \phi _d(x))\) for our method and A the atlas.

In the third experiment we evaluated our method on real ultrasound data, for different combinations of atlases as input to the two parts of the network. The results are shown in Table 3. Using the atlas of the whole embryo gives the best results, since the network has more information for alignment. Figure 3 gives an impression of the resulting registrations. Note that the images that are marked as aligned are not perfectly registered, this is caused by the fact that the network still roughly misaligned most images and therefore voxelwise alignment is not learned.

Table 3. Performance on ultrasound data for different type of atlas. Scoring: 0: fail, 1: correct orthogonal directions, 2: brain and atlas overlap, 3: alignment.
Fig. 3.
figure 3

Same slice for: a) ultrasound atlas, b) example of image after alignment with score 1, c) example of image after alignment with score 2, d, e): example of successfully affine aligned images with score 3. Red line indicates correct boundaries of the brain after alignment. (Color figure online)

6 Conclusion

In this work we extended existing deep learning methods for image registration to developed an atlas-based registration method to align and segment the embryonic brain. Main extensions are the dedicated part of the network for affine registration and the loss function (2). For validation, synthetic 2D datasets containing a ground truth were used. These experiments showed that our method can deal with the wide variation in position and orientation and with simple objects in the background.

The final experiment using real 3D ultrasound data acquired during the first trimester showed that our method is not robust enough to align and segment the embryonic brain. The importance of the atlas was evaluated and it turns out that using an atlas of the whole embryo improves results slightly, since it gives more information. This information is needed since the images are noisy, have artefacts and the embryonic brain is small (on average only \(1\%\) of the volume). Another drawback is that the ultrasound images were rescaled to one-fourth of the original size and during registration the image is resampled twice which makes the deformed image blurry and this has influence on the calculated loss function. The rescaling was done to speed up training.

Another way to speed up training, is to train in two stages. The second part of the network learning the voxelwise registration, can only learn useful features when the images are already roughly aligned. So training first the affine part of the network is more efficient, since from the start the second part can then learn useful features for voxelwise alignment. This will be explored in the future.

Finally, we aim to extend our method to be applicable to the entire first trimester, to enable spatio-temporal modeling of the embryonic brain. This extension can be made by training different networks for each period. Another natural extension is multi-atlas image segmentation [8], both for networks trained within a certain period to get more robust results, or with a set of atlases covering the whole first trimester.