1 Introduction

Pulmonary hypertension (PH) is a cardiorespiratory syndrome characterised by increased blood pressure in pulmonary arteries. It typically follows a rapidly progressive course. As such, early identification of PH patients with elevated risk of a deteriorating course is of paramount importance. For this, accurate segmentation of different functional regions of the heart in CMR images is critical.

Numerous methods for automatic and semi-automatic CMR image segmentation have been proposed, including deformable models [1], atlas-based image registration models [2] as well as statistical shape and appearance models [3]. More recently, deep learning-based methods have achieved state-of-the-art performance in the CMR domain [4]. However, the above approaches for CMR image segmentation have multiple drawbacks. First, they tend to focus on left ventricle (LV) [1]. However, the prognostic importance of the right ventricle (RV) is a broad range of cardiovascular disease and using the coupled biventricular motion of the heart enables more accurate cardiac assessment. Second, existing approaches rely on manual initialisation of the image segmentation or definition of key anatomical landmarks [1,2,3]. This becomes less feasible in population-level applications involving hundreds or thousands of CMR images. Third, existing techniques have been mainly developed and validated using normal (healthy) hearts [1, 2, 4]. Few studies have focused on abnormal hearts in PH patients.

To address the aforementioned limitations of current approaches, in this paper we propose a deep nested level set (DNLS) method for automated biventricular segmentation of CMR images. More specifically, we make three distinct contributions to the area of CMR segmentation, particularly for PH patients: First, we introduce a deep fully convolutional network that effectively combines two loss functions, i.e. softmax cross-entropy and class-balanced sigmoid cross-entropy. As such, the neural network is able to simultaneously extract robust region and edge features from CMR images. Second, we introduce a novel implicit representation of PH hearts that utilises multiple nested level lines of a continuous level set function. This nested level set representation can be effectively deployed with the learned deep features from the proposed network. Furthermore, an initialisation of the level set function can be readily derived from the learned feature. Therefore, DNLS does not need user intervention (manual initialisation or landmark placement) and is fully automated. Finally, we apply the proposed DNLS method to clinical data acquired from 430 PH patients (approx. 12000 images), and compare its performance with state-of-the-art approaches.

Fig. 1.
figure 1

Short-axis images of a healthy subject (left) and a PH subject (right), including the anatomical explanation of both LV and RV. The desired epicardial contours (red) and endocardial contours (yellow) from both ventricles are plotted.

2 Modelling Biventricular Anatomy in Patients with PH

To illustrate cardiac morphology in patients with PH, Fig. 1 shows the difference in CMR images from a representative healthy subject and a PH subject. In health, the RV is crescentic in short-axis views and triangular in long-axis views, wrapping around the thicker-walled LV. In PH, the initial hypertrophic response of the RV increases contractility but is followed invariably by progressive dilatation and failure heralding clinical deterioration and ultimately death. During this deterioration, the dilated RV pushes onto the LV to deform and lose its roundness. Moreover, in PH the myocardium around RV become much thicker than a healthy one, allowing PH cardiac morphology to be modelled by a nested level set. Next, we incorporate the biventricular anatomy of PH hearts into our model for automated segmentation of LV and RV cavities and myocardium.

3 Methodology

Nested Level Set Approach: We view image segmentation in PH as a multi-region image segmentation problem. Let \(I : \varOmega \rightarrow {\mathbb {R}^d}\) denote an input image defined on the domain \(\varOmega \subset \mathbb {R}^2\). We segment the image into a set of n pairwise disjoint region \(\varOmega _i\), with \(\varOmega = \cup _{i = 1}^n{\varOmega _i}\), \({\varOmega _i} \cap {\varOmega _j} = \emptyset \) \(\forall i \ne j\). The segmentation task can be solved by computing a labelling function \(l(x):\varOmega \rightarrow \{1, \ldots , n\}\) that indicates which of the n regions each pixel belongs to: \({\varOmega _i} = \left\{ {x\left| {l\left( x \right) = i} \right. } \right\} \). The problem is then formulated as an energy minimisation problem consisting of a data term and a regularisation term

$$\begin{aligned} \mathop {\min }\limits _{{\varOmega _1}, \ldots , {\varOmega _n}} \left\{ {\sum \limits _{i = 1}^n {\int _{{\varOmega _i}} {{f_i}\left( x \right) dx} } + \lambda \sum \limits _{i = 1}^n {\mathrm{{Pe}}{\mathrm{{r}}_g}\left( {{\varOmega _i},\varOmega } \right) } } \right\} . \end{aligned}$$
(1)
Fig. 2.
figure 2

An example of partitioning the domain \(\varOmega \) into 4 disjoint regions (right), using 3 nested level lines \(\{x|\phi (x)= c_i, i = 1,2,3\} \) of the same function \(\phi \) (left). The intersactions between the 3D smooth surface \(\phi \) and the 2D plans correspond to the three nested curves on the right.

The data term, \(f_i:\varOmega \rightarrow \mathbb {R}\) is associated with region that takes on smaller values if the respective pixel position has stronger response to region. In a Bayesian MAP inference framework, \({f_i}\left( x \right) = - \log P_i \left( {I\left( x \right) |{\varOmega _i}} \right) \) corresponds to the negative logarithm of the conditional probability for a specific pixel color at the given location x within region \(\varOmega _i\). Here we refer to \(f_i\) as region feature. The second term, \({\mathrm{{Pe}}{\mathrm{{r}}_g}\left( {{\varOmega _i},\varOmega } \right) }\) is the perimeter of the segmentation region \(\varOmega _i\), weighted by the non-negative function g. This energy term alone is known as geodesic distance, the minimisation over which can be interpreted as finding a geodesic curve in a Riemannian space. The choice of g can be an edge detection function which favours boundaries that have strong gradients of the input image I. Here we refer to g as edge feature.

We apply the variational level set method [5, 6] to (1) in this study. Because a PH heart can be implicitly represented by two nested level lines of a continuous level set function (\(\{x|\phi (x)=c_i, i=1,2\}\) in Fig. 2). Note that the nested level set idea present here is inspired from previous work [1, 7]. Our approach uses features learned from many images while previous work only consider single image. With the idea, we are able to approximate the multi-region segmentation energy (1) by using only one continuous function. The computational cost is thus small. Now assume that the contours in the image I can be represented by level lines of the same Lipschitz continuous level set function \(\phi :\varOmega \rightarrow \mathbb {R}\). With \(n-1\) distinct levels \(\{c_1< c_2< \cdot \cdot \cdot <c_{n-1}\}\), the implicit function \(\phi \) partitions the domain \(\varOmega \) into n disjoint regions, together with their boundaries (see Fig. 2 right). We can then define the characteristic function \(\chi _i\phi \) for each region \(\varOmega _i\) as

$$\begin{aligned} {\chi _i}\phi (x) = \left\{ \begin{array}{lc} H\left( {{c_i} - \phi (x) } \right) \;\;\; &{}i=1\\ H\left( {\phi (x) - {c_{i - 1}}} \right) H\left( {{c_i} - \phi (x) } \right) \;\;\; &{} 2 \le i \le n-1\\ H\left( {\phi (x) - {c_{i - 1}}} \right) \;\;\; &{}i=n \end{array} \right. , \end{aligned}$$
(2)

where H is the one-dimensional Heaviside function that takes on either 0 or 1 over the whole domain \(\varOmega \). Due to the non-differentiate nature of H it is usually approximated by its smooth version \(H_\epsilon \) for numerical calculation [7]. Note that in (2) \(\sum \nolimits _{i = 1}^n {{\chi _i}\phi = 1}\) is automatically satisfied, meaning that the resulting segmentation will not produce a vacuum or an overlap effect. That is, by using (2) \(\varOmega = \cup _{i = 1}^n{\varOmega _i}\) and \({\varOmega _i} \cap {\varOmega _j} = \emptyset \) hold all the time. With the definition of \({\chi _i}\phi \), we can readily reformulate (1) in the following new energy minimisation problem

$$\begin{aligned} \mathop {\min }\limits _{\phi (x)} \left\{ {\sum \limits _{i = 1}^n {\int _\varOmega {{f_i}\left( x \right) {\chi _i}\phi \left( x \right) dx} } + \lambda \sum \limits _{i = 1}^{n - 1} {\int _\varOmega {g\left( x \right) \left| {\nabla H\left( {\phi \left( x \right) - {c_{i }}} \right) } \right| dx} } } \right\} . \end{aligned}$$
(3)

Note that (3) differs from (1) in multiple ways due to the use of the smooth function \(\phi \) and characteristic function (2). First, the variable to be minimised is the n regions \({\varOmega _1}, \ldots , {\varOmega _n}\) in (1) while the smooth function \(\phi \) in (3). Second, the minimisation domain is changing from over \(\varOmega _i\) in (1) to over \(\varOmega \) in (3). Third (1) uses an abstract \({\mathrm{{Pe}}{\mathrm{{r}}_g}\left( {{\varOmega _i},\varOmega } \right) }\) for the weighted length of the boundary between two adjacent regions, while (3) represents the weighted length with the co-area formula, i.e. \({\int _\varOmega {g\left| {\nabla H\left( {\phi - {c_i}} \right) } \right| dx} }\). Finally, the upper limit of summation in the regularisation term of (1) is n while \(n-1\) in that of (3). So far, the region features \(f_i\) and the edge feature g have not been defined. Next, we will tackle this problem.

Learning Deep Features Using Fully Convolutional Network: We propose a deep neural network that can effectively learn region and edge features from many labelled PH CMR images. Learned features are then incorporated to (3). Let us formulate the learning problem as follows: we denote the input training data set by \(S=\{(U_p, R_p, E_p), p=1, \ldots , N\}\), where sample \(U_p=\{u^p_j, j\,=\,1, \ldots , |U_p|\}\) is the raw input image, \(R_p=\{r^p_j, j=1, \ldots , |R_p|\}\), \(r^p_j \in \{1, \ldots , n\}\) is the ground truth region labels (n regions) for image \(U_p\), and \(E_p=\{e^p_j,j=1, \ldots , |E_p|\}\), \(e^p_j \in \{0,1\}\) is the ground truth binary edge map for \(U_p\). We denote all network layer parameters as \(\mathbf W \) and propose to minimise the following objective function via the (back-propagation) stochastic gradient descent

$$\begin{aligned} \mathbf W ^* = \mathrm{{argmin}} (L_R(\mathbf W ) + \alpha L_E(\mathbf W ) ), \end{aligned}$$
(4)

where \(L_R(\mathbf W )\) is the region associated cross-entropy loss that enables the network to learn region features, while \(L_E(\mathbf W )\) is the edge associated cross-entropy loss for learning edge features. The weight \(\alpha \) balances the two losses. By minimising (4), the network is able to output joint region and edge probability maps simultaneously. In our image-to-image training, the loss function is computed over all pixels in a training image \(U=\{u_j,j=1, \ldots , |U|\}\), a region map \(R=\{r_j, j=1, \ldots , |R|\}\), \(r_j \in \{1, \ldots , n\}\) and an edge map \(E=\{e_j, j=1, \ldots , |E|\}\), \(e_j \in \{0,1\}\). The definitions of \(L_R(\mathbf W )\) and \(L_E(\mathbf W )\) are given as follows.

$$\begin{aligned} L_R(\mathbf W ) = -\sum \limits _{j}\mathrm{{log}}P_{so}(r_j|U,\mathbf W ), \end{aligned}$$
(5)

where j denotes the pixel index, and \(P_{so}(r_j|U,\mathbf W )\) is the channel-wise softmax probability provided by the network at pixel j for image U. The edge loss is

$$\begin{aligned} L_E(\mathbf W ) = -\beta \sum \limits _{j \in Y_+}\mathrm{{log}} P_{si}(e_j=1|U,\mathbf W ) - (1-\beta )\sum \limits _{j \in Y_-}\mathrm{{log}}P_{si}(e_j=0|U,\mathbf W ). \end{aligned}$$
(6)

For a typical CMR image, the distribution of edge and non-edge pixels is heavily biased. Therefore, we use the strategy [8] to automatically balance edge and non-edge classes. Specifically, we use a class-balancing weight \(\beta \). Here, \(\beta = |Y_ -|/|Y|\) and \(1-\beta =|Y_ +|/|Y|\), where \(|Y_-|\) and \(|Y_+|\) respectively denote edge and non-edge ground truth label pixels. \(P_{si}(e_j=1|U,\mathbf W )\) is the pixel-wise sigmoid probability provided by the network at non-edge pixel j for image U.

Fig. 3.
figure 3

The architecture of a fully convolutional network with 17 convolutional layers. The network takes the PH CMR image as input, applies a branch of convolutions, learns image features from fine to coarse levels, concatenates (‘+’ sign in the red layer) multi-scale features and finally predicts the region (1–3) and edge (4) probability maps simultaneously.

In Fig. 3, we show the network architecture for automatic feature extraction, which is a fully convolutional network (FCN) and adapted from the U-net architecture [9]. Batch-normalisation (BN) is used after each convolutional layer, and before a rectified linear unit (ReLU) activation. The last layer is however followed by the softmax and sigmoid functions. In the FCN, input images have pixel dimensions of \(160 \times 160\). Every layer whose label is prefixed with ‘C’ performs the operation: convolution \(\rightarrow \) BN \(\rightarrow \) ReLU, except C17. The (filter size/stride) is (3 \(\times \) 3/1) for layers from C1 to C16, excluding layers C3, C5, C8 and C11 which are (3\(\times \)3/2). The arrows represent (3 \(\times \) 3/1) convolutional layers (C14a−e) followed by a transpose convolutional (up) layer with a factor necessary to achieve feature map volumes with size 160 \(\times \) 160 \(\times \) 32, all of which are concatenated into the red feature map volume. Finally, C17 applies a (1 \(\times \) 1/1) convolution with a softmax activation and a sigmoid activation, producing the blue feature map volume with a depth \(n+1\), corresponding to n (3) region features and an edge feature of an image.

After the network is trained, we deploy it on the given image I in the validation set and obtain the joint region and edge probability maps from the last convolutional layer

$$\begin{aligned} (P_R, P_E) = \mathbf CNN (I,\mathbf W ^*), \end{aligned}$$
(7)

where CNN(\(\cdot \)) denotes the trained network. \(P_R\) is a vector region probability map including n (number of regions) channels, while \(P_E\) is a scalar edge probability map. These probability maps are then fed to the energy (3), in which \(f_i = -\mathrm{{log}}P_{Ri}, i=\{1, \ldots , n\}\) and \(g=P_E\). With all necessary elements at hand, we are ready to minimise (3) next.

Optimisation: The minimisation process of (3) entails the calculus of variations, by which we obtain the resulting Euler-Lagrange (EL) equation with respect to the variable \(\phi \). A solution (\(\phi ^*\)) to the EL equation is then iteratively sought by the following gradient descent method

$$\begin{aligned} \frac{{\partial \phi }}{{\partial t}} = - \sum \limits _{i = 1}^n {{f_i}\frac{{\partial {\chi _i}\phi }}{{\partial \phi }}} + \lambda {\kappa _g}\sum \limits _{i = 1}^{n - 1} {\delta _\epsilon \left( {\phi - {c_i}} \right) }, \end{aligned}$$
(8)

where is the weighted curvature that can be numerically implemented by the finite difference method on a half-point grid [10]. \(\delta _\epsilon \) is the derivative of \(H_\epsilon \), which is defined in [7].

At steady state of (8), a local or global minimiser of (3) can be found. Note that the energy (3) is nonconvex so it may have more than one global minimiser. To obtain a desirable segmentation result, we need a close initialisation of the level set function (\(\phi ^0\)) such that the algorithm converges to the solution we want. We tackle this problem by thresholding the region probability map \(P_{R3}\) and then computing the signed distance function (SDF) from the binary image using the fast sweeping algorithm. The resulting SDF is then used as \(\phi ^0\) for (8). In this way, the whole optimisation process is fully automated.

4 Experimental Results

Data: Experiments were performed using short-axis CMR images from 430 PH patients. For each patient 10 to 16 short-axis slices were acquired roughly covering the whole heart. Each short-axis image has resolution of \(1.5 \times 1.5 \times 8.0\;\mathrm {mm}^3\). Due to the large slice thickness of the short-axis slices and the inter-slice shift caused by respiratory motion, we train the FCN in a 2D fashion and apply the DNLS method to segment each slice separately. The ground truth region labels were generated using a semi-automatic process which included a manual correction step by an experienced clinical expert. Region labels for each subject contain the left and right ventricular blood pools and myocardial walls for all 430 subjects at end-diastolic (ED) and end-systolic (ES) frames. The ground truth edge labels are derived from the region label maps by identifying pixels with label transitions. The dataset was randomly split into training datasets (400 subjects) and validation datasets (30 subjects). For image pre-processing, all training images were reshaped to the same size of \(160 \times 160\) with zero-padding, and image intensity was normalised to the range of [0, 1] before training.

Parameters: The following parameters were used for the experiments in this work: First, there are six parameters associated with finding a desirable solution to (3). They are the weighting parameter \(\lambda \) (1), regularisation parameter \(\epsilon \) (1.5), two levels \(c_1\) (0) and \(c_2\) (8), time step t (0.1), and iteration number (200). Second, for training the network, we use Adam SGD with learning rate (0.001) and batch size (two subjects) for each of 50000 iterations. The weight \(\alpha \) in (4) is set to 1. We perform data augmentation on-the-fly, which includes random translation, rotation, scaling and intensity rescaling of the input images and labels at each iteration. In this way, the network is robust against new images as it has seen millions of different inputs by the end of training. Note that data augmentation is crucial to obtain better results. Training took approx. 10 h (50000 iterations) on a Nvidia Titan XP GPU, while testing took 5s in order to segment all the images for one subject at ED and ES.

Fig. 4.
figure 4

Visual comparison of segmentation results from the vanilla CNN, CRF-CNN and proposed method. LV & RV cavities and myocardium are delineated using yellow and red contours. GT stands for ground truth.

Table 1. Quantitative comparison of segmentation results from the vanilla CNN, CRF-CNN and proposed method, in terms of Dice metric (mean±standard deviation) and computation time at testing stage.

Comparsion: The segmentation performance was evaluated by computing the Dice overlap metric between the automated and ground truth segmentations for LV & RV cavities and myocardium. We compared our method with the vanilla CNN proposed in [4], the code of which is publicly available. DNLS was also compared with the vanilla CNN with a conditional random field (CRF) [11] refinement (CRF-CNN). In Fig. 4, visual comparison suggests that DNLS provides significant segmentation improvements over CNN and CRF-CNN. For example, at the base of the right ventricle both CNN and CRF-CNN fail to retain the correct anatomical relationship between endocardium and epicardium portraying the endocardial border outside the epicardium. CRF-CNN by contrast retains the endocardial border within the epicardium, as described in the ground truth. In Table 1, we report their Dice metric of ED and ES time frames in the validation dataset and show that our DNLS method outperforms the other two methods for all the anatomical structures, especially for the myocardium. CNN is the fastest method as it was deployed with GPU, and DNLS is the most computationally expensive method due to its complex optimisation processes.

5 Conclusion

In this paper, we proposed the deep nested level set (DNLS) approach for segmentation of CMR images in patients with pulmonary hypertension. The main contribution is that we combined the classical level set method with the prevalent fully convolutional network to address the problem of pathological image segmentation, which is a major challenge in medical image segmentation. The DNLS inherits advantages of both level set method and neural network, the former being able to model complex geometries of cardiac morphology and the latter providing robust features. We have shown the derivation of DNLS in detail and demonstrated that DNLS outperforms two state-of-the-art methods.