1 Introduction

Cardiovascular diseases are a leading cause of death globally. Segmenting the whole heart in cardiovascular images is a prerequisite for morphological and pathological analysis, disease diagnosis, and surgical planning [6]. However, automatic and accurate cardiovascular image segmentation remains very challenging due to large variations in different subjects, missing/ambiguous borders, and inhomogeneous appearance and image quality (e.g., see Fig. 1(a, b)).

Fig. 1.
figure 1

Examples of cardiovascular images from (a) the MM-WHS CT dataset [14] in the axial plane and (b) the HVSMR dataset [6] in the sagittal plane. (c) Myocardium boundaries in the axial plane are easier to recognize.

Recent studies showed that deep learning based methods [2,3,4, 11, 12] can learn robust contextual and semantic features and achieve state-of-the-art segmentation performance. 3D fully convolutional networks (FCNs) are a mainstream approach for cardiac segmentation due to their ability to integrate both inter- and intra-slice information in 3D images. However, two key factors have not been well explored: (1) the imaging qualities in different anatomical planes are not the same, and thus the degrees of segmentation difficulty from different views are unequal; (2) subtle structures (e.g., myocardium, pulmonary artery) have different orientations in different anatomical planes. Symmetrical convolutional and pooling operations may cause quick diminishment of subtle structures or boundaries, incurring segmentation errors. As shown in Fig. 1(c), myocardium boundaries in the axial plane are easier to recognize; with asymmetrical pooling along the longitudinal axis, more complementary inter-slice information can be kept which in return benefits segmentation in the axial plane.

Many recent studies tried to tackle the anisotropic issue of 3D biomedical images. But still, they could not segment myocardium or pulmonary artery well. Known methods that explored anisotropic 3D kernels in FCNs can be categorized into two types. (1) The methods in [2, 8] focused on designing repeatable cell structures and replaced all 3D convolutions systematically, called short-range asymmetrical cell. However, symmetrical pooling was used and deep features were fused periodically (with distinctive features vanishing quickly). (2) The methods in [3, 5] dealt with the anisotropic problem in 3D images using 2D FCNs to extract intra-slice features and 3D FCNs to aggregate inter-slice features. But, they did not exploit the fact that complementary information in the other planes (xz- and yz-planes) can also benefit the xy-plane, especially in less anisotropic 3D data (e.g., when the spacing resolution in the z-axis is only 3–5\(\times \) larger than that of the x- and y-axes).

To address the above two key factors, we propose a new heterogeneous feature aggregation network (HFA-Net), which is able to fully exploit complementary information in multiple views of 3D cardiac images and aggregate heterogeneous features to boost segmentation performance. To handle the issue in [2, 8], we utilize long-range asymmetrical branches to maintain distinguishable features associated with a specific view. Besides asymmetrical convolutional operations, we also apply asymmetrical pooling operations to maintain spatial resolution in the other planes. To address the issue in [3, 5], we utilize parallel encoding paths to extract heterogeneous features from multiple geometric views of the 3D data (i.e., the axial, coronal, and sagittal planes). There is a good chance that an object can be distinguished from at least one of the geometric views. Thus, we encourage richer contextual and semantic features to be extracted. Further, to improve the parameter-performance efficiency and reduce GPU memory usage, we devise a sparsified densely-connected convolutional block for our model, and our HFA-Net thus designed can be trained end-to-end.

Experiments on three public challenge datasets [6, 10, 14] show that our new method achieves competitive segmentation results over state-of-the-art methods.

Fig. 2.
figure 2

An overview of our new HFA-Net framework.

2 Method

Our HFA-Net has three main components (see Fig. 2): (1) Long-range asymmetrical branches (LRABs) that preserve subtle structures via asymmetrical convolutions and poolings; (2) a content-aware fusion module (CAFM) that combines multiple asymmetrical branches together, utilizing both raw images and feature maps from LRABs; (3) a new 3D sparse aggregation block (SAB) to reduce GPU memory usage and enable end-to-end training of the entire network.

2.1 Long-Range Asymmetric Branch (LRAB)

A straightforward way to exploit multiple geometric views of 3D images is to replace conventional 3D convolutional (Conv) layers by short-range asymmetrical cell (SRAC) [2, 8]. As shown in Fig. 3(a), a 3D Conv kernel is decomposed into m parallel streams, each having n pseudo 2D kernels and a corresponding orthogonal pseudo 1D kernel. But, the typical decompositions they exploited are \(\{m=1,2; n=1,2\}\), which may not make the best out of all geometric properties of 3D data. Further, such SRAC only governs the specific layer-wise computation but neglects the outer branch/network level which controls spatial resolution changes. Most importantly, feature maps are added together periodically after each SRAC, which causes homogeneous feature maps in deeper layers and that parallel streams do not benefit richer feature extraction anymore. To address these issues, our method aims to fully exploit all the three orthogonal views and encourage extracting heterogeneous features from different scales. For this goal, we need to carefully design both the layer-level and branch-level operations.

Notation. We denote a 3D Conv layer as Conv\((\mathcal {K}_{k_1,k_2,k_3}/\mathcal {S}_{s_1,s_2,s_3})\), where \(k_i\) and \(s_i\) are the kernel size and stride step size in each direction. Conventionally, \(k_1=k_2=k_3\) and \(s_1=s_2=s_3\). A 3D kernel \(\mathcal {K}_{3,3,3}\) can be decomposed into an SRAC (with \(m=1\) and \(n=1\)) by \(\mathcal {K}_{3,3,1}\otimes \mathcal {K}_{1,1,3}\), \(\mathcal {K}_{3,1,3}\otimes \mathcal {K}_{1,3,1}\), or \(\mathcal {K}_{1,3,3}\otimes \mathcal {K}_{3,1,1}\), where \(\otimes \) is convolution. Similarly, we denote a 3D deconvolutional (DeConv) layer as DeConv\((\mathcal {K}_{k_1,k_2,k_3}\times \mathcal {S}_{s_1,s_2,s_3})\). A pooling layer is denoted as \(\mathcal {P}_{s_1,s_2,s_3}\).

Fig. 3.
figure 3

(a) Short-Range Asymmetric Cell; (b) Long-Range Asymmetric Branch; (c) Content-Aware Fusion Module. I: raw image; \(F_{i}^{s_j}\): feature maps (see Sect. 2.2).

Figure 3(b) shows the concept of our long-range asymmetrical branch (LRAB). We utilize three LRABs (\(m=3\)) to operate on three orthogonal geometric views separately, thus increasing the independency among m parallel encoding paths. The original symmetrical Conv\((\mathcal {K}_{k_a,k_a,k_a}/\mathcal {S}_{s_a,s_a,s_a})\) is replaced by an asymmetrical counterpart in each branch (i.e., \((\mathcal {K}_{k_a,k_a,1}/\mathcal {S}_{s_a,s_a,1})\), \((\mathcal {K}_{k_a,1,k_a}/\mathcal {S}_{s_a,1,s_a})\), or \((\mathcal {K}_{1,k_a,k_a}/\mathcal {S}_{1,s_a,s_a})\)). Also, the consecutive 3D Conv kernel \((\mathcal {K}_{k_b,k_b,k_b}/\mathcal {S}_{s_b,s_b,s_b})\) is decomposed in the same orientation in each branch. Besides, since in each LRAB, Conv kernels are along the same orientation, conventional symmetrical pooling is no longer suitable (otherwise, inter-slice features may vanish quickly before being extracted). In our problem, cardiovascular segmentation is highly challenging especially due to the missing/ambiguous boundaries between the regions of interest and background or among various sub-structures. Thus, asymmetrical pooling (i.e., \(\mathcal {P}_{s,s,1}\), \(\mathcal {P}_{s,1,s}\), or \(\mathcal {P}_{1,s,s}\)) is utilized to maintain spatial resolution in the orthogonal direction so that there is a bigger chance that subtle distinguishable features can be kept in at least one of the geometric views.

For example, a \(T\times T\times T\) tensor after three \(\mathcal {P}_{2,2,2}\) becomes a \(\frac{T}{8}\times \frac{T}{8}\times \frac{T}{8}\) tensor but becomes \(\frac{T}{8} \times \frac{T}{8} \times T\) after three \(\mathcal {P}_{2,2,1}\). Hence, additional information of subtle structures along the z-axis is kept and will be utilized by subsequent processing. Observe that the designs in [3, 5] can be viewed as special cases of our LRAB since these methods only used (pre-trained) 2D FCN to extract deep feature maps from 3D data slice by slice independently with \(m=1\). Thus, our method is more cautious in heterogeneous feature aggregation. Specifically, as shown in Fig. 2, our first LRAB is composed of stacking layers of Conv\((\mathcal {K}_{3,3,1}/\mathcal {S}_{2,2,1})\), SAB\((\mathcal {K}_{3,3,1}/\mathcal {S}_{1,1,1})\), \(\mathcal {P}_{2,2,1}\), and SAB\((\mathcal {K}_{3,3,1}/\mathcal {S}_{1,1,1})\), where SAB\((\mathcal {K}_{3,3,1}/\mathcal {S}_{1,1,1})\) refers to sparse aggregation block (SAB) composed of stacked Conv\((\mathcal {K}_{3,3,1}/\mathcal {S}_{1,1,1})\). We will present SAB in Sect. 2.3. In the \(i^{th}\) LRAB, feature maps from different scales (\(s_j, j=1,2,\ldots ,k\)) are recovered by asymmetrical DeConv layers accordingly, denoted by \(F_i^{s_j}\). We will discuss how to aggregate useful information from these heterogeneous feature maps in Sect. 2.2.

2.2 Content-Aware Fusion Module (CAFM)

To maximally exploit the extracted heterogeneous features maps \(F_i^{s_j}\) from parallel LRABs, we need to selectively leverage the correct information and suppress the incorrect one. It is quite possible that each voxel is correctly classified in at least one geometric view; thus, a key challenge is how to deal with agreement and disagreement in different views. For this, we present a content-aware fusion module (CAFM, see Fig. 3(c)) to generate aggregated deep features.

The input of CAFM includes two parts: a raw image I and heterogeneous feature maps \(F_i^{S_j}\) of the same shape, where i is for the \(i^{th}\) LRAB and \(S_j\) is for the selected scales in LRABs. HFA-Net has \(m=3\) LRABs; thus \(i\in \{1,2,3\}\). There are three scales in each LRAB and we choose the last two scales; thus \(j\in \{2,3\}\). To recover the asymmetrical feature maps to the original resolution of the input image I, we use asymmetrical DeConv accordingly (e.g., we use stacked {DeConv\((\mathcal {K}_{4,4,1}\times \mathcal {S}_{2,2,1})\), DeConv\((\mathcal {K}_{4,4,1}\times \mathcal {S}_{2,2,1})\)} to obtain \(F_1^{S_3}\) for the \(1^{st}\) LRAB). Then we average the feature maps from the same scale but different branches together to obtain hierarchical features \(F^{S_j} = \frac{1}{m} \sum _{i=1}^m{F_i^{S_j}}\). This averaging provides a compact representation of all \(F_i^{S_j}\)’s while still showing the image areas where the heterogeneous features have agreement or disagreement. Next, each \(F^{S_j}\) is concatenated with the raw image I and fed to an encoder SAB, and all the intermediate feature maps are integrated in the middle of CAFM for extracting better representations. The raw image I provides a reference for helping further find detailed features and guide the feature aggregation process.

The loss function is computed as \(\mathcal {\ell }(X,Y;\theta )=\mathcal {\ell }_{mse}(\widetilde{P},Y) + \lambda _1 \mathcal {\ell }_{mse}(\widetilde{P}_{aux},Y) + \sum _i\sum _j \lambda _{ij} \mathcal {\ell }_{mse}(S(F_i^{S_j}), Y) \), where Y is the corresponding ground truth of each training sample X, \(\mathcal {\ell }_{mse}\) is the multi-class cross-entropy loss and \(S(\cdot )\) is the softmax function. See supplementary material for more details on HFA-Net.

2.3 Sparse Aggregation Block (SAB)

DenseVoxNet [11] is a state-of-the-art model for cardiovascular image segmentation, built on DenseBlock with dense residual connections. It aggregates all the previously computed features to each subsequent layer, computed as \( x_{\ell } = H_{\ell }([x_0,x_1,\ldots ,x_{\ell -1}]) \), where \(x_0\) is the input, \(x_{\ell }\) is the output of layer \(\ell \), \([\cdot ]\) is the concatenation operation, and \(H_{\ell }(\cdot )\) is a composite of operations such as Conv, Pooling, BN, and ReLU. The dense connections help transfer useful features from shallower to deeper layers, and in turn, allow each shallow layer to receive direct supervision signal, thus alleviating the gradient vanishment issue in training deep ConvNets and achieving better parameter-performance efficiency.

However, for a DenseBlock of depth N, the number of skip connections and parameters grows quadratically asymptotically (i.e., \(O(N^2)\)). This means that each layer generates only a few new outputs to an ever-widening concatenation of previously seen feature representations. Thus, it is hard for the model to make full use of all the parameters and dense skip connections [13].

To further ease the training of our HFA-Net, we devise a new sparsified densely-connected convolutional block, called sparse aggregation block (SAB), to improve parameter-performance efficiency. The output \(x_{\ell }\) of layer \(\ell \) is computed as \(x_{\ell } = H_{\ell }([ x_{\ell -c^0}, x_{\ell -c^1}, x_{\ell -c^2}, x_{\ell -c^3}, \ldots , x_{\ell -c^k} ])\), where \(c>1\) is an integer and \(k\ge 0\) is the largest integer such that \(c^k \le {\ell }\). For an SAB of total depth N, this sparse aggregation introduces no more than \(\log _{c}(N)\) incoming links per layer, for a total of \(O(N\log (N))\) connections and parameters. We use \(c=2\) and \(N=12\) in all experiments. See supplementary material for more details.

Table 1. Datasets and training details. “GT = ✗”: the ground truth of the data is kept by the organizers for fair comparison. The initial learning rate \(L_r\) = \(5\times 10^{-4}\).

3 Experiments and Results

Three 3D Datasets. (1) The 2016 HVSMR dataset [6] aims to segment myocardium and great vessels (blood pool) in cardiovascular MRIs. The results are evaluated using three criteria: Dice coefficient, average surface distance (ADB), and symmetric Hausdorff distance. A score \(S=\sum _{class}(\frac{1}{2}{\textit{Dice}} - \frac{1}{4}{\textit{ADB}} -\frac{1}{30}{\textit{Hausdorff}})\) is used to measure the overall accuracy of the results and for ranking. (2) The 2017 MM-WHS CT dataset [14] aims to segment seven cardiac structures (the left/right ventricle blood cavity (LV/RV), left/right atrium blood cavity (LA/RA), myocardium of the left ventricle (LV-myo), ascending aorta (AO), and pulmonary artery (PA)). Following the setting in [1], we randomly split the dataset into the training (16 subjects) and testing (4 subjects) sets, which are fixed throughout all experiments. (3) The 2017 AAPM CT dataset [10] aims to segment five thoracic structures (esophagus, spinal cord, left/right lung, and heart); esophagus and spinal cord are highly difficult cases.

Implementation Details. Our proposed method is implemented with Python using the TensorFlow framework and trained on an NVIDIA Tesla V100 graphics card with 32GB GPU memory. All the models are initialized using a Gaussian distribution and trained with the “poly” learning rate policy. We perform data augmentation to reduce overfitting. More details can be found in Table 1.

Table 2. Segmentation results on the 2016 HVSMR dataset (top), 2017 MM-WHS CT dataset (middle), and 2017 CT AAPM dataset (bottom).

Quantitative Results. Table 2 (top) shows quantitative comparison of HFA-Net against other methods from the 2016 HVSMR Challenge Leaderboard, including a conventional atlas-based method [9] and 3D FCN based methods [4, 11]. First, our re-implementation of DVN achieves the state-of-the-art performance and our S-DVN with SAB achieves competitive results while reducing the number of parameters by \(\sim \!60\%\) (4.3M vs. 1.6M). Second, recall the two types of the known anisotropic 3D methods (see Sect. 1). We choose at least one typical method from each type for comparison. The method [2] is based on the short-range asymmetrical cell design, which utilizes 3D kernel decomposition on the orthogonal planes to predict a class label for each voxel. The method [3] extracts features from the xy-plane by a 2D FCN and applies a 3D FCN to fuse inter-slice information. Our HFA-Net outperforms these methods across nearly all the metrics with a very high overall score of 0.239. The results for the 2017 MM-WHS CT dataset are given in Table 2 (middle). First, our baselines (DVN and S-DVN) already achieve better results than the known state-of-the-art methods [1, 7]. Second, our HFA-Net further improves the accuracy on most the categories across nearly all the metrics, especially for subtle structures such as LV-myo and AO. To further show that our method is robust and effective in delineating subtle structures, we experiment with HFA-Net on the 2017 AAPM CT dataset. Quantitative results in Table 2 (bottom) show promising performance gain, especially for esophagus and spinal cord (\(2\%\) gain in Dice coefficient).

Fig. 4.
figure 4

Visual qualitative results: the 2016 HVSMR dataset (a), 2017 MM-WHS CT dataset (b), and 2017 CT AAPM dataset (c) (some errors marked by magenta arrows).

Qualitative Results. As shown in Fig. 4, our HFA-Net attains better results and shows a strong capability of delineating missing/ambiguous boundaries. More qualitative results can be found in supplementary material.

4 Conclusions

In this paper, we presented a new deep heterogeneous feature aggregation network (HFA-Net) for cardiovascular segmentation in 3D CT/MR images. Our proposed HFA-Net extracts rich heterogeneous features using long-range asymmetrical branches and aggregates diverse contextual and semantic deep features using a content-aware fusion module. Sparse aggregation block is utilized to give HFA-Net a better parameter-performance efficiency. Comprehensive experiments on three open challenge datasets demonstrated the efficacy of our new method.