Keywords

1 Introduction

The whole heart segmentation is essential for the diagnosis of heart disease. However, the efficiency is limited due to both the annotation of experts and the subjective judgments of doctors. Meanwhile, the segmentation results can only be annotated by doctors and experts, which makes medical images available for research much less than other image datasets. In recent years, deep learning has achieved great success in computer vision and artificial intelligence, which enables the auto segmentation of the substructure of the heart from Computed Tomography (CT) [3]. U-net [9] and Fully Convolutional Network [7] have greatly improved medical image segmentation in terms of accuracy and execution speed, but there exist gradient vanishing and gradient explosion problems when the depth of the network increases. To tackle this problem, Lee et al. [10] added depth supervision mechanism into the network, effectively alleviate the problem caused by gradient. Yang et al. [1] applied a deep supervision mechanism to the whole heart segmentation, through integrating DICE loss and cross-entropy loss into the network, they obtained excellent segmentation results. Based on this work, Ye et al. [5] replaced the weighted cross-entropy loss function with the Focal loss function, which makes the model focus on the indistinguishable boundary and improves the Dice accuracy.

For medical images, they contain more background voxels than foreground voxels. Thus, it suffers from the problem of high misclassification. To overcome these limitations, some segmentation frameworks [6, 8] are put forward in recent years. These frameworks, known as cascade networks, are divided into two steps: (1) the first step is to locate the target and simplify the task; (2) the second step is segmentation. Among these frameworks, Payer et al. [8] performed this method on whole heart images and won first place in the MICCAI2017 Whole Heart Segmentation Challenge. However, these frameworks have the disadvantage of excessive or redundant use of parameters, such as repeated extraction of underlying features. Oktay et al. [11] proposed the plug-play Attention Gates (AGS) model, which makes the network automatically focus on relevant areas through training, effectively overcoming the shortcomings of CNNs to some extent. Wu et al. [4] have proposed a WSL (Weakly supervised learning)-based method for brain lesion segmentation. Through weak supervision learning, the network can automatically select the relevant region to suppress the irrelevant image information.

In this paper, we proposed a novel 3D CNN combining WSL learning for cardiac segmentation. We firstly used deformable image registration (DIR) [2] technology to generate pseudo masks of all the CT images for producing weakly supervised information. Then, we utilized that weakly supervised information to guide a novel 3D U-net learning. Furthermore, we developed a novel loss function based on the weighted cross-entropy to enforce CNN to pay more attention to the tricky voxels nearby the boundary of cardiac substructures during the training stage.

The main contributions of this paper are as follows:

  1. (1)

    We applied traditional medical image registration technology to generate weakly supervised information as the prior knowledge for guiding deep network learning, which not only helps distinguish background and foreground organs but also can be as a data augmentation way avoiding overfitting problems.

  2. (2)

    We developed an improved weighted cross-entropy loss for enforcing the deep network to pay attention to the missegmented voxels and alleviate the class imbalance problem.

2 Method

2.1 Pseudo Masks

The inputs of the network consist of two parts: one is the original CT image, while another is the pseudo masks that format the one-hot after the background is removed. For the generated pseudo masks, relevant image regions can be automatically selected. Although pseudo masks are not able to segment accurately, they can provide relevant positional features of background and foreground for the region, and effectively extract heart substructure from background. This paper utilized DIR (deformable image registration) [2] technology to generate pseudo masks for medical images. Set \({\left\{\left({T}_{i}\right)\right\}}_{i=1}^{N}\) as N training samples, \({\left\{\left({V}_{j}\right)\right\}}_{j=1}^{M}\) as M test samples. There are two training methods, called Model_N-1 and Model_1, as shown in Fig. 2. For a certain training sample \({T}_{i}\), the other N-1 training samples are respectively used as atlas to generate pseudo masks for \({T}_{i}\). In the Model_N-1, we concatenate \({T}_{i}\) with its N-1 pseudo masks respectively and put them into deep network for training. In the Model_1, the N-1 pseudo masks of \({T}_{i}\) are firstly majority voting to get a final pseudo mask, then we concatenate it with \({T}_{i}\) and put them into deep network for training. Thus, similarly, there are two ways to generate test results, called IND and MV, as shown in Fig. 2. IND model is that each training sample is used as atlas to respectively generate pseudo mask for test sample \({V}_{j}\). At testing stage, we concatenate each of N pseudo masks with \({V}_{j}\) and pass through the deep network. Then we can obtain N segmentation results for test sample \({V}_{j}\). Finally, we use majority voting method to generate the final segmentation result. MV model is that N pseudo masks of \({V}_{j}\) are majority voting to obtain a final pseudo mask, the it is concatenated with \({V}_{j}\) and put into the deep network for generating a segmentation result.

Fig. 1.
figure 1

The framework of the proposed Deep U-net network. In input layer, we concatenated the generated pseudo masks with the cropped patches and placed them into the network for training. The details of pseudo masks generation and patch cropping will be introduced in Sect. 2.1 and 2.2.

Fig. 2.
figure 2

Two training methods (left) of pseudo masks, two test methods (left) of pseudo masks.

2.2 Deep U-Net Network

In order to better train the deep network, we adopt the method of deep supervision, which increases the output path in different network layers and shortened the backpropagation path of gradient flow. In this paper, three deep supervised branches are introduced in the decoding stage. The output of each branch is the same as that of the main branch, in Fig. 1, out1, out2, and out3 are the three deep supervised branches, and the final total loss is the sum of the losses of each branch and the main branch.

2.3 Improved Weighted Cross-Entropy Loss

The commonly used weighted cross-entropy loss does not perform well for voxels that are difficult to segment. In this paper, we added predicted false negative (FN) and true positive (TP) voxels losses into the weighted cross-entropy to formula the total loss. As shown in Eq. (1).

$${L}_{mw}Cross\left(x,y,z\right)=-\sum\nolimits_{c}^{C}{\sum }_{i=1}^{N}{w}_{c}\left[\left({G}_{c}^{i}+{G}_{cFN}^{i}\right)log{P}_{c}^{i}+{P}_{cTP}^{i}log\left(1-{P}_{c}^{i}\right)\right]$$
(1)

where \({G}_{cFN}^{i}\) is 0 or 1, where 1 indicates that the current voxel belongs to class c but is predicted to be of another class. \({P}_{cTP}^{i}\) is 0 or 1, where 1 indicates that the current voxel is predicted to be class c, but is actually something else. \({P}_{c}^{i}\)(0.005 \(<{P}_{c}^{i}<\) 0.995) is the probability that the current voxel is class c, and the range is limited to prevent the excessive loss, which is not conducive to network convergence. \({w}_{c}\) is the weight coefficient of class c, which can be used to alleviate class imbalance.

MDSC (Multi-Class Dice Similarity Coefficient) based loss function to balance the training for multiple classes [1]. This loss can be defined as:

$${L}_{mDSC}=-{\sum }_{c=1}^{C}\frac{\frac{2}{N}\sum_{i=1}^{N}{G}_{c}^{i}{P}_{c}^{i}}{\sum_{i=1}^{N}{G}_{c}^{i}{G}_{c}^{i}+\sum_{i=1}^{N}{P}_{c}^{i}{P}_{c}^{i}}$$
(2)

where N is the number of voxels; \({G}_{c}^{i}\) is a binary value, where 1 indicates the voxels belong to class c, 0 stands for other categories; \({P}_{c}^{i}\) (0 \(<\) \({P}_{c}^{i}\) \(<\) 1) denotes the probability that the current voxels belong to class c.

After and are added into the network, the new loss function can be defined as follows:

$${L}_{out\_x}\left(d,w\right)=100d{L}_{mDSC}+w{L}_{mw}Cross$$
(3)

where d and w are the weights of different branches, x represents the output of the deep supervised branch, the final loss function, called the Improved Weighted Cross-Entropy (IWCE), is:

$${L}_{total}={L}_{out\_1}\left(\mathrm{0.2,0.3}\right)+{L}_{out\_2}\left(\mathrm{0.4,0.6}\right)+{L}_{out\_3}\left(\mathrm{0.8,0.9}\right)+{L}_{out\_4}\left(\mathrm{1.0,1.0}\right)$$
(4)

3 Experimental and Results

3.1 Datasets and Implementation Details

We evaluated our approach with the MICCAI2017 whole-heart CT datasets, which contains 20 publicly available CT data [1]. We randomly selected 10 samples as training samples and the rest as test sets. These data were collected in the actual clinical environment, which was variable and contained some images of poor quality, so the robustness of the algorithm in this paper remains to be verified. Each sample is stacked with multiple 2D images of 512 \(*\) 512 size. All training data were normalized to zero mean and unit variance. Adam is used to optimize network parameters, the number of iterations was 35,000 epochs [5], the batch size was 2, and the initial learning rate was 0.001.

3.2 Patch Selection

Due to the particularity of heart medical images, and the 7 substructures voxels in whole heart CT image account for less. When the random cropped size is 96, the background occupied more than half of the training data, which is not conducive to the better learning prospects of the network. To tackle this problem, we adopted an effective cropped method, which utilized ground truth to crop the patches with less background. For the randomly cropped patches, we calculated the proportion \(p\) of the background voxels in the whole patch. If the background proportion \(p\) is less than \(a\) (\(a\) = 0.5), this patch will be called the available patch and sent into the network for training, otherwise, the patch will be re-cropped.

3.3 Experimental Results

We took deeply-Supervised U-net [1] as the baseline network, Multi-Depth Fusion [5] is an improvement of the baseline network and Dice score as performance evaluation. In order to the efficiency of the proposed method in this paper, we conducted a series of ablation experiments.

The experimental results of cardiac substructure, pulmonary artery (PUA), ascending aorta (ASA), right ventricular blood chamber (RVBC), right atrial blood chamber (RABC), left ventricular blood chamber (LVBC), left atrial blood chamber (LABC), and myocardium of the left ventricle (MLV) were shown in Table 1. Except for the PUA (Dice score about 82%–86%), we can see that all the methods achieved relatively accurate substructures’ segmentation for the whole heart. The reason could be that the shape and appearance of the PUA always has greater variability.

Compared with the baseline method, the proposed the four methods with the pseudo masks can produce better segmentation results in almost substructures of the whole heart. And all the proposed four methods have comparable performance with the advanced Multi-Depth Fusion method. Although, these regions of MLV (has the epicardial surface and the endocardial surface of the left ventricular) and RABC have much larger variation in terms of shapes and heterogeneous intensity of the myocardium and the blood. All the proposed methods outperform the two compared methods on the MLV and RABC. Particularly, “MV + Model_1” achieves the best results on MLV, RVBC, ASA, and PUA.

Table 1. Segmentation accuracy (%) of the state-of-the-art segmentation methods and the proposed four methods. “IND + Model_N-1” indicated that it used Model_N-1 at training stage and IND model at testing stage; “IND + Model_1” indicated that it used Model_1 at training stage and IND model at testing stage; “MV + Model_N-1” indicated that it used Model_N-1 at training stage and MV model at testing stage; “MV + Model_1” indicated that it used Model_1 at training stage and MV model at testing stage. The Bold Font in the proposed four methods means it outperform the Baseline and Multi-Depth Fusion methods. The values with underline mean that they are the best results in the six methods.

3.4 Ablation Experiments

We verify the effectiveness of the proposed IWCE LOSS, patch selection, and pseudo mask modules in the proposed model. We used the best model “MV + Model_1” as the basic model “Model”. Then, we ablate or replace each proposed module, respectively. Other experimental conditions are the same as the Table 1.

Table 2 shows the experimental results. We can see that the segmentation results of six substructures become worse after the model without using the Patch Selection module. It proved that the Patch Selection module can select meaningful image patch conducive to the better learning prospects of the network. The third row is the best model using traditional Cross-Entropy loss without using the proposed IWCE loss. We can see that the segmentation results of the almost substructures are slightly worse than the best model. It proved that the proposed loss function takes the class imbalance problem into account and perform well for the voxels, like PUA, that are difficult to segment. The forth row is the model without using pseudo mask information for training, we can see that it achieved comparable performance on five substructures except ASA (reduce ~1%) and PUA (reduce ~3%). One reason is that the pseudo masks generated by simple DIR have lower quality which introduced very limit information for guiding deep network learning on some substructures that are easy to segment. Other reason is that the pseudo masks can provide some useful information, such as location information, for the PUA segmentation.

Table 2. Ablation experiment for the effect of the modules in the proposed MV + model_1 model. “PS” refers to Patch Selection modules; “IWCE” refers to the proposed mixing loss; “pseudo mask” refers to the proposed pseudo mask label modules. “↓” or “↑” denote the increase or decrease of the Dice score (%) compared with the values of “MV + Model_1” method.
Table 3. Generality of the proposed modules. “Baseline” method is the deeply-Supervised U-net [1]; “Baseline PS” is the combination of the baseline method and Patch selection module; “Baseline IWCE” refers to the baseline method whose lose function is replaced for the IWCE loss function; “Baseline Pseudo mask” refers to the baseline method integrates the pseudo mask information during training stage. “↓” or “↑” denote the increase or decrease of the Dice score compared with the values of “Baseline” method.

3.5 Generality Experiments

In order to analysis and discuss the generality of the proposed modules including the Patch Selection, IWCE loss, and pseudo masks, we use the deeply-Supervised U-net [1] as the baseline segmentation network and combine it with the proposed modules respectively. Table 3 shows the experimental results. We can see that the baseline method with each proposed module has a positive effective on most substructures except LABC. Especially, the performance of the baseline with pseudo mask method has significant improvement on PUA. It further proved that the pseudo masks can provide certain prior information which is useful for the hard to segment problem.

4 Conclusion

In this paper, a weakly supervised segmentation method based on CNN is proposed for whole-heart segmentation. We first generate pseudo masks using traditional deformable image registration methods, then perform them on whole-heart data for training. The information provided by pseudo masks is used to distinguish foreground and background. In order to obtain better experimental results, we improved the weighted cross-entropy loss function and mined the training samples to solve the problems of fuzzy boundary and class imbalance. We performed validation on the MICCAI 2017 whole-heart CT dataset, and the results demonstrate that our method can effectively improve the accuracy of heart segmentation.