1 Introduction

Medical imaging plays an important role in disease diagnosis, treatment planning, and clinical monitoring [4, 24]. One of the major challenges in medical image analysis is imbalanced training sample where desired class pixels (lesion or body organ) are often much lower in numbers than non-lesion pixels. A model learned from class imbalanced training data is biased towards the majority class. The predicted results of such networks have low sensitivity, showing the ability of correctly predicting non-healthy classes. In medical applications, the cost of miss-classification of the minority class could be more than the cost of miss-classification of the majority class. For example, the risk of not detecting tumor could be much higher than referring to doctors a healthy subject.

The problem of the imbalanced class has been recently addressed in diseases classification, tumor localization, and tumor segmentation. Two types of approaches being proposed in the literature: data-level approaches and algorithm-level approaches.

At data-level, the objective is to balance the class distribution through re-sampling the data space [35, 52], by including SMOTE (Synthetic Minority Over-sampling Technique) of the positive class [10] or by under-sampling of the negative class [23]. However, these approaches often lead to remove some important samples or add redundant samples to the training set.

Algorithm-level based solutions address class imbalance problem by modifying the learning algorithm to alleviate the bias towards majority class. Examples are cascade training [8, 11], training with cost-sensitive function [47], such as Dice coefficient loss [11, 13, 41], and asymmetric similarity loss [18] that modifying the training data distribution with regards to the miss-classification cost.

In this paper, we mitigate imbalanced training samples: In data-level, we explore the advantage of training network with inverse class frequency segmentation masks, named complementary segmentation masks in addition to ground truth segmentation masks (ordinary masks) which can then be used to improve the overall prediction of the quality of the segmentation. Assume, Y is true segmentation label annotated by expert and \(\bar {Y}\) is synthesized pair of corresponding images with a complementary label. In the complementary masks \(\bar {Y}\), the majority and minority pixels value are changed to skew bias from majority pixels where the negative label for the major class and a positive label for the c − 1 class. Then, our network train with both ordinary segmentation mask Y and complementary segmentation masks \(\bar {Y}\) at the same time but in multiple loss. The final segmentation masks refine by considering ordinary and complementary mask prediction.

In algorithm-level, we study the advantage of mixing adversarial loss with categorical accuracy loss compared to traditional losses such as 1 loss. Hence, image segmentation is an important task in medical imaging that attempts to identify the exact boundaries of objects such as organs or abnormal regions (e.g. tumors). Automating medical image segmentation is a challenging task due to the high diversity in the appearance of tissues among different patients, and in many cases, the similarity between healthy and non-healthy tissues. Numerous automatic approaches have been developed to speed up medical image segmentation [32]. We can roughly divide the current automated algorithms into two categories: those based on generative models and those based on discriminative models.

Generative probabilistic approaches build the model based on prior domain knowledge about the appearance and spatial distribution of the different tissue types. Traditionally, generative probabilistic models have been popular where simple conditionally independent Gaussian models [14] or Bayesian learning [33] are used for tissue appearance. On the contrary, discriminative probabilistic models, directly learn the relationship between the local features of images [3] and segmentation labels without any domain knowledge. Traditional discriminative approaches such as SVMs [2, 9], random forests [27], and guided random walks [12] have been used in medical image segmentation. Deep neural networks (DNNs) are one of the most popular discriminative approaches, where the machine learns the hierarchical representation of features without any handcrafted features [26, 51]. In the field of medical image segmentation, Ronneberger et al. [38] presented a fully convolutional neural network, named UNet, for segmenting neuronal structures in electron microscopic stacks.

Recently, GANs [15] have gained a lot of momentum in the research fraternities. Mirza et al. [28] extended the GANs framework to the conditional setting by making both the generator and the discriminator network class conditional. Conditional GANs (cGANs) have the advantage of being able to provide better representations for multi-modal data generation since there is a control over the modes of the data being generated. This makes cGANs suitable for image semantic segmentation task, where we condition on an observed image and generate a corresponding output image.

Unlike previous works on cGANs [22, 29, 48], we investigate the 2D sequence of medical images into 2D sequence of semantic segmentation. In our method, 3D bio-medical images are represented as a sequence of 2D slices (i.e. as z-stacks). We use bidirectional LSTM units [16] which are an extension of classical LSTMs and are able to improve model performance on sequence processing by enhancing temporal consistency. We use time distribution between convolutional layers and bidirectional LSTM units on bottleneck of the generator and the discriminator to get inter and intra-slice representation of features.

Summarizing, the main contributions of this paper are:

  • We introduce RNN-GAN, a new adversarial framework that improves semantic segmentation accuracy. The proposed architecture shows promising results for small lesions segmentation as well as anatomical regions.

  • Our proposed method mitigates imbalanced training data with biased complementary masks in task of semantic segmentation.

  • We study the effect of different losses and architectural choices that improve semantic segmentation.

The rest of the paper is organized as follows: in the next section, we review recent methods for handling imbalanced training data and semantic segmentation tasks. Section 3 explains the proposed approach for semantic segmentation, while the detailed experimental results are presented in Section 4. We conclude the paper and give an outlook on future research in Section 5.

2 Related work

This section briefs the previous studies carried out in the area of learning from imbalanced datasets, generative adversarial networks, and medical image semantic segmentation mostly in recent years.

Handling imbalanced training dataset.:

Cascade architecture [8] and ensemble approaches [43] provided best performance on highly imbalanced medical dataset like LiTS-2017 for segmentation of very small lesion(s). Some have focused on balancing recall and precision with asymmetric loss [18], others used accuracy loss [41] and weighted the imbalanced class according to its frequency in the dataset [8, 36]. Similar to some recent work [39, 41], we mitigate the negative impact of the class imbalanced, by mixing adversarial loss and categorical accuracy loss and training deep model with complementary masks.

Learning with complementary labels.:

Recently, the complementary labels in context of machine learning [21] has been used by assuming the transition probabilities are identical with modifying traditional one-versus-all and pairwise-comparison losses for multi-class classification. Ishida et al. [21] theoretically prove that unbiased estimator to the classification risk can be obtained by complementary labels. Yu et al. [50] study learning from both complementary labels and ordinary labels can provide a useful application for multi-class classification task. Inspired by recent success [21, 50], we train the proposed RNN-GAN with both complementary labels and ordinary labels for the task of semantic segmentation to skew the bias from majority pixels.

Generative Adversarial Network.:

Previous works [22, 54] show the success of conditional GANs as a general setting for image-to-image translation. Some recent works applied GANs unconditionally for image-to-image translation by forcing generator to predict desired output under 1 [48] or 2 [31, 53] regression. Here, we study the mixing of adversarial loss in conditional setting with traditional loss and accuracy loss motivated to attenuate imbalanced training dataset. Our method also differs from the prior works [22, 25, 29, 55] by the architectural setting of the generator and the discriminator, we use bidirectional LSTM units on top of the generator and discriminator architecture to capture temporal consistency between 2D slices.

Medical image semantic segmentation.:

The UNet has achieved promising results in medical image segmentation [38] since it allows low-level features concatenated with high-level features which provided better learning representation. Later, UNet with combination of residual network [6], in cascade of 2D and 3D [20] were used for cardiac image segmentation or heterogeneous liver segmentation [8]. The generator network in RNN-GAN, is modified UNet where high resolution features are concatenated with up-sampled of global low-resolution features to help the network learn both local and global information.

3 Method

In this section we present the recurrent generative adversarial network for medical image semantic segmentation. To tackle with miss-classification cost and mitigate imbalanced pixel labels, we mixed adversarial loss with categorical accuracy loss Section 3.1. Moreover, we explain our intuition for skewing the biased from majority pixels with proposed complementary labels Section 3.2.

3.1 Recurrent generative adversarial network

In a conventional generative adversarial network, generative model G tries to learn a mapping from random noise vector z to output image y; G : zy. Meanwhile, a discriminative model D estimates the probability of a sample coming from the training data xreal rather than the generator xfake. The GAN objective function is two-player mini-max game with value function V (G, D):

$$ \underset{G} min \underset{D} max V(D, G) = E_{y} [log D(y)] + E_{z} [log (1-D(G(z)))] $$
(1)

In a conditional GAN, a generative model learns the mapping from the observed image x and a random vector z to the output image y; G : x, zy. On the other hand the D attempts to discriminate between generator output image and the training set images. According to the (2), in the cGANs training procedure both G and D are conditioned on desired output y.

$$ \underset{G} min \underset{D} max V(D, G) = E_{x,y} [log D(x,y)] + E_{x,z} [log (1-D(x, G(x,z)))] $$
(2)

More specifically, in our proposed RNN-GAN network, a generative model learns the mapping from a given sequence of 2D medical images xi to the semantic segmentation of corresponding labels \(y_{i_{seg}}\); \(G : {x_{i},z} \rightarrow \{y_{i_{seg}}\}\) (where i refers to 2D slices index between 1 and 20 from a total 20 slices acquired from ACDC-2017). The training procedure for the semantic segmentation task is similar to two-player mini-max game (3). While the generator predicted segmentation in pixel level, the discriminator takes the ground truth and the generator’s output to determine whether predicted label is real or fake.

$$ \mathcal{L}_{adv} \leftarrow \underset{G} min \underset{D} max V(D, G) = E_{x,y_{seg}} [log D(x,y_{seg})] + E_{x,z} [log (1-D(x, G(x,z)))] $$
(3)

We mixed the adversarial loss with 1 distance (4) to minimize the absolute difference between the predicted value and the existing largest value. Hence the 1 objective function takes into account CNN features and differences between the predicted segmentation and the ground truth, resulting in less noise and smoother boundaries.

$$ \mathcal{L}_{L1}(G) = E_{x,z} \parallel y_{seg} - G(x,z) \parallel $$
(4)
$$ \mathcal{L}_{\ell_{acc}}(G) = \frac{1}{c} \sum\limits_{j = 1} \sum\limits_{i = 1} \frac{y_{ij_{seg}} \cap G(x_{ij},z)} {y_{ij_{seg}} \cup G(x_{ij},z)} $$
(5)

where j and i indicate the number of semantic classes and the number of 2D slices for each patients respectively.

Moreover, we mixed categorical accuracy loss acc, (5), in order to mitigate imbalanced training data by assigning a higher cost to the less represented set of pixels, boosting its importance during the learning process. Categorical accuracy loss checks whether the maximal true value is equal to the maximal predicted value regarding each category of the segmentation.

Then, the final adversarial loss for semantic segmentation task by RNN-GAN is calculated through (6).

$$ \mathcal{L}_{RNN-GAN} (D, G) = \mathcal{L}_{adv} (D, G) + \mathcal{L}_{L1}(G) + \mathcal{L}_{\ell_{acc}}(G) $$
(6)

In this work, similar to the work of Isola et al. [22], we used Gaussian noise z in the generator alongside the input data x. As discussed by Isola et al. [22], in training procedure of conditional generative model from conditional distribution P(y|x), that would be better, a trained model produces more than one sample y, from each input x. When the generator G, takes plus input image x, random vector z, then G(x, z) can generate as many different values for each x as there are values of z. Specially for medical image segmentation, the diversity of image acquisition methods (e.g., MRI, fMRI, CT, ultrasound), regarding their settings (e.g., echo time, repetition time), geometry (2D vs. 3D), and differences in hardware (e.g., field strength, gradient performance) can result in variations in the appearance of body organs and tumour shapes [19], thus learning random vector z with input image x makes network robust against noise and act better in the output samples. This has been confirmed by our experimental results using datasets having a large range of variation.

3.2 Complementary label

In order to mitigate the impact of imbalanced pixels labels on medical images, the proposed RNN-GAN as described in Fig. 1, is trained with complementary mask (Fig. 2, third column) in addition of the ordinary masks (Fig. 2, columns 4–6). Similar to Yu et al. [50], we assumed transition probabilities are identical then the adversarial loss (i.e. categorical cross entropy loss) provides an unbiased estimator for minimizing the risk. Since we have the same assumption we skip the proof of theoretical side and here we experimentally show that complementary labels in addition of ordinary losses are able to provide more accurate results for a task of semantic segmentation.

Fig. 1
figure 1

The architecture of RNN-GAN consists of two deep networks: a generative network G and a discriminative network D. G takes sequence of 2D images as a condition and generates the sequence of 2D semantic segmentation outputs, D determines whether those outputs are real or fake. RNN-GAN captures inter and intra-slice feature representation with bidirectional LSTM units on bottleneck of both G and D network. Here, G is modified UNet architecture and D is fully convolutional encoder

Fig. 2
figure 2

The chest MR image, from ACDC-2017 after pre-processing. The first column is semantic segmentation mask correspond to MR images in second column. Columns 3-6 present complementary labels mask, right ventricle, myocardium vessel, and left ventricle where we map 2D images from second column into four segmentation masks presented in columns 3-6

3.3 Network architecture

The proposed architecture is shown in Fig. 1, where the generator network G in the left followed by the discriminator network D in the right side of the figure. We design bidirectional LSTM units on circumvent bottleneck of both G and D, to capture the non-linear relationship between previous, current, and next 2D slices which is important key to process sequential data.

3.3.1 Recurrent generator

The recurrent generator takes a random vector z plus sequence of 2D medical images. Similar to the UNet architecture, we added skip connections between each layer r and the corresponding layer t − 1 − r, where t represents the total number of layers. Each skip connection simply concatenates all channels at layer r with those at layer t − 1 − r. Feature maps from the convolution part in the down-sampling step are fed into the up-convolution part in the up-sampling step. The generator is trained on a sequence input images from same patient and same acquisition plane. We use the convolutional layer with kernel size 5 × 5 and stride 2 for down-sampling, and perform up-sampling by the image resize layer with a factor of 2 and convolutional layer with kernel size 3 × 3 and stride 1.

3.3.2 Recurrent discriminator

The discriminator network is a classifier and has similar structure as an encoder of the generator network. Hierarchical features are extracted from fully convolutional encoder of discriminator and used to classify between the generator segmentation output and ground truth. More specifically, the discriminator is trained to minimize the average negative cross-entropy between predicted and the true labels.

Then, two models are trained through back propagation corresponding to a two-player mini-max game (see (3)). We use categorical cross entropy [30] as an adversarial loss. In this work, the recurrent architecture selected for both discriminator and generator is a bidirectional LSTM [16].

4 Experiments

We validated the performance of RNN-GAN on three recent public medical imaging challenges: real patient data obtained from the MICCAI 2017, automated cardiac MRI segmentation challenge (ACDC-2017) [5], CT liver tumour segmentation challenge (LiTS-2017), and the 2016 whole-heart and great vessel segmentation challenge (HVSMR).

4.1 Datasets and pre-processing

Our experiments are based on three independent datasets consisting of two cardiac MR images, and an abdomen CT dataset that all segmented manually by radiologists at pixel level.

ACDC.:

The ACDC datasetFootnote 1 comprised of 150 patients with 3D cine-MR images acquired in a clinical routine. The training database was composed of 100 patients. For all these data, the corresponding manual references were given by a clinical expert. The testing database consisted of 50 patients without manual references. Figure 3 shows a cardiac MR images from the ACDC dataset.

HVSMR.:

Thirty training cine MRI scans from 10 patients were provided by the organizers of the HVSMR challenge.Footnote 2 Three images were provided for each patient: a complete axial cine MRI, the same image cropped around the heart and the thoracic aorta, and a cropped short-axis reconstruction.

LiTS.:

In third experiment, we applied the LiTS-2017 benchmarkFootnote 3 that comprised of 130 CT training and 70 test subjects. The examined patients were suffering from different liver cancers. The challenging part is segmentation of very small lesion target on a high unbalanced dataset. Here, pre-processing is carried out in a slice-wise fashion. We applied Hounsfield unit (HU) values, which were windowed in the range of [100, 400] to exclude irrelevant organs and objects as shown in Fig. 4. Furthermore, we applied histogram equalization to increase the contrast for better differentiation of abnormal liver tissue.

Pre-processing of MR images.:

The gray-scale distribution of MR images is dependent on the acquisition protocol and the hardware. This makes learning difficult since we expect to have the same data distribution from one subject to another. Therefore, pre-processing is an important step toward bringing all subjects under similar distributions. We applied a bias field correction on the MR images from HVSMR and ACDC datasets to correct the intensity non-uniformity using N4ITK [42]. Lastly, we applied histogram matching normalization on the all 2D slices from sagittal, coronal, and axial planes.

Fig. 3
figure 3

The cardiac MR image, from ACDC 2017 after pre-processing left side image shows end of systolic sample and right side is end of diastolic phase. We extracted complementary mask from inverse of ground truth file annotated by medical expert, presented in the second and seventh column. Other binary masks extracted from ground truth file in columns 3-5 and 8-10 respectively are right ventricles, myocardium vessel, and left ventricles which they are used by the discriminator. The first and sixth columns are an example input of the generator

Fig. 4
figure 4

The abdomen CT image, from LiTS-2017. The first and second columns show before and after pre-processing. Our generator takes after pre-processing slices (second column) and learns to map third and fourth columns by getting feedback from discriminator

4.2 Implementation and configuration

The RNN-GAN architecture is implemented based on Keras [7] and TensorFlow [1] library. The implemented code is available on the author GitHub.Footnote 4 All training was conducted on a workstation equipped with NVIDIA TITAN X GPU.

The model was trained for up to 120 epochs with batch size 10, iteration 450 and initial learning rate 0.001 on ACDC dataset. Similarly, in HVSMR, we had initial learning rate 0.001, batch size 10, iteration 2750, and 100 epochs where we used all 2D slices from coronal, sagittal, and axial planes with size 256 × 256. The generator and discriminator for all layers use the tanh activation function except the output layer which uses softmax. We use categorical cross-entropy as an adversarial loss mixed with categorical accuracy and 1. The RMSprop optimizer was used in both the generator and the discriminator. The RMSprop divides the learning rate by an exponentially decaying average of squared gradients.

The training took eight hours on ACDC for a total of 120 epochs on parallel NVIDIA TITAN X GPUs and with same configuration, it was 12 hours on HVSMR dataset. With this implementation, we are able to produce a cardiac segmentation mask between 500-700 ms per patient on same cardiac phase from ACDC dataset on an axial plane.

The proposed approach is trained on 75% training data released by the HVSMR-2016 and LiTS-2017 benchmarks. We used all provided images from three axes of sagittal, coronal, and axial for training, validation and testing. We trained our system on 75 exams from axial, coronal, and sagittal plane and validated it on the remaining 25 exams for the ACDC dataset.

In both the training and testing phase, the mini-batch consists of 2D images from the same patient, the same acquisition plane and same cardiac phase. We initially normalize the inputs where the mean and variance are computed on a specific patient from the same acquisition plane and from all available images in the same cardiac phase (ED, ES). This normalization helps to restrict the effect of outliers. With batch norm, we normalized the inputs (activations coming from the previous layer) going into each layer using the mean and variance of the activations for the entire mini-batch.

Let us mention that Wolterink’s method (using an ensemble of six trained CNNs) took 4 seconds to compute predictions mask per patient with a system equipped NVIDIA TITAN X GPU in ACDC benchmark as reported in [46], while the RNN-GAN took 500 ms in average per patient with a system equipped single of NVIDIA TITAN X GPUs.

4.3 Evaluation criteria

The evaluation and comparison performed using the quality metrics introduced by each challenge organizer. Semantic segmentation masks were evaluated in a five-fold cross-validation. For each patient, a corresponding images for the End Diastolic (ED) instant and for the End Systolic (ES) instant has provided. As described by ACDC-2017, cardiac regions are defined as right-ventricle region labeled 1, 2 and 3 representing respectively myocardium and left ventricles. In order to optimize the computation of the different error measures, the Dice coefficient (7) and Hausdorff distance (8) python script code were obtained from the ACDC for all participants.

The average distance boundary (ADB) in addition Dice and Hausdorff considered for evaluating the blood pool and myocardium in HVSMR-2016 and similarly, for validating of liver lesions segmentation on LiTS-2017. Besides these parameters, we calculated sensitivity and specificity since they are a good indicator for miss-classified rate (false positives and false negatives) (see Tables 5 and 6).

$$ Dice(P,T)\leftarrow \frac{\mid P \wedge T \mid} { (\mid P \mid + \mid T \mid ) / 2} $$
(7)
$$ Haus(P, T) \leftarrow \max \{ \sup \inf d(P,T ) , \sup \inf d(T,P ) \} $$
(8)

where P and T indicates predicted output by our proposed method and ground truth annotated by medical expert respectively.

4.4 Comparison with related methods and discussion

As shown in Table 1, our method outperforms other top-ranked approaches from the ACDC benchmark. Based on Table 1, in Dice coefficient, our method achieved slightly better than the Wolterink et al. [46] on ACDC challenge in left ventricle and myocardium segmentation. However, Rohe et al. [37] achieved outstanding performance for right ventricle segmentation since they applied the multi-atlas registration and segmentation at the same time. Poudel et al. [34] achieved competitive results on left ventricle segmentation with overall Dice 0.93, based on recurrent fully convolutional networks.

Table 1 Comparison of the achieved accuracy in term of Dice metric on ACDC benchmark with related approaches and top-ranked methods where the best performance in each cardiac phase and regions of interest have been bold

Based on Tables 1 and 2, the right ventricle is a difficult organ for all the participants mainly because of its complicated shape, the partial volume effect close to the free wall, and intensity of homogeneity. Our achieved accuracy in term of Hausdorff distance, in average is 1.2 ± 0.2mm lower than other participants. This is a strong indicator for precision of boundary that RNN-GAN architecture substituted with bidirectional LSTM units is suitable solution for capturing the temporal consistency between slices. Compared to cGAN (Tables 1 and 2) RNN-GAN provides better results when the network is trained with complementary segmentation mask and even sensitivity and precision.

Table 2 Comparison of achieved accuracy in term of Hausdorff distance on ACDC benchmark with top-ranked participant approaches and related work where the best performance in each cardiac phase and regions of interest have been bold

Compared to the expert annotated file on the original ED phase instants, individual Dice scores of 0.968 for the left ventricle (LV), 0.933 for the myocardium (MYO), and 0.940 for the right ventricle (RV) (see Table 1) were achieved in test time on 25 patients. Qualitatively, the RNN-GAN segmentation results are promising (see Fig. 5 and 7) where we can see robust and smooth boundaries for all substructures.

Fig. 5
figure 5

The cardiac segmentation results at test time by RNN-GAN from ACDC 2017 benchmark on Patient084. The red, green, and blue contour present respectively right ventricle, myocardium, and left ventricle region. The top two rows show the diastolic phase from different slices from t = 0 till t = 9 circle. Respectively the third and fourth rows present systolic cardiac phase from t = 0 till t = 9 circle

We report the effect of different losses for RNN-GAN in Table 3. As we expected, the best performance obtained when the network was trained with mixing of categorical cross-entropy (as adversarial loss) with 1 and categorically accuracy. Using an 1 loss encourages the output respect the input, since the 1 loss penalizes the distance between ground truth outputs, which match the input and synthesized outputs. Using categorical accuracy force the network to assign a higher cost to less represented set of objects, by boosting its importance during the learning process.

Table 3 Dice-scores for different losses, evaluated on ACDC benchmark for segmentation of cardiac MR images

As depicted on Fig. 5 and Table 1 right ventricle is complex organ to segment. The most failure happened in systolic phase. Based on Fig. 5 the achieved accuracy in the test time on ACDC benchmark, we observed that the average results in diastolic phase (first and second rows) are better than the average results on systolic phase (third and fourth rows). We evaluated quantitatively the results using Hausdorff distance and Dice as shown in Fig. 6. As expected, the achieved Dice score on left ventricle (median of 6.82/8.02 for the ED/ES frames) tend to be lower than for the two other regions of interest with myocardium at 8.08/8.69 and right ventricle at 8.95/12.07 for ED/ES.

Fig. 6
figure 6

The ACDC 2017 challenge results using RNN-GAN and cGAN architecture. The left figure shows Dice coefficient in two cardiac phase as follows the right sub figure presents Hausdorff distance. The y-axis shows the Dice metrics and x-axis shows segmentation performance based on cGAN and RNN-GAN in ED and ES cardiac phase. In each sub figure, the mean is presented in red. The ACDC 2017 challenge results using RNN-GAN and cGAN architecture. The sub figure (b) y-axis codes the Hausdorff distance in mm and x-axis presents segmentation performance based on cGAN and RNN-GAN in ED and ES cardiac phase

Based on Tables 45 and Fig. 7, the results show good relation to the ground truth for the blood pool. The average value of the Dice index is around 0.94. The main source of error here is the inability of the method to completely segment all the great vessels where the average Dice score is 0.86. Regarding the results on Tables 4 and 5, by comparing the first and second row the achieved accuracy is better when we conditional GAN substituted with bidirectional LSTM units. These architecture provide a better representation of features by capturing spatial-temporal information in forward and backward dependency. In this context, Poudel et al [34] designed unidirectional LSTMs on top of UNet architecture to capture inter-intra slice features and achieved competitive results for segmentation of left ventricle.

Fig. 7
figure 7

The cardiac segmentation results in test time by RNN-GAN from HVSMR 2016 benchmark. The top row shows the predicted output by RNN-GAN and the second row presents the corresponding ground truth annotated by medical expert. The contour with cyan colour describes blood pool and dark blue shows the myocardium region

Table 4 Comparison of Segmentation results on HVSMR dataset in terms of Dice metric and average distance boundaries with other participant where the best performance in each metrics have been bold
Table 5 Comparison of Segmentation errors in HVSMR dataset in terms of Hausdorff distance, sensitivity, and specificity with other participant approaches where the best performance in each metrics have been bold

The qualitative results of liver tumour segmentation are presented in Fig. 8. Based on Fig. 8 and Table 6, RNN-GAN is able to detect complex and heterogeneous structure of all lesions. The RNN-GAN architecture trained with complementary masks yielded better results and trade off between Dice and sensitivity. Dice score is a good measure for class imbalance where indicate the true positive rate by considering false negative and false positive pixels. The effect of class balancing can be seen with comparison of first and second row of Table 6. As we expected the RNN-GAN trained with complementary segmentation labels and binary segmentation masks computed more accurate result with average 3% and 6% improvement respectively in Dice and sensitivity.

Fig. 8
figure 8

LiTS-2017 test results for liver tumour(s) segmentation using RNN-GAN. We overlaid predicted liver tumour region on CT images shown with blue colour. Compared to the green contour annotated by medical expert from ground truth file, we achieved 0.83 for Dice score and 0.74 for sensitivity

Table 6 Quantitative segmentation results of the liver lesions segmentation on the LiTS-2017 dataset

We compared predicted results by RNN-GAN at test time with other top-ranked and related approaches on LiTS-2017 in terms of volume overlap error (VOE), relative volume difference (RVD), average symmetric surface distance (ASD), and maximum surface distance or Hausdorff distance (HD), as introduced by challenge organizer. As depicted results in Table 6 cascade UNet [8] or ensemble network [6, 17] architectures has achieved better performance compared to trained only with fully convolutional neural network (FCN) [44]. In contrast to prior work such as [6, 8, 17], our proposed method could be generalized to segment the very small lesion and also multiple organs in medical data in different modalities.

5 Conclusion

In this paper, we introduced a new deep architecture to mitigate the issue of imbalanced pixel labels in the task of medical image segmentation task. To this end, we developed a recurrent generative adversarial architecture named RNN-GAN, consists of two architecture: a recurrent generator and a recurrent discriminator. To mitigate imbalanced pixel labels, we mixed adversarial loss with categorical accuracy loss and train the RNN-GAN with ordinary and complementary masks. Moreover, we analyzed the effects of different losses and architectural choices that help to improve semantic segmentation results. Our proposed method shows outstanding results for segmentation of anatomical regions (i.e. cardiac image semantic segmentation). Based on the segmentation results on two cardiac benchmarks, the RNN-GAN is robust against slice misalignment and different CMRI protocols. Experimental results reveal that our method produces an average Dice score of 0.95. Regarding the high accuracy and fast processing speed, we think it has the potential to use for the routine clinic task. We validated also the RNN-GAN on tumor segmentation based on abdomen CT images and achieved competitive results on LiTS benchmark.

The impact of learning from complementary labels from different imbalanced ratio may also be useful in the context of semantic segmentation. We will investigate this issue in the future. In term of application, we plan to investigate the potential of RNN-GAN network for learning multiple clinical tasks such as diseases classification and semantic segmentation.