Introduction

Motivation and challenges

Cancer is a major public health problem [1] mainly because of the often late discovery of malignant tumors. Therefore, early detection, early diagnosis, and early treatment of tumors have been critical to improve the survival rate and life qualities of patients with all kinds of cancers. With the development of medical imaging equipment technology, medical imaging has played a critical part in disease diagnosis in modern medicine, with modalities such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), ultrasound, X-ray, angiography, and their variants (e.g., color ultrasound) and pairwise combinations (e.g., PET-CT). Among these imaging techniques, computed tomography is still a widely used routine imaging technique for the detection and diagnosis of lesions and tumors in different parts of the internal body of a patient in clinical practice. Due to the low application cost of the CT technology, it has been widely used in almost every hospital and medical research institution, even including some remote rural hospitals. In particular, CT imaging analysis has already become an important diagnostic tool for radiologists and physicians to make decisions for lung, kidney, liver, and other lesions and tumors.

While CT can bring high-resolution images and improve the performance of computer-aided diagnosis for tumors, the dim and low contrast of CT imaging slices and the presence of similar visual characteristics between tumors and their surroundings still pose huge challenges to overcome in the detection and segmentation of tumors accurately. In addition, it is also a time-consuming task for radiologists to manually detect all the diseased regions among all the imaging slices for each patient; especially, in some big hospitals, patients have to wait for their CT detection results over 6 h due to a large number of patients who need to have a diagnosis every day. This is also a normal phenomenon in most countries due to the shortage of skilled radiologists given the vast number of CT images of patients. Considering all the above problems, an automatic and fast computed tomography tumor localization and segmentation technique, which can also be generalized to different lesions and tumors, is urgently needed for assisting intelligent medicine and healthcare. Meanwhile, we should also note that computer-assisted diagnosis is only a tool to assist radiologists and its result does not necessarily accurately reflect significant changes in the progression of a disease [55]. Furthermore, such a technique also needs to be robust to be able to cope with all kinds of challenges in processing CT images, including dim and low contrast resolution, the complex background, severe class imbalance between tumor and non-tumor pixels, small target detection, imaging noise, image size, wide differences among CT scans in anatomy, variations in contrast and texture from different imaging devices and parameter settings, and even the label noise problem with a poor expert annotation quality. These challenges in lung tumor localization and segmentation tasks can be observed from the first two columns of Fig. 1, between the input images and the corresponding ground truths.

Fig. 1
figure 1

Obstacles from the CT dataset and deficiencies from the current popular approaches for biomedical image segmentation. Here, we demonstrate these problems with the lung dataset for the tumor segmentation task. Note that one of the main challenges of this dataset is the existing pixel-level label noise problem with a poor expert annotation quality, and each segmentation algorithm is trained and tested by using the same training dataset and the test dataset

Common deficiencies

Currently, considering all the proposed automated and intelligent biomedical image analysis methods, deep learning has been extensively used for all kinds of biomedical image processing tasks, including lesion detection, classification, and biomedical image segmentation [2, 56]. Deep neural networks have shown superb abilities in feature extraction and representation by constructing multiple processing layers with multiple levels of abstraction, which is a typical data-driven learning process for a specific task [3, 57]. Moreover, most of the methods based on deep learning usually have achieved the state-of-the-art performance compared with the conventional methods which rely on hand-engineered features. A new proposed deep model can usually work well on its targeted dataset and task; however, there is only a limited immediate possibility to use it for a different dataset or problem [4]. Consequently, we have to adjust the network structure and numerous hyper-parameters and even redesign each network component, and finally, we observe that we can hardly achieve the same performance as we have expected due to the existence of many parameters and possible candidate combinations to be tried. Meanwhile, such a scattershot approach naturally leads to a large number of models being proposed in recent years with a similar network structure or minor changes for a specific dataset or task, where most methods are based on the current popular deep learning techniques.

After studying the literature on the current deep learning-based methods for biomedical image tumor detection and segmentation, we have observed that there are several main problems and deficiencies discussed as follows. Figure 1 also shows some of these problems for the lung tumor segmentation task:

  • Pure implantation: most network models for the task of biomedical image segmentation are inspired and even directly adapted from image classification tasks, but there is often a lack of an in-depth understanding between two different kinds of computer vision tasks.

  • Lack of flexibility: in order to use the transfer learning technique which usually requires loading a pre-trained network model with a large number of well-trained weight parameters, most methods have become much more difficult and inflexible even for changing the size of any layer for a different sized input image or some internal structures for a new dataset and task.

  • Unfriendly for small objects: the task of small target detection and segmentation is still a huge challenge in computer vision tasks and is critical for early finding of lesions and tumors; however, most popular network models mainly focus on big organ segmentation and they (e.g., ResNet, U-Net) usually adopt consecutive down-sampling operations with pooling layers, which result in spatial information loss of small objects.

  • Severe class imbalance: category balance is still a default setting for training a deep network model; however, in the tumor detection and segmentation tasks, we face the severe class imbalance problem between tumor and non-tumor regions.

  • Blurry boundaries: tumor segmentation results delivered by using deep neural networks are usually ambiguous, especially around boundary positions. To improve the quality of such results, some post-processing technique is needed like conditional random fields (CRFs).

  • Confusions over building a new network model and training strategies: building a stable and accurate network model for a specific task is really a time-consuming work, which needs experiments with many different combinations, including network structures and layers, loss functions, learning rates, optimization methods, other hyper-parameters, and others from different tasks.

Contributions

To overcome the critical challenges and deficiencies mentioned above, we propose a unified framework based on generative adversarial learning for localizing and segmenting all kinds of tumors on computed tomography-scanned image slices. Our framework consists of three main modules:

  • the tumor information refinement, which combines high-level semantic information extraction and low-level but high-resolution features for precise localization based on elaborately constructed Generator networks, especially for small tumor detection without information loss;

  • the tumor shape context information rectification, which is fused into our framework for capturing the correlation between output components, and meanwhile, better highlighting tumor regions to successfully address the severe class imbalance problem; and

  • online learning tumor region distribution for different imaging datasets, which is useful to push the Generator networks towards learning the same tumor region distribution with its ground truth by applying the Discriminator networks in a way of adversarial learning, in particular, solving the fuzzy boundary problem of tumor segmentation predictions.

The characteristics and contributions of this work can be summarized as follows:

  1. 1.

    In order to accurately localize and segment tumors on computed tomography scans, we propose a novel unified framework, called CTumorGAN, which can be effectively generalized to address any kinds of tumor datasets with superior performance.

  2. 2.

    The automatic CTumorGAN framework is trained with the way of adversarial learning as in a min-max game by utilizing the Generator networks and the Discriminator networks in an alternating fashion. The Generator is tasked to produce tumor segmentation results that look like golden standards by designing a fully convolutional encoder-decoder network with multi-level information fusion, and meanwhile, the Discriminator tries to distinguish generated samples and real ground truths to further push the Generator towards enhancing its performance. Moreover, in order to tackle several well-known obstacles in the tumor segmentation task, the other two key modules, namely, the tumor information refinement and the tumor shape context rectification, are proposed to further guide the CTumorGAN training process by utilizing multi-level supervision more effectively.

  3. 3.

    We demonstrate that our end-to-end framework has successfully overcome these key challenges arising from CT datasets, including small tumor localization, label noise problem, severe class imbalance, blurry segmentation boundaries, small sample problem without using pre-trained parameters, and other properties and difficulties from different computed tomography datasets themselves. In addition, we also highlight several main deficiencies of the current deep learning-based segmentation methods for biomedical images, and to solve these problems, we propose the novel and automatic CTumorGAN segmentation framework for computed tomography tumors.

  4. 4.

    Comprehensive experimental results, on three new and available CT tumor datasets (lung tumors [5, 6], kidney tumors [7, 8], and liver tumors [9, 10]), demonstrate the superiority of the proposed CTumorGAN for tumor localization and segmentation tasks by comparing its performance with those of the state-of-the-art approaches. In addition, we also compare and analyze the performance of the current popular objective functions for the task of CT tumor segmentation.

Material and methods

Datasets

In our study, we use three new and real-world clinical CT tumor datasets to test the robustness of CTumorGAN, which are respectively from lung tumor [6], kidney tumor [8], and liver tumor [10] databases with real CT-scanned slices. Moreover, these three datasets are also publicly used for three tumor segmentation challenges. Detailed information about each dataset will be introduced in the following subsections.

Lung tumor dataset

We use the non-small cell lung cancer (NSCLC)-Radiomics data from TCIA [6], which is also the dataset used in 2018 IEEE VIP-CUP Challenge [5]. There are in total 422 patient records with 51,513 slices, who are all NSCLC patients. And the image size is totally 25 GB. For all these pretreatment CT scans, a radiation oncologist has manually delineated all the 3D volume of the gross tumor volume. And the data for each patient is saved to DICOM file, and the RTSTRUCT file in each data contains a manual delineation for the gross tumor volume. This lung data is publicly available at The Cancer Imaging Archive https://wiki.cancerimagingarchive.net/display/Public/NSCLC-Radiomics.

However, due to poor manual annotation by only one radiation oncologist and incorrect storage operations in this dataset, finally, 5284 slice pairs from scanned slices and their corresponding annotations can be found, but with severe label noise problem. Therefore, for this dataset, it is really a huge challenge to accurately localize and segment the lung tumors in the presence of some mistakes in the ground truth labels, and this segmentation task has become more difficult than that on a standard biomedical image segmentation dataset. Nevertheless, from another perspective, this is also a normal phenomenon in practice when obtaining and using a real-world biomedical data from a hospital.

Kidney tumor dataset

This dataset is provided from the KiTS2019 challenge for kidney and its tumor segmentation tasks [8]. There are totally 300 imaging data from University of Minnesota. These data are collected from 300 patients who underwent partial or radical nephrectomy for kidney tumors between 2010 and 2018 with their corresponding ground truth annotations. And the collection is involved multi-phase CT imaging and comprehensive clinical outcomes. And the whole data collection procedure for each patient includes four steps: chart review, CT collection, CT annotation, and quality assurance, which is done by the team (including students and their supervisor) and an experienced and well-trained urologic oncologist specialized in kidney tumors. Detailed procedure can be found in the public paper [8]. Note that our study only focuses on tumor segmentation, and so we do not care about kidney segmentation, because tumor segmentation is more difficult than normal organ segmentation. However, due to only released 210 patient records with ground truth as a training set, then we can totally extract 5696 slices that have tumor regions inside. And the imaging data and annotation data for each patient are saved into two .nii.gz files for use respectively. And the 210 training cases have been released at https://github.com/neheller/kits19.

Liver tumor dataset

In order to automatically segment liver tumors by developing robust segmentation algorithms, these contrast-enhanced abdominal CT scans are provided, whose data and segmentation labels are from several clinical sites around the world [10]. In this experiment, there are 131 CT scans provided and we can extract 7190 slices with tumor masks. The image data is collected by Ludwig Maxmilian University of Munich, Radboud University Medical Center of Nijmegen, Polytechnique & CHUM Research Center Montral, Tel Aviv University, Sheba Medical Center, IRCAD Institute Strasbourg, and Hebrew University of Jerusalem. There are various types of liver tumor diseases from the studied subjects, for example, primary tumor disease (e.g., hepatocellular carcinoma) and secondary liver tumors and metastasis (from colorectal, lung, and breast cancer). The collected images are a mix abdomen CT scans between pre- and post-therapy and acquired with different CT scanners and acquisition protocols, which results in the big difference among all images in resolution and image quality. Among them, the axial range of image resolution is from 0.56 to 1.0 mm and the z direction is from 0.45 to 6.0 mm, so the number of slices in z is different between 42 and 1026. Note that imaging artifacts are present in some images, which is a normal phenomenon in real-life clinical data.

For tumor information, among all 131 training CT volumes, the number of tumors is changing from 0 to 75 and its median value is 3 and mean value is 6.93. In this study, we only focus on slices that have tumor or tumors, and then we segment it or them. For tumor size, it varies between 38 mm3 and 349 cm3, whose median value is 2.49 cm3 and mean value is 17.56 cm3. About tumor intensity, it is defined as the average absolute difference between liver voxel HU values and the tumor voxel values [10], which varies from 0 and 98 whose median value is 29.61 and mean value is 31.94. For tumor annotation, the tumor ground truth on each CT slice is annotated by trained radiologists at every clinical site, and then, these results are further verified by another three experienced radiologists by using an institution blinded verification test. In our study, we just care the tumor and non-tumor labels and do not further classify these tumors into different categories. Finally, all data and information are saved into different .nii files for different CT volumes.

Methods

Problem formulation

Assume that there are N pairs of training data \( {\left\{\left({c}_i,{x}_i^{+}\right)\right\}}_{i=1}^N \) from a real database \( \left({c}_i,{x}_i^{+}\right)\sim {p}_{data}\left(c,{x}^{+}\right) \), where ci is the ith input image (e.g., one CT-scanned slice) and \( {x}_i^{+} \) is the corresponding segmentation ground truth (for simplicity, the pixel-wise annotations, where value 1 means a tumor region and value 0 means a non-tumor region) of the ith input image. In contrast to traditional supervised segmentation methods that usually learn a function from input image c to ground truth x+, conditional generative adversarial networks (cGAN) aims to learn a mapping from condition c and random noise vector z~pz(z) to ground truth x+. The condition is also the observed input image c, which is used to control the generated outputs x^~pgene(x^) that must be related with input image c. Moreover, this is also the main improvement of cGAN over the original GAN. cGAN consists of two components, which are Generator G and Discriminator D. G is trained to generate output images x^ to confuse D, while D is also trained to distinguish real samples (in our case, e.g., the ground truth) x+ from the generated “fake” images x^. It is a minimax game between two players, which can push G’s outputs x^ towards the distribution of ground truth x+. In this way, we can finally obtain the biomedical image segmentation results with x^ based on the input conditional image c. Goodfellow et al. [11] have already proven that this minimax game can achieve a global optimal solution when the model is well trained to have the same distributions between pgene = pdata, which also means pgene converges to pdata that is an unknown distribution. Some main notations used in this work are explained in Table 1.

Table 1 Notations for our CTumorGAN

In our CTumorGAN for computed tomography tumor segmentation task, we further use the following notations. Generator networks of our CTumorGAN can be defined as G : T × Z → T, and the Discriminator networks as D : T × T → {0, 1}, where Z is the dimension of random noise vector z, and T is the dimension of input conditional image c, ground truth x+, and generated prediction x^. Besides, we also note \( {\left\{\left({z}_i\right)\right\}}_{i=1}^N \) as noise inputs from a probability distribution, and \( {\left\{\left({x}_i^{\hat{\mkern6mu}}\right)\right\}}_{i=1}^N \) as the generated outputs from G corresponding to input pairs \( {\left\{\left({c}_i,{x}_i^{+}\right)\right\}}_{i=1}^N \). In other words, x^ = G(c, z). Here, we call \( {\left\{\left({c}_i,{x}_i^{+}\right)\right\}}_{i=1}^N \) as positive example, \( {\left\{\left({c}_i,{x}_i^{\hat{\mkern6mu}}\right)\right\}}_{i=1}^N \) and \( {\left\{\left({c}_i,{x}_j^{-}\right)\right\}}_{i\ne j} \) as negative examples. Negative examples are important for tumor segmentation, and the following sections give a detailed explanation. Ideally, a well-trained D can assign value 1 to any positive example and give value 0 to any negative example.

The CTumorGAN framework

Based on the notations introduced in the previous section, the objective function of the adversarial training process from our CTumorGAN can be expressed as follows:

$$ {\displaystyle \begin{array}{c}{\min}_{\theta_G}{\max}_{\theta_D}{L}_{adv}\left({\theta}_G,{\theta}_D\right)=\mathbbm{E}\left[\log D\left(c,{x}^{+}\right)\right]+\mathbbm{E}\left[\log \left(1-D\left(c,{x}^{\hat{\mkern6mu}}\right)\right)\right]+\mathbbm{E}\left[\log \left(1-D\left(c,{x}^{-}\right)\right)\right]\\ {}=\mathbbm{E}\left[\log D\left(c,{x}^{+}\right)\right]+\mathbbm{E}\left[\log \left(1-D\right(c,G\left(c,z\right)\left)\right)\right]+\mathbbm{E}\left[\log \left(1-D\left(c,{x}^{-}\right)\right)\right]\end{array}} $$
(1)

where Generator networks try to minimize the objective loss function, but Discriminator networks attempt to maximize it. Ideally, the Generator G of our trained CTumorGAN can be used to test the new input image as our Segmentor, when \( {G}^{\ast }=\arg\ {\min}_{\theta_G}{\max}_{\theta_D}{L}_{\mathrm{adv}}\left({\theta}_G,{\theta}_D\right) \), which also means that D cannot accurately distinguish the input image pair from the ground truth or the generated segmentation result with 50% probability for both.

In this way, adversarial training is performed to alternately train Generator networks and Discriminator networks, and a balance between two networks can be achieved finally. More specifically, with the training dataset, the following objective function is defined for Generator networks:

$$ {L}_G\left({\theta}_G,{\theta}_D\right)=-\frac{1}{N}{\sum}_{i=1}^N\log \left(D\left({c}_i,{x}_i^{\hat{\mkern6mu}}\right)\right)=-\frac{1}{N}{\sum}_{i=1}^N\log \left(D\left({c}_i,G\left({c}_i,{z}_i\right)\right)\right) $$
(2)

By minimizing the G’s loss function, more realistic segmentation images x^ with the ground truth x+ will be generated. Meanwhile, the loss function for Discriminator networks can be maximized with the following function:

$$ {L}_D\left({\theta}_G,{\theta}_D\right)=\frac{1}{N}{\sum}_{i=1}^N\log \left(D\left({c}_i,{x}_i^{+}\right)\right)+\frac{1}{N}{\sum}_{i=1}^N\log \left(1-D\left({c}_i,G\left({c}_i,{z}_i\right)\right)\right)+\frac{1}{N}{\sum}_{i=1}^N\log \left(1-D\left({c}_i,{x}_i^{-}\right)\right) $$
(3)

In order to further improve the segmentation performance, the final objective from our CTumorGAN is updated to optimize:

$$ {G}^{\ast }=\mathit{\arg}{\min}_{\theta_G}{\max}_{\theta_D}\alpha {L}_{\mathrm{adv}}\left({\theta}_G,{\theta}_D\right)+\beta {L}_{\mathrm{refine}}\left({\theta}_G\right)+\gamma {L}_{\mathrm{shape}}\left({\theta}_G\right) $$
(4)

where Lrefine(θG) is used to make the generated output be near its corresponding ground truth and Lshape(θG) is proposed to capture the shape context information among output pixel-level predictions as a kind of rectification. As shown in Fig. 2, more details about these three modules can be found in the next subsection.

Fig. 2
figure 2

The general structure of the proposed CTumorGAN algorithm. The main backbone of CTumorGAN consists of Generator networks and Discriminator networks. In particular, there are three main modules in the whole process, including tumor distribution adversarial learning, tumor information refinement, and tumor shape context rectification

Main modules

Tumor distribution adversarial learning

In contrast to general image-to-image translation tasks which usually aim to generate plausible images that look like real data, tumor segmentation task not only generates more authentic segmentation images like ground truth but also gives more guidance to teach networks to know whether the generated tumor predictions really correspond to tumor areas from input CT-scanned images. Furthermore, it is easy to generate a truth-like segmentation map, but accurately locating the tumor regions, not healthy ones is really hard to design in the framework. So, in our framework, in addition to the corresponding conditional inputs, we also use different negative samples as the second type of negative samples in our CTumorGAN to train the Discriminator networks so that Discriminator can learn that outputted tumor predictions on healthy tissues are wrong. As we can observe in Eq. (1), we add the third term (c, x), which means “real input images, wrong corresponding ground truths.” In this way, Discriminator can push Generator to produce more realistic segmentation results on correct tumor positions. The effective improvement of this kind of negative samples for tumor localization and segmentation can be observed in Fig. 3. Moreover, the adversarial learning process can further push generated segmentation results to learn the same distribution of ground truth labels, which can give a much clearer tumor boundary and contour, as shown in Figs. 1, 8, and 9. This is a significant improvement in performance over the other state-of-the-art methods in the field of medical image segmentation.

Fig. 3
figure 3

Comparison between with x and without x for tumor localization and segmentation on computed tomography

Tumor information refinement

Unlike U-Net, our Generator network G is designed to reduce the information loss from pooling layers, because pooling operations are really sensitive for small tumor regions and often ignore the early signs from patients’ scans. In our networks, as shown in Fig. 4, we do not use pooling operations but instead only exploit strided convolution operations to reduce the dimension of the feature map and extract higher semantic and context information for accurate tumor segmentation on computed tomography slices. Meanwhile, in order to preserve more spatial information, we follow the skip connections of U-Net to fuse high-level abstract features and low-level spatial and position information to boost the performance of CT tumor localization and segmentation.

Fig. 4
figure 4

The network architecture of our proposed CTumorGAN. The architecture contains two subnetworks which are Generator networks (G) and Discriminator networks (D). G is a novel variant of autoencoder which includes the encoder part with noise vector z and decoder part with multi-scale skip connections. D is like PatchGAN [12], which is suitable to any size of input images and has fewer parameters but still produce high quality discrimination for each input local patch

Another significant finding in our comprehensive experiments is that different loss functions can severely affect the accuracy of the task of distinguishing tumor regions and non-tumor regions. We also provide a detailed comparison and analysis of numerous different objective functions proposed for medical image segmentation in our experimental section. In our work, we have found that the mean square error (MSE) is more suitable for computed tomography tumor detection and segmentation. Furthermore, adding the MSE distance into our CTumorGAN can further boost the performance of the tumor segmentation task, which does not need to change the task of our Discriminator networks. This option can further push Generator’s prediction to be near its corresponding ground truth, and the generated output can also better fool the Discriminator at the same time. Therefore, to refine the tumor candidate positions, we use the following equation to accurately discriminate the pixel-level categories as tumor or non-tumor:

$$ {L}_{\mathrm{refine}}\left({\theta}_G\right)=\mathbbm{E}{\left\Vert {x}^{+}-{x}^{\hat{\mkern6mu}}\right\Vert}_2=\mathbbm{E}{\left\Vert {x}^{+}-G\left(c,z\right)\right\Vert}_2=\frac{1}{N}{\sum}_{i=1}^N{\left\Vert {x}_i^{+}-G\left({c}_i,{z}_i\right)\right\Vert}_2 $$
(5)

Tumor shape context rectification

Most of the existing fully convolutional networks (FCN)-based medical image segmentation methods usually deliver the predicted results directly without a further analysis among all classified pixels, which is also one of the disadvantages of neural networks. This is because the neurons in the same feature map cannot influence each other although they are highly correlated. However, the relation between the different components is critical for tumor segmentation. So, in order to capture the local and global relationships between output pixel-wise predictions from our CTumorGAN, a kind of context information based on the Dice coefficient is exploited to rectify the tumor shape information, referred to as tumor shape context rectification. This module can further assist CTumorGAN to obtain more accurate tumor contours which can be near the ground truths’ labels. Furthermore, this kind of a constraint can also highlight tumor regions to successfully address the severe class imbalance problem in tumor localization and segmentation tasks, and hence avoid the interferences from a large number of non-tumor pixel-wise losses to cover up the loss information from scarce tumor pixels. Therefore, the following objective function is added to boost the performance of the outputs of Generator networks:

$$ {L}_{\mathrm{shape}}\left({\theta}_G\right)=\mathbbm{E}\frac{2\ {x}^{+}\cdotp {x}^{\hat{\mkern6mu} }}{{\left\Vert {x}^{+}\right\Vert}_2+{\left\Vert {x}^{\hat{\mkern6mu}}\right\Vert}_2}=\mathbbm{E}\frac{2\ {x}^{+}\cdotp G\left(c,z\right)}{{\left\Vert {x}^{+}\right\Vert}_2+{\left\Vert G\left(c,z\right)\right\Vert}_2}=\frac{1}{N}{\sum}_{i=1}^N\frac{2\ {x}_i^{+}\cdotp G\left({c}_i,{z}_i\right)}{{\left\Vert {x}^{+}\right\Vert}_2+{\left\Vert G\left({c}_i,{z}_i\right)\right\Vert}_2} $$
(6)

Network architecture

The network architecture of our proposed CTumorGAN consists of Generator networks (G) and Discriminator networks (D). In the whole network architecture, we utilize strided convolution operations to replace all pooling operations to reduce the tumor information loss and random imaging noise interference. Here, the size of a stride is set to 2 × 2, and the kernel size for all convolutional and deconvolution operations is 4 × 4 in this task with the same padding. In addition, the leaky ReLU activation functions are chosen, except with tanh activation function for the last layer of G. Detailed network information can be observed in Fig. 4.

For tumor Generator networks, G is a novel variant of autoencoder which includes the encoder module with noise vector z and decoder module with multi-scale skip connections. The tumor Generator networks are trained to generate segmentation results from random noise vectors z conditioned on input images, where they are CT slices in our work. G aims to not only generate outputs that look real but also enforce them to match the input CT slices, which also help successfully fool D. The input image c and 100-d noise vector z are respectively forwarded to convolutional layers and deconvolutional layers, and then their generated feature maps are concatenated together from the channel dimension. And through several convolution operations to conv8, the encoder module is invoked to extract high-level semantic information. After this stage, the output of encoder is forwarded into decoder module as the input of deconv8. Through successive deconvolution operations, the location information of tumor pixels and non-tumor pixels can be recovered by fusing low-level resolution features from encoder, until the segmentation output x^ is obtained.

For tumor Discriminator networks, D is like PatchGAN [12], which is suitable to work on any size of input images and has fewer parameters but still produce high quality discrimination for each input local patch. In this way, the local discrimination is more appropriate to the actual tumor size, which is different with the discrimination of generally generated images. Meanwhile, D also aims to push G to generate more high-frequency structures based on the low-frequency information produced by our pixel-wise MSE constraint. More importantly, the proposed Discriminator should correctly distinguish segmentation outputs following its corresponding conditional input image from other generated realistic results. To embed the supervision information, a positive image pair {(c, x+)} and two negative image pairs {(c, x^), (c, x)} are fed into D to push itself to give the correct judgment. This process of learning D is shown in Fig. 4.

Adversarial learning: optimization

As mentioned before, the process of learning the optimal tumor segmentation results is conducted by jointly minimizing the Generator losses and maximizing the Discriminator losses of our proposed CTumorGAN as a minimax game. These losses, as obtained in Eqs. (2), (3), and (4), can be further classified into two opposite loss categories from two adversarial learning sub-processes:

$$ {\theta_G}^{\ast }=\arg {\min}_{\theta_G}{L}_G\left({\theta}_G,{\theta_D}^{\ast}\right)+\alpha {L}_{\mathrm{refine}}\left({\theta}_G\right)+\beta {L}_{\mathrm{shape}}\left({\theta}_G\right) $$
(7)
$$ {\theta_D}^{\ast }=\arg {\max}_{\theta_D}{L}_D\left({\theta_G}^{\ast },{\theta}_D\right) $$
(8)

The whole adversarial training process can be implemented using a stochastic gradient descent optimization algorithm. During the actual operation, the gradient ascent optimization process for the parameters of Discriminator networks can also be transferred into gradient descent by multiplying their values by − 1. These two training processes can be performed simultaneously, as shown in Algorithm 1.

figure a

Experimental settings

Comparative methods

As the state-of-the-art biomedical image segmentation algorithms, we choose several methods proposed in recent years, which are from some representative models, including popular and classical models (i.e., U-Net [13], FCNs), transferred models (i.e., CE-Net [14], SegNet [15], FCN+FC_ResNet [16]), adversarial learning models (i.e., Pix2Pix [12], cGAN [17]), newest ones (i.e., CE-Net [14], Joint-Liver-Seg [18], H-DenseUNet [19]), and top 3 winners from each challenge.

Loss functions and parameter setting

Many diverse objective functions have been proposed for biomedical image segmentation. Therefore, we also discuss the performance among different loss functions for tumor segmentation task on computed tomography scans, for instance, MAE (L1 loss) [20], MSE (L2 loss) [21], cross entropy loss [22], Dice loss [23], and MSE + Dice loss [24]. For parameter setting, each tumor dataset is randomly split into a training dataset and a test dataset with a ratio of 4:1, and each compared method is also trained with the same training and test sets following the original setting on a GeForce GTX 1080 graphics card. After repeated trials, the learning rate is set to 2e-4 for the first 200 epochs with α = 0, β = 1, γ = 0 and then 5e-5 for the next 100 epochs with α = 1, β = 0, γ = 100. And the batch size of each mini-batch is set to 1 in each epoch, and the momentum term of the Adam optimizer is 0.5. Note that all these hyper-parameters in our CTumorGAN are set to the same fixed values for different tumor datasets as a unified framework. The unified and fixed setting can be used to show the robustness and adaptability of the proposed CTumorGAN with high accuracy for various medical tumor localization and segmentation tasks.

Evaluation metrics

Moreover, several widely used segmentation criteria [12,13,14,15,16,17,18,19, 58], such as Dice Coefficient, Jaccard index, Precision, Recall (sensitivity), Specificity, and F1, are evaluated on each tumor dataset for computed tomography tumor segmentation with all compared approaches. And all these evaluation metrics can give a comprehensive picture of different segmentation approaches for the task of CT tumor segmentation. Here, given any pair of segmentation result x^ and its manual ground truth x+, we list all these evaluation criteria for different kinds of accuracies as follows:

$$ {\mathrm{Accuracy}}_{\mathrm{Dice}}=\frac{2\sum \left\lceil {x}^{+}\right\rceil \&\left\lceil\ {x}^{\hat{\mkern6mu}}\right\rceil }{\sum \left\lceil {x}^{+}\right\rceil +\sum \left\lceil {x}^{\hat{\mkern6mu}}\right\rceil } $$
(9)
$$ {\mathrm{Accuracy}}_{\mathrm{Jaccard}}=\frac{\sum \left\lceil {x}^{+}\right\rceil \&\left\lceil\ {x}^{\hat{\mkern6mu}}\right\rceil }{\sum \left\lceil {x}^{+}\right\rceil \mid \left\lceil\ {x}^{\hat{\mkern6mu}}\right\rceil } $$
(10)
$$ {\mathrm{Accuracy}}_{\mathrm{Precision}}=\frac{\sum \left\lceil {x}^{+}\right\rceil \&\left\lceil\ {x}^{\hat{\mkern6mu}}\right\rceil }{\sum \left\lceil {x}^{\hat{\mkern6mu}}\right\rceil } $$
(11)
$$ {\mathrm{Accuracy}}_{\mathrm{Recall}}=\frac{\sum \left\lceil {x}^{+}\right\rceil \&\left\lceil\ {x}^{\hat{\mkern6mu}}\right\rceil }{\sum \left\lceil {x}^{+}\right\rceil } $$
(12)
$$ {\mathrm{Accuracy}}_{\mathrm{Specificity}}=\frac{\sum \left(1-\left\lceil {x}^{+}\right\rceil \right)\&\left(1-\left\lceil\ {x}^{\hat{\mkern6mu}}\right\rceil \right)}{\sum 1-\left\lceil {x}^{+}\right\rceil } $$
(13)
$$ {\mathrm{Accuracy}}_{F1}=\frac{2\ {\mathrm{Accuracy}}_{\mathrm{Precision}}\cdotp {\mathrm{Accuracy}}_{\mathrm{Recall}}}{{\mathrm{Accuracy}}_{\mathrm{Precision}}+{\mathrm{Accuracy}}_{\mathrm{Recall}}} $$
(14)

where ⌈∙⌉ indicates binary segmentation results, 1 means tumor regions if the prediction value in any pixel location is larger than 0.1; otherwise, it is 0 for non-tumor regions. Furthermore, we compare the performance from all compared tumor segmentation approaches in terms of qualitative analysis and quantitative analysis on each dataset.

Results

In this section, we report on the results of extensive experiments conducted on three real-world medical tumor datasets. Furthermore, in order to demonstrate the performance of our proposed CTumorGAN, our algorithm is compared with the state-of-the-art segmentation approaches with widely used well-known evaluation metrics mentioned above. Like most of the proposed DNN-based methods, we perform one run for our method and do not select the best one over multiple runs to present our results. However, different from other papers, we also show the robustness and adaptability of our proposed CTumorGAN framework on the three medical tumor datasets with the same fixed hyper-parameter settings. In addition, we also compare the segmentation results by using different objective functions.

On lung tumor dataset

As shown in Figs. 1 and 5, because of the severe label noise problem from the lung tumor dataset and the internal network framework itself, original U-Net did not produce any tumor region prediction and all segmentation maps belonged to healthy tissues in its results. SegNet, as an outstanding image semantic segmentation algorithm by embedding a memory cell for storing location information before and after pooling operations, gave accurate prediction for moderate tumors, but sometimes it failed for large and obvious tumors on CT lung slices (Fig. 1), while the quantitative results look good with 67.81% in Dice and 71.21% in Precision. As classical image-to-image translation methods, Pix2Pix (16.17% in Dice; 21.88% in Precision) and cGAN (20.57% in Dice; 28.42% in Precision) could not give a stable and acceptable tumor prediction map for its corresponding conditional input image, which also means they did not work well on discrete data generation unlike continuous image contents. The newest medical image segmentation algorithm from TMI2019, CE-Net, can achieve competitive performance on the CT tumor segmentation task as a whole with 72.97% in Dice and 64.95% in Precision from Fig. 5; however, it usually failed in some small tumor cases as shown in Fig. 1. Moreover, the Precision of segmenting tumors from CE-Net was really lower with more than 12% gap than our proposed CTumorGAN with 77.34% in Precision (Fig. 5). In addition, the Dice results from the winners of this challenge (59.40% from Markovian team; 52.10% from NTU_MiRA team) were also not more competitive than ours (71.08%), which also prove the difficulty of this dataset and robustness of our CTumorGAN.

Fig. 5
figure 5

Quantitative analysis and comparison among the recent and popular methods with commonly used metrics on the lung tumor dataset. Here we also showed the winners’ results (Markovian and NTU_MiRA) from IEEE Video and Image Processing Cup 2018 [5]

Currently, the Dice coefficient loss is very popular and widely used for biomedical image segmentation tasks. However, as illustrated in Table 2, experimental results showed that the Dice loss was not the most suitable one for CT tumor segmentation tasks, where MSE has much better results than Dice in multiple evaluation criteria except Precision (70.80% vs 71.62%). Furthermore, all kinds of loss functions for training a neural network model in recent years were used to perform biomedical image segmentation tasks, including mean absolute error (MAE, also called L1 loss) [20], MSE (also called L2 loss) [21], cross entropy loss (weighted in this work) [22], Dice loss [23], and the combination of MSE and Dice [24]. So, in this work, we have comprehensively compared the performance of different loss functions for the task of segmenting lung tumors on CT slices. To this end, we have fixed the network framework based on our Generator networks, as a kind of FCNs with skip connections, to show the differences on lung tumor dataset in Table 2. A significant finding we can observe from the results in Table 2 is that MSE can achieve superior performance under most evaluation metrics than other popular objective functions for the task of CT tumor segmentation. Moreover, our CTumorGAN can relatively achieve a more robust performance among different objective functions, especially it is more than 6% higher in Precision of tumor segmentation than MSE (77.34% vs 70.80%) and near 6% higher than Dice with 71.62%. In addition, MAE could not work under a label noisy dataset and cross entropy got bad results in many indicators except Recall with the highest 97.96%. As a whole, MSE + Dice did not produce a whole greater than the sum of the parts in all respects (for example, 67.08% in Dice; 66.14% in Precision; 74.03% in Recall; 69.86% in F1).

Table 2 Quantitative analysis and comparison among different loss functions by implementing them on our proposed Generator networks with commonly used metrics (%) for CT lung tumor segmentation. Note that italicized results mean the best performance under different metrics

On kidney tumor dataset

Note that in the following two datasets, we compared these methods with high performance based on the results from lung tumor segmentation, including ours, CE-Net, FCN (based on our Generator networks with MSE), and some winners or new algorithms from each corresponding challenge.

In this kidney tumor segmentation task, as shown in Fig. 6, our proposed CTumorGAN achieved the best performance compared with others based on most key segmentation metrics. More specifically, the Precision of our method was 86.62%, which is significantly higher than CE-Net (74.89%) and FCN (82.40). In addition, our CTumorGAN obtained also better Dice (83.88%), Jaccard index (77.16%), specificity (99.93%), and F1 (85.59%) results than CE-Net (80.45%, 73.00%, 99.81%, and 82.60%) and FCN (83.82%, 77.07%, 99.89%, and 85.53%), whereas the Recall (84.59%) is a little lower than both (92.07% from CE-Net and 88.91% from FCN). In other words, our method can better locate the position of tumors on kidney slices, which will be helpful for physicians to detect the early signs of cancer and to be able to provide early treatment. Moreover, our Dice result was also very competitive in comparison with the top 3 winners (85.09% from Fabian Isensee et al., 84.54% from Xiaoshuai Hou et al., 83.21% from Guangrui Mu et al.) from more than 100 teams participated in the KiTS2019 challenge [7].

Fig. 6
figure 6

Accuracy curves of different methods on the kidney tumor dataset by using multiple evaluation criteria. Here, we also compared the performance of the top 3 winners among more than 100 teams for the KiTS2019 challenge where only Dice results are available [7]

Furthermore, a qualitative boundary delineation comparison was also implemented with the corresponding ground truths as shown in Fig. 7. For large kidney tumors in the first row, our CTumorGAN (green lines) achieved more accurate tumor localization and segmentation than others, where FCN (yellow lines) gave a wrong tumor indication on the third slice. For small tumors, the last two rows show that CE-Net (blue lines) usually misclassified the healthy tissues as tumors compared with each corresponding ground truth (red lines). Similarly, FCN failed to detect small tumors in the penultimate image without any ouput, and CE-Net also failed in the last image. Due to the dim and low contrast of CT images and no obvious difference between tumor and non-tumor areas in appearance in the kidney dataset, it has become more difficult to distinguish them and also hard to clearly delineate the boundary of each tumor region, especially for some tumors growing inside kidney organizations. Nevertheless, our proposed CTumorGAN can accurately localize the tumor areas and give more competitive and stable performance.

Fig. 7
figure 7

Qualitative boundary delineation comparison of kidney tumor segmentation results between CTumorGAN, CE-Net, FCN, and ground truth

On liver tumor dataset

In these contrast-enhanced abdominal CT scans about liver tumors, we further compare the performance of our CTumorGAN with other algorithms, including some new proposed methods specifically for the liver tumor segmentation task and the top 3 winners of the LiTS challenge online.

Table 3 shows the whole performance of all the compared methods for the liver tumor segmentation task under different criteria. As a whole, our CTumorGAN could predict more accurate tumor localization and segmentation than others. For example, the Precision (83.81% from ours) was more than 9% higher than CE-Net (74.55%) and 7% higher than FCN (76.22%). For the key Dice indicator, FCN (78.75%) based on our Generator networks was also superior to the performance of the winners (e.g., the best one is 75.70%) from the challenge, which also shows that the proposed network framework is really suitable to localize and segment tumors on computed tomography scans. Moreover, CTumorGAN and CE-Net were excellent with over 80% Dice accuracy than others, even better than some of the newly proposed liver segmentation algorithms (72.20% from H-DenseUNet, 71.10% from FCN + FC_ResNet, and 66.10% from Joint-Liver_Seg) that mostly employ the traditional two-stage segmentation idea or network architecture from the general image classification task. For other evaluation metrics, the performance among these methods was also similar with performance on kidney tumor dataset.

Table 3 Comparison of the tumor segmentation performance on the liver dataset with several robust methods and models, including the winners from the current LiTS Competition [9]. Note that italicized results mean best performance under different metrics (%)

Furthermore, in Figs. 8 and 9, a qualitative comparison among these outstanding methods with ground truths was given for liver tumor segmentation from two views respectively, which are large and multiple tumors, and small tumors. For the former, as shown in Fig. 8, an obvious comparison about whether there is a blurry boundary on segmentation results or not can be observed. Our method not only gave more accurate tumor locations and segmentations but also clearly delineated the contours of tumors without any blurry edges; however, CE-Net and FCN all failed in this aspect. In addition, although liver tumors really look similar to the surrounding healthy tissues, our method can still output accurate tumor regions, but CE-Net and FCN usually gave redundant and wrong predictions for liver tumors. In Fig. 9, for small tumors, the misclassification from CE-Net and FCN was more serious, and even FCN completely failed in some small tumor cases without any prediction. The above analysis also reveals the reason why our methods can achieve a much higher Precision than others in the liver tumor segmentation task.

Fig. 8
figure 8

The segmentation performance of the state-of-the-art methods for large and multiple tumors in different CT slices scanned from different patients. Note that the odd rows are the original input images, and the even rows are the corresponding tumor amplification results

Fig. 9
figure 9

Experimental results from different approaches for small target (small tumor) localization and segmentation compared with the ground truth. This task is also important for the detection of early lesions or tumors

Discussion

Main findings

Loss function choice

A major finding of this study is that MSE loss is more suitable for computed tomography tumor segmentation. We have further discussed the effect of different kinds of loss functions for training a good CT tumor segmentation model. Currently, like MAE, MSE, cross entropy, Dice loss has also become a popular and widely used objective function for biomedical image segmentation. However, there is still the lack of an in-depth analysis of which objective function is more suitable for a specific task, for instance, computed tomography tumor segmentation in this study. To this end, we compare and test these popular loss functions, including MAE (also called L1 loss) [20], MSE (also called L2 loss) [21], cross entropy loss (weighted in this work) [22], Dice loss [23], and the combination of MSE and Dice [24].

As a test based on lung tumor dataset, while there is a serious label noisy problem in the dataset, we can find that MAE absolutely failed during the process of training a set of parameters for the CT tumor segmentation task. In essence, as a main objective function, MAE is not conductive to train neural networks because its updated gradient is always the same during the whole model learning process. In other words, the gradient of MAE is still large even for small losses, which means it may miss the optimal solution and cause the model to fail to converge. As a result, shown in Table 2, MAE could not recognize tumors on computed tomography scans. By contrast, the gradient of MSE loss increases as the loss increases, and decreases as the loss approaches 0. cross entropy loss is really popular in image classification tasks, and most works also tried to transfer it into image segmentation tasks. However, we can find that even though we use category balanced cross entropy loss function for CT tumor segmentation, the segmentation results were not satisfactory and even bad in Table 2. Yet, the Recall accuracy is really the highest one among all these compared loss functions. In other words, cross entropy can give predictions about tumor information on CT lung slices, but false positive ratio is also very high which is also the reason why its Precision result is only 26.15%. Furthermore, recently, Dice loss is very popular to train a deep learning model for biomedical image processing. The loss function is from the evaluation criterion of Dice coefficient. Because Dice coefficient is the important and widely used indicator to measure the performance of a segmentation result, so some works directly used it as a new loss function to guide the training process of neural networks. The results shown in Table 2 also proved the competitive performance of Dice loss for lung tumor segmentation, while the Dice coefficient result of Dice loss is slightly worse than MSE (70.14% vs 71.39%). From last year, some works tried to fuse the both loss functions between Dice and MSE and achieved desired performance in some tasks. However, in the CT tumor task, we have observed that fusing two loss functions resulted in worse performance than that of a single one. In short, this interesting finding from this study can provide a good guidance to readers to deal with this kind of tasks when facing a choice between so many loss functions.

Adversarial learning for segmentation

A second finding is that this study provides a proof of adversarial learning for successfully training a deep neural network for biomedical image segmentation. In the field of image-to-image translation, (conditional) generative adversarial networks are effective tools to learn a mapping function from input images to output images. And many interesting applications in generating different kinds of images from one domain to another domain can be observed in the work of Pix2Pix [12]. Moreover, as a popular application in biomedical image analysis, GAN has been widely used for medical data augmentation to further alleviate the data scarcity when training a deep neural network [25]. Differently, in this work, we define the biomedical image segmentation task as an image-to-image translation (image generation) problem, given an input image and then outputting the corresponding segmentation map. Considering the original GAN and conditional GAN, they cannot generate expected segmentation performance, and even only supervised FCNs with L1 loss usually obtain better scores than models using a the cGAN method [12] in generating semantic segmentation labels. In contrast, our proposed CTumorGAN makes full use of the characteristics of cGAN in image generation; meanwhile, we also construct two other specific modules to introduce some strong supervised information for capturing tumor information and overcome some obstacles from CT tumor segmentation as well, which is shown in Fig. 2. In this way, CTumorGAN can achieve more accurate tumor localization and segmentation performance compared with the state-of-the-art biomedical image segmentation algorithms, which is illustrated in Figs. 5 and 6 and Table 3. On three different CT tumor datasets, we specifically compare the performance with cGAN (i.e., CTumorGAN) and without cGAN (i.e., FCN) under different evaluation criteria, where two models are all based on the same basic network architecture for a fair comparison. For example, for Precision, 77.34% vs 70.80% on lung tumor dataset in Tables 2, 86.62% vs 82.40% on kidney tumor dataset in Figs. 6, and 83.81% vs 76.22% on liver tumor dataset in Table 3. In other words, this study also demonstrates the feasibility of the adversarial learning technique (like ours which is based on cGAN) for CT tumor segmentation and can achieve better performance under various evaluation metrics.

Framework design and quality of segmentation results

A third finding of this study is that deep neural networks by fusing prior knowledge for different problems can largely improve tumor segmentation performance with clear segmentation boundaries on computed tomography scans, which can provide new inspirations for biomedical image processing. In our framework, for the tumor information refinement module, we find that shallow layers of a convolutional neural network (CNN) model can provide more detailed and high-resolution information, and high layers can provide more abstract semantic information, which can be combined together to be helpful to generate accurate pixel-level classification results. So, we have added multi-level skip connections into our Generator which is really different with traditional FCNs. Meanwhile, in order to precisely detect small tumors, we analyze the weakness of traditional CNNs and find that the internal successive pooling operations can ignore the information of small tumors. Therefore, for reducing the size of feature maps, we use strided convolution operations instead of all pooling operations. Excellent segmentation performance for small tumors of our method can be found in Figs. 1, 8, and 9. Next, for tumor shape context information rectification module, a main problem from the training process of most CNNs for segmentation tasks is that they usually lack of capturing the correlation between output components; however, this step is really critical to improve the pixel-level segmentation accuracy. Furthermore, this module can also be helpful to solve the severe class imbalance problem between tumor pixels and non-tumor pixels in a CT slice. To this end, a kind of context information based on Dice coefficient is utilized in the proposed module. For example, in this study, for the segmentation Precision, compared with the newest segmentation method, CE-Net, from TMI2019, our CTumorGAN can achieve 77.34% vs 64.95% on lung tumor dataset (Fig. 5), 86.62% vs 74.89% on kidney tumor dataset (Fig. 6), and 83.81% vs 74.55% on liver tumor dataset (Table 3). As we can see, embedding some useful prior knowledge into designing a new network model for a specific task can significantly improve the performance; meanwhile, they can also be helpful to further address some main difficulties from tumor segmentation. Moreover, from the visualization results on all samples as shown in Figs. 1, 8, and 9, only our CTumorGAN can obtain so clear segmentation boundaries with high quality, like ground truths.

Main differences

In this section, we further contrast some related work with ours from different viewpoints, including the general image semantic segmentation approach, the hand-crafted feature-based approach, the early deep feature-based approach, the FCN-based approach, and the adversarial learning approach with cGAN.

General image semantic segmentation approach

Image semantic segmentation is still a popular task in computer vision, whose objective is to partition each input image into different regions that have some homogeneous pixels or belong to the same semantic significance [26]. In contrast to the general image semantic segmentation task, biomedical image segmentation has often paid much attention to the interesting areas of input slices, such as the lungs, kidneys, livers, blood vessels, spleens, ventricular, etc. To show the difference, we have compared our CTumorGAN (71.08% in Dice; 77.34% in Precision) with SegNet (67.81% in Dice; 71.21% in Precision) in Fig. 5. And in Fig. 1, SegNet sometimes failed in some small tumors and also just predicted a part of large tumors in some cases, which shows that general image semantic segmentation methods cannot work well on biomedical image datasets for tumor segmentation. Moreover, biomedical image segmentation is a critical step in the domain of biomedical image analysis, especially for biomedical imaging department and systems of computer-aided diagnosis (CAD). Although the current state-of-the-art methods have already achieved superior segmentation performance with over 90% accuracy for organs, lesion or tumor segmentation tasks are still currently one of the most active and challenging research areas. This is because lesion and tumor regions usually bear resemblance in appearance and other properties with their surrounding healthy tissues and organs, which makes it difficult to distinguish between them and is also a harder challenge than segmenting different organs from a scanned body slice. From the tumor samples as shown in Figs. 1, 7, 8, and 9, most tumors are really difficult to be distinguished by our eyes, so the Dice results from our CTumorGAN on these three datasets are just up to 71.08% on lung tumor dataset, 83.88% on kidney tumor dataset, and 80.19% on liver tumor dataset.

Hand-crafted feature-based approach

In the early days, biomedical image segmentation methods were mainly based on traditional unsupervised theoretical frameworks or learning-based techniques by using hand-crafted features. For example, there are edge detection-based methods [27, 28], template matching-based methods [29,30,31], graph cut-based methods [32,33,34], Markov Random Fields-based methods [35,36,37], etc. These classical methods usually only utilize the characteristics of image density or object shapes and other appearance information for segmentation of different medically interesting regions. Subsequently, some learning-based approaches have been proposed by manually designing some specific features for different objects of interest, which can be roughly classified into unsupervised learning methods [38,39,40] and supervised learning methods [41,42,43,44,45]. As we can conclude, these single features are not that popular to compete with multiple deep features driven by large numbers of clinical data. Due to the bad performance based on hand-crafted features, some relevant comparisons are not assessed in this study, where the conclusion about this aspect can be found in the liver tumor segmentation benchmark [10]. Meanwhile, it also shows the large feature differences used for biomedical image segmentation between our CTumorGAN and these single hand-crafted feature-based approaches.

Early deep feature-based approach

In order to further extract more robust and multi-level features, many new biomedical image segmentation algorithms based on deep neural networks have been developed for some specific modalities in recent years. Deep learning methods can be regarded as an automatic and multiple hierarchical feature learning process based on a kind of a task-driven model with massive volumes of training data. Due to their remarkable performance, deep learning-based methods have become the mainstream solution to deal with numerous computer vision problems. Due to the need of large volumes of data to train a robust deep model, unfortunately, in the context of medical images, it is expensive and complicated to obtain large volumes of medical data and corresponding accurate annotations of medical images because of the privacy of medical data and time-consuming pixel-by-pixel labeling work based on expert knowledge. To tackle this problem, many methods based on the transfer learning technique have been proposed by transferring a large number of well-trained parameters from general image classification tasks to medical image segmentation tasks. For example, the compared CE-Net has used the well-trained parameters from ResNet to initialize its network, whereas the pure implantation from general image classification into biomedical tumor segmentation is really worth probing into for us, and experimental results of this study also prove this difference on three public CT tumor datasets. Meanwhile, earlier deep segmentation models were based on image patches with the sliding window strategy to classify each image pixel, which is also categorized as a local method [46,47,48,49]. However, due to redundant computations and low efficiency of this strategy, new deep learning-based methods, called the global methods, have been proposed with FCNs by directly being trained on whole images or large image patches. Similarly, our CTumorGAN also stems from extending and improving this kind of global category with a higher computational efficiency and accuracy, where we only need to feed our input CT slice one time and then we can get the segmentation predictions unlike local methods based on deep learning with many time inputs pixel by pixel.

FCN-based approach

Among all FCN models for medical image segmentation, U-Net is a widely used convolutional neural network framework [13], and many U-Net based approaches have been proposed recently, e.g., nnU-Net [4], UNet++ [50], SS-UNet [51], MultiResUNet [52], and Attention UNet [53]. In addition to the special autoencoder network model with multiple skip connections for medical image segmentation, another highlight is that this model can achieve a promising performance on small datasets [54]. In essence, most previous deep neural network methods have regarded the image segmentation task as a pixel-wise classification task. However, the pixel-wise loss between predicted results and ground truths usually cannot capture global context information among output pixels, and hence they further utilize CRFs to refine these predicted maps. In addition, the consecutive pooling operations in FCNs, like U-Net, usually lead to information loss, especially for small objects. Actually, another point that should be noted is that there are big differences in designing deep neural network models between pixel-level classification and general image classification, thus it is still arguable whether directly transferring learned weights from general image classification tasks to medical image segmentation tasks is appropriate. Furthermore, the use of post-processing refinement techniques, like CRF and graph cut methods, is also limited to the case of pairwise models with different hand-crafted hyper-parameters and time-consuming as well [22], especially, it may cause serious leakage at lesion or tumor boundary regions with low contrast on computed tomography scans. For these limitations of current methods for biomedical image segmentation, we propose a novel network model with different modules to deal with all these deficiencies. And then we have compared our CTumorGAN with U-Net, FCN + FC_ResNet, and H-DenseUNet in this study, and ours can still achieve superior performance on different datasets under most evaluation metrics.

Adversarial learning approach with cGAN

In comparison with the previously proposed methods, exploring cGAN into medical image segmentation has received comparatively little attention [20, 22]. Image generation with GAN has become one of the current active research directions, which aims to force the generated samples to learn the same distribution with the real samples. Nonetheless, GAN is usually used to generate some images with continuous-valued variations rather than those segmentation outputs with discrete labels [12]. Furthermore, while it is difficult to tune GAN parameters to generate desired images as well as we have expected, GAN may have a good potential for medical image segmentation as we can observe from [20, 22]. To this end, we use Pix2pix and original cGAN approaches to train tumor segmentation models to generate CT tumor segmentation maps. Although both the methods can produce segmentation images and look like segmentation results as shown in Fig. 1, the predicted tumor locations do not correspond to real tumor regions from input CT slices. In order to deal with this problem, we introduce another kind of prior knowledge by adding secondary negative samples \( \left({c}_i,{x}_i^{-}\right) \) to guide the network training process (Algorithm 1) to distinguish tumor and non-tumor regions, which is clearly illustrated in Fig. 3. As a result, our CTumorGAN can precisely localize the tumor positions of input images.

Conclusion

In order to overcome the deficiencies of current biomedical image segmentation algorithms and to alleviate the challenges arising from computed tomography scans, we have proposed a novel end-to-end generative adversarial learning framework, called CTumorGAN, for CT tumor localization and segmentation. In our CTumorGAN framework, the Generator aims to generate authentic segmentation results to fool the Discriminator, while the Discriminator attempts to distinguish between generated samples and real tumor ground truths based on the same conditional inputs. Furthermore, in order to tackle several well-known obstacles in the tumor segmentation task, we propose two other key modules, i.e., the tumor information refinement and the tumor shape context rectification, to better guide the CTumorGAN training process by utilizing multi-level supervision. The proposed CTumorGAN is tested on lung tumor, kidney tumor, and liver tumor datasets using the same settings, whose performance is superior to those of other state-of-the-art medical image segmentation approaches. Moreover, the proposed framework can be directly extended to localize and segment different types of tumors on computed tomography scans. We believe, the novel CT tumor segmentation algorithm will be helpful for medical experts in the choice of therapeutic strategies for tumor diseases and treatment monitoring. In the future, we plan to conduct a further experimental analysis to generalize our CTumorGAN to other clinical imaging modalities. In addition, we also plan to promote our technique by embedding it into websites to provide more services for physicians and patients online, as a remote computer-aided tumor diagnostic expert system.