Keywords

1 Introduction

1.1 Motivation

Liver is the largest abdominal organ in human body, which is often threatened by diseases and drug damage. According to the Global Hepatitis Report 2017 released by the World Health Organization, about 325 million people worldwide are infected with chronic hepatitis B virus or hepatitis C virus, which can lead to chronic infection for life and eventually lead to progressive liver damage. CT image is a very convenient way to detect abdominal organs. Radiologists often need to observe the shape and texture of patients’ liver through CT images to find visible lesions. However, it is tedious and inefficient to analyze abdominal CT images manually by radiologists. Therefore, it is very important to study the automatic liver segmentation technique in CT images. However, as shown in Fig. 1, this task faces many challenges, such as great variation of size and shape, unclear boundaries and different degrees of lesions, so automatic segmentation technology has not yet entered the clinical stage.

Fig. 1.
figure 1

Example of liver CT images displaying large variations

1.2 Related Work

Traditional manual segmentation requires extensive radiology experience and is very time consuming. In order to help radiologists improve their work efficiency, researchers have turned their attention to automatic liver segmentation technology. In recent years, many automatic segmentation methods have been proposed. Pham et al. [16] proposed a method based on grayscale and texture. This method includes global threshold, region growth, voxel classification and edge detection, but it is easy to segment along the blurred boundary, resulting in boundary leakage. Chan et al. [1] proposed a variational semi-automated liver segmentation method, which employs prior knowledge and morphological features into account. Zheng et al. [22] and Zhang et al. [21] proposed a point-based statistical shape model (SSM), which can obtain higher accuracy in the case of small amount of data.

2 Background

2.1 Image Segmentation Network

The fully convolutional network (FCN) [13] is the basic architecture for many semantic segmentation tasks, which consists of cascaded downsampling path and upsampling path. The U-net [18] extends FCN by introducing skip connection between downsampling path and corresponding upsampling path. The skip connection improves the network performance by facilitating the transmission of information. The DeepLab architecture [12] involved atrous convolutions and poolings into the CNN architecture. Based on the DeepLab [2], Chen et al. proposed the latest DeepLabV3 [3]. The upsampling path of this model only consists of very few convolution layers, which is different from the upsampling path used in the FCN and the U-net. Jegou et al. [10] introduces the Dense connection [8] into FCN for segmentation tasks.

In recent years, fully convolutional network has developed rapidly in the field of semantics segmentation, and has achieved remarkable results in the competition of image segmentation. Exploration of this new segmentation method has begun for medical image segmentation. Christ et al. [4] and Vorontsov et al. [19] proposed a cascaded fully convolutional network, which combines two fully convolutional networks. The first network is used to segment liver, and the second network is used to segment liver lesions on the basis of the first network. The segmentation results are processed by 3D conditional random field to make results more accurate. The authors of [5] proposed a deeply supervised network for liver segmentation. The input of the network is part of the 3D bounding box, which has to be slid on a target scan during the test time. In order to alleviate the problem of vanishing gradient, the network uses additional deconvolution layers to generate feature maps from two intermediate layers, and the gradients of loss are calculated from several branches. A three-dimensional convolutional neural network is proposed in [7]. This method contains two steps. First, a deep 3D CNN is trained to learn prior map of the liver. Then, in order to optimize segmentation result, both global and local appearance information from the prior segmentation are adaptively incorporated into a segmentation model. Rafiei et al. [17] proposed a 3D-2D fully convolutional network, which makes full use of the spatial information in CT volume to segment the liver while computation and memory consuming are moderate.

2.2 GAN

GAN [6] has achieved great success in the field of image generation. It is inspired by the theory of zero-sum game. It regards generation problem as conflict and cooperation between two players: Discriminator and generator. First, images will be generated from noise by the generator, and then, with the help of discriminator, quality of the image will be improved step by step. CGAN [14] has gained impressive results in various image-to-image conversion problems, such as image super-resolution [11], image inpainting [15], and style transfer [12]. CGAN regards original image as a constraint, feeding it into the network along with noise as input, and adding conditional variables into loss functions of both generator and discriminator. Under this circumstance, the generated image is of higher quality and closer to the real image. Zhang et al. [20] firstly proposed a stacking model of GAN called StackGAN. The model can be divided into two stages: The first stage generates a coarse image mainly containing primitive shapes and colors based on the given text description. Then, the second stage takes the coarse image and text description as input, to generate high-resolution images gradually. Huang et al. [9] designed a stacked GAN, which is trained to invert the hierarchical representations of a bottom-up discriminative network. Each GAN in this stacked model learns to generate lower-level representations that are conditional on higher level representations in order to generate more qualified images.

3 Method

In this paper, we propose a cascade model with adversarial learning for liver segmentation. The algorithm is able to segment the liver accurately and efficiently from CT slices. It consists of two stages: In the first stage, the 2D slice images are fed into the first stage U-net segmentation network, and the segmentation results of the first stage are obtained. In the second stage, CT slices are concatenated with the results of the first U-net, and then input into the second stage U-net segmentation network to obtain more accurate segmentation results. The training of the above two stages utilizes the adversarial loss. The model is illustrated in the Fig. 2.

Fig. 2.
figure 2

The proposed cascade model

3.1 Model

Generative Adversarial Networks (GANs) consists of two components: an generator G and a discriminator D. The two components are competing in a zero-sum game, in which the generator G aims to produce a realistic image given an input z, that conforms to a certain distribution. The discriminator D is forced to distinguish if a given image is generated by G or it is indeed a real one from the dataset. The adversarial competition enables the generator and the discriminator to achieve better performance, whilst making it hard for D to differentiate generation of G from the real data. Conditional Generative Adversarial Networks (CGANs) extends GANs by introducing an additional observed information, namely conditioning variable, to both the generator G and the discriminator D.

The architecture of the proposed model is shown in the Fig. 3. It consists of two stacked CGANs: CGAN1(G1, D1), CGAN2(G2, D2), with the second stacked on top of the first. CGAN is an extension of GAN. It introduces additional conditional variables for both generator and discriminator. In the proposed model, CT slice is the conditional variable. The architectures of the generator G and discriminator D (ignoring BN and ReLU) can be seen in Tables 1 and 2, respectively.

Fig. 3.
figure 3

The architecture of cascade CGAN model for liver segmentation

Table 1. Architecture of the generator G.
Table 2. Architecture of the discriminator D.

Both the generator G1 and the discriminator D1 of the first CGAN are added with condition, the CT image x. G1 is trained to produce mask G1 (z, x) corresponding to CT image. y denotes the ground truth corresponding x.

The objective function of the CGAN1 is:

$$ \begin{aligned} & {\mathcal{L}}_{{adversarial_{1} }} \left( {G_{1} ,D_{1} } \right) = E_{{x,y\sim\,p_{data} \left( {x,y} \right)}} \left[ {logD_{1} \left( {x,y} \right)} \right] + \\ & \quad \quad E_{{x\sim\,p_{data} \left( x \right),z\sim\,p_{z} \left( z \right)}} \left[ {log\left( {1 - D_{1} \left( {x,G_{1} \left( x \right)} \right)} \right)} \right] \\ \end{aligned} $$
(1)

In order to obtain deterministic results from G1, we eliminate random noise and simplify the formula as follows:

$$ \begin{aligned} & {\mathcal{L}}_{{adversarial_{1} }} \left( {G_{1} ,D_{1} } \right) = E_{{x,y\sim\,p_{data} \left( {x,y} \right)}} \left[ {logD_{1} \left( {x,y} \right)} \right] + \\ & \quad \quad E_{{x\sim\,p_{data} \left( x \right)}} \left[ {log\left( {1 - D_{1} \left( {x,G_{1} \left( x \right)} \right)} \right)} \right] \\ \end{aligned} $$
(2)

In addition to adversarial loss, L1 loss is also applied to obtain more accurate pixel-level classification results:

$$ {\mathcal{L}}_{{data_{1} }} \left( {G_{1} } \right) = E_{{x,y\sim\,p_{data} \left( {x,y} \right)}} \left| {\left| {y - G_{1} \left( x \right)} \right|} \right| $$
(3)

So, the final objective function of CGAN1 is:

$$ {\mathcal{L}}_{{CGAN_{1} }} = {\mathcal{L}}_{{adversarial_{1} }} + {\lambda \mathcal{L}}_{{data_{1} }} $$
(4)

λ is L1 loss weighted coefficient. For CGAN2, it consists of G2 and D2. We use similar objective functions. The adversarial loss of CGAN2 is:

$$ \begin{aligned} & {\mathcal{L}}_{{adversarial_{2} }} \left( {G_{2} ,D_{1} |G_{1} } \right) = E_{{x,y,r\sim\,p_{data} \left( {x,y,r} \right)}} \left[ {logD_{2} \left( {x,y,r} \right)} \right] + \\ & \quad \quad E_{{x\sim\,p_{data} \left( x \right)}} \left[ {log\left( {1 - D_{2} \left( {x,G_{1} \left( x \right),G_{2} \left( {x,G_{1} \left( x \right)} \right)} \right)} \right)} \right] \\ \end{aligned} $$
(5)

The difference is that CGAN2 combines CT slices and the output of CGAN1 as input. Finally, the objective function of the whole model is:

$$ \begin{array}{*{20}c} {{\mathcal{L}}_{total} =\uplambda \frac{min }{{G_{1} ,G_{2} }} \frac{max}{{D_{1} ,D_{2} }}{ \mathcal{L}}_{{data_{1} }} \left( {G_{1} } \right) + {\lambda \mathcal{L}}_{{data_{2} }} \left( {G_{2} |G_{1} } \right) + {\mathcal{L}}_{{adversarial_{1} }} \left( {G_{1} ,D_{1} } \right)} \\ { + {\mathcal{L}}_{{adversarial_{1} }} \left( {G_{2} ,D_{2} |G_{1} } \right)} \\ \end{array} $$
(6)

The output of CGAN1 is fed into CGAN2 as a prior knowledge to help it get more accurate segmentation results.

3.2 Training

Training is divided into two phases. The first phase employs an alternating training scheme. Specifically, each time a mini-batch CT slices is fed into model, firstly, update G1, D1, with G2, D2 fixed. Then, update G2, D2, and fix G1, D1. After 10 epochs training, enter the second stage. In the second phase, the entire model is end-to-end trained for several epochs updating CGAN1 and CGAN2 simultaneously.

4 Experiment

4.1 Dataset

We employ dataset provided by Liver Tumor Segmentation Challenge (LiTS) which is organized by ISBI and MICCAI in conjunction to evaluate the proposed method. The open dataset includes 131 CT volumes and corresponding expert standard labels. The dataset was collected from six different clinical sites using different scanners and scanning methods with different slice resolutions and slice spacings. We randomly divided the data set into 10 folds, each with about 13 volumes, and a fold is used for validation, while the others for training, namely 10-fold cross validation. For data preprocessing, we crop the image intensity values of all volumes to the range of [−200, 250] HU, in order to reduce irrelevant information.

4.2 Implementation

The proposed model was implemented based on tensorflow1.12.0. All experiments were performed on workstations equipped with two Intel Xeon E5-2609 processors, 64G RAM and one NVIDIA TITAN XP GPU.

We use the Adam solver to train the CGAN1 and CGAN2 with a mini-batch size of 8. The model was trained for 15 epochs and the initial learning rate was set to 0.0002. In the second phase of training, the learning rate is divided by 10. In our experiments, λ is set to 100.

5 Comparison with Other Methods

The proposed model is compared with two baseline methods. U-net is widely used in biomedical image segmentation tasks. A 3D-2D model proposed in [17] makes full use of volume’s 3D spatial information.

Dice is employed for evaluating our model. The Dice score is defined as:

$$ Dice = \frac{{2\left| {X \cap Y} \right|}}{\left| X \right| + \left| Y \right|} $$
(7)

where |X| and |Y| are the pixel values of predicted result and ground truth respectively, and the Dice score is in the interval [0,1]. A perfect segmentation yields a Dice score of 1. Dice is usually divided into Dice per case and Dice global where Dice per case score refers to an average Dice score per volume while Dice global score is the Dice score evaluated by combining all datasets into one. In this paper, we employ Dice per case score to evaluate liver segmentation performance.

Some quantitative results of several methods are shown in Table 3 through cross validation. It is observed that our model has achieved more accurate segmentation results. This is because we introduced adversarial loss, which can be seen as a high-level loss that provides more global optimization information for the model. On the other hand, the cascade structure can help the model to further improve performance. Using the output of the first stage as the prior knowledge of the second stage can provide more reliable information for the second stage CGAN and help it achieve more accurate segmentation. Some results are demonstrated in Fig. 4.

Table 3. Quantitative comparison between the proposed method and other methods
Fig. 4.
figure 4

Results of the proposed method

6 Conclusion

In this paper, we propose a cascade model for liver segmentation with adversarial. This model is able to segment the liver accurately and efficiently from CT slice images. It consists of two stages: In the first stage, the 2D CT slice image is input into the first-level CGAN segmentation network to obtain the segmentation result of the first stage. In the second stage, the CT slice image is concatenated with the corresponding segmentation results of the first stage as input of the second stage CGAN2 to obtain more accurate segmentation results. The main contributions of this work are the followings: First, the generative adversarial method is introduced into the liver segmentation task in CT images. Compared with the traditional L1 and L2 losses, adversarial loss can get a clear prediction result and reduce the ambiguity of the segmentation results. Second, due to the cascade structure, the segmentation result of the first phase can provide a prior knowledge for the second phase. Thus, more accurate segmentation results can be obtained. Further research can explore more efficient CGAN cascading methods.