Keywords

1 Introduction

Delineating anatomical structures is an important task within medical imaging, e.g., to generate biomarkers, quantify or track disease progression, or to plan radiation therapy. Manual delineation is prohibitively expensive, which has led to a considerable body of work on automatic segmentation. However, a perennial problem is that models trained on available image/mask pairs, e.g., publicly available data, do not always reflect clinical conditions upon deployment, e.g., the present pathologies, patient characteristics, scanners, and imaging protocols. This leads to potentially drastic performance gaps. When multi-modal or multi-phase imagery is present these challenges are further compounded, as datasets may differ in their composition of available modalities or consist of heterogeneous combinations of modalities. The challenges then are in both managing new patient/disease variations and in handling heterogeneous multi-modal data. Ideally segmentation models can be deployed without first annotating large swathes of additional data matching deployment scenarios. This is our goal, where we introduce co-heterogenous and adaptive segmentation (CHASe), which can adapt models trained on single-modal and public data to produce state-of-the-art results on multi-phase and multi-source clinical data with no extra annotation cost.

Fig. 1.
figure 1

Ground truth and predictions are rendered in green and red, respectively. Despite performing excellently on labeled source data, state-of-the-art fully-supervised models  [25] can struggle on cohorts of multi-phase data with novel conditions, e.g., the patient shown here with splenomegaly and a TACE-treated tumor. CHASe can adapt such models to perform on new data. (Color figure online)

Our motivation stems from challenges in handling the wildly variable data found in picture archiving and communication systems (PACSs), which closely follows deployment scenarios. In this study, we focus on segmenting pathological livers and lesions from dynamic computed tomography (CT). Liver disease and cancer are major morbidities, driving efforts toward better detection and characterization methods. The dynamic contrast CT protocol images a patient under multiple time-points after a contrast agent is injected, which is critical for characterizing liver lesions  [30]. Because accurate segmentation produces important volumetric biomarkers  [3, 12], there is a rich body of work on automatic segmentation  [10, 20, 23, 25, 28, 36, 41, 46, 48], particularly for CT. Despite this, all publicly available data  [3, 7, 11, 18, 38] is limited to venous-phase (single-channel) CTs. Moreover, when lesions are present, they are typically limited to hepatocellular carcinoma (HCC) or metastasized tumors, lacking representation of intrahepatic cholangiocellular carcinoma (ICC) or the large bevy of benign lesion types. Additionally, public data may not represent other important scenarios, e.g.,the transarterial chemoembolization (TACE) of lesions or splenomegaly, which produce highly distinct imaging patterns. As Fig. 1 illustrates, even impressive leading entries  [25] within the public LiTS challenge  [3], can struggle on clinical PACS data, particularly when applied to non-venous contrast phases.

To meet this challenge, we integrate together powerful, but complementary, strategies: hetero-modality learning, appearance-based consistency constraints, mask-based adversarial domain adaptation (ADA), and pseudo-labeling. The result is a semi-supervised model trained on smaller-scale supervised public venous-phase data  [3, 7, 11, 18, 38] and large-scale unsupervised multi-phase data. Crucially, we articulate non-obvious innovations that avoid serious problems from a naive integration. A key component is co-training  [4], but unlike recent deep approaches  [44, 49], we do not need artificial views, instead treating each phase as a view. We show how co-training can be adopted with a minimal increase of parameters. Second, since CT studies from clinical datasets may exhibit any combination of phases, ideally liver segmentation should also be able to accept whatever combination is available, with performance topping out as more phases are available. To accomplish this, we fuse hetero-modality learning  [15] together with co-training, calling this co-heterogeneous training. Apart from creating a natural hetero-phase model, this has the added advantage of combinatorially increasing the number of views for co-training from 4 to 15, boosting even single-phase performance. To complement these appearance-based semi-supervision strategies, we also apply pixel-wise ADA  [40], guiding the network to predict masks that follow a proper shape distribution. Importantly, we show how ADA can be applied to co-heterogeneous training with no extra computational cost over adapting a single phase. Finally, we address edge cases using a principled pseudo-labelling technique specific to pathological organ segmentation. These innovations combine to produce a powerful approach we call CHASe.

We apply CHASe to a challenging unlabelled dataset of 1147 dynamic-contrast CT studies of patients, all with liver lesions. The dataset, extracted directly from a hospital PACS, exhibits many features not seen in public single-phase data and consists of a heterogeneous mixture of non-contrast (NC), arterial (A), venous (V), and delay (D) phase CTs. With a test set of 100 studies, this is the largest, and arguably the most challenging, evaluation of multi-phase pathological liver segmentation to date. Compared to strong fully-supervised baselines  [14, 25] only trained on public data, CHASe can dramatically boost segmentation performance on non-venous phases, e.g., moving the mean Dice-Sørensen coefficient (DSC) from 84.5 to 94.0 on NC CTs. Importantly, performance is also boosted on V phases, i.e., from 90.7 mean DSC to 94.9. Inputting all available phases to CHASe maximizes performance, matching desired behavior. Importantly, CHASe also significantly improves robustness, operating with much greater reliability and without deleterious outliers. Since CHASe is general-purpose, it can be applied to other datasets or even other organs with ease.

2 Related Work

Liver and Liver Lesion Segmentation. In the past decade, several works addressed liver and lesion segmentation using traditional texture and statistical features  [10, 23, 41]. With the advent of deep learning, fully convolutional networks (FCNs) have quickly become dominant. These include 2D  [2, 36], 2.5D  [13, 43], 3D  [20, 28, 45, 48], and hybrid  [25, 46] FCN-like architectures. Some reported results show that 3D models can improve over 2D ones, but these improvements are sometimes marginal  [20].

Like related works, we also use FCNs. However, all prior works are trained and evaluated on venous-phase CTs in a fully-supervised manner. In contrast, we aim to robustly segment a large cohort of multi-phase CT PACS data in a semi-supervised manner. As such, our work is orthogonal to much of the state-of-the-art, and can, in principle, incorporate any future fully-supervised solution as a starting point.

Semi-supervised Learning. Annotating medical volumetric data is time consuming, spurring research on semi-supervised solutions  [39]. In co-training, predictions of different “views” of the same unlabelled sample are enforced to be consistent  [4]. Recent works integrate co-training to deep-learning  [31, 44, 49]. While CHASe is related to these works, it uses different contrast phases as views and therefore has no need for artificial view creation  [31, 44, 49]. More importantly, CHASe effects a stronger appearance-based semi-supervision by fusing co-training with hetero-modality learning (co-heterogenous learning). In addition, CHASe complements this appearance-based strategy via prediction-based ADA, resulting in significantly increased performance.

Adversarial domain adaptation (ADA) for semantic segmentation has also received recent attention in medical imaging  [39]. The main idea is to align the distribution of predictions or features between the source and target domains, with many successful approaches  [6, 9, 40, 42]. Like Tsai et al., we align the distribution of mask shapes of source and target predictions  [40]. Unlike Tsai et al., we use ADA in conjunction with appearance-based semi-supervision. In doing so, we show how ADA can effectively adapt 15 different hetero-phase predictions at the same computational cost as a single-view variant. Moreover, we demonstrate the need for complementary semi-supervised strategies in order to create a robust and practical medical segmentation system.

In self-learning, a network is first trained with labelled data. The trained model then produces pseudo labels for the unlabelled data, which is then added to the labelled dataset via some scheme  [24]. This approach has seen success within medical imaging  [1, 29, 47], but it is important to guard against “confirmation bias”  [26], which can compound initial misclassifications  [39]. While we thus avoid this approach, we do show later that co-training can also be cast as self-learning with consensus-based pseudo-labels. Finally, like more standard self-learning, we also generate pseudo-labels to finetune our model. But, these are designed to deduce and correct likely mistakes, so they do not follow the regular “confirmation”-based self-learning framework.

Hetero-Modality Learning. In medical image acquisition, multiple phases or modalities are common, e.g., dynamic CT. It is also common to encounter hetero-modality data, i.e., data with possible missing modalities  [15]. Ideally a segmentation model can use whatever phases/modalities are present in the study, with performance improving the more phases are available. CHASe uses hetero-modal fusion for fully-supervised FCNs  [15]; however it fuses it with co-training, thereby using multi-modal learning to perform appearance-based learning from unlabelled data. Additionally, this co-heterogeneous training combinatorially increases the number of views for co-training, significantly boosting even single-phase performance by augmenting the training data. To the best of our knowledge, we are the first to propose co-heterogenous training.

3 Method

Although CHASe is not specific to any organ, we will assume the liver for this work. We assume we are given a curated and labelled dataset of CTs and masks, e.g., from public data sources. We denote this dataset \(\mathcal {D}_{\ell } = \{\mathcal {X}_{i}, Y_{i}\}_{i=1}^{N_{\ell }}\), with \(\mathcal {X}_{i}\) denoting the set of available phases and \(Y_{i}(k) \in \{0,1,2\}\) indicating background, liver, and lesion for all pixel/voxel indices k. Here, without loss of generality, we assume the CTs are all V-phase, i.e., \(\mathcal {X}_{i}=V_{i} \,\,\forall \mathcal {X}_{i} \in \mathcal {D}_{\ell }\). We also assume we are given a large cohort of unlabelled multi-phase CTs from a challenging and uncurated clinical source, e.g., a PACS. We denote this dataset \(\mathcal {D}_{u} = \{\mathcal {X}_{i}\}_{i=1}^{N_{u}}\), where \(\mathcal {X}_{i}=\{ NC _{i}, A_{i}, V_{i}, D_{i}\}\) for instances with all contrast phases. When appropriate, we drop the i for simplicity. Our goal is to learn a segmentation model which can accept any combination of phases from the target domain and robustly delineate liver or liver lesions, despite possible differences in morbidities, patient characteristics, and contrast phases between the two datasets.

Fig. 2.
figure 2

Overview of CHASe. As shown at the top, labelled data, V-phase in our experiments, are trained using standard segmentation losses. Multi-phase unlabelled data are inputted into an efficient co-heterogenous pipeline, that minimizes divergence between mask predictions across all 15 phase combinations. ADA and specialized pseudo-labelling are also applied. Not shown are the deeply-supervised intermediate outputs of the PHNN backbone.

Figure 2 CHASe, which integrates supervised learning, co-heterogenous training, ADA, and specialized pseudo-labelling. We first start by training a standard fully-supervised segmentation model using the labelled data. Then under CHASe, we finetune the model using consistency and ADA losses:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{seg} + \mathcal {L}_{cons} + \lambda _{adv}\mathcal {L}_{adv} \mathrm {,} \end{aligned}$$
(1)

where \(\mathcal {L}\), \(\mathcal {L}_{seg}\), \(\mathcal {L}_{cons}\), \(\mathcal {L}_{adv}\) are the overall, supervised, co-heterogenous, and ADA losses, respectively. For adversarial optimization, a discriminator loss, \(\mathcal {L}_{d}\), is also deployed in competition with (1). We elaborate on each of the above losses below. Throughout, we minimize hyper-parameters by using standard architectures and little loss-balancing.

3.1 Backbone

CHASe relies on an FCN backbone, f(.), for its functionality and can accommodate any leading choice, including encoder/decoder backbones  [28, 32]. Instead, we adopt the deeply-supervised progressive holistically nested network (PHNN) framework  [14], which has demonstrated leading segmentation performance for many anatomical structures  [5, 14, 21, 22, 34], sometimes even outperforming U-Net  [21, 22]. Importantly, PHNN has roughly half the parameters and activation maps of an equivalent encoder/decoder. Since we aim to include additional components for semi-supervised learning, this lightweightedness is important.

In brief, PHNN relies on deep supervision in lieu of a decoder and assumes the FCN can be broken into stages based on pooling layers. Here, we assume there are five FCN stages, which matches popular FCN configurations. PHNN produces a sequence of logits, \(a^{(m)}\), using \(1\times 1\) convolutions and upsamplings operating on the terminal backbone activations of each stage. Sharing similarities to residual connections  [16], predictions are generated for each stage using a progressive scheme that adds to the previous stage’s activations:

$$\begin{aligned} \hat{Y}^{(1)}&= \sigma (a^{(1)}) \mathrm {,} \end{aligned}$$
(2)
$$\begin{aligned} \hat{Y}^{(m)}&= \sigma (a^{(m)}+a^{(m-1)}) \,\,\forall m>1 \mathrm {,} \end{aligned}$$
(3)

where \(\sigma (.)\) denotes the softmax and \(\hat{Y}^{(.)}\) represents the predictions, with the final stage’s predictions acting as the actual segmentation output, \(\hat{Y}\). Being deeply supervised, PHNN optimizes a loss at each stage, with higher weights for later stages:

$$\begin{aligned} \ell _{seg}(f(V), Y) = \sum _{j=1}^{5}\dfrac{m}{5}\,\ell _{ce}\left( \hat{Y}^{m}, Y\right) \mathrm {,} \end{aligned}$$
(4)

where we use pixel-wise cross-entropy loss (weighted by prevalence), \(\ell _{ce}(.,.)\). Prior to any semi-supervised learning, this backbone is pre-trained using \(\mathcal {D}_{\ell }\):

$$\begin{aligned} \mathcal {L}_{seg} = \dfrac{1}{N_{\ell }}\sum _{V,Y\in \mathcal {D}_{\ell }} \ell _{seg}\left( f(V), Y\right) \mathrm {.} \end{aligned}$$
(5)

Our supplementary visually depicts the PHNN structure.

3.2 Co-training

With a pretrained fully-supervised backbone, the task is now to leverage the unlabelled cohort of dynamic CT data, \(\mathcal {D}_{u}\). We employ the ubiquitous strategy of enforcing consistency. Because dynamic CT consists of the four NC, A, V, and D phases, each of which is matched to same mask, Y, they can be regarded as different views of the same data. This provides for a natural co-training setup  [4] to penalize inconsistencies of the mask predictions across different phases.

To do this, we must create predictions for each phase. As Fig. 2 illustrates, we accomplish this by using phase-specific FCN stages, i.e., the first two low-level stages, and then use a shared set of weights for the later semantic stages. Because convolutional weights are more numerous at later stages, this allows for an efficient multi-phase setup. All layer weights are initialized using the fully-supervised V-phase weights from Sect. 3.1, including the phase-specific layers. Note that activations across phases remain distinct. Despite the distinct activations, for convenience we abuse notation and use \(\mathcal {\hat{Y}}=f(\mathcal {X})\) to denote the generation of all phase predictions for one data instance. When all four phases are available in \(\mathcal {X}\), then \(\mathcal {\hat{Y}}\) corresponds to \(\{\hat{Y}^{ NC }\), \(\hat{Y}^{A}\), \(\hat{Y}^{V}\), \(\hat{Y}^{D}\}\).

Like Qiao et al.  [31] we use the Jensen-Shannon divergence (JSD)  [27] to penalize inconsistencies. However, because it will be useful later we devise the JSD by first deriving a consensus prediction:

$$\begin{aligned} M = \dfrac{1}{|\mathcal {\hat{Y}}|}\sum _{\hat{Y}\in \mathcal {\hat{Y}}}\hat{Y} \mathrm {.} \end{aligned}$$
(6)

The JSD is then the divergence between the consensus and each phase prediction:

$$\begin{aligned} \ell _{cons}(f(\mathcal {X}))&= \dfrac{1}{|\mathcal {\hat{Y}}|}\sum _{\hat{Y}\in \mathcal {\hat{Y}}} \sum _{k \in \varOmega } KL (\hat{Y}(k) \parallel M(k)) \mathrm {,} \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{cons}&= \dfrac{1}{N_{u}}\sum _{\mathcal {X}\in \mathcal {D}^{u}} \ell _{cons}(f(\mathcal {X})) \mathrm {,} \end{aligned}$$
(8)

where \(\varOmega \) denotes the spatial domain and \( KL (.\parallel .)\) is the Kullback-Leibler divergence across label classes. (8) thus casts co-training as a form of self-learning, where pseudo-labels correspond to the consensus prediction in (6). For the deeply-supervised PHNN, we only calculate the JSD across the final prediction.

3.3 Co-heterogeneous Training

While the loss in (8) effectively incorporates multiple phases of the unlabelled data, it is not completely satisfactory. Namely, each phase must still be inputted separately into the network, and there is no guarantee of a consistent output. Despite only having single-phase labelled data, ideally, the network should be adapted for multi-phase operation on \(\mathcal {D}_{u}\), meaning it should be able to consume whatever contrast phases are available and output a unified prediction that is stronger as more phases are available.

To do this, we use hetero-modality image segmentation (HeMIS)-style feature fusion  [15], which can predict masks given any arbitrary combination of contrast phases. To do this, a set of phase-specific layers produce a set of phase-specific activations, \(\mathcal {A}\), whose cardinality depends on the number of inputs. The activations are then fused together using first- and second-order statistics, which are flexible enough to handle any number of inputs:

$$\begin{aligned} \mathbf {a}_{fuse} = \mathrm {concat}(\mu (\mathcal {A}), \mathrm {var}(\mathcal {A})) \mathrm {,} \end{aligned}$$
(9)

where \(\mathbf {a}_{fuse}\) denotes the fused feature, and the mean and variance are taken across the available phases. When only one phase is available, the variance features are set to 0. To fuse intermediate predictions, an additional necessity for deeply-supervised networks, we simply take their mean.

For choosing a fusion point, the co-training setup of Sect. 3.2, with its phase-specific layers, already offers a natural fusion location. We can then readily combine hetero-phase learning with co-training, re-defining a “view” to mean any possible combination of the four contrast phases. This has the added benefit of combinatorially increasing the number of co-training views. More formally, we use \(\mathcal {X}^{*}=\mathcal {P}(\mathcal {X}) \setminus \{\varnothing \}\) to denote all possible contrast-phase combinations, where \(\mathcal {P}(.)\) is the powerset operator. The corresponding predictions we denote as \(\mathcal {\hat{Y}}^{*}\). When a data instance has all four phases, then the cardinality of \(\mathcal {X}^{*}\) and \(\mathcal {\hat{Y}}^{*}\) is 15, which is a drastic increase in views. With hetero-modality fusion in place, the consensus prediction and co-training loss of (6) and (7), respectively, can be supplanted by ones that use \(\mathcal {\hat{Y}}^{*}\):

$$\begin{aligned} M&= \dfrac{1}{|\mathcal {\hat{Y}}^{*}|}\sum _{\hat{Y}\in \mathcal {\hat{Y}}^{*}}\hat{Y} \mathrm {,} \end{aligned}$$
(10)
$$\begin{aligned} \ell _{cons}(f(\mathcal {X}))&= \dfrac{1}{|\mathcal {\hat{Y}^{*}}|}\sum _{\hat{Y}\in \mathcal {\hat{Y}}^{*}} \sum _{k \in \varOmega } KL (\hat{Y}(k) \parallel M(k)) \mathrm {.} \end{aligned}$$
(11)

When only single-phase combinations are used, (10) and (11) reduce to standard co-training. To the best of our knowledge we are the first to combine co-training with hetero-modal learning. This combined workflow is graphically depicted in Fig. 2.

3.4 Adversarial Domain Adaptation

The co-heterogeneous training of Sect. 3.3 is highly effective. Yet, it relies on accurate consensus predictions, which may struggle to handle significant appearance variations in \(\mathcal {D}_{u}\) that are not represented in \(\mathcal {D}_{\ell }\). Mask-based ADA offers an a complementary approach that trains a network to output masks that follow a prediction-based distribution learned from labelled data  [40]. Since liver shapes between \(\mathcal {D}_{u}\) and \(\mathcal {D}_{\ell }\) should follow similar distributions, this provides an effective learning strategy that is not as confounded by differences in appearance. Following Tsai et al.  [40], we can train a discriminator to classify whether a softmax output originates from a labelled- or unlabelled-dataset prediction. However, because we have a combinatorial number (15) of possible input phase combinations, i.e., \(\mathcal {\hat{X}}^{*}\), naively domain-adapting all corresponding predictions is prohibitively expensive. Fortunately, the formulations of (7) and (11) offer an effective and efficient solution. Namely, we can train the discriminator on the consensus prediction, M. This adapts the combinatorial number of possible predictions at the same computational cost as performing ADA on only a single prediction.

More formally, let d(.) be defined as an FCN discriminator, then the discriminator loss can be expressed as

$$\begin{aligned} \mathcal {L}_{d} = \dfrac{1}{N_{\ell }}\sum _{\mathcal {D}_{\ell }}\ell _{ce}(d(\hat{Y}^{V}), \mathbf {1}) + \dfrac{1}{N_{u}}\sum _{\mathcal {D}_{u}}\ell _{ce}(d(M, \mathbf {0})) \mathrm {,} \end{aligned}$$
(12)

where \(\ell _{ce}\) represents a pixel-wise cross-entropy loss. The opposing labels pushes the discriminator to differentiate semi-supervised consensus predictions from fully-supervised variants. Unlike natural image ADA  [40], we do not wish to naively train the discriminator on all output classes, as it not reasonable to expect similar distributions of liver lesion shapes across datasets. Instead we train the discriminator on the liver region, i.e., the union of healthy liver and lesion tissue predictions. Finally, when minimizing (12), we only optimize the discriminator weights. The segmentation network can now be tasked with fooling the discriminator, through the addition of an adversarial loss:

$$\begin{aligned} \mathcal {L}_{adv} = \dfrac{1}{N_{u}}\sum _{\mathcal {D}_{u}}\ell _{ce}(d(M, \mathbf {1})) \mathrm {,} \end{aligned}$$
(13)

where the ground-truth labels for \(\ell _{ce}\) have been flipped from (12). Note that here we use single-level ADA as we found the multi-level variant  [40] failed to offer significant enough improvements to offset the added complexity. When minimizing (13), or (1) for that matter, the discriminator weights are frozen. We empirically set \(\lambda _{adv}\) to 0.001.

3.5 Pseudo-Labelling

By integrating co-heterogeneous training and ADA, CHASe can robustly segment challenging multi-phase unlabelled data. However some scenarios still present challenging edge cases, e.g., lesions treated with TACE. See the supplementary for some visualizations. To manage these cases, we use a simple, but effective, domain-specific pseudo-labelling.

First, after convergence of (1), we produce predictions on \(\mathcal {D}_{u}\) using all available phases and extract any resulting 3D holes in the liver region (healthy tissue plus lesion) greater than 100 voxels. Since there should never be 3D holes, these are mistakes. Under the assumption that healthy tissue in both datasets should equally represented, we treat these holes as missing “lesion” predictions. We can then create a pseudo-label, \(Y_{h}\), that indicates lesion at the hole, with all others regions being ignored. This produces a new “holes” dataset, \(\mathcal {D}_{h}=\{\mathcal {X}, Y_{h}\}_{i=1}^{N_{h}}\), using image sets extracted from \(\mathcal {D}_{u}\). We then finetune the model using (1), but replace the segmentation loss of (5) by

$$\begin{aligned} \mathcal {L}_{seg}\, =\,&\dfrac{1}{N_{\ell }}\sum _{V,Y\in \mathcal {D}_{\ell }} \ell _{seg}(f(V), Y) \nonumber \\&+ \dfrac{\lambda _{h}}{N_{h}}\sum _{\mathcal {X},Y_{h}\in \mathcal {D}_{h}}\sum _{X\in \mathcal {X}^{*}} \ell _{seg}(f(X), Y_{h}) \mathrm {,} \end{aligned}$$
(14)

where we empirically set \(\lambda _{h}\) to 0.01 for all experiments. We found results were not sensitive to this value. While the hole-based pseudo-labels do not capture all errors, they only have to capture enough of missing appearances to guide CHASe’s training to better handle recalcitrant edge cases.

4 Results

Datasets. To execute CHASe, we require datasets of single-phase labelled and multi-phase unlabelled studies, \(\mathcal {D}_{u}\) and \(\mathcal {D}_{\ell }\), respectively. The goal is to robustly segment patient studies from \(\mathcal {D}_{u}\) while only having training mask labels from the less representative \(\mathcal {D}_{\ell }\) dataset. 1) For \(\mathcal {D}_{u}\), we collected 1147 multi-phase dynamic CT studies (4577 volumes in total) directly from the PACS of Chang Gung Memorial Hospital (CGMH). The only selection criteria were patients with biopsied or resected liver lesions, with dynamic contrast CTs taken within one month before the procedure. Patients may have ICC, HCC, benign or metastasized tumors, along with co-occuring maladies, such as liver fibrosis, splenomegaly, or TACE-treated tumors. Thus, \(\mathcal {D}_{u}\) directly reflects the variability found in clinical scenarios. We used the DEEDS algorithm  [19] to correct any misalignments. 2) For \(\mathcal {D}_{\ell }\), we collected 235 V-phase CT studies collected from as many public sources as we could locate  [3, 7, 11, 18]. This is a superset of the LiTS training data  [3], and consists of a mixture of healthy and pathological livers, with only HCC and metastasis represented.

Evaluation Protocols. 1) To evaluate performance on \(\mathcal {D}_{u}\), 47 and 100 studies were randomly selected for validation and testing, respectively, with 90 test studies having all four phases. Given the extreme care required for lesion annotation, e.g., the four readers used in the LiTS dataset  [3], only the liver region, i.e., union of healthy liver and lesion tissue, of the \(\mathcal {D}_{u}\) evaluation sets was annotated by a clinician. For each patient study, this was performed independently for each phase, with a final study-wise mask generated via majority voting. 2) We also evaluate whether the CHASe strategy of learning from unlabelled data can also improve performance on \(\mathcal {D}_{\ell }\). To do this, we split \(\mathcal {D}_{\ell }\), with \(70\%\)/\(10\%\)/\(20\%\) for training, validation, and testing, respectively, resulting in 47 test volumes. To measure CHASe’s impact on lesion segmentation, we use the \(\mathcal {D}_{\ell }\) test set.

Backbone Network. We used a 2D segmentation backbone, an effective choice for many organs  [14, 25, 33, 35], due to its simplicity and efficiency. We opt for the popular ResNet-50  [17]-based DeepLabv2  [8] network with PHNN-style deep supervision. We also tried VGG-16  [37], which also performed well and its results can be found in the supplementary. To create 3D masks we simply stack 2D predictions. CHASe training. We randomly sample multi-phase slices and, from them, randomly sample four out of the 15 phase combinations from \(\mathcal {X}^{*}\) to stochastically minimize (11) and (13). For standard co-training baselines, we sample all available phases to minimize (7). Discriminator Network. We use an atrous spatial pyramid pooling (ASPP) layer, employing dilation rates of 1,2,3,4 with a kernel size of 3 and a leaky ReLU with negative slope 0.2 as our activation function. After a \(1\times 1\) convolution, a sigmoid layer classifies whether a pixel belongs to the labelled or unlabelled dataset. Specific details on data pre-processing, learning rates and schedules can be found in the supplementary material.

Fig. 3.
figure 3

Box and whisker plot. Shown is the distribution of DSCs of pathological liver segmentation on the CGMH PACS when using all available phases for inference.

4.1 Pathological Liver Segmentation

We first measure the performance of CHASe on segmenting pathological livers from the unlabeled CGMH PACS dataset, i.e., \(\mathcal {D}_{u}\). We use PHNN trained only on \(\mathcal {D}_{\ell }\) as a baseline, testing against different unlabeled learning baselines, i.e., co-training  [4], co-heterogeneous training, ADA  [40], and hole-based pseudo-labelling. We measure the mean DSC and average symmetric surface distance (ASSD). For non hetero-modality variants, we use majority voting across each single-phase prediction to produce a multi-phase output. We also test against the publicly available hybrid H-DenseUNet model  [25], one of the best published models. It uses a cascade of 2D and 3D networks.

Table 1. Pathological liver segmentation. Mean DSC and ASSD scores on the CGMH PACS dataset are tabulated across different contrast phase inputs. “All” means all available phases are used as input. Number of samples are in parentheses.

As Table 1 indicates, despite being only a single 2D network, our PHNN baseline is strong, comparing similarly to the cascaded 2D/3D H-DenseUNet on our datasetFootnote 1. However, both H-DenseUNet and our PHNN baseline still struggle to perform well on the CGMH dataset, particularly on non V-phases, indicating that training on public data alone is not sufficient. In contrast, through its principled semi-supervised approach, CHASe is able to dramatically increase performance, producing boosts of \(9.4\%\), \(3.9\%\), \(4.2\%\), \(7.4\%\), and \(4.0\%\) in mean DSCs for inputs of NC, A, V, D, and all phases, respectively. As can also be seen, all components contribute to these improvements, indicating the importance of each to the final result. Compared to established baselines of co-training and ADA, CHASe garners marked improvements. In addition, CHASe performs more strongly as more phases are available, something the baseline models are not always able to do. Results across all 15 possible combinations, found in our supplementary material, also demonstrate this trend.

More compelling results can be found in Fig. 3’s box and whisker plots. As can be seen, each component is not only able to reduce variability, but more importantly significantly improves worst-case results. These same trends are seen across all possible phase combinations. Compared to improvements in mean DSCs, these worst-case reductions, with commensurate boosts in reliability, can often be more impactful for clinical applications. Unlike CHASe, most prior work on pathological liver segmentation is fully-supervised. Wang et al. report \(96.4\%\) DSC on 26 LiTS volumes and Yang et al.  [45] report \(95\%\) DSC on 50 test volumes with unclear healthy vs pathological status. We achieve comparable, or better, DSCs on 100 pathological multi-phase test studies. As such, we articulate a versatile strategy to use and learn from the vast amounts of uncurated multi-phase clinical data housed within hospitals.

These quantitative results are supported by qualitative examples in Fig. 4. As the first two rows demonstrate, H-DenseUNet  [25] and our baseline can perform inconsistently across contrast phases, with both being confused by the splenomegaly (overly large spleen) of the patient. The CHASe components are able to correct these issues. The third row in Fig. 4 depicts an example of a TACE-treated lesion, not seen in the public dataset and demonstrates how CHASe’s components can progressively correct the under-segmentation. Finally, the last row depicts the worst-case performance of CHASe. Despite this unfavorable selection, CHASe is still able to predict better masks than the alternatives. Of note, CHASe is able to provide tangible improvements in consistency and reliability, robustly predicting even when presented with image features not seen in \(\mathcal {D}_{\ell }\). More qualitative examples can be found in our supplementary material.

Fig. 4.
figure 4

Qualitative results. The first two rows are captured from the same patient across different contrast phases. The third and fourth row shows the performance when all available phases are included. Green and red curves depict the ground truth and segmentation predictions, respectively. (Color figure online)

4.2 Liver and Lesion Segmentation

We also investigate whether using CHASe on unlabelled data can boost performance on the labelled data, i.e., the \(\mathcal {D}_{\ell }\) test set of 47 single-phase V volumes. Note, we include results from the public H-DenseUNet  [25] implementation, even though it was only trained on LiTS and included some of our test instances originating from LiTS in its training set.

Table 2. Ablation study on public data. Presented are test set DSC scores with their standard deviation of healthy liver, lesion, and liver region.

As Table 2 indicates, each CHASe component progressively boosts performance, with lesion scores being the most dramatic. The one exception is that the holes-based pseudo-labelling produces a small decrease in mean lesion scores. Yet, box and whisker plots, included in our supplementary material, indicate that holes-based pseudo-labelling boosts median values while reducing variability. Direct comparisons against other works, all typically using the LiTS challenge, are not possible, given the differences in evaluation data. nnUNet  [20], the winner of the Medical Decathlon, reported \(61\%\) and \(74\%\) DSCs for their own validation and challenge test set, respectively. However, \(57\%\) of the patients in our test set are healthy, compared to the \(3\%\) in LiTS. More healthy cases will tend to make it a harder lesion evaluation set, as any amount of false positives will produce DSC scores of zero. For unhealthy cases, CHASe’s lesion mean DSC is \(61.9\%\) compared to \(53.2\%\) for PHNN. CHASe allows a standard backbone, with no bells or whistles, to achieve dramatic boosts in lesion segmentation performance. As such, these results broaden the applicability of CHASe, suggesting it can even improve the source-domain performance of fully-supervised models.

5 Conclusion

We presented CHASe, a powerful semi-supervised approach to organ segmentation. Clinical datasets often comprise multi-phase data and image features not represented in single-phase public datasets. Designed to manage this challenging domain shift, CHASe can adapt publicly trained models to robustly segment multi-phase clinical datasets with no extra annotation. To do this, we integrate co-training and hetero-modality into a co-heterogeneous training framework. Additionally, we propose a highly computationally efficient ADA for multi-view setups and a principled holes-based pseudo-labeling. To validate our approach, we apply CHASe to a highly challenging dataset of 1147 multi-phase dynamic contrast CT volumes of patients, all with liver lesions. Compared to strong fully-supervised baselines, CHASe dramatically boosts mean performance (>9% in NC DSCs), while also drastically improving worse-case scores. Future work should investigate 2.5D/3D backbones and apply this approach to other medical organs. Even so, these results indicate that CHASe provides a powerful means to adapt publicly-trained models to challenging clinical datasets found “in-the-wild”.