1 Introduction

Accurate delineation of lesions or anatomical structures is a vital step for clinical diagnosis, intervention, and treatment planning [24]. While recently flourished deep learning methods excel at segmenting those structures, deep learning-based segmentors cannot generalize well in a heterogeneous domain, e.g., different clinical centers, scanner vendors, or imaging modalities [4, 14, 16, 20]. To alleviate this issue, unsupervised domain adaptation (UDA) has been actively developed, by applying a well-performed model in an unlabeled target domain via supervision of a labeled source domain [5, 15, 18, 19]. Due to diverse target domains, however, the performance of UDA is far from satisfactory [9, 17, 31]. Instead, labeling a small set of target domain data is usually more feasible [25]. As such, semi-supervised domain adaptation (SSDA) has shown great potential as a solution to domain shifts, as it can utilize both labeled source and target data, in addition to unlabeled target data. To date, while several SSDA classification methods have been proposed [8, 13, 23, 29], based on discriminative class boundaries, they cannot be directly applied to segmentation, since segmentation involves complex and dense pixel-wise predictions.

Recently, while a few works [6, 10, 26] have been proposed to extend SSDA for segmentation on natural images, to our knowledge, no SSDA for medical image segmentation has yet been explored. For example, a depth estimation for natural images is used as an auxiliary task as in [10], but that approach cannot be applied to medical imaging data, e.g., MRI, as they do not have perspective depth maps. Wang et al. [26] simply added supervision from labeled target samples to conventional adversarial UDA. Chen et al. [6] averaged labeled source and target domain images at both region and sample levels to mitigate the domain gap. However, source domain supervision can easily dominate the training, when we directly combine the labeled source data with the target data [23]. In other words, the extra small amount of labeled target data has not been effectively utilized, because the volume of labeled source data is much larger than labeled target data, and there is significant divergence across domains [23].

To mitigate the aforementioned limitations, we propose a practical asymmetric co-training (ACT) framework to take each subset of data in SSDA in a unified and balanced manner. In order to prevent a segmentor, jointly trained by both domains, from being dominated by the source data only, we adopt a divide-and-conquer strategy to decouple the label supervisions for the two asymmetric segmentors, which share the same objective of carrying out a decent segmentation performance for the unlabeled data. By “asymmetric,” we mean that the two segmentors are assigned different roles to utilize the labeled data in either source or target domain, thereby providing a complementary view for the unlabeled data. That is, the first segmentor learns on the labeled source domain data and unlabeled target domain data as a conventional UDA task, while the other segmentor learns on the labeled and unlabeled target domain data as a semi-supervised learning (SSL) task. To integrate these two asymmetric branches, we extend the idea of co-training [1, 3, 22], which is one of the most established multi-view learning methods. Instead of modeling two views on the same set of data with different feature extractors or adversarial sample generation in conventional co-training [1, 3, 22], our two cross-domain views are explicitly provided by the segmentors with the correlated and complementary UDA and SSL tasks. Specifically, we construct the pseudo label of the unlabeled target sample based on the pixel-wise confident predictions of the other segmentor. Then, the segmentors are trained on the pseudo labeled data iteratively with an exponential MixUp decay (EMD) scheme for smooth propagation. Finally, the target segmentor carries out the target domain segmentation.

The contributions of this work can be summarized as follows:

  • We present a novel SSDA segmentation framework to exploit the different supervisions with the correlated and complementary asymmetric UDA and SSL sub-tasks, following a divide-and-conquer strategy. The knowledge is then integrated with confidence-aware pseudo-label based co-training.

  • An EMD scheme is further proposed to mitigate the noisy pseudo label in early epochs of training for smooth propagation.

  • To our knowledge, this is the first attempt at investigating SSDA for medical image segmentation. Comprehensive evaluations on cross-modality brain tumor (i.e., T2-weighted MRI to T1-weighted/T1ce/FLAIR MRI) segmentation tasks using the BraTS18 database demonstrate superiority performance over conventional source-relaxed/source-based UDA methods.

Fig. 1.
figure 1

Illustration of our proposed ACT framework for SSDA cross-modality (e.g., T2-weighted to T1-weighted MRI) image segmentation. Note that only target domain specific segmentor \(\theta \) will be used in testing.

2 Methodology

In our SSDA setting for segmentation, we are given a labeled source set \(\mathcal {D}^s = \{(x^s_i,y^{s}_i)\}_{i=1}^{N^s}\), a labeled target set \(\mathcal {D}^{lt} = \{(x^{lt}_i,y^{lt}_i)\}_{i=1}^{N^{lt}}\), and an unlabeled target set \(\mathcal {D}^{ut} = \{(x^{ut}_i)\}_{i=1}^{N^{ut}}\), where \({N^s}\), \({N^{lt}}\), and \({N^{ut}}\) are the number of samples for each set, respectively. Note that the slice \(x^s_i, x^{lt}_i\), and \(x^{ut}_i\), and the segmentation mask labels \(y_i^{s}\), and \(y_i^{lt}\) have the same spatial size of \(H\times W\). In addition, for each pixel \(y^{s}_{i:n}\) or \(y^{lt}_{i:n}\) indexed by \(n\in \mathbb {R}^{H\times W}\), the label has C classes, i.e., \(y^{s}_{i:n}, y^{lt}_{i:n}\in \{1,\cdots ,C\}\). There is a distribution divergence between source domain samples, \(\mathcal {D}^s\), and target domain samples, \(\mathcal {D}^{lt}\) and \(\mathcal {D}^{ut}\). Usually, \({N^{lt}}\) is much smaller than \({N^s}\). The learning objective is to perform well in the target domain.

2.1 Asymmetric Co-training for SSDA Segmentation

To decouple SSDA via a divide-and-conquer strategy, we integrate \(\mathcal {D}^{ut}\) with either \(\mathcal {D}^s\) or \(\mathcal {D}^{lt}\) to form the correlated and complementary sub-tasks of UDA and SSL. We configure a cross-domain UDA segmentor \(\phi \) and a target domain SSL segmentor \(\theta \), which share the same objective of achieving a decent segmentation performance in \(\mathcal {D}^{ut}\). The knowledge learned from the two segmentors is then integrated with ACT. The overall framework of this work is shown in Fig. 1.

Conventional co-training has focused on two independent views of the source and target data or generated artificial multi-views with adversarial examples, which learns two classifiers for each of the views and teaches each other on the unlabeled data [3, 22]. By contrast, in SSDA, without multiple views of the data, we propose to leverage the distinct yet correlated supervision, based on the inherent discrepancy of the labeled source and target data. We note that the sub-tasks and datasets adopted are different for the UDA and SSL branches. Therefore, all of the data subsets can be exploited, following well-established UDA and SSL solutions without interfering with each other.

To achieve co-training, we adopt a simple deep pseudo labeling method [27], which assigns the pixel-wise pseudo label \(\hat{y}_{i:n}\) for \(x^{ut}_{i:n}\). Though UDA and SSL can be achieved by different advanced algorithms, deep pseudo labeling can be applied to either UDA [32] or SSL [27]. Therefore, we can apply the same algorithm to the two sub-tasks, thereby greatly simplifying our overall framework. We note that while a few methods [28] can be applied to either SSL or UDA like pseudo labeling, they have not been jointly adopted in the context of SSDA.

Specifically, we assign the pseudo label for each pixel \(x^{ut}_{i:n}\) in \(\mathcal {D}^{ut}\) with the prediction of either \(\phi \) or \(\theta \), therefore constructing the pseudo labeled sets \(U^{\phi }\) and \(U^{\theta }\) for the training of another segmentor \(\theta \) and \(\phi \), respectively:

$$\begin{aligned} U^{\phi } = \{(x^{ut}_{i:n},\hat{y}^{\phi }_{i:n}= \mathop {\mathrm {arg\,max}}\limits _c p(c|x^{ut}_{i:n};\phi )); \text { if } \max _c p(c|x^{ut}_{i:n};\phi ) > \epsilon \},\end{aligned}$$
(1)
$$\begin{aligned} U^{\theta } = \{(x^{ut}_{i:n},\hat{y}^{\theta }_{i:n}= \mathop {\mathrm {arg\,max}}\limits _c p(c|x^{ut}_{i:n};\theta )); \text { if } \max _c p(c|x^{ut}_{i:n};\theta ) > \epsilon \}, \end{aligned}$$
(2)

where \(p(c|x^{ut}_{i:n};\theta )\) and \(p(c|x^{ut}_{i:n};\phi )\) are the predicted probability of class \(c\in \{1,\cdots ,C\}\) w.r.t. \(x^{ut}_{i:n}\) using \(\theta \) and \(\phi \), respectively. \(\epsilon \) is a confidence threshold. Note that the low softmax prediction probability indicates the low confidence for training [18, 32]. Then, the pixels in the selected pseudo label sets are merged with the labeled data to construct \(\{\mathcal {D}^{s},U^{\theta }\}\) and \(\{\mathcal {D}^{lt},U^{\phi }\}\) for the training of \(\phi \) and \(\theta \) with a conventional supervised segmentation loss, respectively. Therefore, the two segmentors with asymmetrical tasks act as teacher and student of each other to distillate the knowledge with highly confident predictions.

figure a

2.2 Pseudo-label with Exponential MixUp Decay

Initially generated pseudo labels with the two segmentors are typically noisy, which is significantly acute in the initial epochs, thus leading to a deviated solution with propagated errors. Numerous conventional co-training methods relied on simple assumptions that there is no domain shift, and the predictions of the teacher model can be reliable and be simply used as ground truth. Due to the domain shift, however, the prediction of \(\phi \) in the target domain could be noisy and lead to an aleatoric uncertainty [7, 11, 12]. In addition, insufficient labeled target domain data can lead to an epistemic uncertainty related to the model parameters [7, 11, 12].

To smoothly exploit the pseudo labels, we propose to adjust the contribution of the supervision signals from both labels and pseudo labels as the training progresses. Previously, vanilla MixUp [30] was developed for efficient data augmentation, by combining both samples and their labels to generate new data for training. We note that the MixUp used in SSL [2, 6] adopted a constant sampling, and did not take the decay scheme for gradual co-training. Thus, we propose to gradually exploit the pseudo label by mixing up \(\mathcal {D}^s\) or \(\mathcal {D}^{lt}\) with pseudo labeled \(\mathcal {D}^{ut}\), and adjust their ratio with the EMD scheme. For the selected \({U}^{\phi }\) and \({U}^{\theta }\) with the number of slices \(|{U}^{\phi }|\) and \(|{U}^{\theta }|\), we mix up each pseudo labeled image with all images from \(\mathcal {D}^s\) or \(\mathcal {D}^{lt}\) to form the mixed pseudo labeled sets \(\tilde{U}^{\theta }\) and \(\tilde{U}^{\phi }\). Specifically, our EMD can be formulated as:

$$\begin{aligned} \tilde{U}^{\phi } =\{(\tilde{x}^{lt}_{i:n} = \lambda x^{lt}_{i:n} + (1-\lambda ) x^{ut}_{i:n}, \lambda \tilde{y}^{lt}_{i:n} =\lambda y^{lt}_{i:n} + (1-\lambda ) \hat{y}^{\theta }_{i:n})\}_i^{|{U}^{\theta }|\times N},\end{aligned}$$
(3)
$$\begin{aligned} \tilde{U}^{\theta } =\{(\tilde{x}^{s}_{i:n} = \lambda x^{s}_{i:n} + (1-\lambda ) x^{ut}_{i:n}, \lambda \tilde{y}^{s}_{i:n} =\lambda y^{s}_{i:n} + (1-\lambda ) \hat{y}^{\phi }_{i:n})\}_i^{|{U}^{\phi }|\times N}, \end{aligned}$$
(4)

where \(\lambda =\lambda ^0\text {exp}(-I)\) is the MixUp parameter with the exponential decay w.r.t. iteration I. \(\lambda ^0\) is the initial weight of ground truth samples and labels, which is empirically set to 1. Therefore, along with the increase over iteration I, we have smaller \(\lambda \), which adjusts the contribution of the ground truth label to be large at the start of the training, while utilizing the pseudo labels at the later training epochs. Therefore, \(\tilde{U}^{\phi }\) and \(\tilde{U}^{\theta }\) gradually represent the pseudo label sets of \({U}^{\phi }\) and \({U}^{\theta }\). We note that the mixup operates on the image level, which is indicated by i. The number of generated mixed samples depends on the scale of \({U}^{\phi }\) and \({U}^{\theta }\) in each iteration and batch size N. With the labeled \(\mathcal {D}^s\), \(\mathcal {D}^{lt}\), as well as the pseudo labeled sets with EMD \(\tilde{U}^{\phi }\) and \(\tilde{U}^{\phi }\), we update the parameters of the segmentors \(\phi \) and \(\theta \), i.e., \(\omega _{\phi }\) and \(\omega _{\theta }\) with SGD as:

$$\begin{aligned} \omega _{\phi }\leftarrow \omega _{\phi }-\eta \nabla (\mathcal {L}(\omega _{\phi },\mathcal {D}^s)+\mathcal {L}(\omega _{\phi },\tilde{U}^{\theta })),\end{aligned}$$
(5)
$$\begin{aligned} \omega _{\theta }\leftarrow \omega _{\theta }-\eta \nabla (\mathcal {L}(\omega _{\theta },\mathcal {D}^{lt})+\mathcal {L}(\omega _{\theta },\tilde{U}^{\phi })), \end{aligned}$$
(6)

where \(\eta \) indicates the learning rate, and \(\mathcal {L}(\omega _{\phi },\mathcal {D}^s)\) denotes the learning loss on \(\mathcal {D}^s\) with the current segmentor \(\phi \) parameterized by \(\omega _{\phi }\). The training procedure is detailed in Algorithm 1. After training, only the target domain specific SSL segmentor \(\theta \) is used for testing.

Fig. 2.
figure 2

Comparisons with other UDA/SSDA methods and ablation studies for the cross-modality tumor segmentation. We show target test slices of T1, T1ce, and FLAIR MRI from three subjects.

3 Experiments and Results

To demonstrate the effectiveness of our proposed SSDA method, we evaluated our method on T2-weighted MRI to T1-weighted/T1ce/FLAIR MRI brain tumor segmentation using the BraTS2018 database [21]. We denote our proposed method as ACT, and used ACT-EMD for an ablation study of an EMD-based pseudo label exploration.

Of note, the BraTS2018 database contains a total of 285 patients [21] with the MRI scannings, including T1-weighted (T1), T1-contrast enhanced (T1ce), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (FLAIR) MRI. For the segmentation labels, each pixel belongs to one of four classes, i.e., enhancing tumor (EnhT), peritumoral edema (ED), necrotic and non-enhancing tumor core (CoreT), and background. In addition, the whole tumor covers CoreT, EnhT, and ED. We follow the conventional cross-modality UDA (i.e., T2-weighted to T1-weighted/T1ce/FLAIR) evaluation protocols [9, 17, 31] for 8/2 splitting for training/testing, and extend it to our SSDA task, by accessing the labels of 1–5 target domain subjects at the adaptation training stage. All of the data were used in a subject-independent and unpaired manner. We used SSDA:1 or SSDA:5 to denote that one or five target domain subjects are labeled in training.

For a fair comparison, we used the same segmentor backbone as in DSA [9] and SSCA [17], which is based on Deeplab-ResNet50. Without loss of generality, we simply adopted the cross-entropy loss as \(\mathcal {L}\), and set the learning rate \(\eta =1\textrm{e}{-3}\) and confidence threshold \(\epsilon =0.5\). Both \(\phi \) and \(\theta \) have the same network structure. For the evaluation metrics, we adopted the widely used DSC (the higher, the better) and Hausdorff distance (HD: the lower, the better) as in [9, 17]. The standard deviation was reported over five runs.

Table 1. Whole tumor segmentation performance of the cross-modality UDA and SSDA. The supervised joint training can be regarded as an “upper bound".
Table 2. Detailed comparison of Core/EnhT/ED segmentation. Results are averaged over three tasks including T2-weighted to T1-weighted, T1CE, and FLAIR MRI with the backbone as in [9, 17].

The quantitative evaluation results of the whole tumor segmentation are provided in Table 1. We can see that SSDA largely improved the performance over the compared UDA methods [9, 17]. For the T2-weighted to T1-weighted MRI transfer task, we were able to achieve more than 10% improvements over [9, 17] with only one labeled target sample. Recent SSDA methods for natural image segmentation [6, 26] did not take the balance between the two labeled supervisions into consideration, easily resulting in a source domain-biased solution in case of limited labeled target domain data, and thus did not perform well on target domain data [23]. In addition, the depth estimation in [10] cannot be applied to the MRI data. Thus, we reimplemented the aforementioned methods [6, 26] with the same backbone for comparisons, which is also the first attempt at the medical image segmentation. Our ACT outperformed [6, 26] by a DSC of 3.3% w.r.t. the averaged whole tumor segmentation in SSDA:1 task. The better performance of ACT over ACT-EMD demonstrated the effectiveness of our EMD scheme for smooth adaptation with pseudo-label. We note that we did not manage to outperform the supervised joint training, which accesses all of the target domain labels, which can be considered an “upper bound" of UDA and SSDA. Therefore, it is encouraging that our ACT can approach joint training with five labeled target subjects. In addition, the performance was stable for the setting of \(\lambda \) from 1 to 10.

In Table 2, we provide the detailed comparisons for more fine-grained segmentation w.r.t. CoreT, EnhT, and ED. The improvements were consistent with the whole tumor segmentation. The qualitative results of three target modalities in Fig. 2 show the superior performance of our framework, compared with the comparison methods.

In Fig. 3(a), we analyzed the testing pixel proportion change along with the training that has both, only one, and none of two segmentor pseudo-labels, i.e., the maximum confidence is larger than \(\epsilon \) as in Eq. (1). We can see that the consensus of the two segmentors keeps increasing, by teaching each other in the co-training scheme for knowledge integration. “Both" low rates, in the beginning, indicate \(\phi \) and \(\theta \) may provide a different view based on their asymmetric tasks, which can be complementary to each other. The sensitivity studies of using a different number of labeled target domain subjects are shown in Fig. 3(b). Our ACT was able to effectively use \(\mathcal {D}^{lt}\). In Fig. 3(c), we show that using more EMD pairs improves the performance consistently.

Fig. 3.
figure 3

Analysis of our ACT-based SSDA on the whole tumor segmentation task. (a) The proportion of testing pixels that both, only one, or none of the segmentors have high confidence on (b) the performance improvements with a different number of labeled target domain training subjects, and (c) a sensitivity study of changing different proportion of EMD pairs of \(|\tilde{U}^{\phi }|\times N\) and \(|\tilde{U}^{\theta }|\times N\).

4 Conclusion

This work proposed a novel and practical SSDA framework for the segmentation task, which has the great potential to improve a target domain generalization with a manageable labeling effort in clinical practice. To achieve our goal, we resorted to a divide-and-conquer strategy with two asymmetric sub-tasks to balance between the supervisions from source and target domain labeled samples. An EMD scheme is further developed to exploit the pseudo-label smoothly in SSDA. Our experimental results on the cross-modality SSDA task using the BraTS18 database demonstrated that the proposed method surpassed the state-of-the-art UDA and SSDA methods.