Keywords

1 Introduction

The inception of deep neural networks has revolutionized the landscape of medical image segmentation [14, 24]. This tremendous success, however, is conditioned on the assumption that the training and testing data are drawn from the same distribution. Unfortunately, in real-world clinical scenarios, due to different acquisition protocols or various imaging modalities, domain shift is widespread between training (i.e., source domain) and testing (i.e., target domain) datasets [15]. This distribution gap usually degenerates the model performance on the target domain. To achieve reliable performance across different domains, a straightforward way is manually labeling some target data and fine-tuning the pre-trained model on them [13]. However, obtaining expert-level annotation data in the medical imaging domain incurs significant time and expense [22]. Recently, unsupervised domain adaptation (UDA) has been widely investigated to reduce domain gap through transferring the knowledge learned from a rich-labeled source domain to an unlabeled target domain [4, 7, 17, 19]. Existing UDA methods typically require sharing source data during adaptation, and enforce distribution alignment to diminish the domain discrepancy between source and target domains. This requirement limits the application of UDA methods when source domain data are not accessible. Hence, some very recent works have started to explore a more practical setting, source-free domain adaptation (SFDA), that adapts a pre-trained source model to unlabeled target domains without accessing any source data [1, 5, 6, 12, 20, 21].

Fig. 1.
figure 1

(a\(\rightarrow \)b) t-SNE visualization of target feature distributions in embedding space before and after Prototype-anchored Feature Alignment (PFA). (c) Category-wise probability of the unreliable pixel in (b).

Among these methods, [5] and [20] focus on generating reliable pseudo labels for target domain data by developing various denoising strategies. Unavoidably, these self-training methods depends heavily on initial probability maps produced by the source model, which are considerably unreliable when the domain discrepancy is large (e.g., CT and MRI). To relieve the issues caused by noisy pseudo labels, Bateson et al. [1] proposed a prior-aware entropy minimization method to minimize the label-free entropy loss for target predictions. Furthermore, unlike the above self-adaption methods, Yang et al. [21] utilized the statistic information stored in the batch normalization layer of the source model and mutual Fourier Transform to synthesize the source-like image. However, the quality of the generated image is still influenced by the domain discrepancy.

In this work, we propose a novel SFDA framework for cross-modality medical image segmentation. Our framework contains two sequentially conducted stages, i.e., Prototype-anchored Feature Alignment (PFA) stage and Contrastive Learning (CL) stage. As previous works [12] noted, the weights of the pre-trained classifier (i.e., projection head) can be employed as the source prototypes during domain adaptation. That means we can characterize the features of each class with a source prototype and align the target features with them instead of the inaccessible source features. To that end, during the PFA stage, we first provide a target-to-prototype transport to ensure the target features get close to the corresponding prototypes. Then, considering the trivial solution that all target features are assigned to the dominant class prototype (e.g., background), we add a reverse prototype-to-target transport to encourage diversity. However, although most target features have been assigned to the correct class prototype after PFA, some hard samples with high prediction uncertainty still exist in the decision boundary (see Fig. 1(a\(\rightarrow \)b)). Moreover, we observe that those unreliable predictions usually get confused among only a few classes instead of all classes [18]. Taking the unreliable pixel in Fig. 1(b, c) for example, though it achieves similar high probabilities on the spleen and left kidney, the model is pretty sure about this pixel not belonging to the liver and right kidney. Inspired by this, we use confusing pixels as the negative samples for those unlikely classes, and then introduce the CL stage to pursue a more compact target feature distribution. Finally, we conduct experiments on a cross-modality abdominal multi-organ segmentation task. With only a source model and unlabeled target data, our method outperforms the state-of-the-art SFDA and even achieves comparable results with some classical UDA approaches.

2 Methods

We are first provided a segmentation model \(\mathcal {M}^{s}\) trained on \(N_s\) labeled samples \(\left\{ (x_{n}^{s}, y_{n}^{s})\right\} _{n=1}^{N_{s}}\) from the source domain \(\mathcal {D}^{s}\), and an unlabeled dataset with \(N_t\) samples \(\left\{ x_{m}^{t}\right\} _{m=1}^{N_{t}}\) from the target domain \(\mathcal {D}^{t}\), where \(x^s, x^t \in \mathbb {R}^{H \times W \times D}\), \(y_{n}^{s} \in \mathbb {R}^{H \times W}\), H and W are the height and width of the samples. The goal of SFDA is to adapt the source model \(\mathcal {M}^{s}\) with only unlabeled \(x^t\) to predict pixel-wise label \(y^t\) for the target domain data. In general, the segmentation model consists of two parts: 1) a feature extractor \(F_{\theta }:x_i\rightarrow \boldsymbol{f}_i\in \mathbb {R}^{D_f}\), parameterized by \(\theta \), mapping each pixel \(i \in \{1,\cdots ,H \times W\}\) in image x to the feature \(\boldsymbol{f}_i\) in the embedding space; 2) a one-layer pixel-wise classifier \(\phi :\boldsymbol{f}_i\rightarrow \boldsymbol{p}_i \in \mathbb {R}^{C}\), that projects pixel feature into the semantic label space with C classes.

In the SFDA task, the source classifier \(\phi ^{s}\) encounters a domain shift problem when classifying the target domain feature. To tackle this challenge, we propose a novel SFDA framework mainly including two stages, shown in Fig. 2. We will elaborate on the details in the following.

Fig. 2.
figure 2

An overview of the proposed two-stage SFDA framework. (a) is the first PFA stage. We freeze the classifier \(\phi ^s\) and use its weights for prototype-anchored feature alignment. (b) is the following CL stage. Given a target image, we first use \(\mathcal {M}^{t_0}\) to make a prediction, and separate the pixels into query one and negative ones for each class based on their reliability (entropy). Then, features of query pixels come from \(F^{t}_{\theta }\) (query samples), while features of negative pixels are from \(F^{t_0}_{\theta }\) (negative samples), when minimizing \(\mathcal {L}_{\textrm{CL}}\).

2.1 Prototype-Anchored Feature Alignment

Since source data is not available, explicit feature alignment that directly minimizes the domain gap between the source and target data like many UDA methods [4, 8] is inoperative. As shown by previous methods [12], the weights \([\boldsymbol{\mu }_1,\boldsymbol{\mu }_2,\cdots ,\boldsymbol{\mu }_C] \in \mathbb {R}^{D_f \times C}\) of the source domain classifier \(\phi ^{s}\) can be interpreted as the source prototypes, which characterize the features of each class. Thus, we introduce a bi-directional transport cost to align the target features with these prototypes instead of the unaccessible source features.

Following [23], given a mini-batch \(\left\{ x_{m}^{t}\right\} _{m=1}^{M}\) with M images, we first adopt the cosine distance \(d(\boldsymbol{\mu }_{c},\boldsymbol{f}_{m, i}^{t})= 1-\langle \boldsymbol{\mu }_{c},\boldsymbol{f}_{m, i}^{t}\rangle \) to define a point-to-point transport cost between \(\boldsymbol{f}_{m, i}^{t}\) and \(\boldsymbol{\mu }_{c}\), where \(\langle \cdot ,\cdot \rangle \) is the cosine similarity. Then, a conditional distribution \(\pi _{\theta }\left( \boldsymbol{\mu }_{c} \mid \boldsymbol{f}_{m, i}^{t}\right) \) specifying the probability of transporting from \(\boldsymbol{f}_{m, i}^{t}\) to \(\boldsymbol{\mu }_{c}\) can be constructed as,

$$\begin{aligned} \pi _{\theta }\left( \boldsymbol{\mu }_{c} \mid \boldsymbol{f}_{m, i}^{t}\right) =\frac{\hat{p}\left( \boldsymbol{\mu }_{c}\right) \exp \left( \boldsymbol{\mu }_{c}^{T} \boldsymbol{f}_{m, i}^{t} / \tau \right) }{\sum _{c^{\prime }=1}^{C} \hat{p}\left( \boldsymbol{\mu }_{c^{\prime }}\right) \exp \left( \boldsymbol{\mu }_{c^{\prime }}^{T} \boldsymbol{f}_{m, i}^{t} / \tau \right) } \end{aligned}$$
(1)

where \(\tau \) is the temperature parameter, and \(\hat{p}\left( \boldsymbol{\mu }_{c}\right) \) is the prior distribution (i.e., class proportion) over the C classes for the target domain. As the true class distribution is unavailable in the target domain, we use the EM algorithm to infer \(\hat{p}\left( \boldsymbol{\mu }_{c}\right) \) instead of using a uniform prior distribution (see more details in [16]). Note that in Eq. 1, a target point is more likely to be transported to the class prototypes closer to it or those with higher class propotion.

With the conditional distribution and point-to-point transport cost, we can derive the target-to-prototype (T2P) expected cost of moving the target features in this mini-batch to source prototypes,

$$\begin{aligned} \mathcal {L}_{\textrm{T2P}} = \frac{1}{M\times H \times W} \sum _{m=1}^{M} \sum _{i=1}^{H \times W} \sum _{c=1}^{C} d(\boldsymbol{\mu }_{c},\boldsymbol{f}_{m, i}^{t}) \pi _{\theta }\left( \boldsymbol{\mu }_{c} \mid \boldsymbol{f}_{m, i}^{t}\right) \end{aligned}$$
(2)

In this target-to-prototype direction, we assign each target pixel to the prototypes according to their similarities and the class distribution. However, like many entropy minimization methods [1, 2], optimizing target-to-prototype cost alone may result in degenerate trivial solutions, biasing the prediction towards a single dominant class [16]. To avoid mapping most of the target features to only a few prototypes, we add a prototype-to-target (P2T) transport cost in the opposite direction, which ensures that each prototype can be assigned to some target features. Similarly, we have:

$$\begin{aligned} \mathcal {L}_{\textrm{P2T}} = \sum _{c=1}^{C} \hat{p}\left( \boldsymbol{\mu }_{c}\right) \sum _{m=1}^{M} \sum _{i=1}^{H \times W}d(\boldsymbol{\mu }_{c},\boldsymbol{f}_{m, i}^{t}) \frac{\exp \left( \boldsymbol{\mu }_{c}^{T} \boldsymbol{f}_{m, i}^{t} / \tau \right) }{\sum _{m^{\prime }=1}^{M} \sum _{i^{\prime }=1}^{H \times W} \exp \left( \boldsymbol{\mu }_{c}^{T} \boldsymbol{f}_{m^{\prime }, i^{\prime }}^{t} / \tau \right) } \end{aligned}$$
(3)

Then, combining the conditional transport cost in these two directions, we define the total prototype-anchored feature alignment (PFA) loss:

$$\begin{aligned} \mathcal {L}_{\textrm{PFA}} = \mathcal {L}_{\textrm{T2P}}+\mathcal {L}_{\textrm{P2T}} \end{aligned}$$
(4)

Similar to [6], we initialize the adaptation model \(\mathcal {M}^{t_0}\) with the pre-trained source model \(\mathcal {M}^{s}\) and fix the weights of the classifier during adaptation.

2.2 Contrastive Learning Using Unreliable Predictions

After the PFA stage, the clusters of target features are shifted towards their corresponding source prototypes, which brings remarkable improvements for the initial noisy prediction (see Fig. 3(b)). To further improve the compactness of the target feature distribution, previous self-training methods mainly focus on strengthening the reliability of pseudo labels by developing denoising strategies [5, 20], but discard those low-confidence predictions. However, such contempt for unreliable predictions may result in information loss. For example, in Fig. 1(c), the probability of the unreliable pixel hovers between spleen and left kidney, yet is confident enough to indicate the categories it does not belong to.

With this intuition, we denote \(\boldsymbol{p}_{m,i}^{t}\) as the softmax probabilities generated by model \(\mathcal {M}^{t_0}\) for the target data \(x_{m,i}^{t}\). Then, for each class c, we construct three components, named query samples, positive prototypes, and negative samples, to explore those unreliable predictions as [18].

Query Samples. During training, we employ the per-pixel entropy as uncertainty metric [18], and sample the pixels with low entropy (reliable pixel) in the current mini-batch as query candidates. We denote the set of features of all query pixels for class c as \(\mathcal {P}_{c}\),

$$\begin{aligned} \mathcal {P}_{c} = \{ \boldsymbol{f}_{m, i}^{t} \ \vert \ \mathcal {H}({\boldsymbol{p}_{m,i}^{t}}) \le \gamma _c, \ \arg \max \limits _{c^{\prime }}{\boldsymbol{p}_{m,i}^{t}}=c \} \end{aligned}$$
(5)

where \(\mathcal {H}(\cdot )\) is the entropy of the input probabilities and \(\gamma _c\) is the entropy threshold for class c. Here we set \(\gamma _c\) as the \(\alpha _c\)-th percentile of all the entropy values of pixels assigned a pseudo label c.

Positive Prototypes. The positive prototype is the same for all query pixels from the same class. Instead of using the center of query samples like [18], we set them the same as the previous source prototype, which is denoted as \(\boldsymbol{z}_c^{+}=\boldsymbol{\mu }_c\).

Negative Samples. For a query sample from class c, its qualified negative samples should satisfy: 1) unreliable; 2) highly probable not belong to class c. Therefore, we introduce the pixel-level category order \(\mathcal {O}_{m, i}^{t} = \textrm{argsort}({\boldsymbol{p}_{m,i}^{t}})\). For example, we have \(\mathcal {O}_{m, i}^{t}(\arg \max {\boldsymbol{p}_{m,i}^{t}})=1\) and \(\mathcal {O}_{m, i}^{t}(\arg \min {\boldsymbol{p}_{m,i}^{t}})=C\). Thus, we can use \(\mathcal {O}_{m, i}^{t}(c)\) to define the set of all negative samples:

$$\begin{aligned} \mathcal {N}_{c} = \{ \boldsymbol{f}_{m, i}^{t} \ \vert \ \mathcal {H}({\boldsymbol{p}_{m,i}^{t}})>\gamma _c, \ \mathcal {O}_{m, i}^{t}(c) \ge r_l \} \end{aligned}$$
(6)

where \(r_l\) is the low rank threshold and is set to 3 in our task.

With the above definition, we have the pixel-level contrastive loss as:

$$\begin{aligned} \mathcal {L}_{\textrm{CL}}= -\frac{1}{C \times K} \sum _{c=1}^C \sum _{k=1}^K \log \left[ \frac{e^{\left\langle \boldsymbol{z}_{c,k}, \boldsymbol{z}_{c}^{+}\right\rangle / \tau }}{e^{\left\langle \boldsymbol{z}_{c,k}, \boldsymbol{z}_{c}^{+}\right\rangle / \tau }+\sum _{j=1}^N e^{\left\langle \boldsymbol{z}_{c,k}, \boldsymbol{z}_{c,k,j}^{-}\right\rangle / \tau }}\right] \end{aligned}$$
(7)

where K is the number of query samples, and \(\boldsymbol{z}_{c,k} \in \mathcal {P}_{c}\) denotes the k-th query sample from class c. Each query sample is paired with a positive prototype \(\boldsymbol{z}_{c}^{+}\) and N negative samples \(\boldsymbol{z}_{c,k,j}^{-} \in \mathcal {N}_{c}\).

3 Experiments and Results

3.1 Experimental Setup

Datasets and Evaluation Metrics. We evaluate our SFDA approach on a cross-modality abdominal multi-organ segmentation task. For the abdominal datasets, we obtain 20 MRI volumes from the 2019 CHAOS Challenge [10] and 30 CT volumes from MICCAI 2015 [11], respectively. Both datasets are under the Creative Commons Attribution 4.0 International license and involve segmentation masks for the following abdominal organs: liver, right kidney, left kidney and spleen. We complete adaptation experiments both in the “MRI to CT” direction and in the “CT to MRI” direction. For the “MRI to CT” direction, we take the MRI modality to train the source model and vice verse. Both modalities are randomly divided into 80% for domain adaptation training and 20% for evaluation. For both datasets, we discard the axial slices that do not contain foreground and crop out the non-body region [3]. The value range in CT volumes is first clipped to \([-125,275]\). Then min-max normalization has been performed on both datasets to normalize the intensity value to [0, 1]. After that, all the MRI and CT volumes are uniformly resized to \(256\times 256\) in axial plane. Due to the large variance in the slice thickness of CT and MRI modality, we split the volume into slices for the model training.

Fig. 3.
figure 3

(a) Qualitative segmentation results of different methods for abdominal images. (b) Visualized evolution of the model uncertainty and predictions in different stages.

For the evaluation, two main metrics, dice similarity coefficient (Dice) and average symmetric surface distance (ASSD) are used to quantitatively evaluate the segmentation results [4, 15].

Implementation Details. We adopt classic U-Net structure for the segmentation model as the previous work [1]. The source segmentation model is trained in a fully-supervised manner for 10k iterations. During adaptation, we use Adam optimizer with the learning rate \(1 \times 10^{-4}\) and a weight decay of \(5 \times 10^{-4}\). The temperature \(\tau \) and batch size is set as 0.1 and 16, respectively. In PFA stage, we freeze the classifier and optimize \(F_\theta ^{t_0}\) for 200 iterations. In CL stage, we empirically set hyper-parameters \(\alpha _c=80\), \(K=64\), and \(N=256\) for all classes. All experiments are conducted with PyTorch on a single NVIDIA RTX 3090 GPU of 24 GB memory. Data augmentation such as random cropping, rotation, and brightness are adopted for source domain training and target domain adaptation.

3.2 Results of Source-Free Domain Adaptation

Comparision with Other Methods. In our experiments, “no adaptation” lower bound denotes learning a model on the source domain and directly test on the target domain without adaptation. And “supervised” upper bound means training and testing in the same target domain. We compared our methods with recent SFDA methods all designed for medical image segmentation scenarios, including a denoised pseudo-labeling approach (DPL) [5], a prior-aware entropy minimization approach (AdaMI) [1], a fourier style mining approach (FSM) [21], and a feature map statistics-guided approach [9]. We also considered top-performing UDA methods (i.e., SIFA [4], DAG-Net [19]). For a fair comparison, we utilized the same backbone for these methods [1, 4, 5, 21] and reimplemented them according to their official codes. Note that we reported the results of methods [9, 19] from papers, since their official codes were not released.

Table 1. Comparision with other methods on abdominal multi-organ datasets.
Fig. 4.
figure 4

(a) Ablation analysis of proposed two SFDA stages. “w/o CL” denotes only the PFA is performed; “w/o PFA” denotes directly optimizing the contrastive loss according to the source model prediction. (b) Effect of different uncertainty percentile \(\alpha _c\) on the adaptation performance.

The quantitative evaluation results are presented in Table 1. Compared to the upper and lower bounds in both directions, a huge performance gap can be observed due to the severe domain shifts between MRI and CT modalities. In “MRI to CT” direction, our method remarkably outperforms all other SFDA approaches on the right kidney and spleen, achieving the highest average Dice of 86.1% and the lowest average ASSD of 1.4. Moreover, compared with recent UDA methods, our method obtains competitive results on average Dice and ASSD, which may be due to the use of unreliable predictions. As for “CT to MRI” direction, our method similarly shows great superiority on most organs as well, achieving the best performance in terms of both the average Dice (89.2%) and ASSD (1.3) among all SFDA methods. Figure 3(a) shows the segmentation results obtained by existing and our methods in both modalities. As observed, DPL is prone to amplify the initial noisy regions since it directly discards the unreliable pixels in self-training. For comparison, our method substantially rectificate the uncertain regions from the initial prediction, and details are shown in Fig. 3(b).

Ablation Study. In Fig. 4(a), we verify the effectiveness of the proposed two SFDA stages by removing each stage while keeping the other. The consecutive two stage adaptation leads to the best performance, while the drop in Dice is more significant if we remove the PFA stage. This result is not surprising because, without PFA, the source model prediction is too noisy to sample the qualified query and negative pixels for contrastative learning. We also study the impact of different uncertainty percentile \(\alpha _c\) in Fig. 4(b). This parameter has a certain impact on performance, and we find \(\alpha _c = 80\%\) achieves the best performance for most organs. Large \(\alpha _c\) may introduce low-confidence query samples for supervision, and small \(\alpha _c\) will drop some informative negative samples.

4 Conclusion

In this paper, we propose a novel two-stage framework to address the source-free domain adaptation problem in medical image segmentation. We first introduce a bi-directional transport cost to encourage the alignment between target features and source class prototypes in the prototype-anchored feature alignment stage. Also, a contrastive learning stage using unreliable predictions is further devised to learn a more compact target feature distribution. Sufficient experiments on the cross-modality abdominal multi-organ segmentation task validate the effectiveness and superiority of our method against other strong SFDA baselines, even some classical UDA approaches.