1 Introduction

Deep learning has achieved great success in mainstream computer vision in recent years, and these experiences provide a reference solution for the field of medical image analysis [1,2,3,4,5]. In real clinical scenarios, medical imaging is often acquired using different modalities, scanners, and protocols from different locations and populations that may have different characteristics, and these data suffer from severe domain shifts (inconsistent data distribution in the source and target domains) which may lead to performance degradation of pre-designed methods [6, 7]. Simultaneously, providing intensive professional annotations by high-level physicians is time-consuming and labor-intensive, and real patient data is subject to privacy protections and security regulations resulting in limited access. The lack of annotation also leads to the fact that training a new target domain model is not only overhead but also difficult to implement. The above mentioned problems of domain shift and lack of annotation make it difficult for traditional vision processing solutions to work directly in the field of medical image analysis. How to adapt the trained source model with a small amount of target domain data in real clinical scenarios with drastic domain changes is an urgent problem.

Fig. 1
figure 1

Explanation of the difference between domain adaptation and source-free domain adaptation. Domain adaptation methods utilize both source domain data and target domain data to design feature alignment and other adaptation methods. In practical clinical scenarios, accessing source domain data may be infeasible due to issues such as privacy and security concerns. Source-free domain adaptation, on the other hand, is a solution that addresses this limitation. It involves tuning pre-trained source model parameters using alternative methods without accessing the source domain data, aiming to improve the model’s performance on the target domain

Unsupervised Domain Adaptation (UDA) maps source and target domains to common feature space and then aligns differences in the distribution of two features so that model adapts itself to the target domain samples and obtains close or even consistent performance with source domain through similar representations [6,7,8]. These approaches require labeled source datasets and well-trained models to learn the source knowledge in domain adaptation training. As shown in Fig 1, compared to UDA, where source and target domain samples are directly aligned, Source-Free Domain Adaptation (SFDA) adjusts the parameters of a pre-trained source model using only the target domain samples. This adjustment aims to reduce the feature differences between the source and target domains [9, 10].

Existing SFDA methods can be roughly categorized into two types. The first type achieves domain alignment through virtual domain generation. Yang et al. [11] extract source domain knowledge from the source model to pattern the generative model of the source data to perform style transformation on the target domain data, and then using methods such as contrast learning and noise label filtering to perform target domain adaptation. Tian et al. [12] generate virtual domain samples based on a pre-trained source model in feature space using an approximate Gaussian mixture model, allowing the virtual domain to maintain a similar distribution to the source domain without accessing the source data. Qiu et al. [13] train a prototype generator by exploring the classification boundary information of the source model through contrast learning. However, these methods not only require additional computational costs but also necessitate the design of dedicated generation schemes for different domains, which can be a cumbersome process. The second type focuses on inter-domain feature alignment. These methods leverage techniques such as knowledge distillation and statistical consistency to achieve robust model parameterization against domain shifts [10]. Bateson et al. [14] guides the weakly labeled target sample and the domain-invariant prior on the segmented region based on minimizing the unlabeled entropy learning defined on the target domain data. Kim et al. [15] selected reliable samples with self-entropy criterion and defined them as class prototypes. Self-supervised learning is then performed by assigning pseudo-labels to each target sample based on the similarity scores of the class prototypes. These methods are free from the constraints of domain generation and can achieve simple and generalized domain distribution adaptation by guiding the feature distribution for alignment through prior knowledge.

The above inter-domain feature alignment-based methods have made some progress without accessing the source data. However, on the one hand, adaptation learning over a long time on a small number of target domain samples leads to a bias of the model to fit these only samples, making it difficult for the model to maintain its original dominance over the source domain during the adaptation process. On the other hand, the feature signals obtained by these methods inevitably contain redundant, erroneous, and harmful noise signals due to the lack of explicit supervision. Indiscriminate use of these signals to update the source model parameters may lead to the model learning in a bad direction or even to a complete collapse. With these limitations in mind, this letter aims to investigate persistent adaptation over the target domain in the presence of effective supervision.

We propose a two-stage SFDA framework for additive source-free domain adaptation, aiming to achieve more robust and simplified target domain adaptation for medical image segmentation tasks. Inspired by consistency learning, in the first stage, we freeze the decoder part of model and generalize the encoding capability through aligning style and content consistency between rotated and cropped images. In the second stage, we utilize the model fine-tuned in the first stage with the frozen encoder part and guide the target domain samples with their enhanced knowledge distillation using uncertainty maps. The main contributions of this letter can be summarized as follows:

  • We investigate a more realistic and challenging task to achieve continuous learning on both the target and source domains without accessing the source data. The proposed method can effectively deal with the problem of cross-distribution domain bias and the problem of unavailability of source domain data due to privacy and security protection in the field of medical image segmentation.

  • We propose a two-stage approach to adapt the encoder and decoder of the model separately. The framework uses multi-view feature styles and content consistency in the encoder adaptation stage to generalize the feature extraction capability of the encoder of the segmentation model, and reduces errors in the decoder reconstruction of the segmentation results by finding and eliminating potentially erroneous feature elements through uncertainty estimation in the decoder adaptation stage.

  • We conduct fair comparison experiments with current state-of-the-art methods in cross-device polyp segmentation and cross-modal brain tumor segmentation application scenarios, respectively. We validate the effectiveness of the proposed method on the target domain to demonstrate that the proposed method can be well adapted to different target domain offsets. Further, we report the performance of the post-adaptation method on a test set in the source domain to demonstrate that the proposed method is knowledge-retentive rather than knowledge-replacing.

2 Method

2.1 Overview

We propose a two-stage adaptation framework to enable additive source-free domain adaptation by adjusting the model to learn domain-invariant features. It consists of an encoder adaptation stage that learns joint style and content invariance and a decoder adaptation stage that reduces feature uncertainty.

Let us define the source data \(D_{s}=\left\{ x_{s}, y_{s} \right\} \) and the target data \(D_{t}=\left\{ x_{t} \right\} \), where \(x_{s}\) is the source image, \(y_{s}\) is the corresponding source label, and \(y_{t}\) is the target image. \(x^{crop}\) and \(x^{rot}\) are the cropped and rotated enhanced images, respectively, and \(z^{crop}\), \(z^{rot}\), \(z_{t}\) are the intermediate vectors encoded by the encoder. p is the predicted probability map. \(M_{s}\) is the pre-trained source model. Our goal aims to improve the performance of the adjusted model \(M_{t}\) on the target domain by adjusting \(M_{s}\) in conjunction with the target data \(D_{t}\) without accessing the source data \(D_{s}\).

Fig. 2
figure 2

The proposed encoder adaptation stage. The decoder is frozen first to improve the robustness and anti-interference capability of the encoder by learning styles and content consistency for features from different viewpoints

2.2 Encoder adaptation stage

Medical imaging exposes different anatomical structures of organs or tissues, and these images exhibit significant style information differences in terms of texture, contrast, saturation, and other visual attributes across different devices or modalities. These style variances exacerbate the domain shift issues along these cross-domain imaging pathways, leading to performance degradation of models in cross-scenario settings [11, 16]. Taking these inherent properties into consideration, in the encoder adaptation stage (fixed decoder, adjusted encoder), we decompose the high-level semantic representation space learned by the encoder into content representation and style representation. We then use style matching to enforce style consistency across the overall representation distribution between the base branch and the two branches based on cropping and rotation. The specific process is shown in the Fig 2.

Specifically, for the enhanced sample feature z model from the base feature is transformed (same range cropping or same angle rotation) to obtain the original feature corresponding to the field of view for the same type of transformation to constrain the style content consistency.

The process of style and content consistency constraint is shown in the following equation:

$$\begin{aligned} \begin{aligned} loss_{style}=&(\frac{\sum _{i=1}^{n} z_{i}^{base}}{n} -\frac{\sum _{i=1}^{n}z_{i}^{aug}}{n})^2 \\&+\frac{\sum _{i=1}^{n}(z_{i}^{base}-z_{mean}^{base})^2 }{n} \\&+\frac{\sum _{i=1}^{n}(z_{i}^{aug}-z_{mean}^{aug})^2 }{n} \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} loss_{content}=\frac{\sum _{i=1}^{n} (z^{base}_{i}-z^{aug}_{i})^2}{n} \end{aligned}$$
(2)

where n is the number of samples in a batch, \(z^{aug}=\left( z^{crop}, z^{rot} \right) \), and \(z_{mean}^{aug}\) is actually the mean of the sample of the corresponding augmentation perspective in a batch. In the specific implementation, the alignment of \(z^{aug}\) from different perspectives with the base branch \(z^{base}\) is calculated separately.

Fig. 3
figure 3

The proposed decoder adaptation stage. At the end of encoder self-adaptation, the encoder is frozen and then the decoder is fine-tuned. The uncertainty of the samples is calculated by Monte-Carlo uncertainty estimation, and then the features are filtered by threshold filtering and feature weighting to reduce the uncertainty elements in the features and ensure that the decoder is optimized in the right direction

2.3 Decoder adaptation stage

The high-level semantic representations encoded by the encoder may include sub-elements with low confidence and incorrect categorization. These errors will be gradually amplified by the upsampled decoder, resulting in regionally false-positive segmentation results [17, 18]. We can capture these low-confidence sub-elements by estimating the model’s uncertainty on the samples. Then, we can jointly constrain the decoder’s reconstruction process through two mechanisms: filtering out low-confidence elements and enforcing consistency across different augmented perspectives of the same sample. The specific process is shown in the Fig 3.

Specifically, we perform Monte-Carlo uncertainty estimation [19] on the prediction probability maps \(p^{base}\) and \(p^{aug}\) from the base perspective and the perspective augmented by ColorJitter in the decoder adaptation stage (fixed encoder, adjusted decoder) to obtain the uncertainty maps \(U^{base}\) and \(U^{aug}\) in the prediction probability maps and the average probability prediction maps \(p^{base}\) and \(p^{aug}\). Pixels with low confidence in \(U^{base}\) and \(U^{aug}\) are filtered by a threshold \(\gamma \), and the filtered uncertainty maps are used to weight the average prediction maps. Finally, knowledge distillation is used to align the weighted probability prediction maps.

The uncertainty map is calculated as follows:

$$\begin{aligned} \Delta _{U} = \frac{\sum _{i = 1}^{n}\sum _{j = 1}^{t}(p_{i,j}-p^{mean}_{i,j})^2 }{n} \end{aligned}$$
(3)
$$\begin{aligned} U=\left\{ \begin{array}{ll} \Delta _{U}&{} \Delta _{U} \ge \gamma \\ 0 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(4)

where \(\gamma \) is a hyperparameter used to adjust the filtering ratio, t is the number of Monte-Carlo uncertainty estimation iterations, and \(p^{mean}_{j}\) represents the average of the results of t iterations.

The calculation process of consistency distillation is as follows:

$$\begin{aligned} loss_{con} = \frac{\sum _{i=1}^{n}(p^{base}_{i}*U^{base}_{i}-p^{aug}_{i}*U^{aug}_{i})^2 }{n} \end{aligned}$$
(5)
$$\begin{aligned} loss_{entropy} = -\sum _{i=1}^{n}p_{i}\times logp_{i} \end{aligned}$$
(6)

2.4 Optimization

The optimization process of the proposed method is to first perform encoder adaptation, and then decoder adaptation based on the adaptation results, which are optimized according to \(loss_{stage1}\) and \(loss_{stage2}\) respectively. During the encoder adaptation stage, the decoder is frozen, and the adaptation is only for the encoder. Similarly, during the decoder adaptation stage, the encoder is frozen, and the adaptation is only for the decoder.

$$\begin{aligned} loss_{stage1} = \alpha \times loss_{style}+(1-\alpha ) \times loss_{content} \end{aligned}$$
(7)
$$\begin{aligned} loss_{stage2} = \beta \times loss_{con} + (1-\beta ) \times loss_{entropy} \end{aligned}$$
(8)

where \(\alpha \) and \(\beta \) are corresponding weight hyperparameters.

3 Experiment

3.1 Experiment setting

We utilize DeeplabV3+ [20] as our base segmentation model for conducting source-free domain adaptation experiments and implemented the entire framework using PyTorch. Following the workflow of previous works [11, 14], we divide the data into source domain data and target domain data. The source domain data is used to pretrain the source model, and then the corresponding domain adaptation methods were employed to optimize the model on the target domain. We used the SGD optimizer and applied basic data augmentation using Colorjitter. For methods with open-source code, we conduct experiments using the provided code, while for methods without open-source implementations, we strictly follow the descriptions in their papers to construct the corresponding pipelines.

3.2 Dataset

We perform fair comparison experiments with current state-of-the-art methods on two publicly available medical image segmentation datasets. We follow previous methods to segment the datasets [11]. And for each task we perform a 3-fold cross-validation.

Cross-device polyp segmentation: The publicly available colonoscopy datasets EndoScene [21] and ETIS-Larib [22] are used for cross-device adaptation. The EndoScene dataset includes 912 images from 36 patients. It is collected by Olympus Q160AL and Q165L, additional II video processors. We set the EndoScene dataset [21] as the source domain and follow the standard setting for polyp segmentation as described by [21] with a ratio of 3:1:1 for the training set validation set and test set, respectively. We set the ETIS-Larib dataset [22] as the target domain, which is composed of Pentax 90i series, EPKi 7000 video processor collected 196 frames. We randomly set the training set to 4:1 with the test set. To facilitate training and testing, we resized all images to 256x256 dimensions.

Cross-modal brain tumor segmentation: The Multimodal Brain Tumor Segmentation Challenge 2018 dataset [23] is a dataset providing multimodal 3D brain MRI and ground truth segmentation for each case including 4 MRI modalities (T1, T1c, T2 and FLAIR). We refer to the pipeline of previous work [24] to partition the source and target domains of the samples in a 1:1 ratio from the data of 285 HGG patients on this dataset, and then further randomly partition the training and test sets in a 4:1 ratio on the corresponding domains. We perform experiments in both flair and T2 modals of MRI imaging, where the size of each axial slice is adjusted to 192 \(\times \) 168.

Table 1 Compare with State-of-The-Art methods on the target domain test set. Source Only indicates the performance of the source model trained on the source domain and tested on the target domain (without any domain adaptation) Target Only represents the performance of the source model trained on the target domain and tested on the target domain

3.3 Comparison with the state-of-the-art methods

To verify the effectiveness of the proposed method, we conduct fair comparison experiments with some of the most popular methods in the same environment. We compare the experimental results with some state-of-the-art methods on the cross-device polyp segmentation task and the cross-modal brain tumor segmentation task, respectively. These methods are described below:

  • AdaEnt [25]: This method combines domain-invariant prior with entropy loss minimization to guide segmentation. It learns an analogical prior through an auxiliary network and integrates it in the overall loss function in the form of Kullback–Leibler (KL) divergence.

  • AUGCO [26]: This method utilizes pixel-level prediction consistency of the model, automatically generates views of each target image, and utilizes model confidence to identify reliable pixel predictions. It selectively self-trains these images.

  • SFDA [15]: This method employs a pre-trained model from the source domain and progressively updates the target model in a self-learning manner. It assigns pseudo-labels to each target sample using reliable samples selected based on the self-entropy criterion.

  • AdaMI [14]: This method minimizes the unlabeled entropy loss defined on the target domain data to further guide weak labels of target samples and domain-invariant prior on segmentation regions.

  • SMG [11]: In the generation stage, this method achieves inverse-source class images using a pre-trained source model and statistical information of mutually Fourier-transformed. In the adaptation stage, it transfers relational knowledge using domain distillation loss and reduces domain discrepancy through domain contrastive loss in a self-supervised paradigm.

In Table 1 the data reported is the performance of these methods on the test set of target domain and in Table 2 the data is the performance of these methods on the test set of source domain. The Source Only method in Table 1 is the result obtained by training on the source domain data through a single-step transfer learning method. Specifically, we use ResNet101 pre-trained on ImageNet as initialization backbone, and then use source domain data for fine-tuning the whole model. Similarly, Target Only use the ResNet101 pre-trained on ImageNet as initialization backbone, and then use target domain data to fine-tuning the whole model. As shown in Table 1, the performance of Source Only tested on target domain test set can be treated as a a pass mark for source domain-free adaptive method for this task. While Target Only is trained directly on target domain, the performance on target domain test set can be regarded as the upper limit of source-free adaptation method for this task. In the cross-device polyp segmentation task, the performance of model can generally be improved compared to using domain-adaptive methods and not using any domain-adaptive methods. For example, using methods such as AdaEnt, SFDA, and AdaMI can improve the model’s Dice score on target domain to around 70.12, and these methods can also improve performance in a cross-modal brain tumor segmentation task (Dice scores 68.31 and 67.64). The limitation of these methods to further improve their performance may be that they only perform simple adaptation of output layer features of the model without in-depth consideration of the consistency of different viewpoints of same representation and bias caused by the uncertainty of model’s prediction. Whereas methods such as AUGCO and SMG consider more intermediate layer features for adaptation, these methods are either very sensitive to changes in viewpoints or require complex hyperparameters for tuning. In contrast, the Dice Score on the tasks of polyp segmentation and brain tumor segmentation using the proposed method reaches 73.7 and 70.61, respectively. This is mainly attributed to the fact that the proposed method uses a two-phase adaptive approach to learn domain-invariant representations and uncertainty-reducing feature factors, respectively. These operations allow the model to keep the model learning discriminative and robust features in the presence of complex changes, while selecting features with higher confidence for final decision making and segmentation.

At the same time, to evaluate the persistence of the proposed method in the source domain more objectively, we also report the performance of the adaptation model on the source domain test set to prove that the knowledge learned by the proposed method is retention-addition rather than replacement-forgetting. It can be seen from the Table 2 that after the model completes the adaptation process, the model shows different degrees of performance degradation on the test set of the source domain, for example, AdaMI, AdaEnt and other methods have Dice Scores of only 83.41 and 83.17 (87.04 for the source model) in the adaptation model. Although these methods achieve performance improvement on the target domain after adaptation, they show significant performance degradation on the source domain, which means that these methods forget the rich experience of previous learning after learning new knowledge, and some beneficial weight parameters are replaced with new knowledge. This phenomenon can also be interpreted as the overfitting of the model to the target domain sample. In contrast, using proposed method the model can almost maintain its original advantage in the source domain (Dice Score 86.16) while achieving the best performance in the target domain (Dice Score 73.7 and 70.61). This shows that the features learned by proposed method are more generalized and robust than others.

Table 2 Compare with State-of-The-Art methods on the source domain test set. The Source Model represents the performance of the source model on the source domain test set
Table 3 Ablation experiments contributed by each module. Baseline is the effect of the basic model without any adaptive adjustment on the target domain test set
Table 4 The ablation of Monte-Carlo uncertainty maps

3.4 Ablation experiments

We perform a series of ablation experiments to verify some other details in the overall framework.

Loss function curves: We plot the loss function curves of the proposed method in the encoder and decoder adaptation phases. As shown in Fig. 4, since the entire network is initialized using the parameters of the source model, which means that the network already has a certain amount of discriminative ability, the loss function starts to decrease from a relatively low point at the beginning of the encoder adaptation phase. After about 100 epochs of training, the model converges in the encoder adaptation phase. At the beginning of the decoder adaptation phase, the loss starts to decrease from around 0.31, and the loss converges to around 0.26 after 50 epochs of training. These results validate that combining the consistency and uncertainty estimates in the two complex augmented viewpoint features can effectively improve the performance of the model.

Fig. 4
figure 4

Loss function decline curve. The encoder adaptation phase and the decoder adaptation phase are conducted in 100 epoch and 50 epoch, respectively, and the whole training is conducted in two phases

Backbone: To explore the robustness of the proposed method and the influence of different architectures on the model, we conduct ablation experiments on some popular backbone networks [27,28,29,30,31,32]. The experiment uses the same configuration, replacing only the backbone network portion of the model. It can be seen from the Fig. 5 that backbones such as Shufflenet [29] and InceptionV3 [31] perform the worst, which may be due to the complexity of the design of these architectures, which leads to their poor performance in some special scenarios, compared to simpler architectures such as EfficientNet [30] and ResNet [27]. After comprehensive consideration, we finally use ResNet as the basic feature extractor.

Fig. 5
figure 5

Comparison of different backbones. All experiments are performed using the same configuration changing only the type of backbone

Ablation of Monte-Carlo uncertainty maps: We perform ablation experiments on decoder adaptation stage to demonstrate the necessity of Monte-Carlo uncertainty maps. The proposed decoder adaptation stage computes Monte-Carlo uncertainty map for the original and enhanced views separately, and then unifies high-confidence pixel computation consistency among different views in uncertainty map by distillation learning. From Table 4, it can be seen that using Monte-Carlo uncertainty map (Monte-Carlo) has a significant advantage over not using Monte-Carlo (w/o Monte-Carlo) in both Dice Score and Miou. These results validate the effectiveness of Monte-Carlo uncertainty map in this passive-domain adaptive scenario for medical imaging.

Loss-function weighted proportion: We design experiments to explore the proportions between different loss functions. An excessively large value of \(\alpha \) will cause the model to pay too much attention to the style of the sample and ignore the content modeling. \(\alpha \) that is too small causes the model to pay too much attention to the content of the sample and neglect to generalize the features of different styles. Similarly, \(\beta \) is used to regulate the degree of optimization between consistency learning and entropy learning. The results are shown in the Fig. 6, and after comprehensive consideration, we set the values of a and beta to 0.7 and 0.9, respectively.

Fig. 6
figure 6

Weighted ratio of different losses

Contribution of each module: Table 3 reports the experimental results of the proposed method decoupling, verifying that the proposed components work together and benefit the overall framework. It can be seen from the table that the proposed two-stage domain adaptation fine-tuning is carried out sequentially on the basis of baseline, and the performance of the model is significantly improved (Dice Score from 68.03 to 73.7, 62.75 to 70.61).

4 Conclusion

This letter summarizes the limitations of domain feature alignment methods in adaptation learning and proposes a new two-stage additive SFDA framework to address these issues. The proposed method is extensively evaluated on two medical image segmentation tasks: cross-device polyp segmentation adaptation and cross-modal brain tumor segmentation adaptation, achieving significant results that validate the effectiveness and potential applications of the framework. Overall, this work provides valuable exploration for achieving additive learning on the target and source domains in the absence of source data and offers new ideas and methods for adaptation research in the field of medical image segmentation.