Keywords

1 Introduction

In recent years, deep learning (DL) methods have demonstrated remarkable performance in detecting and localizing tumors on ultrasound images [2, 27]. Compared with conventional image processing methods, DL methods provide an accurate feature extraction capability on ultrasound images, despite their low resolution and noise disturbance, leading to superior segmentation accuracy [2, 5, 14]. However, there are some limitations in developing a DL model in a source domain and deploying it in an unseen target domain. The primary limitation is that DL models require a large number of training samples to achieve accurate predictions [8, 24]. Yet, acquiring large training datasets and their corresponding labels, especially from a cohort of patients, can be costly or even infeasible, which poses a significant challenge in developing a DL model with high performance [7]. Second, even when large-scale datasets are available through collaborative research from multiple sites, DL models trained on such datasets may yield sub-optimal solutions due to domain gaps caused by differences in images acquired from different sites [20]. Third, due to the small number of datasets from each domain, the images for each individual domain may not capture representative features, limiting the ability of DL models to generalize across domains [3].

Domain adaptation (DA) has been extensively studied to alleviate the aforementioned limitations, the goal of which is to reduce the domain gap caused by the diversity of datasets from different domains [12, 20, 26, 29, 33]. Example solutions include transfer learning- and style transfer-based methods. Nonetheless, unlike natural images, generating labels can be a challenging task, making it difficult to apply general DA methods; thus bridging domain gaps by DA methods remains limited [26, 33]. This is due to sensitive privacy issues in patients’ data, particularly in collaborative research, which restricts access to labels from different domains. As a result, conventional DA methods cannot be easily applied [10]. More recently, unsupervised domain adaptation (UDA) has been introduced to address this issue [16, 33], aiming to generate semi-predictions (pseudo-labels) in target domains first, followed by producing accurate predictions using the pseudo-labels. One critical limitation of pseudo-label-based UDA is the possibility of error accumulation due to mispredicted pseudo-labels. This can lead to significant degradation of the performance of DL models, as errors can compound and become more pronounced over time [17, 25].

To alleviate the problem of pseudo-label-based UDA, in this work, we propose an advanced UDA framework based on self-supervised DA with a test-time fine-tuning network. Test-time adaptation methods have been developed [4, 11, 13, 23] to improve the learning of knowledge in target domains. The distinctive feature of our test-time self-supervised DA is that it enables the DL network (i) to learn knowledge about the features of target domains by fine-tuning the network itself during the test-time phase, rather than generating pseudo-labels and then (ii) to provide precise predictions on images in target domains, by using the fine-tuned network. Specifically, we adopt self-supervised learning and verify the model via thorough mathematical analysis. Our framework was tested on the task of breast cancer segmentation in ultrasound images, but it could also be applied to other lesion segmentation tasks.

To summarize, our contributions are three-fold:

  • We design a self-supervised DA framework that includes a parameter search method and provide a mathematical justification for it. With our framework, we are able to identify the best-performing parameters that result in improved performance in DA tasks.

  • Our framework is effective at preserving privacy, since it carries out DA using only pre-trained network parameters, without transferring any patient data.

  • We applied our framework to the task of segmenting breast cancer from ultrasound imaging data, demonstrating its superior performance over competing UDA methods.

Our results indicate that our framework is effective in improving the accuracy of breast cancer segmentation from ultrasound images, which could have potential implications for improving the diagnosis and treatment of breast cancer.

2 Methodology

Fig. 1.
figure 1

Architecture of our TTFT network (Left) and its pipeline (Right).

figure a

2.1 Test-Time Fine-Tuning (TTFT) Network and Its Pipeline

Network Architecture. Our proposed TTFT network is based on self-supervised DA [31], which is a part of UDA and can be seen as multi-task learning, involving both the main and pretext tasks, as shown in Fig. 1. In the main task, an encoder (E), a decoder for segmentation (\(D_\text {seg}\)), and a segmentation header (H) are included. The main task is the segmentation task, \((H \circ D_\text {seg} \circ E)(x)\). In predicting segmentation labels in the target domain (\(\mathcal {T}\)), \(D_\text {FT}\) is also involved in the main task, and the final prediction after the fine-tuning is provided by \(\big (H \circ (D_\text {seg} \oplus D_\text {FT}) \circ E \big )(x)\), where \(\oplus \) is the concatenation operation. In the pretext task, E, a decoder for a generator, \(D_\text {gen}\), and a discriminator, C, are involved. The pretext task aims to generate synthetic images, \((D_\text {gen} \circ E)(t)\). Note that \(D_\text {gen}\) and \(D_\text {seg}\) share the same parameters to enable knowledge transfer. However, since the headers of image reconstruction and generating segmentation mask are different (different output), a new header incorporating \(D_{FT}\) and \(D_{seg}\) is devised and leverages the outputs of two decoders. Besides, \(D_{gen}\) = \(D_{FT}\) is fine-tuned during the fine-tuning step, and the \(D_{FT}\) learns the knowledge of the input domain via image reconstruction. Two distinct knowledge (information) from \(D_{seg}\) and \(D_{FT}\) enable the network to utilize target domain knowledge and predict precise predictions.

Pre-training in Source Domain. The model M is first trained in \(\mathcal {S}\) in a supervised manner with \((s, \bar{s}) \sim \mathcal {S}\) in both main and pretext tasks as below:

$$\begin{aligned} \varTheta ^{m}_{\mathcal {S}}, \varTheta ^{p}_{\mathcal {S}} = \underset{\theta ^{m}_{\mathcal {S}}, \theta ^{p}_{\mathcal {S}}}{\textrm{argmin}}\sum _s\Big \{\mathcal {L}_\text {BCE}\big ((H \circ D_\text {seg} \circ E)(s), \bar{s}\big ) + \mathcal {L}_{\text {GAN}}\big ((D_\text {gen} \circ E)(s), s\big )\Big \}, \end{aligned}$$
(1)

where \(\mathcal {L}_\text {BCE}\) and \(\mathcal {L}_\text {GAN}\) represent the loss functions for binary cross-entropy and generative adversarial network [6], respectively. \(\varTheta ^{m}_{\mathcal {S}}\) includes \(E^\mathcal {S}\), \(D^\mathcal {S}_{seg}\), and \(H^\mathcal {S}\), while \(\varTheta ^{p}_{\mathcal {S}}\) includes \(E^\mathcal {S}\), \(D^\mathcal {S}_{gen}\), and \(C^\mathcal {S}\). Additionally, \(D^\mathcal {S}_\text {seg} = D^\mathcal {S}_\text {gen}\).

Fine-Tuning in Target Domain. Since the pre-trained model is likely to produce imprecise predictions in \(\mathcal {T}\), the model should learn domain knowledge about \(\mathcal {T}\). To this end, in the pretext task, for self-supervised learning, the model is fine-tuned in \(\mathcal {T}\) to generate synthetic images identical to the input images as below:

$$\begin{aligned} \varTheta ^{p}_{\mathcal {T}} = \underset{\theta ^{p}_{\mathcal {T}}}{\textrm{argmin}}\sum _t\mathcal {L}_{\text {GAN}}\big ((D^\mathcal {S}_\text {gen} \circ E^\mathcal {S})(t), t\big ) \;\;\Rightarrow \;\; \varTheta ^{p}_{\mathcal {T}} \supseteq E^\mathcal {S}\cup D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {gen}, \end{aligned}$$
(2)

where only \(D_{gen}\) is fine-tuned to achieve memory efficiency and to decrease the fine-tuning time, and \(D^\mathcal {S}_\text {gen}\) is fine-tuned as \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {gen}\). Then, \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {gen}\) is transferred to \(D_\text {FT}\), and knowledge distillation via self-supervised learning is realized. Hence, the precise predictions in \(\mathcal {T}\) could be provided by \(\big (H \circ (D^{\mathcal {S}}_\text {seg} \oplus D^{\mathcal {T}}_\text {FT}) \circ E \big )(x)\).

Benefits of Our Dual-Pipeline. Due to the symmetric property of mutual information in information entropy (\(\mathbb {H}\)), we have \(I(X; Y) = H(X) + H(Y) - H(X, Y)\). As a result, the predictions made by the fine-tuned network in the target domain (\(\mathcal {T}\)) lead to reduced entropy, as shown below:

$$\begin{aligned} \mathbb {H}\big ((H \circ (D^{\mathcal {S}}_\text {seg} \oplus D^{\mathcal {T}}_\text {FT}) \circ E)(t), \bar{t}\big ) \le \; \mathbb {H}\big ((H \circ D^{\mathcal {S}}_\text {seg} \circ E)(t), \bar{t}\big ) + \mathbb {H}\big ((H \circ D^{\mathcal {T}}_\text {FT} \circ E)(t), \bar{t}\big ). \end{aligned}$$
(3)

Since \(D^\mathcal {S}_\text {seg}\) is fully optimized for \(\mathcal {S}\) in a supervised manner, it guarantees a baseline segmentation performance. Furthermore, since \(D^{\mathcal {T}}_\text {FT}\) is fine-tuned in \(\mathcal {T}\) using knowledge distillation, it can provide domain-specific information for \(\mathcal {T}\). As a result, the predictions made by the fine-tuned model in \(\mathcal {T}\) are jointly constrained by the expectations of \(D^\mathcal {S}_\text {seg}\) and \(D^{\mathcal {T}}_\text {FT}\). This enables the final model to provide precise predictions in \(\mathcal {T}\) by taking into account both the source domain and target domain information.

2.2 Parameter Fluctuation: Parameter Randomization Method

Since the loss function and its values can vary based on the distribution of inputs, and different domains can have different distributions, the local minimum identified in the source domain (\(\mathcal {S}\)) cannot be considered as the same local minimum in \(\mathcal {T}\), as illustrated in Fig. 2. The y-axis of Fig. 2 indicates \(\frac{1}{|\mathcal {X}|}\sum _x\mathcal {L}(M(x;\theta ), \bar{x})\), and the local minimum is different in \(\mathcal {S}\) and \(\mathcal {T}\) as \(\varTheta _\mathcal {S}\) in Fig. 2a and \(\varTheta _\mathcal {T}\) in Fig. 2c, respectively. A longer fine-tuning time is required to re-position \(\varTheta _\mathcal {S}\) to \(\varTheta _\mathcal {T}\) as in Fig. 2c than to re-position \(\theta _\mathcal {T}\) to \(\varTheta _\mathcal {T}\). Therefore, efficient fine-tuning is necessary to re-position the local minimum in Fig. 2b and this process is known as parameter fluctuation. Note that the parameter fluctuation is followed by the fine-tuning step.

Fig. 2.
figure 2

Illustration of the local minimum of the source (a) and target (b) domains and parameter fluctuation (c)

Suppose \(C_i\) be the \(i^\text {th}\) convolution operator in \(D_{seg}\) with weight \(w_i\), then \(C_i(x) = w_i \cdot x\). Since \(D^\mathcal {S}_\text {seg}\) provides the baseline segmentation performance, \(D^\mathcal {T}_{FT}\) should provide similar feature maps to achieve the baseline performance. To this end, the mid-feature maps generated should be similar, i.e., \(\forall _i C_i(F_i) \approx C'_i(F'_i)\), where \(C'_i\) represents the convolution in \(D^\mathcal {T}_{FT}\), \(F_i\) represents \(i^\text {th}\) feature map, and \(F_0 = E(x)\). Suppose \(\forall _i |C_i(F_i) - C'_i(F'_i)| < \epsilon _i \ll 1\), such that \(\forall _i F_i \approx F'_i\) by mathematical induction. Therefore, the sum of errors (\(\sum |C_i(F_i) - C'_i(F'_i)|\)) is approximated by \(\sum |w_iF_0 - w'_iF_0|\) iff \(\forall _i F_i \approx F'_i\), which can be expressed as:

$$\begin{aligned} \sum |w_iF_0 - w'_iF_0| < \epsilon \ll 1 \Leftarrow \sum |w_iF_0 - w'_iF_0| \approx 0 \Leftrightarrow \sum |w_i - w'_i| = 0. \end{aligned}$$
(4)

Here, we denote \(w_i - w'_i = f_i\) as the fluctuation vector in the vector space, and the condition \(\sum f_i = 0\) indicates that the sum of the fluctuation vectors should be zero under the condition of \(|f_i| < r \ll 1\). Hence, we achieve the condition for the parameter fluctuation that the centers of parameters of \(\varTheta _\mathcal {S}\) and \(\theta _\mathcal {T}\) should be the same in the vector space, and the length of the fluctuation vector should be less than a certain small threshold (\(0 < r \ll 1\)). Therefore, the parameter fluctuation aims to add random vectors of which length is less than \(0 < r \ll 1\) on the parameters of \(\varTheta _\mathcal {S}\), and the sum of vectors should be zero. To summarize, the parameter fluctuation aims to add randomness on \(\varTheta _\mathcal {S}\) as follows:

$$\begin{aligned} \theta _\mathcal {T}= \{w_i + f_i |\;\; w_i \in \varTheta _\mathcal {S},\;\; \sum f_i = 0,\;\; 0 < |f_i| < r \ll 1 \}. \end{aligned}$$
(5)

3 Experiments

3.1 Experimental Set-Ups

To evaluate the segmentation performance of our TTFT framework, we used three different ultrasound databases: BUS [32], BUSI [1], and BUV [18], which are considered to be different domains. All three databases contain ultrasound imaging data and segmentation masks for breast cancer, with the masks labeled as 0 (background) and 1 (lesion) using a one-hot encoding. The BUS database consists of 163 images along with corresponding labels. The BUSI database contains 780 images, with 133 images belonging to the NORMAL class and having labels containing only 0 values. The BUV database originally consists of ultrasound videos, providing a total of 21,702 frames. While the database also provides labels for the detection task, we processed these labels as segmentation masks using a region growing method [15].

We employed different deep-learning models for evaluation. Specifically, U-Net [22] and FusionNet [21] were employed as our baseline models, since U-Net is a widely used basic model for segmentation, and FusionNet contains advanced residual modules, compared with U-Net. Ours I and Ours II were based on U-Net and FusionNet as the baseline network, respectively. Additionally, MIB-Net [28], which is a state-of-the-art model for breast cancer segmentation using ultrasound images, was employed for comparison. Furthermore, CBST [33] and CT-Net [16] were employed as the comparison models for UDA methods. As the evaluation metrics, dice coefficient (D. Coef), PRAUC, which is an area under a precision-recall curve, and cohen kappa (\(\kappa \)) were employed [30]. Our experimental set-ups included: (i) individual databases were used to assess the baseline segmentation performance (Appendix); (ii) the domain adaptive segmentation performance was assessed using the three databases, where two databases were regarded as the source domain, and the remaining database was regarded as the target domain; and (iii) the ablation study was carried out to evaluate the proposed network architecture along with the randomized re-initialization method.

Fig. 3.
figure 3

Comparison analysis of our framework and comparison models: performance comparison table (Left) and Box-and-Whisker plot (Right).

3.2 Comparison Analysis

Since all compared DL models show similar D. Coef, only UDA performance is comparable as a control in our experiments. In this experiment, two databases were used for training, and the remaining database was used for testing. For instance, BUS in Fig. 3 illustrates the BUS database was used for testing, and the other two databases of BUSI and BUV were used for training. Figs. 3 and  4 show quantitative results, and Fig. 5 shows the sample segmentation results. Unlike the experiment using the individual database, U-Net, FusionNet, and MIB-Net showed significantly inferior scores due to domain gaps. In contrast, UDA methods of CBST and CT-Net showed superior scores, compared with others, and the scores were not strongly reduced, compared with the experiment with the single database. Note that, our TTFT framework achieved the best performance compared with other DL models. Additionally, Ours II, based on FusionNet, showed the best scores, potentially due to the advanced residual connection module. Furthermore, as illustrated in Fig 4, our framework provides superior precision scores in a long range of (0, 0.7), indicating that our frameworks estimated unnecessary mispredictions but precise predictions on cancer.

Fig. 4.
figure 4

Precision-Recall curves by ours and comparison models on each database. Area under the precision-recall curve (PR-AUC) values were reported.

Fig. 5.
figure 5

Segmentation results by ours and comparison models on each database.

3.3 Ablation Study

In order to assess the effectiveness of each of the proposed modules, including the parameter fluctuation and fine-tuning methods, the ablation study was carried out. Since our framework contains three types of decoders, including \(D^\mathcal {S}_\text {seg}\), \(D^{fl}_\text {seg}\), and \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\) for the fine-tuning, we mainly targeted those decoders in our ablation study. Table 1 illustrates the quantitative results by different types of decoders. The higher D. coef value (+3.4%) of Pre-train + PF than that of Pre-train + Random Init and Pre-train + Offset confirms the effectiveness of the parameter fluctuation in the UDA performance. Additionally, the higher score (+11%) of Fine-tuning than Pre-train shows an outstanding UDA performance of the fine-tuning pipeline. Furthermore, the simultaneous utilization of the dual pipeline with \(D^\mathcal {S}_\text {seg}\) and \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\) is justified by the scores of Pre-train + Fine-tuning. Using dual-pipeline and parameter fluctuation yielded the best performance. However, the utilization of ensemble pipelines of multiple fine-tuning modules was inefficient, since negligible performance improvements (+0.002) were observed, despite the heavy memory utilization.

Fig. 6.
figure 6

Illustration of feature maps: style loss comparison (Left) and a T-SNE plot of generated images by different decoders (Right)

Table 1. Dice coefficients by different versions of our TTFT framework. Random Init is \(D_{FT}\) is randomly initialized, and Offset indicates \(D_{FT}\) is initialized with the value of \(D_{seg}\) added by the offset value.

Furthermore, Fig. 6 shows the effectiveness of the parameter fluctuation and fine-tuning methods. We first compared the similarity of feature-maps by decoders, including \(D^\mathcal {S}_\text {seg}\), \(D^\textit{fl}_\text {seg}\), and \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\), with \(D^\mathcal {S}_\text {seg}\) and \(D^\mathcal {T}_\text {seg}\), which was fully optimized decoder in \(\mathcal {T}\). Here, a style loss [9] was employed to measure the similarity of feature maps. Our framework was fine-tuned as \(D^\mathcal {S}_\text {seg} \rightarrow D^\textit{fl}_\text {seg} \rightarrow D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\) along which the similarity with \(D^\mathcal {T}_\text {seg}\) of those decoders were increasing, and the feature-maps by \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\) were similar to those of \(D^\mathcal {T}_\text {seg}\), compared with \(D^\mathcal {S}_\text {seg}\), indicating UDA was successfully performed. Additionally, the generated images by decoders, including \(D^\mathcal {S}_\text {seg}\), \(D^\text {fl}_\text {seg}\), and \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\) in \(\mathcal {S}\) and \(\mathcal {T}\) are plotted with T-SNE, where the short distance represents the similar features [19]. The generated images became similar to \(\mathcal {T}\) in order of \(D^\mathcal {S}_\text {seg}\), \(D^\text {fl}_\text {seg}\), and \(D^{\mathcal {S}\rightarrow \mathcal {T}}_\text {seg}\), which confirmed the effectiveness of the fine-tuning method in terms of knowledge distillation. Additionally, the parameters were successfully re-positioned from the local minimum in \(\mathcal {S}\) by parameter fluctuation, which was confirmed by the distances from \(\mathcal {S}\) to \(D^\mathcal {S}_\text {gen}\) and \(D^\textit{fl}_\text {gen}\).

4 Discussion and Conclusion

In this work, we proposed a DL-based segmentation framework for multi-domain breast cancer segmentation on ultrasound images. Due to the low resolution of ultrasound images, manual segmentation of breast cancer is challenging even for expert clinicians, resulting in a sparse number of labeled data. To address this issue, we introduced a novel self-supervised DA network for breast cancer segmentation in ultrasound images. In particular, we proposed a test-time fine-tuning network to learn domain-specific knowledge via knowledge distillation by self-supervised learning. Since UDA is susceptible to error accumulation due to imprecise pseudo-labels, which can lead to degraded performance, we employed a self-supervised learning-based pretext task. Specifically, we utilized an auto-encoder-based network architecture to generate synthetic images that matched the input images. Moreover, we introduced a randomized re-initialization module that injects randomness into network parameters to reposition the network from the local minimum in the source domain to a local minimum that is better suited for the target domain. This approach enabled our framework to efficiently fine-tune the network in the target domain and achieve better segmentation performance. Experimental results, carried out with three ultrasound databases from different domains, demonstrated the superior segmentation performance of our framework over other competing methods. Additionally, our framework is well-suited to a scenario in which access to source domain data is limited, due to data privacy protocols. It is worth noting that we used vanilla U-Net [22] and FusionNet [21] as baseline models to evaluate the basic performance of our TTFT framework. However, the use of more advanced baseline models could lead to even better segmentation performance, which is a subject for our future work. Moreover, our proposed framework is not limited to breast cancer segmentation on ultrasound images acquired from different domains. It can also be applied to other disease groups or imaging modalities such as MRI or CT.