Keywords

1 Introduction

Many medical image datasets have been created over the year, and recent breakthrough achieved by supervised training accelerates the pace in medical image segmentation. Despite great promise, many prior works have limited clinical value, since they are separately trained on small datasets in terms of scale, diversity, and heterogeneity of annotations. As a result, such single-site methods [10, 14, 21, 22, 29, 31, 32, 35,36,37,38,39,40,41] are vulnerable to unknown target domains, and linearly expand parameters since they assume to train a new model in isolation when adding new datasets. This jeopardizes their trustworthiness and practical deployment in real-world clinical environments.

In this paper, we carry out the first-of-its-kind comprehensive exploration of how to build a multi-site model to achieve strong performance on the training domains and can also serve as a strong starting point for better generalization on new domains in the clinical scenarios. Multi-site training [1, 3, 7, 8, 11, 24, 25] has been proposed to consolidate the generalization on multi-site datasets, but it has the following limitations: (1) it still exhibits certain vulnerability to different domains (i.e., different imaging protocols), which yields sub-optimal performance [1, 13, 34]; (2) due to various constraints (i.e., imaging time, privacy, and copyright status), it could become challenging or even infeasible for the requirement on the availability of all training data in a certain time phase. For example, when a new site’s data will be available after training, the model requires retraining, which largely prohibits the practical deployments; and (3) consider the relatively small size of the single medical imaging dataset, simply training a dense network from scratch usually leads to sub-optimal segmentation quality because the model might over-fit to those datasets.

Our key idea is to combine the benefits of incremental-learning (IL) and transfer-learning by sequentially training a multi-dataset expert: we continually train a model with corresponding pretrained weights as new site data are incrementally added, which we call Incremental-Transfer Learning (ITL). This setting is appealing as: (1) the common IL setting [4, 5, 15, 17, 23, 27, 28, 42] is to train the base-learner when different site datasets gradually come; thus the effectiveness of this approach heavily depends on the optimality of the base-learner. Consider each single medical image dataset is usually of relatively small size, it is undesirable to build a strong base-learner from scratch; (2) transfer-learning [26, 30, 33, 43, 44] typically leads to better performance and faster convergence in medical image analysis. Inspired by these findings above, we develop a novel training strategy for expanding its high-quality learning abilities to our multi-site incremental setting, considering both model-level and site-level. Specifically, our system is built upon a site-agnostic encoder with pretrained weights from natural image datasets such as ImageNet, and at most two segmentation decoder heads wherein only one head is trainable, and the other is fixed associated with specific sites - a parameter-efficient design. Our intuition is that the shared site-agnostic encoder network with pretrained weights encodes regularities across different medical image datasets, while the target and source segmentation decoder heads model the sub-distribution by our proposed site-level incremental loss, resulting in an accurate and robust model that transfers better to new domains without sacrificing performance. We conduct a comprehensive evaluation of ITL on five prostate MRI datasets. Our approach can consistently achieve competitive performance and faster convergence compared to the upper-bound baselines (i.e., isolated-site and mixed-site training), and has a clear advantage on overall segmentation performance compared to the lower-bound baselines (i.e., multi-site training). We also find that our simple approach can effectively address the forgetting issues. Our experiments demonstrate the benefits of modeling both multi-site regularities and site-specific attributes, and thereby serve as a strong starting point on this important practical setting.

Fig. 1.
figure 1

Overview of (a) our proposed Incremental Transfer Learning framework, and (b) the multi-site expert model. Note that in this study, we only use one multi-site expert model and one source decoder network, which will not introduce additional parameter.

2 Method

2.1 Problem Setup

In ITL, a model incrementally learns from a sequential site stream wherein new datasets (namely, medical image segmentation tasks with new sites) are gradually added during the training, as illustrated in Fig. 1. More formally, we denote the sequence of multi-site datasets to be trained as a multi-domain data sequence \(\mathcal {D}\!=\!\{D_{1},D_{2},\cdots ,D_{N}\}\) of N sites, and i-th site \(D_{i}\) contains the training images \(X\!=\!\{x_j\}_{j=1}^{M}\) and segmentation labels \(Y\!=\!\{y_j\}_{j=1}^{M}\), where \({x}_{j} \in \mathbb {R}^{H \times W \times 3}\) is the augmented image input, and \({y}_{j} \in \{0, 1\}^{{H \times W}}\) is the ground-truth label. Here the augmented input setting is appealing: the axial context naturally provided by a 3D volume can uniquely yield more robust semantic representations to the downstream tasks. We assume access to a multi-site expert model \(F_{i}\!=\!\{E_{i},G_{i}\}\) for i-th (site) phase, including a pretrained model as a site-agnostic encoder network \(E_{i}\) with the weight \(\theta _{i}\), a target decoder network \(G_{i}^{t}\) with the weight \(\theta ^{t}_{i}\). During training, we additionally attach a source decoder network \(G_{i}^{s}\) (i.e., using \(G_{i-1}^{s}\) from previous phrase) with the weight \(\theta ^{s}_{i}\). In the i-th incremental (site) phase, the multi-site expert model has access to two types of domain knowledge: the site-specific knowledge from the current dataset \(D_{i}\) and old exemplars \(P_{i}\). The latter refers to a set of old exemplars from all previous training datasets \(D_{1:i-1}\) in the memory protocol \(\mathcal {M}\). This is highly nontrivial to preventing the challenging “catastrophic forgetting” problem [20] of the current dataset i against previous sites in clinical practice. Note that, in this study, we only use one multi-site expert model and one source decoder network, which will not introduce additional parameters. Based on the setting above, we define the ITL problem below.

Problem of ITL. In the current site i , our goal is to continuously learn a multi-site expert model based on the knowledge from both \((D_{i},P_{i})\) and the pretrained weight, making the model (1) generalizes well on the unseen data at site i , and (2) achieves competitive performance on the previous sites.

2.2 Preliminary

Our goal is to build a strong multi-site model by learning a site-agnostic encoder with pretrained weights as well as a segmentation decoder over multi-site datasets. This naturally raises several interesting questions: How well will ITL-based methods perform in multi-site medical image datasets? Will transfer learning make the base learner stronger on the unseen site? If yes, can they perform stably well? To answer the above questions, a prerequisite is to define the upper bound and lower bound. Here we introduce three common paradigms for multi-site medical image segmentation: (1) isolated-site training, (2) mixed-site training, and (3) multi-site training. It is well-known that the isolated-site and mixed-site training approaches can achieve state-of-the-art performance when evaluating the same dataset, while the performance catastrophically drops when evaluating new datasets. On the other hand, the multi-site training approach often yields inconsistent performance across multiple sites. For all training paradigms, we minimize Dice loss between the predicted outputs and the ground truth label.

Table 1. Information about five different sites from three benchmark datasets.

Upper Bound. We consider two training paradigms (i.e., isolated-site and mixed-site training) as our upper bound baseline. For isolated-site training, given each site \(D_{i}\), we train isolated-site models separately. The architecture of the isolated-site model consists of a pretrained encoder \(E_{i}\) and a segmentation decoder network, same architecture as \(G_{i}\). Then, we apply different isolated-site models to predict results based on the site-specific data at inference. However, this approach dramatically increases memory and computational overhead, making it practically challenging at scale. For mixed-site training, we train one full model on the full mixed-site data D, and then use the well-trained model for inference. However, it requires the simultaneous presence of all data in training and inference.

Lower Bound. For multi-site training, we sequentially train only one model coupled with the pretrained weights on all sites. This can get rid of large parameter counts, making it appealing in practice. However, due to the forgetting quandary, it inevitably suffers from severe performance degradation. This naturally questions: can we improve performance on multi-site medical image segmentation with minimal additional memory footprint? In the following, we give an affirmative answer.

2.3 Proposed Incremental Transfer Learning Multi-site Method

To address the aforementioned problems, we develop the incremental transfer learning framework to perform well on the training distribution and generalize well on the new site dataset with minimal additional memory. To our best knowledge, we are the first work to apply incremental transfer learning to the limited clinical data regimes. To control the parameter efficiency, we decompose the model into a share site-agnostic encoder \(E_{i}\) and two segmentation decoder heads (i.e., source decoder \(G_{i}^{s}\) and target decoder \(G_{i}^{t}\)). In this way, we can keep the network parameters the same when adding a new site. Specifically, \(G_{i}^{s}\) is designed to transfer the knowledge of a previously learned site, and \(G_{i}^{t}\) is designed to comprehensively train on a new site and previous datasets. During training, we only update \(G_{i}^{t}\) while \(G_{i}^{s}\) is frozen. It is worth mentioning that our proposed framework is independent of the encoder architecture, and can be easily plugged in other pretrained vision models.

The full ITL algorithm is summarized in Algorithm 1. We describe our ITL algorithm as follows. We first randomly initialize \(G_{i}^{t}\), \(G_{i}^{s}\), and then iteratively train our full model (i.e., a pretrained encoder \(E_{i}\) and two decoders \(G_{i}^{t}\), \(G_{i}^{s}\)) with N-site training samples. Bounded by the computational requirements, it is challenging or even infeasible to retain all data for training. Inspired by recent work [23], to maintain the knowledge of previous sites, we “store” all the old site data exemplars in the memory protocol \(\mathcal {M}_{i}\). In the i-th incremental (site) phase, we first load \(P_{i}\), and then use both \(P_{i}\) and \(D_i\) to train \(F_{i}\) initialized by \(\theta _{i}^{s}\). This setting is appealing as (1) it can substantially alleviate the imbalance between the old and new site knowledge, and (2) it is efficient to train on them. Of note, we do not use the source decoder when training on the first-site dataset. We formulate ITL as model-level and site-level optimization.

Model-Level Optimization. To perform better on all these training distributions, we propose improving generic representations by distilling knowledge from previous data. In each incremental phase, we jointly optimize two groups of learnable parameters in our ITL learning by minimizing the model-level incremental loss (i.e., \(\mathcal {L}_{\text {model}}\!=\!\mathcal {L}_{\text {target}}+\mathcal {L}_{\text {source}}\)) on all training samples (i.e., \(D_{i}\bigcup D_{0:i-1}\)): (1) a share site-agnostic encoder \(E_{i}\) and a target decoder \(G_{i}^{t}\); (2) a share site-agnostic encoder \(E_{i}\) and a source decoder \(G_{i}^{s}\). This helps ITL avoid catastrophic forgetting of prior site-specific knowledge.

figure a

Site-Level Optimization. The above model-level optimization is used to maintain previously learned knowledge. In contrast, this step is design to train the multi-site model to learn site-specific knowledge on the newly added site. Specifically, we minimize the site-level incremental loss \(\mathcal {L}_{\text {site}}\) between the probability distribution from \(F_{i}\) and the ground truth. This essentially learns the site-specific knowledge for the downstream medical image segmentation tasks. Of note, \(\mathcal {L}_{\text {source}}\), \(\mathcal {L}_{\text {target}}\), and \(\mathcal {L}_{\text {site}}\) use the Dice loss. The overall loss combines the model-level loss and the site-level loss as follows:

$$\begin{aligned} \mathcal {L}_{\text {all}} = \mathcal {L}_{\text {model}}+\mathcal {L}_{\text {site}}. \end{aligned}$$
(1)

3 Experiments

Datasets and Settings. We evaluate our proposed incremental transfer learning method on three prostate T2-weighted MRI datasets with different sub-distributions: NCI-ISBI13 [2], I2CVB [12], and PROMISE12 [16]. Due to the diverse data source distributions, they can be split into five multi-site datasets, which is similar to [19]. Table 1 provides some dataset statistics. For pre-processing, we follow the setting in [18] to normalize the intensity, and resample all 2D slices and the corresponding segmentation maps to \(384\times 384\) in the axial plane. For all five site datasets, we randomly split each original site dataset into training and testing with a ratio of 4:1. For each site training, we divide the data from the previous site into a small subset with a certain portion (i.e., 1%, 3%, 5%), and combine it with the current site data for training.

Table 2. Comparison of segmentation performance (DSC[%]/95HD[mm]) across datasets. Note that a larger DSC (\(\uparrow \)) and a smaller 95HD (\(\downarrow \)) indicate better performing ITL models. We use four models pretrained on ImageNet: ResNet-18, ResNet-34, ResNet-50, and ViT under different portions (i.e., 1%, 3%, 5%) of exemplars from previous data for every incremental phase. We consider multi-site training as the lower bound, isolated-site, and mixed-site training as the upper bound.

Training and Evaluation. In this study, we implement all models using Pytorch. We set HW as 384, \(\alpha ,\delta \) as 0.5, and the batch size as 5. To mitigate the overfitting, we augment the data by random horizontal flipping, random rotation, and random shift. We adopt ResNet family [9] (i.e., ResNet18, ResNet34, ResNet50) and ViT [6] (i.e., R50+ViT-B/16 hybrid model) as our pretrained encoder. We evaluate the model performance by Dice coefficient (DSC) and 95% Hausdorff Distance (95HD). For a fair comparison, we adopt the same decoder architecture design in [18] are shown in Appendix Table 4, and do not use any post-processing techniques. All of our experiments are conducted on two NVIDIA Titan X GPUs. All the models are trained using Adam optimizer with \(\beta _1=0.9\), \(\beta _2 = 0.999\). For 100 epochs training, a multi-step learning rate schedule is initialized as 0.001 and then decayed with a power of 0.95 at epochs 60 and 80.

Main Results. We conduct extensive experiments on five benchmark datasets. We adopt four models: ResNet-18, ResNet-34, ResNet-50, and ViT. We select three portions (i.e., 1%, 3%, 5%) of exemplars from previous data for every incremental phase. Our results are presented in Table 2 and Appendix Fig. 2. First and foremost, we can see ITL-based methods generalize across all datasets under two exemplar portions (i.e., 3% and 5%), yielding the competitive segmentation quality comparable to the upper bound baselines (i.e., isolated-site and mixed-site training), which are much higher than the lower bound counterparts. The 1% exemplar portion seems slightly more challenging for ITL, but its superiority over the lower bound counterparts is still solid. A possible explanation for this finding is that using two exemplar portions (i.e., 3% and 5%) maintains enough information of ITL, which mitigates the catastrophic forgetting, while ITL trained in the setting of 1% exemplar portion is not powerful enough to inherit prior knowledge and generalize well on newly added sites. Second, we consistently observe that ITL using larger models (i.e., ResNet-50 and ViT) generalize substantially better than those using small models (i.e., ResNet-18 and ResNet-34), which demonstrate competitive performance across all datasets. These results suggest that our ITL using the large model as our pretrained encoder leads to substantial gains in the setting of very limited data.

4 Analysis and Discussion

We address several research questions pertaining to our ITL approach. We use a ResNet-18 model as our encoder in our experiments. For comparisons, all models are trained for the same number of epochs, and all results are the average of three independent runs of experiments. To study the effectiveness of our proposed ITL framework, we performed experiments with \(5\%\) exemplars ratio.

Table 3. Comparison of segmentation performance in different phases.

Does Transfer Learning Lead to Better ITL? We draw two perspectives that may intuitively explain the effectiveness of transfer learning in our proposed ITL framework. As a first test of whether transfer learning makes the base-learner stronger, we plot the training loss/validation loss (i.e., \(\mathcal {L}_{\text {all}}\)) to iteration to demonstrate the convergence improvements in Appendix Fig. 3. We can see that training from pretrained weights can converge faster than training from scratch. Another (perhaps not so surprising) observation we can get from Appendix Fig. 3 is that using pretrained weights usually yields slightly smaller loss compared to training from scratch. We then ask whether transfer learning produces increased performance on multi-site datasets. Since each single medical image dataset is usually of relatively small size, training the model from scratch tends to overfit a particular dataset. To evaluate the impact of transferring learning, we compare w/pretraining to w/o pretraining. As shown in Appendix Table 7, training from scratch does not bring benefits to the ITL framework. Instead of training from scratch, we find that simply incorporating transfer learning significantly boots the performance of ITL while achieving faster convergence speed, suggesting that transfer learning provides additional regularization against overfitting.

Does ITL Generalizes Well on Multi-site Datasets? We investigate whether the ITL framework generalizes well on multi-site datasets. We report the segmentation results of different phases in Table 3, from which we observe that ITL achieves good performance in different phases. This reveals that our approach is greatly helpful in reducing forgetting issues. We evaluate the proposed ITL methods with two random ordering (i.e., (1) {HK\(\rightarrow \)UCL\(\rightarrow \)ISBI\(\rightarrow \)ISBI1.5\(\rightarrow \)I2CVB}, and (2) {ISBI\(\rightarrow \)ISBI1.5\(\rightarrow \)I2CVB\(\rightarrow \)HK\(\rightarrow \) UCL}). The results are shown in Appendix Table 5. We perform experiments using both ordering strategies and observe comparable performance.

Efficiency of ITL. We report the network size and memory costs in Appendix Table 6. We observe that ITL achieves competitive performance and utilizes less network parameters compared to isolated-site training (upper bound), which requires the new model when adding new site data. We also examine the required memory footprint at each incremental phase. We observe that ITL is significantly more memory-efficient than mixed-site training (upper bound), although the latter remains the same network size when adding a new training phase. These results further demonstrate the efficiency of our proposed ITL framework.

5 Conclusion

In this paper, we present a novel incremental transfer learning framework for incrementally tackling multi-site medical image segmentation tasks. We pose model-level and site-level incremental training strategies for better segmentation, generalization, and transfer performance, especially in limited clinical resource settings. Extensive experimental results on four different baseline architectures demonstrate the effectiveness of our approach, offering a strong starting point to encourage future work in these important practical clinical scenarios.