Keywords

1 Introduction

Computer-aided approaches utilizing deep learning models have become prominent in the domain of medical image processing [18]. The amount and diversity of training data used to develop these models are important for model success and generalizability [25,26,27]. Currently, the inadequacy of medical data sources and labeled data have become a bottleneck and led to poor performance of the deep learning based solutions [30]. In order to overcome these issues, there are several initiatives to form diverse datasets to train reliable and robust models that have good generalization ability and clinical usability. EndoCV Challenges incorporates diverse endoscopy video frames from several institutions worldwide, including different modalities and organs to utilize deep learning methods to detect artifacts and diseases [1, 2, 21, 22]. BraTS Challenges brings multi-institutional multi-parametric magnetic resonance imaging (mpMRI) scans for the analysis of brain tumors and the dataset has been continuously growing [6]. Although these initiatives are very important for reliable and clinical-ready models, they are not feasible to scale because it requires a tremendous work. First of all, it is difficult to represent whole distribution (e.g., minority and under-represented groups) as it requires healthy collaborations with many institutions and immense annotations. Secondly, data properties such as image modalities and resolutions are in a constant change that leads to distribution shift over time; therefore, collecting and processing all the data for once does not work either. Moreover, due to the data privacy regulations, collecting sensitive patient data from different institutions and hospitals is not always applicable. The federated learning (FL) concept offers a solution in such situations where data privacy and ownership are a problem by enabling collaborators train a common global model without disclosing their local data [15, 19]. Several studies employing FL approaches in the medical domain have reported successful results [8, 25, 26]. These studies have drawn the attention of researchers into FL for medical imaging and made it a popular research field recently.

In this study, we propose various FL approaches for the Federated Tumor Segmentation (FeTS) Challenge [20]. For the Task-1 of the challenge, the participants are provided with an FL environment setup that is based on the OpenFL [24] framework and they are requested to develop strategies for the development of the methods in order to extract much of the knowledge from the collaborators. In this task, the participants are allowed to modify four functions: 1) custom performance metrics, 2) collaborator selection, 3) hyperparameter selection, and 4) custom aggregator. Our proposed methods took the 3rd place in the competition.

2 Related Work

Recently, the use of FL has been increasing in the medical field. In [11], Huang et al. proposed Loss-based Adaptive Boosting Federated Averaging (LoAdaBoost FedAvg) on critical care database data called as MIMIC-III [13]. In this method, the collaborators with higher losses than the previous round median loss are retrained before sending to the server for model aggregation. In [16], Li et al. have proposed a federated learning system for brain tumor segmentation on BraTS 2018 dataset [6] and have shown the trade-off between privacy protection costs and model performance. Similarly, in [26], Sheller et al. have compared federated learning and other data private collaborative learning approaches such as institutional incremental learning and cyclic institutional incremental learning on brain tumor segmentation task. This study has shown that FL can overcome institutional biases and form a global model that has better generalization where data amount and data diversity are inadequate. In [8], Dou et al. have used FL architecture to detect chest CT abnormalities in COVID-19 patients and showed that federated global model outperforms in terms of generalizability on external datasets better than individual models and their ensemble.

3 Data

Fig. 1.
figure 1

The data distribution in the training dataset splits.

The Federated Tumor Segmentation (FeTS) challenge 2021 is the first challenge in the federated medical imaging area. The challenge data set is composed of multi-institutional magnetic resonance images from the International Brain Tumor Segmentation (BraTS) challenge and other independent institutions in the FeTS initiative [3,4,5, 20, 24]. The training set contains 341 images, institution-based split of which is given in Fig. 1. The validation and the tests sets contain 111 and 166 images, respectively. The segmentation annotations of the challenge dataset were performed by annotators whose experience levels vary with respect to their clinical and academic backgrounds. Then, these annotations were approved by two experienced board-certified neuroradiologists with more than 12 years of experience [20].

4 Methods

4.1 Aggregator

In a real-life FL setting, the data distribution of collaborators is non independent identically distributed (non-IID) because collaborators may have different data distribution and the number of observations. The difference in device capabilities, user demographic information, or geographic location can be major reasons for the non-IIDness [14, 19].

When collaborators have access to differing amounts of data and when they use the same number of epochs E in their local training, they would perform different numbers of local updates \(\tau \). If a collaborator has \(n_i\) samples, number of local gradient descent (GD) iterations is \(\tau _i = En_i/B\), where B is the mini-batch size. In [29], Wang et al. have shown that the heterogeneity in collaborators’ local progresses causes convergence to a stationary point of mismatched objective function, which is different from the true objective, when vanilla weighted averaging is used. Instead, they propose FedNova, a normalized averaging method that prevents bias toward clients performing more local updates. The shared global model is updated as in Eq. 1.

$$\begin{aligned} \boldsymbol{x}^{t+1}=\boldsymbol{x}^t-\tau _\mathrm {eff} \sum \limits _{i=1}^m p_i \frac{\varDelta _i^t}{\tau _i^t} \end{aligned}$$
(1)

where \(p_i\) denotes the relative sample size of the collaborator i (i.e., \(p_i=n_i/n\) where n is the total number of samples), \(\tau _\mathrm {eff}=\sum _{i=1}^m p_i \tau _i\), \(\varDelta _i^t = \boldsymbol{x}^t-\boldsymbol{x}_i^{t+1}\), and m is the total number of collaborators. Since the number of samples \(n_i\) for collaborator i is directly proportional to the number of local iteration \(\tau _i\) and the relative sample size \(p_i\), this formula can be rewritten as in Eq. 2.

figure a
$$\begin{aligned} \boldsymbol{x}^{t+1}=\boldsymbol{x}^t-\gamma \sum \limits _{i=1}^m \varDelta _i^t \end{aligned}$$
(2)

where \(\gamma \) refers to the aggregator learning rate, which can be increased or decreased according to FL training needs. As given in Eq. 2, FedNova corresponds to a uniform averaging with adjustable step size (or learning rate) on the aggregator. FedNova aims to prevent exacerbation of client drifts caused by relative sample sizes \(p_i\). When there is a significant difference between the number of samples in the collaborators, as in the FeTS Challenge dataset, FedAvg creates a bias toward the collaborators having more samples (Fig. 2). Although validation set (named Val-1 in Sect. 5) results reported during the training may seem good as its data distribution directly comes from the training set, out-of-distribution performance results may not be satisfactory. Wang et al. [29] have shown that FedNova generally achieves 6–9% higher accuracy than FedAvg on a non-IID version of CIFAR-10 dataset.

Fig. 2.
figure 2

Naive weighted averaging (FedAvg) creates bias toward collaborators having higher number of samples, which may adversely affect the out-of-distribution performance. On the other hand, FedNova gives equal weights to all collaborators acting as a regularizer.

Another approach to deal with convergence issues when collaborators’ data distribution is non-IID is Federated Averaging with server momentum (FedAvgM). The momentum on top of the Stochastic Gradient Descent (SGD) has proven to provide a significant success in accelerating the training and dampening the oscillations [9]. In [10], Hsu et al. have shown that as the level of non-IIDness increases, the performance of the FedAvgM stays relatively constant while federated averaging falls rapidly. Moreover, [23] has shown the improved effect of adaptive optimizers such as Adam and RMSProp, which are based on momentum, on top of the federated averaging.

In FedAvgM, the average of the gradients are added to the accumulated gradient which is multiplied by a \(\beta \) parameter to adjust effect of the momentum as shown in Eq. 3. Then this weighted accumulated gradient is used to update weights of the current communication round as in Eq. 4. Here, an aggregator learning rate \(\gamma \) can be used to adjust the step size on the server (in our experiments \(\beta \) is chosen as 0.9 and \(\gamma \) is chosen as 1).

$$\begin{aligned} \varDelta \boldsymbol{w}^{t+1} = \sum \limits _{i=1}^m p_i \varDelta \boldsymbol{w_i}^{t+1} \end{aligned}$$
$$\begin{aligned} \boldsymbol{v}^{t+1}=\beta \boldsymbol{v}^t + \varDelta \boldsymbol{w}^{t+1} \end{aligned}$$
(3)
$$\begin{aligned} \boldsymbol{w}^{t+1}=\boldsymbol{w}^t- \gamma \boldsymbol{v}^{t+1} \end{aligned}$$
(4)

where \(p_i\) denotes the relative sample size of the collaborator i (i.e., \(p_i=n_i/n\) where n is the total number of samples), and m is the total number of collaborators.

Along with FedNova and FedAvgM, other aggregator functions (Table 1) have been implemented and experimented in the FeTS Challenge. However, in this article, only the results for FedNova and FedAvgM are presented. Please visit https://github.com/eceisik/FeTS_Challenge_METU_FL_Team to see all implemented methods by the METU FL Team.

Table 1. The list of other aggregator methods implemented.

4.2 Collaborator Selection

How to choose collaborators that will take part in each round is another important dimension of the FeTS Challenge. We used “all_collaborators_train” as a collaborator choice function and all collaborators participated in each FL round.

We implemented two alternative collaborator choice functions given in Table 2. If the focus is on the convergence time metric, the method called as “choose random nodes with faster ones” could be more preferable. This method does not introduce any extra communication delays, because once a random collaborator is selected, only those that are faster than the selected one participate in the training for the FL round (i.e., selected collaborator creates an upper bound for the other selected collaborators in terms of time). Although the number of collaborators participate in each round varies, the working mechanism tends to favor the fastest collaborators. Being fast, in this case, depends on two factors namely the amount of available computation/communication resources and the number of samples in a collaborator. On the other hand, the institutions having fewer patient images may be over represented, which is a disadvantage of this method.

4.3 Hyperparameter Selection

For the hyperparameter selection, an adaption of AdaComm [28] with a learning rate scheduling scheme is used. AdaComm [28] is an adaptive communication strategy that saves communication delay and enables fast convergence by federated averaging less frequently in early training rounds and later increasing communication frequency. In [28], experimental converge analysis was examined on wall-clock time instead of communication round. It is shown that using more local updates in the early rounds of training resulted in a faster decrease in loss but also a higher error. For this reason, it starts with a large number of updates per round and gradually decreases as the model starts to converge.

Table 2. The list of other collaborator choice methods implemented.

In the original version of AdaComm, the method is based on the number of local updates in an IID setting. However, in the challenge, the data distribution is extremely uneven. While Institute-1 has 37.83% of the data, Institute-14 has 0.88% of the whole training data (Fig. 1). Using the same number of local updates for each collaborator could potentially cause over-representation of some small data provider institutions. By considering the non-IID nature of the data distribution, our aggregation method mechanism, and the fact that the number of local updates is directly proportional to the number of epochs, we adapted this method based on the decaying number of epoch (AdaptiveEpoch). Basically, the number of epochs per round at each FL round decays according to the relative difference between the initial loss and current round loss as stated in Eq. 5.

$$\begin{aligned} E_t = \Bigg \lceil \sqrt{\frac{F(x_{T=t})}{F(x_{T=0})}}E_0 \Bigg \rceil \end{aligned}$$
(5)

where T denotes the number of FL rounds, t denotes the round number, \(E_t\) denotes the number of epochs at a given round t, and F(x) is the objective function with respect to model parameters denoted by x.

Learning rate scheduling is a commonly used technique to train deep neural networks in a centralized manner [9]. Studies show that learning rate scheduling is also necessary for FedAvg to converge to an optimum point of loss function [17]. However, there are many strategies for scheduling and there is no benchmark for their performances. In this study, we have adopted decay learning rate on plateau approach. This strategy brings two new parameters namely patience and decay factor. In our implementation, learning rate scheduling tracks the target performance metric, which is the mean Dice score for ET, TC and WT labels, and if there is no improvement on the target performance metric for a patience number of round, the learning rate is updated by scaling with the decay factor. Experiments show that learning rate scheduling provides faster convergence, more relaxed learning rate selection, higher convergence score, and reduced oscillations when training converges [9]. The list of hyperparameter selection methods are given in Table 3. For AdaptiveEpoch initial epoch \(E_0\) is set to 8; for the LR scheduling, the initial LR is set to 0.0002 and patience is set as 15. For the constant hyperparameters, default values were used (LR = 0.00005, epoch per round = 1).

Table 3. The list of other hyperparameter selection methods.

5 Experimental Results and Discussion

Before the FL training, the training dataset is split into train and validation sets as 80% and 20%, respectively. The performance results of the aggregated and individual models on validation sets are logged at each FL round (it is integrated with the FeTS Challenge source code). Unless otherwise stated, all reported performance metrics and loss graphs belong to this validation set of partitioning_2.csv. The mean Dice score refers to the average of Dice scores of ET, TC, and WT labels.

Fig. 3.
figure 3

The performance comparison of FedAvg, FedNova, and FedAvgM.

Figure 3 shows the performances of FedAvg, FedNova, and FedAvgM on aggregator mean dice score, aggregator loss, and aggregator sensitivity metrics. Since medical datasets may contain institutional biases [26] and FedAvg have an undesirable effect of favoring these biases, FedNova is expected to have better performance on the non-training sets. However, since samples of institutions’ distributions of training and validation sets are similar to each other, we observe nearly identical performance for both FedAvg and FedNova. Yet, models built with FedNova are expected to have better inferences on the out-of-distribution dataset [29], and as such, they are expected to be more suitable for real-life use-case scenarios. On the other hand, FedAvgM outperforms both FedAvg and FedNova on all metrics. Therefore, we have preferred FedAvgM as the aggregator method in the FeTS challenge.

Fig. 4.
figure 4

The impacts of LR scheduling on FedAvg, FedNova, and FedAvgM.

Figure 4 shows the effect of the LR scheduling approach on each aggregation method. For both FedAvg and FedNova, it can be observed on both loss function and performance metrics that LR scheduling has an evident effect on their performances. In particular, a sharp increase on performance metrics occurs when the LR is decayed. On the other hand, LR scheduling has no improvement on FedAvgM. One possible reason might be since FedAvgM converges much faster than the FedAvg and FedNova, it may have directly reached the optimum region where it does not need any scheduling. However, it should be noted that we have used fixed values for starting learning rate, decay rate, and patience parameters; therefore, more experiments with different set of values should be performed to make a comment on effect of LR scheduling on FedAvgM.

AdaptiveEpoch helps training converge in fewer rounds with higher performance due to having more local epochs than using the constant hyperparameters as seen in Fig. 5. The AdaptiveEpoch method improves the performance of both FedAvg, FedNova, and FedAvgM on aggregator loss, aggregator mean dice score, and aggregator sensitivity metrics. The improvement achieved on aggregator methods by AdaptiveEpoch is much more significant than the LR scheduling. The performance increase can be observed both on loss and performance metrics.

Fig. 5.
figure 5

The impacts of adaptive epoch on FedAvg, FedNova, and FedAvgM.

Figure 6 shows the performance comparison of different hyperparameter strategies on FedAvgM. Accordingly, LR scheduling, AdaptiveEpoch, and AdaptiveEpoch+LR scheduling improves the baseline model performance. AdaptiveEpoch and AdaptiveEpoch+LR scheduling provides faster convergence than LR scheduling. However, there is no significant difference between AdaptiveEpoch and AdaptiveEpoch+LR scheduling. Due to the time and resource constraints, the number of FL round was set to 70 for all experiments, which in turn limited the effect of LR scheduling and AdaptiveEpoch+LR scheduling due to incomplete decaying of LR.

Fig. 6.
figure 6

The impacts of hyperparameter setting strategies on FedAvgM

Table 4 shows the mean dice score and convergence score obtained on the validation set. These experiments are performed by using partitioning-2 as data split. The convergence score is computed as the area under the validation learning curve where the horizontal axis is the runtime, and the vertical axis is the performance. Most of the time, FedAvgM outperforms others and achieves the best mean Dice score and the convergence score for all hyperparameter choice strategies except for LR scheduling. It is expected and in line with the results that are presented in Fig. 4. Nevertheless, the convergence score is based on the validation set reported during the FL training; therefore, the comparison of convergence scores on an out-of-distribution set is still an open question.

Table 4. The mean Dice score and convergence scores on the validation set.

Table 5 presents the results of the our challenge submission on the challenge test set with convergence score of 0.770. The results are provided by the FeTS initiative.

Table 5. The scores were obtained on Leader Board 2 of Task 1 that our team (METU FL) won the \(3^{rd}\) rank in the FeTS Challenge.

6 Conclusion

In this study, we perform comprehensive experiments to compare different hyperparameter selection strategies and aggregation methods. The experiments reveal that FedAvgM has better performance than FedAvg and FedNova. Moreover, it is shown that the AdaptiveEpoch approach provides performance increase and faster convergence. However, LR scheduling is not effective with FedAvgM or AdaptiveEpoch. Therefore, it can be said that methods that work well individually may not work well together when combined, or one can reduce the effectiveness of the other. For instance, while AdaptiveEpoch results in better validation mean dice scores and convergence scores than using constant hyperparameter strategy, when it is combined with LR scheduling, all mean dice and convergence scores get worse for all aggregation methods (see Table 4).

During the experiments, all collaborators have participated in the local training process for all rounds. Instead, collaborator choosing methods such as clustering collaborators based on the update similarity or increasing the likelihood of being chosen collaborators that improved the performance for the random collaborator choice can be utilized to improve performance.

Moreover, in the medical image domain, there is generally high interobserver variability in annotations, which can be considered label noise. For example, if an institution’s label quality is low, the model coming from that institution will adversely affect the global model; therefore, weights coming from that institution should be handled carefully. There are defense mechanisms such as KRUM [7], BARFED [12], or trimmed mean [31] that can overcome the attacks in federated learning to some extent. These defense strategies may be used to overcome the label noise.

7 GPU Training Times

Computation time and cost, as well as energy consumption, are important factors determining the direction of future research and adoption of the technology in real life. Table 6 shows the detailed GPU training times of the experiments that are run on single NVIDIA A100-80GB GPU. LR scheduling has no significant effect on the training times. On the other hand, although AdaptiveEpoch strategy brings an increase in performance metrics, its usage nearly doubles the total training time due to longer round times.

Table 6. The detailed GPU training times (hour).