Keywords

1 Introduction

Medical institutions often seek to leverage their data to build deep learning models for predicting different diseases [10, 23]. However, limited access to data can impede the learning process [27] or introduce bias towards the local data [18]. Hence, collaborative learning is vital to expand data availability and maximize model performance. Federated Learning (FL) [20] provides a paradigm where multiple institutions/devices can use their data to train local models and exchange their model parameters regularly with a central server. The server aggregates the local model parameters to update a global model, which is shared back to the clients. This allows all parties to preserve their data locally and offers better results while maintaining data privacy. Recent works have underscored the importance of FL in medical imaging [9, 30].

Typically, FL algorithms (e.g., FedAvg [24]) operate under the assumption that clients participate in the protocol honestly and their data distributions are identical. However, the competitive nature of the participants makes them vulnerable to poisoning attacks. Real-world FL is susceptible to a variety of targeted/untargeted attacks [12, 34] and byzantine faults [19]. Techniques proposed to tackle non-iid data [20] cannot handle malicious parties that attempt to deteriorate the global model performance. Though many robust aggregation rules have been proposed for byzantine-tolerant FL [5, 14, 33], such methods often require a careful choice of hyperparameters (e.g., prior knowledge of proportion of malicious clients) and are vulnerable to more sophisticated attacks [11].

Fig. 1.
figure 1

Federated Learning (FL) in the presence of a malicious client. Distance-based Outlier Suppression (DOS) aggregation rule (right) can be applied to robustly aggregate local parameter updates.

The core contribution of this work is a novel robust aggregation rule called Distance-based Outlier Suppression (DOS) for byzantine-tolerant FL. The proposed framework (see Fig. 1) enables the central server to suppress local parameter updates from malicious clients during aggregation. The DOS framework is enabled by computing distances (Euclidean and cosine) between local parameter updates of different clients and detecting outliers in this distance space. A state-of-the-art, parameter-free outlier detection algorithm called COPOD [21] is applied to compute the outlier score for each client based on the above distances. Finally, the outlier scores are mapped to client-specific weights, and weighted average of local parameter updates is utilized for global model update. The proposed DOS aggregation rule demonstrates high robustness against diverse poisoning attacks under both iid (CheXpert) and non-iid (HAM10000) settings.

2 Related Work

Most FL algorithms aim to minimize the following loss function:

$$\begin{aligned} \textstyle {\min _{\theta \in \mathbb {R}^d}} \left\{ \textbf{F}(\theta ) := \textstyle {\sum _{i=1}^{n} \alpha _i \textbf{F}_i(\theta )}\right\} , \end{aligned}$$
(1)

where n is the number of clients, \(F_i\) is the loss function for the \(i^{th}\) client, and \(\alpha _i\) is a value between 0 and 1 that weights the contribution of the \(i^{th}\) client. Popular aggregation rules such as FedAvg [24] either assign equal weights to all clients (\(\alpha _i=1/n, \forall ~i=1,2,\ldots ,n\)) or assign weights to the clients based on the relative size of their training sets. Such schemes have shown effective outcomes under the assumption of honest clients and iid data distributions. FedProx [20] was introduced to tackle heterogeneity and non-iid data across clients. Progressive Fourier Aggregation (PFA) was introduced in [6] to improve stability by preventing an abrupt drop in accuracy. While these methods promise convergence, they do not consider noisy parameters [32] or malicious parties that attempt to hinder learning or cause it to converge to poor local minima.

Since FL requires communication between the server and clients, it is susceptible to random network errors that can deliver abnormal parameters to the server or malicious parties that attempt to corrupt the learning process. Such failures are defined as Byzantine failures [19]. In such settings, there are two types of attacks: (i) targeted attacks that cause the global model to misclassify a selected class (e.g., backdoor attacks [2], model poisoning attacks [3, 4], and data poisoning attacks [25]) and (ii) untargeted attacks that harm the model’s performance across all classes [20]. Moreover, there are techniques designed to make the attack more stealthy [34]. Other potential attacks include label-flipping, where the model is trained on incorrect labels [12], and crafted model attacks [11] that formulate the attack as an optimization problem. In this paper, we mainly focus on untargeted attacks since they are more lethal to the learning process.

Several robust aggregation rules have been proposed to defend against Byzantine failures. Krum [5] selects the parameter update that is closest to the mean parameter update of all clients. Coordinate-Wise-Median [33] calculates the median because it is less sensitive to outliers as opposed to the mean. Trimmed Mean [33] sorts the parameter values and discards a fraction of each coordinate’s smallest and largest values. Bulyan was proposed in [14] to further improve the robustness. However, most of these methods require hyperparameter selection (e.g., proportion of outliers to be discarded) and are susceptible to crafted model attacks [11]. More recently, Robust Federated Aggregation (RFA) [28], which relies on aggregation using the geometric median, and SparseFed [26], which uses global top-k update sparsification and device-level gradient clipping, have been proposed to mitigate poisoning attacks.

figure a

3 Proposed Method

We propose Distance-based Outlier Suppression (DOS), a robust FL aggregation rule (see Algorithm 1) that can defend against different untargeted poisoning attacks on FL as long as the proportion of clients experiencing byzantine-failures is less than 50%. Suppose that there are n clients in the FL setup, and their goal is to learn the global model parameters \(\theta \) that minimizes the objective function in Eq. (1). Note that a proportion p of these clients (\(p < 50\%\)) can be malicious and may not share the same objective as the other honest clients. Similar to FedAvg, the global model parameters are initialized to \(\theta ^0\). In each of the T communication rounds, the current global model parameters \(\theta ^t\) are broadcast to all clients. Each client \(i \in [n]\) learns the local model parameters \(\theta _i^{t+1}\) based on their local data using a suitable optimizer and sends \(\hat{\theta }_i^{t+1}\) back to the server. While \(\hat{\theta }_i^{t+1}\) is expected to be a faithful version of \(\theta _i^{t+1}\) (except for known transformations like compression) for honest clients, this may not be the case for malicious clients. The server computes the updated global model parameters using the DOS aggregation rule. The proposed DOS aggregation rule consists of three key steps:

Distance Computation: The server starts with calculating the Euclidean and cosine distances between the local parameters sent by the clients as follows:

$$\begin{aligned} d^E_{ij} = ||\hat{\theta }_i - \hat{\theta }_j||_{2}, \qquad d^C_{ij} = 1 - \tfrac{\hat{\theta }_i^{T} \hat{\theta }_j}{||\hat{\theta }_i||_{2} ||\hat{\theta }_j||_{2}}, \end{aligned}$$

where \(i,j = 1,2,\ldots ,n\). Here, the time index t is skipped for convenience. The \((n \times n)\) distance matrices \(M_E = [d^E_{ij}]\) and \(M_C = [d^C_{ij}]\) are then computed and utilized for outlier detection. Unlike existing methods such as [5, 14, 33], where outlier detection is performed directly on the model parameter space, the proposed DOS method performs outlier detection in the distance space.

Outlier Score Computation: After exploring various anomaly detection methods such as Random Forest [29], Local Outlier Factor (LOF) [7], and K-means [22], we selected the Copula Based Outlier Detection (COPOD) method [21] due to several factors. Firstly, other methods require choosing the percentage of abnormal data points (in our case clients) in advance, whereas COPOD is parameter-free, thus making it more robust. Moreover, COPOD is known to be computationally efficient even in high dimensional settings because it utilizes the marginals of the joint distributions, thereby allowing for greater flexibility and individual modeling of each dimension. Finally, the COPOD method also has desirable interpretability properties since it is based on modeling the tail probabilities along each dimension. In general, the COPOD function takes a \((n \times d)\) matrix as input, where each row represents a sample and each column represents an attribute of the sample to produce a n-dimensional vector of outlier scores \(\textbf{r}=[r_1, r_2, \ldots , r_n]\), where \(r_i \in (0,+\infty )\) represents the relative likelihood that the sample i is an outlier (true outliers are expected to have larger values of \(r_i\)). In DOS, we compute \(\textbf{r}_E = COPOD(M_E)\) and \(\textbf{r}_C = COPOD(M_C)\) and average these two vectors to obtain the final outlier score \(\textbf{r} = (\textbf{r}_E+\textbf{r}_C)/2\).

Fig. 2.
figure 2

Left: Distribution of the multiclass multilabel CheXpert dataset among clients where each stacked bar represents the number of positive cases for class 0 to 12. Right: Distribution of the multiclass HAM1000 dataset in non-iid case among clients where each stacked bar represents the number of samples for class 0 to 6.

Weighted Average Aggregation: Ideally, the local parameter updates of those clients with higher outlier scores must be suppressed and the local parameters of clients with lower outlier scores must be amplified. To achieve this goal, we apply a softmax function to the outlier score vector with a temperature parameter of \(-1\) and use the resulting output as the normalized weights for each client, i.e.,

$$\begin{aligned} w_i = \tfrac{exp(-r_i)}{\sum _{j=1}^n exp(-r_i)},\quad i = 1,2,\cdots ,n. \end{aligned}$$

The local parameters of the clients are aggregated using the following rule:

$$\begin{aligned} \theta ^{t+1} = \textstyle {\sum }_{i=1}^n w_i \hat{\theta }_i^{t+1}. \end{aligned}$$

4 Experiments

4.1 Datasets

CheXpert: CheXpert is a large chest X-ray dataset with uncertainty labels and expert comparison [17]. We use ‘CheXpert-small’, a multi-class multi-label dataset that contains 191,456 chest X-ray images for training. The dataset is imbalanced and has 13 pathology categories. In addition, it has a percentage of uncertain data, and [17] suggests either ignoring the uncertain labels during training or mapping all instances to zero or one. In our experiments, all uncertain labels were mapped to zero. The TorchXRayVision library [8] was used to preprocess the data, where every image has a \(224 \times 224\) resolution. We divide the training images equally between the clients as shown in Fig. 2 (Left), where the last stacked bar represents the testing dataset that we use to measure the global model’s performance after each round. We use the CheXpert dataset to evaluate our method in the iid setting.

Table 1. Area under the receiver operating characteristic curve (AUC) with different types of poisoning attack scenarios on the Chexpert dataset. T-M stands for Trimmed Mean [33]. The last column represents the average AUC over five presented scenarios.

HAM10000: HAM10000 is a multi-class dataset consisting of 10,015 dermoscopic images from different populations [31]. It consists of 7 categories of pigmented lesions where every image was resized to \(128 \times 128\). The train-test split is 8,910 and 1,105 images, respectively. We use this dataset to test our method in the non-iid case as shown in Fig. 2 (Right).

4.2 FL Setup

In all our experiments, we assume \(n=10\) clients. The well-known ResNet-18 [16] model, as well as a custom Convolutional Neural Network (CNN) with two convolutional layers, a ReLU activation function, and two fully connected layers are used as the architectures for both global and local models. Each client only has access to its local dataset and the local models are trained using Stochastic Gradient Descent (SGD) [13] with a learning rate of 0.01. We implement all our experiments using the Pytorch framework [1]. Using a batch size of 16 for CheXpert, each client is trained for a total of 100 rounds with 1 local step in each round that goes through the entire dataset. As for the HAM10000 dataset, we iterate through the whole dataset with a batch size of 890 for each client. We train it for a total of 250 rounds with 5 local steps for each client. This experiment was performed to show that the batch size does not impact the DOS aggregation rule. The evaluation metric is the Area Under the Receiver operating characteristic curve (AUC), which is calculated after each round of communication on the testing dataset. For the CheXpert dataset [17], the AUC was calculated for each class and the macro average was taken over all classes. For the HAM1000 dataset [31], the average AUC was computed for all possible pairwise class combinations to minimize the effect of imbalance [15].

Fig. 3.
figure 3

Performance on CheXpert dataset using ResNet-18 model: The first two rows show AUC on a test set after each round for six scenarios in the following order (left to right): No Attack, Label Flip 10%, Mix Attack 40%, Noise 40%, Noise and Scaled 40%, and Crafted Attack 40%. The bottom two rows show the normalized weights of each client after each round.

4.3 Poisoning Attacks and Baseline Aggregation Rules

We assume that up to \(p \le 40\)% of these clients could be malicious, i.e., at most 4 out of the 10 clients are malicious. We consider 5 different attacks on the CheXpert dataset: (i) Label Flip 10% - label-flipping by one of the clients, (ii) Mix Attack 40% - transmission of Gaussian noise by two clients and label-flipping by two clients, (iii) Noise 40% - transmission of Gaussian noise by four clients, (iv) Noise and Scaled 40% - transmission of a mix of Gaussian noise and scaled parameters by four clients, and (v) Crafted 40% - crafted model attack with four clients based on [11] (with the aggregation rule being unknown to the attacker). On the HAM1000 dataset, 2 attacks were considered: (i) Noise 30% - transmission of Gaussian noise by three clients, and (ii) Mix Attack 40% - transmission of Gaussian noise by two clients, scaled parameter by a factor of 100 by one client, and scaled parameter by a factor of \(-0.5\) (directionally opposite to the true parameters) by one client. The robustness of the proposed DOS method was benchmarked against the following aggregation rules: FedAvg [24], Coordinate-Wise-Median [33], Trimmed Mean [33] and Krum [5]. For both datasets, a No Attack scenario was also considered to evaluate the performance of DOS when all the clients are benign (honest).

4.4 Results and Discussion

Figure 3 compares all the aggregation rules under six different scenarios on the CheXpert dataset. Table 1 summarizes the average AUC for all the aggregation rules and it can be observed that DOS aggregation rule performs consistently well against all attack scenarios. In contrast, all the other aggregation rules had lower accuracy values in one or more scenarios. Furthermore, it can be observed from Fig. 3 that the weights of the malicious clients (shown in red) are almost always close to zero, indicating that the DOS method is able to effectively suppress the parameter updates from malicious clients. The performance of DOS method is also comparable to that of RFA [28], with DOS method having a marginal edge when the proportion of malicious clients was higher (40%). Only in the case where 40% of clients transmitted Gaussian noise, the RFA method had a significantly lower accuracy (0.625) compared to the DOS method (0.69).

Fig. 4.
figure 4

Performance on HAM10000 dataset using a custom CNN model. The top-left plot shows AUC on the test set for DOS aggregation for the 3 scenarios. The normalized weights of each client after each round for the 3 scenarios are also depicted.

Figure 4 shows the performance of the DOS rule under three different scenarios on the HAM10000 dataset. In all cases, DOS performed with a high AUC score without dropping its overall performance. The main advantage of DOS lies in its ability to account for both Euclidean and cosine distance, thereby addressing both positive and negative scaling. It also successfully detects Gaussian noise attacks and label-flip attacks, leveraging the strengths of the COPOD method. Since DOS combines three different approaches, it is hard to design model poisoning attacks that can successfully circumvent it.

Proportion of Malicious Clients: Since our approach treats malicious weight updates as outliers, the protocol will fail when a majority of clients are malicious. We conducted experiments by fixing the number of clients and increasing the proportion of malicious clients from 10% to 60%, in steps of 10%. As expected, the DOS approach was robust until the proportion of malicious clients was less than or equal to 50% and failed when the proportion was 60% (e.g., for HAM10000 dataset, the AUC values were 0.695, 0.697, 0.696, 0.711, 0.710, and 0.554 for 10%, 20%, 30%, 40%, 50%, and 60% corruption, respectively. In comparison, the AUC without any attack was 0.70). This is the reason for choosing the proportion of malicious clients as 40% for most of our experiments. Thus, the DOS method is appropriate only for the honest majority scenario.

Different Number of Clients: We conducted experiments by fixing the proportion of malicious clients to 40% and increasing the number of clients from 5 to 40. The AUC values were 0.725, 0.700, 0.692, and 0.674 for 5, 10, 20 and 40 clients, respectively. We observe a minor degradation in accuracy when the number of clients increases and it requires more rounds to converge. Therefore, the DOS approach may be more suitable for cross-silo FL settings.

Non-IID Setting: We have conducted experiments under five different non-iid settings, where we randomly partition the HAM10000 dataset among 10 clients. For all the five experiments, the convergence trends and the final accuracy were similar (between 0.69 and 0.71) to the iid case, showing that the DOS method can work well in the non-iid setting.

4.5 Conclusions and Future Work

We presented Distance-based Outlier Suppression (DOS), a novel robust aggregation rule that performs outlier detection in the distance space to effectively defend against Byzantine failures in FL. We proved the effectiveness and robustness of our method using two real-world medical imaging datasets. In future, we aim to prove theoretical convergence and extend this framework to ensure fairness between FL clients, based on the contribution of their local datasets.