Keywords

1 Introduction

We present a new variant of the Frank-Wolfe (FW) algorithm, FedFW, designed for the increasingly popular Federated Learning (FL) paradigm in machine learning. Consider the following constrained empirical risk minimization template:

$$\begin{aligned} \min _{\textbf{x}\in \mathcal {D}} ~~ F(\textbf{x}):= ~ \frac{1}{n} \sum _{i=1}^n f_i(\textbf{x}), \end{aligned}$$
(1)

where \(\mathcal {D}\subseteq \mathbb {R}^p\) is a convex and compact set. We define the diameter of \(\mathcal {D}\) as \(D:=\max _{\textbf{x}, \textbf{y} \in \mathcal {D}} \Vert \textbf{x}-\textbf{y}\Vert \). The function \(F : \mathbb {R}^p \rightarrow \mathbb {R}\) represents the objective function, and \(f_i : \mathbb {R}^p \rightarrow \mathbb {R}\) (for \(i = 1,\ldots ,n\)) represent the loss functions of the clients, where n is the number of clients. Throughout, we assume \(f_i\) is L-smooth, meaning that it has Lipschitz continuous gradients with parameter L.

FL holds great promise for solving optimization problems over a large network, where clients collaborate under the coordination of a server to find a common good model. Privacy is an explicit goal in FL; clients work together towards a common goal by utilizing their own data without sharing it. As a result, FL exhibits remarkable potential for data science applications involving privacy-sensitive information. Its applications range from learning tasks (such as training neural networks) on mobile devices without sharing personal data [1] to medical applications of machine learning, where hospitals collaborate without sharing sensitive patient information [2].

Most FL algorithms focus on unconstrained optimization problems, and extending these algorithms to handle constrained problems typically requires projection steps. However, in many machine learning applications, the projection cost can create a computational bottleneck, preventing us from solving these problems at a large scale. The FW algorithm [3] has emerged as a preferred method for addressing these problems in machine learning. The main workhorse of the FW algorithm is the linear minimization oracle (LMO),

$$\begin{aligned} \textrm{lmo}(\textbf{y}) := \underset{\textbf{x}\in \mathcal {D}}{\textrm{argmin}} ~ \langle \textbf{y},\textbf{x}\rangle . \end{aligned}$$
(2)

Evaluating linear minimization is generally less computationally expensive than performing the projection step. A famous example illustrating this is the nuclear-norm constraint: projecting onto a nuclear-norm ball often requires computing a full-spectrum singular value decomposition. In contrast, linear minimization involves finding the top singular vector, a task that can be efficiently approximated using methods such as the power method or Lanczos iterations.

To our knowledge, FW has not yet been explored in the context of FL. This paper aims to close this gap. Our primary contribution lies in adapting the FW method for FL with convergence guarantees.

The paper is organized as follows: Sect. 2 provides a brief review of the literature on FL and the FW method. In Sect. 3, we introduce FedFW. Unlike traditional FL methods, FedFW does not overwrite clients’ local models with the global model sent by the server. Instead, it penalizes clients’ loss functions by using the global model. We present the convergence guarantees of FedFW in Sect. 3.1. Specifically, our method provably finds a \(\varepsilon \)-suboptimal solution after \(\mathcal {O}(\varepsilon ^{-2})\) iterations for smooth and convex objective functions (refer to Theorem 1). In the case of non-convex objectives, the complexity increases to \(\mathcal {O}(\varepsilon ^{-3})\) (refer to Theorem 2). Section 4 introduces several design variations of FedFW, including a stochastic variant. Section 5 presents numerical experiments on various machine learning tasks with both convex and non-convex objective functions. Finally, Sect. 6 provides concluding remarks along with a discussion on the limitations of the proposed method. Detailed proofs and technical aspects are deferred to the supplementary material.

2 Related Work

Federated Learning. FL is a distributed learning paradigm that, unlike most traditional distributed settings, focuses on a scenario where only a subset of clients participate in each training round, data is often heterogeneous, and clients can perform different numbers of iterations in each round [4, 5]. FedAvg [4] has been a cornerstone in the FL literature, demonstrating practical capabilities in addressing key concerns such as privacy and security, data heterogeneity, and computational costs. Although it is shown that fixed points of some FedAvg variants do not necessarily converge to the minimizer of the objective function, even in the least squares problem [6], and can even diverge [7], the convergence guarantees of FedAvg have been studied under different assumptions (see [8,9,10,11,12,13,14,15] and the references therein). However, all these works on the convergence guarantees of FedAvg are restricted to unconstrained problems.

Constrained or composite optimization problems are ubiquitous in machine learning, often used to impose structural priors such as sparsity or low-rankness. To our knowledge, FedDR [16] and FedDA [17] are the first FL algorithms with convergence guarantees for constrained problems. The former employs Douglas-Rachford splitting, while the latter is based on the dual averaging method [18], to solve composite optimization problems, including constrained problems via indicator functions, within the FL setting. [19] introduced a ‘fast’ variant of FedDA, achieving rapid convergence rates with linear speedup and reduced communication rounds for composite strongly convex problems. FedADMM [20] was proposed for federated composite optimization problems involving a non-convex smooth term and a convex non-smooth term in the objective. Moreover, [21] proposed a FL algorithm based on a proximal augmented Lagrangian approach to address problems with convex functional constraints. None of these works address our problem template, where the constraints are challenging to project onto but allow for an efficient solution to the linear minimization problem.

Frank-Wolfe Algorithm. The FW algorithm, also known as the conditional gradient method or CGM, was initially introduced in [3] to minimize a convex quadratic objective over a polytope, and was extended to general convex objectives and arbitrary convex and compact sets in [22]. Following the seminal works in [23, 24], the method gained popularity in machine learning.

The increasing interest in FW methods for data science applications has led to the development of new results and variants. For example, [25] established convergence guarantees for FW with non-convex objective functions. Additionally, online, stochastic, and variance-reduced variants of FW have been proposed; see [26,27,28,29,30,31] and the references therein. FW has also been combined with smoothing strategies for non-smooth and composite objectives [32,33,34,35], and with augmented Lagrangian methods for problems with affine equality constraints [36, 37]. Furthermore, various design variants of FW, such as the away-step and pairwise step strategies, can offer computational advantages. For a comprehensive overview of FW-type methods and their applications, we refer to [38, 39].

The most closely related methods to our work are the distributed FW variants. However, the variants in [40,41,42] are fundamentally different from FedFW as they require sharing gradient information of the clients with the server or with the neighboring nodes. In FedFW, clients do not share gradients, which is critical for data privacy [43, 44]. Other distributed FW variants are proposed in [45,46,47]. However, the method proposed by [46] is limited to the convex low-rank matrix optimization problem, and the methods in [45, 47] assume that the problem domain is block separable.

3 Federated Frank-Wolfe Algorithm

In essence, any first-order optimization algorithm can be adapted for a simplified federated setting by transmitting local gradients to the server at each iteration. These local gradients can be aggregated to compute the full gradient and distributed back to the clients. Although it is possible to implement the standard FW algorithm in FL this way, this baseline has two major problems. First, it relies on communication at each iteration, which raises scalability concerns, as extending this approach to multiple local steps is not feasible. Secondly, sharing raw gradients raises privacy concerns, as sensitive information and data points can be inferred with high precision from transmitted gradients [43]. Consequently, most FL algorithms are designed to exchange local models or step-directions rather than gradients. Unfortunately, a simple combination of the FW algorithm with a model aggregation step fails to find a solution to (1), as we demonstrate with a simple counterexample in the supplementary material. Therefore, developing FedFW requires a special algorithmic approach, which we elaborate on below.

We start by rewriting problem (1) in terms of the matrix decision variable \(\textbf{X} := [\textbf{x}_1,\textbf{x}_2,\ldots ,\textbf{x}_n]\), as follows:

$$\begin{aligned} \min _{\textbf{X} \in \mathcal {D}^n} ~ \frac{1}{n} \sum _{i=1}^n f_i(\textbf{X} \textbf{e}_i) + \delta _\mathcal {C}(\textbf{X}). \end{aligned}$$
(3)

Here, \(\textbf{e}_i\) denotes the ith standard unit vector, and \(\delta _\mathcal {C}\) is the indicator function for the consensus set:

$$\begin{aligned} \mathcal {C} := \{[\textbf{x}_1,\ldots ,\textbf{x}_n] \in \mathbb {R}^{p\times n}: \textbf{x}_1 = \textbf{x}_2 = \ldots = \textbf{x}_n \}. \end{aligned}$$
(4)

It is evident that problems (1) and (3) are equivalent. However, the latter formulation represents the local models of the clients as the columns of the matrix \(\textbf{X}\), offering a more explicit representation for FL.

The original FW algorithm is ill-suited for solving problem (3) due to the non-smooth nature of the objective function because of the indicator function. Drawing inspiration from techniques proposed in [33], we adopt a quadratic penalty strategy to address this challenge. The main idea is to perform FW updates on a surrogate objective which replaces the hard constraint \(\delta _{\mathcal {C}}\) with a smooth function that penalizes the distance between \(\textbf{X}\) and the consensus set \(\mathcal {C}\):

$$\begin{aligned} \hat{F}_t(\textbf{X}) = \frac{1}{n} \sum _{i=1}^n f_i(\textbf{X} \textbf{e}_i) + \frac{\lambda _t}{2} \textrm{dist}^2(\textbf{X},\mathcal {C}), \end{aligned}$$
(5)

where \(\lambda _t \ge 0\) is the penalty parameter. Note that the surrogate function is parameterized by the iteration counter t, as it is crucial to amplify the impact of the penalty function by gradually increasing \(\lambda _t\) at a specific rate through the iterations. This adjustment will ensure that the generated sequence converges to a solution of the original problem in (3).

To perform an FW update with respect to the surrogate function, first, we need to compute the gradient of \(\hat{F}_t\), given by

$$\begin{aligned} \begin{aligned} \nabla \hat{F}_t(\textbf{X}) & = \frac{1}{n} \sum _{i=1}^n \nabla f_i(\textbf{X} \textbf{e}_i) \textbf{e}_i^\top + \lambda _t (\textbf{X} - \textrm{proj}_\mathcal {C}(\textbf{X})) \\ & = \frac{1}{n} \sum _{i=1}^n \nabla f_i(\textbf{x}_i) \textbf{e}_i^\top + \lambda _t \sum _{i=1}^n (\textbf{x}_i - \bar{\textbf{x}}) \textbf{e}_i^\top \end{aligned} \end{aligned}$$
(6)

where \(\bar{\textbf{x}}:= \frac{1}{n} \sum _{i=1}^n \textbf{x}_i\). Then, we call the linear minimization oracle:

$$\begin{aligned} \textbf{S}^t \in \underset{\textbf{X} \in \mathcal {D}^n}{\textrm{argmin}} ~ \langle \nabla \hat{F}_t(\textbf{X}^t),\textbf{X}\rangle . \end{aligned}$$
(7)

Since \(\mathcal {D}^n\) is separable for the columns of \(\textbf{X}\), we can evaluate (7) in parallel for \(\textbf{x}_1,\textbf{x}_2,\ldots ,\textbf{x}_n\). Define \(\textbf{s}_i^t\) as

$$\begin{aligned} \textbf{s}_i^t \in \underset{\textbf{x}\in \mathcal {D}}{\textrm{argmin}} ~ \langle \frac{1}{n}\nabla f_i(\textbf{x}_i^t) + \lambda _t (\textbf{x}_i^t - \bar{\textbf{x}}^t),\textbf{x}\rangle , \end{aligned}$$
(8)

where \( \bar{\textbf{x}}^t := \frac{1}{n} \sum _{i=1}^n \textbf{x}_i^t\). Then, \(\textbf{S}^t = \sum _{i=1}^n \textbf{s}_i^t \, \textbf{e}_i^\top .\)

Finally, we update the decision variable by \(\textbf{X}^{t+1} = (1-\eta _t) \textbf{X}^t + \eta _t \textbf{S}^t\), which can be computed column-wise in parallel:

$$\begin{aligned} \textbf{x}_i^{t+1} = (1-\eta _t) \textbf{x}_i^t + \eta _t \textbf{s}_i^t, \end{aligned}$$
(9)

where \(\eta _t \in [0,1]\) is the step-size.

This establishes the fundamental update rule for our proposed algorithm, FedFW. Note that communication is required only during the computation of \(\bar{\textbf{x}}^t\), which constitutes our aggregation step. All other computations can be performed locally by the clients. Algorithm 1 presents FedFW and several design variants, which are further detailed in Sect. 4.

Algorithm 1
figure a

. FedFW: Federated Frank-Wolfe Algorithm (\(+\)variants)

3.1 Convergence Guarantees

This section presents the convergence guarantees of FedFW. We begin with the guarantees for problems with a smooth and convex objective function.

Theorem 1

Consider problem (1) with L-smooth and convex loss functions \(f_i\). Then, estimation \(\bar{\textbf{x}}^t\) generated by FedFW with step-size \(\eta _t = \frac{2}{t+1}\) and penalty parameter \(\lambda _t = \lambda _0 \sqrt{t+1}\) for any \(\lambda _0 > 0\) satisfies

$$\begin{aligned} F(\bar{\textbf{x}}^t)-F(\bar{\textbf{x}}^*) \le \mathcal {O}(t^{-1/2}). \end{aligned}$$
(10)

Remark 1

Our proof is inspired by the analysis in [33]. However, a distinction lies in how the guarantees are expressed. In [33], the authors demonstrate the convergence of \(\textbf{x}_i^t\) towards a solution by proving that both the objective residual and the distance to the feasible set converge to zero. In contrast, we establish the convergence of \(\bar{\textbf{x}}^t\), representing a feasible point, focusing only on the objective residual. We present detailed proof in the supplementary material.

It is worth noting that the convergence guarantees of FedFW are slower compared to those of existing unconstrained or projection-based FL algorithms. For instance, in the smooth convex setting with full gradients, FedAvg [4] achieves a rate of \(\mathcal {O}(t^{-1})\) in the objective residual. In a convex composite problem setting, FedDA [17] converges at a rate of \(\mathcal {O}(t^{-2/3})\). While FedFW guarantees a slower rate of \(\mathcal {O}(t^{-1/2})\), it is important to highlight that FedFW employs cheap linear minimization oracles.

Next, we present the convergence guarantees of FedFW for non-convex problems. For unconstrained non-convex problems, the gradient norm is commonly used as a metric to demonstrate convergence to a stationary point. However, this metric is not suitable for constrained problems, as the gradient may not approach zero if the solution resides on the boundary of the feasible set. To address this, we use the following gap function, standard in FW analysis [25]:

$$\begin{aligned} \textrm{gap}(\textbf{x}) := \max _{\textbf{u} \in \mathcal {D}} ~ \langle \nabla F(\textbf{x}), \textbf{x}-\textbf{u} \rangle . \end{aligned}$$
(11)

It is straightforward to show that \(\textrm{gap}(\textbf{x})\) is non-negative for all \(\textbf{x}\in \mathcal {D}\), and it attains zero if and only if \(\textbf{x}\) is a first-order stationary point of Problem (1).

Theorem 2

Consider problem (1) with L-smooth loss functions \(f_i\). Suppose the sequence \(\bar{\textbf{x}}^t\) is generated by FedFW with the fixed step-size \(\eta _t=T^{-{2}/{3}}\), and penalty parameter \( \lambda _t=\lambda _0 T^{{1}/{3}}\) for an arbitrary \(\lambda _0 > 0\). Then,

$$\begin{aligned} \min _{1\le t \le T} ~ \textrm{gap}(\bar{\textbf{x}}^t) \le \mathcal {O}(T^{-1/3}). \end{aligned}$$
(12)

Remark 2

We present the proof in the supplementary material. Our analysis introduces a novel approach, as [33] does not explore non-convex problems. While our focus is primarily on problems (1) and (3), our methodology can be used to derive guarantees for a broader setting of minimization of a smooth non-convex function subject to affine constraints over a convex and compact set.

As with our previous results, the convergence rate in the non-convex setting is slower compared to FedAvg, which achieves an \(\mathcal {O}(t^{-1/2})\) rate in the gradient norm (note the distinction between the gradient norm and squared gradient norm metrics). For composite FL problems with a non-convex smooth loss and a convex non-smooth regularizer, FedDR [16] achieves an \(\mathcal {O}(t^{-1/2})\) rate in the norm of a proximal gradient mapping. In contrast, our guarantees are in terms of the Frank-Wolfe (FW) gap. To our knowledge, FedDA does not offer guarantees in the non-convex setting.

3.2 Privacy and Communication Benefits

FedFW offers low communication overhead since the communicated signals are the extreme points of \(\mathcal {D}\), which typically have low dimensional representation. For example, if \(\mathcal {D}\) is \(\ell _1\) (resp., nuclear) norm-ball, then the signals \(\textbf{s}_i\) are 1-sparse (resp., rank-one). Additionally, linear minimization is a nonlinear oracle, the reverse operator of which is highly ill-conditioned. Retrieving the gradient from its linear minimization output is generally unfeasible. For example, if \(\mathcal {D}\) is the \(\ell _1\) norm-ball, then \(\textbf{s}_i\) merely reveals the sign of the maximum entry of the gradient. In the case of a box constraint, \(\textbf{s}_i\) only reveals the gradient signs. For the nuclear norm-ball, \(\textbf{s}_i\) unveils only the top eigenvectors of the gradient. Furthermore, FW is robust against additive and multiplicative errors in the linear minimization step [24]; consequently, we can introduce noise to augment data privacy without compromising the convergence guarantees.

In a simple numerical experiment, we demonstrate the privacy benefits of communicating linear minimization outputs instead of gradients. This experiment is based on the Deep Leakage algorithm [43] using the CIFAR100 dataset. Our experiment compares reconstructed images (i.e., leaked data points) obtained from shared gradients versus shared linear minimization outputs, under \(\ell _1\) and \(\ell _2\) norm constraints. Figure 1 displays the final reconstructed images alongside the Peak Signal-to-Noise Ratio (PSNR) across iterations. It is evident that reconstruction via linear minimization oracles, particularly under the \(\ell _1\) ball constraint, is significantly more challenging than raw gradients.

Fig. 1.
figure 1

Privacy benefits of sharing linear minimization outputs vs gradients. The Deep Leakage Algorithm can recover CIFAR-100 data points from shared gradients. Sharing linear minimization outputs enhances privacy. (a) and (b) compares reconstructions from gradients and LMO outputs with \(\ell _2\) and \(\ell _1\)-norm ball constraints after \(10^5\) iterations for two different data points. (c) and (d) present the reconstruction PSNR as a function of iterations for the corresponding images.

4 Design Variants of FedFW

This section discusses several design variants and extensions of FedFW.

4.1 FedFW with stochastic gradients

Consider the following stochastic problem template:

$$\begin{aligned} \min _{\textbf{x}\in \mathcal {D}} ~ F(\textbf{x}):= \frac{1}{n} \sum _{i=1}^{n}\mathbb {E}_{\omega _i} \big [ f_i(\textbf{x}, \omega _i) \big ]. \end{aligned}$$
(13)

Here, \(\omega _i\) is a random variable with an unknown distribution \(\mathcal {P}_i\). The client loss function \(f_i(\textbf{x}) := \mathbb {E}_{\omega _i} \big [ f_i(\textbf{x}, \omega _i) \big ]\) is defined as the expectation over this unknown distribution; hence we cannot compute its gradient. We design FedFW-sto for solving this problem.

We assume that at each iteration, every participating client can independently draw a sample \(\omega _i^t\) from their distribution \(\mathcal {P}_i\). \(\nabla f_i(\textbf{x},\omega _i^t)\) serves as an unbiased estimator of \(\nabla f_i(\textbf{x})\). Additionally, we adopt the standard assumption that the estimator has bounded variance.

Assumption 1

(Bounded variance). Let \(\nabla f_i(\textbf{x}, \omega _i)\) denote the stochastic gradients. We assume that it satisfies the following condition for some \(\sigma < \infty \):

$$\begin{aligned} \mathbb {E}_{\omega _i} \Big [ \big \Vert \nabla f_i(\textbf{x}, \omega _i) - \nabla f_i(\textbf{x}) \big \Vert ^2 \Big ] \le \sigma ^2. \end{aligned}$$
(14)

Unfortunately, FW does not readily extend to stochastic settings by replacing the gradient with an unbiased estimator of bounded variance. Instead, adapting FW for stochastic settings, in general, requires a variance reduction strategy. Inspired by [30, 34], we employ the following averaged gradient estimator to tackle this challenge. We start by \(\textbf{d}_i^0 = \boldsymbol{0}\), and iteratively update

$$\begin{aligned} \textbf{d}_i^{t+1} = (1- \rho _t)\textbf{d}_i^t +\, \rho _t \frac{1}{n} \nabla f_i (\textbf{x}_i^t, \omega _i^t), \end{aligned}$$
(15)

for some \(\rho _t \in (0,1]\). FedFW-sto uses \(\textbf{d}_i^{t+1}\) in place of the gradient in the linear minimization step; pseudocode is shown in Algorithm 1. Although \(\textbf{d}_i^{t+1}\) is not an unbiased estimator, it offers the advantage of reduced variance. The balance between bias and variance can be adjusted by modifying \(\rho _t\), and the analysis relies on finding the right balance, reducing variance sufficiently while maintaining the bias within tolerable limits.

Theorem 3

Consider problem (13) with L-smooth and convex loss functions \(f_i\). Suppose Assumption 1 holds. Then, the sequence \(\bar{\textbf{x}}^t\) generated by FedFW-sto in Algorithm 1, with step-size \(\eta _t = \frac{9}{t+8}\), penalty parameter \(\lambda _t = \lambda _0 \sqrt{t+8}\) for an arbitrary \(\lambda _0 > 0\), and \(\rho _t =\frac{4}{(t+7)^{2/3}}\) satisfies

$$\begin{aligned} \mathbb {E} [ F(\bar{\textbf{x}}^t) ] - F(\textbf{x}^*) \le \mathcal {O}(t^{-1/3}). \end{aligned}$$
(16)

Remark 3

Our analysis in this setting is inspired by [34]; however, we establish the convergence of the feasible point \(\bar{\textbf{x}}^t\). This differs from the guarantees in [34], which demonstrate the convergence of \(\textbf{x}_i^t\) towards a solution by proving that both the expected objective residual and the expected distance to the feasible set converge to zero. We present the detailed proof in the supplementary material.

In the smooth convex stochastic setting, FedAvg achieves a convergence rate of \(\mathcal {O}(t^{-1/2})\). This rate also applies to FedDA when addressing composite convex problems. Additionally, under the assumption of strong convexity, Fast-FedDA [19] achieves an accelerated rate of \(\mathcal {O}(t^{-1})\). In comparison, FedFW-sto converges with \(\mathcal {O}(t^{-1/3})\) rate; however, it benefits from the use of inexpensive linear minimization oracles.

4.2 FedFW with Partial Client Participation

A key challenge in FL is to tackle random device participation schedules. Unlike a classical distributed optimization scheme, in most FL applications, clients have some autonomy and are not entirely controlled by the server. Due to various factors, such as network congestion or resource constraints, clients may intermittently participate in the training process. This obstacle can be tackled in FedFW by employing a block-coordinate Frank-Wolfe approach [48]. Given that the domain of problem (3) is block-separable, we can extend our FedFW analysis to block-coordinate updates.

Suppose that in every round t, the client i participates in the training procedure with a fixed probability of \(\texttt {p}_i \in (0,1]\). For simplicity, we assume the participation rate is the same among all clients, i.e., \(\texttt {p}_1 = \ldots = \texttt {p}_n := \texttt {p}\), but non-uniform participation can be addressed similarly. Instate the convex optimization problem described in Theorem 1 but with the random client participation scheme. At round t, the training procedure follows the same as in Algorithm 1 for the participating clients, and \(\textbf{x}_{i}^{t+1} = \textbf{x}_{i}^t\) for the non-participants. Then, the estimation \(\bar{\textbf{x}}^t\) generated with the step-size \(\smash \eta _t = \frac{2}{\texttt {p}(t-1)+2}\) and penalty parameter \( \scriptstyle \lambda _t = \lambda _0 \sqrt{ \texttt {p}(t-1)+2}\) converges to a solution with rate

$$\begin{aligned} \mathbb {E}[F(\bar{\textbf{x}}^t) - F(\textbf{x}^*)] \le \mathcal {O}\big ((\texttt {p}\,t)^{-1/2}\big ). \end{aligned}$$
(17)

Similarly, if we consider the non-convex setting of Theorem 2 with randomized client participation, and use the block-coordinate FedFW with step-size \(\eta _t={(\texttt {p}T+1)}^{-\frac{2}{3}}\), and penalty parameter \(\lambda _t=\lambda _0 {(\texttt {p}T+1)}^{\frac{1}{3}}\), we get

$$\begin{aligned} \min _{1\le t \le T} ~ \mathbb {E} [ \textrm{gap}(\bar{\textbf{x}}^t) ] \le \mathcal {O}\big ((\texttt {p}\,T)^{-1/3}\big ). \end{aligned}$$
(18)

The proofs are provided in the supplementary material.

4.3 FedFW with Split Constraints for Stragglers

FL systems are frequently implemented across heterogeneous pools of client hardware, leading to the ‘straggler effect’– delays in execution resulting from clients with less computation or communication speeds. In FedFW, we can mitigate this issue by assigning tasks to straggling clients more compatible with their computational capabilities. Theoretically, this adjustment can be achieved through certain special reformulations of the problem defined in (3). Specifically, the constraint \(\textbf{X}\in \mathcal {D}^n\) can be refined to \(\textbf{X}\in \bigcap _{i=1}^n \mathcal {D}_i\), where \(\bigcap _{i=1}^n \mathcal {D}_i = \mathcal {D}\). This modification does not affect the solution set, due to the consensus constraint.

In general, in FedFW, most of the computation occurs during the linear minimization step. Suppose that the resources of the client i are limited, particularly for arithmetic computations. In this case, we can select \(\mathcal {D}_i\) as a superset of \(\mathcal {D}\) where linear minimization computations are more straightforward. For instance, a Euclidean (or Frobenius) norm-ball encompassing \(\mathcal {D}\) could be an excellent choice. Then, \(\textbf{s}_i^t\) becomes proportional to the negative of \(\textbf{g}_i^t\) with appropriate normalization based on the radius of \(\mathcal {D}_i\), facilitating computation with minimal effort. On the other hand, if the primary bottleneck is communication, we might opt for \(\mathcal {D}_i\) characterized by sparse extreme points, such as an \(\ell _1\)-norm ball containing \(\mathcal {D}\) or by low-rank extreme points like those in a nuclear-norm ball. This strategy results in sparse (or low-rank) \(\textbf{s}_i^t\), thereby streamlining communication.

4.4 FedFW with Augmented Lagrangian

FedFW employs a quadratic penalty strategy to handle the consensus constraint. We also propose an alternative variant, FedFW+, which is modeled after the augmented Lagrangian strategy in [37]. The pseudocode for FedFW+ is presented in Algorithm 1. We compare the empirical performance of FedFW and FedFW+ in Sect. 5. The theoretical analysis of FedFW+ is omitted here; for further details, we refer readers to [37].

5 Numerical Experiments

In this section, we evaluate and compare the empirical performance of our methods against FedDR, which serves as the baseline algorithm, on the convex multiclass logistic regression (MCLR) problem and the non-convex tasks of training convolutional neural networks (CNN) and deep neural networks (DNN). For each problem, we consider two different choices for the domain \(\mathcal {D}\): namely the \(\smash {\ell _1}\) and \(\smash {\ell _2}\) ball constraints, each with a radius of 10. We assess the models’ performance based on validation accuracy, validation loss, and the Frank-Wolfe gap (11). To evaluate the effect of data heterogeneity, we conducted experiments using both IID and non-IID data distributions across clients. The code for the numerical experiments can be accessed via https://github.com/sourasb05/Federated-Frank-Wolfe.git.

Datasets. We use several datasets in our experiments: MNIST [49], CIFAR-10 [50], EMNIST [51], and a synthetic dataset generated as described in [52]. Specifically, the synthetic data is drawn from a multivariate normal distribution, and the labels are computed using softmax functions. We create data points of 60 features and from 10 different labels. For all datasets, we consider both IID and non-IID data distributions across the clients. In the non-IID scenario, each user has data from only 3 labels. We followed this rule for the synthetic data, as well as MNIST, CIFAR10, and EMNIST-10. For EMNIST-62, each user has data from 20 classes, with unequal distribution among users.

5.1 Comparison of Algorithms in the Convex Setting

We tested the performance of the algorithms on the strongly convex MCLR problem using the MNIST and CIFAR-10 datasets as well as the synthetic dataset. Table 1 presents the test accuracy results for the algorithms with IID and non-IID data distributions, and for two different choices of \(\mathcal {D}\). In these experiments, we simulated FL with 10 clients, all participating fully (\(\texttt {p}=1\)). We ran the algorithms for 100 communication rounds, with one local iteration per round.

5.2 Comparison of Algorithms in the Non-convex Setting

For the experiments in the non-convex setting we trained CNNs using the MNIST dataset and a DNN with two hidden layers using the synthetic dataset. We considered an FL system with 10 clients and full participation \((\texttt {p}=1)\). Similar to the previous case, we evaluated IID and non-IID data distributions as well as different choices of \(\mathcal {D}\), and ran the methods for 100 communication rounds with a single local training step. Table 2 summarizes the resulting test accuracies.

5.3 Comparison of Algorithms in the Stochastic Setting

Finally, we compared the performance of FedFW-sto against FedDR in the stochastic setting, where only stochastic gradients are accessible. For this experiment, we consider an FL network with 100 clients with full participation (\(\texttt {p}=1\)). Over this network, we trained the MCLR model using EMNIST-10, EMNIST-62, CIFAR10, and the synthetic dataset. We used a mini-batch size of 64, one local iteration per communication round, and ran the algorithms for 300 communication rounds. Table 3 summarizes the test accuracies obtained in this experiment. FedFW-sto outperformed FedDR in our experiments in the stochastic setting.

Table 1. Comparison of algorithms on the convex MCLR problem with different datasets and choices of \(\mathcal {D}\). We consider both IID and non-IID data distributions. The numbers represent test accuracy.
Table 2. Comparison of algorithms on the non-convex tasks. We train a CNN using MNIST, and a DNN with synthetic data. We consider IID and non-IID data distributions, and different choices of \(\mathcal {D}\). The numbers show test accuracy.
Table 3. Comparison of algorithms in the stochastic setting on the convex MCLR problem with different datasets and \(\ell _2\) ball constraint. We consider both IID and non-IID data distributions. The numbers represent test accuracy.
Fig. 2.
figure 2

Effect of participation \(\texttt {p}\) on FedFW. The experiment was conducted with MCLR using synthetic data, an \(\ell _1\) constraint, and two different choices of \(\lambda _0\).

Fig. 3.
figure 3

Effect of participation \(\texttt {p}\) on FedFW and FedFW+. We trained a DNN model using synthetic data, an \(\ell _2\) constraint, and a fixed \(\lambda _t = 10^{-3}\).

Fig. 4.
figure 4

Effect of the initial penalty (\(\lambda _0\)) on FedFW. (a) and (b) show the results for the convex setting, (c) and (d) demonstrates the non-convex setting.

5.4 Impact of Hyperparameters

We conclude our experiments with an ablation study to investigate how varying hyperparameters impact the performance of FedFW.

Impact of Partial Participation (\(\texttt {p}\)).  Figure 2 shows the validation accuracy and loss of FedFW algorithm for synthetic data and MCLR model. Figure 3 depicts the validation accuracy and loss of FedFW and FedFW+ algorithms for synthetic data and DNN model. Both convex and non-convex experiments show faster convergence for higher participation probability. It is worth mentioning that variations in \(\lambda _0\) do not alter the influence of \(\texttt {p}\). These observations are in accordance with the theoretical guarantees presented in Sect. 4.2.

Impact of Initial Penalty Parameter (\(\lambda _0\)).  Figure 4 illustrates the effect of hyperparameters on the convergence of loss, Frank-Wolfe gap, and validation accuracy of the algorithms. A higher \(\lambda _0\) leads to a larger gap in the initial iterations of the algorithm due to its regularization effect. In other words, increasing \(\lambda _0\) enforces the update direction towards the consensus set, which in turn increases the gap value in the first iteration. The exact expressions for the constants in the convergence guarantees, which are detailed in the supplementary material, can guide the optimal choice of \(\lambda _0\).

6 Conclusions

We introduced a FW-type method for FL and established its theoretical convergence rates. The proposed method, FedFW, guarantees \(\smash {\mathcal {O}(t^{-1/2})}\) convergence rates when the objective function smooth and convex. If we remove the convexity assumption, the rate reduces to \(\smash {\mathcal {O}(t^{-1/3})}\). With access to only stochastic gradients, FedFW achieves an \(\smash {\mathcal {O}(t^{-1/3})}\) convergence rate in the convex setting. Additionally, we proposed an empirically faster version of FedFW by incorporating an augmented Lagrangian dual update.

We conclude with a brief discussion on the limitations of our work. The primary limitation of FedFW is its slower convergence rates compared to state-of-the-art FL methods. Developing a tighter bound for FedFW, with multiple local steps, is an area for future research. Additionally, the analysis of FedFW+ is left to future work. Another important piece of future work is the convergence analysis of FedFW-sto for non-convex objectives. Finally, the development and analysis of an extension for asynchronous updates also remain as future work.