Keywords

1 Introduction

A large portion of the manufacturing industry follows the Original Equipment Manufacturer (OEM) model. In this model, companies (or aggregators) that design the product usually procure components required from an available set of OEMs. Foundries like TSMC, UMC, and GlobalFoundries handle the production of components used in a wide range of smart electronic offerings [1]. We also observe a similar trend in the automotive industry [2].

However, aggregators are required to maintain minimum quality assurance for their products while maximizing their revenue. Hence, they must judicially procure the components with desirable quality and cost from the OEMs. For this, aggregators should learn the quality of components provided by an OEM. OEM businesses often have numerous agents engaged in procuring the same or similar components. In such a setting, one can employ online learning where multiple aggregators, referred henceforth as agents, cooperate to learn the qualities [8, 24]. Further, decentralized (or federated) learning is gaining traction for large-scale applications [20, 33].

In general, an agent needs to procure and utilize the components from different OEMs (referred to as producers) to learn their quality. This learning is similar to the exploration and exploitation problem, popularly known as Multi-armed Bandit (MAB) [13, 15]. It needs sequential interactions between sets of producers and the learning agent. Further, we associate qualities, costs, and capacities with the producers for each agent. We model this as a combinatorial multi-armed bandit (CMAB) [5] problem with assured qualities [15]. Our model allows the agents to maximize their revenues by communicating their history of procurements to have better estimations of the qualities. Since the agents can benefit from sharing their past quality realizations, we consider them engaged in a federated learning process. Federated MAB often improves performance in terms of regret incurred per agent [16, 25]Footnote 1.

Such a federated exploration/exploitation paradigm is not just limited to selecting OEMs. It is useful in many other domains such as stocking warehouse/distribution centres, flow optimization, and product recommendations on e-commerce websites [21, 27]. However, agents are competitive; thus, engaging in federated learning is not straightforward. Agents may not be willing to share their private experiences since that could negatively benefit them. For example, sharing the exact procurement quantities of components specific to certain products can reveal the market/sales projections. Thus, we desire (or many times even it is necessary) to maintain privacy when engaged in federated learning. This paper aims to design a privacy-preserving algorithm for federated CMAB with quality assurances.

Our Approach and Contributions. Privacy concerns for sensitive information pose a significant barrier to adopting federated learning. To preserve the privacy of such information, we employ the strong notion of differential privacy (DP) [9]. Note that naive approaches (e.g., Laplace or Gaussian Noise Mechanisms [10]) to achieve DP for CMAB may come at a high privacy cost or outright perform worse than non-federated solutions. Consequently, the primary challenge is carefully designing methods to achieve DP that provide meaningful privacy guarantees while performing significantly better than its non-federated counterpart.

To this end, we introduce P-FCB, a Privacy-preserving Federated Combinatorial Bandit algorithm. P-FCB comprises a novel communication algorithm among agents, while each agent is learning the qualities of the producers to cooperate in the learning process. Crucially in P-FCB, the agent only communicates within a specific time frame – since it is not beneficial to communicate in (i) earlier rounds (estimates have high error probability) or (ii) later rounds (value added by communicating is minimal). While communicating in each round reduces per agent regret, it results in a high privacy loss. P-FCB strikes an effective balance between learning and privacy loss by limiting the number of rounds in which agents communicate. Moreover, to ensure the privacy of the shared information, the agents add calibrated noise to sanitize the information a priori. P-FCB also uses error bounds generated for UCB exploration [3] to determine if shared information is worth learning. We show that P-FCB allows the agents to minimize their regrets while ensuring strong privacy guarantees through extensive simulations.

In recent times, research has focused on the intersection of MAB and DP [19, 32]. Unlike P-FCB, these works have limitations to single-arm selections. To the best of our knowledge, this paper is the first to simultaneously study federated CMAB with assured quality and privacy constraints. In addition, as opposed to other DP and MAB approaches [8, 12], we consider the sensitivity of attributes specific to a producer-agent set rather than the sensitivity of general observations. In summary, our contributions in this work are as follows:

  1. 1.

    We provide a theoretical analysis of improvement in terms of regret in a non-private homogeneous federated CMAB setting (Theorem 1, Sect. 4).

  2. 2.

    We show that employing privacy techniques naively is not helpful and has information leak concerns (Claim 1, Sect. 5.2).

  3. 3.

    We introduce P-FCB to employ privacy techniques practically (Algorithm 1). P-FCB includes selecting the information that needs to be perturbed and defining communication rounds to provide strong privacy guarantees. The communicated information is learned selectively by using error bounds around current estimates. Selective communication helps minimize regret.

  4. 4.

    P-FCB’s improvement in per agent regret even in a private setting compared to individual learning is empirically validated through extensive simulations (Sect. 6).

2 Related Work

Multi-armed bandits (MAB) and their variants are a well studied class of problems [3, 6, 15, 17, 22, 23] that tackle the exploration vs. exploitation trade-off in online learning settings. While the classical MAB problem [3, 28] assumes single arm pull with stochastic reward generation, our work deals with combinatorial bandits (CMAB) [5, 11, 26, 31], whereby the learning agent pulls a subset of arms. We remark that our single-agent (non-federated) MAB formulation is closely related to the MAB setting considered in [7], but the authors there do not consider federated learning.

Federated MAB. Many existing studies address the MAB problem in a federated setting but restrict themselves to single-arm pulls. The authors in [24, 25] consider a federated extension of the stochastic single player MAB problem, while Huang et al. [14] considers the linear contextual bandit in a federated setting. Kim et al. [16] specifically considers the federated CMAB setting. However, none of these works address privacy.

Privacy-Preserving MAB. The authors in [19, 32] consider a differentially private MAB setting for a single learning agent, while the works in [4, 18] consider differentially private federated MAB setting. However, these works focus only on the classical MAB setting, emphasising the communication bottlenecks. There also exists works that deal with private and federated setting for the contextual bandit problem [8, 12]. However, they do not consider pulling subsets of arms. Further, Hannun et al. [12] consider privacy over the context, while Dubey and Pentland [8] consider privacy over context and rewards. Contrarily, this paper considers privacy over the procurement strategy used.

To the best of our knowledge, we are the first to propose a solution for combinatorial bandits (CMAB) in a federated setting with the associated privacy concerns.

3 Preliminaries

In this section, we formally describe the combinatorial multi-armed bandit setting and its federated extension. We also define differential privacy in our context.

3.1 Federated Combinatorial Multi Armed Bandits

We consider a combinatorial MAB (CMAB) setting where there are [m] producers and [n] agents. Each producer \(i \in [m]\) has a cost \(k_{ij}\) and capacity \(c_{ij}\) for every agent \(j \in [n]\). At any round \(t \in \{1,2,\ldots ,T\}\), agents procure some quantity of goods from a subset of producers under given constraint(s). We denote the procurement of an agent j by \(\textbf{s}_{j} = (l_{1j}, l_{2j},\ldots ,l_{mj})\) where \(\ l_{ij} \in [0,k_{ij}]\) is the quantity procured from producer i.

Qualities. Each agent observes a quality realisation for each unit it procured from producers. Since the quality of a single unit of good may not be easily identifiable, we characterize it as a Bernoulli random variable. The expected realisation of a unit procured from a producer i is referred to as its quality, \(q_{i}\). In other words, \(q_{i}\) denotes the probability with which a procured unit of good from producer i will have a quality realisation of one. While the producer’s cost and capacity vary across agents, the quality values are indifferent based on agents.

Regret. We use \(r_{ij}\) to denote expected utility gain for the agent j by procuring a single unit from producer i, where \(r_{ij} = \rho q_{i} - c_{ij}\) (where \(\rho >0\), is a proportionality constant). Further, the expected revenue for a procurement vector \(\textbf{s}_{j}\), is given by \(r_{\textbf{s}_{j}} = \sum _{i\in [m]} l_{ij}r_{ij}\).

The goal for the agent is to maximise its revenue, under given constraints. We consider a constraint of maintaining a minimum expected quality threshold \(\alpha \) (quality constraint), for our setting. To measure the performance of an a given algorithm A, we use the notion of regret which signifies the deviation of the algorithm from the procurement set chosen by an Oracle when mean qualities are known. For any round \(t \in \{1,2,\ldots ,T\}\), we use the following to denote the regret for agent j given an algorithm A,

$$\begin{aligned} \mathcal {R}_{Aj}^{t} = {\left\{ \begin{array}{ll} r_{\mathbf {s^{*}_{j}}} - r_{\textbf{s}_{Aj}^{t}}, &{} \textit{if }s_{Aj}^{t} \textit{ satisfies the quality constraint} \\ L, &{} \textit{otherwise} \end{array}\right. } \end{aligned}$$

where \(\mathbf {s^{*}_{j}}\) denotes the procurement set chosen by an Oracle, with the mean qualities known. \(\textbf{s}_{A}^{t}\) is the set chosen by the algorithm A in round t. \(L = \max _{r_{\textbf{s}}}(r_{\mathbf {s^{*}_{j}}} - r_{\textbf{s}})\) is a constant that represents the maximum regret one can acquire. The overall regret for algorithm A is given by \(\mathcal {R}_{A} = \sum _{j \in [n]} \sum _{t \in [T]} \mathcal {R}_{Aj}^{t}\).

Federated Regret Ratio (FRR). We introduce FRR to help quantify the reduction in regret brought on by engaging in federated learning. FRR is the ratio of the regret incurred by an agent via a federated learning algorithm A over agent’s learning individually via a non-federated algorithm NF, i.e., \(FRR = \frac{\mathcal {R}_{A}}{\mathcal {R}_{NF}}\). We believe, FRR is a comprehensive indicator of the utility gained by engaging in federated learning, compared to direct regret, since it presents a normalised value and performance comparison over different data sets/algorithms is possible.

Observe that, \(FRR \approx 1\) indicates that there is not much change in terms of regret by engaging in federated learning. If \(FRR > 1\), it is detrimental to engage in federated learning, whereas if \(FRR < 1\), it indicates a reduction in regret. When \(FRR\approx 0\), there is almost complete reduction of regret in federated learning.

In our setting, we consider that agents communicate with each other to improve their regret. But in general, agents often engage in a competitive setting, and revealing true procurement values can negatively impact them. For instance, knowing that a company has been procuring less than their history can reveal their strategic plans, devalue their market capital, hinder negotiations etc. We give a formalisation of the notion of privacy used in our setting in the next subsection.

Fig. 1.
figure 1

Overview of the communication model for P-FCB: Agents interact with producers as part of the exploration and exploitation process. Agents also communicate among themselves to learn the qualities of producers. However, they share noisy data to maintain the privacy of their sensitive information.

3.2 Differential Privacy (DP)

As opposed to typical federated models, we assume that the agents in our setting may be competing. Thus, agents will prefer the preservation of their sensitive information. Specifically, consider the history of procurement quantities \(\textbf{H}_{ij} = (l_{ij}^{t})_{t \in [T]}\) for any producer \(i \in [m]\) is private to agent j. To preserve the privacy of \(\textbf{H}_{ij}\) while having meaningful utilitarian gains, we use the concept of Differential Privacy (DP). We tweak the standard DP definition in [9, 10] for our setting. For this, let \(\textbf{S}_{j} = (\textbf{s}_{j}^{t})_{t \in [T]}\) be complete history of procurement vectors for agent j.

Definition 1 (Differential Privacy)

In a federated setting with \(n \ge 2\) agents, a combinatorial MAB algorithm \(A = (A_{j})_{j=1}^{n}\) is said to be \((\epsilon ,\delta ,n)-\)differentially private if for any \(u,v \in [n], s.t., u \ne v\), any \(t_{o}\), any set of adjacent histories \(\textbf{H}_{iu} = (l_{iu}^{t})_{t \in [T]}, \textbf{H}_{iu}^{'} = (l_{iu}^{t})_{t \in [T] \setminus \{t_{o}\}} \ \cup \ \bar{l}_{iu}^{t_{o}}\) for producer i and any complete history of procurement vector \(\textbf{S}_{v}\),

$$\begin{aligned} \Pr (A_{v}(\textbf{H}_{iu}) \in \textbf{S}_{v}) \le e^{\epsilon }\Pr (A_{v}(\textbf{H}_{iu}^{'}) \in \textbf{S}_{v}) + \delta \end{aligned}$$

Our concept of DP in a federated CMAB formalizes the idea that the selection of procurement vectors by an agent is insusceptible to any single element \(l_{ij}^{t}\) from another agent’s procurement history. Note that the agents are not insusceptible to their own histories here.

Typically, the “\(\epsilon \)” parameter is referred to as the privacy budget. The privacy loss variable \(\mathcal {L}\) is often useful for the analysis of DP. More formally, given a randomised mechanism \(\mathcal {M}(\cdot )\) and for any output o, the privacy loss variable is defined as,

$$\begin{aligned} \mathcal {L}^o_{\mathcal {M}(\textbf{H})||\mathcal {M}(\textbf{H}')} = \ln \left( \frac{\Pr [\mathcal {M}(\textbf{H})=o]}{\Pr [\mathcal {M}(\textbf{H}')=o]}\right) . \end{aligned}$$
(1)

Gaussian Noise Mechanism [10]. To ensure DP, often standard techniques of adding noise to values to be communicated are used. The Gaussian Noise mechanism is a popular mechanism for the same. Formally, a randomised mechanism \(\mathcal {M}(x)\) satisfies \((\epsilon ,\delta )\)-DP if the agent communicates \(\mathcal {M}(x) \triangleq x + \mathcal {N}\left( 0,\frac{2 \varDelta (x)^{2} \ln (1.25 / \delta )}{\epsilon ^{2}}\right) \). Here, x is the private value to be communicated with sensitivity \(\varDelta (x)\), and \(\mathcal {N}(0,\sigma ^2)\) the Gaussian distribution with mean zero and variance \(\sigma ^2\).

In summary, Fig. 1 provides an overview of the model considered. Recall that we aim to design a differentially private algorithm for federated CMAB with assured qualities. Before this, we first highlight the improvement in regret using the federated learning paradigm. Next, we discuss our private algorithm, P-FCB, in Sect. 5.

Fig. 2.
figure 2

Comparing FRR values for Homogeneous and Heterogeneous Federated CMAB (\(n = 10\), \(m = 30\))

4 Non-private Federated Combinatorial Multi-armed Bandits

We now demonstrate the advantage of federated learning in CMAB by highlighting the reduction in regret incurred compared to agents learning individually. We first categorize Federated CMAB into the following two settings: (i) homogeneous: where the capacities and costs for producers are the same across agents, and (ii) heterogeneous: where the producer’s capacity and cost varies across agents.

Homogeneous Setting. The core idea for single-agent learning in CMAB involves using standard UCB exploration [3]. We consider an Oracle that uses the UCB estimates to return an optimal selection subset. In this paper, we propose that to accelerate the learning process and for getting tighter error bound for quality estimations, the agents communicate their observations with each other in every round. In a homogeneous setting, this allows all agents to train a shared model locally without a central planner since the Oracle algorithm is considered deterministic. We present the formal algorithm in the extended version [29]. It’s important to note that in such a setting, each agent has the same procurement history and the same expected regret.

Further, the quality constraint guarantees for the federated case follow trivially from the single agent case ([7, Theorem 2]). Additionally, in Theorem 1, we prove that the upper bound for regret incurred by each agent is \(\mathcal {O}(\frac{\ln (nT)}{n})\); a significant improvement over \(\mathcal {O}(\ln T)\) regret the agent will incur when playing individually. The formal proof is provided in the extended version [29].

Theorem 1

For Federated CMAB in a homogeneous setting with n agents, if the qualities of producers satisfy \(\gamma \)-seperatedness, then the individual regret incurred by each of the agents is bounded by \(\mathcal {O}(\frac{\ln (nT)}{n})\).

Heterogeneous Setting. In real-world, the agents may not always have the same capacities. For such a heterogeneous setting, the regret analysis is analytically challenging. For instance, we can no longer directly use Hoeffding’s inequality, needed for proving Theorem 1, since the procurement histories will differ across agents. Still, the intuition for regret reduction from cooperative learning carries over.

Even in a heterogeneous setting, communicating the observations allows the agent to converge their quality estimations to the mean faster and provide tighter error bounds. Even with shared quality estimates, Oracle may return different procurement vectors for different agents based on different capacities. Thus, a weighted update in estimation is essential, and the procurement vector would also need to be communicated.

We empirically demonstrate that using federated learning in heterogeneous setting shows similar FRR (ratio of regret incurred in federated setting compared to non federated setting) trend compared to homogeneous setting, over 100000 rounds for two scenarios: (i) Costs and qualities are sampled from uniform distributions, i.e. \(c_{ij} \sim U[0,1]\), \(q_{i} \sim U[0,1]\), (ii) Costs and qualities are sampled from normal distributions around the quality threshold, i.e., \(c_{ij} \sim \mathcal {N}(\alpha ,0.1)\), \(q_{i} \sim \mathcal {N}(\alpha ,0.1)\).

Figure 2 depicts the results. From Fig. 2 we observe that the trend for both homogeneous and heterogeneous settings are quite similar. This shows that, similar to the homogeneous setting, employing federated learning reduces regret even in the heterogeneous setting.

5 P-FCB: Privacy-Preserving Federated Combinatorial Bandit

From Sect. 3.2, recall that we identify the procurement history of an agent-producer pair as the agent’s sensitive information. We believe that the notion of DP w.r.t. the agent-producer procurement history is reasonable. A differentially private solution ensures that the probability with which other agents can distinguish between an agent’s adjacent procurement histories is upper bounded by the privacy budget \(\epsilon \).

Section Outline: In this section, we first argue that naive approaches for DP are not suitable due to their lack of meaningful privacy guarantees. Second, we show that all attributes dependent on the sensitive attribute must be sanitised before sharing to preserve privacy. Third, we define a privacy budget algorithm scheme. Fourth, we formally introduce P-FCB including a selective learning procedure. Last, we provide the \((\epsilon ,\delta )\)-DP guarantees for P-FCB.

5.1 Privacy Budget and Regret Trade-Off

Additive noise mechanism (e.g., Gaussian Noise mechanism [10]) is a popular technique for ensuring \((\epsilon ,\delta )\)-DP. To protect the privacy of an agent’s procurement history within the DP framework, we can build a naive algorithm for heterogeneous federated CMAB setting by adding noise to the elements of the procurement vectors being communicated in each round.

However, such a naive approach does not suitably satisfy our privacy needs. Using the Basic Composition theorem [10], which adds the \(\epsilon \)s and \(\delta \)s across queries, it is intuitive to see that communicating in every round results in a high overall \(\epsilon \) value which may not render much privacy protection in practice [30]. Consider the agents interacting with the producers for \(10^6\) rounds. Let \(\epsilon = 10^{-2}\) for each round they communicate the perturbed values. Using Basic Composition, we can see that the overall privacy budget will be bounded by \(\epsilon = 10^4\), which is practically not acceptable. The privacy loss in terms of overall \(\epsilon \) grows at worst linearly with the number of rounds.

It is also infeasible to solve this problem merely by adding more noise (reducing \(\epsilon \) per round) since if the communicated values are too noisy, they can negatively affect the estimates. This will result in the overall regret increasing to a degree that it may be better to not cooperatively learn. To overcome this challenge, we propose to decrease the number of rounds in which agents communicate information.

Secondly, if the sample size for the local estimates is too small, noise addition can negatively effect the regret incurred. On the other hand, if the sample size of local estimate is too large, the local estimate will have tight error bounds and deviating from the local estimate too much may result in the same.

When to Learn. Based on the above observations, we propose the following techniques to strike an effective trade-off between the privacy budget and regret.

  1. 1.

    To limit the growth of \(\epsilon \) over rounds, we propose that communication happens only when the current round number is equal to a certain threshold (denoted by \(\tau \)) which doubles in each communication round. Thus, there are only \(\log (T)\) communications rounds, where density of communication rounds decrease over rounds.

  2. 2.

    We propose to communicate only for a specific interval of rounds, i.e., for each round \(t \in [\underline{t},\bar{t}]\). No communication occurs outside these rounds. This ensures that agent communication only happens in rounds when it is useful and not detrimental.

5.2 Additional Information Leak with Actual Quality Estimates and Noisy Weights

It is also important to carefully evaluate the way data is communicated every round since it may lead to privacy leaks. For example, consider that all agents communicate their local estimates of the producer qualities and perturbation of the total number of units procured from each producer to arrive at the estimation. We now formally analyse the additional information leak in this case. W.l.o.g. our analysis is for any arbitrarily picked producer \(i \in [m]\) and agent \(j \in [n]\). As such, we omit the subscripts “i” for producer and “j” for the agent. We first set up the required notations as follows.

Notations: Consider \(\hat{q}^t, W^t\) as true values for the empirical estimate of quality and total quantity procured till the round t (not including t). Next, let \(\tilde{W}^t\) denote noisy value of \(W^t\) (with the noise added using any additive noise mechanism for DP [10]). We have \(w^t\) as the quantity procured in round t. Last, let \(\hat{q}^{obsv_t}\) denote the quality estimate based on just round t. Through these notations, we can compute \(\hat{q}^{t+1}\) for the successive round \(t+1\) as follows: \(\hat{q}^{t+1} = \frac{W^t \times \hat{q}^t + w^t \times \hat{q}^{obsv_t}}{W^t + w^t}\).

Claim 1

Given \(\hat{q}^t, W^t, \tilde{W}^t, w^t\) and \(\hat{q}^{obsv_t}\), the privacy loss variable \(\mathcal {L}\) is not defined if \(\hat{q}^t\) is also not perturbed.

We present the formal proof in the extended version [29]. With Claim 1, we show that \(\epsilon \) may not be bounded even after sanitising the sensitive data due to its dependence on other non-private communicated data. This is due to the fact that the local mean estimates are a function of the procurement vectors and the observation vectors. Thus, it becomes insufficient to just perturb the quality estimates.

We propose that whenever communication happens, only procurement and observation values based on rounds since last communication are shared. Additionally, to communicate weighted quality estimates, we use the Gaussian Noise mechanism to add noise to both the procurement values and realisation values. The sensitivity (\(\varDelta \)) for noise sampling is equal to the capacity of the producer-agent pair.

figure a
figure b

5.3 Privacy Budget Allocation

Since the estimates are more sensitive to noise addition when the sample size is smaller, we propose using monotonically decreasing privacy budget for noise generation. Formally, let total privacy budget be denoted by \(\epsilon \) with \((\epsilon ^{1},\epsilon ^{2},\ldots )\) corresponding to privacy budgets for communication rounds \((1,2,\ldots )\). Then, we have \(\epsilon ^{1}> \epsilon ^{2} > \ldots \). Specifically, we denote \(\epsilon ^{z}\) as the privacy budget in the \(z^{th}\) communication round, where \( \epsilon ^z \longleftarrow \frac{\epsilon }{2 \times \log (T)} + \frac{\epsilon }{2^{z+1}}\).

5.4 P-FCB: Algorithm

Based on the feedback from the analysis made in previous subsections, we now present a private federated CMAB algorithm for the heterogeneous setting, namely P-FCB. Algorithm 1 formally presents P-FCB. Details follow.

Algorithm 1 Outline. The rounds are split into two phases. During the initial pure exploration phase (Lines 6–22), the agents explore all the producers by procuring evenly from all of them. The length of the pure exploration phase is carried over from the non-private algorithm. In this second phase (Lines 23–38), explore-exploit, the agents calculate the UCB for their quality estimates. Then the Oracle is used to provide a procurement vector based on the cost, capacity, UCB values as well as the quality constraint (\(\alpha \)). Additionally, the agents communicate their estimates as outlined in Sects. 5.1 and 5.2. The agents update their quality estimates at the end of each round using procurement and observation values (both local and communicated), Lines 19 and 36.

$$\begin{aligned} \begin{gathered} w_{i,j}^{t+1} \longleftarrow w_{i,j}^{t} + l_{i,j}^{t} ~;~ W_{i,j}^{t+1} \longleftarrow W_{i,j}^{t} + l_{i,j}^{t} \\ y_{i,j}^{t+1} \longleftarrow y_{i,j}^{t} + x_{i,j}^{t} ~;~ Y_{i,j}^{t+1} \longleftarrow Y_{i,j}^{t} + x_{i,j}^{t} \\ q_{i,j}^{t+1} \longleftarrow \frac{Y_{i,j}^{t+1}}{W_{i,j}^{t+1}} \end{gathered} \end{aligned}$$
(2)

Noise Addition. From Sect. 5.2, we perturb both uncommunicated procurement and realization values for each agent-producer pair using the Gaussian Noise mechanism. Formally, let \(w_{i,j}^{t}, y_{i,j}^{t}\) be the uncommunicated procurement and realization values. Then \(\tilde{w}_{i,j}, \tilde{y}_{i,j}\) are communicated, which are calculated using the following privatizer,

$$\begin{aligned} \tilde{w}_{i,j} = w_{i,j}^{t} + \mathcal {N}(0,\frac{2 k_{i,j}^{2} \log (1.25 / \delta )}{(\epsilon ^{z})^{2}}) \end{aligned}$$
(3)
$$\begin{aligned} \tilde{y}_{i,j} = y_{i,j}^{t} + \mathcal {N}(0,\frac{2 k_{i,j}^{2} \log (1.25 / \delta )}{(\epsilon ^{z})^{2}}) \end{aligned}$$
(4)

where \(\epsilon ^{z}\) is the privacy budget corresponding to the \(z^{th}\) communication round.

What to Learn. To minimise the regret incurred, we propose that the agents selectively choose what communications to learn from. Weighted confidence bounds around local estimates are used to determine if a communication round should be learned from. Let \(\xi _{i,j}^{t} = \sqrt{\frac{3 ln(t)}{2\sum _{z \in \{1,2,\ldots ,t\}} l_{i,j}^{z} }}\) denote the confidence interval agent j has w.r.t. local quality estimate of producer i. Then, the agents only selects to learn from a communication if \(\hat{q}_{i,j}^{t} - \omega _{1} \xi _{i,j}^{t}< q_{(communicated)i,j} < \hat{q}_{i,j}^{t} + \omega _{1} \xi _{i,j}^{t}\) where \(\omega _{1}\) is a weight factor and \(q_{(communicated)i,j} = \frac{\tilde{y}_{i,j}}{\tilde{w}_{i,j}}\).

The local observations are weighed more compared to communicated observations for calculating overall estimates. Specifically, \(\omega _{2} \in [0,1]\) is taken as the weighing factor for communicated observations.

5.5 P-FCB: \((\epsilon ,\delta )\)-DP Guarantees

In each round, we perturb the values being communicated by adding Gaussian noises satisfying \((\epsilon ',\delta ')\)-DP to them. It is a standard practice for providing DP guarantees for group sum queries. Let \(\mathcal {M}\) be a randomised mechanism which outputs the sum of values for a database input d using Gaussian noise addition. Since Oracle is deterministic, each communication round can be considered a post-processing of \(\mathcal {M}\) whereby subset of procurement history is the database input. Thus making individual communication rounds satisfy \((\epsilon ',\delta ')\)-DP.

The distinct subset of procurement histories used in each communication round can be considered as independent DP mechanisms. Using the Basic Composition theorem, we can compute the overall \((\epsilon ,\delta )\)-DP guarantee. In P-FCB, we use a target privacy budget, \(\epsilon \), to determine the noise parameter \(\sigma \) in each round based on Basic composition. Thus, this can be leveraged as a tuning parameter for privacy/regret optimisation.

6 Experimental Results

In this section, we compare P-FCB with non-federated and non-private approaches for the combinatorial bandit (CMAB) setting with constraints. We first explain the experimental setup, then note our observations and analyze the results obtained.

Fig. 3.
figure 3

EXP1: Regret Comparison across rounds (\(n=10\), \(m=30\))

Fig. 4.
figure 4

EXP2: FRR for P-FCB while varying privacy budget \(\epsilon \) (with \(n=10\), \(m=30\), \(t=100000\))

Fig. 5.
figure 5

EXP3: Average regret per agent with P-FCB by varying the number of learners n (with \(\epsilon =1\), \(t=100000\))

6.1 Setup

For our setting, we generate costs and qualities for the producers from: (a) uniform distributions, i.e., \(q_{i},c_{ij} \sim U[0,1]\) (b) normal distributions, i.e., \(q_{i},c_{ij} \sim \mathcal {N}(\alpha ,0)\). For both cases, the capacities are sampled from a uniform distribution, \(k_{ij} \sim U[1,50]\). We use the following tuning parameters in our experiments: \(\alpha = 0.4\), \(\delta = 0.01\) (i.e., \(\delta < 1/n\)), \(\underline{t} = 200\), \(\bar{t} = 40000\), \(\omega _{1} = 0.1\), \(\omega _{2} = 10\). For our Oracle, we deploy the Greedy SSA algorithm presented in Deva et al. [7]. Further, to compare P-FCB’s performance, we construct the following two non-private baselines:

  1. 1.

    Non-Federated. We use the single agent algorithm for subset selection under constraints proposed in Deva et al. [7]. It follows UCB exploration similar to P-FCB but omits any communication done with other agents.

  2. 2.

    FCB. This is the non-private variant of P-FCB. That is, instead of communicating \(\tilde{w}_{ij}\) and \(\tilde{y}_{ij}\), the true values \(w_{ij}^{t}\) and \(y_{ij}^{t}\) are communicated.

We perform the following experiments to measure P-FCB’s performance:

  • EXP1: For fixed \(n=10\), \(m=30\), we observe the regret growth over rounds (t) and compare it to non-federated and non-private federated settings.

  • EXP2: For fixed \(n=10\), \(m=30\), we observe FRR (ratio of regret incurred in federated setting compared to non federated setting) at \(t=100000\) while varying \(\epsilon \) to see the regret variance w.r.t. privacy budget.

  • EXP3: For fixed \(\epsilon = 1\), \(m=30\), we observe average regret at \(t=100000\) for varying n to study the effect of number of communicating agents.

For EXP1 and EXP2, we generate 5 instances by sampling costs and quality from both Uniform and Normal distributions. Each instance is simulated 20 times and we report the corresponding average values across all instances. Likewise for EXP3, instances with same producer quality values are considered with costs and capacities defined for different numbers of learners. For each instance, we average across 20 simulations.

6.2 Results

  • EXP1. P-FCB shows significant improvement in terms of regret (Fig. 3) at the cost of relatively low privacy budget. Compared to FCB, P-FCB (\(\epsilon = 1\)) and Non-federated incurs \(136\%\), \(233\%\) more regret respectively for uniform sampling and \(235\%\), \(394\%\) more regret respectively for normal sampling. This validates efficacy of P-FCB.

  • EXP2. We study the performance of the algorithm with respect to privacy budget (Fig. 4). We observe that according to our expectations, the regret decreases as privacy budget is increased. This decrease in regret is sub-linear in terms of increasing \(\epsilon \) values. This is because as privacy budget increases, the amount of noise in communicated data decreases.

  • EXP3. We see (Fig. 5) an approximately linear decrease in per agent regret as the number of learning agents increases. This reinforces the notion of reduction of regret, suggested in Sect. 4, by engaging in federated learning is valid in a heterogeneous private setting.

Discussion: Our experiments demonstrate that P-FCB, through selective learning in a federated setting, is able to achieve a fair regret and privacy trade-off. P-FCB achieves reduction in regret (compared to non-federated setting) for low privacy budgets.

With regards to hyperparameters, note that lower \(\omega _{2}\) suggests tighter bounds while selecting what to learn, implying a higher confidence in usefulness of the communicated data. Thus, larger values for \(\omega _{1}\) can be used if \(\omega _{2}\) is decreased. In general, our results indicate that it is optimal to maintain the value \(\omega _{1}\cdot \omega _{2}\) used in our experiments. Also, the communication start time, should be such that the sampled noise is at-least a magnitude smaller than the accumulated uncommunicated data (e.g., \(\underline{t}\approx 200\)). This is done to ensure that the noisy data is not detrimental to the learning process.

The DP-ML literature suggests a privacy budget \(\epsilon <1\) [30]. From Fig. 4, we note that P-FCB performs well within this privacy budget. While our results achieve a fair regret and privacy trade-off, in future, one can further fine tune these hyperparameters through additional experimentation and/or theoretical analysis.

7 Conclusion and Future Work

This paper focuses on learning agents which interact with the same set of producers (“arms”) and engage in federated learning while maintaining privacy regarding their procurement strategies. We first looked at a non-private setting where different producers’ costs and capacities were the same across all agents and provided theoretical guarantees over optimisation due to federated learning. We then show that extending this to a heterogeneous private setting is non-trivial, and there could be potential information leaks. We propose P-FCB  which uses UCB based exploration while communicating estimates perturbed using Gaussian method to ensure differential privacy. We defined a communication protocol and a selection learning process using error bounds. This provided a meaningful balance between regret and privacy budget. We empirically showed notable improvement in regret compared to individual learning, even for considerably small privacy budgets.

Looking at problems where agents do not share exact sets of producers but rather have overlapping subsets of available producers would be an interesting direction to explore. It is also possible to extend our work by providing theoretical upper bounds for regret in a differentially private setting. In general, we believe that the idea of when to learn and when not to learn from others in federated settings should lead to many interesting works.