Keywords

1 Introduction

With the rapid penetration of Internet and Smartphone through the crowded, large-scale collection of user data is already a necessary daily activity for companies. User data has become one of the most important asset, which can give support to data scientists to discover new patterns and provide training examples for machine learning models. However, this comes with huge risks–can these companies protect users’ sensitive data from malicious access? Disclosure may violate the users’ privacy and lead to scandal.

Anonymization techniques are one common method to protect user privacy by blurring the personalized or identifiable information, but it’s vulnerable to the de-anonymization attack as shown in the case of Netflix Price [15]. For the above scenario, Differential Privacy [6, 7] successfully achieved releasing sanitized datasets, but not at the client level.

Local Differential Privacy (LDP) [13] is a branch of DP, which gives the benefits of population-level statistics without the collection of raw private data. With LDP, service provider can get statistical information on all users by just aggregating users’ noised report. This feature makes LDP widely used in real-world scenarios. For example, Google’s use LDP scheme RAPPOR to constantly collect the home-page that all users like to set up; Apple announced its implementation of LDP in iOS 10 and MacOS in WWDC 2016; Microsoft also deploys a LDP-enabled data collection mechanism in Windows Insiders program to collect application usage statistics.

LDP’s security parameter (privacy regime \(\epsilon \)) represents the security level of its randomization process. The bigger (or smaller) the security parameter, the more (or less) availability of the noisy report that the user shares. However, most LDP schemes assume that each user has the same \(\epsilon \), hence each user uses the exact same randomized procedure to generate a noisy report from their own data. There have been complaints that deployed LDP schemes use higher values of \(\epsilon \) while users are not given any choice. So it is questionable that \(\epsilon \) is entirely determined by the service provider who wants more availability of the data.

Multiple and Personalized Privacy Regimes. In order to meet the privacy demands of different people, we argue that users should be allowed to set their overall privacy levels (e.g., low/moderate/high) independently. Here, we assume that this personalization of privacy regimes does not mean the user should set a new \(\epsilon \) every time he shares data. Instead, users will set an infrequently changing \(\epsilon \), which will be consumed a fixed percentage every time users share data with LDP. This assumption is derived from a study [17], which suggests that Apple’s deployment [2] for LDP has an overall privacy regime as high as 16 everyday and sets privacy consumption to 1 or 2 each time shares data while there is no transparency.

In this paper, we consider that simple LDP mining task only contains one data collection activity, and the common simple mining task includes frequency estimation, mean value estimation, heavy-hitters identification and so on. So even if users have the same overall privacy regime \(\epsilon \), multiple \(\epsilon \) may still appear in one mining task because the number of times that data is shared is different. For example, it could involve the real-time sharing of trajectory data and is not the focus of our analysis. We present the above personalization process in Fig. 1(a).

In addition, when LDP handles complex mining task like “Frequent Itemset Mining” [20] which contains several rounds of data collection activities, it is common to randomly assign all users to several groups, and then different groups finish each step of the mining task respectively with the same \(\epsilon \). However, some groups of users only need to pay a small amount of privacy to complete easy step (e.g., frequency estimation in small candidate space), the remaining can be assigned to challenging steps by segmentation as shown in Fig. 1(b). Combining the above two points, it’s urgent to deal with multiple privacy regimes under LDP.

Fig. 1.
figure 1

Application scenarios for multiple privacy regimes.

In this paper, we assume that users may have different privacy acceptances for personalization: Once a user sets his overall privacy regime \(\epsilon \), he does not change this value frequently and his participation in any tasks will automatically consume a certain proportion of this \(\epsilon \) until it’s used up; Since the collection behavior is usually long-term and the user’s privacy data may change (e.g., web pages visited), users’ specific choices of privacy regimes are not related to the value of the private data.

As far as we know, frequency estimation is the most basic LDP mining task, so it is meaningful to apply it under the mechanism of multiple privacy regimes. An obvious method of frequency estimation in this scenario, is to divide users with the same \(\epsilon \) into the same group, and estimate frequency for different groups separately. In the end, service provider aggregates each group’s estimated value by weighting. This is discussed in Sect. 3.

Contributions

  • We propose MLE method which applies the idea of parameter estimation to obtain an optimal estimate from user groups with different privacy regimes. Our theoretical analysis shows that the accuracy of MLE can be equivalent to tradition method which forces all users to choose one specific privacy regime, and this equivalence shows that MLE is practical.

  • We propose S-MLE method to handle the situation when all users’ privacy regimes are subject to continuous distribution in one mining task. It uses a predefined \(\beta \) parameter to segment the contiguous \(\epsilon \) on the basis of MLE. The experiments show that the \(\beta \) parameter of S-MLE may greatly influences the accuracy under certain conditions.

  • We propose Adapt-MLE method which encompasses different LDP schemes for multiple privacy regimes. And the performance of Adapt-MLE is better than that of MLE, especially when the discrete values of \(\epsilon \) of different users span a wide range.

2 Preliminaries and Notations

We assume that all users are willing to share their information to help service provider update its statistical information. For the sake of privacy, each user perturbs his own data by advanced technique (via LDP) with different demands for security. The service provider aims to find out the frequencies of values among the population. Such a process involves the following preliminaries.

2.1 Local Differential Privacy

Definition 1

(Local Differential Privacy). An algorithm A satisfies \(\epsilon \)-local differential privacy (\(\epsilon \)-LDP) where \(\epsilon > 0\), if and only if for any input \(v_1\) and \(v_2\), we have \(\forall y \in Range(A)\),

$$\begin{aligned} \frac{Pr(A(v_1)\in y)}{Pr(A(v_2) \in y)} \le e^\epsilon , \end{aligned}$$

where Range(A) denotes the set of all possible outputs of the algorithm A.

When privacy regime \(\epsilon \) is small, the adversary can’t identify the true value from the noise version reliably. The basic core of algorithm A is Randomized response (RR) [21], which is a statistical technique used for collecting social embarrassing questions.

2.2 Frequency Estimation in LDP

Let \(\varvec{f}=(f_1,...,f_k)\) be a probability distribution on a set containing k candidate values, \(f_1\) is the true frequency of \(v_1\) and the sum of \(\varvec{f}\) is 1. We can consider that \(\varvec{f}\) is the proportion of N users choosing different values. Using LDP allows the server to obtain an estimate \(\widehat{\varvec{f}} = (\widehat{f_1},...,\widehat{f_k}) \) without obtaining the user’s original data.

Each of N users holds a value \(v_j\) taken from the above k candidates and shares this \(v_j\) to the service provider in a LDP manner. In the beginning, user need to encode the \(v_j\) to a specific format, \(x_j = E(v_j)\), then select a parameter \(\epsilon \) t to obtain \(y_j\) by randomization. Finally, service provider aggregates N data records \(({y_1,...,y_N})\) and figures out \(\widehat{\varvec{f}}\).

2.3 General LDP Schemes

Randomization mechanisms that satisfy LDP have been widely studied in recent years, our work involves the following very common schemes.

kRR (k-ary Randomized Response) [12] is a generalization of binary randomized response (RR) which performs well in low privacy level.

Base RAPPOR is simplest configuration of RAPPOR [9] which has been widely accepted. It has a output alphabet \(\upsilon = \{0, 1\}^k\) of size \(2^k\). It first maps \(v_i (1\le i \le k)\) onto \(e_i \in {0,1}^k\), where \(x = e_i\) is the i-th standard basis vector. At last a length-k binary vector y is generated from x by \(y = RR(x)\), RR here can be seen with a pair of alterable probability p and q.

$$\begin{aligned} \begin{aligned} Pr(y[i] = 1 | x[i]=1) = p;~~Pr(y[i] = 1 | x[i]=0) = q \end{aligned} \end{aligned}$$
(1)

Base Rappor sets p to \(e^{\epsilon /2}/(e^{\epsilon /2} + 1)\) and q to \(1-p\). For the fact that Base RAPPOR is classical and is the basis of many other schemes, we mainly use it to analyze the scenario of multiple privacy regimes in Sects. 3 and 4.

Optimal Scheme [22] and OLH [18] are two similar schemes, and they are all obtained by optimizing the probability of RR in the Base RAPPOR (Change pq in Eq. 1), the difference is that the latter is evolved from SH [3] and reduces communication cost by hash method.

At Last, frequency estimation formulas of these schemes are all

$$\begin{aligned} \hat{f_i}=\frac{C(i) - q*N}{(p-q)*N} \end{aligned}$$

where C(i) is the count of reported vector which has the i’th bit being 1. From OLH, we get the following properties.

Lemma 1

For LDP scheme which uses RR with probability p and q, the frequency estimation \(\hat{f_i}\) is an unbiased estimate of \(f_i\), and its variance is

$$\begin{aligned} var(\hat{f_i} )=\frac{(f_i*p+(1-f_i )*q)*(1-f_i*p-(1-f_i )*q)}{N(p-q)^2} \end{aligned}$$

Employing Base rappor’s settings for p and q and taking \(e^{\epsilon /2}> 1>> f_i\) into account, Base rappor’s variance is as follows:

$$\begin{aligned} var(\hat{f_i} )= \frac{e^{\epsilon /2}}{N (e^{\epsilon /2} -1)^2} \end{aligned}$$
(2)
Table 1. Notations

3 Problem Formulation

In this paper, we consider the specific LDP problem that users specific choices of privacy level will lead to multiple \(\epsilon \) in one mining task. In this section, we first introduce a primary method called Raw-PCE (Personalized Count Estimation) which can be considered a simplified version of PCE [5] proposed for the handling spatial data aggregation. Compared with the original PCE scheme, only the customization of privacy regime is preserved in Raw-PCE while the customization of other factors are omitted. Then we analyze the estimate of Raw-PCE and get a new probability model (Table 1).

3.1 A Multiple Privacy Regime Scheme: Raw-PCE

Raw-PCE is an intuitive and primitive method handling this scenario. Suppose there are totally M different privacy regimes namely \(\epsilon _1\), \(\epsilon _2\), ...,\(\epsilon _M\).

Firstly, the service provider groups the users according to their personalized privacy regimes which results in totally M groups.

Then, each group with the same privacy regime independently generate its frequency estimate vector, the estimate vector generated by group m is denoted as \(\hat{\varvec{f{(m)}}}\) \(=(\hat{f{(m)}_1}, \hat{f{(m)}_2}, ...,\hat{f{(m)}_k})\). Totally M estimates are generated. Without loss of generality, here we consider only one candidate value v to simplify the problem. The M estimates for value v can be denoted as \({\hat{f_{(1)}}, \hat{f_{(2)}}, ...,\hat{f_{(M)}}}\).

Finally, if there is no other auxiliary information, the way in which the estimated value \(\hat{f}\) is calculated by Raw-PCE is as follows:

$$\begin{aligned} \hat{f}= \sum _{m=1}^{M}\hat{f_{(m)}}\alpha _{(m)} \end{aligned}$$
(3)

where \(\alpha (\cdot )\) represents the weights of each group’s estimation. Raw-PCE here ignores the fact that every estimate’s accuracy is different, and it just takes the size of each group as the weights where \(\alpha _m=n_m/N\). Combining Eqs. 2 and 3, we have:

Lemma 2

In Base RAPPOR scheme, estimation of Raw-PCE has the variance as follows,

$$\begin{aligned} var(\hat{f}) = \sum _{m=1}^{M} \frac{n_m^2{e^{\epsilon _{m}/2}}}{{(e^{\epsilon _{m}/2} -1)^2} * N^3} \end{aligned}$$

The above formula shows that the error is cumulative, whenever there is an estimation with large error among M estimations, final errors are greatly increased which is rather unacceptable.

Fig. 2.
figure 2

The true frequency of \({v_{i}}(i\in [k]\)) is 0, and there are three estimations of \(f_i\) which are drawn from three normal distribution.

Fig. 3.
figure 3

S-MLE: dividing continuous \(\epsilon \) into discrete values and the size of each shadow area is \(\beta \).

3.2 Probabilistic Model of Multiple Regimes Setting

Although users are divided into different groups with different \(\epsilon \), their choices of privacy regimes can be considered as irrelevant to the choices of their favourite items, which means user’s possibility of choosing item v is the same in all M groups.

After the randomization process, the reported number C (C(i) is introduced in Sect. 2.3 and here we omit i) is a random variable from a binomial distribution, namely . We use \(p'\) to denote \(pf + q(1-f)\). Furthermore, N is usually big enough to ensure normal approximation and we have is a normalization of \(C_{(m)}\) and follows a Gaussian distribution.

Then all M groups follow M different normal distributions, which share the same expectations but have different variances due to the different values of \(\epsilon \). Therefore, as Fig. 2 shows, the M estimates generated by M groups, respectively \({\hat{f_{(1)}}, \hat{f_{(2)}}, ...,\hat{f_{(M)}}}\), can be regarded as M random samplings of the actual user proportion \({{f_v}}\), each of which follows a unique normal distribution.

The problem of multiple privacy regimes can be regarded as equivalent to the problem of obtaining the best estimate of \(f_v\) from M group estimations \({\hat{f_{(1)}}, \hat{f_{(2)}}, ...,\hat{f_{(M)}}}\), which are random samples separately drawn from M normal distributions. Our target is simplified as to give the optimal estimate of the expectation with the help of parameter estimation methods.

4 Estimate Methods

In this section, we present two types of frequency estimate for multiple privacy regimes. In Sect. 4.1, we discuss a new MLE method and give theoretical proof. Then we prove MLE’s accuracy is much better than the basic Raw-PCE (given in Sect. 3.1). In Sect. 4.2, we discuss the situation when there exists large number of groups after grouping operation and propose S-MLE method on the basis of MLE.

4.1 MLE

Maximum likelihood is an effective method in parameter estimation, which is adopted here to generate unbiased expectation f. Once the service provider gets M estimations \(({\hat{f_{(1)}}, \hat{f_{(2)}}, ...,\hat{f_{(M)}}})\) and their variances as well, the Maximum likelihood estimation (MLE) \(\hat{f}\) is defined as follows:

Theorem 1

Given the M estimate \(({\hat{f_{(1)}}, \hat{f_{(2)}}, ...,\hat{f_{(M)}}})\), the MLE for the multiple privacy regimes scenario is

$$\begin{aligned} \hat{f}=(\sum _{m=1}^{M}\frac{\hat{f_{(m)}}}{var(\hat{f_{(m)}})}) / (\sum _{m=1}^{M}\frac{1}{var(\hat{f_{(m)}})}) \end{aligned}$$

The proof is in Appendix A.

It’s interesting to observe that the expectation we derived for MLE is in a weighted-sum manner. The weights become the reciprocal of the variances. Bring it to Eq.  3 for demonstration,

$$\begin{aligned} \alpha _{m} = \frac{1/var(\hat{f_{(m)}})}{\sum _{m=1}^{M} 1/var(\hat{f_{(m)}})} \end{aligned}$$

Estimation Accuracy Analysis. Due to MLE is unbiased and contains grouping operation, its accuracy can be somehow equivalent to a traditional LDP method with a special privacy regime \(\epsilon \). Assuming there are two service providers doing the same collection on a population. One allows user to choose different \(\epsilon \) and groups them, finally there are M groups whose size and privacy regime are \({(n_1, \epsilon _1), (n_2, \epsilon _2), ..., (n_M, \epsilon _M)}\); The other makes all users in the same privacy regime. So in which condition they achieve the same level of accuracy or their estimations have the same variance. The latter obliges all users to have the same \(\epsilon \), and here we call this reckless method traditional estimation (TE).

Theorem 2

The variance of the estimates obtained by the MLE is \(var(\hat{f}) = 1 /(\sum _{m=1}^{M}\frac{1}{var(\hat{f_{(m)}})})\).

The proof is in Appendix B. It can be inferred from the formula that if the data collector only uses the estimate with the least error, its effect is not as good as that of MLE which combines all the estimates.

When substituting the variance generated by Base RAPPOR into the Theorem 2, we obtain a new lemma.

Lemma 3

In Base RAPPOR, if any \(e^{\epsilon _m/2}>> 1\), there exists a privacy regime \(\epsilon '\approx 2*ln \frac{\sum n_m*exp(\epsilon _m)}{\sum n_m}\) such that makes the accuracy of directly using TE method with \(\epsilon '\) equals to MLE with multiple \(\epsilon _m(m \in [M])\).

The proof is in Appendix C.

Comparing Theorem 2 with Lemma 2, we find that the overall variance of MLE is much lower than that of Raw-PCE method. The explanation can be that Theorem 2 implies the final estimate will be accurate as long as at least one of the estimates has low variance, while Lemma 2 has high variance if just one group has high variance. Our experiments also show MLE is much more accurate than Raw-PCE.

4.2 S-MLE

When the user are allowed to choose any value as their overall privacy regimes from a considerate large set, there will be too many groups and some groups inevitably contain too few users. In this scenario, the above MLE method may be inapplicable because large errors are introduced into \(\hat{f}\). And one extreme situation is that \(\epsilon \) is continuous function over real number field as shown in Fig. 4. In this section, we start by analyzing why the minimum size of the group \((\beta N)\) should be set and explaining what factors will affect the value, then we provide a supplement method called S-MLE for this scenario.

\(\beta {{\varvec{N}}}\): Minimum Size of the Group. From the perspective of sampling theory, the essence of grouping process in MLE is that the users are randomly sampled into M groups, and the frequency estimation result of each group is equivalent to the result of sampling scheme without replacement. And what we’ll find is that the sample size of this M group is different and there may exist invalid sample group because sample survey with low sample size introduces lots of sampling error and would not represent the whole. Specifically, small group’s unbiased estimation \(\hat{f_{i(m)}}\) on value \(v_i\) differs greatly from the actual results of the population. Namely, \(|E(\hat{f_{i(m)}}) - f_i| < \sigma \) can’t hold where \(\sigma \) is tolerable sampling error.

So it makes sense to determine the minimum of the sample size which is also a basis for dividing the group. According to the sampling theory, the sample size is usually determined by the variation degree of the research object, the total number of samples and the demand for accuracy. In our MLE, sample size can be mainly determined by the size of candidate set (k), the total number of users (N), and the complexity of candidate set’s frequency (the distribution of \(\varvec{f}\)). So we get the proposition as follows:

Proposition 1

In MLE, for any group whose size does not exceed \(\beta N\), it’s necessary to ignore its estimation or make users in this group join other higher privacy level group.

Combining the Proposition 1 and MLE, we need to figure out an empirical value for \(\beta \) and segment privacy regimes and re-merge existing groups.

S-MLE. The S-MLE method can be further extended to the situation when the type of \(\epsilon \) value is unlimited (continuous distribution).

From Theorem 2 we can see that the entire accuracy is depend on all users’ distribution of \(\epsilon \). let users who have temp largest \(\epsilon \) value form a group (size = \(\beta * N\)) recursively is an efficient segmenting way (segmenting in Fig.  3). Since it’s difficult and complexity to figure out sample size \(\beta \), we give some empirical values here. In the experiment when fixing N to 100000, we find \(\beta \in [0.05,0.1]\) ([0.1, 0.15]) is reasonable for the situation when \(\varvec{f}\) is generated by zipf’s distribution (uniform distribution) and k is ranging from 20 to 200, \(\beta \) should be bigger as k increases.

5 Adapt MLE and Universality of Mining Scenarios

MLE and S-MLE have been able to achieve relatively high accuracy in frequency estimation by Base RAPPOR. Furthermore, they are also applicable for other mining scenarios like heavy-hitter identification, and replace Base RAPPOR with other scheme for better accuracy. In this section, we propose the Adapt MLE method which adaptively selects the most suitable LDP scheme for each group of users to share data, then we briefly show how to apply MLE to other LDP mining tasks.

Adapt MLE. The process of selecting schemes just fits in with some work [11, 18, 22] on how to select LDP schemes based on privacy regime \(\epsilon \), the size of k and communication cost.

Here we attach importance to accuracy and use variance as the evaluation criteria to select LDP scheme for frequency estimation. Because uniform distribution is the most difficult to estimate analyzed by minimax [22], we set each value in \(\varvec{f}\) to be the same and easily calculate the variance of each scheme through Lemma 1. So by comparing their variances, we can get the following directly:

$$\begin{aligned} Adapt~MLE(Group~m) = \left\{ \begin{array}{cc} kRR &{} if~ k < 3e^{\epsilon _m} + 2 \\ Optimal Schemes &{} otherwise \end{array}\right. \end{aligned}$$

In other scenarios like sampling step in frequent item mining [20], “\(3e^\epsilon + 2\)” might change slightly. So the accuracy of Adapt MLE is higher than that of MLE when the discrete values of \(\epsilon \) of different users span a wide range. However, accuracy is not the only factor that matters, communication cost and computational complexity are also worth considering in real world.

As far as we know, our multiple privacy setting still can be used for heavy-hitters identification. SH [3] consists of two important steps. First step uses hash function to separate the values into a lot of channels, with high probability each channel has at most one frequent value, then identify whether there is a frequent item in each channel. Referred to Chen [5], multiple privacy setting is fully applicable to this step. The second step employs a frequency oracle to estimate the frequency of those frequent values obtained from first step. And this is similar to the mining task of frequency estimation while obtaining variance from frequency oracle is in another form and the final estimates are slightly biased (still consistent with the goal of heavy-hitters).

6 Experiments

In this section, we evaluate and compare the performance of our proposed personalized approach through extensive experiments. Since there is no existing work for our settings, we mainly verify the correctness of our analysis.

Setup. All experiments are performed 10 times and we plot the Mean Absolute Percentage Error (MAPE) of all frequency estimation. The MAPE is \(\frac{1}{k}\sum _{i \ in [k]}{\frac{|\hat{f_i}-f_i|}{f_i+\sigma }}\), where \(f_i\) is the actual fraction of all users taking value i and \(\sigma \) is to prevent the denominator from 0.

In RAPPOR [9] with h = 1, their value for epsilon is actually set to 2ln3 and the number of users is a million level. And Apple also set epsilon to 1 or 2. So we assume that 100 thousand users participates the collection, and set M options in most experiments to \(\varvec{\epsilon }\) = [0.2, 1.0, 2.0, 3.0] and the proportion of users in each group is \(G = [0.2,0.3,0.3,0.2]\) where G denotes the corresponding proportion.

For better verification of correctness, we generate two synthetic data which are from Zipf’s distribution (parameter a = 2.5) and uniform distribution. The schemes used in each experiment contain KRR, Base Rappor and Optimal Scheme. We change the distribution of \(\varvec{f}\) by controlling the size of k.

Fig. 4.
figure 4

Comparing MLE and Raw-PCE, varing k: estimates generated by each group are marked as Sep

6.1 Accuracy of MLE Method

Whatever the distribution of \(\varvec{f}\) is in Fig. 4, MAPE curves generated by four groups shows that group with higher \(\epsilon \) generates better estimates; Raw-PCE’s accuracy has been greatly affected by the group with low variance, namely \(\epsilon = 0.2\); From Theorem 2 and this figure, we know the variance of MLE is always smaller than the variance estimated by each group.

In uniform distribution (Fig. 4(a)), MAPE increases as k increases, roughly the same multiple because the denominator of MAPE’s calculation is actually 1 / k. In fact, the size of the variance is independent of k. So in zipf’s distribution (Fig. 4(a)), the change in MAPE will not be obvious when k does not exceed 200. This also the reason why uniform distribution is difficult to estimate.

6.2 S-MLE Method on Continuous \(\epsilon \)

When there are too many options of M or all users’ \(\epsilon \) is continuously distributed, the value of \(\beta \) can help to divide all users into \(\lfloor 1 /\beta \rfloor \) groups as described in Sect. 4.2. Small \(\beta \) will increase sampling error, but it can also reduce the overall variance from Theorem 2. A balance between overall variance and sampling error is reasonable.

Fig. 5.
figure 5

the influence of \(\beta \) ’s value on MAPE, varying k.

In this part, we make the proportion of users choosing different \(\epsilon \) obey N(2.5, 0.8) and \(\epsilon \)’s contiguous interval be [1.0,4.0], namely G obeys N(2.5, 0.8) and \(\varvec{\epsilon }\) is continuous in [1.0,4.0]. It can be observed from Fig. 5(a) that when k is small, the MAPE with small \(\beta \) is acceptable, namely, the sampling error has little effect. On the contrary, when k becomes larger than 60 for \(\beta = 0.0625\), the sampling error even exceeds the error generated by estimator. But for Zipf’s distribution, the effect of k on sampling error is not obvious until \(k>150\). So for our settings above concerning the number of users and \(\epsilon \) ratio, \(\beta \in [0.05,0.1]\) for Zipf’s distribution and \(\beta \in [0.1,0.15]\) for uniform distribution are appropriate choices.

6.3 Multiple Schemes for Different Groups

In Sect. 5, we claim that users in different groups can use different LDP schemes to achieve better accuracy. From Sect. 5, kRR performs better when \(k < 3e^{\epsilon }+2\). We divide 100 thousand users into 4 groups with \(\epsilon =[1.0,2.0,3.0,3.2]\) and \(G=[0.3,0.35,0.3,0.05]\).

Learning from Fig. 6, “Adapt-MLE” performs better because users of these 4 groups use kRR if k are less than 10, 25 and 62 and 76 respectively and otherwise use Optimal Scheme.

Fig. 6.
figure 6

Adapt MLE with multiple schemes, “MLE(kRR)” represents only KRR and “MLE(Opt)” represents only Optimal scheme.

7 Related Work

The traditional differential privacy (DP) was developed for interactive query-response on a central database and provides theoretical privacy guarantee by mathematically randomizing the results of statistical queries. However, the major limitation of DP is that all users need to trust a central server. Despite attacks from aggregate queries, individual’s data may also suffer from privacy leakage before aggregation [8].

On the other hand, Local differential privacy (LDP) [13], a variant of DP, guarantees privacy of data without that server. Random Response (RR) [21], where the user responds either true or opposite answer depending on coin flipping, is the most basic technique in LDP schemes.

Widely accepted schemes for frequency estimation under LDP are Rappor by Erlingsson et al. [9] and succinct histogram (SH) by Bassily and Smith [3]. RAPPOR’s key idea is encoding values into Bloom filters and applying RR to each bit of Bloom filters. In order to conquer hash collision problems in Bloom filters, RAPPOR brings in cohorts. In this paper, we use RAPPOR’s no bloom filter version to analyze our multiple setting. SH’s has two important data structures—frequency oracle and succinct histogram, these two work together to estimate those values whose frequencies exceed \(\eta \), so some applications do heavy hitters [5, 16, 19] identification referred to SH. Due to the high complexity of SH, Bassily et al. [4] recently developed an efficient way to query the frequency estimation based on SH mechanism.

In addition, discrete distribution estimation under LDP considers all values’ frequency accuracy. Kairouz et al. [11] analyzed several key factors (privacy regimes, discrete distribution) which affect accuracy. Ye et al. [22] came up with Optimal Scheme for discrete distribution estimation. About [11, 22], their analysis tools all contained minimax with \(l_2\)-norm as loss function which is similar to variance. Wang et al. [18] introduced a framework that can generalize the most LDP schemes by recognizing RR’s features. These work all have a prerequisite that the size of k is limited.

In personalization privacy fields, Jorgensen et al. [10] incorporated personalized settings for DP (PDP), and Li et al. [14] proposed a k-partition strategy to improve it. Then Chen et al. [5] first introduced the concept of personalized privacy in LDP (PLDP), it allows users to have two optional privacy regime \(\epsilon \) and \(\tau \). The former has no change, while \(\tau \) represents a small piece of the candidates list. They assume some users set small size of candidates list and this part of users can greatly improve their performance on heavy-hitter mining task. Obviously, \(\tau \) makes this personalization process complex for users. Akter et al. [1] borrowed the definition of PLDP to estimate numeric data like average instead of heavy-hitters mining.

8 Conclusion

In this paper, we mainly study frequency estimation under Local Differential Privacy (LDP) in multiple regimes scenarios. We have formulated the problem of multiple privacy levels and proposed a MLE method to deal with this situation. Then, we propose S-MLE and Adapt-MLE to deal with the situation when users’ privacy levels are in some special cases.