Keywords

1 Introduction

With the continued expansion of searchable data, recommendation systems arise for the purpose of providing information retrieval services to users’ interests using their personal data stored on cloud, which give rise to privacy concerns. The literature shows that continuous observation of recommendations, supplemented by a certain amount of background information, makes it possible for an individual’s transaction history to be inferred, especially when using neighbourhood-based methods [1], known as k-nearest neighbor (KNN) attack. Suppose the attacker knows m ratings of the items of an intended victim, the attacker can then compromise the victim’s privacy by creating k fake users with the same ratings to replicate their recommendation list.

Differential privacy is a new provable privacy framework developed in recent years. It has become an important standard for evaluating the privacy levels due to its rigorous privacy guarantee. Normally, differential privacy guarantees users’ privacy by adding random noise to the results of queries to ensure that the output is insensitive to any individual record in the dataset. Several papers have recently applied differential privacy to solve privacy problems in recommendation systems. Guerraoui et al. [2] provide distance-based differential privacy in a KNN-based CF recommendation system. Shen and Jin [3] provide a novel relaxed admissible mechanism that guarantees differential privacy by injecting relaxed instance-based random noise. In a subsequent work, Shen and Jin [4] also provide a practical privacy framework in an untrusted server setting. User data are perturbed and anonymized before leaving their private devices. All of these methods apply traditional differential privacy to preserve a uniform level of privacy for users.

However, privacy is a personal concept, and users may have different expectations for preserving their personal data. For example, users with particular interests may require a strong privacy guarantee. Therefore, we argue that a uniform level of privacy protection leads to insufficient privacy protection for users who have a strong privacy requirement, while over providing protection to those users who do not. In this paper, we consider personalized privacy in neighborhood-based collaborative filtering.

The main challenge in personalized privacy is engineering the privacy parameters in the method’s design to provide different levels of privacy for individual users. To solve this problem, we consider privacy requirements of users who are directly related to the query. Specifically, we make the following contributions:

  1. 1.

    We propose a personalized privacy preservation method based on Laplace mechanism for neighborhood-based collaborative filtering. We add personalized Laplace noise to the query results according to each user’s privacy requirements.

  2. 2.

    We propose a randomized pre-processing method to process a sparse original dataset, which contributes to the improved prediction quality. Given the quality of neighbors selected from sparse datasets is often not very good, our randomized method can help to choose much closer neighbors, which is very important for final prediction.

Extensive testing shows that the proposed privacy method is not only personalized to users’ requirements, but also makes better predictions than when using traditional differential privacy.

The rest of this paper is organized as follows: We provide the preliminaries in Sect. 2 and define the personalized privacy in Sect. 3. In Sects. 4 and 5, we introduce the proposed personalized method and analyse the privacy levels and utility provided by our method respectively, followed by a detailed description of the experiments in Sect. 6. At the end of the paper we present our conclusions.

2 Preliminaries

2.1 Notation

We denote the dataset D as an \(n\times d\) matrix, where n is the number of users and d is the number of items in the dataset. Let \(U=[U_1,U_2,...U_n]\) be the row vectors of the matrix D, then, \(U_i=[r_{u_i1},r_{u_i2},...r_{u_id}]\) is the rating list of user \(u_i\). \(r_{u_it}\) is the rating that user \(u_i\) gives to item t. The similarity between user \(u_i\) and \(u_j\) is represented as \(s(u_i,u_j)\), while \(N_k(u_i)\) is the set of user \(u_i\)’s k-nearest neighbors. We define two datasets as neighbors if they differ by only one row.

2.2 Collaborative Filtering

Collaborative filtering, a popular method for analysing users’ interests, has been widely applied to recommendation systems. The premise of collaborative filtering is that users will obtain good recommendations from people with similar tastes, and therefore these methods make their predictions based on the preferences of the users with the closest common interests. Collaborative filtering systems typically involve two stages: Step 1, select k users with the highest similarity as the closest neighbors and Step 2, predict rating that the predicted user may give to the item. Cosine-based Similarity and Pearson’s Correlation Coefficient [5] are used to estimate the similarity between two users in this paper, and the prediction is calculated as follows:

$$\begin{aligned} r_{u_it}=\frac{\sum _{a\in N_k(b)}{s(u_i,u_j)}r_{u_jt}}{\sum _{a\in N_k(b)}{r_{u_jt}}}. \end{aligned}$$
(1)

2.3 Differential Privacy

The goal of differential privacy [6] is to mask the difference in the answer to query f between neighboring datasets. In \(\epsilon \)-differential privacy, the parameter \(\epsilon \) is defined as the privacy budget, which controls the privacy guarantee level of mechanism \(\mathcal {M}\). A smaller \(\epsilon \) represents a stronger privacy level. The formal definition of differential privacy is presented below:

Definition 1

( \(\epsilon \) -Differential Privacy). A randomized algorithm \(\mathcal {M}\) gives \(\epsilon \)-differential privacy for any pair of neighboring datasets D and \(D^{*}\), and for every set of outcomes \(\varOmega \), \(\mathcal {M}\) satisfies

$$\begin{aligned} Pr[\mathcal {M}(D) \in \varOmega ] \le \exp (\epsilon ) \cdot Pr[\mathcal {M}(D^{*}) \in \varOmega ]. \end{aligned}$$
(2)

Definition 2

\({\mathbf{{(}}{\varvec{Laplace}}~\mathbf{mechanism).}}\) Given a function \(f: D \rightarrow \mathbb {R}\) over a dataset D, Eq. 2 provides \(\epsilon \)-differential privacy.

$$\begin{aligned} \widehat{f}(D)=f(D)+Laplace(\frac{s}{\epsilon }). \end{aligned}$$
(3)

Laplace mechanism satisfies differential privacy by adding Laplace noise to the true answer.

2.4 Johnson-Lindenstrauss Transform

The Johnson-Lindenstrauss transform is a random projection that transforms the points from a high dimension to lower dimensional space while preserving the Euclidean distance between any pair of points. The Johnson-Lindenstrauss transform results in the powerful Johnson-Lindenstrauss Lemma shown below:

Lemma 1

(Johnson-Lindenstrauss lemma [7]). For any set S of n points in \(\mathbb {R}^d\), given \(0\le \delta \le 1/2\), \(m=\varOmega (log(n)/\delta ^2)\), there is a matrix \( A\in \mathbb {R}^{d \times m}\), and there is a map \(f: \mathbb {R}^d \rightarrow \mathbb {R}^m\) for all \(u,v \in S\), such that

$$\begin{aligned} (1-\delta )\Vert u-v\Vert ^2 \le \Vert f(u)-f(v)\Vert ^2 \le (1+\delta )\Vert u-v \Vert ^2. \end{aligned}$$
(4)

in which \(f(u)=uA, f(v)=vA\).

This lemma states that the random matrix A can project the dataset from d dimension to m while maintaining the relative distance. Several constructions for A have been proposed [8, 9]. We use Gaussian distribution to generate the random matrix A.

3 Personalized Privacy Definition

In this section, we focus on the problem of preserving privacy in neighborhood-based collaborative filtering. In traditional differential privacy preservation methods, users are given a uniform privacy guarantee. However, in real life, users have different privacy requirements. A uniform privacy guarantee provides over-protection to users who are not concerned about their personal information while providing insufficient protection to users with strong privacy concerns. To solve this problem, we propose a personalized privacy preservation method that considers the differing privacy requirements of all users. Its formal definition is presented as follows:

Definition 3

(Personalized Differential Privacy [10]). A randomized algorithm \(\mathcal {M}\) gives \(\varPsi \)-differential privacy for any pair of neighboring datasets D and \(D^{*}\), and for every set of outcomes \(\varOmega \), \(\mathcal {M}\) satisfies:

$$\begin{aligned} Pr[\mathcal {M}(D) \in \varOmega ] \le e^{\varepsilon _\varphi (i)} \cdot Pr[\mathcal {M}(D^{*}) \in \varOmega ]. \end{aligned}$$
(5)

where \(\varepsilon _\varphi (i)\) is the user’s privacy requirement.

Definition 4

(Privacy Requirement). Model the dataset as a set of tuples, where each tuple associated with a user \(u_i\) has multiple attributes. There exists a map \(\varPsi : u_i\rightarrow \varepsilon _\varphi (i)\) from each user to a corresponding self-defined privacy parameter.

Each user can choose a privacy parameter according to their own privacy requirements. Practically, this may be presented to the user as a choice of labelled buttons, such as ‘I don’t care about privacy’, ‘I am a little concerned about privacy’, or ‘I am very concerned about privacy’. In our work, we define \(0.1\le \varepsilon _\varphi (i)\le 1\). If the user doesn’t specify a privacy requirement, we would assign a default value of \(\varepsilon _\varphi =1\).

On the assumption that all users in the dataset have the same privacy preference. That is, for all \(u_i\in O\), where O is the set of all users in D, \(\varepsilon _\varphi (i)=max_i(\varepsilon _\varphi (i))\). In these cases, there would be no objection to providing differential privacy with the same guarantee as traditional methods (i.e., \(e^\varepsilon \), where \(\varepsilon =max_i(\varepsilon _\varphi (i))\)). Consider another situation. Half the users in D do not care about their personal data and are happy for it to be public without reservation. Even though none of the privacy budget has been assigned to these data, their privacy has not been disclosed because the data are not considered private to those users. In this way, we can say that personalized differential privacy not only considers personalized privacy requirements, but also preserves the same level of privacy as traditional methods with much more utility.

4 Personalized Privacy Collaborate Filtering

In this section, we propose a Personalized Privacy Preserving Collaborative Filtering algorithm with a pre-processing method that improves the accuracy of prediction. An overview of our method follows.

As rating datasets are very sparse, which can directly affect the neighbor selection, result in inaccurate prediction. we propose a pre-processing method based on the Johnson Lindenstrauss transform to process sparse original datasets. The basic idea is to multiply the dataset with a random matrix that satisfies the Johnson Lindenstrauss lemma. Studies show that a randomization method helps to greatly improve performance.

We implement differential privacy collaborative filtering on a sanitary dataset, then add Laplace noise to the similarity to ensure the user’s privacy. According to Eq. 2, the amount of Laplace noise added is determined by the sensitivity and privacy parameters. We propose a new definition of sensitivity to reduce the amount of noise that is only associated with related users. In addition, the privacy parameter in our method varies from user to user. We assign different values of the privacy parameter to users according to their privacy preferences.

The details of the proposed method are presented in Algorithm 1.

figure a

First, we process the original dataset in Step 1, presenting the detail in Sect. 4.1. In Step 2, we calculate the similarity matrix of the users in a sanitized dataset and perturb it using Laplace noise. The specific operations are presented in Sect. 4.2. The top-k users with the highest similarity are selected as the active user’s neighbor in Step 3. Lastly, we predict the rating that the active user will give to a specific item according to Eq. 1 in Step 4. Steps 3 and 4 are standard collaborative filtering steps that do not require explanation.

4.1 Pre-processing

Ratings datasets are usually of high-dimension and sparse. Therefore, the quality of the neighbors is crucial to neighborhood-based collaborative filtering. As neighbor selection is based on co-rated items, the sparsity of the original dataset means that the selected neighbors are often not the best choices. To solve this problem, we propose a randomized pre-processing method based on the Johnson Lindenstrauss transform that can help to improve prediction. The details are provided in Algorithm 2.

figure b

In Step 1, we construct a transition matrix A by sampling each entry from a Gaussian distribution N(0, 1 / m). Then we calculate the new dataset by multiplying the original dataset with this transition matrix in Step 2. This new dataset is only m dimensions, which reduces the algorithm’s complexity significantly. Also, it makes the dataset more compact, which helps to select closer neighbors.

4.2 Perturb the Similarity

To prevent disclosure of the ratings records, we perturb the similarity based on a Laplace mechanism. The Laplace mechanism preserves differential privacy by adding Laplace noise where the naive Laplace mechanism cannot provide an accurate prediction. Recall that the noise added by the Laplace mechanism follows Laplace distribution, \(Noise=Laplace(s/\varepsilon )\), which is determined by the sensitivity and privacy parameters.

Sensitivity. Sensitivity answers the question ‘How many changes will be made to the result of the query when one record is removed from, or added to, the dataset?’. In addition, it determines how much perturbation should be generated in the mechanism under a certain privacy parameter. Generally, it refers to global sensitivity, which measures the maximum difference between the results of the query from all neighboring datasets. The formal definition is as follows:

Definition 5

(Global sensitivity). Given a query f to the dataset D, \(D'\) is the neighbor dataset of D, the global sensitivity of f is defined as

$$\begin{aligned} GS_f=\underset{D,D'}{max}\Vert f(D)-f(D')\Vert _1. \end{aligned}$$
(6)

The sensitivity is independent of the dataset but determined by the query function [11]. However, for some queries, the global sensitivity is too high to provide an accurate answer. Especially for the similarity query. Global sensitivity always considers the worst case, applying the same standard to all other users. In reality, however, the scope of the dataset is so large that the probability of the worst case occurring is very small. In this case, it is almost impossible for two users to even rate the same movie let alone give an identical score. As a result, we consider a much more practical situation in our paper - immediate sensitivity (IS), defined as follows:

Definition 6

(Immediate Sensitivity). Immediate sensitivity is a dynamic sensitivity, determined by the two users being queried instantly. It is calculated as

$$\begin{aligned} IS=f_D(u_i, u_j)-f_{D'}(u_i,u_j). \end{aligned}$$
(7)

where, \(u_j\in D\), and \(u_j \notin D'\).

IS is updated with changing queries and reflects real situations, which avoids adding too over-much noise. For example, given a dataset \(D=[u_1, u_2, u_3, u_4, u_5]\), when we query the similarity between \(u_2\) and \(u_3\), we would consider how much the result of the query would change if we deleted record \(u_2\) or \(u_3\) from the dataset. We then add Laplace noise based on this sensitivity to hide the information that the query might disclosed.

Privacy Parameter. In traditional differential privacy, uniform privacy parameters are provided to all users. For some users who are not concerned about the safety of their personal dataset, uniform parameters provide both too much protection and more noise, which reduces utility to some extent. In the proposed method, we consider individual preferences when adding Laplace noise. We assume that users have chosen their own privacy parameters or selected their privacy level in a user-friendly way, using such options as ‘low privacy concern’, ‘medium privacy concern’ and ‘high privacy concern’. When we calculate the amount of noise added to the similarity between any two users, we are comparing their privacy concerns. In personalized privacy, the privacy parameter varies according to the user, so choosing the smaller value as the privacy parameter guarantees both users’ privacy. More formally,

$$\begin{aligned} Pp=min_{i,j}\{\varepsilon _\varphi (i),~ \varepsilon _\varphi (j)\}. \end{aligned}$$
(8)

where, \(\varepsilon _\varphi (i), \varepsilon _\varphi (j)\) are the privacy preferences of user \(u_i\) and user \(u_j\) respectively.

As previously mentioned, three operations in the proposed method contribute to improved performance: randomized pre-processing helps to select high-quality neighbors, while both calibrated sensitivity and personalized privacy parameters reduce added noise. Theoretical analysis and testing have proven the effectiveness of the proposed method.

5 Privacy and Utility Analysis

5.1 Privacy Analysis

We preserve users’ rating records by perturbing the similarity between any pair of users. We take the personal privacy requirement into account and add Laplace noise using different privacy parameters. Theorem 1 shows that the proposed method satisfies personalized differential privacy.

Theorem 1

For a given a dataset D, users \(u_i\) in D have different privacy preferences, represented as \(\varepsilon _\varphi (i)\), the PPCF method can provide personalized differential privacy to users in D.

Proof

Assume the user’s privacy preference is \(\varepsilon _\varphi (i)\), the sensitivity is \(IS_{i,j}\), f(D) is the query function, and the outcomes are in the set of S. \(D'\) is the neighboring dataset. N(x) is the noise added to the similarity, which follows the Laplace distribution Laplace(0, b).

Then,

$$\begin{aligned} \nonumber \frac{Pr[f(D)=S]}{Pr[f(D')=S]}=\frac{N(S-f(D))}{N(S-f(D'))} \le exp(\frac{|f(D)-f(D')|}{b}) =exp(IS_{i,j}/b)=e^{\varepsilon _\varphi (i)}. \end{aligned}$$

Therefore, the proposed method not only satisfies differential privacy, but also provides corresponding privacy preservation for each user.

5.2 Utility Analysis

In this section, we apply a well known utility definition suggested by Blum et al. [12] to measure the accuracy of prediction, which in turn is based on the quality of the neighbors.

Definition 7

(( \(\alpha , \beta \) ) -useful). A database access mechanism \(\mathcal {M}\) is (\(\alpha , \beta \))-useful with respect to COS query, if for every database D, with probability at least \(1-\beta \), the output of the mechanism \(\mathcal {M}\) satisfies:

$$\begin{aligned} Pr[max|cos(u',v')-cos(u,v)|\le \alpha ]\ge 1-\beta . \end{aligned}$$
(9)

Theorem 2

For COS query on any pair of users, the output error caused by the proposed method (PPCF) is less than \(\alpha \) with the probability at least \(1-\beta \). The proposed method is satisfied with (\(\alpha , \beta \))-useful when \(\alpha <\frac{s(u,v)\sqrt{d}(ln\frac{2}{\beta } -\frac{1-\delta }{\sqrt{d}})}{2s(u,v)(1-\delta )+\varepsilon _\varphi (i)\sqrt{d}}\).

Proof

The error caused by the proposed method is the result of the Johnson Lindenstrauss transform and Laplace noise. We use \(E_{JL}\) and \(E_{Lap}\) to represent the error introduced by the Johnson Lindenstrauss transform and Laplace noise, respectively. We have

$$\begin{aligned} Pr(max|cos(u',v')-cos(u,v)|> \alpha ) \le Pr(E_{JL}>\alpha ) + Pr(E_{Lap}> \alpha ). \end{aligned}$$

Referring to the Johnson Lindenstrauss lemma and according to the relationship between \(l_2\) Euclidean distance and cosine similarity, we have

$$\begin{aligned} cos(u,v) \ge \frac{(\Vert u \Vert +\Vert v \Vert )(1-\delta )-\Vert u'-v' \Vert }{2\Vert u \Vert \Vert v \Vert (1-\delta )}. \end{aligned}$$

Then, we have

$$\begin{aligned} E_{JL}&\le \frac{\Vert u' \Vert +\Vert v' \Vert - \Vert u'-v' \Vert }{2 \Vert u' \Vert \Vert v' \Vert } + \frac{\Vert u'-v' \Vert -(\Vert u \Vert +\Vert v \Vert )(1- \delta )}{2\Vert u \Vert \Vert v \Vert (1- \delta )} \\&\le \frac{\Vert u' \Vert +\Vert v' \Vert - ( \Vert u \Vert + \Vert v \Vert )(1- \delta )}{2 \Vert u \Vert \Vert v \Vert (1- \delta )} \\&\le \frac{\Vert u' \Vert +\Vert v' \Vert }{2(\Vert u \Vert +\Vert v \Vert )(1- \delta )}-\frac{1}{2} \le \frac{(\Vert u \Vert + \Vert v \Vert ) \Vert X \Vert }{2(\Vert u \Vert +\Vert v \Vert )(1- \delta )}-\frac{1}{2}=\frac{\Vert X \Vert }{2(1- \delta )}-\frac{1}{2}. \end{aligned}$$

Accordingly,

$$\begin{aligned} Pr(E_{JL}>\alpha )&= Pr[\frac{\Vert X \Vert }{2(1- \delta )}-\frac{1}{2}>\alpha ] \\&= Pr[\sqrt{dm}|x|>2(\alpha + \frac{1}{2})(1- \delta )] = Pr[|x|> \frac{2(\alpha + \frac{1}{2})(1- \delta )}{\sqrt{dm}}]. \end{aligned}$$

Because variable \(x\sim N(0,1/m)\), according to the property of Gaussian distribution, we have

$$\begin{aligned}&Pr(E_{JL} >\alpha ) = 2 e^{-\frac{2}{\sqrt{d}}(\alpha +\frac{1}{2})(1-\delta )}. \end{aligned}$$

Given the \(Noise \sim Laplace(\frac{s(u,v)}{\varepsilon _\varphi (i)})\), according to the property of Laplace distribution, we have

$$\begin{aligned}&Pr(E_{Lap} >\alpha ) = e^{-\frac{\varepsilon _\varphi (i)}{s(u,v)}\alpha }. \end{aligned}$$

Accordingly,

$$\begin{aligned} Pr(max|cos(u',v')-cos(u,v)| > \alpha ) \le 2 e^{-\frac{2}{\sqrt{d}}(\alpha +\frac{1}{2})(1-\delta )} + e^{-\frac{\varepsilon _\varphi (i)}{s(u,v)}\alpha }. \end{aligned}$$
(10)

Let Eq. 10 \(= \beta \), and we have

$$\begin{aligned}&ln2 -\frac{2}{\sqrt{d}}(\alpha +\frac{1}{2})(1-\delta ) -\frac{\varepsilon _\varphi (i)}{s(u,v)}\alpha = ln\beta \nonumber \\&\Rightarrow \alpha = \frac{s(u,v)\sqrt{d}(ln\frac{2}{\beta } -\frac{1-\delta }{\sqrt{d}})}{2s(u,v)(1-\delta )+\varepsilon _\varphi (i)\sqrt{d}}. \end{aligned}$$
(11)

The result shows that the error is bounded by \(\alpha \), where \(\alpha <\frac{s(u,v)\sqrt{d}(ln\frac{2}{\beta } -\frac{1-\delta }{\sqrt{d}})}{2s(u,v)(1-\delta )+\varepsilon _\varphi (i)\sqrt{d}}\). When the privacy parameter \(\varepsilon _\varphi (i)\) grows larger and \(\delta \) becomes smaller, the error introduced by the proposed method becomes much smaller.

6 Experiment and Analysis

In this section, our main goal is to demonstrate that applying a Johnson Lindenstrauss transform and considering personal privacy preferences can both contribute to an accurate prediction.

6.1 Experiment Setting

Datasets. In this experiment, we use NetflixFootnote 1 and MovieLensFootnote 2 datasets. The Netflix dataset was extracted from the official dataset used in the Netflix Prize competition. Each movie has been rated by at least 20, 250 users, who have each rated at least 20 movies. The MovieLens dataset is a benchmark dataset, used for collaborative filtering research. It has about 1 million ratings of 3900 movies rated by at least 6040 users. We randomly chose one rating by each user as the test dataset and deleted it from the original dataset.

Privacy Configuration. According to Acquisti and Grossklags and Berendt et al. [13, 14], users can fall into several groups according to their attitudes regarding privacy. To simulate users’ privacy preferences, we divided users randomly into three groups: low, medium and high privacy concern. The corresponding proportion are \(p_l\), \(p_m\) and \(p_h\). The users’ preferences are sampled uniformly at the ranges of \([\varepsilon _l, \varepsilon _m]\) and \([\varepsilon _m, \varepsilon _h]\) respectively, where \(\varepsilon _L, \varepsilon _M, \varepsilon _H \in [0.1, 1]\). Smaller values represent a stronger privacy requirement. The users in group low have a fixed value: \(\varepsilon _L=1\).

Baseline Mechanisms. To prove the effectiveness of our method, we introduce two baseline mechanisms. Both mechanisms satisfy personalized privacy preservation. We use the proposed method without Johnson Lindenstrauss transform as Baseline 1. This baseline was designed to determine whether linear transformation contributes an improvement in utility. Baseline 2 is the method that consider the strongest privacy requirement among users, we chose the smallest privacy parameter \(\varepsilon _{min}\) as the uniform standard to perturb similarity. Baseline 2 also satisfies personalized differential privacy but does not take the advantage of it. Baseline 2 is designed to confirm the effectiveness of considering personalized privacy preferences.

6.2 Evaluation

We performed a substantial number of experiments to evaluate the performance of our methods. Our objective is to predict the missing test rating by each user using the remaining ratings, by applying RMSE to show the effectiveness of our method. RMSE is defined as follows:

$$\begin{aligned} RMSE=\sqrt{\frac{\sum _{u_i,t\in T} (r_{u_it}-\hat{r_{u_it}})^2}{|T|}}. \end{aligned}$$
(12)

where \(r_{u_it}\) is the true rating, and \(\hat{r_{u_it}}\) is the predicted rating. T is the test dataset and |T| is its size. A lower RMSE represents a higher accuracy.

Table 1 lists the parameters and default values used in our experiments.

Table 1. Values of various parameters

Performance Evaluation of the Johnson Lindenstrauss Transform. The performance of the Johnson Lindenstrauss transform was examined by comparing the proposed method with Baseline 1. We used default values for all other parameters in this comparison. Put simply, we varied the number of neighbors from 5 to 50 in steps of 5 on both datasets to specify the number of neighbors in the COS and PCC metrics.

Figure 1 shows the results of the comparison. We observe that, no matter which metric was used, increasing the number of neighbors selected improved performance for both mechanisms. This is because the performance of proposed method is determined mainly by the quality of neighbors. Within a threshold, a higher number of neighbors gives a higher probability that the prediction contains the true close neighbors. In addition, we observe that the performance of proposed method outperforms the baseline without the Johnson Lindenstrauss transform across all k values and in all cases. When \(k = 20\), the RMSE was 0.9645 and 0.9430 for the proposed method on the Netflix and MovieLens datasets, respectively, using the COS metric. Baseline 1’s RMSE was 0.9832 and 0.9531, accordingly. Similar results can be seen on both datasets in terms of the PCC metric. The reason the Johnson Lindenstrauss transform can improve performance is because it tightens the ratings dataset, which increases the accuracy of the selected neighbors. In other words, the Johnson Lindesntrauss transform helps to choose much closer neighbors with more similar tastes.

Fig. 1.
figure 1

Performance of Johnson Lindestrauss transform

Performance of Personalized Consideration. To prove the effectiveness of personalized privacy settings, we compared our method with Baseline 2, which assigns all users the same and strongest privacy parameter. While both mechanisms satisfy users’ privacy requirements, Baseline 2 does not consider individual privacy preferences, rather it simply provides the strongest privacy guarantee. We use the same settings as outlined in last section.

Fig. 2.
figure 2

Performance of personalized privacy setting

Figure 3 shows the results of the experiments. We observe, from Fig. 2a, that both methods performed better with an increase in the number of neighbors and that the proposed method always showed better performance than Baseline 2 in terms of RMSE, no matter how many neighbors were selected. Similar results can be seen on both datasets in terms of the PCC metric.

The proposed method considers each user’s individual privacy preferences. For users who are not sensitive about their personal data, we add less noise to hide their information. Therefore, more real information is preserved and more utility is maintained compared to methods that assign the strongest privacy parameter to all users.

Comparison with Other Related Methods. The performance of the proposed method was also compared to other state-of-the-art methods. Specifically, with both a traditional Laplace method, and the Sample mechanism introduced by Jorgensen [10]. Jorgensen’s method works by sampling records according to a probability calculation of the individual’s privacy requirements. We provided them with the same privacy settings and varied the number of neighbors from 5 to 50 in steps of 5 on both datasets.

Fig. 3.
figure 3

Compare with other related works

Figure 3 shows the comparison results. It is observed that both Sample mechanism and our method outperform the traditional Laplace method significantly in Fig. 3a. And the performance of our method is roughly the same with Sample mechanism. Similar results can be seen in other sub-figures as well. Though the performance of our method is without much difference compared to Sample mechanism, our method has less computational complexity. For Sample mechanism, its computational complexity is \(O (n^3d)\), while our method is \(O(\frac{n^2log(n)}{\delta ^2}+\frac{dnlog(n)}{\delta ^2})\). Because the Johnson Lindenstrauss transform reduces the data dimension from d to m. However, n times sampling operations need to be done for recommendation purpose.

7 Related Work

The concept of personalized privacy was first proposed by Xiao and Tao [15]. They present a generalization framework that take into account personal anonymity requirements, choosing the minimum amount of necessary generalization for each record. Later, Poolsappasit and Ray [16] considered people’s different privacy attitudes to location information in various situations. Yuan et al. [17] preserved a labelled social network by generalizing and modifying a graph. Wang [18] achieve personalized privacy preservation using a clustering technique. All of these methods are based on K-anonymity, which has proven to be very weak and open to attack by newly-developed attack models. Differential privacy has since been the much preferred approach for its rigid privacy preservation.

Zach [10] introduced two mechanisms for achieving personalized differential privacy. The first method is a sampling mechanism that samples the records based on a proposed sample probability associated with users’ privacy preference. The third mechanism is based on Exponential mechanism, which is not applicable to our situation. Mohammad [19] proposed a new concept, called heterogeneous differential privacy. First, the data is rescaled by a shrinkage matrix associated with users’ privacy preferences. Then the new dataset is queried, and Laplace noise is added to the result of the query. However, to satisfy heterogeneous differential privacy, the sensitivity of the query to the rescaled dataset must be equal to, or less than, the product of the privacy preference and the sensitivity of the query to the original dataset. This means that the users’ privacy preferences cannot be chosen as they wish, and the method does not strictly provide personalized privacy preservation.

Many work [20,21,22] focused on privacy perservation recommendation system, while none of them considered the personalized privacy setting.

8 Conclusions

In this paper, we consider users’ personal privacy requirements in a recommendation system to provide a personalized privacy preservation method for neighborhood-based collaborative filtering. The method is based on a Laplace mechanism where calibrated noise is added to the query result according to each user’s individual privacy requirements. In addition, a Johnson Lindenstrauss transform-based pre-processing method is introduced to improve performance. We show that the proposed method satisfies personalized differential privacy, and maintains more utility, compared to traditional differential privacy methods that provide a uniform level of privacy to all users.