Keywords

1 Introduction

During the last few decades we have witnessed the increasing use of recommender systems in various domains to solve the problem of information seeking in an extremely large volume of content. With the help of recommender systems, customers can quickly find things that are interesting or new by narrowing down the set of choices. Meanwhile, service providers using recommender systems can increase sales or click-through rate (CTR) by providing personalized service for customers. For example, McKinseyFootnote 1 reported that “35% of what consumers purchase on Amazon and 75% of what they watch on Netflix come from product recommendations”.

The benefits brought by recommender systems are significant. However, the use of recommender systems introduces privacy threats and concerns. In order to provide high quality recommendations, recommender systems need to collect customers’ private data, such as history data (e.g., the books bought last month or the movies watched last week) and rating data (e.g., the rate for a book or a movie). However, recommender systems may not be trustable. It is common for customers to raise privacy concerns as the collected data may be shared with, rent or sold to third parties. According to a survey done by PewResearchFootnote 2, “86% of Internet users have taken steps online to remove or mask their digital footprints” and “68% of Internet users believe current laws are not good enough in protecting people’s privacy online”. It is thus crucial to develop technologies that can keep users’ private data protected while enabling personalized recommendation, which is a necessary and beneficial complement to the efforts made in the non-technical domain such as privacy policies and related laws.

1.1 Related Work

Cryptography is one of the most important technologies to realize privacy-preserving recommender systems. Using some well-known encryption algorithms, users can transform their private data from meaningful plaintext to meaningless ciphertext, thus achieving privacy preservation. To enable recommender systems to carry out computation over ciphertext directly, the encryption algorithms to be used have to be homomorphic, that is, the result of operations performed on ciphertext, when decrypted, matches the result of operations performed on the corresponding plaintext. For example, Paillier cryptosystem [14] was employed by Erkin et al. [6] and Ma et al. [10], ElGamal cryptosystem [5] was used by Zhan et al. [17] and Badsha et al. [1], to realize privacy-preserving recommender systems. However, homomorphic encryption is built on expensive public-key cryptography, which is theoretical in nature and cannot be applied in practice due to the prohibitive computation cost. In addition, Nikolaenko et al. [13] and Liu et al. [9] built privacy-preserving recommender systems based on another renowned cryptographic tool, Yao’s garbled circuits [8, 16]. However, these approaches require the existence of a trusted third party, which also hinders their application in practice.

To overcome the weakness of cryptography based techniques, Polat and Du [15] proposed a Randomized Perturbation (RP) technique which adds noise to users’ private data before releasing the data to recommender systems. RP is much faster than cryptography based techniques, but this is at the cost of sacrificing recommendation accuracy and privacy protection degree. In particular, the smaller the random noise added, the more accurate the predicted rating. However, the smaller noise also results in weak privacy guarantee. For example, the data reconstruction methods proposed by Zhang et al. [18] can derive more original data when smaller noise is injected. Therefore, a trade-off between accuracy and privacy should be made when applying RP.

The above work aims at keeping users’ private data secret during recommendation computation. However, the output of recommender systems can be also utilized by malicious users to make inferences about other users’ private data [12], that is, based on the recommendation she gets, a malicious user can guess whether someone else has, for example, bought some book or seen some movie. To avoid this kind of information leakage, Differential Privacy (DP) [3, 4] has been introduced into recommender systems recently [7, 11, 12]. By adding noise, DP guarantees the distribution of the recommendation is insensitive to any individual user’s data, thus preventing the inference of any single user’s data from the recommendation. Due to the injected noise, DP also needs to strike a balance between recommendation accuracy and privacy protection degree. In addition, DP does not protect users’ data from recommender systems, as the latter has full access to users’ data in the clear.

1.2 Contributions

From the above discussion, it is expected that a privacy-friendly recommender system should respect user privacy at two stages: (1) does not ask users to submit their original data in the data collection stage; and (2) can prevent the inference of any single user’s data from the final recommendation result in the normal execution stage. To the best of our knowledge, however, these two aspects have not been considered simultaneously. In this paper, we aim at designing an approach that can hide users’ private data and prevent privacy inference simultaneously. At first glance, an intuitive solution is to integrate the techniques mentioned above. Nevertheless, there are some interesting issues worthy of investigation but largely overlooked by recent studies. For example, the amount of noise injected by DP is based on the sensitivity of a query function, which sometimes is not easy to estimate, especially when considering that the underlying data will be disguised by RP or encryption. For another, since DP and RP both introduce noise to original data, can we be certain that the recommendation accuracy will inevitably become worse? Or what is the trade-off between accuracy and privacy in this new context?

As the initial step towards more privacy-friendly recommender systems, we propose in this paper a hybrid approach which combines RP and DP. Specifically, users mask their original data through RP and send the disguised values to the recommender system, which injects calibrated noise again to the perturbed data to achieve DP. Our contributions are summarized as follows:

  • We design a hybrid approach for privacy-preserving recommender systems by combining DP with RP. Compared with existing works, our approach provides more privacy guarantee as users’ private data is kept secret and no one can infer any single user’s data from the recommendation result.

  • We theoretically show the noise added by RP has limited effect on recommendation accuracy and the noise added by DP can be well controlled based on the sensitivity analysis of functions on the perturbed data.

  • We conduct extensive experiments to evaluate the performance of our hybrid approach on three large-scale real world datasets. The results show that the combination of DP and RP is feasible in practice. Generally it provides more privacy protection with acceptable accuracy loss, and surprisingly sometimes it achieves better privacy and accuracy at the same time.

The rest of the paper is organized as follows. Section 2 introduces a representative non-private recommendation algorithm and some background knowledge. Section 3 presents the detailed design of the hybrid approach. Section 4 discusses the experimental results and Sect. 5 concludes the paper.

2 Preliminaries

2.1 Recommendation Algorithm Without Privacy Guarantee

We first describe a recommendation algorithm [12] without privacy guarantee. Suppose there are n users and m items. Based on the data provided by n users, the recommender system has two matrices in hand. One is a rating matrix \(R_{n\times m}\) that contains the ratings of n users for m items where \(r_{ui}\) indicates the rating of user u for item i. The other auxiliary (binary) matrix \(E_{n\times m}\) indicates the presence of ratings, where \(e_{ui}=1\) means u has rated for i and \(e_{ui}=0\) means u does not. The two matrices are the input to the recommendation algorithm, while the output is predicted ratings of items that users have not rated.

Some users tend to give higher ratings than other users, and some items tend to receive higher ratings than others. This difference will make the recommendation result disappointing, so it is necessary to subtract user effects and item effects from ratings. We first compute the global average of \(R_{n\times m}\):

$$\begin{aligned} GAvg = \frac{\sum _{R}{r_{ui}}}{\sum _{E}{e_{ui}}} \end{aligned}$$

Then, we center ratings by computing and subtracting average ratings for items and users:

$$\begin{aligned} r'_{ui} = r_{ui}-UAvg(u) \end{aligned}$$
$$\begin{aligned} UAvg(u) = \frac{\sum _{i}{(r_{ui}-IAvg_i)}+\beta _u \cdot GAvg}{\sum _{i}{e_{ui}}+\beta _u },~~ IAvg_i = \frac{\sum _{u}{r_{ui}}+\beta _m \cdot GAvg}{\sum _{u}{e_{ui}}+\beta _m } \end{aligned}$$

where IAvg and UAvg are dampened by \(\beta _m\) and \(\beta _u\) fictitious ratings of the global average, respectively. Here, \(\beta _m\) is the average number of ratings for item m, and \(\beta _u\) is the average number of rating items for user u.

Finally, we use the centered ratings to calculate the covariance matrix, which indicates the relationships between items:

$$\begin{aligned} Cov(ij)=\frac{\sum _{u}{w_u r'_{ui} r'_{uj}}}{\sum _{u}{w_u e_{ui} e_{uj}}} \end{aligned}$$

where \(w_u\) is per-user weights equaling to the reciprocal of \(||e_u||\). The final recommendation result can be made by passing this covariance matrix to a large number of advanced learning and prediction algorithms, such as the k-nearest neighbor (kNN) method proposed by Bell and Koren [2].

2.2 Differential Privacy (DP)

Intuitively, differential privacy means the probability an attacker who is able to observe the computation’s output learns any record’s presence in or absence from the computation’s input should be indistinguishable [3, 4]. The formal definition is as follows:

Definition 1

A randomized function f provides \(\epsilon \)-differential privacy if for any neighboring data bases A and B (\(A \bigtriangleup B = 1\)), and any subset S of possible outcomes Range(f),

$$\begin{aligned} Pr[f(A)\in S] \le exp(\epsilon ) \times Pr[f(B) \in S] \end{aligned}$$

Two datasets A and B are adjacent if there is only one individual record difference between them (\(A \bigtriangleup B = 1\)). The parameter \(\epsilon \) is the privacy budget, which can be used to control the level of privacy protection. The smaller the value of \(\epsilon \) is, the stronger privacy protection it provides. DP guarantees the output is insensitive to any individual record. The probability that an attacker can correctly guess whether or not an individual record is in the dataset is at most exp(\(\epsilon \)) based on the outputs of calculations. It satisfies a composability property defined as follows: The sequence of \(f_i(A)\) provides \((\sum _{i}{\epsilon _i})\)-differential privacy, where \(f_i\) each provides \(\epsilon _i\)-differential privacy. Therefore, the \(\epsilon \) parameter can be considered as an accumulative privacy cost as more steps are executed. These costs keep accumulating until they reach an allotted privacy budget.

A common way to obtain differential privacy is by applying random noise to the measurement. The amount of noise added depends on the \(L_1\)-sensitivity of the evaluated function, which is the largest possible change in the measurement given a change in a single record in the dataset. In general, the \(L_k\)-sensitivity of a function f is given by:

$$\begin{aligned} S_k(f) = \max _{(A \bigtriangleup B=1)}||f(A)-f(B)||_k \end{aligned}$$

where \(||\cdot ||_k\) denotes the \(L_k-\)norm.

Given a function f:\(D \rightarrow \mathbb {R}^d\), Laplace mechanism obtains \(\epsilon \)-differential privacy by adding noise sampled from Laplace distribution, with a calibrated scale \(b=S_1(f)/\epsilon \). The following computation maintains \(\epsilon \)-differential privacy:

$$\begin{aligned} K(x)=f(x)+(Laplace(S_1(f)/\epsilon ))^d \end{aligned}$$

2.3 Randomized Perturbation

The basic idea of randomized perturbation is to perturb the data in such a way that certain computations can be done while preserving users’ privacy. Although data from each user is scrambled, if the number of users is significant large, the aggregate information of these users can be estimated with decent accuracy. Such property is very useful for computations that are based on aggregate information. Scaler product and sum are among such computations.

Let \(r^a\) and \(r^b\) be the original vectors, where \(r^a = (r^a_1,\ldots ,r^a_i)\) and \(r^b = (r^b_1,\ldots ,r^b_i)\). \(r^a\) is disguised by \(v^a = (v^a_1,\ldots ,v^a_i)\), and \(r^b\) by \(v^b = (v^b_1,\ldots ,v^b_i)\), where \(v^a\) and \(v^b\) are uniformly distributed in domain \([-\gamma ,\gamma ]\). Let \(r'^a = r^a + v^a\) and \(r'^b = r^b + v^b\) be disguised data that are known. Because \( v^a\) and \( v^b\) are uniformly distributed, the scalar product of \(r^a\) and \(r^b\) can be estimated from \(r'^a\) and \(r'^b\) and the sum of the values of \(r^a\) can be estimated from \(r'^a\) as follows:

$$\begin{aligned} \sum _{i=1}^{n}(r_i+v_i) = \sum _{i=1}^{n}r_i +\sum _{i=1}^{n}v_i \approx \sum _{i=1}^{n}r_i \end{aligned}$$
(1)
$$\begin{aligned} r'^a \cdot r'^b = \sum _{i=1}^{n}(r^a_ir^b_i + r^b_i v^a_i + r^a_i v^b_i + v^a_i v^b_i) \approx \sum _{i=1}^{n}r^a_i r^b_i \end{aligned}$$
(2)

3 The Hybrid Approach

Figure 1 shows the whole life-cycle of a typical recommender system armed with our hybrid privacy-preserving approach. Three stages are involved in the process of recommendation: data collection, data publication and data prediction. In the first stage, users’ original data are disguised through randomized perturbation, resulting in perturbed rating matrix R and auxiliary matrix E. Based on the two perturbed matrices, the recommender system computes global average, item averages, user averages, and finally the covariance matrix for data publication. All these data are masked with particular amount of noise to guarantee differential privacy. With the added noise, the covariance matrix is ready for publication and can be fed into an existing learning and prediction algorithm (e.g., the kNN method [2]) with no changes. As mentioned earlier, the challenge here is how to ensure recommendation accuracy and realize differential privacy on the perturbed data effectively.

Fig. 1.
figure 1

Overview of the hybrid approach for privacy-preserving recommender systems

3.1 Methodology

In the data collection stage, the recommender system decides on a range \([-\gamma ,\gamma ]\) and let each user know. Then, each user u disguises her ratings \(r_{ui}\) by adding noise that is uniformly distributed in the domain \([-\gamma ,\gamma ]\). The recommender system collects these disguised data \(r'_{ui}\) to form two perturbed matrices.

In the stage of data publication, the recommender system injects noise to three different values: the global average, the per-item average and the covariance matrix. Note that, the noises for these values are different and depend on the sensitivity of the underlying functions of computing these values. For global average, Laplace distributed noise is added to guarantee its privacy:

$$\begin{aligned} {\begin{matrix} GAvg&= \frac{\sum _{R}{r'_{ui}}+Laplace(\varDelta r_1/\epsilon _1)}{\sum _{E}{e_{ui}}+Laplace(\varDelta r'_1/\epsilon _1)} \end{matrix}} \end{aligned}$$

where \(\varDelta r_1\) and \(\varDelta r'_1\) are the sensitivity of function \(\sum _{R}{r'_{ui}}\) and \(\sum _{E}{e_{ui}}\), respectively. Their exact values will be discussed in the next subsection. We then use the global average to produce a stabilized per-item average rating by \(\beta _m\) at value GAvg for each item:

$$\begin{aligned} {\begin{matrix} IAvg_i&{} = \frac{(\sum _{u}{r'_{ui}}+Laplace(\varDelta r_2/ \epsilon _2))+\beta _m \cdot GAvg}{(\sum _{u}{e_{ui}}+Laplace(\varDelta r'_2/ \epsilon _2))+\beta _m }\\ \end{matrix}} \end{aligned}$$

where \(\varDelta r_2\) and \(\varDelta r'_2\) are the sensitivity of \(\sum _{u}{r'_{ui}}\) and \(\sum _{u}{e_{ui}}\), respectively.

Having published the average rating for each item, we center the ratings for each user as follows, taking shrinking parameter \(\beta _u\) at global average, where \(c_u\) is the number of ratings by user u:

$$\begin{aligned} {\begin{matrix} UAvg(u) &{}= \frac{\sum _{i}{(r'_{ui}-IAvg_i)}+\beta _u \cdot GAvg}{c_u+\beta _u }\\ \end{matrix}} \end{aligned}$$

We subtract user effects average from the appropriate ratings and clamp the resulting centered ratings to the intervals \([-B,B]\), to lower the sensitivity of the measurements at the expense of the relatively few remaining large entries:

$$\begin{aligned} \hat{r_{ui}} = \left\{ \begin{array}{ll} -B &{} \text {if}\ r'_{ui}-UAvg(u)<-B\\ r'_{ui}-UAvg(u) &{} \text {if}\ -B \ge r'_{ui}-UAvg(u)<B\\ B &{} \text {if}\ r'_{ui}-UAvg(u)\ge B \end{array} \right. \end{aligned}$$

The final measurement we make of the private data is the covariance of the perturbed and clamped user ratings vectors. To retain the difference between users, we take the non-uniform averages by using per-user weights \(w_u\) which equals to the reciprocal of \(||e_u||\). Then, the covariance will be published as:

$$\begin{aligned} Cov(ij) =\frac{\sum _{u}{w_u \hat{r}_{ui}\hat{r}_{uj}}+Laplace(\varDelta r_3/\epsilon _3)}{\sum _{u}{w_u e_{ui} e_{uj}}+Laplace(\varDelta r'_3/\epsilon _3)} \end{aligned}$$

where \(\varDelta r_3\) and \(\varDelta r'_3\) are the sensitivity of \(\sum _{u}{w_u \hat{r}_{ui}\hat{r}_{uj}}\) and \(\sum _{u}{w_u e_{ui} e_{uj}}\), respectively.

3.2 Theoretical Analysis

As mentioned earlier, different functions have different sensitivities, which determines the amount of noise needed for differential privacy. In this subsection, we analyze the sensitivities of different functions involved in the data publication. The sensitivity values \(\varDelta r_1\) and \(\varDelta r_2\) are both \( \tau + 2\gamma \), where \( \tau \) is the maximum possible difference in raw ratings, and \(\gamma \) is the parameter of RP. For example, if the range of rating is from 1 to 5, the \(\tau \) then equals to 4. From Theorem 2, the sensitivity value \(\varDelta r_3\) is \(2B(\tau + 2\gamma )+3B^2\). For \(\varDelta r'_1\) and \(\varDelta r'_2\), their values are both 1, because the maximum possible difference is 1 in the auxiliary matrix when \(e^a\) and \(e^b\) differ on only one value. The value of \(\varDelta r'_3\) is 3 which is clear from Theorem 3.

Theorem 1

Let \(r^a\) and \(r^b\) differ on one rating, \(\tau \) be the maximum possible difference in raw ratings. Considering the randomized perturbation before collecting the data, the maximum possible difference in the processed ratings is \(\tau + 2\gamma \). For centered and clamped ratings \(\widehat{r}^a\) and \(\widehat{r}^b\), we have

$$\begin{aligned} ||\widehat{r}^a-\widehat{r}^b||_1 \le \tau + 2\gamma +B \end{aligned}$$

Proof:

If \(r^a\) and \(r^b\) are two sets of ratings which differ on one rating, present in \(r^b\) at \(r^b_{ui}\), others are everywhere equal, except for the ratings of user u. For the ratings in common between \(r^a\) and \(r^b\), the difference is at most the difference in the subtracted averages:

$$\begin{aligned} |UAvg(u)^b-UAvg(u)^a|=\frac{|r_{ui}-UAvg(u)^a|}{c^b_u + \beta _p} \le \frac{\tau + 2\gamma }{c^b_u + \beta _p} \end{aligned}$$

For the new rating \(r_{ui}\), its previous contribution of zero is replaced with the new centered and clamped rating, at most B in magnitude. Hence, we have

$$\begin{aligned} ||\widehat{r}^a-\widehat{r}^b||_1 \le c^a_u \times \frac{\tau + 2\gamma }{c^b_u + \beta _p} + B \end{aligned}$$

Note that \(c^b_u = c^a_u + 1\) and the maximal value of \(c^a_u\) is \(\beta _p +1\). Therefore, the upper bound of \(||\widehat{r}^a-\widehat{r}^b||_1\) is \(\tau + 2\gamma +B\).    \(\square \)

Theorem 2

Let \(r^a\) and \(r^b\) differ on one rating. Taking \(w_u = 1/||e_u||_1\), we have

$$\begin{aligned} ||w^a_u r'^a_{ui} r'^a_{uj}- w^b_u r'^b_{ui} r'^b_{uj}|| \le 2B(\tau + 2\gamma )+3B^2 \end{aligned}$$

Proof:

For the difference \(w^a_u r'^a_{ui} r'^a_{uj}- w^b_u r'^b_{ui} r'^b_{uj}\), we can rewrite it as \( w^a_u r'^a_{ui}( r'^a_{uj}-r'^b_{uj})+ w^b_u (r'^a_{ui}-r'^b_{ui}) r'^b_{uj}+(w^a_u-w^b_u)r'^a_{ui} r'^b_{uj}\), as \(||e^b_u -||e^a_u||\le 1\), we have that

$$\begin{aligned} w^a_u - w^b_u = \frac{1}{||e^a_u||} - \frac{1}{||e^b_u||} \le \frac{1}{||e^a_u|| ||e^b_u||} \end{aligned}$$

The original matrix difference is bounded by

$$\begin{aligned} \left( \frac{||r'^a_i||}{||e^a_i||}+\frac{||r'^b_i||}{||e^b_i||}\right) ||r'^a_i-r'^b_i|| + \frac{||r'^a_i|||r'^b_i||}{||e^r_i||||e^b_i||} \end{aligned}$$

Giving Theorem 2, we have the upper bound \(2B(\tau + 2\gamma )+3B^2\).    \(\square \)

Theorem 3

Let \(e^a\) and \(e^b\) differ on one rating presence or absence. Taking \(w_u = 1/||e_u||_1\), we have

$$\begin{aligned} ||w^a_u e^a_{ui}e^a_{uj}- w^b_u e^b_{ui} e^b_{uj}|| \le 3 \end{aligned}$$

Proof:

Between the two weight matrices, similarly, we can rewrite it as \(w^a_u e^a_{ui}( e^a_{uj}-e^b_{uj})+ w^b_u (e^a_{ui}-e^b_{ui}) e^b_{uj}+(w^a_u-w^b_u)e^a_{ui} e^b_{uj}\). Then, we have the bound as follows:

$$\begin{aligned} ||w^a_u e^a_{ui}e^a_{uj}- w^b_u e^b_{ui} e^b_{uj}|| \le \left( \frac{||e^a_i||}{||e^a_i||}+\frac{||e^b_i||}{||e^b_i||}\right) ||e^a_i-e^b_i|| + \frac{||e^a_i||||e^b_i||}{||e^r_i||||e^b_i||} = 3 \end{aligned}$$

   \(\square \)

The above theorems show that the hybrid approach takes into account the effect of noise introduced by RP on the noise injected by DP. If we directly use DP without considering the noise of RP, we will obtain weaker privacy protection. This result is guaranteed by the following theorem.

Theorem 4

The hybrid approach can provide stronger privacy protection than DP when raw rating data are disguised by RP.

Proof:

First note that the sensitivity of \(\sum _{R}{r'_{ui}}\) and \(\sum _{R}{r_{ui}}\) are \(\tau + 2\gamma \) and \(\tau \), respectively. To provide \(\epsilon _1\)-differential privacy for the global average, the hybrid approach injects noise v based on \(Laplace(\frac{\tau + 2\gamma }{\epsilon _1})\). As DP does not consider the noise introduced by RP, it will inject noise \(v'\) based on \(Laplace(\frac{\tau }{\epsilon _1})\). The noise \(v'\) on the disguised raw data can actually provide \(\epsilon '_1\)-differential privacy where \(\frac{\tau + 2\gamma }{\epsilon '_1}=\frac{\tau }{\epsilon _1}\). Thus we have \(\epsilon '_1=\frac{\tau +2\gamma }{\tau }\epsilon _1>\epsilon _1\). Likewise, we can have \(\epsilon '_2>\epsilon _2\) and \(\epsilon '_3>\epsilon _3\) based on the sensitivity values given in Theorem 2, where \(\epsilon '_2\) and \(\epsilon '_3\) are the actual privacy budget DP can provide for item average and covariance, respectively. According to the composability property of differential privacy, we have \(\epsilon =\epsilon _1+\epsilon _2+\epsilon _3\) and \(\epsilon '=\epsilon _1'+\epsilon _2'+\epsilon _3'\). Clearly, \(\epsilon <\epsilon '\), which completes the proof.    \(\square \)

We now examine the effect of the noise added by RP in the hybrid approach on recommendation accuracy. As mentioned earlier, though raw ratings from each user are perturbed, the aggregate information of these ratings can be estimated with decent accuracy if the number of users is significant large. Suppose GAvg is the global average of the perturbed raw data collected in the hybrid approach and \(GAvg^*\) is the global average of the raw data collected in DP. Clearly, we have

$$\begin{aligned} GAvg = \frac{\sum _{R}{r'_{ui}}+Laplace(\varDelta r_1/\epsilon _1)}{\sum _{E}{e_{ui}}+Laplace(\varDelta r'_1/\epsilon _1)},\quad GAvg^* = \frac{\sum _{R}{r_{ui}}+Laplace(\varDelta r_1^*/\epsilon _1)}{\sum _{E}{e_{ui}}+Laplace(\varDelta r'_1/\epsilon _1)} \end{aligned}$$

where \(\varDelta r_1\), \(\varDelta r'_1\), and \(\varDelta r_1^*\) are the sensitivity of function \(\sum _{R}{r'_{ui}}\), \(\sum _{E}{e_{ui}}\), and \(\sum _{R}{r_{ui}}\), respectively. If R is sufficiently large, we have \(\sum _{R}{r'_{ui}}\approx \sum _{R}{r_{ui}}\) as the noise injected into \(r_{ui}\) is uniformly sampled from \([-\gamma ,\gamma ]\). Besides, it is important to notice that \(\sum _{R}{r'_{ui}}\gg Laplace(\varDelta r_1/\epsilon _1)\). Thus, we have: \(GAvg \approx GAvg^*\). Likewise, we can conclude that \(Cov(ij) \approx Cov^*(ij)\), which indicates that the noise added by RP in the hybrid approach has limited effect on recommendation accuracy.

4 Experiments

4.1 Experimental Setting

In this section, we evaluate our hybrid approach for privacy-preserving recommender systems. As discussed earlier, both RP and DP introduce noise into recommendation computation, so it is worth studying the prediction accuracy when combining the two techniques. Therefore, we examined each of the three methods (i.e., RP, DP, and the hybrid one) in turn to see its effect on recommendation accuracy. All experiments were conducted on three real world datasets: NetflixFootnote 3 consists of roughly 100 M ratings of 17770 movies contributed by 480 K users; MovieLensFootnote 4 consists of 100 K ratings of 1682 movies contributed by 943 users; YahooFootnote 5 consists of 23 M ratings of 11915 movies contributed by 7742 users. The rating of the three datasets are all from 1 to 5.

By adjusting the parameters of the noise distributions we use (i.e., \(\gamma \) of RP and \(\epsilon \) of DP), our approach provides different randomized perturbation and differential privacy guarantees, and consequently, the recommendation outputs have different accuracy values. In our experiment, the recommendation accuracy is measured by the root mean squared error (RMSE) on the test datasets: \( RMSE = \sqrt{\frac{\sum _{X} (x - x')^2}{|X|}}\) where X consists of all values needs to be predicted in the test set and |X| is the size of X, \(x'\) is the predicted value and x is the original value in the test set. A smaller RMSE value indicates a more accurate recommendation result. Regarding the training set and test set, the MovieLens data is divided into two parts, 80% for the training set and 20% for the test set. For Netflix data, the test set is the Probe set. For Yahoo dataset, the training set contains 7642 users and the test set has 2309 users. The test set is gathered chronologically after the training set.

We applied the kNN method [2] to the covariance matrix for the final recommendation. The value of k is fixed at 20, and the clamping parameter B is set to 1. Following the work in [12], for any \(\epsilon \), we set the respective \(\epsilon _i\) as follows: \( \epsilon _1 = 0.02 \times \epsilon , \epsilon _2 = 0.19 \times \epsilon , \epsilon _3 = 0.79 \times \epsilon \). All experiments were conducted on a Dell PowerEdge R930 server which is equipped with 2.2 GHz CPU and 2 TB RAM. Each experiment was run 10 times and the average results were reported.

4.2 Experimental Results

4.2.1 RP’s Effect on Accuracy. 

Figure 2 shows the recommendation accuracy when \(\gamma \) increases from 0.5 to 3.5 with a step of 0.5. Clearly, the accuracy decreases on all three datasets. This is because the noise added by RP is determined by the parameter \(\gamma \). In particular, the larger the \(\gamma \) is, the wider range the random noise is in. Therefore, more randomness is likely to be added into the original data, resulting in less accurate recommendation.

Fig. 2.
figure 2

RP’s effect on recommendation accuracy

Fig. 3.
figure 3

DP’s effect on recommendation accuracy

4.2.2 DP’s Effect on Accuracy. 

Figure 3 shows the recommendation accuracy when \(\epsilon \) increases from 0.1 to 10. From the results, we can see that the accuracy decreases rapidly when DP provides strong privacy guarantee (i.e., when \(\epsilon <1\)). When providing a weak privacy protection (i.e., when \(\epsilon > 1\)), the accuracy approaches to a constant. Such observations imply that a large \(\epsilon \) contributes little to the accuracy but weakens the privacy protection.

Fig. 4.
figure 4

RP’s effect on DP

Fig. 5.
figure 5

RMSE ratio between the hybrid approach and DP

4.2.3 Effect of the Hybrid Approach on Accuracy. 

We first examine the hybrid approach by taking DP as the baseline. Figure 4 depicts how the recommendation accuracy of DP is affected by RP. It is clear that no matter which \(\gamma \) is used in RP, the overall trend of DP remains the same, that is, the accuracy decreases as \(\epsilon \) approaches to 0, indicating a stronger privacy guarantee. Besides, when DP and RP work together, larger \(\gamma \) often leads to less accuracy. For example, DP plus RP with \(\gamma =0.5\) is more accurate than DP plus RP with \(\gamma =3.5\) on MovieLens dataset. Finally, it is worth noting that in most cases the combination of DP and RP makes recommendation less accurate, which coincides with our common sense as both of them introduce noise into the original data. However, their combination sometimes results in a win-win situation where both the accuracy and the privacy becomes better, as seen in Fig. 4(b). To make this clear, we draw in Fig. 5 the RMSE ratio between the hybrid approach and DP. We can see that for Netflix dataset, the combination of DP and RP sometimes outperforms DP only, especially when \(\gamma \) is small, for example, 0.5. Besides, the accuracy loss of DP plus RP is acceptable on MovieLens and Netflix datasets, but is not satisfactory on Yahoo dataset. A possible reason might be that MoveiLens and Netflix datasets have similar rating distribution, which is different from Yahoo dataset. We then examine the hybrid approach by taking RP as the baseline. Figure 6 depicts how the recommendation accuracy of RP is affected by DP. We can again see that the overall trend of RP remains the same (i.e. the accuracy decreases as \(\gamma \) increases) no matter which \(\epsilon \) is used in DP. Besides, smaller \(\epsilon \) is more likely to make the recommendation less accurate, as stronger privacy guarantee is provided in DP through injecting more noise into the data. We also notice that, for MovieLens dataset, the combination of RP and DP makes recommendation less accurate, but in an acceptable range. For the Netflix dataset, however, their combination is indeed a good choice as we can obtain additional privacy guarantee while not sacrificing recommendation accuracy. Further note that this is true for any combination of \(\gamma \) and \(\epsilon \) in our experiments, as shown in Fig. 7(b). The accuracy loss is still unsatisfactory on Yahoo dataset, especially when DP provides strong privacy guarantee, as depicted in Fig. 7(c).

Fig. 6.
figure 6

DP’s effect on RP

Fig. 7.
figure 7

RMSE ratio between the hybrid approach and RP

4.2.4 Efficiency of the Hybrid Approach. 

Figure 8 shows the running time of the hybrid approach. It is clear that the running time increases when the rating matrix becomes large, but the total computation cost is acceptable even on a moderate server. In particular, for the MovieLens dataset where the size of rating matrix is about 1000 * 1700, the hybrid approach only needs 33 s. Even for large Netflix dataset whose rating matrix is 380 K * 500, the hybrid approach can be completed within less than 2.5 h. The computation cost of the hybrid approach mainly comes from DP, as RP only requires few simple operations and is done at user side. Thus, the hybrid approach has the same computation complexity as DP, but can provide stronger privacy guarantee than DP as shown in Theorem 4.

4.2.5 The Hybrid Approach Vs DP. 

Figure 9 depicts the accuracy comparison of the hybrid approach and DP when the raw rating data are disguised by RP with different \(\gamma \). The privacy budget \(\epsilon \) is set to 1 in both methods. In all datasets, we can see that DP has a better performance than the hybrid approach. This, however, exactly shows DP cannot provide sufficient privacy guarantee over the perturbed data, as it underestimates the sensitivity of the functions on the perturbed data. This result coincides with Theorem 4, which says the hybrid approach can provide stronger privacy protection than DP over the data disguised by RP. Further, the RMSE difference of the two methods is small, which means the hybrid approach has an acceptable accuracy loss while providing stronger privacy guarantee.

Fig. 8.
figure 8

Efficiency of the hybrid approach

Fig. 9.
figure 9

RMSE of the hybrid approach and DP over perturbed data

4.2.6 Summary. 

From the above discussion, we can see that, by carefully injecting appropriate noise into the perturbed data based the sensitivity analysis of different functions involved in the recommendation computation, the hybrid approach can provide more privacy protection with acceptable accuracy loss. More interestingly, the hybrid approach will not necessarily lead to less recommendation accuracy, which initially contradicts our common sense but has been validated subsequently by experiments on Netflix dataset, which is the largest dataset in our experiments. Besides, the integration of DP and RP does not affect their original trend of the relation between accuracy and privacy, which is also appealing as we still have control of the balance between accuracy and privacy in the hybrid approach.

5 Conclusion

We have presented a hybrid approach for privacy-preserving recommender systems by combining randomized perturbation (RP) and differential privacy (DP), which is more privacy-friendly than existing works as the user’s private data are protected by randomized perturbation and no one can infer any single user’s data from the normal recommendation output thanks to differential privacy. We have theoretically shown the noise added by RP has limited effect on recommendation accuracy and the noise added by DP can be well controlled based on the sensitivity analysis of functions on the perturbed data. We have conducted extensive experiments on real datasets and concluded that the combination of DP and RP is feasible not only in theory but also in practice.