Keywords

1 Introduction

A recommender system is a program that utilizes algorithms to predict users’ purchase interests by profiling their shopping patterns. With the help of recommender systems, online merchants (also referred to as data owners) could better sell their products to the returning customers. Most recommender systems are based on collaborative filtering (CF) techniques, e.g., item/user correlation based CF [9], SVD (singular value decomposition) based latent factor CF [10]. Due to the technical limitations, many online merchants buy services from professional third parties to help build their recommender systems. In this scenario, merchants need to share their commercial data with the third party which has the potential for privacy leakage of user information. Typical private information in transaction data includes, but is not limited to, the ratings of a user left on particular items and items that this user has rated. People would not like others (except for the website where they purchased the products) to know this information. Thus privacy preserving collaborative filtering algorithms [4, 8] were proposed to tackle the problem. It is obvious that the data should be perturbed by online merchants before they release it to the third party.

Besides the privacy issue, data owner has to take care of the fast growing data as well. Once new data arrives, e.g., new items or new users’ transaction records, it should be appended to the existing data. To protect privacy, data owner needs to do the perturbation. If he just simply redoes the perturbation on the whole data, he needs to resend the full perturbed data to the third party. This process not only takes a large amount of time to perturb data but also requires the model to be rebuilt on the site of the third party. It is infeasible to provide fast real-time recommendations.

In this chapter, we propose a privacy preserving data updating scheme in collaborative filtering. This scheme is based on truncated SVD updating algorithms [2, 7] which can provide privacy protection when incorporating new data into the original one in an efficient way. We start with the pre-computed SVD of the original data matrix. New rows/columns are then built into the existing factor matrices. We try to preserve users’ privacy by truncating new matrix together with randomization and post-processing. Two missing value imputation methods during the updating process are studied as well. Results of the experiments that are conducted on MovieLens dataset [10] and Jester dataset [6] show that our scheme can handle data growth efficiently and keep a low level of privacy loss. The prediction quality is still at a good level compared with most published results.

The remainder of this chapter is organized as follows. Section 2 outlines the related work. Section 3 defines the problem and related notations. Section 4 describes the main idea of the proposed scheme. Section 5 presents the experiments on two datasets and discusses the results. Some concluding remarks and future work are given in Sect. 6.

2 Related Work

Most CF models suffer from the privacy leakage issue. Users’ private information is fully accessible without any disguise to the provider of recommender systems. Canny [4] first proposed the privacy preserving collaborative filtering (PPCF) that deals with the privacy leakage in the CF process. In his distributed PPCF model, users could control all of their data. A community of users can compute a public “aggregate” of their data that does not expose individual users’ data. Each user then uses local computation to get personalized recommendations. Nevertheless, most popular collaborative filtering techniques are based on a central server. In this scheme, users send their data to a server and they do not participate in the CF process; only the server needs to conduct the CF. Polat and Du [8] adopted randomized perturbation for both correlation-based CF and SVD-based CF to provide privacy protection. They use uniform distribution and Gaussian distribution to generate random noise and apply the noise to the original data. Differing from Canny, Polat and Du, we focus on the framework in which data owner has all original user data but he needs to do perturbation before releasing it to a third party [15].

In this framework, data owner should be able to handle the fast data growth without leaking the privacy. Among all data perturbation methods, SVD is acknowledged as a simple and effective data perturbation technique. Stewart [12] surveyed the perturbation theory of singular value decomposition and its application in signal processing. A recent work by Tougas and Spiteri [13] demonstrated a partial SVD updating scheme that requires one QR factorization and one SVD (on small intermediate matrices and thus not expensive in computation) per update. Based on their work, Wang et al. [14] presented an improved SVD-based data value hiding method and tested it with clustering algorithm on both synthetic data sets and real data sets. Their experimental results indicate that the introduction of the incremental matrix decomposition produces a significant increase in speed for the SVD-based data value hiding model. Our scheme is similar to this model but we have modified the SVD updating algorithm with missing value imputation and post-processing so that it can be incorporated into collaborative filtering process smoothly.

3 Problem Description

Suppose the data owner has a sparse user-item rating matrix, denoted by \(R \in \mathbb R ^{m\times n}\), where there are \(m\) users and \(n\) item. When new users’ transactions are available, the new rows, denoted by \(T \in \mathbb R ^{p\times n}\), should be appended to the original matrix \(R\), as

$$\begin{aligned} \left[ \begin{array}{l} R\\ T\\ \end{array} \right] \rightarrow R^{\prime } \end{aligned}$$
(1)

Similarly, when new items are collected, the new columns, denoted by \(F \in \mathbb R ^{m\times q}\), should be appended to the original matrix \(R\), as

$$\begin{aligned} \left[ \begin{array}{ll} R &{} F\\ \end{array} \right] \rightarrow R^{\prime \prime } \end{aligned}$$
(2)

In both cases, data owner could not simply release \(T\) or \(F\) because they contain the real user ratings. He can not directly release \(R^{\prime }\) or \(R^{\prime \prime }\) either due to the scalability and privacy issues. An ideal case is, suppose a perturbed version (SVD-based) of \(R\) with privacy protection has been released, data owner only releases the perturbed incremental data, say \(\tilde{T}\) and \( \tilde{F}\) which preserves users’ privacy and does not degrade the recommendation quality.

4 Privacy Preserving Data Updating Scheme

In this section, we will present the data updating scheme that could preserve the privacy during the whole process. We try to protect users’ privacy in three aspects, i.e., missing value imputation, randomization-based perturbation and SVD truncation. The imputation step can preserve the private information—“which items that a user has rated”. Meanwhile, a second phase perturbation done by randomization and truncated SVD techniques solves another problem—“what are the actual ratings that a user left on particular items”. On one hand, random noise can alter the rating values a bit while leaving the distribution unchanged. On the other hand, truncated SVD is a naturally ideal choice for data perturbation. It captures the latent properties of a matrix and removes the useless noise.

In (1), we see that \(T\) is added to \(R\) as a series of rows. The new matrix \(R^{\prime }\) has a dimension of \({(m+p)\times n}\). Assuming the truncated rank-\(k\) SVD of \(R\) has been computed previously,

$$\begin{aligned} R_{k}=U_{k} \Sigma _{k} V_{k}^T \end{aligned}$$
(3)

where \(U_{k}\in \mathbb R ^{m\times k}\) and \(V_{k}\in \mathbb R ^{n\times k}\) are two orthogonal matrices; \(\Sigma _{k}\in \mathbb R ^{k\times k}\) is a diagonal matrix with the largest \(k\) singular values on its diagonal.

Since SVD cannot work on an incomplete matrix, we should impute the missing values in advance. We would like to use two different imputation methods: mean value imputation and WNMTF (weighted non-negative matrix tri-factorization) imputation.

In mean value imputation [10], we calculate each column mean and impute all the missing values in every column with its mean value.

In WNMTF [5, 16] imputation, we use WNMTF to factorize an incomplete matrix \(A \in \mathbb R ^{m\times n}\) into three factor matrices, i.e., \(W \in \mathbb R ^{m\times l}\), \(G \in \mathbb R ^{l\times t}\), and \(H \in \mathbb R ^{n\times t}\), where \(l\) and \(t\) are the column ranks of \(W\) and \(H\). The objective function of WNMTF is

$$\begin{aligned} { min}_{W\ge 0, G\ge 0, H\ge 0}f(A,Y,W,G,H)=\Vert Y \circ (A-WGH^T)\Vert _{F}^2 \end{aligned}$$
(4)

where \(Y\in \mathbb R ^{p\times n}\) is the weight matrix that indicates the value existence in the matrix \(A\).

The update formula corresponds to (4) is

$$\begin{aligned}&W_{ij} = W_{ij} \frac{[(Y \circ A)HG^T]_{ij}}{\{[Y \circ (WGH^T)]HG^T\}_{ij}}\end{aligned}$$
(5)
$$\begin{aligned}&G_{ij} = G_{ij} \frac{[W^T (Y \circ A)H]_{ij}}{\{W^T [Y \circ (WGH^T)]H\}_{ij}}\end{aligned}$$
(6)
$$\begin{aligned}&H_{ij} = H_{ij} \frac{[(Y \circ A)^T WG]_{ij}}{\{[Y \circ (WGH^T)]^T WG\}_{ij}} \end{aligned}$$
(7)

With either of the above two imputation methods, we obtain the imputed matrices: \(\hat{R}\) (with its factor matrices \(\hat{U}_{k}, \hat{\Sigma }_{k}\) and \(\hat{V}_{k}\)) and \(\hat{T}\). Now the problem space has been converted from (1) to (8):

$$\begin{aligned} \left[ \begin{array}{l} \hat{R}\\ \hat{T}\\ \end{array} \right] \rightarrow \hat{R}^{\prime } \end{aligned}$$
(8)

After imputation, random noise that obeys Gaussian distribution is added to the new data \(\hat{T}\), yielding \(\dot{T}\). Then we follow the procedure in Tougas and Spiteri [13] to update the matrix. First, a QR factorization is performed on \(\ddot{T}=(I_{n}-\hat{V}_{k} \hat{V}_{k}^T) \dot{T}^T\), where \(I_{n}\) is an \(n\times n\) identity matrix. Thus we have \(Q_{T} S_{T}=\ddot{T}\), in which \(Q_{T} \in \mathbb R ^{n\times p}\) is an orthonormal matrix and \(S_{T} \in \mathbb R ^{p\times p}\) is an upper triangular matrix. Then

$$\begin{aligned} \hat{R}^{\prime }&= \left[ \begin{array}{l} \hat{R}\\ \hat{T} \end{array} \right] \approx \left[ \begin{array}{l} \hat{R}_{k}\\ \hat{T} \end{array} \right] \approx \left[ \begin{array}{l} \hat{R}_{k}\\ \dot{T} \end{array} \right] \nonumber \\&= \left[ \begin{array}{cc} \hat{U}_{k} &{} 0\\ 0 &{} I_{p}\\ \end{array} \right] \left[ \begin{array}{ll} \hat{\Sigma }_{k} &{} 0\\ \dot{T}\hat{V}_{k} &{} S_{T}^T \end{array} \right] \left[ \begin{array}{ll} \hat{V}_{k} &{} Q_{T}\\ \end{array} \right] ^T \end{aligned}$$
(9)

Next, we compute the rank-\(k\) SVD on the middle matrix, i.e.,

$$\begin{aligned} \left[ \begin{array}{ll} \hat{\Sigma }_{k} &{} 0\\ \dot{T}\hat{V}_{k} &{} S_{T}^T \end{array} \right] _{(k+p)\times (k+p)} \approx U_{k}^{\prime } \Sigma _{k}^{\prime } V_{k}^{\prime T} \end{aligned}$$
(10)

Since \((k+p)\) is typically small, the computation of SVD should be very fast. Same as in Wang et al. [14], we compute the truncated rank-\(k\) SVD of \(\hat{R}^{\prime }\) instead of a complete one,

$$\begin{aligned} \hat{R}_{k}^{\prime } = \left( \left[ \begin{array}{ll} \hat{U}_{k} &{} 0\\ 0 &{} I_{p}\\ \end{array} \right] U_{k}^{\prime } \right) \Sigma _{k}^{\prime } \left( \left[ \begin{array}{ll} \hat{V}_{k} &{} Q_{T}\\ \end{array} \right] V_{k}^{\prime } \right) ^T \end{aligned}$$
(11)

In CF context, the value of all entries should be in a valid range. For example, a valid value \(r\) in MovieLens should be \(0<r \leqslant 5\). Therefore, after obtaining the truncated new matrix \(\hat{R}_{k}^{\prime }\), a post-processing step is applied to it so that all invalid values will be replaced with reasonable ones. The processed version of \(\hat{R}_{k}^{\prime }\) is denoted by \(\Delta \hat{R}_{k}^{\prime }\).

In our scheme, we assume the third party owns \(\hat{R}_{k}\) so we only send \(\Delta {T}\) (\(\Delta {T}=\Delta \hat{R}_{k}^{\prime }(m+1:m+p,:) \in \mathbb R ^{p\times n}\))Footnote 1 to them.

The following algorithm summarizes the SVD-based row/user updating.

figure a

Like row updating, the column/item updating algorithm is presented as follows.

figure b

Data owner should keep the updated SVD of new user-item rating matrix (\(\hat{U}_{k}^{\prime }, \Sigma _{k}^{\prime }\) and \(\hat{V}_{k}^{\prime }\) for row updating, \(\hat{U}_{k}^{\prime \prime }, \Sigma _{k}^{\prime \prime }\) and \(\hat{V}_{k}^{\prime \prime }\) for column updating) for future update and the perturbed new data matrix (\(\Delta {T}\) for row updating, \(\Delta {F}\) for column updating) as a reference.

5 Experimental Study

In this section, we discuss the test dataset, prediction model, evaluation strategy and experimental results.

5.1 Data Description

As most research papers in collaborative filtering, we adopt both 100 K MovieLens [10] and Jester [6] datasets as the test data. The 100 K MovieLens dataset has 943 users and 1,682 items. The 100,000 ratings, ranging from 1 to 5, were divided into two parts: the training set (80,000 ratings) and the test set (20,000 ratings). The Jester datasets has 24,983 users and 100 jokes with 1,810,455 ratings ranging from \(-\)10 to \(+\)10. In our experiment, we pick up 5,000 users together with their ratings and randomly select 80 % ratings as the training set while use the rest as test set.

5.2 Prediction Model and Error Measurement

In our experiments, we build the prediction model by SVD-based CF algorithm [10]. For a dense matrix \(A\), its rank-\(k\) SVD is \(A\approx \tilde{U}_{k} \tilde{\Sigma }_{k} \tilde{V}_{k}^T\). Compute the user factor matrix(\(UF \in \mathbb R ^{m\times k}\)) and item factor matrix(\(IF \in \mathbb R ^{n\times k}\)):

$$\begin{aligned} UF=\tilde{U}_{k} \sqrt{\tilde{\Sigma }_{k}},\quad IF=\tilde{V}_{k} \sqrt{\tilde{\Sigma }_{k}} \end{aligned}$$
(12)

The predicted rating for user \(i\) left on item \(j\) is computed by taking the inner product of the \(i\)th row of \(UF\) and the \(j\)th row of \(IF\):

$$\begin{aligned} p_{ij}=(\tilde{U}_{k} \sqrt{\tilde{\Sigma }_{k}})_{i} (\tilde{V}_{k} \sqrt{\tilde{\Sigma }_{k}})_{j}^T \end{aligned}$$
(13)

When testing the prediction accuracy, we build the SVD model on training set; then for every rating in the test set, we compute the corresponding predicted value and measure the difference. We use mean absolute error (MAE) [3, 11] as the specific error metric (the lower the better).

5.3 Privacy Measurement

When we measure the privacy, we mean to what extent the original data could be estimated if given the perturbed data. In this chapter, we use the privacy measure that was first proposed by Agrawal and Aggarwal [1] and was applied to measure the privacy loss in collaborative filtering by Polat and Du [8].

In [1], they proposed \(\Pi (Y)=2^{h(Y)}\) as the privacy inherent in a random variable \(Y\) with \(h(Y)\) as its differential entropy. Thus given a perturbed version of \(Y\), denoted by \(X\), the average conditional privacy(also referred to as Privacy Level) of \(Y\) given \(X\) is \(\Pi (Y|X)=2^{h(Y|X)}\).

Similar with Polat and Du’s work, we take \(\Pi (Y|X)\) as privacy measure to quantify the privacy in the experiments. Note that in this chapter, \(Y\) corresponds to all the existing values in the training set, meaning that we do not consider the missing values (treated as zeros in this case). This is slightly different from [15].

5.4 Evaluation Strategy

The proposed scheme is tested in several aspects: the prediction accuracy in recommendation, the privacy protection level, when to recompute SVD, and randomization degree with its impact in perturbation, etc.

To test when to recompute SVD, data in training set, which is viewed as a rating matrix, is split into two sub sections with a particular ratio \(\rho \). The first \(\rho \) data is assumed to be held by the third party; the remaining data will be updated into it. For instance, when a row split is performed with \(\rho =40\,\%\), the first 40 % rows in training set is treated as \(R\) in (1). An imputation process should be done on this data without the knowledge from the remaining 60 % data, yielding \(\hat{R}\) in (8). Then a rank-\(k\) SVD is computed for this matrix. We call the rank-\(k\) approximation of \(\hat{R}\) the starting matrix. These data structures are utilized as the input of row updating algorithm. Results are expected to be different with varying split ratio in the training data. If the result is too far from the predefined threshold or the results starts to degrade at some point, a re-computation should be performed.

However, we don’t update the remaining 60 % rows in training set in one round since data in real world application grows in small amount compared with the existing one. In our updating experiments, the 60 % rows are repetitively added to the starting matrix in several times, with 1/9 of the rows in each time [15].

We evaluate the algorithms on both MovieLens and Jester datasets by investigating the time cost of updating, prediction error and privacy measure on the final matrix. The machine we use is equipped with Intel® Core™ i5-2405S processor, 8GB RAM and is installed with UNIX operating system.

5.5 Results and Discussion

5.5.1 Split Ratio Study

Owing to the inherent property of SVD updating algorithms, errors are generated in each run. The data owner should be aware of the suitable time to recompute SVD for whole data so that the quality of the data can be kept. We study this problem by experimenting with the split ratio \(\rho \). Note we use 13 and 11 as the truncation rank for MovieLens and Jester datasets. To be consistent, in WNMTF imputation, both \(l\) and \(t\) are set to 13 for MovieLens and 11 for Jester. As a reference, the maximum number of iterations in WNMTF is set to 10.

Fig. 1
figure 1

Time cost variation with split ratio \(\rho \)

The time cost for updating new data with varying \(\rho \) is plotted in Fig. 1, where “RowM” and “ColumnM” represent row update and column update with mean value imputation while “RowW” and “ColumnW” represent the updates with WNMTF imputation. It is expected that updating fewer rows/columns takes less time.

Furthermore, the figure indicates the relation between the time cost of row and column updating—it depends on dimensionality of row and column. For example, the MovieLens data has more columns (1,682 items) than rows (943 users) while the Jester data has much fewer columns (100 items) than rows (24,983 users). We observed each step of both row and column updating algorithms and found that, when the number of columns is greater than the number of rows, steps 1 and 3 in row updating algorithm need more time than that in column updating algorithm due to higher dimensionality and vice versa. Compared with the time cost for imputing and computing SVD on the complete raw training set (i.e., the non-incremental case), our scheme runs much faster on both datasets (See Table 1). As for the imputation methods in this scenario, it is intuitive that WNMTF takes much shorter time than mean value imputation. Nevertheless, the difference is not that apparent in total time cost of row/column updating algorithms(i.e., the incremental case). This is because imputation time is a smaller part of the total time cost than the time for computing incremental SVD.

Table 1 Time cost of prediction on raw training data

Figure 2 shows the mean average error. For MovieLens dataset, the column update with mean value imputation almost stays at the same errors while the row update has a descending trend. With the WNMTF imputation, both updates produce smaller errors with rising split ratio. It is interesting that after some points, WNMTF provides lower prediction errors than mean value imputation. Thing is different for Jester data where MAE keeps stable in all updates and mean value imputation performs better. We get this difference because Jester has denser rating matrix than MovieLens meaning the former relies less on the new data than the latter.

Fig. 2
figure 2

MAE variation with split ratio \(\rho _{1}\)

The privacy level with varying split ratio is displayed in Fig. 3. As mentioned before, we only consider the existing ratings in the training set and eliminating the missing values when calculating the privacy level. This makes the results different from [15]. The privacy level without taking into account missing values does not change much with increasing split ratio. The curves look fluctuating because we reduced the interval of Y-axis. In this experiment, we notice that WNMTF imputation produces higher privacy level than mean value imputation. It can be attributed to more changes on ratings done by WNMTF. Since mean value imputation does not alter the existing values whereas WNMTF recompute the whole data, the latter perturbs the data to a deeper extent.

Fig. 3
figure 3

Privacy level variation with split ratio \(\rho _{1}\)

Now we can decide when to recompute SVD for the whole data according to Figs. 2 and 3. For mean value imputation, the MAEs on both datasets drop more slowly after \(\rho \ge 50\,\%\) and there is no apparent variation of the slope for privacy measure curves, the re-computation can be performed when \(\rho \) reaches 50 %. For WNMTF imputation, the MAEs keep decreasing all the way in MovieLens dataset, so the re-computation is not needed so far; the MAE for column update in Jester data increases when \(\rho \ge 70\,\%\), meaning that the pre-computed matrices of SVD starts to degrade and thus a re-computation is helpful.

5.5.2 Role of Randomization in Data Updating

So far, we have not applied randomization technique to our data updating scheme. In this section, we study the role of randomization (Gaussian noise with \(\mu \) and \(\sigma \) as its parameters in both row and column updating algorithms) in both data quality and privacy preservation. In the following experiments, \(\rho \) is fixed to 40 % and we use the WNMTF to impute the missing values. We probe \(\mu \) in {0, 1} and \(\sigma \) in {0.1, 1} for both datasets. Table 2 collects the statistics of the test.

Table 2 Randomization in data updating on MovieLens dataset

In this table, the results of row and column updating with randomization is compared with the non-randomized version. As can be seen, after applying random noise to new data before updating it, the privacy level in all cases improves to a certain extent. Nevertheless, we lost some utility of the data which results in greater MAEs at the same time. Hence, the parameters should be carefully chosen to deal with the trade-off between data utility and data privacy. Moreover, the results indicate that the expectation \(\mu \) affects the results more than the standard deviation \(\sigma \). We suggest data owner determine \(\mu \) first and then tweak \(\sigma \).

As a summary, randomization technique can be used as an auxiliary step in SVD-based data updating scheme to provide better privacy protection. It brings in randomness that perturbs the data before SVD updating. Therefore, data will be perturbed twice (randomization \(+\) SVD) in addition to imputation during the updating process and thus can achieve a higher privacy level. However, with the latent factors captured by SVD, most critical information can be retained which ensures the data quality for recommendation.

6 Conclusion and Future Work

In this chapter, we present a privacy preserving data updating scheme for collaborative filtering purpose. It is an incremental SVD based scheme with randomization technique and could be utilized in updating user-item matrix and preserving the privacy at the same time. We try to protect users’ privacy in three aspects, i.e., missing value imputation, randomization-based perturbation and SVD truncation. The experimental results on MovieLens and Jester datasets show that our proposed scheme could update new data into the existing (processed) data very fast. It can also provide high quality data for accurate recommendation while keep the privacy.

Future work will take into account users’ latent factor to obtain a more reasonable imputation strategy in updating the data. Other matrix factorization techniques, e.g., non-negative matrix factorization, will be considered to explore an alternative way of updating the new data with privacy preservation so that a possible better scheme can be provided.