Keywords

1 Introduction

Over the past decade, machine learning (ML) algorithms have found application in a vast and rapidly growing number of systems for analyzing and classifying usually privacy-sensitive data.

In order to analyze and interpret such data, PCA [18] is employed as one of the most commonly used unsupervised ML algorithm. PCA is used for summarizing the information content in databases by reducing the dimensionality of the data while preserving as much variability as possible. The output of this statistical tool is a set of principal components whose size is usually much smaller than the total number of attributes of the underlying data.

The increasing popularity of ML algorithms, including PCA, opened the door for attackers especially when ML techniques are deployed in critical applications. This work focuses on a particular type of attack named Membership Inference Attack (MIA) against PCA, where an adversary is assumed to intercept the principal components computed over some dataset and infer whether a data sample was part of this dataset or not. The membership prediction is yield by comparing the reconstruction error; the distance between the original target sample and its PCA projection against a threshold. In this paper, we study the effectiveness of MIA against PCA and show that it achieves high performance when the number of samples used by PCA is small.

Furthermore, to cope with such attacks that take advantage of the leakage of principal components, we propose to study the use of differentially private mechanisms and evaluate the privacy budget affects the success rate of the attack as well as the utility of the PCA under differential privacy (DP).

Our main contributions are summarized as follows.

  1. 1.

    We study, for the first time, the impact of MIA against PCA whereby the adversary has access to the principal components.

  2. 2.

    We propose the use of differentially-private PCA algorithms to cope with MIA and analyze the impact of the privacy budget on both utility and the success rate of MIA for both vector and scalar queries under the so-called naive and advanced composition approaches.

  3. 3.

    The experimental results present a comparison between the aforementioned different approaches under Gaussian and Laplace mechanisms for protecting the PCA against MIA.

2 Background

2.1 Principal Component Analysis

Given a set \(D=\{x_{n} \in \mathbb {R}^{d}:n=1:N\}\) of N raw data samples corresponding to N individuals of dimension d, we denote the data matrix where each column is a data sample by \(X=[x_{1},\dots , x_{N}]\). We assume that data X has zero mean, which can be ensured by centering the data. The standard PCA algorithm is to find a k–dimensional subspace that approximates each sample \({x}_{n}\). This problem can be formulated as follows:

$$\begin{aligned} \min _{\varPi _{k}} \mathcal {L} = \frac{1}{N}\sum _{n=1}^{N}\mathcal {L}_{n}= \frac{1}{N}\sum _{n=1}^{N} \Arrowvert x_{n} - \varPi _{k}x_{n} \Arrowvert _{2}^{2} \end{aligned}$$
(1)

where \(\mathcal {L}\) denotes the average reconstruction error and \(\varPi _{k}\) is an orthogonal projector which is used for approximating each sample \(x_{n}\) by \(\hat{x}_{n}=\varPi _{k}x_{n}\). The solution to this problem can be achieved via singular value decomposition (SVD) of the sample covariance matrix, which is defined by \(A = \frac{1}{N}XX^{T} = \frac{1}{N}\sum _{n=1}^{N}x_{n}x_{n}^{T}\). A is a symmetric positive semi-definite matrix, hence its singular value decomposition is equivalent to its spectral decomposition. SVD of A yields \(A= \sum _{i=1}^{d}\lambda _{i}v_{i}v_{i}^{T}\), where \(\lambda _{1} \ge \lambda _{2} \ge \dots , \lambda _{d} \ge 0\) and \(v_{1}, v_{2}, \dots ,v_{d}\) denote the eigenvalues and their corresponding eigenvectors of A, respectively. Let us denote the matrix whose columns are the top k eigenvectors by \(V_{k}=[{v_{1}},\dots , {v_{k}}]\). The orthogonal projector \(\varPi _{k} = V_{k}V_{k}^{T} \) is a solution to the problem in (1). PCA uses \(V_{k}\) to project the samples into the low k–dimensional subspace \(Y=V_k^{T}X\).

2.2 Membership Inference Attacks

The goal of an MIA is to infer whether or not a target sample is included in the training dataset. When an adversary learns whether or not a target sample was used to release any statistics or to train a machine learning model, this refers to an information leakage. This attack could cause serious problems in terms of privacy if the training dataset contains privacy-sensitive information. An example that highlights the implications of such an attack is [7], which was able to identify individuals contributing their DNA to a health-related project.

3 Related Work

Since the introduction of MIA against deep neural network (DNN) models in [22], this attack has been extensively studied on DNNs and other ML models. The cited work formalized the attack as a binary classification problem and trained neural network (NN) classifiers to distinguish between training members and non-members. The authors demonstrates that the main factor contributing to the success of MIA on DNN models is overfitting. Subsequent works [13, 15, 21, 23, 27] further developed MIAs with different approaches against DNN of different architectures. The work in [23] revealed that by using suitable metrics, metric-based attacks result in similar attack performance when compared with NN-based attacks. Besides DNN, MIAs have also been investigated against logistic regression models [20, 25], k-nearest neighbors [24, 25], and decision tree models [25, 27]. Our work extends the investigations of MIAs against machine learning models to PCA. As we shall elaborate later in Sect. 4.1, we propose, to this end, a new metric-based MIA against PCA. To the best of our knowledge, there is no previous work trying to perform MIA on PCA.

To mitigate MIAs, DP has been widely applied to various ML models [12, 13, 26, 28]. In [1], the authors show how to train DNNs with DP by adding noise to the gradients or parameters during model training. In [19], the authors empirically evaluate MIAs using the proposal of [1]. They find that DP can partially mitigate the attack with an acceptable level of privacy budget. In our study, we investigate the effectiveness of DP PCA algorithms on mitigating our proposed attack.

4 Membership Inference Attacks Against PCA

The first part of this work focuses on the study of the impact of MIA targeting PCA. We aim to investigate how the sample size and the number of the intercepted principle components affect the performance of such attacks. In Sect. 4.1, we define the threat model and the actual MIA targeting PCA. This is followed by the experimental setup and the corresponding experimental results of Sects. 4.2 and 4.3, respectively.

4.1 Threat Model and Attack Methodology

In our setting, the curator computes the principal components \(V_{k}\) using the training dataset D, and sends these to a trusted party. We assume the adversary \(\mathcal {A}\) intercepts some or all of those components by eavesdropping the communication channel. With them, the adversary aims to identify whether or not a certain sample z is included in D. In other words, the adversary’s goal is to discover members of the training dataset.

Such an attack can of course occur in a distributed setting [2] where several parties may compute the principal components of their individual (and usually smaller [10]) training datasets and send those to an aggregator, which ultimately may compute the global principal components. Analogously to the non-distributed case, here \(\mathcal {A}\) would compromise individual privacy by intercepting the principal components conveyed by each party.

To identify whether or not sample z was actually used for the computation of the principal components, \(\mathcal {A}\) computes the reconstruction error \(\mathcal {L}(z, V_{k})\) of the target sample z based on the intercepted \(V_k\), and then compares this error with some tunable decision threshold R. If the reconstruction error of the target sample is lower than the threshold, \(\mathcal {A}\) predicts that z is a member of the training dataset D. Otherwise, \(\mathcal {A}\) predicts that z is not a member of D. Our intuition is that samples from the training dataset are more likely to incur lower reconstruction error compared to other non-member samples.

4.2 Experimental Setup

We proceed with a detailed description of the datasets used in our experiments.

Datasets:Footnote 1 We assess the performance of the attack using two groups of datasets: (i) datasets including personal information, namely, UCI Adult [16] (for short, Adult), Census [4], and LFW [8]; and the image dataset MNIST [14], which is typically used in the literature of MIAs. As preprocessing, we standardize the datasets to unit variance before constructing our attack.

  • The UCI Adult dataset includes 48,842 records with 14 attributes. It contains both numerical (e.g. age, hours per week, etc.) and categorical (e.g. working class, education, etc.) attributes. We employ the standard one-hot encoding approach to construct the numerical representation of the categorical attributes [9].

  • Census: it contains 1080 records with 13 attributes of business statistics.

  • Labeled Faces in the Wild (LFW): It includes 13,233 images of 5749 human faces collected from the Web. 1680 of the 5749 people pictured have at least two distinct images in this dataset. The resolution of the images is 25 \(\times \) 18. In our evaluation, in order to balance the number of samples for each individual, we only take one picture of each individual in the dataset.

  • MNIST: it includes 10 classes of handwritten digits formatted as grayscale 28 \(\times \) 28 pixel images. The dataset is used to predict the class of the digit represented in the image. The total number of samples is 70,000.

Performance Metric: As an evaluation metric of the attack’s success, we use the area under the receiver operating characteristic(ROC) curve (AUC) metric, which indicates the relationship between true positive and false-negative rates over several decision thresholds R that the adversary can use to construct the attack. In all experiments, we choose equal-sized samples for both members and non-members at random and report the mean of the results over 10 trials.

4.3 Experimental Results

We evaluate the success rate of the attack in terms of the number of principal components intercepted by the adversary, denoted by k. For this, we measure the attacker’s performance through the AUC. Figure 1 shows the maximum AUC that the adversary can achieve by observing the top-k principal components. Recall that k may take values from 1 to d, where d is the number of attributes of the dataset. We report results for various number of samples N. The closer the AUC is to 0.5, the less successful the attack is as the adversary cannot distinguish between a member and a non-member.

We observe that the AUC increases with increasing k. This is justified by the fact that the attacker has access to more information and therefore is more likely to succeed in identifying the membership. We also observe that the AUC decreases with increasing N, perhaps, indicating that the sample covariance matrix A converges to the true covariance matrix of the dataset, which renders the reconstruction error of member and non-member samples of D indistinguishable. The same behaviour is observed with NNs when the training dataset is large [22].

The results for the MNIST and LFW datasets indicate that the AUC is always greater than 0.5 and reaches 0.9 when \(N=1,000\). As for the Census and Adult datasets, the corresponding AUC values are much lower (compared to the other datasets). This is mainly justified by the small dimension d of these datasets. We note that MIA against machine learning models trained using the Adult dataset is usually unsuccessful [21, 22].

Fig. 1.
figure 1

Impact of the sample size N and the observed top-k components on the attack’s performance. Shaded areas show 95% confidence intervals for the mean.

5 Differentially-Private PCA and MIA

In this section, we present PCA(DP-PCA) algorithms introduced in [6, 17], and study their protection against MIAs with various privacy budget values and their utility. Accordingly, we first remind several preliminaries DP in Sect. 5.1. This is followed by the experimental results in Sect. 5.3.

5.1 Preliminaries on Differential Privacy

Definition 1

(Neighboring datasets). Any two datasets that differ in one record are called neighbors. For two neighbor datasets \(\mathbf {x}\) and \(\mathbf {x'}\), the following equality holds:

$$d(\mathbf {x},\mathbf {x'}) = 1,$$

where d denotes the Hamming distance.

Definition 2

(\((\varepsilon , \delta )\)-Differential privacy [5]). A randomized mechanism \(\mathcal {M}\) on a query function f satisfies \(\varepsilon \)-DP with \(\varepsilon ,\delta \geqslant 0\) if, for all pairs of neighbor databases \(\mathbf {x},\mathbf {x'}\) and for all \(\mathcal {O}\subseteq \) range(\(\mathcal {M}\)),

$${{\,\mathrm{P}\,}}\{\mathcal {M}(f(\mathbf {x}))\in \mathcal {O}\} \leqslant e^{\varepsilon } {{\,\mathrm{P}\,}}\{\mathcal {M}(f(\mathbf {x'}))\in \mathcal {O}\} + \delta .$$

We say that \(\mathcal {M}\) satisfies pure DP if \(\delta =0\), and approximate DP otherwise.

Definition 3

(\(L_p{\text{- }}\)global sensitivity [5]). Let \(\mathcal {D}\) be the class of possible data sets. The \(L_p\)-global sensitivity of a query function \(f:\mathcal {D} \rightarrow \mathbb {R}^d\) is defined as

$$\begin{aligned} \varDelta _p(f)=\max _{\begin{array}{c} \forall \mathbf {x},\mathbf {x'} \end{array}\in \mathcal {D}} \Vert f(\mathbf {x}) - f(\mathbf {x'})\Vert _p, \end{aligned}$$

where \(\mathbf {x},\mathbf {x'}\) are any two neighbor datasets.

Definition 4

(Laplace mechanism [5]). Given any function \(f:\mathcal {D} \rightarrow \mathbb {R}^d\), the Laplace mechanism mechanism is defined as follows:

$$\mathcal {M}_L(\mathbf {x},f(\cdot ),\varepsilon ) =f(\mathbf {x}) + (Y_1,\ldots ,Y_d),$$

where \(Y_i\) are i.i.d. random variables drawn from a Laplace distribution with zero mean and scale \(\varDelta _1(f)/\varepsilon \).

Definition 5

(Gaussian mechanism [5]). Given any function \(f:\mathcal {D} \rightarrow \mathbb {R}^d\), the Gaussian mechanism mechanism is defined as follows:

$$\mathcal {M}_G(\mathbf {x},f(\cdot ),\varepsilon ) =f(\mathbf {x}) + (Y_1,\ldots ,Y_d),$$

where \(Y_i\) are i.i.d. random variables drawn from a Gaussian distribution with zero mean and standard deviation \(\varDelta _2(f)\sqrt{2 \log (1.25/\delta )}/\varepsilon \).

Theorem 1

([5]). The Laplace mechanism satisfies \((\varepsilon ,0)\)-DP.

Theorem 2

([5]). For any \(\varepsilon , \delta \in (0,1)\), the Gaussian mechanism satisfies \((\varepsilon ,\delta )\)-DP.

Theorem 3

([5]). If each mechanism \(\mathcal {M}_i\) in a k-fold adaptive composition \(\mathcal {M}_1,\ldots ,\mathcal {M}_k\) satisfies \((\varepsilon ',\delta ')\)-DP for \(\varepsilon ', \delta ' \geqslant 0\), then the entire k-fold adaptive composition satisfies \((\varepsilon ,k\delta '+\delta )\)-DP for \(\delta \geqslant 0\) and

$$\begin{aligned} \varepsilon =\sqrt{2k\ln (1/\delta )}\varepsilon ' + k\varepsilon '(e^{\varepsilon '}-1). \end{aligned}$$
(2)

5.2 Differentially Private PCA Approaches

As in the previous scenario where no privacy protection was implemented, the first step for the data curator is to compute the principal components of the covariance matrix A, which are to be shared with a trusted entity. However, to protect individual privacy against an adversary who may intercept some or all components of A, the curator now decides adding Laplace noise directly on the coefficients \(q_{ij}\) of A. In the context of DP, this approach is called output perturbation.

To protect the \(\alpha \doteq d(d+1)/2\) distinctFootnote 2 coefficients of A, we consider two strategies: (i) using a joint query function that simultaneously queries all such coefficients, and (ii) querying each coefficient separately. We shall refer to these procedures as vector and scalar queries, respectively.

For \(i=1,\ldots , d\), let attribute i take values in the interval \([l_i,u_i]\) after standardization, and denote by \(\varLambda _i\) the absolute difference \(|l_i - u_i|\). Recall [17] that \(\varDelta _1(q_{ij}) = \varLambda _i \varLambda _j/N\), from which we can easily derive an upper bound on \(\varDelta _1(A)\) just by adding up the sensitivities of all distinct coefficients. Accordingly, the scale of the Laplace noise injected to each coefficient yields \(\varDelta _1(A)/\varepsilon \) in the vector case, and \(\varDelta _1(q_{ij})/\varepsilon _{ij}\) in the scalar case, where \(\varepsilon \) is the total privacy budget and \(\varepsilon _{ij}\) the fraction thereof assigned to the coefficient \(q_{ij}\).

Using the standard sequential composition property, we can compute the total privacy cost of the scalar strategy by adding up all \(\varepsilon _{ij}\) for \(i\geqslant j\). In our experiments, in order to compare the two approaches for a same total privacy budget, we shall assume \(\varepsilon _{ij}=\varepsilon /\alpha \). Note that, in this case, the noise scales will coincide only if \(\sum _{i\geqslant j} \varLambda _i \varLambda _j= \alpha \varLambda _i \varLambda _j\).

We shall also consider a variation of the scalar case that relies on the advanced (sequential) composition property. Notice that even though this property is defined in the context of approximate DP, Theorem 3 also applies if the mechanisms being composed satisfy pure \(\varepsilon \)-DP. With advanced composition, however, the total privacy cost can be estimated more tightly (compared to the standard property) when the number of coefficients is significantly large. Said otherwise, for the same privacy budget \(\varepsilon \) (and small \(\delta \)) and for large \(\alpha \), the scale of the noise introduced with advanced composition can be reduced notably with respect to that injected with standard sequential composition. The noise scale yields in this case \(\varLambda _i \varLambda _j/N\varepsilon '\), where \(\varepsilon '\) satisfies Eq. (2) for \(k=\alpha \) and a given total privacy budget \(\varepsilon , \delta \).

Finally, the fourth protection approach we shall use in our experimental evaluation guarantees approximate DP through the Gaussian mechanism. More specifically, the algorithm in [6] queries all coefficients of A simultaneously and estimates \(\varDelta _2(A)\) to be 1/N; the sensitivity bound follows after normalizing D so that each row has at most unit \(l_2\) norm. Accordingly, the scale of the noise added to each coefficient yields \(\sqrt{2 \log (1.25/\delta )}/N\varepsilon \). Table 1 summarizes the four protection mechanisms we shall evaluate in the next subsection.

Table 1. Overview of the DP mechanisms aimed to protect PCA against MIA. Here, \(\varepsilon \) denotes the total privacy budget and \(\varepsilon '\) the fraction thereof assigned to each coefficient of A.

5.3 Experimental Results

We first study the protection of DP mechanisms against our attack. Therefore, we implement the four aforementioned approaches and evaluate the AUC of the attack with various privacy budgets \(\varepsilon \), ranging from \(10^{-2}\) to \(10^8\). We would like to notice that this is not the usual range of values used in the literature. For example, in privacy-preserving data publishing, values of \(\epsilon \) above 3 progressively seem to lose any meaningful guarantees [3]. However, for us, the fact that we will be using such large values is irrelevant, since we will empirically measure privacy leakage not through the \(\varepsilon \) itself, but through the effectiveness of an MIA. Finally, at the end of this section, we study the utility of the protected data provided by such approaches.

DP Mechanisms and AUC. Figure 2 shows the performance of the attack with respect to the k observed principal components when AG and Laplace vector query algorithms are used with various values of \(\varepsilon \). In the case of the AG algorithm, \(\varepsilon \) varies from 0.01 to 1 and \(\delta \) is set to \(\delta = \frac{1}{N}\) whereas for the Laplace vector query algorithm, we select larger values of \(\varepsilon \) from \(10^{-1}\) to \(10^{7}\). We also present the AUC of the attack in the non-private setting where DP-mechanisms are not adopted. Under AG, we observe that for all values of \(\varepsilon \), the AUC of the attack is only marginally above 0.5 (random guess baseline). Hence, the AG algorithm mitigates the effectiveness of MIA. With larger \(\varepsilon \) values under the Laplace vector query approach, AUC starts to increase and gets closer to the non-private case. We also observe that for the Adult and Census datasets, for \(\varepsilon =10^2\) the Laplacian vector query approach provides roughly the same level of protection than AG for \(\varepsilon =1\). For the LFW dataset, for \(\varepsilon =10^4\) the Laplace vector query approach provides the same protection as AG with \(\varepsilon =1\). Hence, even with a higher privacy budget \(\varepsilon \), the Laplace vector query approach limits the success of the attack.

Fig. 2.
figure 2

The AUC of the attack when the AG algorithm (a) and the Laplace vector query approach (b) are applied with various values of \(\varepsilon \). Shaded areas are the 95% confidence intervals for the mean.

Laplacian Approaches. Figure 3 compares the protection of the aforementioned Laplacian approaches for various levels of the total privacy budget based on the maximum AUC of the attack. We observe that the advanced composition approach achieves better protection than the naïve one in the low privacy regime (when \(\varepsilon \) is large). This observation can be explained through the noise scales injected by the two approaches. From Sect. 5.2, it is easy to verify that the algorithm based on the advanced composition will introduce less noise than that relying on the naive composition when \(\varepsilon '<\varepsilon /\alpha \). In Fig. 4, we plot in the hashed area the set of points \((\varepsilon ,\alpha )\) where this inequality holds. From the figure, we can see that, for a fixed \(\alpha \), increasing \(\varepsilon \) will ultimately result in less noise for naive composition. And the other way round, for a fixed \(\varepsilon \), increasing the number of coefficients will, at some point, make the advanced mechanism introduce less noise. We thus justify the observation above by assuming that adding more noise leads to stronger protection against the MIA.

Specifically, for the Adult (\(d=14\), \(\alpha =105\)) and Census (\(d=13\), \(\alpha =91\)) datasets, where the dimension is relatively small, the AUCs corresponding to the two different approaches intersect at \(\varepsilon \approx 10^2\). As for LFW (\(d=450\), \(\alpha \approx 10^5\)) and MNIST (\(d=784\), \(\alpha \approx 3\times 10^5\)), where d is large, the intersection occurs at the very low privacy regime at \(\varepsilon \approx 10^5\). Furthermore, as depicted in the figure, the vector query and the scalar query with naive composition approaches achieve the same protection, because they consume the same total privacy budget \(\varepsilon \).

Fig. 3.
figure 3

Attack performance with Laplacian approaches when the adversary intercepts all components \((k=d)\). The infinity point represents the non-private case

Fig. 4.
figure 4

The hashed area shows where naive composition introduces less noise than advanced composition.

Trade-off Between Privacy and Utility. We use the total privacy budget \(\varepsilon \) as well as the AUC of an MIA to quantify privacy. As for utility, which refers to the accuracy of the principal components produced by the DP-PCA algorithms of Sect. 5.2, we adopt the metric introduced in [11]. In particular, we compute the percentage of captured energy of the principal components produced by those algorithms, \(\hat{V_k}\), with respect to the principal components of non-private PCA (SVD), \(V_k\). Accordingly, we measure utility as \(q=\frac{tr(\hat{V}_{k}^{T}A\hat{V}_{k})}{tr(V_{k}^{T}A V_{k})}\), where A is the sample covariance matrix. We note that, for all the datasets, we select the reduced dimension k such that \(V_{k}\) have the captured energy of 90%.

Fig. 5.
figure 5

Trade-off posed by the four DP-PCA algorithms described in Sect. 5.2, between the total privacy budget \(\varepsilon \) and data utility. Utility is measured as the percentage of captured energy w.r.t. SVD.

Fig. 6.
figure 6

Trade-off posed by the four DP-PCA algorithms described Sect. 5.2, between attack performance and data utility. We measure attack performance through AUC, and utility through the percentage of captured energy w.r.t. SVD.

Figure 5 and 6 show the utility of the DP-PCA algorithms as a function of the privacy budget \(\varepsilon \), and of the AUC, respectively. We observe that the AG algorithm offers good utility for the Adult and Census datasets. However, AG has a low utility for the other datasets. The Laplacian PCA solutions show lower utility in comparison with AG for \(\varepsilon \le 1\). The vector and scalar query with naive composition approaches show almost the same utility, except for the MNIST and Census datasets, where the scalar query with naive composition achieves better utility than the vector query approach. Advanced composition provides better utility than the naive composition where \(\varepsilon \) and \(\alpha \) are in the blank area of Fig. 4. In summary, the utility of the DP-PCA algorithms is influenced by the amount of noise added, as one would expect.

On the other hand, the vector query approach outperforms the scalar query approach if the sensitivity of the coefficients is skewed. In order to enjoy better utility, the scalar query approach with advanced composition should be used rather than with naive composition when the privacy budget \(\varepsilon \) and the number of queries \(\alpha \) are in the blank area of Fig. 4.

6 Conclusion

In this paper, we have implemented and evaluated the first membership inference attack against PCA, whereby an adversary has access to some or all principal components. Our attack sheds light on privacy leakage in PCA. Specifically, we have demonstrated that an MIA can be deployed successfully, with high performance, when the number of samples used by PCA is small. We have evaluated the protection of DP-PCA under different protection algorithms, privacy budgets, number of principal components intercepted, and number of covariance coefficients. Our work may be useful to assess the practical value of privacy when DP-PCA algorithms are employed along with the desired utility. For future work, to investigate whether there is a correlation between the vulnerable samples in PCA and the ones in the downstream tasks such as neural network classifiers.