Keywords

1 Introduction

The popularisation of smart devices and the development of big data analytics has led to tremendous growth in the generation, collection, and analysis of personal digital information. The useful information extracted from massive data can bring immeasurable value [1, 2]. As a classical data analysis method, clustering is a type of unsupervised learning method. k-means is one of the most popular clustering methods due to its efficiency and simplicity [3]. Although the data analysis has great potential, it also has a risk of leakage of user privacy. Sensitive information such as medical, location, and financial data can directly lead to users’ private information leakage. Traditional anonymization methods wipe out identifiers that cannot resist both differential and background knowledge attacks. An attacker can correlate or identify users’ private information. Therefore, ensuring there is no leakage of users’ private information and maintaining a high level of utility in clustering becomes a problem that needs to be solved.

The Differential Privacy (DP) model is currently considered as a reliable model with rigorous and falsifiable privacy guarantees [4]. Compared with traditional protection models such as anonymity and random perturbation, differential privacy has significant advantages in privacy preservation in cluster analysis [5, 6]. A differential privacy-based model for cluster analysis, which is referred to as Differential Privacy k-means Algorithm (DP k-means), has been widely applied for its efficiency and privacy preservation [7, 8]. DPLloyd-Impr made an improvement on DPLloyd by introducing the concept of sphere packing [9]. DP-KCCM, as a novel algorithm, is effective when cluster merging and adaptive noise mechanisms are adopted to improve clustering utility [10]. The above work improves DP k-means from data pre-processing, cluster delineation, etc., and is based on trusted third-party servers. The servers can collect real user data, perform clustering and uniformly add noise. However, with the development of cloud computing and the diversification of data analysis demands, the assumption that all third-party servers are trustworthy is not valid, as malicious servers may steal and take advantage of users’ private information.

Local Differential Privacy (LDP) [11] was proposed because third-party servers cannot be trusted. LDP has more stringent privacy requirements than DP. It requires users to perturb their data at the local side and sends it to an untrusted server. LDP has also been applied to practical cases to create feasible solutions [12, 13]. A k-means algorithm based on LDP was proposed in [14] to protect location data through feature transformation and privacy budget allocation. Although LDP can effectively address the problem of privacy leakage on third-party servers, it still faces the challenge of reduced clustering utility due to excessive noise [15]. Owing to the perturbation of user data at the local side, the noise of LDP is larger compared to that of DP. The influence of noise is further amplified in the clustering iterations. Also, most research on LDP implicit assumption is that there is uniform protection of the private information of all users. However, different users and data often have different privacy requirements. To address the above issue, Personalized Local Differential Privacy (PLDP) was proposed in [16], which allows each user to set the privacy level of their data independently.

Based on the above discussions, the main issue that needs to be addressed is how to take into account the protection of users’ private information and the utility of clustering in k-means clustering. A clustering framework based on the PLDP k-means algorithm is proposed in this paper. Firstly, the user can perturb sensitive data at the local side by the PLDP k-means algorithm and send it to the server, which performs high-quality k-means clustering on the perturbed data. Thus, the threat of malicious servers is eliminated while the users’ personalized privacy demands are met. In addition, an iterative centroid perturbation algorithm is proposed, which prevents privacy leakage caused by inference attacks by perturbing the centroids in the iterative process. The proposed algorithm also reduces the impact of perturbation on the clustering utility by designing a privacy budget allocation sequence. The main contributions of this paper are as follows.

  1. 1)

    A clustering framework based on the PLDP k-means algorithm is proposed. In the framework, the server does not access users’ private information while ensuring quality clustering and users’ personalized privacy demands.

  2. 2)

    Iterative centroid perturbation algorithms are proposed to address the potential leakage of private information during iteration. They help prevent inference attacks and further protect users’ private information.

  3. 3)

    Theoretical analysis demonstrates the privacy protection capability of the proposed mechanism, and extensive experiments show that the proposed algorithm has better or similar performance than existing DP k-means algorithms. To the best of our knowledge, this paper is the first attempt at adopting PLDP in k-means clustering.

The rest of this paper is organized as follows. The basic concepts required for this framework and the related technical foundation are introduced in Sect. 2. The proposed approach is present in Sect. 3. The experimental results and analysis are illustrated in Sect. 4. Finally, the paper is concluded in Sect. 5.

2 Preliminaries

In this paper, the concept of personalized local differential privacy is adopted. To make the paper more self-contained, some basics of LDP and PLDP are briefly introduced in this section.

Differential privacy is a privacy-preserving model widely used in data analysis, in which the real data of all users is protected by a trusted data collector. However, the prerequisite of a trusted data collector usually does not hold in real-world applications. LDP is an extension of DP that extends to the local settings. LDP implements data sanitization locally by designing random perturbation algorithms that comply with differential privacy requirements. This way, sensitive data information is protected without relying on trusted third-party collectors. The following is the formal definition of LDP.

Definition 1

(\(\varepsilon \)-LDP). A randomized mechanism \(F: D \rightarrow R\) satisfies \(\varepsilon \)-LDP iff for any possible output result \(t^{*}\left( t^{*} \subseteq R\right) \) on any two records t and \(t^{\prime }\left( t, t^{\prime } \subseteq D\right) \) that satisfies Eq. 1.

$$\begin{aligned} {Pr}\left[ F(t)=t^{*}\right] \le e^{\varepsilon } \times {Pr}\left[ F\left( t^{\prime }\right) =t^{*}\right] . \end{aligned}$$
(1)

The parameter \(\varepsilon \) is the privacy budget, which is public and usually set in [0,2]. The value of \(\varepsilon \) determines the probability of outputting the same result \(t^{*}\) for any two input values t and \(t^{\prime }\) of the algorithm F. Thus, stronger (weaker) privacy guarantees are provided by smaller (larger) values of \(\varepsilon \).

The LDP provides a way to protect private data on the local side of the user, but different privacy protection requirements may exist for different users and data. Therefore, PLDP is adopted to satisfy different privacy requirements in this paper. Each user in PLDP has a set of optional parameters \(\left( G_{i}, \varepsilon _{u}\right) \), \(\varepsilon _{u}\) representing the desired strength of privacy protection for that user, i.e., the privacy budget. \(G_{i}\) represents a security range specified by the user containing his real data, where the user data is indistinguishable from other data.

Definition 2

(\(\left( G_{i}, \varepsilon _{u}\right) \)-PLDP). Given a set of privacy requirements \(\left( G_{i}, \varepsilon _{u}\right) \) to one user n, a randomized mechanism \(F: D \rightarrow R\) satisfies \(\left( G_{i}, \varepsilon _{u}\right) \)-PLDP iff for any possible output result \(t^{*}\left( t^{*} \subseteq R\right) \) on any two records t and \(t^{\prime }\left( t, t^{\prime } \subseteq G_{i}\right) \) that satisfies Eq. 2.

$$\begin{aligned} {Pr}\left[ F(t)=t^{*}\right] \le e^{\varepsilon _{u}} \times {Pr}\left[ F\left( t^{\prime }\right) =t^{*}\right] . \end{aligned}$$
(2)

when \(G_{i}\) is set to the domain D, and all users are unified \(\varepsilon \), PLDP is equivalent to LDP.

Differential privacy has two important combinatorial properties: the sequential and parallel combinatorial properties, which are formally defined as follows.

Property 1

(sequence combinability). Given a dataset D and privacy algorithms \(\boldsymbol{F}=\left\{ F_{1}, F_{2}, \ldots , F_{n}\right\} , F_{i}(1 \le i \le n)\) satisfies the \(\varepsilon _{i}\)-DP. Then the sequence combination of \(\left\{ F_{1}, F_{2}, \ldots , F_{n}\right\} \) on D satisfies \(\varepsilon \)-DP, where \(\varepsilon =\sum _{i-1}^{n} \varepsilon _{i}\).

Property 2

(parallel combinability). Given a dataset D, divide it into n disjoint subsets, \(\boldsymbol{D}=\left\{ D_{1}, \ldots , D_{n}\right\} \), let F be any privacy algorithm that satisfies \(\varepsilon _{i}\)-DP, then the algorithm F satisfies \(\varepsilon _{max}\)-DP on D.

3 Proposed Approach

In this section, the PLDP k-means clustering algorithm is proposed, and its privacy is demonstrated. Existing privacy issues in clustering analysis are first analyzed. The overall flow of the proposed framework is then described, and the corresponding design of the perturbation mechanism based on PLDP theory is given. Finally, the privacy of the proposed overall system is proved theoretically.

3.1 Overview

The privacy issues faced by DP and LDP k-means clustering model and the solutions are analyzed in this subsection. A third-party data collector collects sensitive data (e.g., location, income, cases, etc.) from many users, processes it using the k-means algorithm, and shares or publishes the model results to partners or public platforms. When users are faced with third-party collectors (e.g., service providers, etc.) asking for their data, protecting their privacy becomes an issue that must be addressed. DP is considered an effective solution to this problem by perturbing the user’s data on a third-party server so that neither the attacker nor the subsequent release can cause a leakage of the user’s privacy. However, the attacker may be external, or the data collector may be malicious, knowing all the user’s real data. LDP protects user data assuming that third-party servers are not trusted. The way solves the problem of malicious data collectors is that the data is perturbed by the LDP at the local side and then uploaded to the server. The new problem is that due to LDP properties, there are limitations in protecting user data, and the availability of perturbed data is generally considered inferior to that of DP. At the same time, the risk of privacy leakage cannot be completely avoided by simply perturbing the data in clustering. So the problem is to design a model that achieves a better utility of clustering while avoiding the influence of malicious collectors.

A clustering framework based on the PLDP k-means algorithm is proposed in this paper to address the issues mentioned above. A randomized perturbation algorithm satisfying PLDP is used to perturb the user’s local data, eliminating the risk of malicious collectors while satisfying personalized privacy requirements and enhancing the utility of clustering. Meanwhile, an iterative clustering centroid perturbation algorithm perturbs the real clustering information locally to prevent privacy leakage due to inference attacks.

3.2 Proposed Framework

A framework based on PLDP k-means that can solve the above problem is proposed, and its overall framework is shown in Fig. 1. The clustering model has a user set \(U=\left\{ u_{0}, u_{1}, \ldots , u_{n-1}\right\} \), an attribute set \(A=\left\{ a_{0}, a_{1}, \ldots , a_{d-1}\right\} \). Each user has a d-dimensional data vector \(S_{i}=\left\{ s_{0}, s_{1}, \ldots , s_{d-1}\right\} \). where \(0<i<n\), \(0<j<d\) and \(d=\left| S_{i}\right| \) is the number of attributes. \(s_{j}\) corresponds to a numerical value of \(a_{j}\). The target of k-means is to classify the user data into k clusters \(C=\left\{ c_{0}, c_{1}, \ldots , c_{k-1}\right\} \).

Fig. 1.
figure 1

Cluster privacy-preserving framework based on PLDP k-means.

As shown in Fig. 1, a clustering privacy-preserving framework based on PLDP is proposed. The proposed framework consists of two parts: the local side and the collector server. The local side describes how the user data is perturbed by the PLDP perturbation algorithm. The collector server describes how k-means is performed based on the perturbed data. The user data \(S_{i}\) is perturbed to get \(S_{i}^{*}\) by Algorithm 1 at the local side and then sent to the server. The server generates k initial centroids by the initial centroid selection algorithm and attribute set A, then sends them to the local side. The next clustering iteration is performed. The local side calculates the distance between the real user data \(S_{i}\) and each centroid received from the server to find the nearest centroid \(c_{i}\) and the corresponding clusters. The found centroid \(c_{i}\) is then perturbed to get \(c_{i}^{*}\) by Algorithm 2 and sent to the server. The server updates a new set of centroids based on the received \(C^{*}\) and the perturbed data \(S_{i}^{*}\) by Algorithm 3, then sends them to the local side. The iterative process is repeated until the results converge.

Local Side Method. As shown in Fig. 1, the local side consists of two core components, user data perturbation, and centroid perturbation.

User Data Perturbation. In contrast to the usual LDP k-means approach of converting the data into binary strings and then perturbing each dimension to obtain the perturbed results before aggregation, this paper normalizes the user data vector \(S_{i}\) to [−1,1] through data pre-processing for the next perturbation process. Because each bit of the binary string has to be equally assigned privacy budget \(\varepsilon \), which may cause excessive noise problems when the budget is small, or the number of bits in the string is large.

The Duchi solution [17] is a multidimensional data perturbation scheme based on LDP. Since \(S_{i}\) has already completed data pre-processing to obtain \(S_{i}^{\prime }=\left\{ s_{0}^{\prime }, s_{1}^{\prime }, \ldots , s_{d-1}^{\prime }\right\} \), the Duchi-based PLDP mechanism can be used to perturb user data. In the clustering model of this paper, a set of privacy parameters \(\left( G_{i}, \varepsilon _{u}\right) \) can be self-selected by each user. \(\varepsilon _{u}\) represents the user’s selected privacy budget, i.e., the user’s requirement for the strength of data protection. Averaging \(\varepsilon _{u}\) by user data dimension d to obtain \(\varepsilon _{d}=\frac{\varepsilon _{u}}{d}\). \(G_{i}=\left\{ g_{0}, g_{1}, \ldots , g_{d-1}\right\} \) represents the user’s acceptable security range, and \(g_{j}(0<j<d)\) represents the security range of the j-th dimensional data. e.g., a set of age data distributed between [1,100]. A user data is 25 years old, and after LDP perturbation, the perturbed data range is between [1,100], representing the user’s expectation that their age data is indistinguishable in the range [1,100], a wide privacy requirement that is generally unnecessary. In PLDP, the user needs to choose a security range \(g_{j}\). The range and size of \(g_{j}\) is user-defined, and the user’s real data values must be included within the security range. For example, \(g_{j}\)=[10,40] means that the age is indistinguishable within the range [10,40] to satisfy the user’s privacy demands. \(w_{j}\) and \(m_{j}\) are defined as the size and midpoint of \(g_{j}\). Since a secure region symmetric about 0 needs to be obtained, each user moves the secure range, and the user data points within the range move to \(s_{j}^{\prime \prime }=s_{j}^{\prime }-m_{j}\). After processing the \(S_{i}^{\prime \prime }=\left\{ s_{0}^{\prime \prime }, s_{1}^{\prime \prime }, \ldots , s_{d-1}^{\prime \prime }\right\} \) representing \(S_{i}\) is obtained. The perturbation mechanism is defined by

$$\begin{aligned} \begin{aligned}&{\text {Pr}}\left( s_{j}^{*}=x \mid s_{j}^{\prime \prime }\right) \\&=\left\{ \begin{array}{l} \frac{2 \cdot s_{j}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }, \text{ if } x=\frac{w_{j}}{2} \cdot \frac{e^{\varepsilon _{d}}+1}{e^{\varepsilon _{u}}-1}+m_{j}, \\ -\frac{2 \cdot s_{j}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }, \text{ if } x=-\frac{w_{j}}{2} \cdot \frac{e^{\varepsilon _{d}}+1}{e^{\varepsilon _{d}}-1}+m_{j}. \end{array}\right. \end{aligned} \end{aligned}$$
(3)

Since a range move was performed on \(S_{i}^{\prime }\) before the perturbation, \(m_{j}\) is added to the perturbation result x in Eq. 3 to restore the data. After completing the perturbation, send \(S_{j}^{*}\) to the server, which gets all the perturbation data and calculates the mean value of each dimension of the data.

The overall process of user data perturbation is shown in Algorithm 1, where \(S_{i}^{*}\) is obtained according to Eq. 3 perturbation and then sent to the collector server. The privacy proof of the Algorithm 1 is described in Sect. 3.3.

figure a

Centroid Perturbation. As shown in Fig. 1, the local side enters the iterative process after the user data perturbation is completed. The centroids from the server are first accepted, then iteration centroids are calculated based on the real user data. Although the server cannot infer privacy information from the user data, the clustering information of the user belonging to that cluster, i.e., the iteration centroids sent to the server in each iteration, may reveal user privacy. Because the clusters to which users belong are calculated from real data, over multiple iterations, the server can infer the approximate distribution or exact value of the user data as the clusters to which users belong change, and the iteration centroids are updated. For example, assuming the user data is two-dimensional location data, the user can be positioned in a circular region in each iteration. In multiple iterations, overlapping these circular regions will help the server locate the user’s exact location or exact location range.

To address the problem that iterative centroids may cause user privacy leakage, the iterative centroid perturbation algorithm is proposed in this paper to generate perturbed iterative centroids using a random perturbation mechanism. Also, The centroids in the first few iterations of the clustering change greatly, while the centroids in the last few iterations change only a little. Suppose the privacy budget is distributed equally, i.e., given the same amount of noise in each round. In that case, it will cause the problem of poor clustering utility or failure to converge. A privacy budget allocation mechanism in which the privacy budget for each round is incremented with the number of iterations is proposed in this paper. i.e., a smaller privacy budget is used for the first few rounds to add a larger noise. As the number of iterations increases, the privacy budget is incremented, and the noise is gradually reduced. The iterative centroid perturbation algorithm is described in Algorithm 2.

figure b

As shown in Algorithm 2, the K-Randomized Response (K-RR) is used to perturb the user clustering information. Since K-RR can be applied to multivariate perturbations, there is no need to encode the centroids. The privacy budget allocation algorithm is inspired by the Fibonacci sequence. Since the goal of budget allocation is to construct an allocation scheme that increase by degrees and sums to \(\varepsilon \), a privacy budget allocation sequence is constructed in this paper. Assuming that there are L iterations and the recursive formula for the sequence is as follows, \(P(n)=2 \cdot P(n-1)\left( 2<n<L, P(0)=P(1)=\frac{1}{2^{L-1}} \cdot \varepsilon _{u}\right) \), e.g. L=5, then we have a privacy budget allocation sequence \(P=\left\{ \frac{1}{16} \cdot \varepsilon _{u}, \frac{1}{16} \cdot \varepsilon _{u}, \frac{1}{8} \cdot \varepsilon _{u}, \frac{1}{4} \cdot \varepsilon _{u}, \frac{1}{2} \cdot \varepsilon _{u}\right\} \), the privacy budget for the third iteration is \(\varepsilon _{3}=\frac{1}{8} \cdot \varepsilon _{u}\) and the sum is \(\varepsilon _{u}\). The iteration centroid perturbation is shown in the following Eq. 4.

$$\begin{aligned} {\text {Pr}}\left[ c_{i}^{*}=c_{i}\right] = {\left\{ \begin{array}{ll}p=\frac{e^{\varepsilon _{n}}}{e^{\varepsilon _{n}}+k-1} &{} \text{ if } c_{i}^{*}=c_{i} \\ q=\frac{1}{e^{\varepsilon _{n}}+k-1} &{} \text{ if } c_{i}^{*} \ne c_{i}\end{array}\right. }. \end{aligned}$$
(4)

where \(\varepsilon _{n}\) represents the privacy budget for the current number of iterative rounds, L is the maximum number of iterative rounds, k is the number of centroids and \(c_{i}^{*}\) represents the iteration centroid after perturbation. The detailed procedure for the iterative centroid perturbation algorithm is described in Algorithm 2.

Collector Server Method

Initial Centroid Selection. The server randomly generates k d-dimensional initial centroids C based on the \(S_{i}^{*}\) and sent them to the user.

Aggregation and Centroid Computation. The server groups the user perturbation data \(S_{i}^{*}\) according to the perturbation centroid \(C^{*}=\left\{ c_{0}^{*}, c_{0}^{*}, \ldots , c_{K-1}^{*}\right\} \) sent from the local side. In each cluster, the mean of each dimension of \(S_{i}^{*}\) is calculated separately, and the centroid C is updated in this way.

$$\begin{aligned} c_{i}=\frac{1}{\left| c_{i}^{*}\right| } \cdot \left\{ \sum _{S_{i}^{*} \in c_{i}^{*}} s_{0}^{*}, \sum _{S_{i}^{*} \in c_{i}^{*}} s_{1}^{*}, \ldots , \sum _{S_{i}^{*} \in c_{i}^{*}} s_{d-1}^{*}\right\} \end{aligned}$$
(5)

where \(c_{i}\) is the new centroid updated by the calculation, \(\left| c_{i}^{*}\right| \) is the number of user data belonging to \(c_{i}^{*}\), and \(\sum _{S_{i}^{*} \in c_{i}^{*}} s_{j}^{*}\) is the intra-class sum of the j-th dimensional data. Send the new centroid to the local side after the calculation is completed. Clustering iterations are performed as described above until the clustering is complete. The main steps of the centroid update are shown in Algorithm 3.

figure c

3.3 Privacy Analysis

This section proves that Algorithms 1 and 2 satisfy the definition of differential privacy and further proves that the overall framework satisfies the definition of differential privacy.

Theorem 1

Algorithm 1 provides \(\left( G_{i}, \varepsilon _{u}\right) \)-PLDP for each uesr \(u_{i}\) with \(\left( G_{i}, \varepsilon _{u}\right) \).

Proof

For any two values \(s_{j 1}^{\prime }, s_{j 2}^{\prime } \in g_{j}\) and \(s_{j}^{*} \in \big \{\frac{w_{j}}{2} \cdot \frac{e^{\varepsilon _{u}}+1}{e^{\varepsilon _{u}}-1}+m_{j},-\frac{w_{j}}{2} \cdot \frac{e^{\varepsilon _{u}}+1}{e^{\varepsilon _{u}}-1}+m_{j}\big \}\), there is \(s_{j 1}^{\prime \prime }=s_{j 1}^{\prime }-m_{j}\), \(s_{j 2}^{\prime \prime }=s_{j 2}^{\prime }-m_{j}\). Then there is

$$\begin{aligned} \begin{aligned} \frac{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 1}^{\prime }\right) }{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 2}^{\prime }\right) }&= \frac{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 1}^{\prime \prime }\right) }{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 2}^{\prime \prime }\right) } \\&=\frac{\frac{2 \cdot s_{j 1}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }}{\frac{2 \cdot s_{j 2}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }}. \end{aligned} \end{aligned}$$
(6)

or

$$\begin{aligned} \begin{aligned} \frac{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 1}^{\prime }\right) }{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 2}^{\prime }\right) } = \frac{-\frac{2 \cdot s_{j 1}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }}{-\frac{2 \cdot s_{j 2}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }}. \end{aligned} \end{aligned}$$
(7)

Using Eq. 6 as an example,

$$\begin{aligned} \begin{aligned} \frac{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 1}^{\prime }\right) }{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 2}^{\prime }\right) } =\frac{2 \cdot s_{j 1}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }{2 \cdot s_{j 2}^{\prime \prime } \cdot \left( e^{\varepsilon _{d}}-1\right) +w_{j} \cdot \left( e^{\varepsilon _{d}}+1\right) }. \end{aligned} \end{aligned}$$
(8)

It can be seen from Eq. 8 that when \(s_{j 1}^{\prime \prime }=\frac{w_{j}}{2}, s_{j 2}^{\prime \prime }=-\frac{w_{j}}{2}\left( s_{j 1}^{\prime \prime }=-\frac{w_{j}}{2}, s_{j 2}^{\prime \prime }=\frac{w_{j}}{2}\right) \), Eqs. 6 and 7 to obtain the maximum value,

$$\begin{aligned} \frac{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 1}^{\prime }\right) }{{\text {Pr}}\left( s_{j}^{*} \mid s_{j 2}^{\prime }\right) } \le e^{\varepsilon _{d}}. \end{aligned}$$
(9)

Algorithm 1 satisfies \(\left( g_{j}, \varepsilon _{d}\right) \)-PLDP by Eq. 9. Since \(S_{i}^{\prime }=\left\{ s_{0}^{\prime }, s_{1}^{\prime }, \ldots , s_{d-1}^{\prime }\right\} \), \(G_{i}=\left\{ g_{0}, g_{1}, \ldots , g_{d-1}\right\} \), and \(\varepsilon _{d}=\frac{\varepsilon _{u}}{d}\), \(\sum _{j=0}^{d-1} \varepsilon _{d}=\varepsilon _{u}\). According to Property 1 of Sect. 2, differential privacy has the sequence combinability property. So for user \(u_{i}\) with \(\left( G_{i}, \varepsilon _{u}\right) \), Algorithm 1 satisfies \(\left( G_{i}, \varepsilon _{u}\right) \)-PLDP.

Theorem 2

Algorithm 2 provides \(\varepsilon _{u}\)-LDP for each uesr \(u_{i}\) throughout the clustering process.

Proof

For any two values \(c_{i 1}, c_{i 2}, c_{i}^{*} \in C\) there is

$$\begin{aligned} \begin{aligned} \frac{{\text {Pr}}\left[ c_{i 1}=c_{i}^{*}\right] }{{\text {Pr}}\left[ c_{i 2}=c_{i}^{*}\right] }&=\frac{\frac{e^{\varepsilon _{n}}}{e^{\varepsilon _{n}}+k-1}}{\frac{1}{e^{\varepsilon _{n}}+k-1}} \\&=e^{\varepsilon _{n}}. \end{aligned} \end{aligned}$$
(10)

Algorithm 2 satisfies \(\varepsilon _{n}\)-LDP. For all iterations, the Property 1 sequence combinability property of Sect. 2 is applied. Since \(\sum _{l_{c}=1}^{L} \varepsilon _{n}=\varepsilon _{u}\), for the whole clustering process, Algorithm 2 satisfies \(\varepsilon _{u}\)-LDP.

4 Experimental Evaluation

In this section, experiments are designed to investigate the improvements in the proposed framework compared to the existing DP k-means algorithm and how the relevant parameters influence the utility of the proposed framework.

4.1 Experimental Environment and Datasets

The hardware platform for this experiment uses Intel Core i7-11700 CPU @ 2.50 GHz, and 32.00 GB RAM. The experimental platform uses python 3.7. Two databases from the UCI dataset were used for the experiments. The Blood dataset records 748 individual blood donations from the Blood Transfusion Service Centre in Hsinchu city. Each record has five attributes. The Adult dataset is a dataset extracted from the 1994 census database. There are 488,42 records with 14 attributes per record. In this paper, six numerical attributes are retained for each record.

4.2 Experimental Setup and Evaluation Metrics

This paper focuses on three aspects of experimenting with the proposed framework.

  1. 1)

    Compare the utility of clustering with existing algorithms [9, 10] for uniform k values under different \(\varepsilon \). To the best of our knowledge, this paper is the first attempt at adopting PLDP in k-means clustering, so the extant advanced DP k-means algorithm was selected for comparison with the algorithm proposed in this paper. The PLDP k-means algorithm proposed in this paper is compared with the DPLloyd-Impr algorithm [9], and the DP-KCCM algorithm [10]. The DPLloyd-Impr algorithm completes the initial centroid selection by an initial centroid selection algorithm and then adds the Laplace noise to each round on average. In the DP-KCCM algorithm, the privacy budget allocation algorithm and the cluster merging algorithm are combined to enhance the clustering utility, and noise is injected through the Laplace mechanism. It is worth noting that both of these algorithms are based on differential privacy mechanisms and do not prevent attacks by malicious servers.

  2. 2)

    Compare the effects of different setting on the utility of clustering. Two sets of experiments are set up to understand the impact of key mechanisms on the utility of clustering. Firstly, the effect of privacy budget allocation methods on clustering utility was explored. Secondly, experiments were conducted on the effect of the iterative centroid perturbation algorithm on the clustering model.

  3. 3)

    Comparing the effect of different parameter distributions on clustering utility. Users can set their privacy budget \(\varepsilon _{u}\) and the size of the security range \(w_{j}\) according to their privacy needs in PLDP. All users’ privacy parameters cannot be the same in practical applications. To understand the effect of key parameters on clustering utility, three sets of experiments were set up to investigate the effect of different parameters and different distributions on clustering utility.

In this paper, the utility of clustering is assessed using the Normalised Intra-Cluster Variance (NICV) [9]. The essential goal of the k-means algorithm is to divide the data into k clusters based on minimizing the error function, with distance as the evaluation metric. Therefore NICV can directly reflect the utility of clustering, while the NICV value can also reasonably reflect the impact of privacy protection mechanisms on the utility of clustering. The smaller NICV value means the better utility of clustering. NICV is defined as follows,

$$\begin{aligned} N I C V=\frac{1}{N} \sum _{i=1}^{k} \sum _{S_{i}^{\prime } \in c_{i}}\left\| S_{i}^{\prime }-c_{i}\right\| ^{2} \end{aligned}$$
(11)

where N represents the total number of users, k represents the number of centroids, \(S_{i}^{\prime }\) represents the user data \(S_{i}\) normalized to [−1,1], and \(S_{i}^{\prime } \in c_{i}\) represents the centroid \(c_{i}\) is the closest centroid to \(S_{i}^{\prime }\).

4.3 Experimental Analysis

The results of the experiments are shown below. The first experiment explores the performance of two existing algorithms [9, 10] and the PLDP k-means proposed in this paper under different privacy budgets \(\varepsilon \). The data were normalized to [−1,1], \(w_{j}\) was set to 0.1, and the maximum number of iterations was set to 12. For comparison purposes, this experiment will unify the privacy budget \(\varepsilon _{u}\) and \(w_{j}\) for users. As seen in Fig. 2, PLDP k-means performs better than DPLloyd-Impr [9] and performs similarly to DP-KCCM [10]. However, the algorithm proposed in this paper does not require a trusted third-party server, which means that the PLDP k-means algorithm can obtain a similar or better clustering utility while eliminating the risk of malicious servers.

Fig. 2.
figure 2

Performance with respect to \(\varepsilon \).

Fig. 3.
figure 3

Performance with respect to privacy budget allocation methods.

The second experiment explored the effect of different privacy budget allocation methods on the utility of clustering. A privacy budget allocation sequence is designed in this paper. Such that the allocation of privacy budgets in iterations presents a increase by degrees tendency. As shown in Fig. 3, the average privacy budget allocation method and the proposed allocation method were compared. It can be seen that the proposed method in this paper is significantly better than the average method. This demonstrates that the proposed privacy budget allocation algorithm in this paper can further improve the utility of clustering.

Iterative centroid perturbation algorithms are proposed to prevent privacy leakage caused by inference attacks. To evaluate the impact of this algorithm on the utility of clustering, a comparison experiment was conducted between using the iterative centroid perturbation algorithm and using real centroids directly. As shown in Fig. 4, the use of true centroids performed better than the use of the iterative centroid perturbation algorithm. This illustrates that some usability is sacrificed to improve privacy protection.

Fig. 4.
figure 4

Performance with respect to iterative centroid perturbation algorithm.

Fig. 5.
figure 5

Performance under the different W and the same E.

Fig. 6.
figure 6

Performance under the different E and the same W.

Fig. 7.
figure 7

Performance with respect to \(w_{j}\) and \(\varepsilon \) under the \(W_{1}\),\(E_{1}\).

The effect of the parameters is next explored. A fixed range is specified, \(w_{j} \in \)[0.1,0.5], \(\varepsilon _{u} \in \)[0.1,2]. Each user can take their parameters from the range. Suppose the distributions of \(w_{j}\), \(\varepsilon _{u}\) are uniform (\(W_{1}, E_{1}\)) or normal (\(W_{2}, E_{2}\)) respectively. \(W_{1}\) and \(W_{2}\), \(E_{1}\) and \(E_{2}\) have equal means. \(E_{2}\) and \(W_{2}\) have standard deviations of 0.3 and 0.1, respectively. PLDP k-means algorithm and the two variants of the algorithm based on this section discussed above, average, real centroid, were used for testing. Figure 5 (Fig. 6) shows the results of the three algorithms at different W(E) and the same E(W) on the two datasets. The control variables method shows that the results for \(W_{2}\) and \(E_{2}\) are better than those for \(W_{1}\) and \(E_{1}\), respectively. Although the means of the two distributions are equal, the normally distributed data are distributed with a high probability around the mean and a lower probability for smaller \(\varepsilon _{u}\) and larger \(w_{j}\), which leads to better NICV values.

The effects of \(w_{j}\) and \(\varepsilon _{u}\) were further explored. The influence of varying \(w_{j}\), \(\varepsilon _{u}\) on the utility of clustering was explored for the \(W_{1}\),\(E_{1}\) cases, respectively. As illustrated in Fig. 7, a larger \(w_{j}\) (\(\varepsilon _{u}\)) results in the poorer (better) utility of clustering.

Based on the experimental analysis above, the proposed algorithm in this paper improves the utility of clustering while ensuring the strength of privacy protection, and the experiments illustrate that the desired effect is achieved.

5 Conclusion

A clustering framework based on the PLDP k-means and an iterative centroid perturbation algorithms is proposed in this paper. This framework not required trusted third-party servers, and users are allowed to personalize their privacy requirements by the proposed PLDP k-means algorithm. An iterative centroid perturbation algorithm is also proposed that refines the privacy-preserving scheme by perturbing the centroids in the iterative process. Experimental results show that the proposed algorithm in this paper has better or similar performance than the extant DP k-means algorithm. Besides, the PLDP k-means algorithm requires only one upload of perturbation data, unlike the DP k-means algorithm, but the computational and communication costs during the iteration are still nonnegligible. Future work is to analyze and reduce the computational and communication costs of the PLDP k-means algorithm.