Privacy-preserving k-means clustering with local synchronization in peer-to-peer networks

Zhu, Youwen; Li, Xingxin

doi:10.1007/s12083-020-00881-x

Privacy-preserving k-means clustering with local synchronization in peer-to-peer networks

Published: 03 March 2020

Volume 13, pages 2272–2284, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Privacy-preserving k-means clustering with local synchronization in peer-to-peer networks

Download PDF

598 Accesses
14 Citations
Explore all metrics

Abstract

k-means clustering, which partitions data records into different clusters such that the records in the same cluster are close to each other, has many important applications such as image segmentation and genes detection. While the k-means clustering has been well-studied by a significant amount of works, most of the existing schemes are not designed for peer-to-peer (P2P) networks. P2P networks impose several efficiency and security challenges for performing clustering over distributed data. In this paper, we propose a novel privacy-preserving k-means clustering scheme over distributed data in P2P networks, which achieves local synchronization and privacy protection. Specifically, we design a secure aggregation protocol and a secure division protocol based on homomorphic encryption to securely compute clusters without revealing the privacy of individual peer. Moreover, we propose a novel massage encoding method to improve the performance of our aggregation protocol. We formally prove that the proposed scheme is secure under the semi-honest model and demonstrate the performance of our proposed scheme.

Efficient-Secure k-means Clustering Guaranteeing Personalized Local Differential Privacy

An Efficient Approach for Privacy Preserving Distributed K-Means Clustering in Unsecured Environment

Research on K-Means Clustering Algorithm Over Encrypted Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

k-means clustering [4, 25] is a simple and well-studied technique in machines learning areas which aim to group data records into k clusters. Informally, the clustering algorithm partitions data records into different clusters such that records in the same cluster are close to each other based on some distance metrics. Clustering has been applied in many domains, such as image segmentation and genes detection [9, 11, 45]. The traditional k-means clustering algorithm is assumed that data is centralized in a single location, which cannot be directly applied to distributed situations where each party holds a part of the dataset.

As a typical distributed system consisting of thousands of connected nodes, peer-to-peer (P2P) networks are emerging as a good choice for many technologies [6, 33, 46]. Performing clustering in P2P networks arises more and more attention. Among this, designing a distributed clustering algorithm in P2P networks instead of collecting data in a single point is a practical option. There are many important applications of distributed clustering in P2P networks. For example, in a P2P file-sharing network, we can utilize the distributed clustering algorithm to mine browsing history of users and find users with common interest [18]. However, P2P networks impose several challenges for performing clustering over distributed data. First, since peer-to-peer networks are highly decentralized and always contain millions of nodes, it is not practical to centralize the whole dataset and achieve global synchronization in the network. Second, there are frequent topology changes in P2P networks. Third, each peer in P2P networks suffers from frequent data update. Fourth, peers may not reveal their local data for the privacy concern.

Many schemes [2, 7, 27, 29] have been proposed to solve the above challenges. Among these works, Datta et al. [7] proposed two algorithms to perform k-means clustering over distributed data in peer-to-peer networks, which only require that each node achieves local synchronization with its neighbors. Mashayekhi et al. [27] presented a general fully decentralized method to perform clustering over dynamic and distributed data in P2P networks. Their scheme constructed a summarized view of the distributed dataset and utilizes the partition-based and density-based clustering methods to learn the clusters over the summarized view, which achieves effective clustering accuracy and efficient communication costs. Although the above works achieve comparable accuracy with centralized methods, each peer in their schemes is required to send local information to its neighbors for collaborated computations. Since the local data in each peer may contain sensitive information about itself, directly sending local data in the existing distributed clustering schemes compromises the privacy of peers [21, 42,43,44]. Some recent works [3, 5] designed several privacy-preserving feature selection schemes in P2P networks. Unfortunately, their schemes cannot be directly applied to solve privacy-preserving k-means clustering in P2P networks. Among another line of work, many secure schemes [36, 40, 48] are proposed to handle k-means clustering over horizontally partitioned or vertically partitioned data. However, these privacy-preserving clustering schemes either only consider the two parties situation or require global synchronization with all parties, which is not suitable for the topology of P2P networks. There still lacks a practical privacy-preserving k-mean clustering scheme in P2P networks.

In this paper, we propose a novel privacy-preserving k-means clustering scheme over distributed data in P2P networks. Each peer in our scheme iteratively updates the clusters and achieves local synchronization at each iteration, i.e., it only requires to synchronize with its neighbors to learn the clusters. Specifically, each peer implements our secure aggregation protocol to obtain local centers and cluster counts from its neighbors and computes the novel clusters by using our secure division protocol. As a result, each peer can learn the clustering result without revealing its local information. Our main contributions in this paper can be summarized as follows.

We propose a novel privacy-preserving k-means clustering scheme over distributed data in P2P networks, which simultaneously achieves local synchronization and protects the privacy of each peer.
To protect the privacy of local data in each peer, we design a secure aggregation protocol and a secure division protocol based on homomorphic encryption [31]. In addition, we design a novel message encoding mechanism to improve the performance of our aggregation protocol.
We formally prove that the proposed scheme is secure under the semi-honest model. We also theoretically analyze the performance of our proposed scheme.

The remainder of this paper is organized as follows. We introduce the system model and threat model in Section 2. We present some preliminaries used in our scheme in Section 3. We describe the proposed scheme in Section 4. Then, we analyze the security and computational complexity of our proposed scheme in Section 5. We review the related work in Section 6. Finally, we conclude the paper in Section 7.

2 Problem statement

2.1 System model

The dataset consisting of n data records {p₁,p₂,⋯ ,p_n} is distributed over different nodes in peer-to-peer networks. Each node denotes a user and holds a part of the dataset. In other words, the dataset D is horizontally distributed over different nodes. The nodes try to learn k collaborated clusters $C^{\prime }=\{c^{\prime }_{1},c^{\prime }_{2},\cdots ,c^{\prime }_{k}\}$ over the distributed dataset by using the k-means clustering algorithm. Nevertheless, the standard clustering algorithm only works on centralized datasets. It is still a challenge to perform k-means clustering over distributed datasets in peer-to-peer networks. In this paper, we consider the network as a connected, undirected graph where each peer represents a vertex and each edge between two nodes denotes they can communicate with each other. Our scheme is based on local node synchronization and each node is only synchronizing with its neighbors in each iteration. Thus we simply assume that each node can only communicate with its neighbors and each node has a unique identity. Given a node N_i, we use the notion Γ_i to denote its neighbors and |Γ_i| to represent the number of neighbors. The all notations used in this paper are summarized in Table 1.

Table 1 Notation Table

Full size table

2.2 Threat model

We consider the security of our proposed scheme under the semi-honest (honest-but-curious) model [10]. That is, each party will correctly follow the protocol, but try to learn other’s inputs using what he legally receives during the protocol. A protocol π is secure under the semi-honest model if each party’s view during the protocol can be simulated given only its input and output. The semi-honest model has been adopted by several existing woks [16, 20, 22, 41]. A formal definition of security against semi-honest adversaries can be described as follows.

Definition 1

Let F be a functionality and π be a n-party protocol for computing F. F_i represents the computation on parity i. The view of party i during the execution of π is denoted by V iew_i and equals to (x_i,r_i,m₁,⋯ ,m_t) where x_i represents the input, r_i represents the randomness and m_j represents the j-th received message. We say that the protocol π is secure under the semi-honest model if there exist a probabilistic polynomial-time simulator Sim_i for each party i such that

$$ Sim_{i} (x_{i},F_{i}(x_{1},x_{2},\cdots,x_{n}))\overset{c}{\equiv} view_{i}(x_{i},F_{i}(x_{1},x_{2},\cdots,x_{n})), $$

(1)

where $\overset {c}{\equiv }$ represents computational indistinguishability.

2.3 Design goal

In our scheme, the dataset D is horizontally distributed over different peers, so each peer can learn the content of different data records. The nodes try to learn k clusters over the whole dataset, thus we assume each node learns the whole clusters during the computation process. The main privacy issues we consider in this paper are listed below.

A node cannot learn the content of data records possessed by other nodes.
A node cannot know the closest clusters of data records possessed by other nodes.
A node cannot learn the number of data records assigned to a cluster for all of the clusters.

3 Preliminaries

3.1 k-means clustering

Given a dataset of n data records D = {p₁,p₂,⋯ ,p_n}, k-means clustering partitions the dataset into k disjoint subsets called clusters, where k is a user-defined parameter and p_i is an element of $\mathcal {R}^{d}$. The goal of k-means clustering is to find clusters that achieve the minimized sum of the distances between clusters and data records. In this paper, we use Euclidean distance as the distance metric, i.e., $Dist(p,q)=\sqrt {{\sum }_{i=1}^{d}(p_{i}-q_{i})^{2}}$, and represent each cluster as its centroid. The k-means clustering algorithm [4, 25] can be described as follows.

We first set l = 1 and randomly selects k data records $C^{(l)}=\left \{c^{(l)}_{1},c^{(l)}_{2},\cdots ,c^{(l)}_{k}\right \}$ in the dataset D as the initial clusters and iteratively refine the k potential clusters until reaching a termination condition. Specifically, in each iteration, we assign each data record p_i to the cluster $c^{(l)}_{j}$ which is closet to it and count the number of such data records, which is represented as $m^{(l)}_{j}$. Then we compute the novel clusters $C^{(l+1)}=\left \{c^{(l+1)}_{1},c^{(l+1)}_{2},\cdots ,c^{(l+1)}_{k}\right \}$ based on the cluster assignment, i.e., the new clusters are computed by the arithmetic mean of the points in the clusters. We check whether the difference between the novel and the old clusters is within a predefined threshold. If the condition holds, we terminate the iteration and outputs C^(l+ 1) as the final results; otherwise, we will replace C^(l) with C^(l+ 1)and repeat the above process. The main steps is illustrated in Fig. 1.

The above algorithm only works on the centralized situation where the whole dataset D is stored in a single place. Nevertheless, the dataset D in peer-to-peer networks is distributed over different nodes. Each node N_i holds a part of the dataset. The specific topology of P2P networks makes it difficult to design a distributed k-means clustering algorithm. Moreover, the subset D_i holds by each node may contain sensitive information about himself, thus the node is not willing to reveal this information to others. Designing a distributed k-means clustering algorithm maintaining the privacy of each node is still a challenge.

3.2 Homomorphic encryption

Paillier cryptosystem [31] is an efficient public key cryptosystem with semantic security (indistinguishability under chosen plaintext attack, IND-CPA). The encryption scheme is additively homomorphic, i.e.,

$$ E_{pk}(x_{1}) \times E_{pk}(x_{2})=E_{pk}(x_{1}+x_{2}), $$

(2)

$$ E_{pk}(x)^{\alpha} = E_{pk}(\alpha x). $$

(3)

Here, E denotes the encryption function, x₁, x₂ and α are arbitrary messages in the plaintext space, pk is the public key, E_pk(x₁) represents the ciphertext of x₁, and D denote the decryption function. The main steps of the Paiilier encryption system is shown as follows.

Key Generation.:: Choose two large enough primes p and q. Then, the secret key sk = lcm(p − 1,q − 1), that is, the least common multiple of p − 1 and q − 1. The public key pk = (N,g), where N = pq and $g\in \mathbb {Z}_{N^{2}}^{*}$ such that $\gcd \big {(}L(g^{s}\mod N^{2}), N\big {)}=1 $, that is, the maximal common divisor of L(g^s mod N²) and N is equivalent to 1. Here, L(x) = (x − 1)/N, the same below.
Encryption.:: Let x₀ be a number in plaintext space $\mathbb {Z}_{N}$. Select a random $r\in \mathbb {Z}_{N}^{*}$ as the secret parameter, then the ciphertext of x₀ is $c_{0}=\mathrm {E}_{pk}(x_{0})=g^{x_{0}}r^{N}\ \text {mod} \ N^{2}$.
Decryption.:: Let $c_{0}\in \mathbb {Z}_{N^{2}}$ be a ciphertext. The plaintext hidden in c₀ is
$$ x_{0}=\text{Dec}_{sk}(c_{0})=\frac {L({c_{0}^{s}}\mod N^{2})}{L(g^{s}\mod N^{2})}\mod N. $$

Note that Pailler encryption only supports positive integers, we need to transform real numbers into integers through multiplying them by a large integer δ (δ > 0) as previous works [23, 24, 35].

4 Privacy-preserving k-means clustering in P2P networks

4.1 Overview

Our proposed privacy-preserving k-mean clustering scheme does not require that all nodes in a peer-to-peer network achieve global synchronization. The scheme only requires each node to synchronize with its neighbors, i.e., it only requires local synchronization. Each node moves on to the next generation once it receives responses from all its neighbors. To protect the privacy of each node, we design a secure aggregation protocol and a secure division protocol between each node and its neighbors.

In the initial stage of our scheme, a single node randomly generates k initial clusters $C^{(1)}=\left \{c^{(1)}_{1},c^{(1)}_{2},\cdots ,c^{(1)}_{k}\right \}$ and a termination threshold 𝜖 > 0. Then the node sends (1,C⁽¹⁾,𝜖) to all its neighbors and starts the iteration 1. When a node N_i receives the message (1,C⁽¹⁾,𝜖) form it neighbors for the first time, it randomly chooses a neighbor $N_{s_{i}}$ as its assistant node. $N_{s_{i}}$ generate its public and secret key pair (pk_i,sk_i) and sends the public key pk_i to N_i. The node N_i then sends (1,C⁽¹⁾,𝜖,pk_i) to the remainder of its neighbors and starts the iteration 1. Eventually, all nodes finish the initial stage and enter the iteration 1 with the same initial clusters C⁽¹⁾ and threshold 𝜖.

In each iteration of our scheme, each node N_i securely aggregates clusters from his neighbors and computes novel clusters. Specifically, N_i first implement a k-means cluster algorithm on its local dataset D_i with location clusters $C_{i}^{(l)}$. For each data record in D_i, N_i calculates the distances between it and each cluster and assign it to the closest cluster $c^{(l)}_{ij}$. Then N_i counts the number of records in his dataset assigned to the cluster $c^{(l)}_{ij}$, which is denoted as $m_{ij}^{(l)}$ and computes k local centers $w_{i}^{(l)}=\left (w_{i1}^{(l)},w_{i2}^{(l)},\cdots ,w_{ik}^{(l)}\right )$ where $w_{ij}^{(l)}$ is a d-dimensional point and $w_{ij}^{(l)}={\sum }_{p\in c^{(l)}_{ij}}p$. N_i stores $\left \{w_{i}^{(l)},m_{i}^{(l)}=\left (m_{i1}^{(l)},m_{i2}^{(l)},\cdots ,m_{il}^{(l)}\right )\right \}$ in its history table. It also sends a request (i,l) to its neighbor nodes Γ_i. This request requires all its neighbors to implement the secure aggregation protocol to securely respond their local centers and cluster counts in the iteration l. Then, N_i and its assistant node $N_{s_{i}}$ will learn the secret shares of ${\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)}$ and ${\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}$, i.e., N_i gets α_j1,β_j1 and $N_{s_{i}}$ gets α_j2,β_j2 where ${\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)}=\alpha _{j1}+\alpha _{j2}\mod N$ and ${\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}={\beta }_{j1}+{\beta }_{j2}\mod N$. The details about the secure aggregation protocol will be described in the next part. N_i then utilizes the secure division protocol with the assistant node $N_{s_{i}}$ to update its clusters. For each cluster $c_{ij}^{(l+1)}$, N_i obtains

$$ c_{ij}^{(l+1)}=\frac{\sum\limits_{N_{a}\in {\Gamma}_{i}^{*}}w_{aj}^{(l)}}{\sum\limits_{N_{a}\in {\Gamma}_{i}^{*}}m_{aj}^{(l)}}. $$

N_i computes $Dist\left (c_{ij}^{(l)},c_{ij}^{(l+1)}\right )$ and finds the max distance among them. N_i compares $max\left \{Dist\left (c_{ij}^{(l)},c_{ij}^{(l+1)}\right )\right \}_{1\leq j\leq k}$ with 𝜖. If $max\left \{Dist\left (c_{ij}^{(l)},c_{ij}^{(l+1)}\right )\right \}_{1\leq j\leq k}>\epsilon $, it continues to next iteration l + 1; otherwise, it moves on to the termination state and $C_{i}^{(l)}=\left \{c^{(l+1)}_{i1},c^{(l+1)}_{i2},\cdots ,c^{(l+1)}_{ik}\right \}$ is the final clusters. The main steps of each iteration in our proposed scheme are illustrated in Fig. 2.

4.2 Secure aggregation protocol

In our secure aggregation protocol, N_i securely sums centers and cluster counts from its neighbors Γ_i. In the protocol, each node N_a ∈Γ_i holds a local dataset D_a, local centers, and cluster counts $\left \{\left (w_{a}^{(l)},m_{a}^{(l)}\right )\right \}$ in its history table. Once receiving a request $(i,\hat {l})$ from N_i, N_a first compares its current iteration l with $\hat {l}$. If $\hat {l}\leq l$, N_a’s history table contains local centers and cluster counts for the iteration $\hat {l}$. N_a finds $\left \{\left (w_{a}^{(\hat {l})},m_{a}^{(\hat {l})}\right )\right \}$ from its history table and encrypts them with the pubic key of $N_{s_{i}}$. Concretely, for $w_{a}^{(\hat {l})}=\left (w_{aj1}^{(\hat {l})}, w_{aj2}^{(\hat {l})},\cdots ,w_{ajd}^{(\hat {l})}\right )$ and $m_{aj}^{(\hat {l})}$ (1 ≤ j ≤ k), N_a encrypts them as $E_{pk_{i}}\left (w_{aj}^{(\hat {l})}\right )=\left (E_{pk_{i}}\left (w_{aj1}^{(\hat {l})}\right ),E_{pk_{i}}\left (w_{aj2}^{(\hat {l})}\right ),\cdots ,E_{pk_{i}}\left (w_{ajd}^{(\hat {l})}\right )\right )$, and $E_{pk_{i}}(m_{aj}^{(\hat {l})})$. N_a then responses

$$ E_{pk_{i}}(w_{a}^{(\hat{l})})=\left\{E_{pk_{i}}\left( w_{a1}^{(\hat{l})}\right),E_{pk_{i}}\left( w_{a2}^{(\hat{l})}\right),\cdots,E_{pk_{i}}\left( w_{ak}^{(\hat{l})}\right)\right\}, $$

$$ E_{pk_{i}}\left( m_{a}^{(\hat{l})}\right)=\left\{E_{pk_{i}}\left( m_{a1}^{(\hat{l})}\right),E_{pk_{i}}\left( m_{a2}^{(\hat{l})}\right),\cdots,E_{pk_{i}}\left( m_{ak}^{(\hat{l})}\right)\right\} $$

to N_i. If $\hat {l}> l$, N_a puts $(i,\hat {l})$ into its wait table. Then N_a checks the wait table during each iteration l. Once l reaches $\hat {l}$, it computes and sends the response $E_{pk_{i}}(w_{a}^{(\hat {l})})$ and $E_{pk_{i}}(m_{a}^{(\hat {l})})$ based on the above procedure to the node N_i.

After receiving all responses from its neighbors Γ_i, N_i aggregates all messages based on the additively homomorphic property of Paillier encryption. It computes

$$ \begin{array}{@{}rcl@{}} \prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{a}^{(\hat{l})}\right)&=&\left\{\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{a1}^{(\hat{l})}\right),\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{a2}^{(\hat{l})}\right),\right.\\ &\cdots& \left. ,\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{ak}^{(\hat{l})}\right)\right\}, \end{array} $$

$$ \begin{array}{@{}rcl@{}} \prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{a}^{(\hat{l})}\right)&=&\left\{\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{a1}^{(\hat{l})}\right),\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{a2}^{(\hat{l})}\right),\right.\\ &\cdots& \left.,\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{ak}^{(\hat{l})}\right)\right\}. \end{array} $$

Then N_i randomly selects d values $\{r_{j1},r_{j2},\cdots ,r_{jd}\}\in \mathbb {Z}_{N}^{d}$ for each ${\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{aj}^{(\hat {l})}\right )=\left ({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{aj1}^{(\hat {l})}\right ),\right .$ $\left .{\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{aj2}^{(\hat {l})}\right ),\cdots ,{\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{ajd}^{(\hat {l})}\right )\right )$ and computes

$$ \alpha_{js}=\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{ajs}^{(\hat{l})}\right)*E_{pk_{i}}^{-1}(r_{js}), $$

where 1 ≤ s ≤ d. For each ${\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (m_{aj}^{(\hat {l})}\right )$, N_i randomly selects a value $R_{j}\in \mathbb {Z}_{N}$ and calculates

$$ {\beta}_{j}=\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{aj}^{(\hat{l})}\right)*E^{-1}_{pk_{i}}(R_{j}). $$

N_i sends {α_j,β_j}_1≤j≤k to its assistant node $N_{s_{i}}$, who will decrypts {α_j,β_j}_1≤j≤k with its secret key sk_i and obtains

$$ D_{sk_{i}}(\alpha_{js})=\sum\limits_{N_{a} \in{\Gamma}_{i}^{*}}w_{ajs}^{(\hat{l})}-r_{js}\mod N, $$

$$ D_{sk_{i}}({\beta}_{j})=\sum\limits_{N_{a} \in{\Gamma}_{i}^{*}}m_{aj}^{(\hat{l})}-R_{j}\mod N. $$

After executing the secure aggregation protocol, N_i gets r_js,R_j while $N_{s_{i}}$ gets $D_{sk_{i}}(\alpha _{js}),D_{sk_{i}}({\beta }_{j})$ satisfying $r_{js}+D_{sk_{i}}(\alpha _{js})={\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)}\mod N$, $R_{j}+D_{sk_{i}}({\beta }_{j})={\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}\mod N$, i.e., N_i and $N_{s_{i}}$ learn the secret shares of ${\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)},{\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}$ (1 ≤ j ≤ k) (Fig. 3).

4.2.1 Optimization

In this part, we propose a novel message encoding mechanism to improve the performance of our secure aggregation protocol. Our encoding method is based on Horner’s rule [19]. A formal definition about it is shown as follows.

Horner’s rule

Given a n-degree polynomial p = a_nRⁿ + a_n− 1R^n− 1 + ⋯ + a₁R + a₀, we can represent the polynomial as p = (⋯(a_nR + a_n− 1)R + ⋯ )R + a₀.

In our encoding method, we construct a polynomial p = a_nRⁿ + a_n− 1R^n− 1 + ⋯ + a₁R + a₀ and ensure R > max{a_n,a_n− 1,⋯ ,a₁}. If we know p and R, we can get n + 1 coefficients a_n,a_n− 1,⋯ ,a₀ based on the Horner’s rule, which costs n + 1 division operations and n + 1 modulo operations. The detail of our message encoding mechanism in the secure aggregation protocol is described as follows.

We take $m_{a}^{(\hat {l})}=\left \{m_{a1}^{(\hat {l})},m_{a2}^{(\hat {l})},\cdots ,m_{ak}^{(\hat {l})}\right \}$ as an example to show the main steps of our novel encoding method. In the original secure aggregation protocol, N_a encrypts $m_{a}^{(\hat {l})}$ as $E_{pk_{i}}(m_{a}^{(\hat {l})})=\left \{E_{pk_{i}}(m_{a1}^{(\hat {l})}),E_{pk_{i}}(m_{a2}^{(\hat {l})}),\cdots ,E_{pk_{i}}(m_{ak}^{(\hat {l})})\right \}$ and sends $E_{pk_{i}}(m_{a}^{(\hat {l})})$ to N_i. Then N_i randomly selects k values $\{R_{1},R_{2},\cdots ,R_{k}\}\in \mathbb {Z}_{N}$ and computes β = {β₁,β₂,⋯ ,β_k} where ${\beta }_{j}={\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{aj}^{(\hat {l})})*E_{pk_{i}}^{-1}(R_{j})$. N_i sends β to $N_{s_{i}}$, who decrypts β and obtains $\{{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{a1}^{(\hat {l})}-R_{1}\mod N,\cdots ,\\{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{ak}^{(\hat {l})}-R_{k}\mod N\}$. In the original protocol, we encrypt a k-dimensional vector $m_{a}^{(\hat {l})}$ element-wise, which costs k encryption operations. Note that the plaintext space N of Paillier encryption is usually much larger than $m_{aj}^{(\hat {l})}$, we can integrate a k-dimensional vector into an integer and then encrypt the integer, such that the encryption cost can be reduced from k to 1. In our novel encoding method, we choose a value R and encodes $m_{a}^{\hat {l}}=\left \{m_{a1}^{(\hat {l})},m_{a2}^{(\hat {l})},\cdots ,m_{ak}^{(\hat {l})}\right \}$ as a (k − 1)-degree polynomial p(R) as

$$ p(m_a^{(\hat{l})})=m_{a1}^{(\hat{l})}+m_{a2}^{(\hat{l})}R+\cdots+m_{ak}^{(\hat{l})}R^{k- 1}. $$

Then we encrypt $p(m_{a}^{(\hat {l})})$ with public key pk_i and sends $E_{pk_{i}}(p(m_{a}^{(\hat {l})}))$ to N_i. After receiving all $E_{pk_{i}}(p(m_{a}^{\hat {l}}))$ from its neighbors, N_i randomly selects k values {r₁,r₂,⋯ ,r_k} and encodes it as p(r) = r₁ + r₂R + ⋯ + r_kR^k− 1. Then N_i computes

$$ \beta=\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}(p(m_{a}^{(\hat{l})}))*E_{pk_{i}}(p(r)). $$

N_i sends β to its assistant node $N_{s_{i}}$. $N_{s_{i}}$ decrypts β and utilizes Horner’s rule to get $\{{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{a1}^{(\hat {l})}+r_{1}\mod N,\cdots ,{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{ak}^{(\hat {l})}+r_{k}\mod N\}$.

We require R > max{a_n,a_n− 1,⋯ ,a₁} to ensure the correctness of Horner’s rule. In our encoding method, we have $a_{k}={\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{ak}^{(\hat {l})}+r_{k}$. We select proper parameters R and r_j based on the following strategy. Assume $m_{ak}^{(\hat {l})}$ is σ-bit and the maximum number of N_a ∈Γ_i is 𝜃. we can set R to a (σ + 𝜃 + κ)-bit integer and selects r_j from the range [0,2^κ].

4.3 Secure division protocol

In our secure division protocol, values a,b are shared between two nodes N_i,N_j where N_i holds a₁,b₁ and N_j holds a₂,b₂ satisfying a = a₁ + a₂ mod N,b = b₁ + b₂ mod N. After executing the protocol, N_i learns the value $\frac {a}{b}$ while N_j cannot learn any useful information about a,b and $\frac {a}{b}$.

N_i first generates the public and secret key pair (pk,sk) of Paillier cryptosystem and encrypts a₁,b₁ with the public key pk. Then N_i sends encrypted values (E_pk(a₁),E_pk(b₁)) to N_j. After receiving messages from N_i, N_j selects a non-zero random value $\lambda \in \mathbb {Z}_{n}$ and computes

$$E_{pk}(\lambda a)=(E_{pk}(a_{1})*E_{pk}(a_{2}))^{\lambda},$$

$$E_{pk}(\lambda b)=(E_{pk}(b_{1})*E_{pk}(b_{2}))^{\lambda}.$$

N_j then sends E_pk(λa) and E_pk(λb) to N_i, who decrypts received values with the secret key sk and obtains λa mod N, λb mod N. Finally, N_i obtains the division by computing $\frac {a}{b}=\frac {\lambda a}{\lambda b}$ (Fig. 4).

5 Evaluation

5.1 Security analysis

We consider the security of our scheme under the semi-honest model [10]. The security of our scheme can be proved based on the following theorems.

Theorem 1

Our secure aggregation protocol is secure under the semi-honest model. Each participant cannot learn any useful information about others.

Proof

Our secure aggregation protocol involves participants with three different types: neighbor nodes N_a ∈Γ_i, the node N_i, and the assistant node $N_{s_{i}}$. We prove the theorem by considering each participant is corrupted in turn by an adversary. We show that we can construct a computationally indistinguishable simulator to simulate the corrupted party’s view.

When the node N_i is corrupted, we construct a simulator Sim_i to simulate N_i’s view as follows. In a real execution, the N_i’s view V iew_i is as follow:

$$ \begin{array}{@{}rcl@{}} View_{i}&=&\left\{w_{i}^{(\hat{l})},m_{i}^{(\hat{l})},(E_{pk_{i}}(w_{a}^{(\hat{l})}),E_{pk_{i}}(m_{a}^{(\hat{l})}))_{N_{a}\in{\Gamma}_{i}},(\alpha_{j},{\beta}_{j} )_{1\leq j\leq k},\right.\\ && \left. \times (r_{js})_{1\leq j\leq k,1\leq s\leq d}, (R_{j})_{1\leq j\leq k}\right\} \end{array} $$

In the above V iew_i, $w_{i}^{(\hat {l})},m_{i}^{(\hat {l})}$ are the input, $(E_{pk_{i}}(w_{a}^{(\hat {l})}),E_{pk_{i}}(m_{a}^{(\hat {l})}),\alpha _{j},{\beta }_{j})$ are the ciphertexts of Paillier encryption, and r_js,R_j are random value selected from $\mathbb {Z}_{N}$. To simulate the N_i’s view V iew_i, Sim_i randomly selects $w_{a}^{*},m_{a}^{*}$ from $\mathbb {Z}_{N}$ and encrypts them with the public key pk_i. Then it randomly selects $(r^{*}_{js}, R^{*}_{j})_{1\leq j\leq k}$ from $\mathbb {Z}_{N}$ and computes $\alpha ^{*}_{j},\beta ^{*}_{j}$ as $\alpha ^{*}_{js}={\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(w_{ajs}^{*})*E^{-1}_{pk_{i}}(r^{*}_{js}),\beta ^{*}_{j}={\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{aj}^{*})*E_{pk_{i}}^{-1}(R^{*}_{j})$. The simulator $Sim_{i}=\{w_{i}^{(\hat {l})},m_{i}^{(\hat {l})},(E_{pk_{i}}(w_{a}^{*}),E_{pk_{i}}(m_{a}^{*}))_{N_{a}\in {\Gamma }_{i}},(\alpha ^{*}_{j},\beta ^{*}_{j} )_{1\leq j\leq k},$ $(r^{*}_{js})_{{1\leq j\leq k,1\leq s\leq d}},(R^{*}_{j})_{1\leq j\leq k}\}$. In both V iew_i and Sim_i, the input are identical. $(E_{pk_{i}}(w_{a}^{(\hat {l})}),E_{pk_{i}}(m_{a}^{(\hat {l})}),\alpha _{j},{\beta }_{j})$ and $(E_{pk_{i}}(w_{a}^{*}),E_{pk_{i}}(m_{a}^{*})), (\alpha ^{*}_{j},\beta ^{*}_{j} )$ are the ciphertexts of Paillier encryption. Since Paillier encryption is semantically secure, $(E_{pk_{i}}(w_{a}^{(\hat {l})}),E_{pk_{i}}(m_{a}^{(\hat {l})}),\alpha _{j},{\beta }_{j})$ and $(E_{pk_{i}}(w_{a}^{*}),E_{pk_{i}}(m_{a}^{*})),(\alpha ^{*}_{j},\beta ^{*}_{j} )$ are computationally indistinguishable. Since r_js,R_j and $r^{*}_{js},R^{*}_{j}$ are all random values in $\mathbb {Z}_{N}$, they are indistinguishable. Therefore, we can claim that Sim_i is computationally indistinguishable from V iew_i.

When the assistant node $N_{s_{i}}$ is corrupted, we can construct a simulator $Sim_{s_{i}}$ as follows. The $N_{s_{i}}$’s view is

$$ View_{s_{i}}=\{(D_{sk_{i}}(\alpha_{js}))_{1\leq j\leq k,1\leq s\leq d}, (D_{sk_{i}}({\beta}_{j}))_{1\leq j\leq k}\}. $$

To simulate $View_{s_{i}}$, $Sim_{s_{i}}$ randomly chooses $\gamma _{js}^{*}$ and $\eta _{j}^{*}$ from $\mathbb {Z}_{N}$ once $N_{s_{i}}$ obtains $D_{sk_{i}}(\alpha _{js})$ and $D_{sk_{i}}({\beta }_{j})$. The simulator $Sim_{s_{i}}=\{(\gamma _{js})_{1\leq j\leq k,1\leq s\leq d}, (\eta _{j}^{*})_{1\leq j\leq k}\}$. Since both $D_{sk_{i}}(\alpha _{js})$ and $D_{sk_{i}}({\beta }_{j})$ are masked with random values, $\gamma _{js}^{*},\eta _{j}^{*}$ are indistinguishable from $D_{sk_{i}}(\alpha _{js}),D_{sk_{i}}({\beta }_{j})$. Therefore, we can conclude that $Sim_{s_{i}}$ is computationally indistinguishable from $View_{s_{i}}$.

The case where neighbors N_a ∈Γ_i is corrupted. The node N_a in our protocol only receives $(i,\hat {l})$ from N_i. These messages are all public parameters, thus an adversary cannot learn any useful information once he corrupts the neighbor N_a ∈Γ_i.

Combining the above, we can conclude that the secure aggregation protocol is secure under the semi-honest model. □

Theorem 2

Our secure division protocol is secure under the semi-honest model.

Proof

Our secure division protocol involves two participants N_i and N_j. We construct a computationally indistinguishable simulator to simulate the corrupted party’s view as follows.

The N_j’s view during the protocol is V iew_j = {E_pk(a₁),E_pk(b₁),λ,E_pk(λa),E_pk(λb)}. Once N_j receives E_pk(a₁),E_pk(b₁), Sim_j randomly selects $a_{1}^{*},b_{1}^{*}$ from $\mathbb {Z}_{N}$ and encrypts them with the public key pk. Then Sim_j randomly selects a non-zero value λ^∗ from $\mathbb {Z}_{N}$ and computes $E_{pk}(\lambda ^{*}a^{*})=(E_{pk}(a^{*}_{1})*E_{pk}(a^{*}_{2}))^{\lambda ^{*}}$ and $E_{pk}(\lambda ^{*}b^{*})=(E_{pk}(b^{*}_{1})*E_{pk}(b^{*}_{2}))^{\lambda ^{*}}$. The simulator $Sim_{j}=\{E_{pk}(a^{*}_{1}),E_{pk}(b^{*}_{1}),\lambda ^{*}, E_{pk}(\lambda ^{*} a^{*}),E_{pk}(\lambda ^{*} b^{*})\}$. Since λ and λ^∗ are randomly selected values and Pallier cryptosystem is semantically secure, Sim_j is computationally indistinguishable from V iew_j.

The N_i’s view in this protocol is V iew_i = {λa,λb}. To simulate it, Sim_i selects a non-zero value a^∗ from $\mathbb {Z}_{N}$ and computes $\frac {a}{b}a^{*}$. The simulator $Sim_{i}=\{a^{*}, \frac {a}{b}a^{*}\}$. Since λa,λb are masked with random values, they are indistinguishable from $a^{*}, \frac {a}{b}a^{*}$. Thus, Sim_i is computationally indistinguishable from V iew_i.

Based on the above analysis, we can claim that our secure division protocol is secure under the semi-honest model. □

Theorem 3

Our privacy-preserving k-means clustering protocol is secure under the semi-honest model if the secure aggregation protocol and the secure division protocol are secure under the semi-honest model.

Proof

In our k-means clustering protocol, each node N_i first computes local centers $w^{(l)}_{i}$ and cluster counts $m_{i}^{(l)}$. Then it implements the secure aggregation protocol with its neighbors Γ_i to securely sum local centers and cluster counts. We have learned that the secure aggregation protocol is secure under the semi-honest model in Theorem 1. Thus N_i cannot learn any useful information about other node’s local data. Then N_i runs the secure division protocol with $N_{s_{i}}$ to compute the next iteration clusters. Theorem 2 has proved that the secure division protocol is secure under the semi-honest model. Therefore, N_i cannot learn any useful information about the aggregation result expect the novel clusters. Based on the above analysis, we can claim that our k-means clustering protocol is secure under the semi-honest model. □

5.2 Complexity analysis

In this part, we analyze the computational complexity of our proposed scheme. For simplicity, we omit the cost of operations over plaintexts and focus the time-consuming operations on ciphertexts including encryption operations E, exponentiation operations Exp, and decryption operations D. The computational complexity of our proposed protocols is shown in Table 2. The detailed analysis is described as follows.

Table 2 The computation complexity of proposed protocols

Full size table

We first analyze the cost of our original secure aggregation protocol. In the proposed protocol, each node N_a ∈Γ_i first encrypts $(w_{a}^{(\hat {l})},m_{a}^{(\hat {l})})$, which requires (dk + k) encryption operations. N_i computes ${\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(w_{a}^{(\hat {l})})$ which costs $|{\Gamma }_{i}^{*}|kd$ exponentiation operations and ${\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{a}^{(\hat {l})})$ which takes $|{\Gamma }_{i}^{*}|k$ exponentiation operations. Then N_i selects random values and computes {α_j,β_j}_1≤j≤k, which requires (kd + k) encryption operations and (kd + k) exponentiation operations. $N_{s_{i}}$ decrypts {α_j,β_j}_1≤j≤kwhich takes (kd + k) decryption operations. The overall computational cost in our original secure aggregation protocol is $O(|{\Gamma }_{i}^{*}|kd)$ encryption operations, O(kd) decryption operations, and $O(|{\Gamma }_{i}^{*}|kd)$ exponentiation operations. In our optimized secure aggregation protocol, we encode vectors $w_{aj}^{(\hat {l})}$ and $m_{a}^{(\hat {l})}$ into two integers respectively. Thus N_a ∈Γ_i only requires (k + 1) encryption operations to encrypt $(w_{a}^{(\hat {l})},m_{a}^{(\hat {l})})$. N_i takes $|{\Gamma }_{i}^{*}|k$ exponentiation operations to compute ${\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(w_{a}^{(\hat {l})})$ and $|{\Gamma }_{i}^{*}|$ exponentiation operations to compute ${\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{a}^{(\hat {l})})$. Then N_i encodes random values and computes {(α₁,α₂,⋯ ,α_k),β} which requires (k + 1) encryption operations and (k + 1) exponentiation operations. $N_{s_{i}}$ decrypts {(α₁,α₂,⋯ ,α_k),β} to learn the result which takes (k + 1) decryption operations. The overall computational cost in our optimized secure aggregation protocol is $O(|{\Gamma }_{i}^{*}|k)$ encryption operations, O(k) decryption operations, and $O(|{\Gamma }_{i}^{*}|k)$ exponentiation operations.

In our secure division protocol, N_i encrypts a₁,b₂ which requires 2 encryption operations. N_j encrypts a₂,b₂ and computes E_pk(λa) and E_pk(λb) based on the additively homomorphic property of Paillier encryption, which needs 2 encryption operations and 2λ exponentiation operations. N_i then receives responses from N_j and decrypts the messages which requires 2 decryption operations. The overall computational cost of our secure division protocol is O(1) encryption operations, O(1) decryption operations, and O(λ) exponentiation operations.

In each iteration of our privacy-preserving k-means clustering protocol, the node N_i implements the secure aggregation protocol to gather centers and clustering counts from all its neighbors and securely share this information with its assistant node $N_{s_{i}}$. N_i then implements the secure division protocol to compute local clusters. The update procedure needs to invoke the secure division protocol kd times to calculate the new clusters $C_{i}^{(l+1)}$. Therefore, the computational cost of single node in our privacy-preserving k-means clustering scheme with the original secure aggregation protocol denoted by PPk M_b in each iteration is $O((d+|{\Gamma }_{i}^{*}|)kd)$ encryption operations, O(kd) decryption operations, and $O(kd(d\lambda +|{\Gamma }_{i}^{*}|))$ exponentiation operations. Assume the number of iteration is l, the overall computational cost of single node in our PPk M_b scheme is $O((d+|{\Gamma }_{i}^{*}|)kdl)$ encryption operations, O(kdl) decryption operations, and $O((d+|{\Gamma }_{i}^{*}|)kdl)$ exponentiation operations. Similarly, the computational cost of single node in our privacy-preserving k-means clustering scheme with the optimized secure aggregation protocol denoted by PPk M_e is $O((d+|{\Gamma }_{i}^{*}|)kl)$ encryption operations, O(kl) decryption operations, and $O((d+|{\Gamma }_{i}^{*}|)kl)$ exponentiation operations.

5.3 Experiments

In this part, we evaluate the performance of our proposed scheme under various parameter setting. The dataset we used is a real dataset from the UCI repository,^{Footnote 1} which consists of 56,000 data records and each record has 8 attributes. We assume that the number of data records in each peer node is same and is fixed to 500 in our experiments. We implement the Paillier encryption by using Paillier library^{Footnote 2} and conducts all experiments on a 4-core CPU, 16GB RAM machine. The key size of Paillier encryption is set to 2048 bits and the number of clusters k in the experiments is set to 8. We vary the number of neighbors of each node |Γ_i| to evaluate computation time of single node at each iteration. The experiment results are shown in Table 3. We can learn from the results that the computation time of both PPk M_b and PPk M_e increases with the value |Γ_i| and the cost of PPk M_e is smaller than that of PPk M_b. For example, when |Γ_i| is 5, the computation time of PPk M_b is 0.72 seconds while the corresponding time of PPk M_e is only 0.12 seconds; when |Γ_i| grows to 30, the computation time of PPk M_b ups to 4.63 seconds while the corresponding time of PPk M_e is only 0.74 seconds. The experiment results indicate that our novel encoding method based on Horner’s rule can effectively reduce the computation costs.

Table 3 The computation time of single node at one iteration. (Seconds)

Full size table

6 Related work

6.1 Distributed machine learning in peer-to-peer networks

A large number of works have been proposed to perform machine learning over distributed data in peer-to-peer networks. Luo et al. [26] proposed an ensemble scheme for distributed classification in peer-to-peer networks. In the proposed scheme, they constructed local classifiers on each peer by using the learning algorithm of pasting bites and proposed a distributed plurality voting protocol to combine the decisions by local classifiers. Wolff et al. [39] presented a generic algorithm to calculate any ordinal function of the average data in a large peer-to-peer network and proposed a general framework to consequently update any model of distributed data. Datta et al. [7] proposed two algorithms to perform k-means clustering over distributed data in peer-to-peer networks. Their scheme avoids large-scale synchronization or data centralization and only requires that each node achieves local synchronization with its neighbors. Ang et al. [1] combined cascade support vector machine (SVM) and reduced SVM methods to construct SVM classifiers in peer-to-peer networks. Their scheme achieves comparable classification accuracy with centralized classifiers. Kan et al. [17] presented a collaborative classification method to build SVM classifiers in scale-free peer-to-peer networks. The proposed method improves the local classification accuracy through propagating SVM models with the most influence. Ormándi et al. [30] proposed a general approach named gossip learning to combine local classification models based on multiple models taking random walks and virtual weighted voting mechanisms. Papapetrou et al. [32] presented a collaborative approach for document classification in peer-to-peer networks. Their scheme constructs local classification models on each node and combines the most discriminative model to construct the collaborative classification model. Mashayekhi et al. [27] presented a general fully decentralized method to perform clustering over dynamic and distributed data in peer-to-peer networks. Their scheme first constructs a summarized view of the distributed dataset through decentralized gossip-based communication and then utilizes the partition-based and density-based clustering methods to learn the clusters over the summarized view. The above solutions mainly focus on designing machine learning algorithms over distributed data, which maintain high accuracy and low computation and communication costs without centralizing distributed data in a single node. Although some of the above works achieve comparable accuracy with centralized methods, they fail to consider the privacy issues when performing machine learning over distributed data. Some nodes may not want to reveal his local data or model to others for privacy consideration.

6.2 Privacy-preserving distributed machine learning

Privacy-preserving machine learning over distributed data has been investigated by several secure schemes, such as privacy-preserving naive Bayes classification [37, 38, 49], secure support vector machine [15, 47, 50], and privacy-preserving deep learning [28, 34]. For the k-means clustering, Vaidya et al. [36] utilized secure permutation protocol and homomorphic encryption to construct the first privacy-preserving k-means algorithm for vertically partitioned data where each party holds a portion of the attributes of data records. The work [8] presented a secure multi-party clustering scheme over vertically partitioned data based on additively secret sharing, which achieves better computational performance than existing works. works [14] proposed secure protocols based on oblivious polynomial evaluation and homomorphic encryption to perform k-means clustering over horizontally partitioned data where each party holds different data records in the dataset. Yu et al. [48] proposed a secure multi-party k-means clustering scheme where they consider both horizontally partitioned and vertically partitioned datasets. Jagannathan et al. [13] proposed a secure scheme to perform k-means clustering over arbitrarily partitioned data based on random shares and Yao’s garbled circuits [12]. Xing et al. [40] designed a mutual privacy-preserving k-means clustering scheme in social participatory sensing environments, which preserves both each party’s private information and global clusters.

There are also some works considered the privacy-preserving machine learning in peer-to-peer networks. Das et al. [5] proposed a privacy-preserving feature selection scheme over distributed data in large peer-to-peer networks. Their scheme incorporated misclassification gain, Gini index, and entropy feature measurement and combines the secure sum protocol with the Bayes optimal privacy model to aggregate features without revealing the privacy of each node. Bhuyan et al. [3] utilized the fuzzy methodologies technique to design a privacy-preserving sub-feature section scheme in distributed environments.

7 Conclusion

In this paper, we proposed a novel privacy-preserving k-means clustering scheme in peer-to-peer networks. In our scheme, we designed a secure aggregation protocol to learn the sum of centers and clustering counts from neighbors and a secure division protocol to perform division operations over shared values. Moreover, we presented a novel message encoding method based on Horner’s rule to improve the performance of our aggregation protocol. Compared with existing solutions, our scheme achieves local synchronization and privacy protection in each peer. We also formally proved the security of our proposed scheme and analyzed the computational complexity of our proposed scheme.

Notes

References

Ang HH, Gopalkrishnan V, Hoi SC, Ng WK (2008) Cascade rsvm in peer-to-peer networks. In: Joint European conference on machine learning and knowledge discovery in databases, pp 55–70. Springer
Bandyopadhyay S, Giannella C, Maulik U, Kargupta H, Liu K, Datta S (2006) Clustering distributed data streams in peer-to-peer environments. Inform Sci 176(14):1952–1985
Article Google Scholar
Bhuyan HK, Kamila NK (2015) Privacy preserving sub-feature selection in distributed data mining. Appl Soft Comput 36:552–569
Article Google Scholar
Chien Y (1974) Pattern classification and scene analysis. IEEE Trans Autom Control 19(4):462–463
Article Google Scholar
Das K, Bhaduri K, Kargupta H (2010) A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24(3):341–367
Article Google Scholar
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4):18–26
Article Google Scholar
Datta S, Giannella C, Kargupta H (2008) Approximate distributed k-means clustering over a peer-to-peer network. IEEE Trans Knowl Data Eng 21(10):1372–1388
Article Google Scholar
Doganay MC, Pedersen TB, Saygin Y, Savaṡ E, Levi A (2008) Distributed privacy preserving k-means clustering with additive secret sharing. In: Proceedings of the 2008 international workshop on privacy and anonymity in information society, pp 3–11. ACM
Gligorijević V, Pržulj N (2015) Methods for biological data integration: perspectives and challenges. J R Soc Interface 12(112):20150,571
Article Google Scholar
Goldreich O (2004) Foundations of cryptography: Volume II, Basic Applications. Cambridge University Press, Cambridge
Book Google Scholar
Hao M, Li H, Luo X, Xu G, Yang H, Liu S (2019) Efficient and privacy-enhanced federated learning for industrial artificial intelligence. IEEE Transactions on Industrial Informatics pp 1–1. https://doi.org/10.1109/TII.2019.2945367
Huang Y, Evans D, Katz J, Malka L (2011) Faster secure two-party computation using garbled circuits. In: USENIX security symposium, vol 201, pp 331–335
Jagannathan G, Wright RN (2005) Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, pp 593–599. ACM
Jha S, Kruger L, McDaniel P (2005) Privacy preserving clustering. In: European symposium on research in computer security, pp 397–417. Springer
Jia Q, Guo L, Jin Z, Fang Y (2018) Preserving model privacy for machine learning in distributed systems. IEEE Trans Parallel and Distrib Syst 29(8):1808–1822
Article Google Scholar
Jiang W, Li H, Xu G, Wen M, Dong G, Lin X (2019) Ptas: Privacy-preserving thin-client authentication scheme in blockchain-based pki. Futur Gener Comput Syst 96:185– 195
Article Google Scholar
Khan U, Schmidt-Thieme L, Nanopoulos A (2017) Collaborative svm classification in scale-free peer-to-peer networks. Expert Syst Appl 69:74–86
Article Google Scholar
Koskela T, Kassinen O, Harjula E, Ylianttila M (2013) P2p group management systems: A conceptual analysis. ACM Comput Surv (CSUR) 45(2):20
Article Google Scholar
Levitin A (2012) Introduction to the design & analysis of algorithms. Pearson Education
Li H, Liu D, Dai Y, Luan TH, Yu S (2018) Personalized search over encrypted data with efficient and secure updates in mobile clouds. IEEE Trans Emerg Topics Comput 6(1):97–109
Article Google Scholar
Li H, Yang Y, Dai Y, Yu S, Xiang Y (2017) Achieving secure and efficient dynamic searchable symmetric encryption over medical cloud data. IEEE Transactions on Cloud Computing pp 1–1. https://doi.org/10.1109/TCC.2017.2769645
Li X, Zhu Y, Wang J (2019) Highly efficient privacy preserving location-based services with enhanced one-round blind filter. IEEE Transactions on Emerging Topics in Computing. https://doi.org/10.1109/TETC.2019.2926385
Li X, Zhu Y, Wang J, Liu Z, Liu Y, Zhang M (2018) On the soundness and security of privacy-preserving svm for outsourcing data classification. IEEE Trans Depend Secure Comput 15(5):906–912
Article Google Scholar
Liu Y, Zhao Q (2019) E-voting scheme using secret sharing and k-anonymity. World Wide Web 22 (4):1657–1667
Article Google Scholar
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet Google Scholar
Luo P, Xiong H, Lü K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 968–976. ACM
Mashayekhi H, Habibi J, Khalafbeigi T, Voulgaris S, Van Steen M (2015) Gdcluster: A general decentralized clustering algorithm. IEEE Trans Knowl Data Eng 27(7):1892–1905
Article Google Scholar
Mohassel P, Zhang Y (2017) Secureml: A system for scalable privacy-preserving machine learning. In: 2017 IEEE symposium on security and privacy (SP), pp 19–38. IEEE
Muller WT, Eisenhardt M, Henrich A (2003) Efficient content-based p2p image retrieval using peer content descriptions. In: Internet Imaging V, vol 5304, pp. 57–68. International Society for Optics and Photonics
Ormándi R, Hegedu̇s I, Jelasity M (2013) Gossip learning with linear models on fully distributed data. Concurr Comput Pract Exp 25(4):556–571
Article Google Scholar
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: International conference on the theory and applications of cryptographic techniques, pp. 223–238. Springer
Papapetrou O, Siberski W, Siersdorfer S (2015) Efficient model sharing for scalable collaborative classification. Peer-to-Peer Netw Appl 8(3):384–398
Article Google Scholar
Ren H, Li H, Dai Y, Yang K, Lin X (2018) Querying in internet of things with privacy preserving: Challenges, solutions and opportunities. IEEE Netw 32(6):144–151
Article Google Scholar
Shokri R, Shmatikov V (2015) Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp 1310–1321. ACM
Song J, Liu Y, Shao J, Tang C (2019) A dynamic membership data aggregation (dmda) protocol for smart grid. IEEE Systems Journal. https://doi.org/10.1109/JSYST.2019.2912415
Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 206–215. ACM
Vaidya J, Clifton C (2004) Privacy preserving naive bayes classifier for vertically partitioned data. In: Proceedings of the 2004 SIAM international conference on data mining, pp 522–526. SIAM
Vaidya J, Kantarcıoġlu M, Clifton C (2008) Privacy-preserving naive bayes classification. The VLDB J 17(4):879–898
Article Google Scholar
Wolff R, Bhaduri K, Kargupta H (2008) A generic local algorithm for mining data streams in large distributed systems. IEEE Trans Knowl Data Eng 21(4):465–478
Article Google Scholar
Xing K, Hu C, Yu J, Cheng X, Zhang F (2017) Mutual privacy preserving k-means clustering in social participatory sensing. IEEE Trans Ind Inform 13(4):2066–2076
Article Google Scholar
Xu G, Li H, Dai Y, Yang K, Lin X (2019) Enabling efficient and geometric range query with access control over encrypted spatial data. IEEE Trans Inf Forensics Secur 14(4):870–885
Article Google Scholar
Xu G, Li H, Liu S, Wen M, Lu R (2019) Efficient and privacy-preserving truth discovery in mobile crowd sensing systems. IEEE Trans Veh Technol 68(4):3854–3865
Article Google Scholar
Xu G, Li H, Liu S, Yang K, Lin X (2020) Verifynet: Secure and verifiable federated learning. IEEE Trans Inf Forensics Secur 15(1):911–926
Article Google Scholar
Xu G, Li H, Ren H, Yang K, Deng RH (2019) Data security issues in deep learning: Attacks, countermeasures and opportunities. IEEE Commun Mag 57(11):116–122. https://doi.org/10.1109/MCOM.001.1900091
Article Google Scholar
Xu M, Guo M, Shang L, Jia X (2016) Multi-value image segmentation based on fcm algorithm and graph cut theory. In: 2016 IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1333–1340. IEEE
Xue Q, Zhu Y, Wang J (2019) Joint distribution estimation and naïve bayes classification under local differential privacy. IEEE Transactions on Emerging Topics in Computing. https://doi.org/10.1109/TETC.2019.2959581
Yu H, Vaidya J, Jiang X (2006) Privacy-preserving svm classification on vertically partitioned data. In: Pacific-asia conference on knowledge discovery and data mining, pp 647–656. Springer
Yu TK, Lee D, Chang SM, Zhan J (2010) Multi-party k-means clustering with privacy consideration. In: International symposium on parallel and distributed processing with applications, pp 200–207. IEEE
Zhu Y, Li X, Wang J, Liu Y, Qu Z (2017) Practical secure naïve bayesian classification over encrypted big data in cloud. Int J Found Comput Sci 28(06):683–703
Article Google Scholar
Zhu Y, Zhang Y, Li X, Yan H, Li J (2018) Improved collusion-resisting secure nearest neighbor query over encrypted data in cloud. Concurrency and Computation Practice and Experience. https://doi.org/10.1002/cpe.4681

Download references

Acknowledgements

This work is partly supported by the National Key Research and Development Program of China (No. 2017YFB0802300), the Natural Science Foundation of China (No. 61602240), the Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX18_0305), and the Research Fund of Guangxi Key Laboratory of Trusted Software (No. kx201906).

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
Youwen Zhu & Xingxin Li
Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
Youwen Zhu

Authors

Youwen Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xingxin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingxin Li.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection: Special Issue on Security and Privacy in Machine Learning Assisted P2P Networks

Guest Editors: Hongwei Li, Rongxing Lu and Mohamed Mahmoud

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Y., Li, X. Privacy-preserving k-means clustering with local synchronization in peer-to-peer networks. Peer-to-Peer Netw. Appl. 13, 2272–2284 (2020). https://doi.org/10.1007/s12083-020-00881-x

Download citation

Received: 22 September 2019
Accepted: 31 January 2020
Published: 03 March 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s12083-020-00881-x

Privacy-preserving k-means clustering with local synchronization in peer-to-peer networks

Abstract

Similar content being viewed by others

Efficient-Secure k-means Clustering Guaranteeing Personalized Local Differential Privacy

An Efficient Approach for Privacy Preserving Distributed K-Means Clustering in Unsecured Environment

Research on K-Means Clustering Algorithm Over Encrypted Data

Explore related subjects

1 Introduction

2 Problem statement

2.1 System model

2.2 Threat model

Definition 1

2.3 Design goal

3 Preliminaries

3.1 k-means clustering

3.2 Homomorphic encryption

4 Privacy-preserving k-means clustering in P2P networks

4.1 Overview

4.2 Secure aggregation protocol

4.2.1 Optimization

Horner’s rule

4.3 Secure division protocol

5 Evaluation

5.1 Security analysis

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

5.2 Complexity analysis

5.3 Experiments

6 Related work

6.1 Distributed machine learning in peer-to-peer networks

6.2 Privacy-preserving distributed machine learning

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Ethical approval

Informed Consent

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation