1 Introduction

k-means clustering [4, 25] is a simple and well-studied technique in machines learning areas which aim to group data records into k clusters. Informally, the clustering algorithm partitions data records into different clusters such that records in the same cluster are close to each other based on some distance metrics. Clustering has been applied in many domains, such as image segmentation and genes detection [9, 11, 45]. The traditional k-means clustering algorithm is assumed that data is centralized in a single location, which cannot be directly applied to distributed situations where each party holds a part of the dataset.

As a typical distributed system consisting of thousands of connected nodes, peer-to-peer (P2P) networks are emerging as a good choice for many technologies [6, 33, 46]. Performing clustering in P2P networks arises more and more attention. Among this, designing a distributed clustering algorithm in P2P networks instead of collecting data in a single point is a practical option. There are many important applications of distributed clustering in P2P networks. For example, in a P2P file-sharing network, we can utilize the distributed clustering algorithm to mine browsing history of users and find users with common interest [18]. However, P2P networks impose several challenges for performing clustering over distributed data. First, since peer-to-peer networks are highly decentralized and always contain millions of nodes, it is not practical to centralize the whole dataset and achieve global synchronization in the network. Second, there are frequent topology changes in P2P networks. Third, each peer in P2P networks suffers from frequent data update. Fourth, peers may not reveal their local data for the privacy concern.

Many schemes [2, 7, 27, 29] have been proposed to solve the above challenges. Among these works, Datta et al. [7] proposed two algorithms to perform k-means clustering over distributed data in peer-to-peer networks, which only require that each node achieves local synchronization with its neighbors. Mashayekhi et al. [27] presented a general fully decentralized method to perform clustering over dynamic and distributed data in P2P networks. Their scheme constructed a summarized view of the distributed dataset and utilizes the partition-based and density-based clustering methods to learn the clusters over the summarized view, which achieves effective clustering accuracy and efficient communication costs. Although the above works achieve comparable accuracy with centralized methods, each peer in their schemes is required to send local information to its neighbors for collaborated computations. Since the local data in each peer may contain sensitive information about itself, directly sending local data in the existing distributed clustering schemes compromises the privacy of peers [21, 42,43,44]. Some recent works [3, 5] designed several privacy-preserving feature selection schemes in P2P networks. Unfortunately, their schemes cannot be directly applied to solve privacy-preserving k-means clustering in P2P networks. Among another line of work, many secure schemes [36, 40, 48] are proposed to handle k-means clustering over horizontally partitioned or vertically partitioned data. However, these privacy-preserving clustering schemes either only consider the two parties situation or require global synchronization with all parties, which is not suitable for the topology of P2P networks. There still lacks a practical privacy-preserving k-mean clustering scheme in P2P networks.

In this paper, we propose a novel privacy-preserving k-means clustering scheme over distributed data in P2P networks. Each peer in our scheme iteratively updates the clusters and achieves local synchronization at each iteration, i.e., it only requires to synchronize with its neighbors to learn the clusters. Specifically, each peer implements our secure aggregation protocol to obtain local centers and cluster counts from its neighbors and computes the novel clusters by using our secure division protocol. As a result, each peer can learn the clustering result without revealing its local information. Our main contributions in this paper can be summarized as follows.

  • We propose a novel privacy-preserving k-means clustering scheme over distributed data in P2P networks, which simultaneously achieves local synchronization and protects the privacy of each peer.

  • To protect the privacy of local data in each peer, we design a secure aggregation protocol and a secure division protocol based on homomorphic encryption [31]. In addition, we design a novel message encoding mechanism to improve the performance of our aggregation protocol.

  • We formally prove that the proposed scheme is secure under the semi-honest model. We also theoretically analyze the performance of our proposed scheme.

The remainder of this paper is organized as follows. We introduce the system model and threat model in Section 2. We present some preliminaries used in our scheme in Section 3. We describe the proposed scheme in Section 4. Then, we analyze the security and computational complexity of our proposed scheme in Section 5. We review the related work in Section 6. Finally, we conclude the paper in Section 7.

2 Problem statement

2.1 System model

The dataset consisting of n data records {p1,p2,⋯ ,pn} is distributed over different nodes in peer-to-peer networks. Each node denotes a user and holds a part of the dataset. In other words, the dataset D is horizontally distributed over different nodes. The nodes try to learn k collaborated clusters \(C^{\prime }=\{c^{\prime }_{1},c^{\prime }_{2},\cdots ,c^{\prime }_{k}\}\) over the distributed dataset by using the k-means clustering algorithm. Nevertheless, the standard clustering algorithm only works on centralized datasets. It is still a challenge to perform k-means clustering over distributed datasets in peer-to-peer networks. In this paper, we consider the network as a connected, undirected graph where each peer represents a vertex and each edge between two nodes denotes they can communicate with each other. Our scheme is based on local node synchronization and each node is only synchronizing with its neighbors in each iteration. Thus we simply assume that each node can only communicate with its neighbors and each node has a unique identity. Given a node Ni, we use the notion Γi to denote its neighbors and |Γi| to represent the number of neighbors. The all notations used in this paper are summarized in Table 1.

Table 1 Notation Table

2.2 Threat model

We consider the security of our proposed scheme under the semi-honest (honest-but-curious) model [10]. That is, each party will correctly follow the protocol, but try to learn other’s inputs using what he legally receives during the protocol. A protocol π is secure under the semi-honest model if each party’s view during the protocol can be simulated given only its input and output. The semi-honest model has been adopted by several existing woks [16, 20, 22, 41]. A formal definition of security against semi-honest adversaries can be described as follows.

Definition 1

Let F be a functionality and π be a n-party protocol for computing F. Fi represents the computation on parity i. The view of party i during the execution of π is denoted by V iewi and equals to (xi,ri,m1,⋯ ,mt) where xi represents the input, ri represents the randomness and mj represents the j-th received message. We say that the protocol π is secure under the semi-honest model if there exist a probabilistic polynomial-time simulator Simi for each party i such that

$$ Sim_{i} (x_{i},F_{i}(x_{1},x_{2},\cdots,x_{n}))\overset{c}{\equiv} view_{i}(x_{i},F_{i}(x_{1},x_{2},\cdots,x_{n})), $$
(1)

where \(\overset {c}{\equiv }\) represents computational indistinguishability.

2.3 Design goal

In our scheme, the dataset D is horizontally distributed over different peers, so each peer can learn the content of different data records. The nodes try to learn k clusters over the whole dataset, thus we assume each node learns the whole clusters during the computation process. The main privacy issues we consider in this paper are listed below.

  • A node cannot learn the content of data records possessed by other nodes.

  • A node cannot know the closest clusters of data records possessed by other nodes.

  • A node cannot learn the number of data records assigned to a cluster for all of the clusters.

3 Preliminaries

3.1 k-means clustering

Given a dataset of n data records D = {p1,p2,⋯ ,pn}, k-means clustering partitions the dataset into k disjoint subsets called clusters, where k is a user-defined parameter and pi is an element of \(\mathcal {R}^{d}\). The goal of k-means clustering is to find clusters that achieve the minimized sum of the distances between clusters and data records. In this paper, we use Euclidean distance as the distance metric, i.e., \(Dist(p,q)=\sqrt {{\sum }_{i=1}^{d}(p_{i}-q_{i})^{2}}\), and represent each cluster as its centroid. The k-means clustering algorithm [4, 25] can be described as follows.

We first set l = 1 and randomly selects k data records \(C^{(l)}=\left \{c^{(l)}_{1},c^{(l)}_{2},\cdots ,c^{(l)}_{k}\right \}\) in the dataset D as the initial clusters and iteratively refine the k potential clusters until reaching a termination condition. Specifically, in each iteration, we assign each data record pi to the cluster \(c^{(l)}_{j}\) which is closet to it and count the number of such data records, which is represented as \(m^{(l)}_{j}\). Then we compute the novel clusters \(C^{(l+1)}=\left \{c^{(l+1)}_{1},c^{(l+1)}_{2},\cdots ,c^{(l+1)}_{k}\right \}\) based on the cluster assignment, i.e., the new clusters are computed by the arithmetic mean of the points in the clusters. We check whether the difference between the novel and the old clusters is within a predefined threshold. If the condition holds, we terminate the iteration and outputs C(l+ 1) as the final results; otherwise, we will replace C(l) with C(l+ 1)and repeat the above process. The main steps is illustrated in Fig. 1.

Fig. 1
figure 1

k-means clustering algorithm

The above algorithm only works on the centralized situation where the whole dataset D is stored in a single place. Nevertheless, the dataset D in peer-to-peer networks is distributed over different nodes. Each node Ni holds a part of the dataset. The specific topology of P2P networks makes it difficult to design a distributed k-means clustering algorithm. Moreover, the subset Di holds by each node may contain sensitive information about himself, thus the node is not willing to reveal this information to others. Designing a distributed k-means clustering algorithm maintaining the privacy of each node is still a challenge.

3.2 Homomorphic encryption

Paillier cryptosystem [31] is an efficient public key cryptosystem with semantic security (indistinguishability under chosen plaintext attack, IND-CPA). The encryption scheme is additively homomorphic, i.e.,

$$ E_{pk}(x_{1}) \times E_{pk}(x_{2})=E_{pk}(x_{1}+x_{2}), $$
(2)
$$ E_{pk}(x)^{\alpha} = E_{pk}(\alpha x). $$
(3)

Here, E denotes the encryption function, x1, x2 and α are arbitrary messages in the plaintext space, pk is the public key, Epk(x1) represents the ciphertext of x1, and D denote the decryption function. The main steps of the Paiilier encryption system is shown as follows.

Key Generation.:

Choose two large enough primes p and q. Then, the secret key sk = lcm(p − 1,q − 1), that is, the least common multiple of p − 1 and q − 1. The public key pk = (N,g), where N = pq and \(g\in \mathbb {Z}_{N^{2}}^{*}\) such that \(\gcd \big {(}L(g^{s}\mod N^{2}), N\big {)}=1 \), that is, the maximal common divisor of L(gs mod N2) and N is equivalent to 1. Here, L(x) = (x − 1)/N, the same below.

Encryption.:

Let x0 be a number in plaintext space \(\mathbb {Z}_{N}\). Select a random \(r\in \mathbb {Z}_{N}^{*}\) as the secret parameter, then the ciphertext of x0 is \(c_{0}=\mathrm {E}_{pk}(x_{0})=g^{x_{0}}r^{N}\ \text {mod} \ N^{2}\).

Decryption.:

Let \(c_{0}\in \mathbb {Z}_{N^{2}}\) be a ciphertext. The plaintext hidden in c0 is

$$ x_{0}=\text{Dec}_{sk}(c_{0})=\frac {L({c_{0}^{s}}\mod N^{2})}{L(g^{s}\mod N^{2})}\mod N. $$

Note that Pailler encryption only supports positive integers, we need to transform real numbers into integers through multiplying them by a large integer δ (δ > 0) as previous works [23, 24, 35].

4 Privacy-preserving k-means clustering in P2P networks

4.1 Overview

Our proposed privacy-preserving k-mean clustering scheme does not require that all nodes in a peer-to-peer network achieve global synchronization. The scheme only requires each node to synchronize with its neighbors, i.e., it only requires local synchronization. Each node moves on to the next generation once it receives responses from all its neighbors. To protect the privacy of each node, we design a secure aggregation protocol and a secure division protocol between each node and its neighbors.

In the initial stage of our scheme, a single node randomly generates k initial clusters \(C^{(1)}=\left \{c^{(1)}_{1},c^{(1)}_{2},\cdots ,c^{(1)}_{k}\right \}\) and a termination threshold 𝜖 > 0. Then the node sends (1,C(1),𝜖) to all its neighbors and starts the iteration 1. When a node Ni receives the message (1,C(1),𝜖) form it neighbors for the first time, it randomly chooses a neighbor \(N_{s_{i}}\) as its assistant node. \(N_{s_{i}}\) generate its public and secret key pair (pki,ski) and sends the public key pki to Ni. The node Ni then sends (1,C(1),𝜖,pki) to the remainder of its neighbors and starts the iteration 1. Eventually, all nodes finish the initial stage and enter the iteration 1 with the same initial clusters C(1) and threshold 𝜖.

In each iteration of our scheme, each node Ni securely aggregates clusters from his neighbors and computes novel clusters. Specifically, Ni first implement a k-means cluster algorithm on its local dataset Di with location clusters \(C_{i}^{(l)}\). For each data record in Di, Ni calculates the distances between it and each cluster and assign it to the closest cluster \(c^{(l)}_{ij}\). Then Ni counts the number of records in his dataset assigned to the cluster \(c^{(l)}_{ij}\), which is denoted as \(m_{ij}^{(l)}\) and computes k local centers \(w_{i}^{(l)}=\left (w_{i1}^{(l)},w_{i2}^{(l)},\cdots ,w_{ik}^{(l)}\right )\) where \(w_{ij}^{(l)}\) is a d-dimensional point and \(w_{ij}^{(l)}={\sum }_{p\in c^{(l)}_{ij}}p\). Ni stores \(\left \{w_{i}^{(l)},m_{i}^{(l)}=\left (m_{i1}^{(l)},m_{i2}^{(l)},\cdots ,m_{il}^{(l)}\right )\right \}\) in its history table. It also sends a request (i,l) to its neighbor nodes Γi. This request requires all its neighbors to implement the secure aggregation protocol to securely respond their local centers and cluster counts in the iteration l. Then, Ni and its assistant node \(N_{s_{i}}\) will learn the secret shares of \({\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)}\) and \({\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}\), i.e., Ni gets αj1,βj1 and \(N_{s_{i}}\) gets αj2,βj2 where \({\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)}=\alpha _{j1}+\alpha _{j2}\mod N\) and \({\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}={\beta }_{j1}+{\beta }_{j2}\mod N\). The details about the secure aggregation protocol will be described in the next part. Ni then utilizes the secure division protocol with the assistant node \(N_{s_{i}}\) to update its clusters. For each cluster \(c_{ij}^{(l+1)}\), Ni obtains

$$ c_{ij}^{(l+1)}=\frac{\sum\limits_{N_{a}\in {\Gamma}_{i}^{*}}w_{aj}^{(l)}}{\sum\limits_{N_{a}\in {\Gamma}_{i}^{*}}m_{aj}^{(l)}}. $$

Ni computes \(Dist\left (c_{ij}^{(l)},c_{ij}^{(l+1)}\right )\) and finds the max distance among them. Ni compares \(max\left \{Dist\left (c_{ij}^{(l)},c_{ij}^{(l+1)}\right )\right \}_{1\leq j\leq k}\) with 𝜖. If \(max\left \{Dist\left (c_{ij}^{(l)},c_{ij}^{(l+1)}\right )\right \}_{1\leq j\leq k}>\epsilon \), it continues to next iteration l + 1; otherwise, it moves on to the termination state and \(C_{i}^{(l)}=\left \{c^{(l+1)}_{i1},c^{(l+1)}_{i2},\cdots ,c^{(l+1)}_{ik}\right \}\) is the final clusters. The main steps of each iteration in our proposed scheme are illustrated in Fig. 2.

Fig. 2
figure 2

An overview of our privacy-preserving k-means clustering algorithm

4.2 Secure aggregation protocol

In our secure aggregation protocol, Ni securely sums centers and cluster counts from its neighbors Γi. In the protocol, each node Na ∈Γi holds a local dataset Da, local centers, and cluster counts \(\left \{\left (w_{a}^{(l)},m_{a}^{(l)}\right )\right \}\) in its history table. Once receiving a request \((i,\hat {l})\) from Ni, Na first compares its current iteration l with \(\hat {l}\). If \(\hat {l}\leq l\), Na’s history table contains local centers and cluster counts for the iteration \(\hat {l}\). Na finds \(\left \{\left (w_{a}^{(\hat {l})},m_{a}^{(\hat {l})}\right )\right \}\) from its history table and encrypts them with the pubic key of \(N_{s_{i}}\). Concretely, for \(w_{a}^{(\hat {l})}=\left (w_{aj1}^{(\hat {l})}, w_{aj2}^{(\hat {l})},\cdots ,w_{ajd}^{(\hat {l})}\right )\) and \(m_{aj}^{(\hat {l})}\) (1 ≤ jk), Na encrypts them as \(E_{pk_{i}}\left (w_{aj}^{(\hat {l})}\right )=\left (E_{pk_{i}}\left (w_{aj1}^{(\hat {l})}\right ),E_{pk_{i}}\left (w_{aj2}^{(\hat {l})}\right ),\cdots ,E_{pk_{i}}\left (w_{ajd}^{(\hat {l})}\right )\right )\), and \(E_{pk_{i}}(m_{aj}^{(\hat {l})})\). Na then responses

$$ E_{pk_{i}}(w_{a}^{(\hat{l})})=\left\{E_{pk_{i}}\left( w_{a1}^{(\hat{l})}\right),E_{pk_{i}}\left( w_{a2}^{(\hat{l})}\right),\cdots,E_{pk_{i}}\left( w_{ak}^{(\hat{l})}\right)\right\}, $$
$$ E_{pk_{i}}\left( m_{a}^{(\hat{l})}\right)=\left\{E_{pk_{i}}\left( m_{a1}^{(\hat{l})}\right),E_{pk_{i}}\left( m_{a2}^{(\hat{l})}\right),\cdots,E_{pk_{i}}\left( m_{ak}^{(\hat{l})}\right)\right\} $$

to Ni. If \(\hat {l}> l\), Na puts \((i,\hat {l})\) into its wait table. Then Na checks the wait table during each iteration l. Once l reaches \(\hat {l}\), it computes and sends the response \(E_{pk_{i}}(w_{a}^{(\hat {l})})\) and \(E_{pk_{i}}(m_{a}^{(\hat {l})})\) based on the above procedure to the node Ni.

After receiving all responses from its neighbors Γi, Ni aggregates all messages based on the additively homomorphic property of Paillier encryption. It computes

$$ \begin{array}{@{}rcl@{}} \prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{a}^{(\hat{l})}\right)&=&\left\{\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{a1}^{(\hat{l})}\right),\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{a2}^{(\hat{l})}\right),\right.\\ &\cdots& \left. ,\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{ak}^{(\hat{l})}\right)\right\}, \end{array} $$
$$ \begin{array}{@{}rcl@{}} \prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{a}^{(\hat{l})}\right)&=&\left\{\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{a1}^{(\hat{l})}\right),\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{a2}^{(\hat{l})}\right),\right.\\ &\cdots& \left.,\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{ak}^{(\hat{l})}\right)\right\}. \end{array} $$

Then Ni randomly selects d values \(\{r_{j1},r_{j2},\cdots ,r_{jd}\}\in \mathbb {Z}_{N}^{d}\) for each \({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{aj}^{(\hat {l})}\right )=\left ({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{aj1}^{(\hat {l})}\right ),\right .\) \(\left .{\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{aj2}^{(\hat {l})}\right ),\cdots ,{\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (w_{ajd}^{(\hat {l})}\right )\right )\) and computes

$$ \alpha_{js}=\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( w_{ajs}^{(\hat{l})}\right)*E_{pk_{i}}^{-1}(r_{js}), $$

where 1 ≤ sd. For each \({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}\left (m_{aj}^{(\hat {l})}\right )\), Ni randomly selects a value \(R_{j}\in \mathbb {Z}_{N}\) and calculates

$$ {\beta}_{j}=\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}\left( m_{aj}^{(\hat{l})}\right)*E^{-1}_{pk_{i}}(R_{j}). $$

Ni sends {αj,βj}1≤jk to its assistant node \(N_{s_{i}}\), who will decrypts {αj,βj}1≤jk with its secret key ski and obtains

$$ D_{sk_{i}}(\alpha_{js})=\sum\limits_{N_{a} \in{\Gamma}_{i}^{*}}w_{ajs}^{(\hat{l})}-r_{js}\mod N, $$
$$ D_{sk_{i}}({\beta}_{j})=\sum\limits_{N_{a} \in{\Gamma}_{i}^{*}}m_{aj}^{(\hat{l})}-R_{j}\mod N. $$

After executing the secure aggregation protocol, Ni gets rjs,Rj while \(N_{s_{i}}\) gets \(D_{sk_{i}}(\alpha _{js}),D_{sk_{i}}({\beta }_{j})\) satisfying \(r_{js}+D_{sk_{i}}(\alpha _{js})={\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)}\mod N\), \(R_{j}+D_{sk_{i}}({\beta }_{j})={\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}\mod N\), i.e., Ni and \(N_{s_{i}}\) learn the secret shares of \({\sum }_{N_{a}\in {\Gamma }_{i}^{*}}w_{aj}^{(l)},{\sum }_{N_{a}\in {\Gamma }_{i}^{*}}m_{aj}^{(l)}\) (1 ≤ jk) (Fig. 3).

Fig. 3
figure 3

Secure aggregation protocol

4.2.1 Optimization

In this part, we propose a novel message encoding mechanism to improve the performance of our secure aggregation protocol. Our encoding method is based on Horner’s rule [19]. A formal definition about it is shown as follows.

Horner’s rule

Given a n-degree polynomial p = anRn + an− 1Rn− 1 + ⋯ + a1R + a0, we can represent the polynomial as p = (⋯(anR + an− 1)R + ⋯ )R + a0.

In our encoding method, we construct a polynomial p = anRn + an− 1Rn− 1 + ⋯ + a1R + a0 and ensure R > max{an,an− 1,⋯ ,a1}. If we know p and R, we can get n + 1 coefficients an,an− 1,⋯ ,a0 based on the Horner’s rule, which costs n + 1 division operations and n + 1 modulo operations. The detail of our message encoding mechanism in the secure aggregation protocol is described as follows.

We take \(m_{a}^{(\hat {l})}=\left \{m_{a1}^{(\hat {l})},m_{a2}^{(\hat {l})},\cdots ,m_{ak}^{(\hat {l})}\right \}\) as an example to show the main steps of our novel encoding method. In the original secure aggregation protocol, Na encrypts \(m_{a}^{(\hat {l})}\) as \(E_{pk_{i}}(m_{a}^{(\hat {l})})=\left \{E_{pk_{i}}(m_{a1}^{(\hat {l})}),E_{pk_{i}}(m_{a2}^{(\hat {l})}),\cdots ,E_{pk_{i}}(m_{ak}^{(\hat {l})})\right \}\) and sends \(E_{pk_{i}}(m_{a}^{(\hat {l})})\) to Ni. Then Ni randomly selects k values \(\{R_{1},R_{2},\cdots ,R_{k}\}\in \mathbb {Z}_{N}\) and computes β = {β1,β2,⋯ ,βk} where \({\beta }_{j}={\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{aj}^{(\hat {l})})*E_{pk_{i}}^{-1}(R_{j})\). Ni sends β to \(N_{s_{i}}\), who decrypts β and obtains \(\{{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{a1}^{(\hat {l})}-R_{1}\mod N,\cdots ,\\{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{ak}^{(\hat {l})}-R_{k}\mod N\}\). In the original protocol, we encrypt a k-dimensional vector \(m_{a}^{(\hat {l})}\) element-wise, which costs k encryption operations. Note that the plaintext space N of Paillier encryption is usually much larger than \(m_{aj}^{(\hat {l})}\), we can integrate a k-dimensional vector into an integer and then encrypt the integer, such that the encryption cost can be reduced from k to 1. In our novel encoding method, we choose a value R and encodes \(m_{a}^{\hat {l}}=\left \{m_{a1}^{(\hat {l})},m_{a2}^{(\hat {l})},\cdots ,m_{ak}^{(\hat {l})}\right \}\) as a (k − 1)-degree polynomial p(R) as

$$ p(m_a^{(\hat{l})})=m_{a1}^{(\hat{l})}+m_{a2}^{(\hat{l})}R+\cdots+m_{ak}^{(\hat{l})}R^{k- 1}. $$

Then we encrypt \(p(m_{a}^{(\hat {l})})\) with public key pki and sends \(E_{pk_{i}}(p(m_{a}^{(\hat {l})}))\) to Ni. After receiving all \(E_{pk_{i}}(p(m_{a}^{\hat {l}}))\) from its neighbors, Ni randomly selects k values {r1,r2,⋯ ,rk} and encodes it as p(r) = r1 + r2R + ⋯ + rkRk− 1. Then Ni computes

$$ \beta=\prod\limits_{N_{a} \in{\Gamma}_{i}^{*}}E_{pk_{i}}(p(m_{a}^{(\hat{l})}))*E_{pk_{i}}(p(r)). $$

Ni sends β to its assistant node \(N_{s_{i}}\). \(N_{s_{i}}\) decrypts β and utilizes Horner’s rule to get \(\{{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{a1}^{(\hat {l})}+r_{1}\mod N,\cdots ,{\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{ak}^{(\hat {l})}+r_{k}\mod N\}\).

We require R > max{an,an− 1,⋯ ,a1} to ensure the correctness of Horner’s rule. In our encoding method, we have \(a_{k}={\sum }_{N_{a} \in {\Gamma }_{i}^{*}}m_{ak}^{(\hat {l})}+r_{k}\). We select proper parameters R and rj based on the following strategy. Assume \(m_{ak}^{(\hat {l})}\) is σ-bit and the maximum number of Na ∈Γi is 𝜃. we can set R to a (σ + 𝜃 + κ)-bit integer and selects rj from the range [0,2κ].

4.3 Secure division protocol

In our secure division protocol, values a,b are shared between two nodes Ni,Nj where Ni holds a1,b1 and Nj holds a2,b2 satisfying a = a1 + a2 mod N,b = b1 + b2 mod N. After executing the protocol, Ni learns the value \(\frac {a}{b}\) while Nj cannot learn any useful information about a,b and \(\frac {a}{b}\).

Ni first generates the public and secret key pair (pk,sk) of Paillier cryptosystem and encrypts a1,b1 with the public key pk. Then Ni sends encrypted values (Epk(a1),Epk(b1)) to Nj. After receiving messages from Ni, Nj selects a non-zero random value \(\lambda \in \mathbb {Z}_{n}\) and computes

$$E_{pk}(\lambda a)=(E_{pk}(a_{1})*E_{pk}(a_{2}))^{\lambda},$$
$$E_{pk}(\lambda b)=(E_{pk}(b_{1})*E_{pk}(b_{2}))^{\lambda}.$$

Nj then sends Epk(λa) and Epk(λb) to Ni, who decrypts received values with the secret key sk and obtains λa mod N, λb mod N. Finally, Ni obtains the division by computing \(\frac {a}{b}=\frac {\lambda a}{\lambda b}\) (Fig. 4).

Fig. 4
figure 4

Secure Division protocol

5 Evaluation

5.1 Security analysis

We consider the security of our scheme under the semi-honest model [10]. The security of our scheme can be proved based on the following theorems.

Theorem 1

Our secure aggregation protocol is secure under the semi-honest model. Each participant cannot learn any useful information about others.

Proof

Our secure aggregation protocol involves participants with three different types: neighbor nodes Na ∈Γi, the node Ni, and the assistant node \(N_{s_{i}}\). We prove the theorem by considering each participant is corrupted in turn by an adversary. We show that we can construct a computationally indistinguishable simulator to simulate the corrupted party’s view.

When the node Ni is corrupted, we construct a simulator Simi to simulate Ni’s view as follows. In a real execution, the Ni’s view V iewi is as follow:

$$ \begin{array}{@{}rcl@{}} View_{i}&=&\left\{w_{i}^{(\hat{l})},m_{i}^{(\hat{l})},(E_{pk_{i}}(w_{a}^{(\hat{l})}),E_{pk_{i}}(m_{a}^{(\hat{l})}))_{N_{a}\in{\Gamma}_{i}},(\alpha_{j},{\beta}_{j} )_{1\leq j\leq k},\right.\\ && \left. \times (r_{js})_{1\leq j\leq k,1\leq s\leq d}, (R_{j})_{1\leq j\leq k}\right\} \end{array} $$

In the above V iewi, \(w_{i}^{(\hat {l})},m_{i}^{(\hat {l})}\) are the input, \((E_{pk_{i}}(w_{a}^{(\hat {l})}),E_{pk_{i}}(m_{a}^{(\hat {l})}),\alpha _{j},{\beta }_{j})\) are the ciphertexts of Paillier encryption, and rjs,Rj are random value selected from \(\mathbb {Z}_{N}\). To simulate the Ni’s view V iewi, Simi randomly selects \(w_{a}^{*},m_{a}^{*}\) from \(\mathbb {Z}_{N}\) and encrypts them with the public key pki. Then it randomly selects \((r^{*}_{js}, R^{*}_{j})_{1\leq j\leq k}\) from \(\mathbb {Z}_{N}\) and computes \(\alpha ^{*}_{j},\beta ^{*}_{j}\) as \(\alpha ^{*}_{js}={\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(w_{ajs}^{*})*E^{-1}_{pk_{i}}(r^{*}_{js}),\beta ^{*}_{j}={\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{aj}^{*})*E_{pk_{i}}^{-1}(R^{*}_{j})\). The simulator \(Sim_{i}=\{w_{i}^{(\hat {l})},m_{i}^{(\hat {l})},(E_{pk_{i}}(w_{a}^{*}),E_{pk_{i}}(m_{a}^{*}))_{N_{a}\in {\Gamma }_{i}},(\alpha ^{*}_{j},\beta ^{*}_{j} )_{1\leq j\leq k},\) \((r^{*}_{js})_{{1\leq j\leq k,1\leq s\leq d}},(R^{*}_{j})_{1\leq j\leq k}\}\). In both V iewi and Simi, the input are identical. \((E_{pk_{i}}(w_{a}^{(\hat {l})}),E_{pk_{i}}(m_{a}^{(\hat {l})}),\alpha _{j},{\beta }_{j})\) and \((E_{pk_{i}}(w_{a}^{*}),E_{pk_{i}}(m_{a}^{*})), (\alpha ^{*}_{j},\beta ^{*}_{j} )\) are the ciphertexts of Paillier encryption. Since Paillier encryption is semantically secure, \((E_{pk_{i}}(w_{a}^{(\hat {l})}),E_{pk_{i}}(m_{a}^{(\hat {l})}),\alpha _{j},{\beta }_{j})\) and \((E_{pk_{i}}(w_{a}^{*}),E_{pk_{i}}(m_{a}^{*})),(\alpha ^{*}_{j},\beta ^{*}_{j} )\) are computationally indistinguishable. Since rjs,Rj and \(r^{*}_{js},R^{*}_{j}\) are all random values in \(\mathbb {Z}_{N}\), they are indistinguishable. Therefore, we can claim that Simi is computationally indistinguishable from V iewi.

When the assistant node \(N_{s_{i}}\) is corrupted, we can construct a simulator \(Sim_{s_{i}}\) as follows. The \(N_{s_{i}}\)’s view is

$$ View_{s_{i}}=\{(D_{sk_{i}}(\alpha_{js}))_{1\leq j\leq k,1\leq s\leq d}, (D_{sk_{i}}({\beta}_{j}))_{1\leq j\leq k}\}. $$

To simulate \(View_{s_{i}}\), \(Sim_{s_{i}}\) randomly chooses \(\gamma _{js}^{*}\) and \(\eta _{j}^{*}\) from \(\mathbb {Z}_{N}\) once \(N_{s_{i}}\) obtains \(D_{sk_{i}}(\alpha _{js})\) and \(D_{sk_{i}}({\beta }_{j})\). The simulator \(Sim_{s_{i}}=\{(\gamma _{js})_{1\leq j\leq k,1\leq s\leq d}, (\eta _{j}^{*})_{1\leq j\leq k}\}\). Since both \(D_{sk_{i}}(\alpha _{js})\) and \(D_{sk_{i}}({\beta }_{j})\) are masked with random values, \(\gamma _{js}^{*},\eta _{j}^{*}\) are indistinguishable from \(D_{sk_{i}}(\alpha _{js}),D_{sk_{i}}({\beta }_{j})\). Therefore, we can conclude that \(Sim_{s_{i}}\) is computationally indistinguishable from \(View_{s_{i}}\).

The case where neighbors Na ∈Γi is corrupted. The node Na in our protocol only receives \((i,\hat {l})\) from Ni. These messages are all public parameters, thus an adversary cannot learn any useful information once he corrupts the neighbor Na ∈Γi.

Combining the above, we can conclude that the secure aggregation protocol is secure under the semi-honest model. □

Theorem 2

Our secure division protocol is secure under the semi-honest model.

Proof

Our secure division protocol involves two participants Ni and Nj. We construct a computationally indistinguishable simulator to simulate the corrupted party’s view as follows.

The Nj’s view during the protocol is V iewj = {Epk(a1),Epk(b1),λ,Epk(λa),Epk(λb)}. Once Nj receives Epk(a1),Epk(b1), Simj randomly selects \(a_{1}^{*},b_{1}^{*}\) from \(\mathbb {Z}_{N}\) and encrypts them with the public key pk. Then Simj randomly selects a non-zero value λ from \(\mathbb {Z}_{N}\) and computes \(E_{pk}(\lambda ^{*}a^{*})=(E_{pk}(a^{*}_{1})*E_{pk}(a^{*}_{2}))^{\lambda ^{*}}\) and \(E_{pk}(\lambda ^{*}b^{*})=(E_{pk}(b^{*}_{1})*E_{pk}(b^{*}_{2}))^{\lambda ^{*}}\). The simulator \(Sim_{j}=\{E_{pk}(a^{*}_{1}),E_{pk}(b^{*}_{1}),\lambda ^{*}, E_{pk}(\lambda ^{*} a^{*}),E_{pk}(\lambda ^{*} b^{*})\}\). Since λ and λ are randomly selected values and Pallier cryptosystem is semantically secure, Simj is computationally indistinguishable from V iewj.

The Ni’s view in this protocol is V iewi = {λa,λb}. To simulate it, Simi selects a non-zero value a from \(\mathbb {Z}_{N}\) and computes \(\frac {a}{b}a^{*}\). The simulator \(Sim_{i}=\{a^{*}, \frac {a}{b}a^{*}\}\). Since λa,λb are masked with random values, they are indistinguishable from \(a^{*}, \frac {a}{b}a^{*}\). Thus, Simi is computationally indistinguishable from V iewi.

Based on the above analysis, we can claim that our secure division protocol is secure under the semi-honest model. □

Theorem 3

Our privacy-preserving k-means clustering protocol is secure under the semi-honest model if the secure aggregation protocol and the secure division protocol are secure under the semi-honest model.

Proof

In our k-means clustering protocol, each node Ni first computes local centers \(w^{(l)}_{i}\) and cluster counts \(m_{i}^{(l)}\). Then it implements the secure aggregation protocol with its neighbors Γi to securely sum local centers and cluster counts. We have learned that the secure aggregation protocol is secure under the semi-honest model in Theorem 1. Thus Ni cannot learn any useful information about other node’s local data. Then Ni runs the secure division protocol with \(N_{s_{i}}\) to compute the next iteration clusters. Theorem 2 has proved that the secure division protocol is secure under the semi-honest model. Therefore, Ni cannot learn any useful information about the aggregation result expect the novel clusters. Based on the above analysis, we can claim that our k-means clustering protocol is secure under the semi-honest model. □

5.2 Complexity analysis

In this part, we analyze the computational complexity of our proposed scheme. For simplicity, we omit the cost of operations over plaintexts and focus the time-consuming operations on ciphertexts including encryption operations E, exponentiation operations Exp, and decryption operations D. The computational complexity of our proposed protocols is shown in Table 2. The detailed analysis is described as follows.

Table 2 The computation complexity of proposed protocols

We first analyze the cost of our original secure aggregation protocol. In the proposed protocol, each node Na ∈Γi first encrypts \((w_{a}^{(\hat {l})},m_{a}^{(\hat {l})})\), which requires (dk + k) encryption operations. Ni computes \({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(w_{a}^{(\hat {l})})\) which costs \(|{\Gamma }_{i}^{*}|kd\) exponentiation operations and \({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{a}^{(\hat {l})})\) which takes \(|{\Gamma }_{i}^{*}|k\) exponentiation operations. Then Ni selects random values and computes {αj,βj}1≤jk, which requires (kd + k) encryption operations and (kd + k) exponentiation operations. \(N_{s_{i}}\) decrypts {αj,βj}1≤jkwhich takes (kd + k) decryption operations. The overall computational cost in our original secure aggregation protocol is \(O(|{\Gamma }_{i}^{*}|kd)\) encryption operations, O(kd) decryption operations, and \(O(|{\Gamma }_{i}^{*}|kd)\) exponentiation operations. In our optimized secure aggregation protocol, we encode vectors \(w_{aj}^{(\hat {l})}\) and \(m_{a}^{(\hat {l})}\) into two integers respectively. Thus Na ∈Γi only requires (k + 1) encryption operations to encrypt \((w_{a}^{(\hat {l})},m_{a}^{(\hat {l})})\). Ni takes \(|{\Gamma }_{i}^{*}|k\) exponentiation operations to compute \({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(w_{a}^{(\hat {l})})\) and \(|{\Gamma }_{i}^{*}|\) exponentiation operations to compute \({\prod }_{N_{a} \in {\Gamma }_{i}^{*}}E_{pk_{i}}(m_{a}^{(\hat {l})})\). Then Ni encodes random values and computes {(α1,α2,⋯ ,αk),β} which requires (k + 1) encryption operations and (k + 1) exponentiation operations. \(N_{s_{i}}\) decrypts {(α1,α2,⋯ ,αk),β} to learn the result which takes (k + 1) decryption operations. The overall computational cost in our optimized secure aggregation protocol is \(O(|{\Gamma }_{i}^{*}|k)\) encryption operations, O(k) decryption operations, and \(O(|{\Gamma }_{i}^{*}|k)\) exponentiation operations.

In our secure division protocol, Ni encrypts a1,b2 which requires 2 encryption operations. Nj encrypts a2,b2 and computes Epk(λa) and Epk(λb) based on the additively homomorphic property of Paillier encryption, which needs 2 encryption operations and 2λ exponentiation operations. Ni then receives responses from Nj and decrypts the messages which requires 2 decryption operations. The overall computational cost of our secure division protocol is O(1) encryption operations, O(1) decryption operations, and O(λ) exponentiation operations.

In each iteration of our privacy-preserving k-means clustering protocol, the node Ni implements the secure aggregation protocol to gather centers and clustering counts from all its neighbors and securely share this information with its assistant node \(N_{s_{i}}\). Ni then implements the secure division protocol to compute local clusters. The update procedure needs to invoke the secure division protocol kd times to calculate the new clusters \(C_{i}^{(l+1)}\). Therefore, the computational cost of single node in our privacy-preserving k-means clustering scheme with the original secure aggregation protocol denoted by PPk Mb in each iteration is \(O((d+|{\Gamma }_{i}^{*}|)kd)\) encryption operations, O(kd) decryption operations, and \(O(kd(d\lambda +|{\Gamma }_{i}^{*}|))\) exponentiation operations. Assume the number of iteration is l, the overall computational cost of single node in our PPk Mb scheme is \(O((d+|{\Gamma }_{i}^{*}|)kdl)\) encryption operations, O(kdl) decryption operations, and \(O((d+|{\Gamma }_{i}^{*}|)kdl)\) exponentiation operations. Similarly, the computational cost of single node in our privacy-preserving k-means clustering scheme with the optimized secure aggregation protocol denoted by PPk Me is \(O((d+|{\Gamma }_{i}^{*}|)kl)\) encryption operations, O(kl) decryption operations, and \(O((d+|{\Gamma }_{i}^{*}|)kl)\) exponentiation operations.

5.3 Experiments

In this part, we evaluate the performance of our proposed scheme under various parameter setting. The dataset we used is a real dataset from the UCI repository,Footnote 1 which consists of 56,000 data records and each record has 8 attributes. We assume that the number of data records in each peer node is same and is fixed to 500 in our experiments. We implement the Paillier encryption by using Paillier libraryFootnote 2 and conducts all experiments on a 4-core CPU, 16GB RAM machine. The key size of Paillier encryption is set to 2048 bits and the number of clusters k in the experiments is set to 8. We vary the number of neighbors of each node |Γi| to evaluate computation time of single node at each iteration. The experiment results are shown in Table 3. We can learn from the results that the computation time of both PPk Mb and PPk Me increases with the value |Γi| and the cost of PPk Me is smaller than that of PPk Mb. For example, when |Γi| is 5, the computation time of PPk Mb is 0.72 seconds while the corresponding time of PPk Me is only 0.12 seconds; when |Γi| grows to 30, the computation time of PPk Mb ups to 4.63 seconds while the corresponding time of PPk Me is only 0.74 seconds. The experiment results indicate that our novel encoding method based on Horner’s rule can effectively reduce the computation costs.

Table 3 The computation time of single node at one iteration. (Seconds)

6 Related work

6.1 Distributed machine learning in peer-to-peer networks

A large number of works have been proposed to perform machine learning over distributed data in peer-to-peer networks. Luo et al. [26] proposed an ensemble scheme for distributed classification in peer-to-peer networks. In the proposed scheme, they constructed local classifiers on each peer by using the learning algorithm of pasting bites and proposed a distributed plurality voting protocol to combine the decisions by local classifiers. Wolff et al. [39] presented a generic algorithm to calculate any ordinal function of the average data in a large peer-to-peer network and proposed a general framework to consequently update any model of distributed data. Datta et al. [7] proposed two algorithms to perform k-means clustering over distributed data in peer-to-peer networks. Their scheme avoids large-scale synchronization or data centralization and only requires that each node achieves local synchronization with its neighbors. Ang et al. [1] combined cascade support vector machine (SVM) and reduced SVM methods to construct SVM classifiers in peer-to-peer networks. Their scheme achieves comparable classification accuracy with centralized classifiers. Kan et al. [17] presented a collaborative classification method to build SVM classifiers in scale-free peer-to-peer networks. The proposed method improves the local classification accuracy through propagating SVM models with the most influence. Ormándi et al. [30] proposed a general approach named gossip learning to combine local classification models based on multiple models taking random walks and virtual weighted voting mechanisms. Papapetrou et al. [32] presented a collaborative approach for document classification in peer-to-peer networks. Their scheme constructs local classification models on each node and combines the most discriminative model to construct the collaborative classification model. Mashayekhi et al. [27] presented a general fully decentralized method to perform clustering over dynamic and distributed data in peer-to-peer networks. Their scheme first constructs a summarized view of the distributed dataset through decentralized gossip-based communication and then utilizes the partition-based and density-based clustering methods to learn the clusters over the summarized view. The above solutions mainly focus on designing machine learning algorithms over distributed data, which maintain high accuracy and low computation and communication costs without centralizing distributed data in a single node. Although some of the above works achieve comparable accuracy with centralized methods, they fail to consider the privacy issues when performing machine learning over distributed data. Some nodes may not want to reveal his local data or model to others for privacy consideration.

6.2 Privacy-preserving distributed machine learning

Privacy-preserving machine learning over distributed data has been investigated by several secure schemes, such as privacy-preserving naive Bayes classification [37, 38, 49], secure support vector machine [15, 47, 50], and privacy-preserving deep learning [28, 34]. For the k-means clustering, Vaidya et al. [36] utilized secure permutation protocol and homomorphic encryption to construct the first privacy-preserving k-means algorithm for vertically partitioned data where each party holds a portion of the attributes of data records. The work [8] presented a secure multi-party clustering scheme over vertically partitioned data based on additively secret sharing, which achieves better computational performance than existing works. works [14] proposed secure protocols based on oblivious polynomial evaluation and homomorphic encryption to perform k-means clustering over horizontally partitioned data where each party holds different data records in the dataset. Yu et al. [48] proposed a secure multi-party k-means clustering scheme where they consider both horizontally partitioned and vertically partitioned datasets. Jagannathan et al. [13] proposed a secure scheme to perform k-means clustering over arbitrarily partitioned data based on random shares and Yao’s garbled circuits [12]. Xing et al. [40] designed a mutual privacy-preserving k-means clustering scheme in social participatory sensing environments, which preserves both each party’s private information and global clusters.

There are also some works considered the privacy-preserving machine learning in peer-to-peer networks. Das et al. [5] proposed a privacy-preserving feature selection scheme over distributed data in large peer-to-peer networks. Their scheme incorporated misclassification gain, Gini index, and entropy feature measurement and combines the secure sum protocol with the Bayes optimal privacy model to aggregate features without revealing the privacy of each node. Bhuyan et al. [3] utilized the fuzzy methodologies technique to design a privacy-preserving sub-feature section scheme in distributed environments.

7 Conclusion

In this paper, we proposed a novel privacy-preserving k-means clustering scheme in peer-to-peer networks. In our scheme, we designed a secure aggregation protocol to learn the sum of centers and clustering counts from neighbors and a secure division protocol to perform division operations over shared values. Moreover, we presented a novel message encoding method based on Horner’s rule to improve the performance of our aggregation protocol. Compared with existing solutions, our scheme achieves local synchronization and privacy protection in each peer. We also formally proved the security of our proposed scheme and analyzed the computational complexity of our proposed scheme.