Keywords

1 Introduction

Explosive growth in data storing and data processing technologies has led to creation of huge databases that contains fruitful information. Data mining techniques are used to retrieve hidden patterns from the large databases. In distributed data mining a group of clients share their data with trusted third party. For example a health organization collects data about diseases of a particular geographical area that is nearby to the organization. To improve the quality of information and collaborate with other organizations, which benefits the participated clients to conduct their business smoothly. The Third party performs a knowledge based technique on the collected data from group of clients. In this scenario, clients can be grouped under three categories like honest, semi-honest and malicious. First category honest client always obey the protocol and will not alter the data or information. Second category, semi-honest client follows the protocol but tries to acquire the knowledge about other clients while executing. Third category, malicious client or unauthorized client always tries to access other’s data and alter the information. To provide security of the data, this paper proposes a novel approach.

Many researchers address this problem by using cryptography, perturbation and reconstruction based techniques. This paper proposes an approach for Horizontal Privacy Preserving Data Mining (HPPDM), with combination of different transformations. Consider a Trusted Party (TP) among group of clients. Then TP communicate with other clients by using symmetric cryptography algorithm and assigns transformation techniques to each client. In this work, transformation techniques such as WHT, Simple Additive Noise (SAN), Multiplicative Noise (MN) and FIrst and Second order sum and Inner product Preservation (FISIP) are used. Proposed model is evaluated with data distortion and privacy measures such as Value Difference (VD) and Position Difference parameters like CP, RP, CK and RK.

This paper is organized as follows: Sect. 2 discusses about the Related Work. Section 3 focus on Transformation Techniques. Section 4 explains about Proposed Model. Section 5 discusses Experimental Results and in Sect. 6 Conclusion and Future Work discussed.

2 Related Work

Recently there has been lot of research addressing the issue of Privacy Preserving Data Mining (PPDM). PPDM techniques are mostly divided into two categories such as data perturbation and cryptographic techniques. In data perturbation methods, data owners can alter the data before sharing with data miner. Cryptography techniques are used in distributed data mining scenario to provide privacy to individual customers.

Let X be a numerical attribute and Y be the perturbed value of X. Traub [1] proposed SAN as Y = X + e and MN as Y = X * e. Other perturbation methods such as Micro Aggregation (MA), Univariate Micro Aggregation (UMA) and Multivariate Micro Aggregation (MMA) are proposed in [24]. Algorithm based Transformations are discussed in Sect. 3.

Distributed Data Mining is divided into two categories Horizontal and vertical in PPDM. Yao [5] introduced two-way Communication protocol using cryptography. Murat and Clifton proposed a secure K-NN classifier [6]. In [7] Yang et al. proposed a simple cryptographic approach i.e. many customers participate without loss of privacy and accuracy of a classifier. A frame work was proposed [8] which include general model as well as multi-round algorithms for HPPDM by using a privacy preserving K-NN classifier. Kantarcioglu and Vaidya proposed privacy preserving Naive Bayes classifier for Horizontally Partition data in [9]. Xu and Yi [10] discussed about classification of privacy preserving Distributed Data Mining protocols.

3 Transformation Techniques

3.1 Walsh Hadamard Transformation

Definition

The Hadmard transform Hn is a 2n × 2n matrix, the Hadamard matrix (scaled by normalization factor), that transforms 2n real numbers Xn into 2n real numbers Xk. The Walsh-Hadamard transform of a signal X of size N = 2n, is the matrix vector product X*Hn. Where

$$ H_{N} = {}_{i = 1}^{n} \otimes H_{2} = \underbrace {{H_{2} \otimes H_{2} \otimes \ldots \otimes H_{2} }}_{n} $$

The matrix \( {\text{H}}_{ 2} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] \) and \( \otimes \) denotes the tensor or kronecker product. The tensor product of two matrices is obtained by replacing each entry of first matrix by multiplying the first matrix with corresponding elements of second matrix. For example

$$ H_{4} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] \otimes \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } & {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \\ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } & {\begin{array}{*{20}c} { - 1} & { - 1} \\ { - 1} & 1 \\ \end{array} } \\ \end{array} } \right] $$

The Walsh-Hadamard transformation generates an orthogonal matrix, it preserves Euclidean distance between the data points. This can be used in Image Processing and Signal Processing.

3.2 First and Second Order Sum and Inner Product Preservation (FISIP)

FISIP is a distance and correlation preservation Transformation [11].

Definition

The matrix representation of a linear transformation can be written as A = [Ai] = [A1 A2 … Ak], Additionally, Ai can be written as Ai = [Aim]. Then the transformation is called a FISIP transformation if A has following properties.

  1. a.

    \( \mathop \sum \limits_{m = 1}^{k} A_{im} = 1. \)

  2. b.

    \( \mathop \sum \limits_{m = 1}^{k} A_{im}^{2} = 1. \)

  3. c.

    \( \mathop \sum \limits_{m = 1}^{k} A_{im} A_{jm} = 0, \,\,for \,i \ne j \)

$$ \begin{aligned} A^{\left[ k \right]} & = \left[ {a_{ij} } \right],1 \le j \le k, a_{ii} = \frac{2 - k}{k},a_{ij} = \frac{2}{k} \\ A^{2} & = \left[ {\begin{array}{*{20}c} 0 & 1 \\ 1 & 0 \\ \end{array} } \right],A^{3} = \left[ {\begin{array}{*{20}c} {\frac{ - 1}{3}} & {\frac{2}{3}} & {\frac{2}{3}} \\ {\frac{2}{3}} & {\frac{ - 1}{3}} & {\frac{2}{3}} \\ {\frac{2}{3}} & {\frac{2}{3}} & {\frac{ - 1}{3}} \\ \end{array} } \right] \\ \end{aligned} $$

3.3 Discrete Cosine Transformation (DCT)

DCT works on real numbers and gives following real coefficients:

$$ f_{i} = \left( {\frac{2}{n}} \right)^{{\frac{1}{2}}} \mathop \sum \limits_{k = 0}^{n - 1}\Lambda _{k} x_{k} { \cos }[\left( {2k + 1} \right)i\Pi /2n] $$

where for k = 0 and 1 otherwise. These transforms are unitary and Euclidean distance between two sequences is preserved [12].

3.4 Randomization

Randomization is one of the simple approaches to PPDM. Randomization involves perturbing numerical data. Let X is a confidential attribute Y be a perturbed data [1]. SAN is defined asY = X + e, where e is random value drawn from a distribution with mean value zero and variance 1.

MN is defined as Y = X * e, e is a random value.

4 Proposed Model

This paper proposes a new approach for HPPDM as shown in Fig. 1. In this approach a group of clients select a Trusted Party who has capability to retrieve information from large data. Symmetric cryptography algorithm is used for communication between clients. Suppose client1 wants to collaborate with other client i.e., client2. Client1 sends a request to client2 for approval of collaboration. If client2 sends acceptance response both the clients choose their own transformation/modification techniques to modify the data.

Fig. 1
figure 1

Model for HPPDM

This work focuses on perturbs numeric data. Numerical attributes are considered as a confidential attribute. Different transformations techniques are used to modify original data, transformations techniques discussed in Sect. 3. Both clients modify their original data using transformation techniques then modified data will be sent to the Trusted Party. TP decrypts modified data which is collected from clients and performs knowledge based technique. K-Nearest Neighbor (K-NN) as knowledge based technique.

Theorem

Suppose that T: Rn  Rn is a linear transformation with matrix A, then the linear transformation T preserves scalar products and therefore distance between points/vectors if and only if the associated matrix A is orthogonal.

Here T (X) = T1(X1) + T2(X2), T1 and T2 are linear transformations subsequently; T is also a linear Transformation. Whereas X1and X2 are the original data of the respective clients.

Any data like parameters, keys and modified data that needs to be securely shared between the clients and TP must be ensure using Symmetric Cryptography Algorithms.

5 Experimental Work

Experimental work conducted on two real datasets Iris and WDBC collected from [13]. Assume that datasets taken as matrix format, row indicates an object and column indicates an attribute. Divide entire dataset into two parts numerical and categorical attributes. Among that only numerical attributes are considered and it is shared between the clients.

The distribution of the data is done in two methods. Generate a random value using random function which is considered as the percentage of the total records. In method 1: For client1 the data records are sent from record 1 to the percentage of random value. And the remaining records are sent to client2. In method 2: first n records are leaving (n value change as per the dataset size) and then consider the random value to select next number of records to send for client 1. For client2 merge few records from first n records and remaining from the left out records of the whole data set and send them. If any client choose WHT as transformation technique data pre-processing is required if Number of Attributes are less than 2n, n = 0, 1, 2, 3,… add number of columns to its nearest 2n (Table 1).

Table 1 Data set description

K-NN is used as a classification technique from WEKA Tool [14]. While conducting experiments K value set to 3, 5, and 7. Tenfold cross validation is used when running the K-NN algorithm. In this paper, consider four combinations of linear transformations such as WHT-WHT, WHT-DCT, WHT-FISIP and WHT-SAN. Follow two methods in data distribution, select 4(2 values are below 50 and 2 values are above 50 to 100) random values in each method per a dataset. 35 and 135 records are skipped in IRIS and WDBC datasets respectively as per method 2 discussed above. Classifier results of IRIS data set shown in Tables 2, 3, 4 and 5. Classifier Accuracy of WDBC shown in Tables 6, 7, 8 and 9. IRIS original data gives 96.00 % using K-NN. Modified IRIS data gives acceptable accuracy on k = 7 using all methods. WDBC Original data set gives 97.18 and modified WDBC gives acceptable accuracy in all cases. Calculated different privacy measures VD, RP, CP, RK and CK from [15].

Table 2 Accuracy on Iris data set using combination of WHT-WHT
Table 3 Accuracy on Iris data set using combination of WHT-DCT
Table 4 Accuracy on Iris data set using combination of WHT-FISIP
Table 5 Accuracy on Iris data set using combination of WHT-SAN
Table 6 Accuracy on WDBC data set using combination of WHT-WHT
Table 7 Accuracy on WDBC data set using combination of WHT-DCT
Table 8 Accuracy on WDBC data set using combination of WHT-FISIP
Table 9 Accuracy on WDBC data set using combination of WHT-SAN

Calculated average values of all data distributions shown in Tables 10 and 11. The higher values of RP and CP and the lower value of RK and CK, and the analysis show more privacy is preserved [15].

Table 10 Privacy measures on IRIS
Table 11 Privacy measures on WDBC

6 Conclusion and Future Work

This paper proposes a new approach for Horizontal PPDM based on combination of linear transformations. It is a simple and efficient approach to protect privacy of individual customers by inference from the experimental results. This approach will be extended in future to more than two clients and different combinations of transformation techniques and will be applied to vertically partitioned data also.