Abstract
Many business applications use data mining techniques. Small organizations collaborate with each other to develop few applications to run their business smoothly in competitive world. While developing an application the organization wants to share data among themselves. So, it leads to the privacy issues of the individual customers, like personal information. This paper proposes a method which combines Walsh Hadamard Transformation (WHT) and existing data perturbation techniques to ensure privacy preservation for business applications. The proposed technique transforms original data into a new domain that achieves privacy related issues of individual customers of an organization. Experiments were conducted on two real data sets. From the observations it is concluded that the proposed technique gives acceptable accuracy with K-Nearest Neighbour (K-NN) classifier. Finally, the calculation of data distortion measures were done.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Explosive growth in data storing and data processing technologies has led to creation of huge databases that contains fruitful information. Data mining techniques are used to retrieve hidden patterns from the large databases. In distributed data mining a group of clients share their data with trusted third party. For example a health organization collects data about diseases of a particular geographical area that is nearby to the organization. To improve the quality of information and collaborate with other organizations, which benefits the participated clients to conduct their business smoothly. The Third party performs a knowledge based technique on the collected data from group of clients. In this scenario, clients can be grouped under three categories like honest, semi-honest and malicious. First category honest client always obey the protocol and will not alter the data or information. Second category, semi-honest client follows the protocol but tries to acquire the knowledge about other clients while executing. Third category, malicious client or unauthorized client always tries to access other’s data and alter the information. To provide security of the data, this paper proposes a novel approach.
Many researchers address this problem by using cryptography, perturbation and reconstruction based techniques. This paper proposes an approach for Horizontal Privacy Preserving Data Mining (HPPDM), with combination of different transformations. Consider a Trusted Party (TP) among group of clients. Then TP communicate with other clients by using symmetric cryptography algorithm and assigns transformation techniques to each client. In this work, transformation techniques such as WHT, Simple Additive Noise (SAN), Multiplicative Noise (MN) and FIrst and Second order sum and Inner product Preservation (FISIP) are used. Proposed model is evaluated with data distortion and privacy measures such as Value Difference (VD) and Position Difference parameters like CP, RP, CK and RK.
This paper is organized as follows: Sect. 2 discusses about the Related Work. Section 3 focus on Transformation Techniques. Section 4 explains about Proposed Model. Section 5 discusses Experimental Results and in Sect. 6 Conclusion and Future Work discussed.
2 Related Work
Recently there has been lot of research addressing the issue of Privacy Preserving Data Mining (PPDM). PPDM techniques are mostly divided into two categories such as data perturbation and cryptographic techniques. In data perturbation methods, data owners can alter the data before sharing with data miner. Cryptography techniques are used in distributed data mining scenario to provide privacy to individual customers.
Let X be a numerical attribute and Y be the perturbed value of X. Traub [1] proposed SAN as Y = X + e and MN as Y = X * e. Other perturbation methods such as Micro Aggregation (MA), Univariate Micro Aggregation (UMA) and Multivariate Micro Aggregation (MMA) are proposed in [2–4]. Algorithm based Transformations are discussed in Sect. 3.
Distributed Data Mining is divided into two categories Horizontal and vertical in PPDM. Yao [5] introduced two-way Communication protocol using cryptography. Murat and Clifton proposed a secure K-NN classifier [6]. In [7] Yang et al. proposed a simple cryptographic approach i.e. many customers participate without loss of privacy and accuracy of a classifier. A frame work was proposed [8] which include general model as well as multi-round algorithms for HPPDM by using a privacy preserving K-NN classifier. Kantarcioglu and Vaidya proposed privacy preserving Naive Bayes classifier for Horizontally Partition data in [9]. Xu and Yi [10] discussed about classification of privacy preserving Distributed Data Mining protocols.
3 Transformation Techniques
3.1 Walsh Hadamard Transformation
Definition
The Hadmard transform Hn is a 2n × 2n matrix, the Hadamard matrix (scaled by normalization factor), that transforms 2n real numbers Xn into 2n real numbers Xk. The Walsh-Hadamard transform of a signal X of size N = 2n, is the matrix vector product X*Hn. Where
The matrix \( {\text{H}}_{ 2} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] \) and \( \otimes \) denotes the tensor or kronecker product. The tensor product of two matrices is obtained by replacing each entry of first matrix by multiplying the first matrix with corresponding elements of second matrix. For example
The Walsh-Hadamard transformation generates an orthogonal matrix, it preserves Euclidean distance between the data points. This can be used in Image Processing and Signal Processing.
3.2 First and Second Order Sum and Inner Product Preservation (FISIP)
FISIP is a distance and correlation preservation Transformation [11].
Definition
The matrix representation of a linear transformation can be written as A = [Ai] = [A1 A2 … Ak], Additionally, Ai can be written as Ai = [Aim]. Then the transformation is called a FISIP transformation if A has following properties.
-
a.
\( \mathop \sum \limits_{m = 1}^{k} A_{im} = 1. \)
-
b.
\( \mathop \sum \limits_{m = 1}^{k} A_{im}^{2} = 1. \)
-
c.
\( \mathop \sum \limits_{m = 1}^{k} A_{im} A_{jm} = 0, \,\,for \,i \ne j \)
3.3 Discrete Cosine Transformation (DCT)
DCT works on real numbers and gives following real coefficients:
where for k = 0 and 1 otherwise. These transforms are unitary and Euclidean distance between two sequences is preserved [12].
3.4 Randomization
Randomization is one of the simple approaches to PPDM. Randomization involves perturbing numerical data. Let X is a confidential attribute Y be a perturbed data [1]. SAN is defined asY = X + e, where e is random value drawn from a distribution with mean value zero and variance 1.
MN is defined as Y = X * e, e is a random value.
4 Proposed Model
This paper proposes a new approach for HPPDM as shown in Fig. 1. In this approach a group of clients select a Trusted Party who has capability to retrieve information from large data. Symmetric cryptography algorithm is used for communication between clients. Suppose client1 wants to collaborate with other client i.e., client2. Client1 sends a request to client2 for approval of collaboration. If client2 sends acceptance response both the clients choose their own transformation/modification techniques to modify the data.
This work focuses on perturbs numeric data. Numerical attributes are considered as a confidential attribute. Different transformations techniques are used to modify original data, transformations techniques discussed in Sect. 3. Both clients modify their original data using transformation techniques then modified data will be sent to the Trusted Party. TP decrypts modified data which is collected from clients and performs knowledge based technique. K-Nearest Neighbor (K-NN) as knowledge based technique.
Theorem
Suppose that T: Rn → Rn is a linear transformation with matrix A, then the linear transformation T preserves scalar products and therefore distance between points/vectors if and only if the associated matrix A is orthogonal.
Here T (X) = T1(X1) + T2(X2), T1 and T2 are linear transformations subsequently; T is also a linear Transformation. Whereas X1and X2 are the original data of the respective clients.
Any data like parameters, keys and modified data that needs to be securely shared between the clients and TP must be ensure using Symmetric Cryptography Algorithms.
5 Experimental Work
Experimental work conducted on two real datasets Iris and WDBC collected from [13]. Assume that datasets taken as matrix format, row indicates an object and column indicates an attribute. Divide entire dataset into two parts numerical and categorical attributes. Among that only numerical attributes are considered and it is shared between the clients.
The distribution of the data is done in two methods. Generate a random value using random function which is considered as the percentage of the total records. In method 1: For client1 the data records are sent from record 1 to the percentage of random value. And the remaining records are sent to client2. In method 2: first n records are leaving (n value change as per the dataset size) and then consider the random value to select next number of records to send for client 1. For client2 merge few records from first n records and remaining from the left out records of the whole data set and send them. If any client choose WHT as transformation technique data pre-processing is required if Number of Attributes are less than 2n, n = 0, 1, 2, 3,… add number of columns to its nearest 2n (Table 1).
K-NN is used as a classification technique from WEKA Tool [14]. While conducting experiments K value set to 3, 5, and 7. Tenfold cross validation is used when running the K-NN algorithm. In this paper, consider four combinations of linear transformations such as WHT-WHT, WHT-DCT, WHT-FISIP and WHT-SAN. Follow two methods in data distribution, select 4(2 values are below 50 and 2 values are above 50 to 100) random values in each method per a dataset. 35 and 135 records are skipped in IRIS and WDBC datasets respectively as per method 2 discussed above. Classifier results of IRIS data set shown in Tables 2, 3, 4 and 5. Classifier Accuracy of WDBC shown in Tables 6, 7, 8 and 9. IRIS original data gives 96.00 % using K-NN. Modified IRIS data gives acceptable accuracy on k = 7 using all methods. WDBC Original data set gives 97.18 and modified WDBC gives acceptable accuracy in all cases. Calculated different privacy measures VD, RP, CP, RK and CK from [15].
Calculated average values of all data distributions shown in Tables 10 and 11. The higher values of RP and CP and the lower value of RK and CK, and the analysis show more privacy is preserved [15].
6 Conclusion and Future Work
This paper proposes a new approach for Horizontal PPDM based on combination of linear transformations. It is a simple and efficient approach to protect privacy of individual customers by inference from the experimental results. This approach will be extended in future to more than two clients and different combinations of transformation techniques and will be applied to vertically partitioned data also.
References
J.F. Traub, Y. Yemini, and H. Wozniakowski, “The StatisticalSecurity of a Statistical Database,” ACM Trans. Database Systems, vol. 9, no. 4, pp. 672–679, 1984.
C.C. Aggarwal and P.S. Yu, “A Condensation Approach to Privacy Preserving Data Mining,” Proc. Ninth Int’l Conf. Extending Database Technology, pp. 183–199, 2004.
D. Defays and P. Nanopoulos, “Panels of Enterprises andConfidentiality: The Small Aggregates Method,” Proc. Statistics Canada Symp. 92 Design and Analysis of Longitudinal Surveys, pp. 195–204, 1993.
J. Domingo-Ferrer and J.M. Mateo-Sanz, “Practical Data-Oriented Microaggregation for Statistical Disclosure Control,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 1, pp. 189–201, 2002.
C.C. Yao, “How to generate and Exchange Secrets”, IEEE, 1986.
M. Kantarcioglu and C. Clifton. “Privately computing a distributed k-nn classifier”. PKDD, v. 3202, LNCS, pp. 279–290, 2004.
Z. Yang, S. Zhong, R. Wright, “Privacy-preserving Classification of Customer Data without Loss of Accuracy”, In: Proceedings of the Fifth SIAM International Conference on Data Mining, pp. 92–102, NewportBeach, CA, April 21–23, 2005.
L. Xiong, S. Chitti and L. Liu. k Nearest Neighbor Classification across Multiple Private Databases. CIKM’06, pp. 840–841, Arlington, Virginia, USA, November 5–11, 2006.
M. Kantarcioglu and J. Vaidya. Privacy preserving naïve Bayes classifierfor horizontally partitioned data. In IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, FL, pp. 3–9, November 2003.
ZhuojiaXu, Xun Yi, “Classification of Privacy-preserving Distributed Data Mining Protocols”, IEEE, 2011.
Jen-Wei Huang, Jun-Wei Su and Ming-Syan Chen, “FISIP: A Distance and Correlation Preserving Transformation for Privacy Preserving Data Mining” IEEE, 2011.
Shibnath Mukharjee, Zhiyuan Chen, Aryya Gangopadhyay, “A Privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms”, the VLDB Journal, pp (293–315), 2006.
ShutingXu, Jun Zhang, Dianwei Han, and Jie Wang, (2005) “Data distortion for privacy protection in a terrorist Analysis system”, P. Kantor et al (Eds.): ISI 2005, LNCS 3495, pp. 459–464.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer India
About this paper
Cite this paper
Jalla, H.R., Girija, P.N. (2016). A Novel Approach for Horizontal Privacy Preserving Data Mining. In: Satapathy, S.C., Mandal, J.K., Udgata, S.K., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 434. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2752-6_9
Download citation
DOI: https://doi.org/10.1007/978-81-322-2752-6_9
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2750-2
Online ISBN: 978-81-322-2752-6
eBook Packages: EngineeringEngineering (R0)