A Novel Approach for Horizontal Privacy Preserving Data Mining

Jalla, Hanumantha Rao; Girija, P. N.

doi:10.1007/978-81-322-2752-6_9

Hanumantha Rao Jalla¹⁸ &
P. N. Girija¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 434))

1629 Accesses

Abstract

Many business applications use data mining techniques. Small organizations collaborate with each other to develop few applications to run their business smoothly in competitive world. While developing an application the organization wants to share data among themselves. So, it leads to the privacy issues of the individual customers, like personal information. This paper proposes a method which combines Walsh Hadamard Transformation (WHT) and existing data perturbation techniques to ensure privacy preservation for business applications. The proposed technique transforms original data into a new domain that achieves privacy related issues of individual customers of an organization. Experiments were conducted on two real data sets. From the observations it is concluded that the proposed technique gives acceptable accuracy with K-Nearest Neighbour (K-NN) classifier. Finally, the calculation of data distortion measures were done.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Data Mining: Min–Max Normalization Based Data Perturbation Technique for Privacy Preservation

A Hybrid Approach for Privacy-Preserving Data Mining

Privacy Preserving Datamining Techniques with Data Security in Data Transformation

Keywords

1 Introduction

Explosive growth in data storing and data processing technologies has led to creation of huge databases that contains fruitful information. Data mining techniques are used to retrieve hidden patterns from the large databases. In distributed data mining a group of clients share their data with trusted third party. For example a health organization collects data about diseases of a particular geographical area that is nearby to the organization. To improve the quality of information and collaborate with other organizations, which benefits the participated clients to conduct their business smoothly. The Third party performs a knowledge based technique on the collected data from group of clients. In this scenario, clients can be grouped under three categories like honest, semi-honest and malicious. First category honest client always obey the protocol and will not alter the data or information. Second category, semi-honest client follows the protocol but tries to acquire the knowledge about other clients while executing. Third category, malicious client or unauthorized client always tries to access other’s data and alter the information. To provide security of the data, this paper proposes a novel approach.

Many researchers address this problem by using cryptography, perturbation and reconstruction based techniques. This paper proposes an approach for Horizontal Privacy Preserving Data Mining (HPPDM), with combination of different transformations. Consider a Trusted Party (TP) among group of clients. Then TP communicate with other clients by using symmetric cryptography algorithm and assigns transformation techniques to each client. In this work, transformation techniques such as WHT, Simple Additive Noise (SAN), Multiplicative Noise (MN) and FIrst and Second order sum and Inner product Preservation (FISIP) are used. Proposed model is evaluated with data distortion and privacy measures such as Value Difference (VD) and Position Difference parameters like CP, RP, CK and RK.

This paper is organized as follows: Sect. 2 discusses about the Related Work. Section 3 focus on Transformation Techniques. Section 4 explains about Proposed Model. Section 5 discusses Experimental Results and in Sect. 6 Conclusion and Future Work discussed.

2 Related Work

Recently there has been lot of research addressing the issue of Privacy Preserving Data Mining (PPDM). PPDM techniques are mostly divided into two categories such as data perturbation and cryptographic techniques. In data perturbation methods, data owners can alter the data before sharing with data miner. Cryptography techniques are used in distributed data mining scenario to provide privacy to individual customers.

Let X be a numerical attribute and Y be the perturbed value of X. Traub [1] proposed SAN as Y = X + e and MN as Y = X * e. Other perturbation methods such as Micro Aggregation (MA), Univariate Micro Aggregation (UMA) and Multivariate Micro Aggregation (MMA) are proposed in [2–4]. Algorithm based Transformations are discussed in Sect. 3.

Distributed Data Mining is divided into two categories Horizontal and vertical in PPDM. Yao [5] introduced two-way Communication protocol using cryptography. Murat and Clifton proposed a secure K-NN classifier [6]. In [7] Yang et al. proposed a simple cryptographic approach i.e. many customers participate without loss of privacy and accuracy of a classifier. A frame work was proposed [8] which include general model as well as multi-round algorithms for HPPDM by using a privacy preserving K-NN classifier. Kantarcioglu and Vaidya proposed privacy preserving Naive Bayes classifier for Horizontally Partition data in [9]. Xu and Yi [10] discussed about classification of privacy preserving Distributed Data Mining protocols.

3 Transformation Techniques

3.1 Walsh Hadamard Transformation

Definition

The Hadmard transform H_n is a 2ⁿ × 2ⁿ matrix, the Hadamard matrix (scaled by normalization factor), that transforms 2ⁿ real numbers X_n into 2ⁿ real numbers X_k. The Walsh-Hadamard transform of a signal X of size N = 2ⁿ, is the matrix vector product X*H_n. Where

$$ H_{N} = {}_{i = 1}^{n} \otimes H_{2} = \underbrace {{H_{2} \otimes H_{2} \otimes \ldots \otimes H_{2} }}_{n} $$

The matrix $ {\text{H}}_{ 2} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] $ and $ \otimes $ denotes the tensor or kronecker product. The tensor product of two matrices is obtained by replacing each entry of first matrix by multiplying the first matrix with corresponding elements of second matrix. For example

$$ H_{4} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] \otimes \left[ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } & {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } \\ {\begin{array}{*{20}c} 1 & 1 \\ 1 & { - 1} \\ \end{array} } & {\begin{array}{*{20}c} { - 1} & { - 1} \\ { - 1} & 1 \\ \end{array} } \\ \end{array} } \right] $$

The Walsh-Hadamard transformation generates an orthogonal matrix, it preserves Euclidean distance between the data points. This can be used in Image Processing and Signal Processing.

3.2 First and Second Order Sum and Inner Product Preservation (FISIP)

FISIP is a distance and correlation preservation Transformation [11].

Definition

The matrix representation of a linear transformation can be written as A = [A_i] = [A₁ A₂ … A_k], Additionally, A_i can be written as A_i = [A_im]. Then the transformation is called a FISIP transformation if A has following properties.

a.
$ \mathop \sum \limits_{m = 1}^{k} A_{im} = 1. $
b.
$ \mathop \sum \limits_{m = 1}^{k} A_{im}^{2} = 1. $
c.
$ \mathop \sum \limits_{m = 1}^{k} A_{im} A_{jm} = 0, \,\,for \,i \ne j $

$$ \begin{aligned} A^{\left[ k \right]} & = \left[ {a_{ij} } \right],1 \le j \le k, a_{ii} = \frac{2 - k}{k},a_{ij} = \frac{2}{k} \\ A^{2} & = \left[ {\begin{array}{*{20}c} 0 & 1 \\ 1 & 0 \\ \end{array} } \right],A^{3} = \left[ {\begin{array}{*{20}c} {\frac{ - 1}{3}} & {\frac{2}{3}} & {\frac{2}{3}} \\ {\frac{2}{3}} & {\frac{ - 1}{3}} & {\frac{2}{3}} \\ {\frac{2}{3}} & {\frac{2}{3}} & {\frac{ - 1}{3}} \\ \end{array} } \right] \\ \end{aligned} $$

3.3 Discrete Cosine Transformation (DCT)

DCT works on real numbers and gives following real coefficients:

$$ f_{i} = \left( {\frac{2}{n}} \right)^{{\frac{1}{2}}} \mathop \sum \limits_{k = 0}^{n - 1}\Lambda _{k} x_{k} { \cos }[\left( {2k + 1} \right)i\Pi /2n] $$

where for k = 0 and 1 otherwise. These transforms are unitary and Euclidean distance between two sequences is preserved [12].

3.4 Randomization

Randomization is one of the simple approaches to PPDM. Randomization involves perturbing numerical data. Let X is a confidential attribute Y be a perturbed data [1]. SAN is defined asY = X + e, where e is random value drawn from a distribution with mean value zero and variance 1.

MN is defined as Y = X * e, e is a random value.

4 Proposed Model

This paper proposes a new approach for HPPDM as shown in Fig. 1. In this approach a group of clients select a Trusted Party who has capability to retrieve information from large data. Symmetric cryptography algorithm is used for communication between clients. Suppose client1 wants to collaborate with other client i.e., client2. Client1 sends a request to client2 for approval of collaboration. If client2 sends acceptance response both the clients choose their own transformation/modification techniques to modify the data.

This work focuses on perturbs numeric data. Numerical attributes are considered as a confidential attribute. Different transformations techniques are used to modify original data, transformations techniques discussed in Sect. 3. Both clients modify their original data using transformation techniques then modified data will be sent to the Trusted Party. TP decrypts modified data which is collected from clients and performs knowledge based technique. K-Nearest Neighbor (K-NN) as knowledge based technique.

Theorem

Suppose that T: Rⁿ → Rⁿ is a linear transformation with matrix A, then the linear transformation T preserves scalar products and therefore distance between points/vectors if and only if the associated matrix A is orthogonal.

Here T (X) = T₁(X₁) + T₂(X₂), T₁ and T₂ are linear transformations subsequently; T is also a linear Transformation. Whereas X₁and X₂ are the original data of the respective clients.

Any data like parameters, keys and modified data that needs to be securely shared between the clients and TP must be ensure using Symmetric Cryptography Algorithms.

5 Experimental Work

Experimental work conducted on two real datasets Iris and WDBC collected from [13]. Assume that datasets taken as matrix format, row indicates an object and column indicates an attribute. Divide entire dataset into two parts numerical and categorical attributes. Among that only numerical attributes are considered and it is shared between the clients.

The distribution of the data is done in two methods. Generate a random value using random function which is considered as the percentage of the total records. In method 1: For client1 the data records are sent from record 1 to the percentage of random value. And the remaining records are sent to client2. In method 2: first n records are leaving (n value change as per the dataset size) and then consider the random value to select next number of records to send for client 1. For client2 merge few records from first n records and remaining from the left out records of the whole data set and send them. If any client choose WHT as transformation technique data pre-processing is required if Number of Attributes are less than 2ⁿ, n = 0, 1, 2, 3,… add number of columns to its nearest 2ⁿ (Table 1).

Table 1 Data set description

Full size table

K-NN is used as a classification technique from WEKA Tool [14]. While conducting experiments K value set to 3, 5, and 7. Tenfold cross validation is used when running the K-NN algorithm. In this paper, consider four combinations of linear transformations such as WHT-WHT, WHT-DCT, WHT-FISIP and WHT-SAN. Follow two methods in data distribution, select 4(2 values are below 50 and 2 values are above 50 to 100) random values in each method per a dataset. 35 and 135 records are skipped in IRIS and WDBC datasets respectively as per method 2 discussed above. Classifier results of IRIS data set shown in Tables 2, 3, 4 and 5. Classifier Accuracy of WDBC shown in Tables 6, 7, 8 and 9. IRIS original data gives 96.00 % using K-NN. Modified IRIS data gives acceptable accuracy on k = 7 using all methods. WDBC Original data set gives 97.18 and modified WDBC gives acceptable accuracy in all cases. Calculated different privacy measures VD, RP, CP, RK and CK from [15].

Table 2 Accuracy on Iris data set using combination of WHT-WHT

Full size table

Table 3 Accuracy on Iris data set using combination of WHT-DCT

Full size table

Table 4 Accuracy on Iris data set using combination of WHT-FISIP

Full size table

Table 5 Accuracy on Iris data set using combination of WHT-SAN

Full size table

Table 6 Accuracy on WDBC data set using combination of WHT-WHT

Full size table

Table 7 Accuracy on WDBC data set using combination of WHT-DCT

Full size table

Table 8 Accuracy on WDBC data set using combination of WHT-FISIP

Full size table

Table 9 Accuracy on WDBC data set using combination of WHT-SAN

Full size table

Calculated average values of all data distributions shown in Tables 10 and 11. The higher values of RP and CP and the lower value of RK and CK, and the analysis show more privacy is preserved [15].

Table 10 Privacy measures on IRIS

Full size table

Table 11 Privacy measures on WDBC

Full size table

6 Conclusion and Future Work

This paper proposes a new approach for Horizontal PPDM based on combination of linear transformations. It is a simple and efficient approach to protect privacy of individual customers by inference from the experimental results. This approach will be extended in future to more than two clients and different combinations of transformation techniques and will be applied to vertically partitioned data also.

References

J.F. Traub, Y. Yemini, and H. Wozniakowski, “The StatisticalSecurity of a Statistical Database,” ACM Trans. Database Systems, vol. 9, no. 4, pp. 672–679, 1984.
Google Scholar
C.C. Aggarwal and P.S. Yu, “A Condensation Approach to Privacy Preserving Data Mining,” Proc. Ninth Int’l Conf. Extending Database Technology, pp. 183–199, 2004.
Google Scholar
D. Defays and P. Nanopoulos, “Panels of Enterprises andConfidentiality: The Small Aggregates Method,” Proc. Statistics Canada Symp. 92 Design and Analysis of Longitudinal Surveys, pp. 195–204, 1993.
Google Scholar
J. Domingo-Ferrer and J.M. Mateo-Sanz, “Practical Data-Oriented Microaggregation for Statistical Disclosure Control,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 1, pp. 189–201, 2002.
Google Scholar
C.C. Yao, “How to generate and Exchange Secrets”, IEEE, 1986.
Google Scholar
M. Kantarcioglu and C. Clifton. “Privately computing a distributed k-nn classifier”. PKDD, v. 3202, LNCS, pp. 279–290, 2004.
Google Scholar
Z. Yang, S. Zhong, R. Wright, “Privacy-preserving Classification of Customer Data without Loss of Accuracy”, In: Proceedings of the Fifth SIAM International Conference on Data Mining, pp. 92–102, NewportBeach, CA, April 21–23, 2005.
Google Scholar
L. Xiong, S. Chitti and L. Liu. k Nearest Neighbor Classification across Multiple Private Databases. CIKM’06, pp. 840–841, Arlington, Virginia, USA, November 5–11, 2006.
Google Scholar
M. Kantarcioglu and J. Vaidya. Privacy preserving naïve Bayes classifierfor horizontally partitioned data. In IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, FL, pp. 3–9, November 2003.
Google Scholar
ZhuojiaXu, Xun Yi, “Classification of Privacy-preserving Distributed Data Mining Protocols”, IEEE, 2011.
Google Scholar
Jen-Wei Huang, Jun-Wei Su and Ming-Syan Chen, “FISIP: A Distance and Correlation Preserving Transformation for Privacy Preserving Data Mining” IEEE, 2011.
Google Scholar
Shibnath Mukharjee, Zhiyuan Chen, Aryya Gangopadhyay, “A Privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms”, the VLDB Journal, pp (293–315), 2006.
Google Scholar
http://kdd.ics.uci.edu/.
http://www.wekaito.ac.nz/ml/weka.
ShutingXu, Jun Zhang, Dianwei Han, and Jie Wang, (2005) “Data distortion for privacy protection in a terrorist Analysis system”, P. Kantor et al (Eds.): ISI 2005, LNCS 3495, pp. 459–464.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, CBIT, Hyderabad, T.S, India
Hanumantha Rao Jalla
School of Computer and Information Sciences, UoH, Hyderabad, T.S, India
P. N. Girija

Authors

Hanumantha Rao Jalla
View author publications
You can also search for this author in PubMed Google Scholar
P. N. Girija
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanumantha Rao Jalla .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India
Suresh Chandra Satapathy
Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
University of Hyderabad, Hyderabad, India
Siba K. Udgata
Department of Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jalla, H.R., Girija, P.N. (2016). A Novel Approach for Horizontal Privacy Preserving Data Mining. In: Satapathy, S.C., Mandal, J.K., Udgata, S.K., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 434. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2752-6_9

Download citation

DOI: https://doi.org/10.1007/978-81-322-2752-6_9
Published: 03 February 2016
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2750-2
Online ISBN: 978-81-322-2752-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Novel Approach for Horizontal Privacy Preserving Data Mining

Abstract

Similar content being viewed by others

Data Mining: Min–Max Normalization Based Data Perturbation Technique for Privacy Preservation

A Hybrid Approach for Privacy-Preserving Data Mining

Privacy Preserving Datamining Techniques with Data Security in Data Transformation

Keywords

1 Introduction

2 Related Work