Abstract
Recent advancements in data mining have given rise to a new channel of research, coined as privacy-preserving data mining (PPDM). PPDM technology allows us to derive useful information from vast amounts of data while protecting privacy of individual records. This paper proposed a methodology based on the machine learning algorithm called singular value decomposition (SVD) and 3D rotation data perturbation (RDP) for preserving privacy of data. Decomposition and dimensionality reduction helps to eliminate sensitive information, and perturbed matrix is generated. The original and perturbed data are classified using different classifiers, and the performance is measured in terms of accuracy rate. Accuracy is the degree of correlation between the absolute observation and the actual observations. Experimental results revealed that the proposed scheme outperforms by achieving excellent accuracy for matrices of different sizes.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Recent advancements in information science have led to the digitalized data collection and storage on a large scale. The data collection can be done using mobile phones, computers, satellite reports, etc. These data include highly sensitive personal data such as medical reports of an individual, credit card information, shopping preferences, and others. The collected sensitive data are uploaded in centralized locations for knowledge extraction. Since the 1990s, Knowledge Discovery in Databases (KDD) has got special attention for identifying hidden patterns, predicting the latest trends from large volumes of data. This pattern prediction is very helpful for the applications in areas such as trading, e-commerce, the medical field in which strategic decisiveness and business refinement is made through identified patterns. Despite its usefulness, knowledge extraction has been viewed as a new threat to an individual’s privacy and information security. Preserving privacy is a challenging problem if the collected data are published for mining without hiding sensitive information [1,2,3,4].
Privacy-preserving data mining (PPDM) is a novel research field that has become more popular because it ensures data hiding without sacrificing data originality before publishing. In this paper, we present an enhanced singular value decomposition-based data transformation method to protect privacy of individuals in a centralized environment. Classification is done using different classifiers on both original and perturbed data. The experimental results show that the method proposed is effective in managing the privacy and utility of data [5,6,7,8,9].
The rest of this article is structured as follows. The related work is presented in Sect. 2. Section 3 offers our proposed algorithm details. Section 4 displays the outcomes of our experiments. The conclusion is presented in Sect. 5.
2 Literature survey
Shuguo Han et al. introduced secure protocols based on SVD between two parties for both horizontally and vertically partitioned data. Secure matrix multiplication is a method proposed secure building block for performing computations based on QR algorithm. Privacy-preserving SVD is derived, and the experimental results show that this method achieves high accuracy for small and medium sized matrices. A new PPDM approach based on both SVD and ICA is proposed by Guang Li et al. SVD is used to identify and retain relevant data for data mining. Data distortion is achieved by removing irrelevant data from the database. The data analysis done by ICA is different from SVD. Both the methods analyze and identify various types of relevant data for mining from various perspectives. The proposed method is the combination of SVD and ICA that processes the data to identify and remove irrelevant data. The conducted experiments confirm that the method proposed in this article is better than other methods [10,11,12,13].
Guang Li et al. suggested a new privacy-preserving classification method by combining an improved SVD and WCNN sample selection algorithm. The data samples are classified as relevant and irrelevant by using WCNN algorithm. Then the relevant samples are perturbed as low-degree instances by using SVD. The conducted experiments indicate that the suggested method is more accurate in managing data utility and privacy of data. WCNN algorithm can be used only for classification in data mining. In the future research, the same scheme can be executed in other contexts of data mining.
Lakshmi and Naga et al. presented two hybrid methods to conceal sensitive information present in the database by applying SVD, data rotation perturbation and independent component analysis. SVD can effectively classify information that is not necessary for mining. ICA will reveal relevant information. The perturbation of rotation will maintain dataset’s statistical properties. The presented hybrid methods are tested successfully on real time data sets from UCI. The test results have shown that the presented methods satisfy both data security and reliability of the exploration of information than SVD data perturbation. Shuting Xu et al. proposed sparsified SVD method as a streamlined model to examine terrorist analysis system for data distortion. The proposed method is compared with four data distortion methods, and the experimental results show that SSVD is the best suited for preserving privacy of data. This method is also efficient in maintaining utility of data. SVD and SSVD are stronger than the traditional data distortion methods that introduce noise directly to the database. In data mining, SVD-based approaches can be used for data manipulation purpose in order to protect privacy of sensitive information stored and accessible from the original datasets. Further work may be conducted to discover the associations between the data manipulation measures suggested here and other privacy measures. Hasan et al. employed a hybrid approach with sparsified singular value decomposition (SSVD) and nonnegative matrix factorization (NMF). The privacy of the dataset is enhanced by destroying sensitive information through matrix decomposition and keeping the usefulness of relevant data. This approach guarantees better privacy and significant accuracy. The downside of this approach is that more time is taken for execution and only binary datasets are considered for evaluation. Privacy preserving using multi-class datasets by maintaining high utility of data is an interesting addition to this work. At any point of the data mining process, preservation of privacy is an essential consideration. A. Afrinet al. drafted a method with the help of NMF and SVD that is undermined by real-world datasets. With relation to the query accuracy, the value of the distorted datasets is evaluated. Accuracy of the query analysis is not effective because it gives an error of over 50%. The test accuracy informs that improvement is needed in overall process by adding more efficient methods for privacy preserving and data utility maintenance [14,15,16,17].
3 Proposed work
3.1 Singular value decomposition
SVD is a simple matrix approximation and dimensionality reduction technique in machine learning. This is one of the well-used dynamic and general purpose tool in linear algebra for data processing. Nowadays, it is used everywhere. Google is using SVD for performing ranking of pages; Facebook uses it for facio recognition to identify the human faces from the image. It is also used by the recommender systems like Amazon and Netflix to identify the correlation patterns like what kind of people likes what kind of products.
Let C be the m × n original matrix where m rows represent data objects and n columns correspond to attributes. In SVD, a matrix is divided into three parts as follows to perform large computations.
Data dimensionality reduction approach sparsified singular value decomposition (SSVD) is used to identify and remove trivial features in \(U_{{\text{k}}}\) and \(V_{{\text{k}}}^{{\text{T}}}\) to improve privacy protection. The obtained matrix can be used directly for mining without any modification. If the elements threshold value is lower that element values can be set to zero. Then the equation would then look as follows:
First eigenvalues are ordered from small to large to determine the number of key components. The diagram shows a cumulative sum from 1 to k against k for the singular values (Fig. 1).
This matrix decomposition helps us to articulate it as a linear mixture of low-rank matrices. The matrix rank is a unique data present in a matrix. If the rank is high, then more information will be stored [9].
U and V are two orthogonal matrices and ∑ is a rectangular diagonal matrix with n eigenvalues of C that are arranged in decreasing order. U is an m × m matrix, the columns of U are linearly independent eigenvectors of C, and it also depends on eigenvalues of ∑. Eigenvectors provide data distribution path or data variance. In the ∑ matrix, the diagonal values are defined as the singular values of the original matrix C. The U matrix columns are considered as the left-singular C vector, and the V matrix columns are considered as the right-singular C vectors [10]. The useful feature of SVD is that the original data can be re-created by multiplying two matrices with the vector.
3.2 Rotation data perturbation
The transformation using rotation data perturbation (RDP) is the rotation about a point present in the coordinate axis into different axes without altering the metrics. Here the noise terminology is an angle \(\left( \theta \right).\) The transformation matrix in Eq. (3) is used to rotate the point in two-dimensional discrete angle spaces [11]. The angle of rotation \(\theta\) is measured in clockwise, and the values of X and Y coordinates are affected by that transformation. The rotation matrix is represented as Rdxd. Geometric transformation of data C is usually represented as a function g(C) or C′. The perturbed data are achieved by multiplying the rotation matrix with the original data X and transpose matrix column vectors in C, g(X) = RXC. The geometrically perturbed data will be generated from the real datasets. Predicting original data from this perturbed data is difficult, and it can be published for further processing.
In our paper, the 3D rotational transformation is applied to rotate the data into different degrees. Three-dimensional rotation or higher affects two attributes at a time, while others remain unchanged. In 3D RDP, there are 3 orientations X, Y and Z and the axis pairs used for rotation are XY, YZ and ZX. Rotation operation is applied more than once until all the sensitive attributes are transformed to preserve privacy. For every set of attributes, the rotation angle will be selected and rotated [11, 12]. The rotation matrix R in 3D transformation is represented as follows:
3.3 Proposed method algorithm
-
1.
The original matrix C, the number of records in the dataset m and the number of attributes n will be considered for perturbation.
-
2.
Do SVD and calculate Ck′ = Uk∑kVkT, for each pair of k attributes in the dataset.
$$U_{{\text{k}}} = U\left[ {1:k} \right]$$$${\Sigma }_{{\text{k}}} = {\Sigma }\left[ {1:k, 1:k} \right]$$$$V_{{\text{k}}}^{{\text{T}}} = V^{{\text{T}}} \left[ {1:k} \right]$$This decomposition is very useful that presents a large amount of information about C, which includes data range, null values, rank, etc.
-
3.
The features which are having smaller values than the threshold will be set to 0.
-
4.
Axes pairs will be selected for rotation, and then rotation matrix will be calculated. After that, rotation will be performed in three-dimensional planes into different axes of rotation. The rotation angle \(\theta\) will be calculated. The resultant rotated matrix C′ is the perturbed matrix.
-
5.
The perturbed matrix will be given for classification.
4 Experimental results
The proposed algorithm is deployed using R programming by conducting investigations with two real-life datasets from UCI machine learning repository. The datasets adult and heart disease are considered for evaluation. The adult dataset consists of 14 attributes and 48,842 instances, and heart disease dataset consists of 14 attributes and 303 instances. The performance of the classifiers is visualized and measured by using quantitative method called error matrix. Each row of the matrix represents actual class instances, and each column represents the instances of predicted class. From error matrix, the number of false positives, false negatives, true positives and true negatives are taken and the accuracy of different classifiers is given. The sensitivity or true positive rate (TPR), specificity or true negative rate (TNR), false positive rate (FPR) and false negative rate (FNR) are some of the metrics can be measured using these values.
The following table shows the correspondence between the original data and the classification result. The result contains two matrices. The counts of records classified correctly and incorrectly are given by first matrix. The percentage of records classified correctly and incorrectly is given by second matrix. It is also very important to check whether the positive and negative cases are classified correctly (Figs. 2, 3, 4, 5, 6 and 7) (Tables 1 and 2).
5 Conclusion and future enhancement
The main aim of this paper is to maintain privacy and considerable data utility while preserving privacy of datasets. Initially, SVD is applied on original data for perturbation that extracts sensitive attributes by matrix decomposition and eliminates irrelevant information. Then 3D RDP is applied repetitively to ensure that all the sensitive features are skewed to maintain privacy along different axes. The perturbed data and the original data are classified using different classifiers, and performance is measured as percentage of accuracy. Accuracy is measured as a ratio of properly classified instances and complete number of instances present in the dataset. In comparison with the existing methods, the test results show that our proposed method balances the data privacy and utility more effectively. In the future, different perturbation techniques can be used to improve the classification accuracy further.
Change history
13 February 2024
This article has been retracted. Please see the Retraction Notice for more detail: https://doi.org/10.1007/s11227-024-05968-3
References
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: SIGMOD '00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 439–450. https://doi.org/10.1145/342009.335438
Han S, Ng WK, Philip SY (2009) Privacy-preserving singular value decomposition. International conference on data engineering, IEEE
Li G, Wang Y (2011) A privacy-preserving data mining method based on singular value decomposition and independent component analysis. Data Sci J 9:124–132
Parthasarathy P, Vivekanandan S (2018) A numerical modelling of an amperometric-enzymatic based uric acid biosensor for GOUT arthritis diseases. Inform Med Unlocked 12:143–147
Li G, Wang Y (2012) A privacy-preserving classification method based on singular value decomposition. Int Arab J Inform Technol 9(6):529–534
Lakshmi MN, Rani KS (2013) SVD based data transformation methods for privacy preserving clustering. Int J Comput Appl 78(3)
Mathan K, Kumar PM, Panchatcharam P, Manogaran G, Varadharajan R (2018) A novel Gini index decision tree data mining method with neural network classifiers for prediction of heart disease. Des Autom Embed Syst 22(3):225–242
Xu S, Zhang J, Han D, Wang J (2006) Singular value decomposition based data distortion strategy for privacy protection. Knowl Inf Syst 10(3):383–397. https://doi.org/10.1007/s10115-006-0001-2
Parthasarathy P, Vivekanandan S (2020) A typical IoT architecture-based regular monitoring of arthritis disease using time wrapping algorithm. Int J Comput Appl 42(3):222–232
Hasan MM, Hossain S, Paul MK, Sattar AS (2019) A new hybrid approach for privacy preserving data mining using matrix decomposition technique. In: 2019 4th international conference on electrical information and communication technology (EICT). Khulna, Bangladesh
Parthasarathy P, Vivekanandan S (2020) Biocompatible TiO2-CeO2 nano-composite synthesis, characterization and analysis on electrochemical performance for uric acid determination. Ain Shams Eng J 11(3):777–785
Afrin A, Paul MK, Sattar AHMS (2019) Privacy preserving data mining using non-negative matrix factorization and singular value decomposition. In: 2019 4th international conference on electrical information and communication technology (EICT). Khulna, Bangladesh
Parthasarathy P, Vivekanandan S (2020) An extensive study on the COVID-19 pandemic, an emerging global crisis: risks, transmission, impacts and mitigation. J Infect Public Health
Oliveira SRM, Zaiane OR (2010) Privacy preserving clustering by data transformation. J Inf Data Manage 1(1):37–37
Panchatcharam P, Vivekanandan S (2019) Internet of things (IOT) in healthcare–smart health and surveillance, architectures, security analysis and data transfer: a review. Int J Softw Innov (IJSI) 7(2):21–40
Upadhyay S et al (2018) Privacy preserving data mining with 3-D rotation transformation. J King Saud Univ-Comput Inf Sci 30(4):524–530
Varadharajan R, Priyan MK, Panchatcharam P, Vivekanandan S, Gunasekaran M (2018) A new approach for prediction of lung carcinoma using back propogation neural network with decision tree classifiers. J Ambient Intell Humaniz Comput 1–12
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1007/s11227-024-05968-3"
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kousika, N., Premalatha, K. RETRACTED ARTICLE: An improved privacy-preserving data mining technique using singular value decomposition with three-dimensional rotation data perturbation. J Supercomput 77, 10003–10011 (2021). https://doi.org/10.1007/s11227-021-03643-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03643-5