Abstract
Finding the nearest k objects to a query object is a fundamental operation for many data mining algorithms. With the recent interest in privacy, it is not surprising that there is strong interest in k-NN queries to enable clustering, classification and outlier-detection tasks. However, previous approaches to privacy-preserving k-NN have been costly and can only be realistically applied to small data sets. In this paper, we provide efficient solutions for k-NN queries for vertically partitioned data. We provide the first solution for the L ∞ (or Chessboard) metric as well as detailed privacy-preserving computation of all other Minkowski metrics. We enable privacy-preserving L ∞ by providing a practical approach to the Yao’s millionaires problem with more than two parties. This is based on a pragmatic and implementable solution to Yao’s millionaires problem with shares. We also provide privacy-preserving algorithms for combinations of local metrics into a global metric that handles the large dimensionality and diversity of attributes common in vertically partitioned data. To manage very large data sets, we provide a privacy-preserving SASH (a very successful data structure for associative queries in high dimensions). Besides providing a theoretical analysis, we illustrate the efficiency of our approach with an empirical evaluation.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Amirbekyan A, Estivill-Castro V (2006) Privacy preserving DBSCAN for vertically partitioned data. In: IEEE international conference on intelligence and security informatics, ISI 2006. Lecture notes in computer science, vol 3975. Springer, San Diego, pp 141–153
Amirbekyan A, Estivill-Castro V (2007) The privacy of k-NN retrieval for horizontal partitioned data—new methods and applications. In: Bailey J, Fekete A (eds) Eighteenth Australasian database conference (ADC2007). Conferences in research and practice in information technology (CRPIT), CORE, Australasian Computer Society, Ballarat, Victoria, Australia, pp 33–42
Ankerst M, Kastenmueller G, Kriegel H, Seidl T (1999) Nearest neighbor classification in 3D protein databases. In: Proceedings of the seventh international conference on intelligent systems for molecular biology (ISMB-99), pp 34–43
Ankerst M, Kriegel H, Seidl T (1998) A multi-step approach for shape similarity in image databases. IEEE Trans Data Eng 10(6): 996–1004
Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1994) An optimal algorithm for approximate nearest neighbor searching. In: 5th annual ACM-SIAM symposium on discrete algorithms, pp 573–582
Avidan S, Butman M (2006) Blind vision. In: Leonardis S, Bischof H, Pinz A (eds) Computer vision—ECCV 2006, 9th European conference on computer vision, Part III’. Lecture notes in computer science, vol 3953. Springer, Graz, pp 1–13
Beimel A, Ishai Y (2005) On the power of nonlinear secret-sharing. SIAM J Discrete Math 19(1): 258–280
Bentley J (1975) Multidimensional binary search trees used for associative retrieval. Commun ACM 18(9): 509–517
Berchtold S, Keim DA, Kriegel H-P (1996) The X-tree: an index structure for higher dimensional data. In: Proceedings of twentieth VLDB conference, pp 28–39
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful. In: International conference on database theory, Jerusalem, Israel, pp 217–225
Blundo C, Masucci B, Stinson D, Wei R (2002) Constructions and bounds for unconditionally secure non-interactive commitment schemes. Designs Codes Cryptography 26: 97–110
Breunig M, Kriegel H-P, Ng R, Sander J (2000) Lof: Identifying density-based local outliers. In: Chen W, Naughton JF, Bernstein P (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data. ACM Press, Dallas, pp 93–104
Cachin C (1999) Efficient private bidding and auctions with an oblivious third party. In: Proceedings of the sixth ACM conference on computer and communications security, SIGSAC, ACM Press, Singapore, pp 120–127
Cheng X, Dolin R, Neary M, Prabhakar S, Ravikanth K, Wu D, Agrawal D, El Abbadi E, Freeston M, Singh A, Smith T, Su J (1997) Scalable access within the context of digital libraries. In: Proceedings of the international conference on advances in digital libraries. ADL, Washington, DC, pp 70–81
Ciaccia P, Patella M (2000) PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In: Proceedings of the international conference on data engineering, San Jose, California, pp 244–255
Damgard I, Pedersen T, Pfitzmann B (1998) Statistical secrecy and multibit commitments. IEEE Trans Inf Theory 44(3): 1143–1151
Du W, Atallah M (2001) Privacy-preserving cooperative statistical analysis. In: Proceedings of the seventeenth annual computer security applications conference (ACSAC), ACM SIGSAC. IEEE Computer Society, New Orleans, pp 102–110
Du W, Han Y-S, Chen S (2004) Privacy-preserving multivariate statistical analysis: linear regression and classification. In: BM W, Dayal U, Kamath C, Skillicorn D (eds) 2004 SIAM international conference on data mining. Lake Buena Vista, Florida, pp 222–233
Du W, Zhan Z (2002) Building decision tree classifier on private data. In: Estivill-Castro V, Clifton C (eds) Privacy, security and data mining, IEEE ICDM workshop proceedings, Conferences in research and practice in information technology series, Vol 14. Australian Computer Society, Sydney, pp 1–8
Ester M, Kriegel H, Sander S, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), AAAI. AAAI Press, Menlo Park, pp 226–231
Estivill-Castro V (2004) Private representative-based clustering for vertically partitioned data. In: Baeza-Yates R, Marroquin J, Chávez E (eds) Fifth Mexican international conference on computer science (ENC 04), SMCC. IEEE Computer Society Press, Colima, pp 160–167
Estivill-Castro V, Yang J (2002) Clustering Web visitors by fast, robust and convergent algorithms. Int J Found Comput Sci 13(4): 497–520
Fagin R (1996) Combining fuzzy information from multiple systems. In: Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM Press, Montreal, pp 216–226
Faloutsos C, Ranganathan M, Manolopoulos Y (May 1994) Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Minneapolis, pp 419–429
Ferhatosmanoglu H, Tuncel E, Agrawal D, El Abbadi A (2002) Approximate nearest neighbor searching in multimedia databases. In: Seventh IEEE international conference on data engineering (ICDE), Germany, pp 503–511
Goethals B, Laur S, Lipmaa H, Mielikäinen (2005) On private scalar product computation for privacy-preserving data mining. In: Park C, Chee S (eds) Information security and cryptology—ICISC 2004. Lecture notes in computer science, vol 3506. Springer, Berlin, pp 104–120
Goldreich O (2004) The foundations of cryptography, vol 2, chapt General cryptographic protocols. Cambridge University Press, London
Goldreich O, Micali S, Wigderson A (1987) How to play any mental game (extended abstract). In: Aho A (eds) Proceedings of the nineteenth ACM annual symposium on theory of computing. ACM Press, New York, pp 218–229
Guttmann A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of ACM SIGMOD international conference on management of data, pp 47–57
Houle M (2003a) Navigating massive data sets via local clustering. In: Getoor L, Senator T, Domingos P, Faloutsos C (eds) Ninth ACM SIGKDD conference on knowledge discovery and data mining. ACM Press, Washington, pp 547–552
Houle M (2003b) SASH: a spatial approximation sample hierarchy for similarity, Technical Report RT-0517, IBM Tokyo Research Laboratory, p 16
Houle M, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: Twenty first international conference on data engineering ICDE. IEEE Computer Society, Tokyo, pp 619–630
Ioannidis I, Grama A (2003) An efficient protocol for Yao’s millionaires’ problem. In: Thirty sixth Hawaii international conference on system sciences HICSS. IEEE Computer Society, Big Island, p 205
Kahveci T, Singh AK (2001) An efficient index structure for string databases. In: VLDB conference, Rome, Italy, pp 351–360
Kantarcioğlu M, Clifton C (2004) Privately computing a distributed k-nn classifier. In: Boulicaut J-F (eds) Eighth European conference on principles and practice of knowledge discovery in databases PKDD. Lecture notes in computer science, vol 3202. Springer, Berlin, pp 279–290
Kargupta H, Datta S, Wang Q, Sivakumar K (2005) Random-data perturbation techniques and privacy-preserving data mining. Knowl Inf Syst 7(4): 387–414
Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbour queries. In: ACM SIGMOD conference on management of data, Tucson, USA, pp 369–380
Korn F, Sidropoulos N, Faloutsos C, Siegel E, Proropapas Z (1996) Fast nearest neighbor search in medical image databases. In: VLDB, Mumbai, India, pp 215–226
Lin X, Clifton C, Zhu M (2005) Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inf Syst 8(1): 68–81
Malkhi D, Nisan N, Pinkas B, Sella Y (2004) Fairplay—a secure two-party computation system. In: Proceedings of the thirteenth conference on USENIX security symposium, vol 13, USENIX Association, p 20
Mena J (2003) Investigative data mining for security and criminal detection. Butterworth-Heinemann, London
Morimoto Y, Aono M, Houle M, McCurley K (2003) Extracting spatial knowledge from the Web. In: International symposium on applications and the internet (SAINT), Orlando, USA, pp 326–333
Nielsen J, Schwatzbach M (2007) A domain-specific programming language for secure multiparty computation. In: Conference on programming language design and implementation—Proceedings on the 2007 workshop on programming languages and analysis for security, SIGPLAN, ACM Press, New York, pp 21–30
Peng K, Boyd C, Dawson E, Lee B (2005) An efficient and verifiable solution to the millionaire problem. In: Park C, Chee S (eds) Information security and cryptology—ICISC 2004’. Lecture notes in computer science, vol 3506. Springer, Berlin, pp 51–66
Rivest R (1999) Unconditionally secure commitment and oblivious transfer schemes using concealing channels and a trusted initializer (manuscript)
Roussopoulos N, Kelly S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, San Jose, pp 71–79
Shaneck M, Kim Y, Kumar V (2006) Privacy preserving nearest neighbor search. In: ICDMW ’06: Proceedings of the sixth IEEE international conference on data mining—Workshops. IEEE Computer Society, Washington, DC, pp 541–545
Shannon C (1949) Communication theory of secret systems. Bell Systems Technical Journal 28: 656–715
Stinson DR (1995) Cryptography: theory and practice. CRC, Boca Raton
Subrahmanian V (1999) Principles of multimedia database systems. Morgan Kaufmann, San Francisco
Sung S, Liu Y, Xiong H, Ng P (2006) Privacy preservation for data cubes. Knowl Inf Syst 9(1): 38–61
Vaidya J, Clifton CC (2002) Privacy preserving association rule mining in vertically partitioned data. In: The eighth ACM SIGKDD international conference on knowledge discovery and data mining, SIGKDD. ACM Press, Edmonton, pp 639–644
Vaidya J, Clifton CC (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the SIGKDD-ACM conference of data mining. ACM Press, Washington, DC, pp 206–215
Vaidya J, Clifton C (2004) Privacy-preserving outlier detection. In: Fourth IEEE international conference on data mining (ICDM 2004). IEEE Computer Society, Brighton, pp 233–240
Vaidya J, Clifton C (2005) Privacy-preserving top-k queries. In: Twenty first international conference on data engineering, ICDE 2005. IEEE Computer Society, Tokyo, pp 545–546
Vaidya J, Clifton C, Zhu M (2006) Privacy preserving data mining Advances in information security, vol 19. Springer, New York
Vaidya J, Yu H, Jiang X (2008) Privacy-preserving SVM classification. Knowl Inf Syst 14(2): 161–178
van der Putten P, van Someren M (2000) CoIL challenge 2000: the insurance company case. Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Amsterdam
Wang K, Fung B, Yu P (2007) Handicapping attacker’s confidence: an alternative to k-anonymization. Knowl Inf Syst 11(3): 345–368
Witten I, Frank E (2000) Data mining—practical machine learning tools and technologies with JAVA implementations. Morgan Kaufmann, San Mateo
Wu X, Kumar V, Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Xu S, Zhang J, Han D, Wang J (2006) Singular value decomposition based data distortion strategy for privacy protection. Knowl Inf Syst 10(3): 383–397
Yao A (1982) Protocols for secure computation. In: IEEE symposium of foundations of computer science. IEEE Computer Society, pp 160–164
Yu B, Ooi C, Tan K, Jagadish H (2001) Indexing the distance: an efficient method to KNN processing. In: Twenty seventh VLDB conference, Rome, Italy, pp 421–430
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Amirbekyan, A., Estivill-Castro, V. Practical protocol for Yao’s millionaires problem enables secure multi-party computation of metrics and efficient privacy-preserving k-NN for large data sets. Knowl Inf Syst 21, 327–363 (2009). https://doi.org/10.1007/s10115-009-0233-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0233-z