Abstract
Spectral clustering algorithms recently gained much interest in research community. This surge in interest is mainly due to their ease of use, their applicability to a variety of data types and domains as well as the fact that they very often outperform traditional clustering algorithms. These algorithms consider the pair-wise similarity between data objects and construct a similarity matrix to group data into natural subsets, so that the objects located in the same cluster share many common characteristics. Objects are then allocated into clusters by employing a proximity measure, which is used to compute the similarity or distance between the data objects in the matrix. As such, an early and fundamental step in spectral cluster analysis is the selection of a proximity measure. This choice also has the highest impact on the quality and usability of the end result. However, this crucial aspect is frequently overlooked. For instance, most prior studies use the Euclidean distance measure without explicitly stating the consequences of selecting such measure. To address this issue, we perform a comparative and explorative study on the performance of various existing proximity measures when applied to spectral clustering algorithm. Our results indicate that the commonly used Euclidean distance measure is not always suitable, specifically in domains where the data is highly imbalanced and the correct clustering of boundary objects are critical. Moreover, we also noticed that for numeric data type, the relative distance measures outperformed the absolute distance measures and therefore, may boost the performance of a clustering algorithm if used. As for the datasets with mixed variables, the selection of distance measure for numeric variable again has the highest impact on the end result.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Luxburg, U.: A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 395–416 (2007)
Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Bach, F.R., Jordan, M.I.: Learning Spectral Clustering, with Application to Speech Separation. J. Mach. Learn. Res. 7, 1963–2001 (2006)
Paccanaro, A., Casbon, J.A., Saqi, M.A.: Spectral Clustering of Protein Sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On Spectral Clustering: Analysis and an Algorithm. In: Dietterich, T.G., Ghahramani, S.B. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 849–856 (2001)
Verma, D., Meila, M.: A Comparison of Spectral Clustering Algorithms (2001)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31(3), 264–323 (1999)
Bach, F.R., Jordan, M.I.: Learning Spectral Clustering. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, pp. 305–312 (2003)
Everitt, B.S.: Cluster Analysis, 2nd edn. Edward Arnold and Halsted Press (1980)
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience (2005)
Meila, M., Shi, J.: A Random Walks View of Spectral Segmentation. In: International Conference on Artificial Intelligence and Statistics (AISTAT), pp. 8–11 (2001)
Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Wiley & Sons (2002)
Larose, D.T.: Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience (2004)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2006)
Costa, I.G., de Carvalho, F.A.T., de Souto, M.C.P.: Comparative Study on Proximity Indices for Cluster Analysis of Gene Expression Time Series. Journal of Intelligent and Fuzzy Systems: Applications in Engineering and Technology 13(2-4), 133–142 (2002)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)
Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2-3), 195–215 (1998)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Japkowicz, N., Shah, M.: Performance Evaluation for Classification A Machine Learning and Data Mining Perspective (in progress): Chapter 6: Statistical Significance Testing (2011)
Heinz, G., Peterson, L.J., Johnson, R.W., Kerk, C.J.: Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2) (2003)
Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)
Lee, S.-W., Verri, A. (eds.): SVM 2002. LNCS, vol. 2388. Springer, Heidelberg (2002)
Abou-Moustafa, K.T., Ferrie, F.P.: The Minimum Volume Ellipsoid Metric. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 335–344. Springer, Heidelberg (2007)
Filzmoser, P., Garrett, R., Reimann, C.: Multivariate Outlier Detection in Exploration Geochemistry. Computers and Geosciences 31(5), 579–587 (2005)
Aiello, M., Andreozzi, F., Catanzariti, E., Isgro, F., Santoro, M.: Fast Convergence for Spectral Clustering. In: ICIAP 2007: Proceedings of the 14th International Conference on Image Analysis and Processing, pp. 641–646. IEEE Computer Society, Washington, DC (2007)
Fischer, I., Poland, J.: New Methods for Spectral Clustering. Technical Report IDSIA-12-04, IDSIA (2004)
Teknomo, K.: Similarity Measurement, http://people.revoledu.com/kardi/tutorial/Similarity/
Boslaugh, S., Watters, P.A.: Statistics in a Nutshell. O.Reilly & Associates, Inc., Sebastopol (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Azam, N.F., Viktor, H.L. (2013). Spectral Clustering: An Explorative Study of Proximity Measures. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2011. Communications in Computer and Information Science, vol 348. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37186-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-37186-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37185-1
Online ISBN: 978-3-642-37186-8
eBook Packages: Computer ScienceComputer Science (R0)