Abstract
Feature selection for unsupervised learning is generally harder than for supervised learning, because the former lacks the class information of the latter, and thus an obvious way by which to measure the quality of a feature subset. In this paper, we propose a new method based on representing data sets by their distance matrices, and judging feature combinations by how well the distance matrix using only these features resembles the distance matrix of the full data set. Using articial data for which the relevant features were known, we observed that the results depend on the data dimensionality, the fraction of relevant features, the overlap between clusters in the relevant feature subspaces, and how to measure the similarity of distance matrices. Our method consistently achieved higher than 80% detection rates of relevant features for a wide variety of experimental configurations.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)
Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research 5, 1205–1224 (2004)
Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: An ever evolving frontier in data mining. In: Proceedings of the 4th International Workshop on Feature Selection in Data Mining, pp. 4–13 (2010)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensinal data: A review. ACM SIGKDD Explorations 6, 90–105 (2004)
Dy, J., Brodley, C.: Feature selection for unsupervised learning. Journal of Machine Learning Research 5, 845–889 (2004)
Dash, M., Liu, H.: Feature selection for clustering. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 110–121. Springer, Heidelberg (2000)
Dash, M., Choi, K., Scheuermann, P., Liu, H.: Feature selection for clustering — a filter solution. In: Proceedings of the Second International Conference on Data Mining, pp. 115–122 (2002)
Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 1–13 (2002)
Escoufier, Y.: Le traitement des variables vectorielles. Biometrics 29, 751–760 (1973)
Kullback, S., Leibler, R.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society 35, 99–109 (1943)
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, San Diego (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dreiseitl, S. (2013). Feature Selection for Unsupervised Learning via Comparison of Distance Matrices. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds) Computer Aided Systems Theory - EUROCAST 2013. EUROCAST 2013. Lecture Notes in Computer Science, vol 8111. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53856-8_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-53856-8_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53855-1
Online ISBN: 978-3-642-53856-8
eBook Packages: Computer ScienceComputer Science (R0)