Abstract
Feature selection is an important method to provide both efficiency and effectiveness for high-dimension data clustering. However, most feature selection methods require prior knowledge such as class-label information to train the clustering module, where its performance depends on training data and types of learning machine. This paper presents a feature selection algorithm that does not require supervised feature assessment. We analyze relevance and redundancy among features and effectiveness to each target class to build a correlation-based filter. Compared to feature sets selected by existing methods, the experimental results show that performance of a feature set selected by the proposed method is comparably equal and better when it is tested on the RCV1v2 corpus and Isolet data set, respectively. However, our technique is simpler and faster and it is independent to types of learning machine.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Almeida, L.P., Vasconcelos, A.R., Maia, M.G.: A Simple and Fast Term Selection Procedure for Text Clustering. In: Nedjah, N., Macedo Mourelle, L., Kacprzyk, J., França, F.G., De Souza, A. (eds.) Intelligent Text Categorization and Clustering, vol. 164, pp. 47-64. Springer Berlin Heidelberg (2009)
Alelyani, S., Tang, J., Liu, H.: Feature Selection for Clustering: A Review. In: Aggarwal, C., Reddy, C. (eds.) Data Clustering: Algorithms and Applications. CRC Press (2013)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1-47 (2002)
Ferreira, A.J., Figueiredo, M.A.T.: An unsupervised approach to feature discretization and selection. Pattern Recognition 45, 3048-3060 (2012)
Shamsinejadbabki, P., Saraee, M.: A new unsupervised feature selection method for text clustering based on genetic algorithms. J Intell Inf Syst 38, 669-684 (2012)
Luying, L., Jianchu, K., Jing, Y., Zhongliang, W.: A comparative study on unsupervised feature selection methods for text clustering. In: Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE ‘05. Proceedings of 2005 IEEE International Conference on, pp. 597-601. (Year)
Ferreira, A.J., Figueiredo, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recognition Letters 33, 1794-1804 (2012)
Ferreira, A., Figueiredo, M.: Efficient unsupervised feature selection for sparse data. In: EUROCON - International Conference on Computer as a Tool (EUROCON), 2011 IEEE, pp. 1-4. (Year)
Yanjun, L., Congnan, L., Chung, S.M.: Text Clustering with Feature Selection by Using Statistical Data. Knowledge and Data Engineering, IEEE Transactions on 20, 641-652 (2008)
Mitra, S., Kundu, P.P., Pedrycz, W.: Feature selection using structural similarity. Information Sciences 198, 48-61 (2012)
Guyon, I., Andr, #233, Elisseeff: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157-1182 (2003)
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17, 491-502 (2005)
Somol, P., Novovicova, J., Pudil, P.: Efficient Feature Subset Selection and Subset Size Optimization. Pattern Recognition Recent Advances 75-97 (2010)
Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redundancy. J. Mach. Learn. Res. 5, 1205-1224 (2004)
Liu, T., Liu, S., Chen, Z.: An Evaluation on Feature Selection for Text Clustering. In: In ICML, pp. 488-495. (Year)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: 14th International Conference on Machine Learning, pp. 412-420. Morgan Kaufmann Publishers Inc., 657137 (Year)
Zonghu, W., Zhijing, L., Donghui, C., Kai, T.: A new partitioning based algorithm for document clustering. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on, pp. 1741-1745. (Year)
Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361-397 (2004)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, Irvine, CA (2013)
Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, 301-312 (2002)
Shamsinejadbabki, P., Saraee, M.: A new unsupervised feature selection method for text clustering based on genetic algorithms. J Intell Inf Syst 1-16 (2011)
Achtert, E., Goldhofer, S., Kriegel, H.-P., Schubert, E., Zimek, A.: Evaluation of Clusterings - Metrics and Visual Support. In: ICDE’12, pp. 1285-1288. (2012)
Ruiz, R., Riquelme, J., Aguilar-Ruiz, J.: Heuristic Search over a Ranking for Feature Selection. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) Computational Intelligence and Bioinspired Systems, vol. 3512, pp. 498-503. Springer Berlin/Heidelberg (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Singapore
About this paper
Cite this paper
Pramokchon, P., Piamsa-nga, P. (2014). An Unsupervised, Fast Correlation-Based Filter for Feature Selection for Data Clustering. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore. https://doi.org/10.1007/978-981-4585-18-7_10
Download citation
DOI: https://doi.org/10.1007/978-981-4585-18-7_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-4585-17-0
Online ISBN: 978-981-4585-18-7
eBook Packages: EngineeringEngineering (R0)