Abstract
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin’s index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin’s index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002, 34:41–47.
Liu T, Liu S, Chen Z, et al. An Evaluation on Feature Selection for Text Clustering[C]//Proceedings of the 20th International Conference on Machine Learning. Washington D C: AAAI Press, 2003:488–495.
Dash M, Liu H. Feature Selection for Classification[J]. International Journal of Intelligent Data Analysis, 1997, 1(3): 131–156.
Koller D, Sahami M. Toward Optimal Feature Selection [C]//Proceedings of the 13th International Conference on Machine Learning. Bari: Morgan Kaufmann, 1996:284–292.
Blum A, Langley P. Selection of Relevant Features and Examples in Machine Learning[J]. Artificial Intelligence, 1997, 1(2):245–271.
Jain A, Duin P, Chang M. Statistical Pattern Recognition: A Review[J]. IEEE Trans Pattern Analysis and Machine Intelligence, 2000, 22(1):4–37.
Dash M, Liu H. Feature Selection for Clustering[C]// Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Kyoto: Springer, 2000:110–121.
Martin H, Mario A, Jain A. Feature Saliency in Unsupervised Learning[R]. Michigan: Michigan State University,2002.
Yang Y, Pedersen J. A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 4th International Conference on Machine Learning. Nashville: Morgan Kaufmann Press, 1997:412–420.
Galavotti L, Sebastiani F, Simi M. Feature Selection and Negative Evidence in Automated Text Categorization[C]// Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining. Boston: ACM Press, 2000.
Davies D, Bouldin D. A Cluster Separation Measure[J]. IEEE Trans Pattern Analysis and Machine Intelligence, 1979, 1:224–227.
Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training[C]//Proceedings of the 11th Annual Conference on Computational Learning Theory. Madison: ACM Press, 1998:92–100.
Huang S, Chen Z, Yu Y, et al. Multitype Features Coselection for Web Document Clustering[J]. IEEE Trans Knowledge and Data Engineering, 2006, 18(4):448–459.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China (60503020, 60373066), the Outstanding Young Scientist’s Fund (60425206), the Natural Science Foundation of Jiangsu Province (BK2005060) and the Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow University
Biography: XU Junling(1984–), male, Ph.D. candidate, research direction: statistical pattern recognition, machine learning and data mining.
Rights and permissions
About this article
Cite this article
Xu, J., Xu, B., Zhang, W. et al. A new feature selection method for text clustering. Wuhan Univ. J. of Nat. Sci. 12, 912–916 (2007). https://doi.org/10.1007/s11859-007-0040-x
Received:
Issue Date:
DOI: https://doi.org/10.1007/s11859-007-0040-x