Abstract
The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward’s minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques, 2nd ed., San Francisco, USA: Morgan Kaufmann, 2005.
M. I. Devi, R. Rajaram, K. Selvakuberan. Generating best features for web page classification. Webology, vol. 5, no. 1, Article 52, 2008.
L. W. Han, S. M. Alhashmi. Joint web-feature (JFEAT): A novel web page classification framework. Communications of the IBIMA, vol. 2010, Artical ID 73408, 2010.
A. Salamat, S. Omata. Web page feature selection and classification using neural networks. Information Sciences, vol. 158, no. 1, pp. 69–88, 2004.
C. M. Chen, H. M. Lee, Y. J. Chang. Two novel feature selection approaches for web page classification. Expert Systems with Applications, vol. 36, no. 1, pp. 260–272, 2009.
T. Wakaki, H. Itakura, M. Tamura. Rough set-aided feature selection for automatic web-page classification. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, Beijing, China, pp. 70–76, 2004.
R. Jensen, Q. Shen. Web page classification with ACO-enhanced fuzzy-rough feature selection. In Proceedings of the 5th International Conference on Rough Sets and Current Trends in Computing, ACM, Berlin, Germany, vol. 459, pp. 147–156, 2006.
Q. Shen, R. Jensen. Rough sets, their extensions and applications. International Journal of Automation and Computing, vol. 4, no. 3, pp. 217–228, 2007.
X. Peng, Z. Ming, H. Wang. Text learning and hierarchial feature selection in web page classification. In Proceedings of the 4th International Conference on Advanced Data Mining and Applications, ACM, Berlin, Germany, vol. 5139, pp. 452–459, 2008.
M. Farhoodi, A. Yari, M. Mahmoudi. A persian web page classifier applying a combination of content-based and context-based features. International Journal of Information Studies, vol. 1, no. 4, pp. 263–271, 2009.
S. A. Ozel. A genetic algorithm based optimal feature selection for web page classification. In Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 282–286, 2011.
S. Appavu alias Balamurugan, R. Rajaram. Effective and efficient feature selection for large-scale data using Baye’s theorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009.
J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244. 1963.
K. P. Soman, S. Diwakar, V. Ajay. Insight Into Data Mining, India: Prentice Hall, 2006.
The 4 Universities data set. [Online], Available: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/, May 7, 2012.
Author information
Authors and Affiliations
Corresponding author
Additional information
J. Alamelu Mangai graduated from Annamalai University, India in 2005. She is a Ph. D. candidate of BITS Pilani, Dubai Campus, UAE, and she has been working as a senior lecturer in the Department of Computer Science in BITS Pilani, Dubai Campus.
Her research interests include data mining algorithms, text and web mining.
V. Santhosh Kumar received his Ph.D. degree from Indian Institute of Science, Bangalore, India. He is currently working as assistant professor in BITS Pilani, Dubai Campus, UAE.
His research interests include data mining and performance evaluation of computer systems
S. Appavu alias Balamurugan received his Ph.D. degree from Anna University Chennai, Chennai, India. He is currently working as assistant professor, Department of Information Technology at Thiagarajar College of Engineering, Madurai, India.
His research interests include pattern recognition, data mining and informatics.
Rights and permissions
About this article
Cite this article
Alamelu Mangai, J., Santhosh Kumar, V. & Appavu alias Balamurugan, S. A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput. 9, 442–448 (2012). https://doi.org/10.1007/s11633-012-0665-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-012-0665-x