A novel feature selection framework for automatic web page classification

Alamelu Mangai, J.; Santhosh Kumar, V.; Appavu alias Balamurugan, S.

doi:10.1007/s11633-012-0665-x

A novel feature selection framework for automatic web page classification

Regular Papers
Published: 09 August 2012

Volume 9, pages 442–448, (2012)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Automation and Computing Aims and scope Submit manuscript

A novel feature selection framework for automatic web page classification

Download PDF

J. Alamelu Mangai¹,
V. Santhosh Kumar¹ &
S. Appavu alias Balamurugan²

250 Accesses
17 Citations
Explore all metrics

Abstract

The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward’s minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.

Article PDF

Efficient Machine Learning Technique for Web Page Classification

Article 08 September 2015

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques, 2nd ed., San Francisco, USA: Morgan Kaufmann, 2005.
Google Scholar
M. I. Devi, R. Rajaram, K. Selvakuberan. Generating best features for web page classification. Webology, vol. 5, no. 1, Article 52, 2008.
L. W. Han, S. M. Alhashmi. Joint web-feature (JFEAT): A novel web page classification framework. Communications of the IBIMA, vol. 2010, Artical ID 73408, 2010.
A. Salamat, S. Omata. Web page feature selection and classification using neural networks. Information Sciences, vol. 158, no. 1, pp. 69–88, 2004.
Article MathSciNet Google Scholar
C. M. Chen, H. M. Lee, Y. J. Chang. Two novel feature selection approaches for web page classification. Expert Systems with Applications, vol. 36, no. 1, pp. 260–272, 2009.
Article Google Scholar
T. Wakaki, H. Itakura, M. Tamura. Rough set-aided feature selection for automatic web-page classification. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, Beijing, China, pp. 70–76, 2004.
Chapter Google Scholar
R. Jensen, Q. Shen. Web page classification with ACO-enhanced fuzzy-rough feature selection. In Proceedings of the 5th International Conference on Rough Sets and Current Trends in Computing, ACM, Berlin, Germany, vol. 459, pp. 147–156, 2006.
Chapter Google Scholar
Q. Shen, R. Jensen. Rough sets, their extensions and applications. International Journal of Automation and Computing, vol. 4, no. 3, pp. 217–228, 2007.
Article Google Scholar
X. Peng, Z. Ming, H. Wang. Text learning and hierarchial feature selection in web page classification. In Proceedings of the 4th International Conference on Advanced Data Mining and Applications, ACM, Berlin, Germany, vol. 5139, pp. 452–459, 2008.
Chapter Google Scholar
M. Farhoodi, A. Yari, M. Mahmoudi. A persian web page classifier applying a combination of content-based and context-based features. International Journal of Information Studies, vol. 1, no. 4, pp. 263–271, 2009.
Google Scholar
S. A. Ozel. A genetic algorithm based optimal feature selection for web page classification. In Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 282–286, 2011.
S. Appavu alias Balamurugan, R. Rajaram. Effective and efficient feature selection for large-scale data using Baye’s theorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009.
Article Google Scholar
J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244. 1963.
Article MathSciNet Google Scholar
K. P. Soman, S. Diwakar, V. Ajay. Insight Into Data Mining, India: Prentice Hall, 2006.
Google Scholar
The 4 Universities data set. [Online], Available: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/, May 7, 2012.

Download references

Author information

Authors and Affiliations

Department of Computer Science, BITS Pilani, Dubai Campus, DIAC, Dubai, 345055, UAE
J. Alamelu Mangai & V. Santhosh Kumar
Department of Information Technology, Thiagarajar College of Engineering, Madurai, 625015, India
S. Appavu alias Balamurugan

Authors

J. Alamelu Mangai
View author publications
You can also search for this author in PubMed Google Scholar
V. Santhosh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
S. Appavu alias Balamurugan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Alamelu Mangai.

Additional information

J. Alamelu Mangai graduated from Annamalai University, India in 2005. She is a Ph. D. candidate of BITS Pilani, Dubai Campus, UAE, and she has been working as a senior lecturer in the Department of Computer Science in BITS Pilani, Dubai Campus.

Her research interests include data mining algorithms, text and web mining.

V. Santhosh Kumar received his Ph.D. degree from Indian Institute of Science, Bangalore, India. He is currently working as assistant professor in BITS Pilani, Dubai Campus, UAE.

His research interests include data mining and performance evaluation of computer systems

S. Appavu alias Balamurugan received his Ph.D. degree from Anna University Chennai, Chennai, India. He is currently working as assistant professor, Department of Information Technology at Thiagarajar College of Engineering, Madurai, India.

His research interests include pattern recognition, data mining and informatics.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alamelu Mangai, J., Santhosh Kumar, V. & Appavu alias Balamurugan, S. A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput. 9, 442–448 (2012). https://doi.org/10.1007/s11633-012-0665-x

Download citation

Received: 02 January 2012
Revised: 08 February 2012
Published: 09 August 2012
Issue Date: August 2012
DOI: https://doi.org/10.1007/s11633-012-0665-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A novel feature selection framework for automatic web page classification

Abstract

Article PDF

Similar content being viewed by others

Efficient Machine Learning Technique for Web Page Classification

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

A new feature selection method for handling redundant information in text classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel feature selection framework for automatic web page classification

Abstract

Article PDF

Similar content being viewed by others

Efficient Machine Learning Technique for Web Page Classification

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

A new feature selection method for handling redundant information in text classification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation