Abstract
In this paper, we propose TFIGF, a method which detects peculiar web pages using distribution of words in WWW given a set of keywords. Our TFIGF detects a set of index words which represent a WWW page by estimating their importance in the WWW page and their rareness in WWW. Experiments using both English and Japanese WWW pages clearly show superiority of our approach over a traditional method which employs a limited number of WWW pages in the estimation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Billsus, D., Pazzani, M.: A Hybrid User Model for News Story Classification. In: Proc. Seventh International Conference on User Modeling, pp. 99–108 (1999)
Chawla, N.V., Lazarevic, A., Hall, L.O.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Domingos, P.: MetaCost: A General Method for Making Classifiers Cost-Sensitive. In: Proc. Fifth Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 155–164 (1999)
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: Misclassification Costsensitive Boosting. In: Proc. Sixteenth Intl. Conf. on Machine Learning (ICML), pp. 97–105 (1999)
Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Proc. Thirteenth Int’l Conf. on Machine Learning (ICML), pp. 148–156 (1996)
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-Based Outliers: Algorithms and Applications. VLDB J. 8(3-4), 237–253 (2000)
Kosala, R., Blockeel, H.: Web Mining Research: A Survey. ACM SIGKDD Exploration (2), 1–15 (2000)
Narahashi, M., Suzuki, E.: Detecting Hostile Accesses through Incremental Subspace Clustering. In: Proc. 2003 IEEE/WIC International Conference on Web Intelligence (WI), pp. 337–343 (2003)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hirose, M., Suzuki, E. (2004). Using WWW-Distribution of Words in Detecting Peculiar Web Pages. In: Suzuki, E., Arikawa, S. (eds) Discovery Science. DS 2004. Lecture Notes in Computer Science(), vol 3245. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-30214-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23357-2
Online ISBN: 978-3-540-30214-8
eBook Packages: Springer Book Archive