Abstract
Users of a Web site usually perform their interest-oriented actions by clicking or visiting Web pages, which are traced in access log files. Clustering Web user access patterns may capture common user interests to a Web site, and in turn, build user profiles for advanced Web applications, such as Web caching and prefetching. The conventional Web usage mining techniques for clustering Web user sessions can discover usage patterns directly, but cannot identify the latent factors or hidden relationships among users’ navigational behaviour. In this paper, we propose an approach based on a vector space model, called Random Indexing, to discover such intrinsic characteristics of Web users’ activities. The underlying factors are then utilised for clustering individual user navigational patterns and creating common user profiles. The clustering results will be used to predict and prefetch Web requests for grouped users. We demonstrate the usability and superiority of the proposed Web user clustering approach through experiments on a real Web log file. The clustering and prefetching tasks are evaluated by comparison with previous studies demonstrating better clustering performance and higher prefetching accuracy.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Anderson CR (2002) Amachine learning approach to web personalization. Ph.D. thesis, University of Washington
Ansari S, Kohavi R, Mason L, Zheng Z (2000) Integrating e-commerce and data mining: architecture and challenges. In: Proceedings of the 2001 IEEE international conference on data mining (ICDM 2001), pp 27–34
Berendt B (2002) Using site semantics to analyze, visualize, and support navigation. Data Min Knowl Discov 6(1): 37–59
Berry MJA, Linoff G (1996) Data mining techniques for marketing, sales and customer support. Wiley, London
Bezerra BLD, de Assis Tenório de Carvalho F (2010) Symbolic data analysis tools for recommendation systems. Knowl Inf Syst (on-line)
Bundschus M, Yu Sh, Tresp V, Rettinger A, Dejori M, Kriegel H-P (2009) Hierarchical bayesian models for collaborative tagging systems. In: Proceedings IEEE international conference on data mining (ICDM 2009), pp 728–733
Cadez I, Heckerman D, Meek C, Smyth P, Whire S (2002) Visualization of navigation patterns on a website using model based clustering. Technical Report MSR-TR-00-18, Microsoft Research
Catledge LD, Pitkow JE (1995) Characterizing browsing strategies in the world-wide web. Comput Netw ISDN Syst 27: 1065–1073
Characteristics of WWW Client Traces, Cunha CA, Bestavros A, Crovella ME (1995) Boston University Department of Computer Science. Technical Report TR-95-010. http://ita.ee.lbl.gov/html/contrib/BU-Web-Client.html
Chatterjee N, Mohan S (2008) Discovering word senses from text using Random Indexing. In: Gelbukh A (ed) Computational linguistics and intelligent text processing (Lecture Notes in Computer Science), CICLing 2008, LNCS 4919, pp 299–310
Cooley R (2000) Web usage mining: discovery and application of interesting patterns from web data. Ph.D. thesis, University of Minnesota
Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. J Knowl Inf Syst 1(1): 5–32
Curran JR (2004) From distributional to semantic similarity. Ph.D. thesis, University of Edinburgh
Etzioni O (1996) The world-wide Web: quagmire or gold mine. Commun ACM 39(11): 65–68
Feng S, Wang D, Yu G, Gao W, Wong K (2010) Extracting common emotions from blogs based on fine-grained sentiment clustering. Knowl Inf Syst 24(1). doi:10.1007/s10115-010-0325-9
Fu Y, Creado M, Ju C (2001) Reorganizing web sites based on user access patterns. In: Proceedings of the tenth international conference on information and knowledge management, pp 583–585
Gorman J, Curran JR (2006) Random Indexing using statistical weight functions. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP 2006), Sydney, Australia, pp 457–464
Halkidi M, Vazirgiannis M, Batistakis I (2000) Quality scheme assessment in the clustering process. In: Proceedings of the 4th European conference on principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France
Hou J, Zhang Y (2002) Constructing good quality web page communities. In: Proceedings of the 13th Australasian database conferences (ADC2002), vol 36. ACS Inc, Melbourne, pp 65–74
Hou J, Zhang Y (2003) Effectively finding relevant web pages from linkage information. IEEE Trans Knowl Data Eng 15(4): 940–951
IBM (2003) SurfAid analytics. http://surfaid.dfw.ibm.com
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall
Jin X, Zhou Y, Mobasher B (2004) A unified approach to personalization based on probabilistic latent semantic models of web usage and content. In: Proceedings of the AAAI 2004 workshop on semantic web personalization (SWP’04), San Jose
Kanerva P, Kristofersson J, Holst A (2000) Random Indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd annual conference of the cognitive science society. Erlbaum, New Jersey, p 1036
Kanerva P (1988) Sparse distributed memory. The MIT Press, Cambridge
Kanerva P, Sjödin G, Kristofersson J, Karlsson R, Levin B, Holst A, Karlgren J, Sahlgren M (2001) Computing with large random patterns. In: Uesaka Y, Kanerva P, Asoh H (eds) Foundations of real-world intelligence. CSLI Publications, Stanford
Kaski S (1999) Dimensionality reduction by random mapping: fast similarity computation for clustering. In Proceedings of the international joint conference on neural networks (IJCNN98), IEEE Service Center
Krishnapuram R, Joshi A, Nasraoui O, YI L (2003) Low-complexity fuzzy relational clustering algorithms for web mining. IEEE Trans Fuzzy Syst 4(9): 596–607
Lan B, Bressan S, Ooi BC, Tan K (2000) Rule-assisted prefetching in web server caching. In: Proceedings of 2000 ACM international conference on information and knowledge management (Virginia, USA), vol 1. ACM, New York, pp 504–11
Landauer T, Dumais S (1997) A solution to Platos problem: the latent semantic analysis theory for acquisition, induction and representation of knowledge. Psychol Rev 104(2): 211–240
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Mobasher B, Cooley R, Srivastava J (2000) Automatic personalization based on web usage mining. Commun ACM 8(43): 142–151
Nasraoui O, Frugui H, Krishnapuram R, Joshi A (2000) Extracting web user profiles using relational competitive fuzzy clustering. Int J Artif Intell Tools 4(9): 509–526
Nanopoulos A, Katsaros D, Manolopoulos Y (2001) Effective prediction of web-user accesses: a data mining approach. In Proceedings of workshop web usage analysis and user profiling (WebKDD’01) (San Francisco, USA). ACM, New York
Oceans Research Group. Department of Computer Science, Boston University. http://cs-www.bu.edu/groups/oceans/Home.html
Paliouras G, Papatheodorou C, Karkaletsis V, Spyropoulos CD (2000) Clustering the users of large web sites into communities. In: Proceedings of the international conference on machine learning (ICML), pp 719–726
Pal SK, Ghosh A, Uma Shankar B (2000) Segmentation of remotely sensed images with fuzzy thresholding and quantitative evaluation. Int J Remote Sens 21(11): 2269–2300
Sahlgren M, Karlgren J (2005) Automatic bilingual lexicon acquisition using Random Indexing of parallel corpora. J Nat Lang Eng (Special Issue on Parallel Texts)
Sahlgren M, Karlgren J (2005) Automatic bilingual lexicon acquisition using Random Indexing of parallel corpora. J Nat Lang Eng Special Issue Parallel Texts 11(3): 1–14
Theodoridis S, Koutroumbas K (2006) Pattern recognition, 3rd edn. Academic Press, New York
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Addison-Wesley, Reading
The Internet Traffic Archive. http://ita.ee.lbl.gov/index.html
Teng W, Chang C, Chen M (2005) Integrating web caching and web prefetching in client-side proxies. IEEE Trans Parallel Distrib Syst 16: 444–455
Tian W, Choi B, Phoha VV (2002) An adaptive web cache access predictor using neural network. In: Proceedings of 15th international conference on IEA/AIE (Cairns, Australia), vol 2358. Springer, Berlin, pp 450–459
Wan M, Li L, Xiao J, Yang Y, Wang C, Guo X (2010) CAS based clustering algorithm for Web users. Nonlinear Dyn 61(3): 347–361
Wu Y, Chen A (2002) Prediction of web page accesses by proxy server log. World Wide Web 5: 67–88
Xie Y, Phoha VV (2001) Web user clustering from access log using belief function. In: Proceedings of the 1st international conference on Knowledge capture, pp 202–208
Yang S, Li Y, Wu X, Pan R (2006) Optimization study on k value of K-means algorithm. J Syst Simul 18(3): 97–101
Zhou Y, Jin X, Mobasher B (2004) A recommendation model based on latent principal factors in web navigation data. In: Proceedings of the 3rd international workshop on web dynamics. ACM Press, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wan, M., Jönsson, A., Wang, C. et al. Web user clustering and Web prefetching using Random Indexing with weight functions. Knowl Inf Syst 33, 89–115 (2012). https://doi.org/10.1007/s10115-011-0453-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0453-x