Abstract
A database session is a sequence of requests presented to the database system by a user or an application to achieve a certain task. Session identification is an important step in discovering useful patterns from database trace logs. The discovered patterns can be used to improve the performance of database systems by prefetching predicted queries, rewriting the current query or conducting effective cache replacement.
In this paper, we present an application of a new session identification method based on statistical language modeling to database trace logs. Several problems of the language modeling based method are revealed in the application, which include how to select values for the parameters of the language model, how to evaluate the accuracy of the session identification result and how to learn a language model without well-labeled training data. All of these issues are important in the successful application of the language modeling based method for session identification. We propose solutions to these open issues. In particular, new methods for determining an entropy threshold and the order of the language model are proposed. New performance measures are presented to better evaluate the accuracy of the identified sessions. Furthermore, three types of learning methods, namely, learning from labeled data, learning from semi-labeled data and learning from unlabeled data, are introduced to learn language models from different types of training data. Finally, we report experimental results that show the effectiveness of the language model based method for identifying sessions from the trace logs of an OLTP database application and the TPC-C Benchmark.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal S, Chaudhuri S, Narasayya VR (2000) Automated selection of materialized views and indexes in SQL databases. VLDB Conference, pp 496–505
Allan J (2002) Introduction to topic detection and tracking. In: Allan J (ed) Topic detection and tracking: event-based information organization. Kluwer Academic Publishers, pp 1–16
Bahl L, Jelinek F, Mercer R (1983) A maximum likelihood approach to continuous speech recognition. IEEE Trans Pattern Anal Mach Intell 5(2):179–190
Benchmark Factory software (2005) http://www.quest.com/benchmark_factory/index.asp, Quest Software Inc.
Brent M (1999) An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach Learn 34:71–106
Brent M, Tao X (2001) Chinese text segmentation with MBDP-1: making the most of training corpora. In: Proceedings of the ACL2001, France
Calzarossa M, Serazzi G (1993) Workload characterization: a survey. Proceedings of the IEEE 81(8):1136–1150
Catledge L, Pitkow J (1995) Characterizing browsing strategies in the world wide web. Proceedings of the 3rd International World Wide Web Conference
Chang JS, Su KY (1997) An unsupervised iterative method for Chinese New Lexicon extraction. Int J Comput Linguist Chin Lang Process 2(2): 97–148
Chaudhuri S, Narasayya VR (1998) Microsoft index tuning wizard for SQL Server 7.0. SIGMOD Conference, pp 553–554
Chaudhuri S, Narasayya VR (2000) Automating statistics management for query optimizers. ICDE Conference, pp 339–348
Chen A, He J, Xu L, Gey FC, Meggs J (1997) Chinese text retrieval without using a dictionary. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 42–49, ACM
Chen S, Goodman J (1998) An empirical study of smoothing techniques for language modeling. Technical report, TR-10-98, Harvard University
Chen MS, Park JS, Yu PS (1998) Efficient data mining for path traversal patterns. IEEE Trans Knowl Data Eng 10(2):209–221
Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. Knowl Inf Syst: An Int J 1(1):5–32
Duyand J, Vaughan L (2003) Usage data for electronic resources: a comparison between locally collected and vendor-provided statistics. J Acad Libr 29(1):16–22
Fung P (1998) Extracting key terms from Chinese and Japnese text. Int J Comput Process Orien Lang, Special Issue on Information Retrieval on Oriental Languages pp 99–121
Ge X, Pratt W, Smyth P (1999) Discovering Chinese words from unsegmented text. In: Proceedings of the 22th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 271–272
Hatch P (2000) Lexical chaining for the online detection of new events. Master's thesis, University College Dubin
He D, Goker A (2000) Detecting session boundaries from web user logs. In: Proceedings of the 22nd annual colloquium on information retrieval research, Cambridge, England, pp 57–66
Huang X, Peng F, An A, Schuurmans D, Cercone N (2003) Session boundary detection for association rule learning using N-gram language models. In: Proceedings of the 16th Canadian conference on artificial intelligence (CAI-03), Halifax, Canada, pp 237–251
Huang X, Peng F, An A, Schuurmans D (2004) Dynamic web log session identification with statistical language model. J Am Soc Inf Sci Tech, Special Issue on Webometrics 55(14):1290–1303
Huang, X, Peng F, Schuurmans D, Cercone N, Robertson SE (2003) Applying machine learning to text segmentation for information retrieval. Inf Retriev J 6(4):333–362
Huang X, Robertson SE (2000) A probabilistic approach to Chinese information retrieval: theory and experiments. In: Proceedings of the BCS-IRSG 2000: the 22nd annual colloquium on information retrieval research, Cambridge, England, pp 178–193
Jin W (1992) Chinese segmentation and its disambiguation. In: MCCS-92-227, computing research laboratory, New Mexico State University, Las Cruces, New Mexico
Katz S (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust, Speech Signal Process 35(3):400–401
Nie JY, Ren F (1999) Chinese information retrieval: using characters or words? Inform Process Manage 35:443–462
Nie JY, Brisebois M, Ren X (1996) On Chinese text retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 225–233
Peng F, Schuurmans D (2001) Self-supervised Chinese Word Segmentation. In: Hoffman F, et al (eds) Advances in intelligent data analysis, proceedings of the fourth international conference (IDA-01), LNCS 2189, Cascais, Portugal, pp 238–247
Peng F, Feng, F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th COLING 2004, Switzerland, pp 562-568
Ponte J, Croft W (1996) Useg: A retargetable word segmentation procedure for information retrieval. In: Proceedings of symposium on document analysis and information retrival 96 (SDAIR)
Ponte J, Croft W (1998) Text Segmentation by Topic. Proc Eur Conf Digit Libr 113–125
Sapia C (2000) PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems. Proceedings of the 2nd international conference on data warehousing and knowledge discovery, Greewich, UK. Springer Verlag, pp 224–233
Stokes N, Carthy J, Smeaton AF (2004) SeLeCT: A Lexical Cohesion based News Story Segmentation System. J AI Commun 17(1):3-12
Sproat R, Shih C (1990) A statistical method for finding word boundaries in Chinese text. Comput Process Chin Orient Lang 4:336–351
Teahan WJ (2000) Text Classification and Segmentation Using Minimum Cross-entropy. In: Proceedings of international conference on content-based multimedia information access (RIAO-00)
Transaction Processing Performance Council (2004) TPC Benchmark C Standard Specification Revision 5.3
Wang X, Wang K, Li Z (1989) Minimal word segmentation and its algorithm. J Sci 13:1030–1032
Xue N (2003) Chinese word segmentation as charater tagging. Int J Comput Linguist Chin Lang Process 8(1):29-48
Yang, Y, Carbonell, JG, Brown, R, Pierce, T, Archibald, B, Liu, X (1999) Learning approaches for detecting and tracking news events. IEEE Intell Syst: Spec Iss Appl Intell Inform Retriev 14(4):32–43
Yao Q, An A (2003) Using user access patterns for semantic query caching. In: Proceedings of the 14th international conference on database and expert systems applications (DEXA'03), Prague, Czech Republic, pp 737–746.
Yao Q, An A (2003) SQL-Relay: An event-driven rule-based database. In: International conference on web-age information management (WAIM'03)
Yao Q, An A (2004) Characterizing database user's access patterns. In: Proceedings of the 15th international conference on database and expert systems applications (DEXA'04), Spain, pp 528–538
Zhang HP, Liu Q, Cheng XQ, Zhang H, Yu HK (2003) Chinese lexical analysis using hierarchical hidden Markov model. In: Proceedings of the second SIGHAN workshop, Japan, pp 63–70
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiangji Huang joined York University as an Assistant Professor in July 2003 and then became a tenured Associate Professor in May 2006. Previously, he was a Post Doctoral Fellow at the School of Computer Science, University of Waterloo, Canada. He did his Ph.D. in Information Science at City University in London, England, with Professor Stephen E. Robertson. Before he went into his Ph.D. program, he worked as a lecturer for 4 years at Wuhan University. He also worked in the financial industry in Canada doing E-business, where he was awarded a CIO Achievement Award, for three and half years. He has published more than 50 refereed papers in journals, book chapter and conference proceedings. His Master (M.Eng.) and Bachelor (B.Eng.) degrees were in Computer Organization & Architecture and Computer Engineering, respectively. His research interests include information retrieval, data mining, natural language processing, bioinformatics and computational linguistics.
Qingsong Yao is a Ph.D. student in the Department of Computer Science and Engineering at York University, Toronto, Canada. His research interests include database management systems and query optimization, data mining, information retrieval, natural language processing and computational linguistics. He earned his Master's degree in Computer Science from Institute of Software, Chinese Academy of Science in 1999 and Bachelor's degree in Computer Science from Tsinghua University.
Aijun An is an associate professor in the Department of Computer Science and Engineering at York University, Toronto, Canada. She received her Bachelor's and Master's degrees in Computer Science from Xidian University in China. She received her PhD degree in Computer Science from the University of Regina in Canada in 1997. She worked at the University of Waterloo as a postdoctoral fellow from 1997 to 1999 and as a research assistant professor from 1999 to 2001. She joined York University in 2001. She has published more than 60 papers in refereed journals and conference proceedings. Her research interests include data mining, machine learning, and information retrieval.
Rights and permissions
About this article
Cite this article
Huang, X., Yao, Q. & An, A. Applying language modeling to session identification from database trace logs. Knowl Inf Syst 10, 473–504 (2006). https://doi.org/10.1007/s10115-006-0015-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0015-9