Applying language modeling to session identification from database trace logs

Huang, Xiangji; Yao, Qingsong; An, Aijun

doi:10.1007/s10115-006-0015-9

Applying language modeling to session identification from database trace logs

Regular Paper
Published: 24 March 2006

Volume 10, pages 473–504, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Knowledge and Information Systems Aims and scope Submit manuscript

Applying language modeling to session identification from database trace logs

Download PDF

Xiangji Huang¹,
Qingsong Yao² &
Aijun An²

116 Accesses
11 Citations
Explore all metrics

Abstract

A database session is a sequence of requests presented to the database system by a user or an application to achieve a certain task. Session identification is an important step in discovering useful patterns from database trace logs. The discovered patterns can be used to improve the performance of database systems by prefetching predicted queries, rewriting the current query or conducting effective cache replacement.

In this paper, we present an application of a new session identification method based on statistical language modeling to database trace logs. Several problems of the language modeling based method are revealed in the application, which include how to select values for the parameters of the language model, how to evaluate the accuracy of the session identification result and how to learn a language model without well-labeled training data. All of these issues are important in the successful application of the language modeling based method for session identification. We propose solutions to these open issues. In particular, new methods for determining an entropy threshold and the order of the language model are proposed. New performance measures are presented to better evaluate the accuracy of the identified sessions. Furthermore, three types of learning methods, namely, learning from labeled data, learning from semi-labeled data and learning from unlabeled data, are introduced to learn language models from different types of training data. Finally, we report experimental results that show the effectiveness of the language model based method for identifying sessions from the trace logs of an OLTP database application and the TPC-C Benchmark.

Article PDF

An efficient and scalable dynamic session identification framework for web usage mining

Article 09 February 2022

Improving the system log analysis with language model and semi-supervised classifier

Article 23 March 2019

The Curious Case of Session Identification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agrawal S, Chaudhuri S, Narasayya VR (2000) Automated selection of materialized views and indexes in SQL databases. VLDB Conference, pp 496–505
Allan J (2002) Introduction to topic detection and tracking. In: Allan J (ed) Topic detection and tracking: event-based information organization. Kluwer Academic Publishers, pp 1–16
Bahl L, Jelinek F, Mercer R (1983) A maximum likelihood approach to continuous speech recognition. IEEE Trans Pattern Anal Mach Intell 5(2):179–190
Article Google Scholar
Benchmark Factory software (2005) http://www.quest.com/benchmark_factory/index.asp, Quest Software Inc.
Brent M (1999) An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach Learn 34:71–106
Article MATH Google Scholar
Brent M, Tao X (2001) Chinese text segmentation with MBDP-1: making the most of training corpora. In: Proceedings of the ACL2001, France
Calzarossa M, Serazzi G (1993) Workload characterization: a survey. Proceedings of the IEEE 81(8):1136–1150
Google Scholar
Catledge L, Pitkow J (1995) Characterizing browsing strategies in the world wide web. Proceedings of the 3rd International World Wide Web Conference
Chang JS, Su KY (1997) An unsupervised iterative method for Chinese New Lexicon extraction. Int J Comput Linguist Chin Lang Process 2(2): 97–148
MATH Google Scholar
Chaudhuri S, Narasayya VR (1998) Microsoft index tuning wizard for SQL Server 7.0. SIGMOD Conference, pp 553–554
Chaudhuri S, Narasayya VR (2000) Automating statistics management for query optimizers. ICDE Conference, pp 339–348
Chen A, He J, Xu L, Gey FC, Meggs J (1997) Chinese text retrieval without using a dictionary. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 42–49, ACM
Chen S, Goodman J (1998) An empirical study of smoothing techniques for language modeling. Technical report, TR-10-98, Harvard University
Chen MS, Park JS, Yu PS (1998) Efficient data mining for path traversal patterns. IEEE Trans Knowl Data Eng 10(2):209–221
Google Scholar
Cooley R, Mobasher B, Srivastava J (1999) Data preparation for mining world wide web browsing patterns. Knowl Inf Syst: An Int J 1(1):5–32
Google Scholar
Duyand J, Vaughan L (2003) Usage data for electronic resources: a comparison between locally collected and vendor-provided statistics. J Acad Libr 29(1):16–22
Article Google Scholar
Fung P (1998) Extracting key terms from Chinese and Japnese text. Int J Comput Process Orien Lang, Special Issue on Information Retrieval on Oriental Languages pp 99–121
Ge X, Pratt W, Smyth P (1999) Discovering Chinese words from unsegmented text. In: Proceedings of the 22th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 271–272
Hatch P (2000) Lexical chaining for the online detection of new events. Master's thesis, University College Dubin
He D, Goker A (2000) Detecting session boundaries from web user logs. In: Proceedings of the 22nd annual colloquium on information retrieval research, Cambridge, England, pp 57–66
Huang X, Peng F, An A, Schuurmans D, Cercone N (2003) Session boundary detection for association rule learning using N-gram language models. In: Proceedings of the 16th Canadian conference on artificial intelligence (CAI-03), Halifax, Canada, pp 237–251
Huang X, Peng F, An A, Schuurmans D (2004) Dynamic web log session identification with statistical language model. J Am Soc Inf Sci Tech, Special Issue on Webometrics 55(14):1290–1303
Google Scholar
Huang, X, Peng F, Schuurmans D, Cercone N, Robertson SE (2003) Applying machine learning to text segmentation for information retrieval. Inf Retriev J 6(4):333–362
Article Google Scholar
Huang X, Robertson SE (2000) A probabilistic approach to Chinese information retrieval: theory and experiments. In: Proceedings of the BCS-IRSG 2000: the 22nd annual colloquium on information retrieval research, Cambridge, England, pp 178–193
Jin W (1992) Chinese segmentation and its disambiguation. In: MCCS-92-227, computing research laboratory, New Mexico State University, Las Cruces, New Mexico
Katz S (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust, Speech Signal Process 35(3):400–401
Article Google Scholar
Nie JY, Ren F (1999) Chinese information retrieval: using characters or words? Inform Process Manage 35:443–462
Google Scholar
Nie JY, Brisebois M, Ren X (1996) On Chinese text retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 225–233
Peng F, Schuurmans D (2001) Self-supervised Chinese Word Segmentation. In: Hoffman F, et al (eds) Advances in intelligent data analysis, proceedings of the fourth international conference (IDA-01), LNCS 2189, Cascais, Portugal, pp 238–247
Peng F, Feng, F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th COLING 2004, Switzerland, pp 562-568
Ponte J, Croft W (1996) Useg: A retargetable word segmentation procedure for information retrieval. In: Proceedings of symposium on document analysis and information retrival 96 (SDAIR)
Ponte J, Croft W (1998) Text Segmentation by Topic. Proc Eur Conf Digit Libr 113–125
Sapia C (2000) PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems. Proceedings of the 2nd international conference on data warehousing and knowledge discovery, Greewich, UK. Springer Verlag, pp 224–233
Stokes N, Carthy J, Smeaton AF (2004) SeLeCT: A Lexical Cohesion based News Story Segmentation System. J AI Commun 17(1):3-12
Google Scholar
Sproat R, Shih C (1990) A statistical method for finding word boundaries in Chinese text. Comput Process Chin Orient Lang 4:336–351
Google Scholar
Teahan WJ (2000) Text Classification and Segmentation Using Minimum Cross-entropy. In: Proceedings of international conference on content-based multimedia information access (RIAO-00)
Transaction Processing Performance Council (2004) TPC Benchmark C Standard Specification Revision 5.3
Wang X, Wang K, Li Z (1989) Minimal word segmentation and its algorithm. J Sci 13:1030–1032
Google Scholar
Xue N (2003) Chinese word segmentation as charater tagging. Int J Comput Linguist Chin Lang Process 8(1):29-48
Google Scholar
Yang, Y, Carbonell, JG, Brown, R, Pierce, T, Archibald, B, Liu, X (1999) Learning approaches for detecting and tracking news events. IEEE Intell Syst: Spec Iss Appl Intell Inform Retriev 14(4):32–43
Google Scholar
Yao Q, An A (2003) Using user access patterns for semantic query caching. In: Proceedings of the 14th international conference on database and expert systems applications (DEXA'03), Prague, Czech Republic, pp 737–746.
Yao Q, An A (2003) SQL-Relay: An event-driven rule-based database. In: International conference on web-age information management (WAIM'03)
Yao Q, An A (2004) Characterizing database user's access patterns. In: Proceedings of the 15th international conference on database and expert systems applications (DEXA'04), Spain, pp 528–538
Zhang HP, Liu Q, Cheng XQ, Zhang H, Yu HK (2003) Chinese lexical analysis using hierarchical hidden Markov model. In: Proceedings of the second SIGHAN workshop, Japan, pp 63–70

Download references

Author information

Authors and Affiliations

School of Information Technology, York University, 4700 Keele Street, Toronto, ON, Canada, M3J 1P3
Xiangji Huang
Department of Computer Science and Engineering, York University, Toronto, ON, Canada, M3J 1P3
Qingsong Yao & Aijun An

Authors

Xiangji Huang
View author publications
You can also search for this author in PubMed Google Scholar
Qingsong Yao
View author publications
You can also search for this author in PubMed Google Scholar
Aijun An
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangji Huang.

Additional information

Xiangji Huang joined York University as an Assistant Professor in July 2003 and then became a tenured Associate Professor in May 2006. Previously, he was a Post Doctoral Fellow at the School of Computer Science, University of Waterloo, Canada. He did his Ph.D. in Information Science at City University in London, England, with Professor Stephen E. Robertson. Before he went into his Ph.D. program, he worked as a lecturer for 4 years at Wuhan University. He also worked in the financial industry in Canada doing E-business, where he was awarded a CIO Achievement Award, for three and half years. He has published more than 50 refereed papers in journals, book chapter and conference proceedings. His Master (M.Eng.) and Bachelor (B.Eng.) degrees were in Computer Organization & Architecture and Computer Engineering, respectively. His research interests include information retrieval, data mining, natural language processing, bioinformatics and computational linguistics.

Qingsong Yao is a Ph.D. student in the Department of Computer Science and Engineering at York University, Toronto, Canada. His research interests include database management systems and query optimization, data mining, information retrieval, natural language processing and computational linguistics. He earned his Master's degree in Computer Science from Institute of Software, Chinese Academy of Science in 1999 and Bachelor's degree in Computer Science from Tsinghua University.

Aijun An is an associate professor in the Department of Computer Science and Engineering at York University, Toronto, Canada. She received her Bachelor's and Master's degrees in Computer Science from Xidian University in China. She received her PhD degree in Computer Science from the University of Regina in Canada in 1997. She worked at the University of Waterloo as a postdoctoral fellow from 1997 to 1999 and as a research assistant professor from 1999 to 2001. She joined York University in 2001. She has published more than 60 papers in refereed journals and conference proceedings. Her research interests include data mining, machine learning, and information retrieval.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, X., Yao, Q. & An, A. Applying language modeling to session identification from database trace logs. Knowl Inf Syst 10, 473–504 (2006). https://doi.org/10.1007/s10115-006-0015-9

Download citation

Received: 16 February 2005
Revised: 08 January 2006
Accepted: 30 January 2006
Published: 24 March 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s10115-006-0015-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Applying language modeling to session identification from database trace logs

Abstract

Article PDF

Similar content being viewed by others

An efficient and scalable dynamic session identification framework for web usage mining

Improving the system log analysis with language model and semi-supervised classifier

The Curious Case of Session Identification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Applying language modeling to session identification from database trace logs

Abstract

Article PDF

Similar content being viewed by others

An efficient and scalable dynamic session identification framework for web usage mining

Improving the system log analysis with language model and semi-supervised classifier

The Curious Case of Session Identification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation