Abstract
Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Denise, A., Regnier, M., Vandenbogaert, M.: Accessing the statistical significance of overrepresented oligonucleotides. In: Work. Alg. Bioinf. (WABI), pp. 85–97 (2001)
Ye, N., Chen, Q.: An anomaly detection technique based on chi-square statistics for detecting intrusions into information systems. Quality and Reliability Engineering International 17(2), 105–112 (2001)
Rahmann, S.: Dynamic programming algorithms for two statistical problems in computational biology. In: Work. Alg. Bioinf. (WABI), pp. 151–164 (2003)
Regnier, M., Vandenbogaert, M.: Comparison of statistical significance criteria. J. Bioinformatics and Computational Biology 4(2), 537–551 (2006)
Bejerano, G., Friedman, N., Tishby, N.: Efficient exact p-value computation for small sample, sparse and surprisingly categorical data. J. Comp. Bio. 11(5), 867–886 (2004)
Read, T., Cressie, N.: Goodness-of-fit statistics for discrete multivariate data. Springer, Heidelberg (1988)
Read, T., Cressie, N.: Pearson’s χ 2 and the likelihood ratio statistic G 2: a comparative review. International Statistical Review 57(1), 19–43 (1989)
Hotelling, H.: Multivariate quality control. Techniques of Statistical Analysis 54, 111–184 (1947)
Agarwal, S.: On finding the most statistically significant substring using the chi-square measure. Master’s thesis, Indian Institute of Technology, Kanpur (2009)
Keogh, E., Lonardi, S., Chiu, B.: Finding surprising patterns in a time series database in linear time and space. In: SIGKDD, pp. 550–556 (2002)
Dutta, S., Bhattacharya, A.: Mining most significant substrings based on the chi-square measure. arXiv:1002.4315 [cs.DB] (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dutta, S., Bhattacharya, A. (2010). Most Significant Substring Mining Based on Chi-square Measure. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-13657-3_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)