Most Significant Substring Mining Based on Chi-square Measure

Dutta, Sourav; Bhattacharya, Arnab

doi:10.1007/978-3-642-13657-3_35

Sourav Dutta²³ &
Arnab Bhattacharya²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6118))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4201 Accesses
9 Citations

Abstract

Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A New Approach to String Pattern Mining with Approximate Match

Frequency-Constrained Substring Complexity

Computing Minimal Unique Substrings for a Sliding Window

Article Open access 20 August 2021

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Denise, A., Regnier, M., Vandenbogaert, M.: Accessing the statistical significance of overrepresented oligonucleotides. In: Work. Alg. Bioinf. (WABI), pp. 85–97 (2001)
Google Scholar
Ye, N., Chen, Q.: An anomaly detection technique based on chi-square statistics for detecting intrusions into information systems. Quality and Reliability Engineering International 17(2), 105–112 (2001)
Article MathSciNet Google Scholar
Rahmann, S.: Dynamic programming algorithms for two statistical problems in computational biology. In: Work. Alg. Bioinf. (WABI), pp. 151–164 (2003)
Google Scholar
Regnier, M., Vandenbogaert, M.: Comparison of statistical significance criteria. J. Bioinformatics and Computational Biology 4(2), 537–551 (2006)
Article Google Scholar
Bejerano, G., Friedman, N., Tishby, N.: Efficient exact p-value computation for small sample, sparse and surprisingly categorical data. J. Comp. Bio. 11(5), 867–886 (2004)
Google Scholar
Read, T., Cressie, N.: Goodness-of-fit statistics for discrete multivariate data. Springer, Heidelberg (1988)
MATH Google Scholar
Read, T., Cressie, N.: Pearson’s χ ² and the likelihood ratio statistic G ²: a comparative review. International Statistical Review 57(1), 19–43 (1989)
Article MATH Google Scholar
Hotelling, H.: Multivariate quality control. Techniques of Statistical Analysis 54, 111–184 (1947)
Google Scholar
Agarwal, S.: On finding the most statistically significant substring using the chi-square measure. Master’s thesis, Indian Institute of Technology, Kanpur (2009)
Google Scholar
Keogh, E., Lonardi, S., Chiu, B.: Finding surprising patterns in a time series database in linear time and space. In: SIGKDD, pp. 550–556 (2002)
Google Scholar
Dutta, S., Bhattacharya, A.: Mining most significant substrings based on the chi-square measure. arXiv:1002.4315 [cs.DB] (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India
Sourav Dutta & Arnab Bhattacharya

Authors

Sourav Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Arnab Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki
The Chinese University of Hong Kong, China
Jeffrey Xu Yu
IIT Madras, Chennai, India
B. Ravindran
IIIT, Hyderabad, India
Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dutta, S., Bhattacharya, A. (2010). Most Significant Substring Mining Based on Chi-square Measure. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-13657-3_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Most Significant Substring Mining Based on Chi-square Measure

Abstract

Chapter PDF

Similar content being viewed by others

A New Approach to String Pattern Mining with Approximate Match

Frequency-Constrained Substring Complexity

Computing Minimal Unique Substrings for a Sliding Window

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Most Significant Substring Mining Based on Chi-square Measure

Abstract

Chapter PDF

Similar content being viewed by others

A New Approach to String Pattern Mining with Approximate Match

Frequency-Constrained Substring Complexity

Computing Minimal Unique Substrings for a Sliding Window

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation