Abstract
We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent queries: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k 2) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for probabilities and expected occurrences of a substring under unit increases in its length, allowed number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apostolico, A.: Pattern discovery and the algorithmics of surprise. In: Frasconi, P., Shamir, R. (eds.) Artificial Intelligence and Heuristic Methods for Bioinformatics, pp. 111–127. IOS Press, Amsterdam (2003)
Apostolico, A., Galil, Z. (eds.): Pattern matching algorithms. Oxford University Press, Oxford (1997)
Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and largescale quest for unusual words (extended abstract). In: Proc. of Research in Computational Molecular Biology RECOMB, Washington, DC (2002); Myers, G., Hannenhalli, S., Istrail, S., Pevzner, P., Waterman, M. (eds.): Also, J. Comp. Bio., 10:3-4, 283–311 (July 2003)
Apostolico, A., Parida, L.: Incremental Paradigms of Motif Discovery. J. Comput. Bio. 7,11(1), 15–25 (2004)
Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21(1/2), 51–80 (1995)
Br\(\bar{a}\)zma, A., Jonassen, I., Ukkonen, E., Vilo, J.: Predicting gene regulatory elements in silico on a genomic scale. Genome Research 8(11), 1202–1215 (1998)
Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Bio. 9(2), 225–242 (2002)
Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically sign ificant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)
Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509–522 (1997)
Keich, Pevzner: Finding motifs in the twilight zone. In: Annual International Conference on Computational Molecular Biology, Washington, DC, April 2002, pp. 195–204 (2002)
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Apostolico, A., Pizzi, C. (2004). Monotone Scoring of Patterns with Mismatches. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-30219-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23018-2
Online ISBN: 978-3-540-30219-3
eBook Packages: Springer Book Archive