Abstract
We examine the problem of finding maximal-scoring sets of disjoint regions in a sequence of scores. The problem arises in DNA and protein segmentation, and in post-processing of sequence alignments. Our key result states a simple recursive relationship between maximal-scoring segment sets. The statement leads to an algorithm that finds such a k-set of segments in a sequence of length n in O(nk) time. We describe linear-time algorithms for finding optimal segment sets using different criteria for choosing k, as well as an algorithm for finding an optimal set of k segments in O(nlog n) time, independently of k. We apply our methods to the identification of non-coding RNA genes in thermophiles.
Work supported by NSERC grant 250391-02.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bentley, J.: Programming pearls: algorithm design techniques. Comm. ACM 27, 865–873 (1984)
Braun, J.V., Müller, H.G.: Statistical methods for DNA sequence segmentation. Statist. Sci. 13, 142–162 (1998)
Karlin, S., Brendel, V.: Chance and significance in protein and DNA analysis. Science 257, 39–49 (1992)
Fu, Y.X., Curnow, R.N.: Maximum likelihood estimation of multiple change points. Biometrika 77, 563–573 (1990)
Li, W., Bernaola-Galván, P., Haghighi, F., Grosse, I.: Applications of recursive segmentation to the analysis of DNA sequences. Comput. Chem. 26, 491–510 (2002)
Ruzzo, W.L., Tompa, M.: A linear time algorithm for finding all maximal scoring subsequences. In: Proc. 7th Intl. Conf. Intelligent Systems in Molecular Biology, pp. 234–241. AAAI Press, Menlo Park (1999)
Klein, R.J., Misulovin, Z., Eddy, S.R.: Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc. Natl. Acad. Sci. USA 99, 7542–7547 (2002)
Churchill, G.A.: Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79–94 (1989)
Zhang, Z., Berman, P., Wiehe, T., Miller, W.: Post-processing long pairwise alignments. Bioinformatics 15, 1012–1019 (1999)
Barron, A., Rissanen, J., Yu, B.: The Minimum Description Length principle in coding and modeling. IEEE Trans. Inform. Theory 44, 2743–2760 (1998)
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)
Karlin, S., Dembo, A., Kawabata, T.: Statistical composition of high-scoring segments from molecular sequences. Ann. Statist. 18, 571–581 (1990)
Rabiner, L.R.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Schattner, P.: Searching for RNA genes using base composition statistics. Nucleic Acids Res 30, 2076–2082 (2002)
Galtier, N., Lobry, J.: Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in Prokaryotes. J. Mol. Evol. 44, 632–636 (1997)
Wang, H.C., Hickey, D.A.: Evidence for strong selective constraint acting on the nucleotide composition of 16S ribosomal RNA genes. Nucleic Acids Res. 30, 2501–2507 (2002)
Bao, Q., et al.: A complete sequence of the T. tengcongensis genome. Genome Res. 12, 689–700 (2002)
Lowe, T.M., Eddy, S.R.: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997)
Waters, E., et al.: The genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism. Proc. Natl. Acad. Sci. USA 100 (2003)
Kawarabayashi, Y., et al.: Complete genome sequence of an aerobic thermoacidophilic crenarchaeon, Sulfolobus tokodaii strain7. DNA Research 8, 123–140 (2001)
Brown, J.W.: The ribonuclease P database. Nucleic Acids Res. 27, 314 (1999)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Csűrös, M. (2004). Algorithms for Finding Maximal-Scoring Segment Sets. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-30219-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23018-2
Online ISBN: 978-3-540-30219-3
eBook Packages: Springer Book Archive