Abstract
It is well known that the base composition along eukaryotic genomes is long-range correlated. Here, we investigate the effect of such long-range correlations on alignment score statistics. We model the correlated score-landscape by means of a Gaussian approximation. In this framework, we can calculate the corrections to the scale parameter λ of the extreme value distribution of alignment scores. To evaluate our approximate analytic results, we perform a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find that the mean and the exponential tail of the score distribution are in fact influenced by the correlations along the sequences. Therefore, the significance of measured alignment scores in biological sequences will change upon incorporation of the correlations in the null model.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences, and Genomes. CRC Press, Boca Raton (1995)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1998)
Peng, C.K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simons, M., Stanley, H.E.: Long-range correlations in nucleotide sequences. Nature 356, 168 (1992)
Li, W., Kaneko, K.: Long-range correlation and partial 1/f α spectrum in a noncoding DNA sequence. Europhys. Lett. 17, 655 (1992)
Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805 (1992)
Arneodo, A., Bacry, E., Graves, P.V., Muzy, J.F.: Characterizing long-range correlations in DNA sequences from wavelet analysis. Phys. Rev. Lett. 74, 3293 (1995)
Bernaola-Galvan, P., Carpena, P., Roman-Roldan, R., Oliver, J.L.: Study of statistical correlations in DNA sequences. Gene. 300, 105 (2002)
Li, W., Holste, D.: Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome. Phys. Rev. E 71, 41910 (2005)
Li, W.: Expansion-modification systems: A model for spatial 1/f spectra. Phys. Rev. A 43, 5240 (1991)
Messer, P.W., Arndt, P.F., Lässig, M.: Solvable sequence evolution models and genomic correlations. Phys. Rev. Lett. 94, 138103 (2005)
Messer, P.W., Lässig, M., Arndt, P.F.: Universality of long-range correlations in expansion-randomization systems. J. Stat. Mech., P10004 (2005)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403 (1990)
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research 25, 403 (1997)
Smith, S.F., Waterman, M.S.: Comparison of biosequences. Adv. Appl. Math. 2, 482 (1981)
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87, 2264 (1990)
Karlin, S., Dembo, A.: Limit distribution of the maximal segmental score among Markov-dependent partial sums. Adv. Appl. Prob. 24, 113 (1992)
Karlin, S., Altschul, S.F.: Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U.S.A. 90, 5873 (1993)
Smith, T.F., Waterman, M.S., Burks, C.: The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 13, 645 (1985)
Waterman, M.S., Vingron, M.: Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. U.S.A. 91, 4625 (1994)
Altschul, S.F., Gish, W.: Local alignment statistics. Methods Enzymol. 266, 460 (1996)
Mott, R.: Maximum likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. Math. Biol. 54, 59 (1999)
Bundschuh, R.: An analytic approach to significance assessment in local sequence alignment with gaps. In: RECOMB 2000, p. 86 (2000)
Bundschuh, R.: Asymmetric exclusion process and extremal statistics of random sequences. Phys. Rev. E 65, 31911 (2002)
Grossmann, S., Yakir, B.: Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments. Bernoulli 10, 829 (2004)
Park, Y., Sheetlin, S., Spouge, J.L.: Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. Journal of Physics A 38, 97 (2005)
Chia, N., Bundschuh, R.: A practical approach to significance assessment in alignment with gaps. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 474–488. Springer, Heidelberg (2005)
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555 (1991)
Yu, Y.K., Bundschuh, R., Hwa, T.: Statistical significance and extremal ensemble of gapped local hybrid alignment. LNP: Biological Evolution and Statistical Physics 585, 3 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Messer, P.W., Bundschuh, R., Vingron, M., Arndt, P.F. (2006). Alignment Statistics for Long-Range Correlated Genomic Sequences. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2006. Lecture Notes in Computer Science(), vol 3909. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11732990_36
Download citation
DOI: https://doi.org/10.1007/11732990_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33295-4
Online ISBN: 978-3-540-33296-1
eBook Packages: Computer ScienceComputer Science (R0)