Abstract
In searching for strong homologies between multiple nucleic acid or protein sequences, researchers commonly look at fixed-length segments in common to the sequences. Such homologies form the foundation of segment-based algorithms for multiple alignment of protein sequences. The researcher uses settings of “unusualness of multiple matches” to calibrate the algorithms. In applications where a researcher has found a multiple matching word, statistical significance helps gauge the unusualness of the observed match. Previous approximations for the unusualness of multiple matches are based on large sample theory, and are sometimes quite inaccurate. Section 2 illustrates this inaccuracy, and provides accurate approximations for the probability of a common word inR out ofR sequences. Section 3 generalizes the approximation to multiple matching inR out ofS sequences. Section 4 describes a more complex approximation that incorporates exact probabilities and yields excellent accuracy; this approximation is useful for checking the simpler approximations over a range of values.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Glaz, J. and J. I. Naus. 1991. Tight bounds and approximations for scan statistic probabilities for discrete data.Ann. Appl. Prob. 1, 306–318.
Karlin, S., G. Ghandour and D. Fousler. 1985. DNA sequence comparisons of human, mouse, and rabbit immunoglobulin Kappa gene.Mol. Biol. Evol. 2, 35–52.
Karlin, S. and F. Ost 1987. Counts of long aligned word matches among random letter sequences.Adv. Appl. Prob. 19, 293–351.
Karlin, S., F. Ost and B. E. Blaisdell. 1989. Patterns in DNA and amino acid sequences and their statistical significance. InMathematical Methods for DNA Sequences, M. S. Waterman (Ed), ch. 6. Boca Raton, FL: CRC Press Inc.
Karlin, S. and F. Ost 1988. Maximal length of common words among random sequences.Ann. Prob. 16, 535–563.
Leung, M. Y., B. E. Blaisdell, C. Burge and S. Karlin. 1991. An efficient algorithm for identifying matches with errors in multiple long molecular sequences.J. Mol. Biol. 221, 1367–1378.
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. Math. Biol. 52, 773–784.
Naus, J. and K. N. Sheng. 1996. Screening for unusual matched segments in multiple protein sequences.Commun. in Statist., Simulation and Computation 25, 937–952.
Sheng, K. N. and J. Naus. 1994. Pattern matching between two non-aligned random sequences.Bull. Math. Biol. 56, 1143–1162.
Sobel, E. and H. M. Martinez. 1986. A multiple sequence alignment program.Nucleic Acids Res. 14, 363–374.
Waterman, M. S. 1986. Multiple sequence alignment by consensus.Nucleic Acids Res. 14, 9095–9102.
Waterman, M. S., R. Arratia and D. J. Galas. 1984. Pattern recognition in several sequences; consensus and alignment.Bull. Math. Biol. 46, 515–527.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Naus, J.I., Sheng, KN. Matching among multiple random sequences. Bltn Mathcal Biology 59, 483–496 (1997). https://doi.org/10.1007/BF02459461
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF02459461