Summary
Protein sequence alignments generally are constructed with the aid of a “substitution matrix” that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a “log-odds” matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565
Altschul SF, Erickson BW (1986) A nonlinear measure of sub-alignment similarity and its significance levels. Bull Math Biol 48:617–632
Altschul SF, Erickson BW (1988) Significance levels for biological sequence comparison using non-linear similarity functions. Bull Math Biol 50:77–92
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Argos P (1987) A sensitive procedure to compare amino acid sequences. J Mol Biol 193:385–396
Arratia R, Gordon L, Waterman MS (1986) An extreme value theory for sequence matching. Ann Star 14:971–993
Arratia R, Morris P, Waterman MS (1988) Stochastic scrabble: large deviations for sequences with scores. J Appl Prob 25: 106–119
Arratia R, Waterman MS (1989) The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann Prob 17:1152–1169
Barker WC, George DG, Hunt LT (1990) Protein sequence database. Methods Enzymol 183:31–49
Chow ET, Hunkapiller T, Peterson JC, Zimmerman BA, Waterman MS (1991) A systolic array processor for biological information signal processing. In: Proceedings of the 1991 international conference on supercomputing. ACM Press, New York, pp 216–223
Collins JF, Coulson AFW, Lyall A (1988) The significance of protein sequence similarities. Comput Appl Biosci 4:67–71
Coulson AFW, Collins JF, Lyall A (1987) Protein and nucleic acid database searching: a suitable case for parallel processing. Computer J 30:420–424
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 326–329
Dembo A, Karlin S (1991) Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann Prob 19:1737–1755
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. Natl Biomed Res Found, Washington, pp 345–352
Feng DF, Johnson MS, Doolittle RF (1985) Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol 21:112–125
Fisher RA (1925) Theory of statistical estimation. Proc Cambridge Phil Soc 22:700–725
Goad WB, Kanehisa MI (1982) Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucl Acids Res 10:247–263
Gonnet GH (1993) A tutorial introduction to computational biochemistry using Darwin. Manuscript in preparation
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
Gumbel EJ (1958) Statistics of extremes. Columbia University Press, New York
Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358
Hamming RW (1986) Coding and information theory. Prentice-Hall, Englewood Cliffs, p 106
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919
Hughey RP (1991) Programmable systolic arrays. PhD Thesis, Brown University, Providence
Hyldig-Nielsen JJ, Jensen EO, Paludan K, Wiborg O, Garrett R, Jorgensen P, Marcker KA (1982) The primary structures of two leghemoglobin genes from soybean. Nucl Acids Res 10: 689–701
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
Karlin S, Bucher P, Brendel V, Altschul SF (1991) Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem 20:175–203
Karlin S, Dembo A, Kawabata T (1990) Statistical composition of high-scoring segments from molecular sequences. Ann Stat 18:571–581
Karlin S, Ost F (1988) Maximum length of common words among random letter sequences. Ann Prob 16:535–563
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Mauri F, Omnaas J, DavidsonL, Whitfill C, Kitto GB (1991) Amino acid sequence of a globin from the sea cucumber Caudina (Molpadia) arenicola. Biochim Biophys Acta 1078:63–67
McLachlan AD (1971) Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551. J Mol Biol 61:409–424
Mott R (1992) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull Math Biol 54:59–75
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J Mol Biol 48:443–453
Patthy L (1987) Detecting homology of distantly related proteins with consensus sequences. J Mol Biol 198:567–577
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448
Rao JKM (1987) New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int J Peptide Protein Res 29:276–281
Risler JL, Delorme MO, Delacroix H, Henaut A (1988) Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol 204:1019–1029
Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading
Schwartz RM, Dayhoff MO (1978) Matrices for detecting distant relationships. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. Natl Biomed Res Found, Washington, pp 353–358
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793
Sellers PH (1984) Pattern recognition in genetic sequences by mismatch density. Bull Math Biol 46:501–514
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Smith TF, Waterman MS, Burks C (1985) The statistical distribution of nucleic acid similarities. Nucl Acids Res 13:645–656
States DJ, Gish W, Altschul SF (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3:66–70
Stougaard J, Petersen TE, Marcker KA (1987) Expression of a complete soybean leghemoglobin gene in root nodules of transgenic Lotus corniculatus. Proc Natl Acad Sci USA 84: 5754–5757
Taylor WR (1986) Identification of protein sequence homology by consensus template alignment. J Mol Biol 188:233–258
Vogt G, Argos P (1992) Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. Comput Appl Biosci 8:49–55
Wakabayashi S, Matsubara H, Webster DA (1986) Primary sequence of a dimeric bacterial haemoglobin from Vitreoscilla. Nature 322:481–483
Waterman MS, Gordon L (1990) Multiple hypothesis testing for sequence comparisons. In: Bell GI, Marr TG (eds) Computers and DNA. Addison-Wesley, Reading, pp 127–135
Waterman MS, Gordon L, Arratia R (1987) Phase transitions in sequence matches and nucleic acid structure. Proc Natl Acad Sci USA 84:1239–1243
White C, Singh RK, Reintjes PB, Lampe J, Erickson BW, Dettloff WD, Chi VL, Altschul SF (1991) BioSCAN: A VLSI-based system for biosequence analysis. In: Proceedings of the 1991 IEEE international conference on computer design: VLSI in computers and processors. IEEE Comp Soc Press, Los Alamitos, pp 504–509
Wilbur WJ (1985) On the PAM matrix model of protein evolution. Mol Biol Evol 2:434–447
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol 36, 290–300 (1993). https://doi.org/10.1007/BF00160485
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00160485