A protein alignment scoring system sensitive at all evolutionary distances

Altschul, Stephen F.

doi:10.1007/BF00160485

A protein alignment scoring system sensitive at all evolutionary distances

Published: March 1993

Volume 36, pages 290–300, (1993)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Molecular Evolution Aims and scope Submit manuscript

A protein alignment scoring system sensitive at all evolutionary distances

Download PDF

Stephen F. Altschul¹

419 Accesses
108 Citations
12 Altmetric
Explore all metrics

Summary

Protein sequence alignments generally are constructed with the aid of a “substitution matrix” that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a “log-odds” matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.

References

Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565
Google Scholar
Altschul SF, Erickson BW (1986) A nonlinear measure of sub-alignment similarity and its significance levels. Bull Math Biol 48:617–632
Google Scholar
Altschul SF, Erickson BW (1988) Significance levels for biological sequence comparison using non-linear similarity functions. Bull Math Biol 50:77–92
Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Google Scholar
Argos P (1987) A sensitive procedure to compare amino acid sequences. J Mol Biol 193:385–396
Google Scholar
Arratia R, Gordon L, Waterman MS (1986) An extreme value theory for sequence matching. Ann Star 14:971–993
Google Scholar
Arratia R, Morris P, Waterman MS (1988) Stochastic scrabble: large deviations for sequences with scores. J Appl Prob 25: 106–119
Google Scholar
Arratia R, Waterman MS (1989) The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann Prob 17:1152–1169
Google Scholar
Barker WC, George DG, Hunt LT (1990) Protein sequence database. Methods Enzymol 183:31–49
Google Scholar
Chow ET, Hunkapiller T, Peterson JC, Zimmerman BA, Waterman MS (1991) A systolic array processor for biological information signal processing. In: Proceedings of the 1991 international conference on supercomputing. ACM Press, New York, pp 216–223
Google Scholar
Collins JF, Coulson AFW, Lyall A (1988) The significance of protein sequence similarities. Comput Appl Biosci 4:67–71
Google Scholar
Coulson AFW, Collins JF, Lyall A (1987) Protein and nucleic acid database searching: a suitable case for parallel processing. Computer J 30:420–424
Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 326–329
Google Scholar
Dembo A, Karlin S (1991) Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann Prob 19:1737–1755
Google Scholar
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. Natl Biomed Res Found, Washington, pp 345–352
Google Scholar
Feng DF, Johnson MS, Doolittle RF (1985) Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol 21:112–125
Google Scholar
Fisher RA (1925) Theory of statistical estimation. Proc Cambridge Phil Soc 22:700–725
Google Scholar
Goad WB, Kanehisa MI (1982) Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucl Acids Res 10:247–263
Google Scholar
Gonnet GH (1993) A tutorial introduction to computational biochemistry using Darwin. Manuscript in preparation
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
Google Scholar
Gumbel EJ (1958) Statistics of extremes. Columbia University Press, New York
Google Scholar
Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358
Google Scholar
Hamming RW (1986) Coding and information theory. Prentice-Hall, Englewood Cliffs, p 106
Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919
Google Scholar
Hughey RP (1991) Programmable systolic arrays. PhD Thesis, Brown University, Providence
Google Scholar
Hyldig-Nielsen JJ, Jensen EO, Paludan K, Wiborg O, Garrett R, Jorgensen P, Marcker KA (1982) The primary structures of two leghemoglobin genes from soybean. Nucl Acids Res 10: 689–701
Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
Google Scholar
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
Google Scholar
Karlin S, Bucher P, Brendel V, Altschul SF (1991) Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem 20:175–203
Google Scholar
Karlin S, Dembo A, Kawabata T (1990) Statistical composition of high-scoring segments from molecular sequences. Ann Stat 18:571–581
Google Scholar
Karlin S, Ost F (1988) Maximum length of common words among random letter sequences. Ann Prob 16:535–563
Google Scholar
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Google Scholar
Mauri F, Omnaas J, DavidsonL, Whitfill C, Kitto GB (1991) Amino acid sequence of a globin from the sea cucumber Caudina (Molpadia) arenicola. Biochim Biophys Acta 1078:63–67
Google Scholar
McLachlan AD (1971) Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551. J Mol Biol 61:409–424
Google Scholar
Mott R (1992) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull Math Biol 54:59–75
Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J Mol Biol 48:443–453
Google Scholar
Patthy L (1987) Detecting homology of distantly related proteins with consensus sequences. J Mol Biol 198:567–577
Google Scholar
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448
Google Scholar
Rao JKM (1987) New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int J Peptide Protein Res 29:276–281
Google Scholar
Risler JL, Delorme MO, Delacroix H, Henaut A (1988) Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol 204:1019–1029
Google Scholar
Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading
Google Scholar
Schwartz RM, Dayhoff MO (1978) Matrices for detecting distant relationships. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. Natl Biomed Res Found, Washington, pp 353–358
Google Scholar
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793
Google Scholar
Sellers PH (1984) Pattern recognition in genetic sequences by mismatch density. Bull Math Biol 46:501–514
Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Google Scholar
Smith TF, Waterman MS, Burks C (1985) The statistical distribution of nucleic acid similarities. Nucl Acids Res 13:645–656
Google Scholar
States DJ, Gish W, Altschul SF (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3:66–70
Google Scholar
Stougaard J, Petersen TE, Marcker KA (1987) Expression of a complete soybean leghemoglobin gene in root nodules of transgenic Lotus corniculatus. Proc Natl Acad Sci USA 84: 5754–5757
Google Scholar
Taylor WR (1986) Identification of protein sequence homology by consensus template alignment. J Mol Biol 188:233–258
Google Scholar
Vogt G, Argos P (1992) Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. Comput Appl Biosci 8:49–55
Google Scholar
Wakabayashi S, Matsubara H, Webster DA (1986) Primary sequence of a dimeric bacterial haemoglobin from Vitreoscilla. Nature 322:481–483
Google Scholar
Waterman MS, Gordon L (1990) Multiple hypothesis testing for sequence comparisons. In: Bell GI, Marr TG (eds) Computers and DNA. Addison-Wesley, Reading, pp 127–135
Google Scholar
Waterman MS, Gordon L, Arratia R (1987) Phase transitions in sequence matches and nucleic acid structure. Proc Natl Acad Sci USA 84:1239–1243
Google Scholar
White C, Singh RK, Reintjes PB, Lampe J, Erickson BW, Dettloff WD, Chi VL, Altschul SF (1991) BioSCAN: A VLSI-based system for biosequence analysis. In: Proceedings of the 1991 IEEE international conference on computer design: VLSI in computers and processors. IEEE Comp Soc Press, Los Alamitos, pp 504–509
Google Scholar
Wilbur WJ (1985) On the PAM matrix model of protein evolution. Mol Biol Evol 2:434–447
Google Scholar

Download references

Author information

Authors and Affiliations

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 20894, Bethesda, MD, USA
Stephen F. Altschul

Authors

Stephen F. Altschul
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol 36, 290–300 (1993). https://doi.org/10.1007/BF00160485

Download citation

Received: 22 June 1992
Accepted: 24 August 1992
Issue Date: March 1993
DOI: https://doi.org/10.1007/BF00160485

Key words

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A protein alignment scoring system sensitive at all evolutionary distances

Summary

Article PDF

Similar content being viewed by others

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

Dynamic Programming

Theoretical and Computational Aspects of Protein Structural Alignment

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Key words

Navigation

A protein alignment scoring system sensitive at all evolutionary distances

Summary

Article PDF

Similar content being viewed by others

BLAST and FASTA Similarity Searching for Multiple Sequence Alignment

Dynamic Programming

Theoretical and Computational Aspects of Protein Structural Alignment

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation