Abstract
The accuracy of models for DNA substitution used in phylogenetic analyses is becoming more important with the increasing availability and analysis of molecular sequence data. It is natural to look for ways of improving these models, and to do this in a planned manner it is useful to be able to identify features of sequences that may not be described adequately. In this paper, I describe three statistics which may give useful diagnostic information on departures from models' predictions. The statistical distributions of these statistics are discussed and simple significance tests are derived. These tests are based on the (estimated) phylogeny of the sequences and so have the advantage of using the information contained in this tree. Examples are given of the application of the new tests to Markov chain models describing the evolution of primate pseudogene sequences and small-subunit RNA sequences.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Abbreviations
- b(N,p) :
-
binomial distribution of N trials, each with probability p of success
- m(N,p 1,p 2, ..., p r ):
-
multinomial distribution of N trials, with r possible outcomes having probabilities p 1, p 2, ..., pr, respectively
- N(μ, σ2):
-
Normal distribution with mean μ and variance σ2
- p(λ):
-
Poisson distribution with mean λ
- bp:
-
base pairs
- cdf:
-
cumulative distribution function
- i.i.d.:
-
independent, identical distribution
References
Bishop MJ, Friday AE (1985) Evolutionary trees from nucleic acid and protein sequences. Proc R Soc Lond B 226:271–302
Cavender JA (1989) Mechanized derivation of linear invariants. Mol Biol Evol 6:301–316
Cox DR (1961) Tests of separate families of hypotheses. Proceedings of the 14th Berkeley Symposium (University of California Press) 1:105–123
Cox DR (1962) Further results on tests of separate families of hypotheses. J R Statist Soc B 24:406–424
Feller W (1968) An introduction to probability theory and its applications, 3rd ed. John Wiley, New York, pp 153–154, 167–168,179–186
Felsenstein J (1973) Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet 25:471–492
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Fitch WM, Margoliash E (1967) A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem Genet 1:65–71
Fitch WM, Markowitz E (1970) An improved method for determining codon variability in a gene and its application to the rate of fixations of mutations in evolution. Biochem Genet 4:579–593
Gillespie JH (1989) Lineage effects and the index of dispersion of molecular evolution. Mol Biol Evol 6:636–647
Golding B, Felsenstein J (1990) A maximum likelihood approach to the detection of selection from a phylogeny. J Mol Evol 31:511–523
Goldman N (1990) Maximum likelihood inference of phylogenetic trees, with special reference to a Poisson process model of DNA substitution and to parsimony analyses. Syst Zool 39:345–361
Goldman N (1993) Statistical tests of models of DNA substitution. J Mol Evol 36:182–198
Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:160–174
Hasegawa M, Kishino H, Yano T (1987) Man's place in Hominoidea as inferred from molecular clocks of DNA. J Mol Evol 26:132–147
Hasegawa M, Kishino H, Yano T (1989) Estimation of branching dates among primates by molecular clocks of nuclear DNA which slowed down in Hominoidea. J Hum Evol 18:461–476
Holmes EC, Pesole G, Saccone C (1989) Stochastic models of molecular evolution and the estimation of phylogeny and rates and nucleotide substitution in the hominoid primates. J Hum Evol 18:775–794
Johnson NL, Kotz S (1977) Urn models and their application. John Wiley, New York, pp 107–113
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism, vol 3. Academic Press, New York, pp 21–132
Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge, pp 65–89
Kishino H, Hasegawa M (1989) Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J Mol Evol 29:170–179
Koop BF, Goodman M, Xu P, Chan K, Slightom JL (1986) Primate eta-globin DNA sequences and man's place among the great apes. Nature 319:234–238
Lanave C, Preparata G, Saccone C, Serio G (1984) A new method for calculating evolutionary substitution rates. J Mol Evol 20:86–93
Lindgren BW (1976) Statistical theory, 3rd ed. Macmillan, New York, pp 487–489, 494–495
Navidi WC, Churchill GA, von Haeseler A (1991) Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. Mol Biol Evol 8:128–143
Palumbi SR (1989) Rates of molecular evolution and the fraction of nucleotide positions free to vary. J Mol Evol 29:180–187
Penny D, Hendy MD, Steel MA (1992) Progress with methods for constructing evolutionary trees. TREE 7:73–79
Ripley BD (1987) Stochastic simulation. John Wiley, New York, pp 170–178
Swofford DL, Olsen GJ (1990) Phylogeny reconstruction. In: Hillis DM, Moritz C (eds) Molecular systematics. Sinauer, Sunderland MA, pp 411–502
Yang Z (1992) Variations in evolutionary rates and estimation of evolutionary distances of DNA sequences. PhD thesis, Beijing Agricultural University, Beijing
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Goldman, N. Simple diagnostic statistical tests of models for DNA substitution. J Mol Evol 37, 650–661 (1993). https://doi.org/10.1007/BF00182751
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00182751