Abstract
The complete genomes of living organisms have provided much information on their phylogenetic relationships. Similarly, the complete genomes of chloroplasts have helped to resolve the evolution of this organelle in photosynthetic eukaryotes. In this paper we propose an alternative method of phylogenetic analysis using compositional statistics for all protein sequences from complete genomes. This new method is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b) and Chu et al. (2004). The same data sets used in Qi et al. (2004b) and Chu et al. (2004) are analyzed using the new method. Our distance-based phylogenic tree of the 109 prokaryotes and eukaryotes agrees with the biologists “tree of life” based on 16S rRNA comparison in a predominant majority of basic branching and most lower taxa. Our phylogenetic analysis also shows that the chloroplast genomes are separated to two major clades corresponding to chlorophytes s.l. and rhodophytes s.l. The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
In our understanding of the classification of the living world as a whole, the most important advance was made by Chatton (1937), whose classification is that there are two major groups of organisms, the prokaryotes (bacteria) and the eukaryotes (organisms with nucleated cells). Then the universal tree of life based on the 16S-like rRNA genes given by Woese and colleagues (Woese 1987; Woese et al. 1990) led to the proposal of three primary domains (Eukarya, Bacteria, and Archaea). Although the archaebacterial domain is accepted by biologists, its phylogenetic status is still a matter of controversy (Gupta 1998; Mayr 1998). Analysis of some genes, particularly those encoding metabolic enzymes, gives different phylogenies of the same organisms or even fails to support the three-domain classification of living organisms (Brown and Doolittle 1997; Doolittle 1998; Gupta 1998). In the conventional comparative sequence analyses for phylogenetics, the investigators all rely on a very small number of the same ribosomal genes. How accurate are the results? The only way to begin to answer this question is to conduct independent analyses, either with different genes or with different approaches. So for evaluating relationships of taxa, it is worthwhile to think about alternatives to conventional comparative sequence analysis for phylogenetics.
It is generally accepted that genome sequences are excellent tools for studying evolution (Eisen and Fraser 2003). In building the tree of life, analysis of whole genomes has begun to supplement, and in some cases to improve upon, studies previously done with one or a few genes (Eisen and Fraser 2003). The availability of complete genomes allows the reconstruction of organismal phylogeny, taking into account the genome content, for example, based on the rearrangement of gene order (Sankoff et al. 1992), the presence or absence of protein-coding gene families (Fitz-Gibbon and House 1999), gene content and overall similarity (Tekaia et al. 1999), and occurrence of folds and orthologs (Lin and Gerstein 2000). All these approaches depend on alignment of homologous sequences, and it is apparent that much information (such as gene rearrangement and insertions/deletions) in these data sets is lost after sequence alignment, let alone the intrinsic problems of alignment algorithms (Li et al. 2001; Stuart et al. 2002a, b). There have been a number of recent attempts to develop methodologies that do not require sequence alignment for deriving species phylogeny based on overall similarities of the complete genomes (e.g., Li et al. 2001; Yu and Jiang 2001; Yu et al. 2003a, b, 2004; Edwards et al. 2002; Stuart et al. 2002a, b).
By overcoming the problem of noise and bias in the protein sequences through the use of better models, whole-genome trees have now largely converged to the rRNA-sequence tree (Charlebois et al. 2003). Qi et al. (2004b) have developed a simple correlation analysis of complete genome sequences based on compositional vectors without the need of sequence alignment. The compositional vectors calculated from the frequency of amino acid strings are converted to distance values for all taxa, and the phylogenetic relationships are inferred from the distance matrix using conventional tree-building methods. An analysis based on this method using 109 organisms (prokaryotes and eukaryotes) yields a tree separating the three domains of life—Archaea, Eubacteria, and Eukarya—with the relationships among the taxa correlating with those based on traditional analyses (Qi et al. 2004b). A correlation analysis based on a different transformation of compositional vectors was also reported by Stuart et al. (2002a), who demonstrated the applicability of the method in revealing phylogeny using vertebrate mitochondrial genomes.
Chloroplast DNA is a primary source of molecular variations for phylogenetic analysis of photosynthetic eukaryotes. During the past decade the availability of complete chloroplast genome sequences has provided a wealth of information to elucidate the phylogeny of photosynthetic eukaryotes at the deep levels of evolution. There have been many phylogenetic analyses based on comparison of sequences of multiple protein-coding genes in chloroplast genomes (e.g., Martin et al. 1998, 2002; Turmel et al. 1999, 2002; Adachi et al. 2000; Lemieux et al. 2000; De Las Rivas et al. 2002). The approach proposed by Qi et al. (2004b) has also been adopted to analyze the complete chloroplast genomes (Chu et al., 2004) and found to reveal a phylogeny of this organelle that is largely consistent with the phylogeny of the photosynthetic eukaryotes based on traditional analyses, thus demonstrating the value of this methodology in analyzing genomes of a smaller size.
In the approach proposed by Qi et al. (2004), a key step is to subtract the noise background in the composition vector of the protein sequences from complete genomes through a Markov model. In this study, we propose an alternative method to model the noise background in the composition vector through the relationship between a word and its two subwords in the theory of symbolic dynamics. This approach is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b). We apply our new approach to the phylogenetic analyses of the same data sets used in Qi et al. (2004b) (109 organisms) and our previous paper (Chu et al. 2004) (chloroplast genomes). The results are as good as those previously reported by Qi et al. (2004b) and Chu et al. (2004).
Materials and Methods
Genome Data Sets
We used the complete genomes from the NCBI database (ftp://ncbi.nlm.nih.gov/genbank/genomes/).
Data Set 1 (Used in Qi et al. [2004b])
We selected 109 organisms for prokaryote phylogenetic analysis. These include 4 Archaea Crenarchaeota —Aeropyrum pernix (Aerpe), Sulfolobus solfataricus (Sulso), Sulfolobus tokodaii (Sulto), and Pyrobaculum aerophilum (Pyrae); 12 Archaea Euryarchaeota—Archaeoglobus fulgidus (Arcfu), Halobacterium sp. NRC-1 (Halsp), Methanosarcina acetivorans str. C 2A (Metac), Methanococcus jannaschii (Metja), Methanopyrus kandleri AV19 (Metka), Methanosarcina mazei Goel (Metma), Methanobacterium thermoautotrophicum (Metth), Pyrococcus abyssi (Pyrab), Pyrococcus furiosus (Pyrfu), Pyrococcus horikoshii (Pyrho), Thermoplasma acidophilum (Theac), and Thermoplasma volcanium (Thevo); 2 hyperthermophilic bacteria—Aquifex aeolicus (Aquae) and Thermotoga maritima (Thema); 1 Deinococcus–Thermus—Deinococcus radiodurans R1 (Deira); 3 cyanobacteria—Cyanobacterium Nostoc sp. PCC7120 (Anasp), Cyanobacterium Synechocystis PCC6803 (Synpc), and Thermosynechococcus elongatus BP-1 (Theel); 1 green sulphur bacterium—Chlorobium tepidum TLS (Chlte); 9 proteobacteria, alpha subdivision—Agrobacterium tumefaciens C58 (Agrt5), Agrobacterium tumefaciens C58 UWash (Agrt5 W), Brucella melitensis (Brume), Brucella suis 1330 (Brusu), Caulobacter crescentus (Caucr), Mesorhizobium loti (Rhilo), Sinorhizobium meliloti 1021 (Rhime), Rickettsia conorii (Riccn), and Rickettsia prowazekii (Ricpr); 3 Proteobacteria, beta subdivision—Neisseria meningitidis MC58 (NeimeM) Neisseria meningitidis Z2491 (NeimeZ), and Ralstonia solanacearum (Ralso); 22 Proteobacteria, gamma subdivision—Buchnera sp. APS (Bucai), Buchnera aphidicola Sg (Bucap), Escherichia coli CFT073 (Ecolic), Escherichia coli O157:H7 EDL933 (EcoliE), Escherichia coli K-12 (EcoliK), Escherichia coli O157:H7 (EcoliO), Haemophilus influenzae Rd (Haein), Pasteurella multocida PM70 (Pasmu), Pseudomonas aeruginosa PA01 (Pseae), Pseudomonas putida KT2440 (Psepu), Salmonella typhi (Salti), Salmonella typhimurium LT2 (Salty), Shewanella oneidensis MR-1 (Sheon), Shigellaflexneriua 2a strain 301 (Shifl), Vibrio cholerae (Vibch), Vibrio vulnificus CMCP6 (Vibvu), Wigglesworthia brevipalpis (Wigbr), Xanthomonas axonopodis citri 306 (Xanax), Xanthomonas campestris ATCC 33913 (Xanca), Xylella fastidiosa (Xylfa), Yersinia pestis strain C092 (YerpeC), and Yersinia pestis KIM (YerpeK); 3 proteobacteria, epsilon subdivision—Campylobacter jejuni (Camje), Helicobacter pylori J99 (Helpj), and Helicobacter pylori 26695 (Helpy); 27 firmicutes—Bacillus anthracis A2012 (Bacan), Bacillus halodurans (Bachd), Bacillus subtilis (Bacsu), Clostridium acetobutylicum ATCC824 (Cloab), Clostridium perfringens (Clope), Lactococcus lactis sp. IL 1403 (Lacla), Listeria monocytogenes EGD-e (Lisimo), Listeria innocua (Lisin), Mycoplasma genitalium (Mycge), Mycoplasma penetrans (Mycpe), Oceanobacillus iheyensis (Oceih), Mycoplasma pneumoniae (Mycpn), Mycoplasma pulmonis UAB CTIP (Mycpu), Staphylococcus aureus N315 (StaauN), Staphylococcus aureus Mu50 (StaauM), Staphylococcus epidermidis ATCC 12228 (Staep), Streptococcus agalactiae NEM316 (StragN), Streptococcus agalactiae 2603V/R (StragV), Streptococcus mutans UA159 (Strmu), Streptococcus pneumoniae R6 (StrpnR), Streptococcus pneumoniae TIGR4 (StrpnT), Streptococcus pyogenes MGAS8232 (Strpy8), Streptococcus pyogenes MGAS315 (StrpyG), Streptococcus pyogenes SF370 (StrpyS), Thermoanaerobacter tengcongensis (Thete), and Ureaplasma urealyticum (Uerpa); 7 Actinobacteria—Bifidobacterium longum NCC2705 (Biflo), Corynebacterium efficiens YS-314 (Coref), Corynebacterium glutamicum (Corgl), Mycobacterium leprae TN (Mycle), Mycobacterium tuberculosis CDC1551 (MyctuC) Mycobacterium tuberculosis H37Rv (MyctuH), and Streptomyces coelicolor A3(2) (Strco); 5, chlamydia—Chlamydia muridarum (Chlmu), Chlamydia pneumoniae AR3 9 (ChlpnA), Chlamydia pneumoniae CWL029 (ChlpnC) Chlamydia pneumoniae J138 (ChlpnJ) and Chlamydia trachomatis (Chltr); 3 spirochaetes—Borrelia burgdorferi (Borbu), Leptospira interrogans serovar lai strain 56601 (Lepin), and Treponema pallidum (Trepa); and 1 fusobacteria—Fusobacterium nucleatum ATCC 25586 (Fusnu). We also included in the analysis six eukaryotes: Saccharomyces cerevisiae (yeast), Caenorhabdites elegans (Worm), Arabidopsis thaliana (Atha), Encephalitozoon cuniculi (Enccu), Plasmodium falciparum (Plafa), and Schizosaccharomyces pombe (Schpo). For the NCBI accession numbers of these genomes, one can refer to Qi et al. (2004b). The words in parentheses are the abbreviations of the name of these organisms used in our phylogenetic tree (Fig. 1).
Data Set 2 (Used in Chu et al. [2004])
We selected the following genomes of chloroplast, archaea, eubacteria, and eukaryotes for chloroplast phylogenetic analysis. These include 21 chloroplast genomes (Cyanophora paradoxa, Cyanidium caldarium, Porphyra purpurea, Guillardia theta, Odontella sinensis, Euglena gracilis, Chlorella vulgaris, Nephroselmis olivacea, Mesostigma viride, Chaetosphaeridium globosum, Marchantia polymorpha, Psilotum nudum, Pinus thunbergii, Oenothera elata, Lotus japonicus, Spinacia oleracea, Nicotiana tabacum, Arabidopsis thaliana, Oryza sativa, Triticum aestivu, and Zea mays), 2 archaea genomes (Archaeoglobus fulgidu and Sulfolobus solfataricus), 8 eubacteria genomes (Helicobacter pylori, Neisseria meningitides, Rickettsia prowazekii, Borrelia burgdorferi, Chlamydophila pneumoniae, Mycobacterium leprae, Nostoc sp., and Synechocystis sp.), and 3 eukaryote genomes (Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabitidis elegans).
Composition Vectors and Distance Matrix
We base our analysis on all protein sequences including hypothetical reading frames from each genome, regarding sequences of the 20 amino acids as symbolic sequences. In such a sequence of length L, there are a total of N = 20 K possible types of strings of length K. We use a window of length K and slide it through the sequences by shifting one position at a time to determine the frequencies of each of the N kinds of strings in each genome. A protein sequence is excluded if its length is shorter than K. The observed frequency p(α1,α2 ...α K ) of a K-string α1α2...α K is defined as p(α1,α2 ...α K ) = n(α1α2...α K )/(L − K + T), where n(α1α2...α K ) is the number of times that α1α2...α K appears in this sequence. Denoting by m the number of protein sequences from each complete genome, the observed frequency of a K-string α1α 2 ...α K is defined as \( (\sum\nolimits_{j=1}^m {n_j } (\alpha _1 \alpha _2 \ldots \alpha _K ))/(\sum\nolimits_{j=1}^m {(L_j - K + 1)} ) \); here n j (α1α2...α K ) means the number of times that α1α2...α K appears in the jth protein sequence and L j is the length of the jth protein sequence in this complete genome.
The phylogenetic signal in the protein sequences is often obscured by noise and bias (Charlebois et al. 2003). There is always some randomness in the composition of protein sequences, revealed by their statistical properties at the single amino acid or oligopeptide level (see Weiss et al. [2000] for a recent discussion on this point). In order to highlight the selective diversification of sequence composition, we subtract the random background (noise and bias) from the simple counting results. In the present study, we consider an idea from the theory of dynamical language that a K-string α1α2...α K is possibly constructed by adding a letter α K to the end of a (K−1)-string α2...αK−1 or a letter α1 to the beginning of a (K−1)-string α2...α K . Suppose that we have performed direct counting for all strings of length (K−1) and the 20 kinds of letters; then the expected frequency of appearance of K-strings is predicted by
where q denotes the predicted frequency, and p(α1) and p(α K ) are frequencies of amino acids α1 and α K appearing in this genome. [In the previous papers by our group (Qi et al. 2004b; Chu et al. 2004), we use a Markov model to characterize the predictor, in which we need to know the information of the (K−1)-strings and (K−2)-strings.] We then subtract the above random background (noise and bias) before performing a cross-correlation analysis (similar to removing a time-varying mean in time series before computing the cross-correlation of two time series). We then calculate a new measured X of the shaping role of selective evolution as
The transformation X = p/q−1 has the desired effect of subtraction of random background (noise and bias) from p and rendering it a stationary time series suitable for subsequent cross-correlation analysis.
For all possible K-strings α1α2...α K , we use X(α1α2...α K ) as components to form a composition vector for a genome. To further simplify the notation, we use X i for the ith component corresponding to the string type i, i = 1,..., N (the N strings are arranged in a fixed order as the alphabetical order). Hence we construct a composition vector X = (X1,X2,...,X N ) for genome X, and likewise Y = (Y1, Y2,..., Y N ) for genome Y.
If we view the N components in vectors X and Y as samples of two zero-mean random variables, respectively, the sample correlation C(X,Y) between any two genomes X and Y is defined in the usual way in probability theory as \( C(X,Y)={{(\sum\limits_{i=1}^N {X_i \times Y_i } )} / {(\sum\limits_{i=1}^N {X_i^2 \times \sum\limits_{i=1}^N {Y_i^2 } )^{{1/2}} } }} \). The distance D(X,Y) between the two genomes is then defined by the equation D(X,Y) = (1−C(X,Y))/2. A distance matrix for all the genomes under study is then generated for construction of phylogenetic trees.
The vector p that we described is identical to the peptide frequency vector used by Stuart et al. (2002a). We have pointed out in our previous paper (Chu et al. 2004) that their method of structure removal is entirely different. Starting from the vector p, these authors used Singular Value Decomposition (SVD) and then Dimension Reduction on their constructed matrix. The correlation distance is then used to construct the tree. In the method used in Qi et al. (2004b) and Chu et al. (2004), we subtract the random background through a Markov model for q and X. The SVD step is much more complicated than the method proposed by Qi et al. (2004b) in both theoretical and practical considerations. In the present study, we subtract the random background through a dynamic language formula. We only need the information for all strings of length (K − 1) and the 20 kinds of letters instead of that for all strings of length (K − 1) and (K – 2) which is needed in the Markov model. Our new method is conceptually simpler than the one used in Qi et al. (2004b) and Chu et al. (2004).
Tree Construction and Computational Time
Qi et al. (2004a) pointed out that the Fitch–Margoliash (1967) method is not feasible when the number of species is as large as 100 or more and an algorithm such as maximal likelihood is not based on the distance matrices alone. So we construct all trees using the neighbor-joining (NJ) method (Saitou and Nei 1987) in the PHYLIP package (Felsenstein 1993).
We used a PC (Intel Pentium 4 CPU 2.80 GHz, 512 MB of RAM) to calculate the distance matrices of the two data sets for different values of K using both the present method and the one proposed by Qi et al. (2004b). The times to run the programs are listed in Table 1. From Table 1, we see that the present method is computationally faster than the one proposed by Qi et al. (2004b) for K = 3, 4, and 5. For data set 2, in the case K = 6, the method of Qi et al. (2004b) is a little bit faster than the present method. And for the case K=6 of data set 1, we cannot perform either the present method or the one of Qi et al. (2004b) on our PC since this is beyond its computing capacity. For the case K=6 of data set 1, we only perform our method on the supercomputers at Xiangtan University and at Queensland University of Technology (it is not meaningful to compare the speed because the supercomputers are used by many users all the time) and compare the result with Fig. 1 in Qi et al. (2004b) directly.
Results and Discussion
In both the present method and the one of Qi et al. (2004b), K must be larger than 3. We can only calculate the distance matrices and construct the trees for K from 3 to 6 because of the limitation on the computing capability of our PC and supercomputers. We find that the topology of the trees converges with K increasing from 3 to 6 and it becomes stable for K ≥ 5. Here we present the results based on K = 6 (Figs. 1 and 2). The distance matrices generated from this analysis can be provided via email: z.yu@qut.edu.au.
Figure 1 shows the K=6 tree based on the NJ analysis for the selected 109 organisms. The selected Archaea group together as a domain (except Pyrobaculum aerophilum). The six eukaryotes also cluster together as a domain. And all Eubacteria fall into another domain. So the division of life into three main domains, Eubacteria, Archaebacteria, and Eukarya, is a clean and prominent feature. At the interspecific level, it is clear that Archaea is divided into two groups, Euryarchaeota and Crenarchaeota. Different prokaryotes in the same group (Firmicutes, Actinobacteria, Cyanobacteria, Chlamydia, Hyperthermophilic bacteria) all cluster together. Proteobacteria (except the epsilon division) cluster together. In proteobacteria, prokaryotes from the alpha and epsilon divisions group with those from the same division. It is clear that the branch of Firmicutes is divided into subbranches Bacillales, Lactobacillales, Clostridia, and Mollicutes. Our phylogenetic tree of organisms supports the 16S-like rRNA tree of life in its broad division into three domains and the grouping of the various prokaryotes. So after subtracting the noise and bias from the protein sequences as described in our method, the whole-genome tree converges to the rRNA-sequence tree as asserted in Charlebois et al. (2003).
In the phylogenetic analyses based on a few genes, the tendency of the two hyperthermophilic bacteria, Aquae and Thema, to be placed in Archaea, has intensified the debate on whether there has been widespread lateral or horizontal gene transfers among species (Doolittle 1999; Ragan 2001; Martin and Herrmann 1998). It is a consensus now that one should not equate a tree inferred from a single or a few genes to the organismic tree of life (Qi et al. 2004b; Pennisi 1999). Analyses of complete genomes suggest that lateral gene transfer has been rare over the course of evolution and it has not distorted the structure of the tree (Eisen and Eraser 2003). In our tree (Fig. 1) the two hyperthermophilic bacteria group together and stay in the domain of eubacteria. This result is the same as in Qi et al. (2004b) and also supports the point of view of Eisen and Fraser (2003).
Now we give a short comparison of our trees and those obtained by the method of Qi et al. (2004b) for data set 1 in the cases K=5 and 6. Generally speaking, the trees obtained by these two methods are quite similar if we fix the value of K.
-
1
For both methods and both cases, K=5 and 6, the genera from Proteobacteria beta division are mixed into the proteobacteria gamma group; Lep- tospira stands outside the other two Spirochetes.
-
2
In the trees obtained by Qi et al. for both cases, K=5 and K=6, the gamma division splits into two subgroups: the Firmicutes are divided into two branches and these two branches are separated; the Rickettsia from the alpha division joins the small gamma group in the K=6 tree but stays within the whole alpha group in the K=5 tree using the method of Qi et al. These placement problems are overcome by the present method. In the trees obtained by our method for both cases, K=5 and K=6, the gamma division and Firmicutes are both monophyletic. Their phylogenetic placement accords with current understanding. The Rickettsia stays within the whole alpha division in both the K=5 and the K=6 trees by our method.
-
3
For both methods and for K=5, the epsilon division is separated from other divisions of proteobacteria. In the K=6 tree of Qi et al., the epsilon division joins into the proteobacteria branch. But in the K=6 tree by our method, the epsilon division is still separated from other divisions of proteobacteria.
-
4
In both the K=5 and the K=6 trees by our method, the Archaea Pyrobaculum aerophilum does not stay in the right place. It stays in the right place in both the K=5 and the K=6 trees by the method of Qi et al.
Figure 2 shows the K=6 tree based on NJ analysis for the chloroplasts (data set 2). All the chloroplast genomes form a clade branched in the Eubacteria domain and share a most recent common ancestor with cyanobacteria, which is in accordance with the widely accepted endosymbiotic theory that chloroplasts arose from a cyanobacterium-like ancestor (Gray 1992, 1999; McFadden 2001b). Apparently, despite massive gene transfer from the endosymbiont to the nucleus of the host cell (Martin and Herrmann 1998; Martin et al. 1998, 2002), our analysis is able to identify cyanobacteria as the most closely related prokaryotes of chloroplast. The chloroplasts are separated into two major clades, one of which corresponds to the green plants sensu lato, or chlorophytes s.l. (Palmer and Delwiche 1998), which include all taxa with a chlorophyte chloroplast, both primary and secondary endosymbioses in origin, and the other comprising the glaucophyte Cyanaphora and members of rhodophytes s.l., which refers to rhodophytes (or red algae, Cyanidium and Porphyra in the tree) and their secondary symbiotic derivatives (the heterokont Odontella and the cryptophyte Guillarida). The close relationship between Cyanophora and rhodophytes s.l. (Cyanophora is mixed into rhodophytes s.l. ) agrees with some of the previous analyses (Stirewalt et al. 1995; De Las Rivas et al. 2002), although most recent studies suggest that the glaucophyte represents the earliest branch in chloroplast evolution with the green plants s.l. and rhodophytes s.l. as sister taxa (Martin et al. 1998, 2002; Adachi et al. 2000; Moreira et al. 2000). In chlorophyte s.l., the green algae (i.e., Chlorella, Mesostigma, and Nephroselmis) and Euglena are basal in position and the seed plants cluster together as a derived group, although the relationships among the other taxa (i.e., Marchantia, Psilotum, and Chaetosphaeridium) are somewhat different from our traditional understanding, probably due to limited taxon sampling in these primitive green plants. To sum up, our simple correlation analysis on the complete chloroplast genomes has yielded a tree that is in good agreement with our current knowledge on the phylogenetic relationships of different groups of photosynthetic eukaryotes in general (see Palmer and Delwiche [l998] and McFadden [2001a,b] for reviews). The only difference between the trees obtained by the present method and the one in Chu et al. (2004) is the placement of Pinus in the clade of Chlorophyte s.l. (for K=5 and 6).
Our approach circumvents the ambiguity in the selection of genes from complete genomes for phylogenetic reconstruction, and is also faster than the traditional approaches of phylogenetic analysis, particularly when dealing with a large number of genomes. Moreover, since multiple sequence alignment is not used, the intrinsic problems associated with this complex procedure can be avoided. In contrast to a recent similar analysis on mitochondrial genomes based on compositional vector (Stuart et al. 2002a), our approach does not require prior information on gene families in the genome and is also simpler in the method used for subtraction of random background from the data (see Materials and Methods). In the present method, we only need the information for all strings of length (K−1) and the 20 kinds of letters instead of that for all strings of length (K−1) and (K−2) which is needed in the Markov model (Qi et al. 2004; Chu et al. 2004). Our new method is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b) and Chu et al. (2004). We have shown that this approach is applicable for analyzing the prokaryotes as well as the much smaller genomes of chloroplasts.
References
J Adachi PJ Waddell W Martin M Hasegawa (2000) ArticleTitlePlastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA J Mol Evol 50 348–358 Occurrence Handle1:CAS:528:DC%2BD3cXivFyjur8%3D Occurrence Handle10795826
JR Brown WF Doolittle (1997) ArticleTitleArchaea and the prokaryote-to-eukaryote transition Microbiol Mol Biol Rev 61 456–502 Occurrence Handle1:CAS:528:DyaK2sXotVymsbk%3D Occurrence Handle9409149
RL Charlebois RG Beiko MA Ragan (2003) ArticleTitleBranching out Nature 421 217–217 Occurrence Handle10.1038/421217a Occurrence Handle1:CAS:528:DC%2BD3sXjsF2jtA%3D%3D Occurrence Handle12529621
E Chatton (1937) Titres et travaux scientiflques Sette, Sottano Italy
KH Chu J Qi ZG Yu VV Anh (2004) ArticleTitleOrigin and Phylogeny of Chloroplasts revealed by a simple correlation analysis of complete genome Mol Biol Evol 21 200–206 Occurrence Handle10.1093/molbev/msh002 Occurrence Handle1:CAS:528:DC%2BD2cXhvVKqsL0%3D Occurrence Handle14595102
J Las Rivas ParticleDe JJ Lozano AR Ortiz (2002) ArticleTitleComparative analysis of chloroplast genomes: Functional annotation, genome-based phylogeny, and deduced evolutionary patterns Genome Res 12 567–583 Occurrence Handle10.1101/gr.209402 Occurrence Handle11932241
RF Doolittle (1998) ArticleTitleMicrobial genomes opened up Nature 392 339–342 Occurrence Handle10.1038/32789 Occurrence Handle1:CAS:528:DyaK1cXit12ru7w%3D Occurrence Handle9537318
RF Doolittle (1999) ArticleTitlePhylogenetic classification and the universal tree Science 284 2124–2128 Occurrence Handle10.1126/science.284.5423.2124 Occurrence Handle1:CAS:528:DyaK1MXkt1Kgsbs%3D Occurrence Handle10381871
SV Edwards B Fertil A Giron P Deschavanne (2002) ArticleTitleA genomic schism in birds revealed by phylogenetic analysis of DNA strings Syst Biol 51 599–613 Occurrence Handle10.1080/10635150290102285 Occurrence Handle12228002
JA Eisen CM Fraser (2003) ArticleTitlePhylogenomics: intersection of evolution and genomics Science 300 1706–1707 Occurrence Handle10.1126/science.1086292 Occurrence Handle1:CAS:528:DC%2BD3sXksVKisb8%3D Occurrence Handle12805538
Felsenstein J (1993) PHYLIP (phylogeny inference package) version 3.5c. Distributed by the author at http://evolution.genetics.washington.edu/phylip.html
FitchWM E Margoliash (1967) ArticleTitleConstruction of phylogenetic trees Science 155 279–284 Occurrence Handle1:CAS:528:DyaF2sXnt1Gnsw%3D%3D Occurrence Handle5334057
ST Fitz-Gibbon CH House (1999) ArticleTitleWhole genome-based phylogenetic analysis of free-living microorganisms Nucleic Acids Res 27 4218–4222 Occurrence Handle10.1093/nar/27.21.4218 Occurrence Handle1:CAS:528:DyaK1MXnt1Gkur8%3D Occurrence Handle10518613
MW Gray (1992) ArticleTitleThe endosymbiont hypothesis revisited Int. Rev Cytol 141 233–357 Occurrence Handle1:STN:280:ByyD1cvkslI%3D Occurrence Handle1452433
MW Gray (1999) ArticleTitleEvolution of organellar genomes Curr Opin Genet Dev 9 678–687 Occurrence Handle1:CAS:528:DC%2BD3cXhslOitw%3D%3D Occurrence Handle10607615
RS Gupta (1998) ArticleTitleProtein phylogenies and signature sequences: A reappraisal of evolutionary relationships among Archaebacteria, Eubacteria, and Eukaryotes Microbiol Mol Biol Rev 62 1435–1491 Occurrence Handle1:CAS:528:DyaK1MXhs1OntQ%3D%3D Occurrence Handle9841678
C Lemieux C Otis M Turmel (2000) ArticleTitleAncestral chloroplast genome in Mesostigma viride reveals an early branch of green plant evolution Nature 403 649–652 Occurrence Handle10.1038/35001059 Occurrence Handle1:CAS:528:DC%2BD3cXht1Oqt7g%3D Occurrence Handle10688199
M Li JH Badger X Chen S Kwong P Kearney H Zhang (2001) ArticleTitleAn information-based sequence distance and its application to whole mitochondrial genome phylogeny Bioinformatics 17 149–154 Occurrence Handle10.1093/bioinformatics/17.2.149 Occurrence Handle1:CAS:528:DC%2BD3MXisFymsbY%3D Occurrence Handle11238070
J Lin M Gerstein (2000) ArticleTitleWhole-genome trees based on the occurrence of folds and orthologs, implications for comparing genomes at different levels Genome Res 10 808–818 Occurrence Handle10.1101/gr.10.6.808 Occurrence Handle1:CAS:528:DC%2BD3cXkt1eisLk%3D Occurrence Handle10854412
W Martin RG Herrmann (1998) ArticleTitleGene transfer from organelles to the nucleus: How much, what happens, and why? Plant Physiol 118 9–17 Occurrence Handle10.1104/pp.118.1.9 Occurrence Handle1:CAS:528:DyaK1cXmtV2msLs%3D Occurrence Handle9733521
W Martin B Stoebe V Goremykin S Hansmann M Hasegawa KV Kowallik (1998) ArticleTitleGene transfer to the nucleus and the evolution of chloroplasts Nature 393 162–165 Occurrence Handle1:CAS:528:DyaK1cXjt1ahsL0%3D Occurrence Handle11560168
W Martin T Rujan E Richly A Hansen S Cornelsen T Lins D Leister B Stoebe M Hasegawa D Penny (2002) ArticleTitleEvolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus Proc Natl Acad Sci USA 99 12246–12251 Occurrence Handle1:CAS:528:DC%2BD38XntlCks70%3D Occurrence Handle12218172
E Mayr (1998) ArticleTitleTwo empires or the three Proc Natl Acad Sci USA 95 9720–9723 Occurrence Handle10.1073/pnas.95.17.9720 Occurrence Handle1:CAS:528:DyaK1cXlsFSkur0%3D Occurrence Handle9707542
GI McFadden (2001a) ArticleTitlePrimary and secondary endosymbiosis and the origin of plastids J Phycol 37 951–959 Occurrence Handle10.1046/j.1529-8817.2001.01126.x
GI McFadden (2001b) ArticleTitleChloroplast origin and integration Plant Physiol 125 50–53 Occurrence Handle10.1104/pp.125.1.50 Occurrence Handle1:CAS:528:DC%2BD3MXjslymt7Y%3D
DH Moreira H Le Guyader H Philippe (2000) ArticleTitleThe origin of red algae and the evolution of chloroplasts Nature 405 69–72 Occurrence Handle1:STN:280:DC%2BD3c3mvFemtA%3D%3D Occurrence Handle10811219
JD Palmer CF Delwiche (1998) The origin and evolution of plastids and their genomes DE Soltis PS Soltis JJ Doyle (Eds) Molecular systematics of plants II DNA sequencing Kluwer London 345–409
E Pennisi (1999) ArticleTitleIs it the time to uproot the tree of life? Science 284 1305–1308 Occurrence Handle10.1126/science.284.5418.1305 Occurrence Handle1:CAS:528:DyaK1MXjs1SqtLg%3D Occurrence Handle10383313
J Qi H Luo B Hao (2004a) ArticleTitleCVTree: a phylogenetic tree reconstruction tool based on whole genomes Nucleic Acids Res 32 W45–W47 Occurrence Handle10.1093/nar/gnh180 Occurrence Handle1:CAS:528:DC%2BD2cXlvFKns70%3D
J Qi B Wang B Hao (2004b) ArticleTitleWhole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach J Mol Evol 58 1–11 Occurrence Handle10.1007/s00239-003-2493-7 Occurrence Handle1:CAS:528:DC%2BD2cXmsVSntQ%3D%3D
MA Ragan (2001) ArticleTitleDetection of lateral gene transfer among microbial genomes Curr Opin, Gen Dev 11 620–626
N Saitou M Nei (1987) ArticleTitleThe neighbor-joining method: a new method for reconstructing phylogenetic trees Mol Biol Evol 4 406–425 Occurrence Handle1:STN:280:BieC1cbgtVY%3D Occurrence Handle3447015
D Sankoff G Leaduc N Antoine B Paquin BF Lang R Cedergren (1992) ArticleTitleGene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome Proc Natl Acad Sci USA 89 6575–6579 Occurrence Handle1:CAS:528:DyaK38XltVKku7o%3D Occurrence Handle1631158
VL Stirewalt CB Michalowski W Loffelhardt HJ Bohnert DA Bryant (1995) ArticleTitleNucleotide sequence of the cyanelle genome from Cycmophora paradoxa Plant Mol Biol Rep 13 327–332 Occurrence Handle1:CAS:528:DyaK28XhtFWms7Y%3D
GW Stuart K Moffet S Baker (2002a) ArticleTitleIntegrated gene species phylogenies from unaligned whole genome protein sequences Bioinformatics 18 100–108 Occurrence Handle10.1093/bioinformatics/18.1.100 Occurrence Handle1:CAS:528:DC%2BD38Xhs1elurg%3D
GW Stuart K Moffet JJ Leader (2002b) ArticleTitleA comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes Mol Biol Evol 19 554–562 Occurrence Handle1:CAS:528:DC%2BD38XivV2gtbc%3D
F Tekaia A Lazcano B Dujon (1999) ArticleTitleThe genomic tree as revealed from whole proteome comparisons Genome Res 9 550–557 Occurrence Handle1:CAS:528:DyaK1MXksVems7s%3D Occurrence Handle10400922
M Turmel C Otis C Lemieux (1999) ArticleTitleThe complete chloroplast DNA sequence of the green alga Nephroselmis olivacea: Insights into the architecture of ancestral chloroplast genomes Proc Natl Acad Sci USA 96 10248–10253 Occurrence Handle10.1073/pnas.96.18.10248 Occurrence Handle1:CAS:528:DyaK1MXlvFensr4%3D Occurrence Handle10468594
M Turmel C Otis C Lemieux (2002) ArticleTitleThe chloroplast and mitochondrial genome sequences of the charophyte Chaetosphaeridium globosum: Insights into the timing of the events that restructured organelle DNAs within the green algal lineage that led to land plants Proc Natl Acad Sci USA 99 11275–11280 Occurrence Handle10.1073/pnas.162203299 Occurrence Handle1:CAS:528:DC%2BD38XmslSmtbs%3D Occurrence Handle12161560
O Weiss MA Jimenez H Herzel (2000) ArticleTitleInformation content of protein sequences J Theor Biol 206 379–386 Occurrence Handle10.1006/jtbi.2000.2138 Occurrence Handle1:CAS:528:DC%2BD3cXmsVGisbo%3D Occurrence Handle10988023
CR Woese (1987) ArticleTitleBacterial evolution Microbiol Rev 51 221–271 Occurrence Handle1:CAS:528:DyaL2sXkslertLc%3D Occurrence Handle2439888
CR Woese O Kandler ML Wheelis (1990) ArticleTitleTowards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya Proc Natl Acad Sci USA 87 4576–4579 Occurrence Handle1:STN:280:By%2BB1c%2FktFw%3D Occurrence Handle2112744
ZG Yu P Jiang (2001) ArticleTitleDistance, correlation and mutual information among portraits of organisms based on complete genomes Phys Lett A 286 34–46 Occurrence Handle10.1016/S0375-9601(01)00336-X Occurrence Handle1:CAS:528:DC%2BD3MXkvF2gsrk%3D
ZG Yu V Anh KS Lau (2003a) ArticleTitleMultifractal and correlation analysis of protein sequences from complete genome Phys Rev E 68 021913 Occurrence Handle10.1103/PhysRevE.68.021913
ZG Yu V Anh KS Lau KH Chu (2003b) ArticleTitleThe genomic tree of living organisms based on a fractal model Phys Lett A 317 293–302 Occurrence Handle10.1016/j.physleta.2003.08.040 Occurrence Handle1:CAS:528:DC%2BD3sXnvVelt7g%3D Occurrence HandleMR2018655
ZG Yu V Anh KS Lau (2004) ArticleTitleChaos game representation, and multifractal and correlation analysis of protein sequences from complete genome based on detailed HP model J Theor Biol 226 341–348 Occurrence Handle10.1016/j.jtbi.2003.09.009 Occurrence Handle1:CAS:528:DC%2BD3sXpt1SjtLw%3D Occurrence Handle14643648 Occurrence HandleMR2068825
Acknowledgments
One of the authors, Zu-Guo Yu, would like to express his thanks to Dr. Ji Qi, ITP, Chinese Academy of Science, for useful discussion and sharing of his data and source code. Financial support was provided by the Youth Foundation of the Chinese National Natural Science Foundation (Grant 10101022) and Postdoctoral Research Support Grant 9900658 from Queensland University of Technology (Z.-G. Yu), and by the AoE Fund of The Chinese University of Hong Kong (K.H. Chu).
Author information
Authors and Affiliations
Corresponding author
Additional information
Reviewing Editor: Dr. John Oakeshott
Rights and permissions
About this article
Cite this article
Yu, Z., Zhou, L., Anh, V. et al. Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment. J Mol Evol 60, 538–545 (2005). https://doi.org/10.1007/s00239-004-0255-9
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s00239-004-0255-9