Introduction

In our understanding of the classification of the living world as a whole, the most important advance was made by Chatton (1937), whose classification is that there are two major groups of organisms, the prokaryotes (bacteria) and the eukaryotes (organisms with nucleated cells). Then the universal tree of life based on the 16S-like rRNA genes given by Woese and colleagues (Woese 1987; Woese et al. 1990) led to the proposal of three primary domains (Eukarya, Bacteria, and Archaea). Although the archaebacterial domain is accepted by biologists, its phylogenetic status is still a matter of controversy (Gupta 1998; Mayr 1998). Analysis of some genes, particularly those encoding metabolic enzymes, gives different phylogenies of the same organisms or even fails to support the three-domain classification of living organisms (Brown and Doolittle 1997; Doolittle 1998; Gupta 1998). In the conventional comparative sequence analyses for phylogenetics, the investigators all rely on a very small number of the same ribosomal genes. How accurate are the results? The only way to begin to answer this question is to conduct independent analyses, either with different genes or with different approaches. So for evaluating relationships of taxa, it is worthwhile to think about alternatives to conventional comparative sequence analysis for phylogenetics.

It is generally accepted that genome sequences are excellent tools for studying evolution (Eisen and Fraser 2003). In building the tree of life, analysis of whole genomes has begun to supplement, and in some cases to improve upon, studies previously done with one or a few genes (Eisen and Fraser 2003). The availability of complete genomes allows the reconstruction of organismal phylogeny, taking into account the genome content, for example, based on the rearrangement of gene order (Sankoff et al. 1992), the presence or absence of protein-coding gene families (Fitz-Gibbon and House 1999), gene content and overall similarity (Tekaia et al. 1999), and occurrence of folds and orthologs (Lin and Gerstein 2000). All these approaches depend on alignment of homologous sequences, and it is apparent that much information (such as gene rearrangement and insertions/deletions) in these data sets is lost after sequence alignment, let alone the intrinsic problems of alignment algorithms (Li et al. 2001; Stuart et al. 2002a, b). There have been a number of recent attempts to develop methodologies that do not require sequence alignment for deriving species phylogeny based on overall similarities of the complete genomes (e.g., Li et al. 2001; Yu and Jiang 2001; Yu et al. 2003a, b, 2004; Edwards et al. 2002; Stuart et al. 2002a, b).

By overcoming the problem of noise and bias in the protein sequences through the use of better models, whole-genome trees have now largely converged to the rRNA-sequence tree (Charlebois et al. 2003). Qi et al. (2004b) have developed a simple correlation analysis of complete genome sequences based on compositional vectors without the need of sequence alignment. The compositional vectors calculated from the frequency of amino acid strings are converted to distance values for all taxa, and the phylogenetic relationships are inferred from the distance matrix using conventional tree-building methods. An analysis based on this method using 109 organisms (prokaryotes and eukaryotes) yields a tree separating the three domains of life—Archaea, Eubacteria, and Eukarya—with the relationships among the taxa correlating with those based on traditional analyses (Qi et al. 2004b). A correlation analysis based on a different transformation of compositional vectors was also reported by Stuart et al. (2002a), who demonstrated the applicability of the method in revealing phylogeny using vertebrate mitochondrial genomes.

Chloroplast DNA is a primary source of molecular variations for phylogenetic analysis of photosynthetic eukaryotes. During the past decade the availability of complete chloroplast genome sequences has provided a wealth of information to elucidate the phylogeny of photosynthetic eukaryotes at the deep levels of evolution. There have been many phylogenetic analyses based on comparison of sequences of multiple protein-coding genes in chloroplast genomes (e.g., Martin et al. 1998, 2002; Turmel et al. 1999, 2002; Adachi et al. 2000; Lemieux et al. 2000; De Las Rivas et al. 2002). The approach proposed by Qi et al. (2004b) has also been adopted to analyze the complete chloroplast genomes (Chu et al., 2004) and found to reveal a phylogeny of this organelle that is largely consistent with the phylogeny of the photosynthetic eukaryotes based on traditional analyses, thus demonstrating the value of this methodology in analyzing genomes of a smaller size.

In the approach proposed by Qi et al. (2004), a key step is to subtract the noise background in the composition vector of the protein sequences from complete genomes through a Markov model. In this study, we propose an alternative method to model the noise background in the composition vector through the relationship between a word and its two subwords in the theory of symbolic dynamics. This approach is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b). We apply our new approach to the phylogenetic analyses of the same data sets used in Qi et al. (2004b) (109 organisms) and our previous paper (Chu et al. 2004) (chloroplast genomes). The results are as good as those previously reported by Qi et al. (2004b) and Chu et al. (2004).

Materials and Methods

Genome Data Sets

We used the complete genomes from the NCBI database (ftp://ncbi.nlm.nih.gov/genbank/genomes/).

Data Set 1 (Used in Qi et al. [2004b])

We selected 109 organisms for prokaryote phylogenetic analysis. These include 4 Archaea Crenarchaeota —Aeropyrum pernix (Aerpe), Sulfolobus solfataricus (Sulso), Sulfolobus tokodaii (Sulto), and Pyrobaculum aerophilum (Pyrae); 12 Archaea Euryarchaeota—Archaeoglobus fulgidus (Arcfu), Halobacterium sp. NRC-1 (Halsp), Methanosarcina acetivorans str. C 2A (Metac), Methanococcus jannaschii (Metja), Methanopyrus kandleri AV19 (Metka), Methanosarcina mazei Goel (Metma), Methanobacterium thermoautotrophicum (Metth), Pyrococcus abyssi (Pyrab), Pyrococcus furiosus (Pyrfu), Pyrococcus horikoshii (Pyrho), Thermoplasma acidophilum (Theac), and Thermoplasma volcanium (Thevo); 2 hyperthermophilic bacteria—Aquifex aeolicus (Aquae) and Thermotoga maritima (Thema); 1 Deinococcus–Thermus—Deinococcus radiodurans R1 (Deira); 3 cyanobacteria—Cyanobacterium Nostoc sp. PCC7120 (Anasp), Cyanobacterium Synechocystis PCC6803 (Synpc), and Thermosynechococcus elongatus BP-1 (Theel); 1 green sulphur bacterium—Chlorobium tepidum TLS (Chlte); 9 proteobacteria, alpha subdivision—Agrobacterium tumefaciens C58 (Agrt5), Agrobacterium tumefaciens C58 UWash (Agrt5 W), Brucella melitensis (Brume), Brucella suis 1330 (Brusu), Caulobacter crescentus (Caucr), Mesorhizobium loti (Rhilo), Sinorhizobium meliloti 1021 (Rhime), Rickettsia conorii (Riccn), and Rickettsia prowazekii (Ricpr); 3 Proteobacteria, beta subdivision—Neisseria meningitidis MC58 (NeimeM) Neisseria meningitidis Z2491 (NeimeZ), and Ralstonia solanacearum (Ralso); 22 Proteobacteria, gamma subdivision—Buchnera sp. APS (Bucai), Buchnera aphidicola Sg (Bucap), Escherichia coli CFT073 (Ecolic), Escherichia coli O157:H7 EDL933 (EcoliE), Escherichia coli K-12 (EcoliK), Escherichia coli O157:H7 (EcoliO), Haemophilus influenzae Rd (Haein), Pasteurella multocida PM70 (Pasmu), Pseudomonas aeruginosa PA01 (Pseae), Pseudomonas putida KT2440 (Psepu), Salmonella typhi (Salti), Salmonella typhimurium LT2 (Salty), Shewanella oneidensis MR-1 (Sheon), Shigellaflexneriua 2a strain 301 (Shifl), Vibrio cholerae (Vibch), Vibrio vulnificus CMCP6 (Vibvu), Wigglesworthia brevipalpis (Wigbr), Xanthomonas axonopodis citri 306 (Xanax), Xanthomonas campestris ATCC 33913 (Xanca), Xylella fastidiosa (Xylfa), Yersinia pestis strain C092 (YerpeC), and Yersinia pestis KIM (YerpeK); 3 proteobacteria, epsilon subdivision—Campylobacter jejuni (Camje), Helicobacter pylori J99 (Helpj), and Helicobacter pylori 26695 (Helpy); 27 firmicutes—Bacillus anthracis A2012 (Bacan), Bacillus halodurans (Bachd), Bacillus subtilis (Bacsu), Clostridium acetobutylicum ATCC824 (Cloab), Clostridium perfringens (Clope), Lactococcus lactis sp. IL 1403 (Lacla), Listeria monocytogenes EGD-e (Lisimo), Listeria innocua (Lisin), Mycoplasma genitalium (Mycge), Mycoplasma penetrans (Mycpe), Oceanobacillus iheyensis (Oceih), Mycoplasma pneumoniae (Mycpn), Mycoplasma pulmonis UAB CTIP (Mycpu), Staphylococcus aureus N315 (StaauN), Staphylococcus aureus Mu50 (StaauM), Staphylococcus epidermidis ATCC 12228 (Staep), Streptococcus agalactiae NEM316 (StragN), Streptococcus agalactiae 2603V/R (StragV), Streptococcus mutans UA159 (Strmu), Streptococcus pneumoniae R6 (StrpnR), Streptococcus pneumoniae TIGR4 (StrpnT), Streptococcus pyogenes MGAS8232 (Strpy8), Streptococcus pyogenes MGAS315 (StrpyG), Streptococcus pyogenes SF370 (StrpyS), Thermoanaerobacter tengcongensis (Thete), and Ureaplasma urealyticum (Uerpa); 7 Actinobacteria—Bifidobacterium longum NCC2705 (Biflo), Corynebacterium efficiens YS-314 (Coref), Corynebacterium glutamicum (Corgl), Mycobacterium leprae TN (Mycle), Mycobacterium tuberculosis CDC1551 (MyctuC) Mycobacterium tuberculosis H37Rv (MyctuH), and Streptomyces coelicolor A3(2) (Strco); 5, chlamydia—Chlamydia muridarum (Chlmu), Chlamydia pneumoniae AR3 9 (ChlpnA), Chlamydia pneumoniae CWL029 (ChlpnC) Chlamydia pneumoniae J138 (ChlpnJ) and Chlamydia trachomatis (Chltr); 3 spirochaetes—Borrelia burgdorferi (Borbu), Leptospira interrogans serovar lai strain 56601 (Lepin), and Treponema pallidum (Trepa); and 1 fusobacteria—Fusobacterium nucleatum ATCC 25586 (Fusnu). We also included in the analysis six eukaryotes: Saccharomyces cerevisiae (yeast), Caenorhabdites elegans (Worm), Arabidopsis thaliana (Atha), Encephalitozoon cuniculi (Enccu), Plasmodium falciparum (Plafa), and Schizosaccharomyces pombe (Schpo). For the NCBI accession numbers of these genomes, one can refer to Qi et al. (2004b). The words in parentheses are the abbreviations of the name of these organisms used in our phylogenetic tree (Fig. 1).

Figure 1
figure 1

Phylogeny of 109 organisms (prokaryotes and eukaryotes) based on the new compositional approach in the case K=6.

Data Set 2 (Used in Chu et al. [2004])

We selected the following genomes of chloroplast, archaea, eubacteria, and eukaryotes for chloroplast phylogenetic analysis. These include 21 chloroplast genomes (Cyanophora paradoxa, Cyanidium caldarium, Porphyra purpurea, Guillardia theta, Odontella sinensis, Euglena gracilis, Chlorella vulgaris, Nephroselmis olivacea, Mesostigma viride, Chaetosphaeridium globosum, Marchantia polymorpha, Psilotum nudum, Pinus thunbergii, Oenothera elata, Lotus japonicus, Spinacia oleracea, Nicotiana tabacum, Arabidopsis thaliana, Oryza sativa, Triticum aestivu, and Zea mays), 2 archaea genomes (Archaeoglobus fulgidu and Sulfolobus solfataricus), 8 eubacteria genomes (Helicobacter pylori, Neisseria meningitides, Rickettsia prowazekii, Borrelia burgdorferi, Chlamydophila pneumoniae, Mycobacterium leprae, Nostoc sp., and Synechocystis sp.), and 3 eukaryote genomes (Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabitidis elegans).

Composition Vectors and Distance Matrix

We base our analysis on all protein sequences including hypothetical reading frames from each genome, regarding sequences of the 20 amino acids as symbolic sequences. In such a sequence of length L, there are a total of N = 20 K possible types of strings of length K. We use a window of length K and slide it through the sequences by shifting one position at a time to determine the frequencies of each of the N kinds of strings in each genome. A protein sequence is excluded if its length is shorter than K. The observed frequency p(α1,α2 ...α K ) of a K-string α1α2...α K is defined as p(α1,α2 ...α K ) = n1α2...α K )/(LK + T), where n(α1α2...α K ) is the number of times that α1α2...α K appears in this sequence. Denoting by m the number of protein sequences from each complete genome, the observed frequency of a K-string α1α 2 ...α K is defined as \( (\sum\nolimits_{j=1}^m {n_j } (\alpha _1 \alpha _2 \ldots \alpha _K ))/(\sum\nolimits_{j=1}^m {(L_j - K + 1)} ) \); here n j 1α2...α K ) means the number of times that α1α2...α K appears in the jth protein sequence and L j is the length of the jth protein sequence in this complete genome.

The phylogenetic signal in the protein sequences is often obscured by noise and bias (Charlebois et al. 2003). There is always some randomness in the composition of protein sequences, revealed by their statistical properties at the single amino acid or oligopeptide level (see Weiss et al. [2000] for a recent discussion on this point). In order to highlight the selective diversification of sequence composition, we subtract the random background (noise and bias) from the simple counting results. In the present study, we consider an idea from the theory of dynamical language that a K-string α1α2...α K is possibly constructed by adding a letter α K to the end of a (K−1)-string α2...αK−1 or a letter α1 to the beginning of a (K−1)-string α2...α K . Suppose that we have performed direct counting for all strings of length (K−1) and the 20 kinds of letters; then the expected frequency of appearance of K-strings is predicted by

$$ q(\alpha _1 \alpha _2 \ldots \alpha _k )={{p(\alpha _1 \alpha _2 \ldots \alpha _{k - 1} )p(\alpha _k )+p(\alpha _1 )(p \alpha _2 \alpha _3 \ldots \alpha _k )} \over 2} $$

where q denotes the predicted frequency, and p(α1) and p K ) are frequencies of amino acids α1 and α K appearing in this genome. [In the previous papers by our group (Qi et al. 2004b; Chu et al. 2004), we use a Markov model to characterize the predictor, in which we need to know the information of the (K−1)-strings and (K−2)-strings.] We then subtract the above random background (noise and bias) before performing a cross-correlation analysis (similar to removing a time-varying mean in time series before computing the cross-correlation of two time series). We then calculate a new measured X of the shaping role of selective evolution as

$$ X(\alpha _1 \alpha _2 \ldots \alpha _k ) = \left\{ \matrix{ {{p(\alpha _1 \alpha _2 \ldots \alpha _k )} \mathord{\left/ {\vphantom {{p(\alpha _1 \alpha _2 \ldots \alpha _k )} {q(\alpha _1 \alpha _2 \ldots \alpha _k ) - 1\quad {\rm{if }}q(\alpha _1 \alpha _2 \ldots \alpha _k ) \ne 0}}} \right. \kern-\nulldelimiterspace} {q(\alpha _1 \alpha _2 \ldots \alpha _k ) - 1\quad {\rm{if }}q(\alpha _1 \alpha _2 \ldots \alpha _k ) \ne 0}} \hfill \cr 0\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad {\rm{if }}q(\alpha _1 \alpha _2 \ldots \alpha _k ) = 0 \hfill \cr} \right. $$

The transformation X = p/q−1 has the desired effect of subtraction of random background (noise and bias) from p and rendering it a stationary time series suitable for subsequent cross-correlation analysis.

For all possible K-strings α1α2...α K , we use X1α2...α K ) as components to form a composition vector for a genome. To further simplify the notation, we use X i for the ith component corresponding to the string type i, i = 1,..., N (the N strings are arranged in a fixed order as the alphabetical order). Hence we construct a composition vector X = (X1,X2,...,X N ) for genome X, and likewise Y = (Y1, Y2,..., Y N ) for genome Y.

If we view the N components in vectors X and Y as samples of two zero-mean random variables, respectively, the sample correlation C(X,Y) between any two genomes X and Y is defined in the usual way in probability theory as \( C(X,Y)={{(\sum\limits_{i=1}^N {X_i \times Y_i } )} / {(\sum\limits_{i=1}^N {X_i^2 \times \sum\limits_{i=1}^N {Y_i^2 } )^{{1/2}} } }} \). The distance D(X,Y) between the two genomes is then defined by the equation D(X,Y) = (1−C(X,Y))/2. A distance matrix for all the genomes under study is then generated for construction of phylogenetic trees.

The vector p that we described is identical to the peptide frequency vector used by Stuart et al. (2002a). We have pointed out in our previous paper (Chu et al. 2004) that their method of structure removal is entirely different. Starting from the vector p, these authors used Singular Value Decomposition (SVD) and then Dimension Reduction on their constructed matrix. The correlation distance is then used to construct the tree. In the method used in Qi et al. (2004b) and Chu et al. (2004), we subtract the random background through a Markov model for q and X. The SVD step is much more complicated than the method proposed by Qi et al. (2004b) in both theoretical and practical considerations. In the present study, we subtract the random background through a dynamic language formula. We only need the information for all strings of length (K − 1) and the 20 kinds of letters instead of that for all strings of length (K − 1) and (K – 2) which is needed in the Markov model. Our new method is conceptually simpler than the one used in Qi et al. (2004b) and Chu et al. (2004).

Tree Construction and Computational Time

Qi et al. (2004a) pointed out that the Fitch–Margoliash (1967) method is not feasible when the number of species is as large as 100 or more and an algorithm such as maximal likelihood is not based on the distance matrices alone. So we construct all trees using the neighbor-joining (NJ) method (Saitou and Nei 1987) in the PHYLIP package (Felsenstein 1993).

We used a PC (Intel Pentium 4 CPU 2.80 GHz, 512 MB of RAM) to calculate the distance matrices of the two data sets for different values of K using both the present method and the one proposed by Qi et al. (2004b). The times to run the programs are listed in Table 1. From Table 1, we see that the present method is computationally faster than the one proposed by Qi et al. (2004b) for K = 3, 4, and 5. For data set 2, in the case K = 6, the method of Qi et al. (2004b) is a little bit faster than the present method. And for the case K=6 of data set 1, we cannot perform either the present method or the one of Qi et al. (2004b) on our PC since this is beyond its computing capacity. For the case K=6 of data set 1, we only perform our method on the supercomputers at Xiangtan University and at Queensland University of Technology (it is not meaningful to compare the speed because the supercomputers are used by many users all the time) and compare the result with Fig. 1 in Qi et al. (2004b) directly.

Table 1 Speed comparison of the present method and the one of Qi et al. (2004b)

Results and Discussion

In both the present method and the one of Qi et al. (2004b), K must be larger than 3. We can only calculate the distance matrices and construct the trees for K from 3 to 6 because of the limitation on the computing capability of our PC and supercomputers. We find that the topology of the trees converges with K increasing from 3 to 6 and it becomes stable for K ≥ 5. Here we present the results based on K = 6 (Figs. 1 and 2). The distance matrices generated from this analysis can be provided via email: z.yu@qut.edu.au.

Figure 2
figure 2

Phylogeny of chloroplast genomes based on the new compositional approach’in the case K=6.

Figure  1 shows the K=6 tree based on the NJ analysis for the selected 109 organisms. The selected Archaea group together as a domain (except Pyrobaculum aerophilum). The six eukaryotes also cluster together as a domain. And all Eubacteria fall into another domain. So the division of life into three main domains, Eubacteria, Archaebacteria, and Eukarya, is a clean and prominent feature. At the interspecific level, it is clear that Archaea is divided into two groups, Euryarchaeota and Crenarchaeota. Different prokaryotes in the same group (Firmicutes, Actinobacteria, Cyanobacteria, Chlamydia, Hyperthermophilic bacteria) all cluster together. Proteobacteria (except the epsilon division) cluster together. In proteobacteria, prokaryotes from the alpha and epsilon divisions group with those from the same division. It is clear that the branch of Firmicutes is divided into subbranches Bacillales, Lactobacillales, Clostridia, and Mollicutes. Our phylogenetic tree of organisms supports the 16S-like rRNA tree of life in its broad division into three domains and the grouping of the various prokaryotes. So after subtracting the noise and bias from the protein sequences as described in our method, the whole-genome tree converges to the rRNA-sequence tree as asserted in Charlebois et al. (2003).

In the phylogenetic analyses based on a few genes, the tendency of the two hyperthermophilic bacteria, Aquae and Thema, to be placed in Archaea, has intensified the debate on whether there has been widespread lateral or horizontal gene transfers among species (Doolittle 1999; Ragan 2001; Martin and Herrmann 1998). It is a consensus now that one should not equate a tree inferred from a single or a few genes to the organismic tree of life (Qi et al. 2004b; Pennisi 1999). Analyses of complete genomes suggest that lateral gene transfer has been rare over the course of evolution and it has not distorted the structure of the tree (Eisen and Eraser 2003). In our tree (Fig. 1) the two hyperthermophilic bacteria group together and stay in the domain of eubacteria. This result is the same as in Qi et al. (2004b) and also supports the point of view of Eisen and Fraser (2003).

Now we give a short comparison of our trees and those obtained by the method of Qi et al. (2004b) for data set 1 in the cases K=5 and 6. Generally speaking, the trees obtained by these two methods are quite similar if we fix the value of K.

  1. 1

    For both methods and both cases, K=5 and 6, the genera from Proteobacteria beta division are mixed into the proteobacteria gamma group; Lep- tospira stands outside the other two Spirochetes.

  2. 2

    In the trees obtained by Qi et al. for both cases, K=5 and K=6, the gamma division splits into two subgroups: the Firmicutes are divided into two branches and these two branches are separated; the Rickettsia from the alpha division joins the small gamma group in the K=6 tree but stays within the whole alpha group in the K=5 tree using the method of Qi et al. These placement problems are overcome by the present method. In the trees obtained by our method for both cases, K=5 and K=6, the gamma division and Firmicutes are both monophyletic. Their phylogenetic placement accords with current understanding. The Rickettsia stays within the whole alpha division in both the K=5 and the K=6 trees by our method.

  3. 3

    For both methods and for K=5, the epsilon division is separated from other divisions of proteobacteria. In the K=6 tree of Qi et al., the epsilon division joins into the proteobacteria branch. But in the K=6 tree by our method, the epsilon division is still separated from other divisions of proteobacteria.

  4. 4

    In both the K=5 and the K=6 trees by our method, the Archaea Pyrobaculum aerophilum does not stay in the right place. It stays in the right place in both the K=5 and the K=6 trees by the method of Qi et al.

Figure 2 shows the K=6 tree based on NJ analysis for the chloroplasts (data set 2). All the chloroplast genomes form a clade branched in the Eubacteria domain and share a most recent common ancestor with cyanobacteria, which is in accordance with the widely accepted endosymbiotic theory that chloroplasts arose from a cyanobacterium-like ancestor (Gray 1992, 1999; McFadden 2001b). Apparently, despite massive gene transfer from the endosymbiont to the nucleus of the host cell (Martin and Herrmann 1998; Martin et al. 1998, 2002), our analysis is able to identify cyanobacteria as the most closely related prokaryotes of chloroplast. The chloroplasts are separated into two major clades, one of which corresponds to the green plants sensu lato, or chlorophytes s.l. (Palmer and Delwiche 1998), which include all taxa with a chlorophyte chloroplast, both primary and secondary endosymbioses in origin, and the other comprising the glaucophyte Cyanaphora and members of rhodophytes s.l., which refers to rhodophytes (or red algae, Cyanidium and Porphyra in the tree) and their secondary symbiotic derivatives (the heterokont Odontella and the cryptophyte Guillarida). The close relationship between Cyanophora and rhodophytes s.l. (Cyanophora is mixed into rhodophytes s.l. ) agrees with some of the previous analyses (Stirewalt et al. 1995; De Las Rivas et al. 2002), although most recent studies suggest that the glaucophyte represents the earliest branch in chloroplast evolution with the green plants s.l. and rhodophytes s.l. as sister taxa (Martin et al. 1998, 2002; Adachi et al. 2000; Moreira et al. 2000). In chlorophyte s.l., the green algae (i.e., Chlorella, Mesostigma, and Nephroselmis) and Euglena are basal in position and the seed plants cluster together as a derived group, although the relationships among the other taxa (i.e., Marchantia, Psilotum, and Chaetosphaeridium) are somewhat different from our traditional understanding, probably due to limited taxon sampling in these primitive green plants. To sum up, our simple correlation analysis on the complete chloroplast genomes has yielded a tree that is in good agreement with our current knowledge on the phylogenetic relationships of different groups of photosynthetic eukaryotes in general (see Palmer and Delwiche [l998] and McFadden [2001a,b] for reviews). The only difference between the trees obtained by the present method and the one in Chu et al. (2004) is the placement of Pinus in the clade of Chlorophyte s.l. (for K=5 and 6).

Our approach circumvents the ambiguity in the selection of genes from complete genomes for phylogenetic reconstruction, and is also faster than the traditional approaches of phylogenetic analysis, particularly when dealing with a large number of genomes. Moreover, since multiple sequence alignment is not used, the intrinsic problems associated with this complex procedure can be avoided. In contrast to a recent similar analysis on mitochondrial genomes based on compositional vector (Stuart et al. 2002a), our approach does not require prior information on gene families in the genome and is also simpler in the method used for subtraction of random background from the data (see Materials and Methods). In the present method, we only need the information for all strings of length (K−1) and the 20 kinds of letters instead of that for all strings of length (K−1) and (K−2) which is needed in the Markov model (Qi et al. 2004; Chu et al. 2004). Our new method is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b) and Chu et al. (2004). We have shown that this approach is applicable for analyzing the prokaryotes as well as the much smaller genomes of chloroplasts.