Abstract
The character compatibility approach, which removes all homoplasic characters and involves finding the largest clique of compatible characters in a dataset, in principle, provides a powerful means for obtaining correct topology in difficult to resolve cases. However, the usefulness of this approach to generalized molecular sequence data for phylogeny determination has not been studied in the past. We have used this approach to determine the topology of 23 proteobacterial species (6 each of α-, β- and γ-, 3 δ-, and 2 ε-proteobacteria) using sequence data for 10 conserved proteins (Hsp60, Hsp70, EF-Tu, EF-G, alanyl-tRNA synthetase, RecA, GyrA, GyrB, RpoB and RpoC). All sites in the sequence alignments of these proteins where only two amino acids were found, with each amino acid present in at least two species, were selected. Mutual compatibility determination on these binary state sites was carried out by two means. In one case, all of these sites were combined into a large dataset (Set A; 957 characters) prior to compatibility analysis. In the second case, compatibility analysis was carried out on characters from individual proteins and all compatible sites were combined into a large dataset (Set B; 398 characters) for further studies. Upon compatibility analyses, the largest cliques that were obtained from Sets A and B consisted of 337 and 323 compatible characters, respectively. In these cliques, all proteobacterial subgroups were clearly distinguished and branching orders of most of the species were also resolved. The ε-proteobacteria exhibited the earliest branching, whereas the β- and γ-subgroups were found to have emerged last. The relative placement of the α- and δ-subgroups, however, was not resolved. The topology of these species was also determined based on 16S rRNA sequences and a concatenated dataset of sequences for all 10 proteins by means of neighbor-joining, maximum likelihood, and maximum parsimony methods. In the protein trees, all proteobacterial groups were reliably resolved and they branched in the following order: (ε(δ(α(β,γ)))). However, in the rRNA trees, the γ- and β-subgroups exhibited polyphyletic branching and many internal nodes were not resolved. These results indicate that the character compatibility analysis using generalized molecular sequence data provides a powerful means for evolutionary studies. Based on molecular sequences, it should be possible to obtain very large datasets of compatible characters that should prove very helpful in clarifying difficult to resolve phylogenetic relationships.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Phylogenetic relationships among species can be inferred based on a variety of methods. These include distance matrix, parsimony, maximum likelihood (ML) and Bayesian methods, supertree construction involving matrix representation and conditioned reconstruction, as well as methods based on discrete characters (Felsenstein 1981b; Kishino and Hasegawa 1989; Fitch 1971; Saitou and Nei 1987; Huelsenbeck and Bollback 2001; Lake and Rivera 2004; Felsenstein 2004; Semple and Steel 2003; Wilkinson et al. 2005). The relative efficiencies of these methods in obtaining correct tree topology are affected by a variety of factors and underlying assumptions (Hasegawa and Fujiwara 1993; Tateno et al. 1994; Felsenstein 1978, 2004; Penny 1976; Wilkinson et al. 2005; Semple and Steel 2003). An earlier study by Felsenstein (1981a) suggests that in cases where some characters in the dataset have evolved rapidly while others are evolving more slowly, character compatibility approach may give better results. The character compatibility methods are intended to remove misleading or fast-evolving characters from the dataset, and in theory, they should reduce the chance of obtaining the incorrect topology (Felsenstein 2004; Le Quesne 1975; Sneath et al. 1975; Estabrook et al. 1976; Pisani 2004; Meacham and Estabrook 1985). The character compatibility approach can be used for a variety of applications including ranking or weighting characters, identifying and removing fast-evolving sites or problematic taxa from phylogenetic analysis (Pisani 2004), ordering multistate characters, and building supertrees based on matrix representation of trees (Wilkinson et al. 2005). (See Wilkinson [2001] for other references and applications of compatibility methods.) However, despite some of the advantages that they offer, the character compatibility methods have been little used in phylogenetic studies (Pisani 2004; Felsenstein 2004; Wilkinson 2001).
Compatibility methods were proposed by Le Quesne (1969, 1975) and Wilson (1965), who pointed out that if two characters exhibit certain patterns of occurrence in the organisms under study, then it is not possible to construct a phylogenetic tree that shows only a single change (or mutation) for each of the two characters. Instead, it is necessary to postulate that one or the other of the two has had a double mutation or a backward mutation (or several such) or has resulted from nonspecific means (such as lateral gene transfer). It is not possible to tell a priori which of the characters is “bad” (in the sense of being potentially misleading), but Le Quesnce (1969) suggested that if one compared all pairs of character, one could identify those characters that had numerous incompatibilities with others, and these would usually be bad in this sense. If these were deleted, one would be left with characters which had hopefully had only a single mutation and which, therefore, were good guides to phylogeny. By proceeding in this way LeQuesne (1969, 1975) showed that a set of characters could be obtained where every character is compatible with every other. A tree based on such mutually compatible character sets is commonly referred to as a clique (Felsenstein 2004; Meacham and Estabrook 1985; Wilkinson 2001). These concepts have been further developed by several investigators (Estabrook et al. 1976; Estabrook and McMorris 1980; Buneman 1971; Sneath et al. 1975; Wilson 1965; Meacham and Estabrook 1985) to prove the Pairwise Compatibility Theorem. This theorem states that provided all characters are binary (i.e., have only two states, such as 0 or 1, or present or absent, or can be recoded into this form), if all pairs are compatible, the entire clique is compatible with a single tree. For characters with three of more states such as those commonly found in nucleotide or protein sequences, the pairwise compatibility does not ensure their mutual compatibility (Fitch 1975; Meacham and Estabrook 1985; Felsenstein 2004), hence it is difficult to use them for compatibility or clique analysis. However, if one deletes or omits the multistate characters from protein (or nucleic acid) sequences, it should still be possible to obtain sufficient numbers of useful sites (particularly from combined datasets for several genes/proteins) that can be used for compatibility or clique analysis.
Compatibility analysis with molecular sequences can also be carried out with rare genomic changes such as conserved insertions and deletions (i.e., indels) in protein sequences (Griffiths and Gupta 2004; Gupta 1998, 2003; Gupta and Griffiths 2002) or the order of various genes in different genomes (Kunisawa 2001, 2006). Based on the presence or absence of mutually compatible conserved indels, or arrangement of different genes, relationships among different groups can be deduced. One of us has made extensive use of conserved indels in protein sequences for inferring relationships among bacterial taxa (Griffiths and Gupta 2004; Gupta 1998, 2003; Gupta and Griffiths 2002). However, one limitation of the above kinds of genetic characteristics for compatibility analysis is that such characters are rare and they are also not found in all gene sequences. Hence, one cannot always use them to study relationships among the desired groups of species. In contrast, the compatibility analysis of two state characters in molecular sequences provides a more general approach that should be universally applicable. The application of this approach may prove helpful in clarifying the topology or branching patterns in cases that have proven difficult to resolve by phylogenetic treeing methods (Creevey et al. 2004; Daubin et al. 2002; Gophna et al. 2005; Delsuc et al. 2005; Gupta and Griffiths 2002; Gupta 1995).
Although the character compatibility approach, which in the present work refers specifically to clique analysis, offers a promising means for determining the relationships among different taxa, this approach has thus far not been used with generalized molecular sequence data to assess its usefulness and phylogenetic reliability. In this work, we have applied this approach to molecular sequence data for a number of conserved proteins to explore the topology of bacterial species belonging to different subdivisions within proteobacteria. Proteobacteria comprises a very large group among bacteria, accounting for nearly half of the cultured bacteria (Kersters et al. 2003; Gupta 2000; Stackebrandt et al. 1988; Olsen et al. 1994; De Ley 1992) and genomic sequences for a large number of proteobacterial species are now available (see http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Thus, they provide a good test case to compare and assess the reliability of phylogeny as determined by the character compatibility method. In the present work, we have compared the results obtained using the character compatibility analyses with those based on phylogenetic trees for 16S rRNA and concatenated protein sequences to assess the reliability of this approach.
Materials and Methods
Compatibility Analysis
Multiple sequence alignments for various proteins for different species were created using the Clustal x program (Jeanmougin et al. 1998). Using a computer program, DUALSITE, that was developed for this work, those sites in the sequence alignments where only two amino acid states were found, with each state present in at least two species, were selected. For this purpose, all columns where any gaps were present in any of the species were omitted. The DUALSITE program can also identify sites where one of the states is present in only single species. Although such sites are not useful for compatibility analysis, they provide information regarding the lengths of the terminal branches. This information was separately computed and noted on the terminal branches in the tree. The DUALSITE program also converts all useful two-state sites into a binary file of “0, 1” that is suitable for compatibility analysis. The main program that we have used for compatibility analysis is CLIQUE from the PHYLIP (v. 3.5c) program package (Felsenstein 1993). Compatibility analyses on the large combined datasets from 10 proteins were carried out using this program, the largest cliques of compatible characters obtained were drawn, and the numbers of characters that distinguished different nodes were indicated. In addition to carrying out compatibility analysis on two-state characters from all 10 proteins, similar analyses were also carried out on individual protein sequences using the HARMONY program developed in this work. The largest clique identified by this program had the same numbers of characters as those obtained from the CLIQUE program. However, the HARMONY program also extracts the compatible characters for the largest clique in a file, which can then be combined with characters from other protein sequences to generate larger datasets, which were analyzed for overall compatibility by the CLIQUE program.
Phylogenetic Analysis
The 16S rRNA sequences for various species listed in Table 1 were downloaded from the Ribosomal Database Project-II site (Maidak et al. 2001) and aligned using the CLUSTAL x program (Jeanmougin et al. 1998). A neighbor-joining (NJ) tree was constructed based on distances calculated using Kimura’s (1980) two-parameter model. ML tree for the rRNA sequences was computed using the HKY model using the TREE-PUZZLE 5.2 program (Schmidt et al. 2002). The amino acid sequences for the 10 proteins that we have used as the test set (viz., RNA polymerase β subunit (RpoB), RNA polymerase β‘ subunit [RpoC], alanyl-tRNA synthetase [AlaRS], elongation factor-Tu [EF-Tu], elongation factor G [EF-G], RecA protein, DNA gyrase subunit A [GyrA], DNA gyrase subunit B [GyrB], Hsp60 or GroEL protein, and DnaK or Hsp70 protein) were downloaded and aligned using the CLUSTAL x program. For phylogenetic analysis, the sequences for all 10 proteins were concatenated into a single large alignment file containing 7977 sites. A NJ tree based on this sequence alignment (bootstrapped 1000 times) was constructed based on Kimura’s (1983) model using the TREECON programs (Van de Peer and De Wachter 1994). ML tree based on this dataset was computed using the WAG+F model plus a gamma distribution with four categories (Whelan and Goldman 2001) using the TREE-PUZZLE (Schmidt et al. 2002). Maximium parsimony (MP) trees were constructed using the Mega 3.1 program (Kumar et al. 2004). All of the trees were bootstrapped 100 times (Felsenstein 1985), unless otherwise indicated.
Results
As a test case to evaluate the utility of the compatibility method, a group of 25 bacterial species was chosen (Table 1). This included six species each from the α-, β-, and γ- subdivisions, plus two and three members, respectively, from the δ- and ε-subdivisions of proteobacteria. In addition, the set also included two Chlamydiae species, viz., Chlamydia trachomatis and Chlamydophila pneumoniae, to serve as outgroup for the proteobacterial species (Olsen et al. 1994). The genomes of all the chosen species have been sequenced. For the sequence dataset, 10 highly conserved proteins (RpoB, RpoC, AlaRS, EF-Tu, EF-G, RecA, GyrA, GyrB, Hsp60, and Hsp70) that are found in all of these species were chosen (Harris et al. 2003). The lengths of these proteins for E. coli (i.e. number of sites) are given in column 2 of Table 2, and homologues from the other species are of similar lengths.
Multiple sequence alignments for these proteins for the 25 species were created using the Clustal x program (Jeanmougin et al. 1998). Subsequently, using the DUALSITE program, those sites in the alignments where only two amino acids were present were selected and such sites were converted into a binary file of “0, 1” characters. Table 2 presents a summary of the two-state sites in different proteins. Of the total number of positions in the sequence alignments of these proteins, about 17.6% (range, 13%–24%) of the sites were found to contain only two amino acids. After excluding those positions where one of the amino acids was present in a single species, 12.3% of the sites were useful (each of the two character states present in at least two of the species) for compatibility analysis (Table 2).
The mutual compatibility of the useful characters in different proteins was determined employing two different approaches. In one, all useful characters from the 10 proteins were combined into a single large dataset of 957 characters (Set A in Table 2) prior to compatibility determination. The dataset of the useful two-state characters from all 10 proteins is provided as Supplemental Information. In the second approach, the useful two-state characters from individual protein sequences were initially analyzed for compatibility and then the compatible characters from all 10 proteins were combined to create a second large dataset (Set B; 398 characters), which was again subjected to compatibility analysis. The compatibility analysis involves a pairwise comparison of each character in the data matrix with every other character in the dataset. If all four of the combinations, 00, 01, 10, and 11, are found to occur among the organisms, then that pair of sites is regarded as incompatible (Le Quesne 1975; Sneath et al. 1975; Meacham and Estabrook 1985; Felsenstein 2004; Wilkinson 2001) and one of the two sites which shows the highest degree of incompatibility with the other characters is removed from the dataset. This process is repeated until the largest datasets of compatible characters, i.e., cliques, are obtained. The mutual compatibility of each site in the matrix of pairs of sites was determined using two different programs. The main program that we have used for such analysis was the CLIQUE program from PHYLIP (v. 3.5c) program package (Felsenstein 1993). It uses a branch-and-bound algorithm to find the largest dataset(s) of mutually compatible characters (i.e., cliques) (Felsenstein 1993; Bron and Lerbosch 1973). The compatibility analysis on characters from individual proteins was carried out using both the CLIQUE program and the HARMONY program. The largest cliques identified by both these programs had the same numbers of characters. However, unlike the CLIQUE program, which can identify multiple cliques of the same size, the output of the HARMONY program consists of a single clique and it extracts all compatible characters corresponding to it in a file. The compatible characters obtained from different proteins are then combined to create the second large dataset (Set B; Table 2).
The compatibility analysis on the Set A characters by CLIQUE program gave rise to four largest cliques, each consisting of 337 characters. Two of these cliques are shown in Figs. 1A and B. These cliques were rooted using the sequences for Chlamydiae species, which branch deeper than proteobacteria in different phylogenetic trees (Olsen et al. 1994; Brown et al. 2001; Eisen 1995; Gupta 2001). The Chlamydiae species were distinguished from Proteobacteria by a large number (i.e., 125) of characters. In both cliques, all five main subgroups of proteobacteria, i.e., α, β, γ, δ, and ε, were clearly distinguished from each other based on a minimum of five characters. The ε-proteobacteria formed the deepest-branching group or clade in these cliques, and 17 characters supported their deep branching in relation to the other proteobacterial subgroups. Six characters also supported a specific relationship between the β- and the γ-subgroups. The branching order of most species within these proteobacterial subgroups was also resolved, generally based on two or more characters. The only difference between the cliques shown in Figs. 1A and B is with regard to the relative branching positions of the α- and δ-subgroups. In the clique shown in Fig. 1B, the δ-subgroup is indicated to branch prior to the α-subgroup, whereas in the other clique (Fig. 1A) it is found to branch after the α-subgroup of species. However, the relative positions of the α- and δ-subgroups in these cliques are based on a single character, indicating that this is not resolved. The remaining two cliques from Set A were identical to these cliques except for the branching order within the α-subgroup of species. In the cliques shown in Figs. 1A and B, Bra. japonicum formed the outgroup of a clade consisting of Bru. melitensis and Meso. loti. But only a single character supported this relationship. In the other two cliques obtained from Set A (Fig. 1C), Sil. pomeroyi formed the outgroup of a clade comprising of Bru. melitensis and Meso. loti, whereas Bra. japonicum was found to branch with Ca. crescentus. These results indicate that these cliques are unable to resolve the relative branching order of these α-proteobacterial species.
When compatibility analysis was carried out on Dataset B (Table 2), a very high proportion (81.5%) of the characters was found to be mutually compatible. This indicates that the compatibility analysis of individual proteins is retaining predominantly useful and stable characters, which maintain these characteristics upon combining. However, the total numbers of compatible characters that were recovered by this approach were smaller than those when the analysis was carried out on all of the characters at the same time (323 vs. 337; Table 2). This is due to the fact that compatibility analysis of individual proteins often results in multiple cliques of the same size. However, when these datasets are combined for further analysis, only one clique from each dataset is included. This prevents searching for the largest cliques that are compatible with all of the characters in the entire dataset. These results indicate that in order to identify the largest cliques of compatible characters, it is necessary to carry out analysis on the entire dataset at the same time. The compatibility analysis of the characters in Set B resulted in only two cliques of 323 characters (Fig. 2). The overall relationship indicated by these cliques was very similar to that observed in the Set A cliques. However, the numbers of characters that distinguished different nodes were slightly lower in a number of cases (cf. Fig. 1). All five subgroups of proteobacteria were again clearly distinguished and their branching order was indicated as (ε(δ(α(β,γ)))). However, branching of the δ-subgroup before the α-subgroup was based on a single character. The two cliques obtained differed only with regard to the branching position within the α-proteobacterial species, and the differences between them were the same as noted above for the Set A cliques (Fig. 1C). In the cliques shown in Fig. 2, E. coli and Y. pestis were found to branch at the same position. The relative branching of these species in the Set A cliques is also based on a single character, indicating that this relationship is not reliably resolved.
We have also examined the effect of changing outgroup species on the results of compatibility analysis. This was done by replacing the sequences for two Chlamydiae species from the above dataset with those from two Firmicutes (low G+C Gram-positvie) species (Listeria innocua and Staphylococcus aureus). The largest cliques that were found with this new dataset contained only 276 compatible characters (results not shown), instead of the 337 characters obtained with dataset A. This difference was mainly due to the smaller number of characters that were uniquely shared by the two Firmicutes species in comparison to the two Chlamydiae (88 vs 125), which is a fast-evolving lineage (Griffiths et al. 2006). Similar to Set A, four largest cliques were obtained. The topology of various proteobacterial species in these cliques was very similar to that shown in Fig. 1, although the numbers of characters that distinguished different nodes were slightly different in the two cases.
Phylogenetic trees were also constructed for the above species based on 16S rRNA sequences and concatenated sequences for all 10 proteins by different methods (NJ, ML, and MP). These trees are shown in Fig. 3. In the 16S rRNA trees (Fig. 3A), a few of the internodes, particularly that leading to the branching position of the α- and δ-subdivisions, was not resolved by different methods. A polyphyletic branching of the β- and γ-subgroups, with a clade consisting of Xanthomonadales (Xyl. fastidiosa and Xan. axonopodis) forming an outgroup of the β-subgroup of species, was also supported by NJ and MP analyses (Fig. 3A). Similar polyphyletic branching of β- and γ-proteobacteria in 16S rRNA trees has been observed in earlier studies (Kersters et al. 2003; Ludwig and Klenk 2001). The relationships within the β-proteobacteria and the branching position of Bra. Japonicum were also not resolved in the rRNA trees by different methods.
In contrast to the 16S rRNA tree, in the trees based on concatenated protein sequences (7977 positions) the NJ and ML analyses produced identical tree topologies and all of the internodes were resolved with high bootstrap scores (Fig. 3B). However, the bootstrap scores by the NJ method tended to be slightly higher than those by the ML analysis. The topology for the MP tree was also very similar, except that Ca. crescentus exhibited deeper branching than Sil. pomeryoi. The branching order of various proteobacterial subgroups by different methods was found to be (ε (δ(α(β,γ)))) (Fig. 3B), which is identical to that deduced in earlier work based on conserved indels in a number of different protein sequences (Gupta 2000, 2001; Kersters et al. 2003). The branching orders or interrelationships among various species in the protein tree are virtually identical to that seen in the cliques based on compatible characters in these proteins (Figs. 1 and 2). These results provide evidence that the character compatibility analysis provides a powerful new tool, in addition to the traditional phylogenetic approaches such as NJ, MP, and ML analyses (Felsenstein 2004), for determining the topological relationships among species.
Discussion
This paper describes the first detailed application of the character compatibility approach or “clique analysis” to generalized molecular sequence data to assess its usefulness and reliability for phylogenetic studies. Although the basic concepts and mathematical foundation of this method to infer phylogenetic relationships were developed more than 30 years ago (Le Quesne 1969, 1975; Estabrook and McMorris 1980; Estabrook et al. 1976; Sneath et al. 1975; Wilson 1965; Sneath et al. 1975; Meacham and Estabrook 1985), this approach has thus far only been used in a limited manner with morphological characters (O’Keefe and Wagner 2001; Meacham 1994; Sneath 2001), and its applicability to molecular sequence data has not been explored (Felsenstein 2004; Wilkinson 2001). One of the main limitations of this approach is that the compatibility algorithms mainly work with binary character states (Estabrook et al. 1976; Felsenstein 2004), hence they are not directly applicable to molecular sequence alignments which contain either 4 (DNA or RNA) or 20 characters (protein sequences) (Fitch 1975; Kannan and Warnow 1995; Meacham and Estabrook 1985). However, by limiting analyses to those positions where only two characters states are present in the sequence alignments, this approach can be used with molecular sequences (Felsenstein 2004). Because molecular sequences contain huge number of characters, large numbers of potentially useful sites that are useful for such analysis could still be found in these sequences. In view of the two-character state limitation of this approach, a large number of sites in a given dataset are not useful for these analyses. However, the sites that are retained are generally those that are slowly evolving, and it is hoped that the information contained in them is of the highest quality and will prove helpful in resolving the topology (Meacham and Estabrook 1985; Wilkinson 2001).
In the present work, the usefulness of the compatibility approach for phylogenetic studies was examined by comparing the results obtained using this approach with other conventional methods such as phylogenetic trees based on 16S rRNA or protein sequences. As a test case the evolutionary relationships among 25 species mainly from different subdivisions of Proteobacteria was investigated. The sequence dataset consisted of 10 highly conserved proteins ubiquitous to all bacteria and some to species from all three domains (Harris et al. 2003). About 18% of the total sites in these proteins were found to satisfy the two-state criteria, and of these nearly two-thirds of the sites were useful for compatibility analysis (i.e., where each state was present in at least two of the species). These characters are scattered in different parts of the proteins and no apparent clustering of them was observed. To increase the total number of useful two-state sites, such sites from different proteins have been combined into large datasets. In one instance, all of the useful two-state sites were combined into a large dataset prior to examining their mutual compatibility. In another instance, compatibility analysis was carried out on two-state characters from each protein separately and then all compatible characters were combined into a larger dataset (Set B) that was subjected to compatibility analysis. Results of these studies show that when compatible characters from individual proteins are combined, a very high proportion (>80%) of them was found to be overall compatible. This provides evidence that the compatibility analysis is predominantly retaining useful and stable characters, which maintained these characteristics upon combining in larger datasets. However, these studies also revealed that the total number of compatible characters that one obtains from a dataset after analyzing it as a whole is greater than that obtained by analysis of the same dataset in parts. In view of this, to maximize and include all compatible characters from a dataset, it is advisable to carry out compatibility analysis on the entire dataset at the same time.
The largest cliques that were obtained from these datasets contained sufficient numbers of compatible characters (337 and 323 from Sets A and B, respectively) to clearly distinguish all of the main groups within Proteobacteria, as well as resolve most of the internal nodes within these cliques. In all of these cliques, the ε-proteobacteria showed the earliest branching, whereas the β- and γ-subgroups were indicated as late-branching groups within proteobacteria. However, these cliques did not resolve a few of the relationships. One of these was the relative branching order or placement of the α- and δ-subgroups. Additionally, the topology within the α-subgroup, particularly the branching position of Bra. japonicum, was not resolved. In general, the relationships that are distinguished by a single compatible character in the cliques are found to be unreliable.
Results presented here also show that the size of the largest cliques that one obtains is affected by the choice of the species. In the present analysis, by replacing the fast-evolving Chlamydiae species with the Gram-positive species, the numbers of characters in the cliques were reduced from 337 to 276. However, this difference was mainly due to a reduction in numbers of characters that distinguished a few of the fast-evolving groups. The overall topology and the reliability of various internal nodes were not affected by this change of the outgroup species.
The results of the compatibility analysis were also compared with those obtained by traditional phylogenetic analyses based on the 16S rRNA tree or a tree based on combined sequences from all 10 proteins. The overall topology of most of the species in these different trees was very similar. However, in the 16S rRNA tree, a number of relationships were either not resolved or poorly resolved. The γ-protoebacteria were indicated as polyphyletic, with the β-subgroups of species branching in between them. Several other relationships, including the relative branching positions of the α- and δ- subgroups, the placement of Bra. japonicum within the α-subgroup, and the relative branching of Ralstonia, Burkholderia, and Bordetella, were also not resolved in the rRNA tree by different phylogenetic methods. The failure of these trees to resolve some of these relationships is very likely due to the long-branch attraction effect (Felsenstein 1978; Delsuc et al. 2005). In contrast to the rRNA tree, in the tree based on concatenated protein sequences, all internal nodes were resolved with a high degree of confidence (>90% bootstrap score). The branching order of various species in this tree was very similar to that seen in the compatibility trees (Figs. 1 and 2) and some of these inferences are also independently supported by conserved indels in a number of proteins (Gupta 2000, 2005; Gupta and Griffiths 2002), as well as the gene order patterns in bacterial genomes (Kunisawa 2001). In earlier work, a number of conserved indels in protein sequences have been identified that are uniquely shared either by all proteobacteria (viz., α, β, γ, δ, and ε), or by the α-, β-, and γ-subgroups of species, or by only the β- and γ-proteobacteria, or which are unique to particular groups of proteobacteria, supporting the observed branching order of the proteobacterial subgroups (i.e., (ε(δ(α(β,γ)))) (Gupta 2000, 2005, 2006; Gupta and Griffiths 2002).
These results provide evidence that the character compatibility or clique analysis is retaining stable and useful characters from molecular sequences, and based on such characters a reliable phylogeny for the group of species under consideration can be deduced. The resolving power of this approach depends on and increases with the total number of compatible characters that are present in a dataset. In addition to the results presented here, we have also carried out compatibility analysis on smaller numbers of proteins from the present dataset. Although the cliques obtained in these cases were similar to those shown here, with the smaller numbers of characters either many internal nodes were not resolved or their distinction was based on only single characters (results not shown). In principle, using this approach it should be possible to construct a tree of all compatible characters based on all commonly shared proteins for a given group of species. Such an approach using a large number of compatible characters should prove particularly useful in clarifying the topological relationships among groups of species which have proven difficult to resolve by traditional phylogenetic means (e.g., higher taxa within prokaryotes and eukaryotes, evolutionary relationships among Metazoan, Cyanobacteria) (Creevey et al. 2004; Daubin et al. 2002; Gophna et al. 2005; Delsuc et al. 2005; Gupta and Griffiths 2002; Wilmotte and Herdman 2001; Erwin and Davidson 2002; Nielsen 2003; Baldauf et al. 2000; Gupta 1995). Several other applications of the character compatibility methods and the programs to implement them have been described by Wilkinson (2001).
Although the compatibility approach has worked well in the present case, there are certain considerations in the use of this approach that should be pointed out. Based on an evolutionary model for random occurrence of mutations in a clocklike manner (Kimura 1983), sequences with very high mutation rates will exhibit numerous mutations and back mutations, and characters from them will likely exhibit many incompatibilities. Thus, it is likely that relatively few compatible characters will be found in such sequences. Likewise, the sequences that have very low mutation rates may contain only a few changes that are useful for these analyses. Thus, the number of useful characters that may be present in a given sequence will vary with the number and the phylogenetic depth of the species that are being studied, as well as the choice of the outgroup species, as shown here. Some sequences that may contain enough useful characters for examining the evolutionary relationships among closely related species may not prove useful when a broad range of species is included in the dataset. The ability of the available characters for resolving the topological relationship will also depend on the relative lengths of the terminal internodes compared to the internal internodes. If the terminal internodes are very long compared to the inner ones, most single changes will contribute to estimating these terminal lengths, but they will not help resolve the branching pattern.
The reliability of compatibility analysis has scarcely been explored and this is the first detailed study examining its application to generalized molecular sequence data. One cannot be certain that every character in a clique represents a single change on a tree. Although the possibility of having multiple mutations that exactly cancel out each other is expected to be low for amino acid sequences, the same may not be true for nucleic acid sequences. Therefore, it is likely that nucleic acid sequences will prove less useful than amino acid sequences for compatibility analysis, but this remains to be experimentally tested. Our observation that upon combining different compatible subsets one obtains new sets that are almost wholly compatible indicates that these subsets are reflecting the same evolutionary history and the number of bad characters in them is small. The inference that the characters that are selected by the compatibility approach are generally reliable is also supported by the fact that the same relationships as indicated by these characters are independently supported by rare genomic changes such as conserved inserts and deletions in different proteins or whole proteins that are specific for these groups and the gene order arrangement in different bacteria (Gupta 2000, 2005, 2006; Gupta and Griffiths 2002; Kunisawa 2001; Kainth and Gupta 2005). Because of the rarity and highly specific nature of these genetic changes, it is considered unlikely that they are results of independent genetic events. In future studies, it would be of much interest to carry out simulation studies using compatibility analysis to determine whether this approach is able to detect and exclude characters that are derived from sequences that have been laterally transferred or result from similar perturbing events (Gogarten et al. 2002; Ochman 2001; Beiko et al. 2005). These studies should also indicate whether the increased reliability of compatibility methods compared to other phylogenetic methods compensates for the loss of information that results from excluding many characters.
References
Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF (2000) A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290:972–977
Beiko RG, Harlow TJ, Ragan MA (2005) Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA 102:14332–14337
Bron C, Lerbosch J (1973) Alogrithm 457:Finding all cliques of an undirected graph. Commun Assoc Comput Mach 16:575–577
Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ (2001) Universal trees based on large combined protein sequence data sets. Nat Genet 28:281–285
Buneman P (1971) The recovery of trees from measures of dissimilarity. In: Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 387–395
Creevey CJ, Fitzpatrick DA, Philip GK, Kinsella RJ, O’Connell MJ, Pentony MM, Travers SA, Wilkinson M, McInerney JO (2004) Does a tree-like phylogeny only exist at the tips in the prokaryotes? Proc Biol Sci 271:2551–2558
Daubin V, Gouy M, Perriere G (2002) A phylogenomic approach to bacterial phylogeny:evidence of a core of genes sharing a common history. Genome Res 12:1080–1090
De Ley J (1992) The Proteobacteria: ribosomal RNA cistron similarities and bacterial taxonomy. In: Balows A, Trüper HG, Dworkin M, Harder W, Schleifer KH (eds) The prokaryotes. Springer-Verlag, New York, pp 2111–2140
Delsuc F, Brinkmann H, Philippe H (2005) Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 6:361–375
Eisen JA (1995) The RecA protein as a model molecule for molecular systematic studies of bacteria:comparison of trees of RecAs and 16S rRNAs from the same species. J Mol Evol 41:1105–1123
Erwin DH, Davidson EH (2002) The last common bilaterian ancestor. Development 129:3021–3032
Estabrook GF, McMorris FR (1980) When is one estimate of evolutionary relationship a refinement of another? J Math Biol 10:367–373
Estabrook GF, Johnson CS Jr, McMorris FR (1976) A mathematical foundation for the analysis of cladistic character compatibility. Math Biosci 29:181–187
Felsenstein J (1978) Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool 27:401–410
Felsenstein J (1981a) A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biol J Linn Soc 16:183–196
Felsenstein J (1981b) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Felsenstein J (1985) Confidence limits in phylogenies: an approach using the bootstap. Evolution 39:783–791
Felsenstein J (1993) PHYLIP, version 3.5c. University of Washington, Seattle
Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, Sunderland, MA
Fitch WM (1971) Toward defining the course of evolution:minimum change for a specified tree topology. Syst Zool 20:406–416
Fitch WM (1975) Towards finding the tree of maximum parsimony. In: Estabrook GF (ed) Proceedings of the Eighth International Conference on Numerical Taxonomy. W. H. Freeman, San Francisco, pp 189–230
Gogarten JP, Doolittle WF, Lawrence JG (2002) Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19:2226–2238
Gophna U, Doolittle WF, Charlebois RL (2005) Weighted genome trees:refinements and applications. J Bacteriol 187:1305–1316
Griffiths E, Gupta RS (2004) Signature sequences in diverse proteins provide evidence for the late divergence of the order Aquificales. Int Microbiol 7:41–52
Griffiths E, Ventresca MS, Gupta RS (2006) BLAST screening of chlamydial genomes to identify signature proteins that are unique for the Chlamydiales, Chlamydiaceae, Chlamydophila and Chlamydia groups of species. BMC Genomics 7:14
Gupta RS (1995) Phylogenetic analysis of the 90 kD heat shock family of protein sequences and an examination of the relationship among animals, plants, and fungi species. Mol Biol Evol 12:1063–1073
Gupta RS (1998) Protein phylogenies and signature sequences: a reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes. Microbiol Mol Biol Rev 62:1435–1491
Gupta RS (2000) The phylogeny of Proteobacteria: relationships to other eubacterial phyla and eukaryotes. FEMS Microbiol Rev 24:367–402
Gupta RS (2001) The branching order and phylogenetic placement of species from completed bacterial genomes, based on conserved indels found in various proteins. Inter Microbiol 4:187–202
Gupta RS (2003) Evolutionary relationships among photosynthetic bacteria. Photosynth Res 76:173–183
Gupta RS (2005) Protein signatures distinctive of Alpha proteobacteria and its subgroups and a model for alpha proteobacterial evolution. Crit Rev Microbiol 31:135
Gupta RS (2006) Molecular signatures (unique proteins and conserved Indels) that are specific for the epsilon proteobacteria (Campylobacterales) BMC Genomics 7:167
Gupta RS, Griffiths E (2002) Critical issues in bacterial phylogenies. Theor Popul Biol 61:423–434
Harris JK, Kelley ST, Spiegelman GB, Pace NR (2003) The genetic core of the universal ancestor. Genome Res 13:407–412
Hasegawa M, Fujiwara M (1993) Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol Phylogenet Evol 2:1–5
Huelsenbeck JP, Bollback JP (2001) Empirical and hierarchical Bayesian estimation of ancestral states. Syst Biol 50:351–366
Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ (1998) Multiple sequence alignment with Clustal x. Trends Biochem Sci 23:403–405
Kainth P, Gupta RS (2005) Signature proteins that are distinctive of alpha proteobacteria. BMC Genomics 6:94
Kannan S, Warnow TJ (1995) Inferring evolutionary history from DNA sequences. SIAM J Comput 23:713–737
Kersters K, Devos P, Gillis M, Vandamme P, Stackebrandt E (2003) Introduction to the proteobacteria. In: Dworkin M (ed) The prokaryotes:an evolving electronic resource for the microbiological community. Springer-Verlag, New York
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120
Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge
Kishino H, Hasegawa M (1989) Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 29:170–179
Kumar S, Tamura K, Nei M (2004) MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform 5:150–163
Kunisawa T (2001) Gene arrangements and phylogeny in the class Proteobacteria. J Theor Biol 213:9–19
Kunisawa T (2006) Dichotomy of major bacterial phyla inferred from gene arrangement comparisons. J Theor Biol 239:367–375
Lake JA, Rivera MC (2004) Deriving the genomic tree of life in the presence of horizontal gene transfer:conditioned reconstruction. Mol Biol Evol 21:681–690
Le Quesne WJ (1969) A method of selection of characters in numerical taxonomy. Syst Zool 18:201–205
Le Quesne WJ (1975) The uniquely evolved character concept and its cladistic application. Syst Zool 23:513–517
Ludwig W, Klenk H-P (2001) Overview: a phylogenetic backbone and taxonomic framework for prokaryotic systamatics. In: Boone DR, Castenholz RW (eds) Bergey’s manual of systematic bacteriology. Springer-Verlag, Berlin, pp 49–65
Maidak BL, Cole JR, Lilburn TG, Parker CT, Jr., Saxman PR, Farris RJ, Garrity GM, Olsen GJ, Schmidt TM, Tiedje JM (2001) The RDP-II (Ribosomal Database Project). Nucleic Acids Res 29:173–174
Meacham CA (1994) Phylogenetic relationships at the basal radiation of angiosperms: further study by probability of character compatibilityy. Syst Bot 19:506–522
Meacham CA, Estabrook GF (1985) Comaptibility methods in systematics. Annu Rev Ecol Syst 16:431–446
Nielsen C (2003) Defining phyla: morphological and molecular clues to metazoan evolution. Evol Dev 5:386–393
O’Keefe FR, Wagner PJ (2001) Inferring and testing hypthoses of cladistic character dependence by using character compatibility. Syst Bot 50:657–675
Ochman H (2001) Lateral and oblique gene transfer. Curr Opin Genet Dev 11:616–619
Olsen GJ, Woese CR, Overbeek R (1994) The winds of (evolutionary) change: breathing new life into microbiology. J Bacteriol 176:1–6
Penny D (1976) Criteria for optimising phylogenetic trees and the problem of determining the root of a tree. J Mol Evol 8:95–116
Pisani D (2004) Identifying and removing fast-evolving sites using compatibility analysis: an example from the Arthropoda. Syst Biol 53:978–989
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425
Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502–504
Semple C, Steel M (2003) Phylogenetics. Oxford University Press, Oxford
Sneath PHA (2001) Numerical taxonomy. In: Boone DR, Castenholz RW (eds) Bergey’s manual of systematic bacteriology. Springer-Verlag, Berlin, pp 39–42
Sneath PHA, Sackin MJ, Ambler RP (1975) Detecting evolutionary incompatibilities from protein sequences. Syst Zool 24:311–332
Stackebrandt E, Murray RGE, Trüper HG (1988) Proteobacteria classis nov., a name for the phylogenetic taxon that includes the “purple bacteria and their relatives.” Int J Syst Bacteriol 38:321–325
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10:512–526
Tateno Y, Takezei N, Nei M (1994) Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum parsimony methods when substitution rate varies with site. Mol Biol Evol 12:261–277
Van de Peer Y, De Wachter R (1994) TREECON for Windows: a software package for the construction and drawing of evolutionary trees for the Microsoft Windows environment. Comput Appl Biosci 10:569–570
Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699
Wilkinson M (2001) PICA 4.0: software and documentation. Department of Zoology, Natural History Museum, London
Wilkinson M, Cotton JA, Creevey C, Eulenstein O, Harris SR, Lapointe FJ, Levasseur C, McInerney JO, Pisani D, Thorley JL (2005) The shape of supertrees to come:tree shape related properties of fourteen supertree methods. Syst Biol 54:419–431
Wilmotte A, Herdman M (2001) Phylogenetic relationships among the cyanobacteria based on 16S rRNA sequences. In: Boone DR, Castenholz RW (eds) Bergey’s manual of systematic bacteriology. Springer, New York, pp 487–493
Wilson EO (1965) A consistency test for phylogenies based on contemporaneous species. Syst Zool 14:214–220
Acknowledgments
We thank Yan Li for writing the computer algorithms for the DUALSITE and the HARMONY programs. The work from R.S.G.’s lab, including support for Yan Li, was through a grant from the National Science and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
[Reviewing Editor: Dr. Yves Van de Peer]
Rights and permissions
About this article
Cite this article
Gupta, R.S., Sneath, P.H.A. Application of the Character Compatibility Approach to Generalized Molecular Sequence Data: Branching Order of the Proteobacterial Subdivisions. J Mol Evol 64, 90–100 (2007). https://doi.org/10.1007/s00239-006-0082-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-006-0082-2