Introduction

Phylogenetic relationships among species can be inferred based on a variety of methods. These include distance matrix, parsimony, maximum likelihood (ML) and Bayesian methods, supertree construction involving matrix representation and conditioned reconstruction, as well as methods based on discrete characters (Felsenstein 1981b; Kishino and Hasegawa 1989; Fitch 1971; Saitou and Nei 1987; Huelsenbeck and Bollback 2001; Lake and Rivera 2004; Felsenstein 2004; Semple and Steel 2003; Wilkinson et al. 2005). The relative efficiencies of these methods in obtaining correct tree topology are affected by a variety of factors and underlying assumptions (Hasegawa and Fujiwara 1993; Tateno et al. 1994; Felsenstein 1978, 2004; Penny 1976; Wilkinson et al. 2005; Semple and Steel 2003). An earlier study by Felsenstein (1981a) suggests that in cases where some characters in the dataset have evolved rapidly while others are evolving more slowly, character compatibility approach may give better results. The character compatibility methods are intended to remove misleading or fast-evolving characters from the dataset, and in theory, they should reduce the chance of obtaining the incorrect topology (Felsenstein 2004; Le Quesne 1975; Sneath et al. 1975; Estabrook et al. 1976; Pisani 2004; Meacham and Estabrook 1985). The character compatibility approach can be used for a variety of applications including ranking or weighting characters, identifying and removing fast-evolving sites or problematic taxa from phylogenetic analysis (Pisani 2004), ordering multistate characters, and building supertrees based on matrix representation of trees (Wilkinson et al. 2005). (See Wilkinson [2001] for other references and applications of compatibility methods.) However, despite some of the advantages that they offer, the character compatibility methods have been little used in phylogenetic studies (Pisani 2004; Felsenstein 2004; Wilkinson 2001).

Compatibility methods were proposed by Le Quesne (1969, 1975) and Wilson (1965), who pointed out that if two characters exhibit certain patterns of occurrence in the organisms under study, then it is not possible to construct a phylogenetic tree that shows only a single change (or mutation) for each of the two characters. Instead, it is necessary to postulate that one or the other of the two has had a double mutation or a backward mutation (or several such) or has resulted from nonspecific means (such as lateral gene transfer). It is not possible to tell a priori which of the characters is “bad” (in the sense of being potentially misleading), but Le Quesnce (1969) suggested that if one compared all pairs of character, one could identify those characters that had numerous incompatibilities with others, and these would usually be bad in this sense. If these were deleted, one would be left with characters which had hopefully had only a single mutation and which, therefore, were good guides to phylogeny. By proceeding in this way LeQuesne (1969, 1975) showed that a set of characters could be obtained where every character is compatible with every other. A tree based on such mutually compatible character sets is commonly referred to as a clique (Felsenstein 2004; Meacham and Estabrook 1985; Wilkinson 2001). These concepts have been further developed by several investigators (Estabrook et al. 1976; Estabrook and McMorris 1980; Buneman 1971; Sneath et al. 1975; Wilson 1965; Meacham and Estabrook 1985) to prove the Pairwise Compatibility Theorem. This theorem states that provided all characters are binary (i.e., have only two states, such as 0 or 1, or present or absent, or can be recoded into this form), if all pairs are compatible, the entire clique is compatible with a single tree. For characters with three of more states such as those commonly found in nucleotide or protein sequences, the pairwise compatibility does not ensure their mutual compatibility (Fitch 1975; Meacham and Estabrook 1985; Felsenstein 2004), hence it is difficult to use them for compatibility or clique analysis. However, if one deletes or omits the multistate characters from protein (or nucleic acid) sequences, it should still be possible to obtain sufficient numbers of useful sites (particularly from combined datasets for several genes/proteins) that can be used for compatibility or clique analysis.

Compatibility analysis with molecular sequences can also be carried out with rare genomic changes such as conserved insertions and deletions (i.e., indels) in protein sequences (Griffiths and Gupta 2004; Gupta 1998, 2003; Gupta and Griffiths 2002) or the order of various genes in different genomes (Kunisawa 2001, 2006). Based on the presence or absence of mutually compatible conserved indels, or arrangement of different genes, relationships among different groups can be deduced. One of us has made extensive use of conserved indels in protein sequences for inferring relationships among bacterial taxa (Griffiths and Gupta 2004; Gupta 1998, 2003; Gupta and Griffiths 2002). However, one limitation of the above kinds of genetic characteristics for compatibility analysis is that such characters are rare and they are also not found in all gene sequences. Hence, one cannot always use them to study relationships among the desired groups of species. In contrast, the compatibility analysis of two state characters in molecular sequences provides a more general approach that should be universally applicable. The application of this approach may prove helpful in clarifying the topology or branching patterns in cases that have proven difficult to resolve by phylogenetic treeing methods (Creevey et al. 2004; Daubin et al. 2002; Gophna et al. 2005; Delsuc et al. 2005; Gupta and Griffiths 2002; Gupta 1995).

Although the character compatibility approach, which in the present work refers specifically to clique analysis, offers a promising means for determining the relationships among different taxa, this approach has thus far not been used with generalized molecular sequence data to assess its usefulness and phylogenetic reliability. In this work, we have applied this approach to molecular sequence data for a number of conserved proteins to explore the topology of bacterial species belonging to different subdivisions within proteobacteria. Proteobacteria comprises a very large group among bacteria, accounting for nearly half of the cultured bacteria (Kersters et al. 2003; Gupta 2000; Stackebrandt et al. 1988; Olsen et al. 1994; De Ley 1992) and genomic sequences for a large number of proteobacterial species are now available (see http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Thus, they provide a good test case to compare and assess the reliability of phylogeny as determined by the character compatibility method. In the present work, we have compared the results obtained using the character compatibility analyses with those based on phylogenetic trees for 16S rRNA and concatenated protein sequences to assess the reliability of this approach.

Materials and Methods

Compatibility Analysis

Multiple sequence alignments for various proteins for different species were created using the Clustal x program (Jeanmougin et al. 1998). Using a computer program, DUALSITE, that was developed for this work, those sites in the sequence alignments where only two amino acid states were found, with each state present in at least two species, were selected. For this purpose, all columns where any gaps were present in any of the species were omitted. The DUALSITE program can also identify sites where one of the states is present in only single species. Although such sites are not useful for compatibility analysis, they provide information regarding the lengths of the terminal branches. This information was separately computed and noted on the terminal branches in the tree. The DUALSITE program also converts all useful two-state sites into a binary file of “0, 1” that is suitable for compatibility analysis. The main program that we have used for compatibility analysis is CLIQUE from the PHYLIP (v. 3.5c) program package (Felsenstein 1993). Compatibility analyses on the large combined datasets from 10 proteins were carried out using this program, the largest cliques of compatible characters obtained were drawn, and the numbers of characters that distinguished different nodes were indicated. In addition to carrying out compatibility analysis on two-state characters from all 10 proteins, similar analyses were also carried out on individual protein sequences using the HARMONY program developed in this work. The largest clique identified by this program had the same numbers of characters as those obtained from the CLIQUE program. However, the HARMONY program also extracts the compatible characters for the largest clique in a file, which can then be combined with characters from other protein sequences to generate larger datasets, which were analyzed for overall compatibility by the CLIQUE program.

Phylogenetic Analysis

The 16S rRNA sequences for various species listed in Table 1 were downloaded from the Ribosomal Database Project-II site (Maidak et al. 2001) and aligned using the CLUSTAL x program (Jeanmougin et al. 1998). A neighbor-joining (NJ) tree was constructed based on distances calculated using Kimura’s (1980) two-parameter model. ML tree for the rRNA sequences was computed using the HKY model using the TREE-PUZZLE 5.2 program (Schmidt et al. 2002). The amino acid sequences for the 10 proteins that we have used as the test set (viz., RNA polymerase β subunit (RpoB), RNA polymerase β‘ subunit [RpoC], alanyl-tRNA synthetase [AlaRS], elongation factor-Tu [EF-Tu], elongation factor G [EF-G], RecA protein, DNA gyrase subunit A [GyrA], DNA gyrase subunit B [GyrB], Hsp60 or GroEL protein, and DnaK or Hsp70 protein) were downloaded and aligned using the CLUSTAL x program. For phylogenetic analysis, the sequences for all 10 proteins were concatenated into a single large alignment file containing 7977 sites. A NJ tree based on this sequence alignment (bootstrapped 1000 times) was constructed based on Kimura’s (1983) model using the TREECON programs (Van de Peer and De Wachter 1994). ML tree based on this dataset was computed using the WAG+F model plus a gamma distribution with four categories (Whelan and Goldman 2001) using the TREE-PUZZLE (Schmidt et al. 2002). Maximium parsimony (MP) trees were constructed using the Mega 3.1 program (Kumar et al. 2004). All of the trees were bootstrapped 100 times (Felsenstein 1985), unless otherwise indicated.

Table 1 Species information

Results

As a test case to evaluate the utility of the compatibility method, a group of 25 bacterial species was chosen (Table 1). This included six species each from the α-, β-, and γ- subdivisions, plus two and three members, respectively, from the δ- and ε-subdivisions of proteobacteria. In addition, the set also included two Chlamydiae species, viz., Chlamydia trachomatis and Chlamydophila pneumoniae, to serve as outgroup for the proteobacterial species (Olsen et al. 1994). The genomes of all the chosen species have been sequenced. For the sequence dataset, 10 highly conserved proteins (RpoB, RpoC, AlaRS, EF-Tu, EF-G, RecA, GyrA, GyrB, Hsp60, and Hsp70) that are found in all of these species were chosen (Harris et al. 2003). The lengths of these proteins for E. coli (i.e. number of sites) are given in column 2 of Table 2, and homologues from the other species are of similar lengths.

Table 2 Summary of compatible characters in protein sequences

Multiple sequence alignments for these proteins for the 25 species were created using the Clustal x program (Jeanmougin et al. 1998). Subsequently, using the DUALSITE program, those sites in the alignments where only two amino acids were present were selected and such sites were converted into a binary file of “0, 1” characters. Table 2 presents a summary of the two-state sites in different proteins. Of the total number of positions in the sequence alignments of these proteins, about 17.6% (range, 13%–24%) of the sites were found to contain only two amino acids. After excluding those positions where one of the amino acids was present in a single species, 12.3% of the sites were useful (each of the two character states present in at least two of the species) for compatibility analysis (Table 2).

The mutual compatibility of the useful characters in different proteins was determined employing two different approaches. In one, all useful characters from the 10 proteins were combined into a single large dataset of 957 characters (Set A in Table 2) prior to compatibility determination. The dataset of the useful two-state characters from all 10 proteins is provided as Supplemental Information. In the second approach, the useful two-state characters from individual protein sequences were initially analyzed for compatibility and then the compatible characters from all 10 proteins were combined to create a second large dataset (Set B; 398 characters), which was again subjected to compatibility analysis. The compatibility analysis involves a pairwise comparison of each character in the data matrix with every other character in the dataset. If all four of the combinations, 00, 01, 10, and 11, are found to occur among the organisms, then that pair of sites is regarded as incompatible (Le Quesne 1975; Sneath et al. 1975; Meacham and Estabrook 1985; Felsenstein 2004; Wilkinson 2001) and one of the two sites which shows the highest degree of incompatibility with the other characters is removed from the dataset. This process is repeated until the largest datasets of compatible characters, i.e., cliques, are obtained. The mutual compatibility of each site in the matrix of pairs of sites was determined using two different programs. The main program that we have used for such analysis was the CLIQUE program from PHYLIP (v. 3.5c) program package (Felsenstein 1993). It uses a branch-and-bound algorithm to find the largest dataset(s) of mutually compatible characters (i.e., cliques) (Felsenstein 1993; Bron and Lerbosch 1973). The compatibility analysis on characters from individual proteins was carried out using both the CLIQUE program and the HARMONY program. The largest cliques identified by both these programs had the same numbers of characters. However, unlike the CLIQUE program, which can identify multiple cliques of the same size, the output of the HARMONY program consists of a single clique and it extracts all compatible characters corresponding to it in a file. The compatible characters obtained from different proteins are then combined to create the second large dataset (Set B; Table 2).

The compatibility analysis on the Set A characters by CLIQUE program gave rise to four largest cliques, each consisting of 337 characters. Two of these cliques are shown in Figs. 1A and B. These cliques were rooted using the sequences for Chlamydiae species, which branch deeper than proteobacteria in different phylogenetic trees (Olsen et al. 1994; Brown et al. 2001; Eisen 1995; Gupta 2001). The Chlamydiae species were distinguished from Proteobacteria by a large number (i.e., 125) of characters. In both cliques, all five main subgroups of proteobacteria, i.e., α, β, γ, δ, and ε, were clearly distinguished from each other based on a minimum of five characters. The ε-proteobacteria formed the deepest-branching group or clade in these cliques, and 17 characters supported their deep branching in relation to the other proteobacterial subgroups. Six characters also supported a specific relationship between the β- and the γ-subgroups. The branching order of most species within these proteobacterial subgroups was also resolved, generally based on two or more characters. The only difference between the cliques shown in Figs. 1A and B is with regard to the relative branching positions of the α- and δ-subgroups. In the clique shown in Fig. 1B, the δ-subgroup is indicated to branch prior to the α-subgroup, whereas in the other clique (Fig. 1A) it is found to branch after the α-subgroup of species. However, the relative positions of the α- and δ-subgroups in these cliques are based on a single character, indicating that this is not resolved. The remaining two cliques from Set A were identical to these cliques except for the branching order within the α-subgroup of species. In the cliques shown in Figs. 1A and B, Bra. japonicum formed the outgroup of a clade consisting of Bru. melitensis and Meso. loti. But only a single character supported this relationship. In the other two cliques obtained from Set A (Fig. 1C), Sil. pomeroyi formed the outgroup of a clade comprising of Bru. melitensis and Meso. loti, whereas Bra. japonicum was found to branch with Ca. crescentus. These results indicate that these cliques are unable to resolve the relative branching order of these α-proteobacterial species.

Fig. 1
figure 1

The compatibility trees (or cliques) obtained from a dataset of 957 two states characters from all 10 proteins (Set A). The four largest cliques obtained each consisted of 337 compatible characters, and the topologies of different species in two of them are shown in A and B. The other two cliques were identical to these except for the branching order within the α-proteobacterial subgroup, which is as shown in C. The numbers of characters that distinguished various internodes are indicated. In A, the numbers in parentheses after the species name indicate the numbers of unique amino acid changes that were found in these species. The trees were rooted using the sequences for Chlamydiae species.

When compatibility analysis was carried out on Dataset B (Table 2), a very high proportion (81.5%) of the characters was found to be mutually compatible. This indicates that the compatibility analysis of individual proteins is retaining predominantly useful and stable characters, which maintain these characteristics upon combining. However, the total numbers of compatible characters that were recovered by this approach were smaller than those when the analysis was carried out on all of the characters at the same time (323 vs. 337; Table 2). This is due to the fact that compatibility analysis of individual proteins often results in multiple cliques of the same size. However, when these datasets are combined for further analysis, only one clique from each dataset is included. This prevents searching for the largest cliques that are compatible with all of the characters in the entire dataset. These results indicate that in order to identify the largest cliques of compatible characters, it is necessary to carry out analysis on the entire dataset at the same time. The compatibility analysis of the characters in Set B resulted in only two cliques of 323 characters (Fig. 2). The overall relationship indicated by these cliques was very similar to that observed in the Set A cliques. However, the numbers of characters that distinguished different nodes were slightly lower in a number of cases (cf. Fig. 1). All five subgroups of proteobacteria were again clearly distinguished and their branching order was indicated as (ε(δ(α(β,γ)))). However, branching of the δ-subgroup before the α-subgroup was based on a single character. The two cliques obtained differed only with regard to the branching position within the α-proteobacterial species, and the differences between them were the same as noted above for the Set A cliques (Fig. 1C). In the cliques shown in Fig. 2, E. coli and Y. pestis were found to branch at the same position. The relative branching of these species in the Set A cliques is also based on a single character, indicating that this relationship is not reliably resolved.

Fig. 2
figure 2

The compatibility cliques obtained from a combined dataset of 398 compatible characters from all 10 proteins (Set B). The compatibility analysis was initially carried out individually on the two-state characters from all 10 proteins using the HARMONY program. The compatible sites from the single large clique for each protein obtained using this program were then combined into a larger dataset (Set B). The two largest cliques obtained after CLIQUE analysis of Set B characters are shown. The numbers of characters that distinguished different internodes are indicated.

We have also examined the effect of changing outgroup species on the results of compatibility analysis. This was done by replacing the sequences for two Chlamydiae species from the above dataset with those from two Firmicutes (low G+C Gram-positvie) species (Listeria innocua and Staphylococcus aureus). The largest cliques that were found with this new dataset contained only 276 compatible characters (results not shown), instead of the 337 characters obtained with dataset A. This difference was mainly due to the smaller number of characters that were uniquely shared by the two Firmicutes species in comparison to the two Chlamydiae (88 vs 125), which is a fast-evolving lineage (Griffiths et al. 2006). Similar to Set A, four largest cliques were obtained. The topology of various proteobacterial species in these cliques was very similar to that shown in Fig. 1, although the numbers of characters that distinguished different nodes were slightly different in the two cases.

Phylogenetic trees were also constructed for the above species based on 16S rRNA sequences and concatenated sequences for all 10 proteins by different methods (NJ, ML, and MP). These trees are shown in Fig. 3. In the 16S rRNA trees (Fig. 3A), a few of the internodes, particularly that leading to the branching position of the α- and δ-subdivisions, was not resolved by different methods. A polyphyletic branching of the β- and γ-subgroups, with a clade consisting of Xanthomonadales (Xyl. fastidiosa and Xan. axonopodis) forming an outgroup of the β-subgroup of species, was also supported by NJ and MP analyses (Fig. 3A). Similar polyphyletic branching of β- and γ-proteobacteria in 16S rRNA trees has been observed in earlier studies (Kersters et al. 2003; Ludwig and Klenk 2001). The relationships within the β-proteobacteria and the branching position of Bra. Japonicum were also not resolved in the rRNA trees by different methods.

Fig. 3
figure 3

Phylogenetic trees for the proteobacterial species based on (A) 16S rRNA sequences and (B) concatenated sequences for all 10 proteins used in the compatibility studies. Phylogenetic analyses were carried out by means of neighbor joining (NJ), maximum likelihood (ML), and maximum parsimony (MP) methods on 100 bootstrap samples (except for the NJ protein tree, which was bootstrapped 1000 times) as described under Materials and Methods. The bootstrap scores for various internodes in the NJ/ML/MP trees that were >50% are indicated. The trees were rooted using the Chlamydiae species.

In contrast to the 16S rRNA tree, in the trees based on concatenated protein sequences (7977 positions) the NJ and ML analyses produced identical tree topologies and all of the internodes were resolved with high bootstrap scores (Fig. 3B). However, the bootstrap scores by the NJ method tended to be slightly higher than those by the ML analysis. The topology for the MP tree was also very similar, except that Ca. crescentus exhibited deeper branching than Sil. pomeryoi. The branching order of various proteobacterial subgroups by different methods was found to be (ε (δ(α(β,γ)))) (Fig. 3B), which is identical to that deduced in earlier work based on conserved indels in a number of different protein sequences (Gupta 2000, 2001; Kersters et al. 2003). The branching orders or interrelationships among various species in the protein tree are virtually identical to that seen in the cliques based on compatible characters in these proteins (Figs. 1 and 2). These results provide evidence that the character compatibility analysis provides a powerful new tool, in addition to the traditional phylogenetic approaches such as NJ, MP, and ML analyses (Felsenstein 2004), for determining the topological relationships among species.

Discussion

This paper describes the first detailed application of the character compatibility approach or “clique analysis” to generalized molecular sequence data to assess its usefulness and reliability for phylogenetic studies. Although the basic concepts and mathematical foundation of this method to infer phylogenetic relationships were developed more than 30 years ago (Le Quesne 1969, 1975; Estabrook and McMorris 1980; Estabrook et al. 1976; Sneath et al. 1975; Wilson 1965; Sneath et al. 1975; Meacham and Estabrook 1985), this approach has thus far only been used in a limited manner with morphological characters (O’Keefe and Wagner 2001; Meacham 1994; Sneath 2001), and its applicability to molecular sequence data has not been explored (Felsenstein 2004; Wilkinson 2001). One of the main limitations of this approach is that the compatibility algorithms mainly work with binary character states (Estabrook et al. 1976; Felsenstein 2004), hence they are not directly applicable to molecular sequence alignments which contain either 4 (DNA or RNA) or 20 characters (protein sequences) (Fitch 1975; Kannan and Warnow 1995; Meacham and Estabrook 1985). However, by limiting analyses to those positions where only two characters states are present in the sequence alignments, this approach can be used with molecular sequences (Felsenstein 2004). Because molecular sequences contain huge number of characters, large numbers of potentially useful sites that are useful for such analysis could still be found in these sequences. In view of the two-character state limitation of this approach, a large number of sites in a given dataset are not useful for these analyses. However, the sites that are retained are generally those that are slowly evolving, and it is hoped that the information contained in them is of the highest quality and will prove helpful in resolving the topology (Meacham and Estabrook 1985; Wilkinson 2001).

In the present work, the usefulness of the compatibility approach for phylogenetic studies was examined by comparing the results obtained using this approach with other conventional methods such as phylogenetic trees based on 16S rRNA or protein sequences. As a test case the evolutionary relationships among 25 species mainly from different subdivisions of Proteobacteria was investigated. The sequence dataset consisted of 10 highly conserved proteins ubiquitous to all bacteria and some to species from all three domains (Harris et al. 2003). About 18% of the total sites in these proteins were found to satisfy the two-state criteria, and of these nearly two-thirds of the sites were useful for compatibility analysis (i.e., where each state was present in at least two of the species). These characters are scattered in different parts of the proteins and no apparent clustering of them was observed. To increase the total number of useful two-state sites, such sites from different proteins have been combined into large datasets. In one instance, all of the useful two-state sites were combined into a large dataset prior to examining their mutual compatibility. In another instance, compatibility analysis was carried out on two-state characters from each protein separately and then all compatible characters were combined into a larger dataset (Set B) that was subjected to compatibility analysis. Results of these studies show that when compatible characters from individual proteins are combined, a very high proportion (>80%) of them was found to be overall compatible. This provides evidence that the compatibility analysis is predominantly retaining useful and stable characters, which maintained these characteristics upon combining in larger datasets. However, these studies also revealed that the total number of compatible characters that one obtains from a dataset after analyzing it as a whole is greater than that obtained by analysis of the same dataset in parts. In view of this, to maximize and include all compatible characters from a dataset, it is advisable to carry out compatibility analysis on the entire dataset at the same time.

The largest cliques that were obtained from these datasets contained sufficient numbers of compatible characters (337 and 323 from Sets A and B, respectively) to clearly distinguish all of the main groups within Proteobacteria, as well as resolve most of the internal nodes within these cliques. In all of these cliques, the ε-proteobacteria showed the earliest branching, whereas the β- and γ-subgroups were indicated as late-branching groups within proteobacteria. However, these cliques did not resolve a few of the relationships. One of these was the relative branching order or placement of the α- and δ-subgroups. Additionally, the topology within the α-subgroup, particularly the branching position of Bra. japonicum, was not resolved. In general, the relationships that are distinguished by a single compatible character in the cliques are found to be unreliable.

Results presented here also show that the size of the largest cliques that one obtains is affected by the choice of the species. In the present analysis, by replacing the fast-evolving Chlamydiae species with the Gram-positive species, the numbers of characters in the cliques were reduced from 337 to 276. However, this difference was mainly due to a reduction in numbers of characters that distinguished a few of the fast-evolving groups. The overall topology and the reliability of various internal nodes were not affected by this change of the outgroup species.

The results of the compatibility analysis were also compared with those obtained by traditional phylogenetic analyses based on the 16S rRNA tree or a tree based on combined sequences from all 10 proteins. The overall topology of most of the species in these different trees was very similar. However, in the 16S rRNA tree, a number of relationships were either not resolved or poorly resolved. The γ-protoebacteria were indicated as polyphyletic, with the β-subgroups of species branching in between them. Several other relationships, including the relative branching positions of the α- and δ- subgroups, the placement of Bra. japonicum within the α-subgroup, and the relative branching of Ralstonia, Burkholderia, and Bordetella, were also not resolved in the rRNA tree by different phylogenetic methods. The failure of these trees to resolve some of these relationships is very likely due to the long-branch attraction effect (Felsenstein 1978; Delsuc et al. 2005). In contrast to the rRNA tree, in the tree based on concatenated protein sequences, all internal nodes were resolved with a high degree of confidence (>90% bootstrap score). The branching order of various species in this tree was very similar to that seen in the compatibility trees (Figs. 1 and 2) and some of these inferences are also independently supported by conserved indels in a number of proteins (Gupta 2000, 2005; Gupta and Griffiths 2002), as well as the gene order patterns in bacterial genomes (Kunisawa 2001). In earlier work, a number of conserved indels in protein sequences have been identified that are uniquely shared either by all proteobacteria (viz., α, β, γ, δ, and ε), or by the α-, β-, and γ-subgroups of species, or by only the β- and γ-proteobacteria, or which are unique to particular groups of proteobacteria, supporting the observed branching order of the proteobacterial subgroups (i.e., (ε(δ(α(β,γ)))) (Gupta 2000, 2005, 2006; Gupta and Griffiths 2002).

These results provide evidence that the character compatibility or clique analysis is retaining stable and useful characters from molecular sequences, and based on such characters a reliable phylogeny for the group of species under consideration can be deduced. The resolving power of this approach depends on and increases with the total number of compatible characters that are present in a dataset. In addition to the results presented here, we have also carried out compatibility analysis on smaller numbers of proteins from the present dataset. Although the cliques obtained in these cases were similar to those shown here, with the smaller numbers of characters either many internal nodes were not resolved or their distinction was based on only single characters (results not shown). In principle, using this approach it should be possible to construct a tree of all compatible characters based on all commonly shared proteins for a given group of species. Such an approach using a large number of compatible characters should prove particularly useful in clarifying the topological relationships among groups of species which have proven difficult to resolve by traditional phylogenetic means (e.g., higher taxa within prokaryotes and eukaryotes, evolutionary relationships among Metazoan, Cyanobacteria) (Creevey et al. 2004; Daubin et al. 2002; Gophna et al. 2005; Delsuc et al. 2005; Gupta and Griffiths 2002; Wilmotte and Herdman 2001; Erwin and Davidson 2002; Nielsen 2003; Baldauf et al. 2000; Gupta 1995). Several other applications of the character compatibility methods and the programs to implement them have been described by Wilkinson (2001).

Although the compatibility approach has worked well in the present case, there are certain considerations in the use of this approach that should be pointed out. Based on an evolutionary model for random occurrence of mutations in a clocklike manner (Kimura 1983), sequences with very high mutation rates will exhibit numerous mutations and back mutations, and characters from them will likely exhibit many incompatibilities. Thus, it is likely that relatively few compatible characters will be found in such sequences. Likewise, the sequences that have very low mutation rates may contain only a few changes that are useful for these analyses. Thus, the number of useful characters that may be present in a given sequence will vary with the number and the phylogenetic depth of the species that are being studied, as well as the choice of the outgroup species, as shown here. Some sequences that may contain enough useful characters for examining the evolutionary relationships among closely related species may not prove useful when a broad range of species is included in the dataset. The ability of the available characters for resolving the topological relationship will also depend on the relative lengths of the terminal internodes compared to the internal internodes. If the terminal internodes are very long compared to the inner ones, most single changes will contribute to estimating these terminal lengths, but they will not help resolve the branching pattern.

The reliability of compatibility analysis has scarcely been explored and this is the first detailed study examining its application to generalized molecular sequence data. One cannot be certain that every character in a clique represents a single change on a tree. Although the possibility of having multiple mutations that exactly cancel out each other is expected to be low for amino acid sequences, the same may not be true for nucleic acid sequences. Therefore, it is likely that nucleic acid sequences will prove less useful than amino acid sequences for compatibility analysis, but this remains to be experimentally tested. Our observation that upon combining different compatible subsets one obtains new sets that are almost wholly compatible indicates that these subsets are reflecting the same evolutionary history and the number of bad characters in them is small. The inference that the characters that are selected by the compatibility approach are generally reliable is also supported by the fact that the same relationships as indicated by these characters are independently supported by rare genomic changes such as conserved inserts and deletions in different proteins or whole proteins that are specific for these groups and the gene order arrangement in different bacteria (Gupta 2000, 2005, 2006; Gupta and Griffiths 2002; Kunisawa 2001; Kainth and Gupta 2005). Because of the rarity and highly specific nature of these genetic changes, it is considered unlikely that they are results of independent genetic events. In future studies, it would be of much interest to carry out simulation studies using compatibility analysis to determine whether this approach is able to detect and exclude characters that are derived from sequences that have been laterally transferred or result from similar perturbing events (Gogarten et al. 2002; Ochman 2001; Beiko et al. 2005). These studies should also indicate whether the increased reliability of compatibility methods compared to other phylogenetic methods compensates for the loss of information that results from excluding many characters.