Introduction

The globin family of proteins has representatives in all kingdoms of living things (Preitas et al. 2004; Li 1997; Vinogradov et al. 1992). The phylogeny and gene structure of globins have been extensively studied (Doolittle 1987; Shikama et al. 2004; Vidal et al. 2004; Vinogradov et al. 1992). DNA and protein sequences, and aspects of globin gene structure, have each provided different types of molecular data useful in phylogenetic inference and evolutionary studies (Piro et al. 1996; Sherman et al. 1992; Vidal et al. 2004). The “globin fold” defines the common structural features of a broad family of globin-related proteins. The structures of many globins have been solved (Perutz 1983; Shikama et al. 2004; Berman et al. 2000).

More than 31,000 protein structures are currently known and the goal of structural biological initiatives is to solve the structure of every protein in every living organism (Burley et al. 2003). The evolution of protein structure, similar to the evolution of any structure, is an interesting topic for evolutionary biologists and therefore structures are attractive subjects if they can be converted to a form of data useful for evolutionary studies. Structures also provide insight into evolution over long time intervals because the same basic structure types or folds are observed in organisms that diverged billions of years ago (Qian et al. 2001). New folds rarely arise de novo. Instead, for the most part, existing proteins diversify without changing fold morphology (Caetano-Anolles et al. 2005). Protein structure has been shown to evolve more slowly than sequence (Levitt et al. 1998). Genes encoding proteins with no significant sequence homology may share a common fold and a common ancestor (Kinch et al. 2002; Chothia et al. 1986). For example, hemoglobin and phycocyanin have diverged in function but both maintain the globin fold (Pastore et al. 1990; Caetano-Anolles et al. 2005). On the other hand, proteins that have converged in biochemical function during evolution typically have unrelated protein structures (Osawa et al. 2005).

In order to study the evolution of structure, structural similarity must be measured. Different approaches to quantifying structural similarity have been used. One set of approaches is based on quantifying intramolecular features. The contact map, the two-dimensional record of which residues touch in the folded protein, has been found to be relatively stable, even in proteins that have diverged to an extent that their primary sequences cannot be recognized as similar (Caprara et al. 2004; Kozitsyn et al. 1975). The contact maps of specific folds are recognizable even if individual contacts vary. Residue-residue contacts provide the forces that stabilize a specific structure, so the interactions underlying contact maps are a basic target of evolution. Contact maps and related measures convert some aspects of three-dimensional structures to a two-dimensional representation (Caprara et al. 2004; Kienjung et al. 2004; Johannissen et al. 2003). Patterns of residue contacts correlate with other aspects of structure (Xu et al. 2003; Lesk et al. 1980). Sequence conservation is not required for contact site conservation. For a given contact site a variety of amino acid residues might be able to successfully form a stabilizing contact (Parisi et al. 2004). Though it is understood that the environment of each amino acid is unique in the tertiary structure, in evolutionary models the selective environment of residues is often modeled as uniform over all sites (Jones et al. 1992). Techniques for using information about contacts within the protein to refine models for sequence evolution and phylogenetic inference have been proposed (Parisi et al. 2005; Rodrigue N. et al. 2005; Marsh et al. 2005).

A common approach to study evolutionary relationships of protein structures involves intermolecular comparison of protein backbone atom positions. The methods measure structural similarities between proteins or domains of proteins. The most common approach to compare structures is to calculate the root mean square deviation (RMSD) of equivalent Cα backbone carbons of structurally aligned proteins. Evolutionary distance measures for globins based on RMSD, or related measures of interprotein similarities in shape, have been described and used in phylogenetic inference (Aronson et al. 1994; Bostick et al. 2004; Johnson et al. 1990). The slow pace of protein structure evolution may give structure a unique role in phylogenetic comparison of distant proteins (Levitt et al. 1998). Structural fold representation in genomes has been used to study the evolution of the early evolution of life (Deeds et al. 2005; Caetano-Anolles et al. 2005).

The relationships of protein structures have been studied. In the SCOP database initial structural groupings assigned by automated methods are hand-curated (Murzin et al. 2000). Some of the relationships in SCOP reflect phylogenetic relationships, whereas others are classifications without an assignment of evolutionary history. The globin fold in SCOP is comprised of structurally similar though functionally diverse proteins (Balaji et al. 2001). Purely automated alignments have also been used to organize protein families. The HSSP database contains alignments of protein structures produced using the program DALI, which aligns structures using intramolecular distance patterns (Holm et al. 1998). The CE-MC server represents an attempt to produce structural alignments of protein families that optimize all structures (Guda et al. 2004). These methods agree on most fold assignments but differ in assignments for pairs of proteins with marginal similarity (Sauder et al. 2000).

Here I show that discrete structural features of proteins, residue contacts, can be used to study the evolution of the globin fold. Contacts represent local features that represent interactions between secondary structure elements of the globin fold. A simple model for evolution of contacts is generally consistent with observed patterns of change. I show that evolution of contacts in globins is largely divergent. Significant convergence was not detected. Contacts changed rapidly, with high homoplasy. The rapid divergence of contacts supports the concept that the globin fold is inherently robust with redundant contacts. Many combinations of residue contacts can produce a stable globin protein. It is possible that incorporating an understanding of how residue contacts evolve within a structure may improve models for sequence evolution.

Methods

Protein Structures

Bacterial globin fold protein structures were retrieved from the Protein Data Bank (PDB) (Berman et al. 2000) (Table 1) using the SCOP structural database classification (Murzin et al. 2000). Proteins are referred to by the code assigned them by the PDB since sequences derived from those files were used for analyses. The PDB structure file often contains more than one protein chain, usually multiple copies of a single polypeptide or distinct polypeptide subunits. A letter code is used to designate the file chain. A five-character PDB + chain code serves as a descriptor of both an amino acid sequence and structural information about that sequence. The hemoglobin family was best related to all of the other families. Therefore the E. coli flavohemoglobin (1GVH-A) was assigned as the index structure for alignment of the other structures. Structures were excluded if they were of mutant derivatives, exceeded 3.0 Å on the resolution factor or if they contained large unresolved regions that would interfere with analysis. Only structures derived by X-ray crystallographic methods were included. If multiple chains representing a single gene product were present in a PDB file the ‘A’ chain was used for analysis. The bacterial flavohemoglobins and FeS cluster proteins contain multiple domains (Vinogradov et al. 1992). Only the globin domains were included in the analysis. Because bacterial and eukaryotic proteins may differ in contact density (Shapiro et al. 2004), eukaryotic structures were excluded for this study.

Table 1 Globin structures used in analysis

Alignment of Structures

Meaningful comparison of protein structures requires alignment of the three-dimensional positional data from each protein. Structures were aligned using the DaliLite program, a local version of the server program DALI (Holm et al. 2000). This procedure aligns structures based on superposition of maps that record internal distances between backbone atoms. The 15 structures aligned well to 1GVH-A, though some regions had to be excluded because of low similarity. Quality of a structural alignment in DALI is assessed using a Z-score (Holm et al. 2000). The Z-score is derived from comparison of an alignment value to the distribution of alignment values from random matches to a reference library of unrelated structures. Alignment Z-scores for the proteins in this study were significant (Z > 2.0), indicating statistically similar structures, and had an acceptable deviation as well within the globin domain regions (RMSD ≤ 4.0 Å). Alignments were confirmed manually. The majority of excluded (unalignable) residues lay in loop regions or unrelated domains. Though all structures aligned well to 1GVH-A and other hemoglobins, alignment between the other families was occasionally poorer.

For this work, a structurally based sequence alignment of the globin core residues was prepared. Amino acid residue sites were only included if 14 of the 15 structures could be aligned at that site. From the alignment to the index structure, a set of 60 amino acid residue sites was derived that met these criteria (Holm et al. 2000; Chakrabarti et al. 2004). Alignment did not pose a special problem for this particular data set since the globin portion of all structures used in this study formed compact domains with similar topology (Aronson et al. 1994).

Scoring Contact Characters

Amino acid residue interactions within each protein were determined for all structures. Contacting pairs were determined using the computer program Contact 0.7. Neighbors within five amino acid residues of each analyzed residue were excluded in order to score only intersecondary structure contacts and not intrasecondary structure contacts. Contact was defined as present if at least one atom of each residue lay within the van der Waals radius of each other. Because experimentally determined structures are not precise and protein structure can change slightly during crystallization, an additional gap of 0.7 Å was permitted in addition to the maximum allowed separation based on van der Waals radius alone. This gap factor corresponds generally with the uncertainty of atom position determined from the Luzzati plots of R factor and resolution for this set of structures.

A contact character site was defined as a pair of amino acid residues that could contact (touch). Both residues defining a contact character site had to be members of the set of alignable characters described above (Kienjung et al. 2004). A character matrix for phylogenetic analysis recording presence or absence of contacts was generated. Contact characters were treated as morphological data with binary ascertainments in an arbitrary order. A contact within a protein was labeled equivalent to a contact in a homologous protein if the residues involved were equivalent in the alignment. Because of occasional spatial alignment ambiguities, contacts involving adjacent residues were collapsed into a single contact character to avoid the risk of creating invalid character divisions. Collapsing adjacent residue contacts also avoided overweighting insignificant changes in residue conformation. A set of 97 contact sites was observed. The Cα carbons of these pairs were separated on average by 12.69 ± 7.34 Å over the entire set. The maximum separation compatible with contacts is about 16 Å, indicating that pairs often lie in proximity even if they do not contact. Individual residues often interacted with more than one other residue. Each interaction was defined as a distinct character. Distribution of number of contacts per structure was analyzed by a chi-square goodness-of-fit test against a modally centered binomial curve. Pearson correlations of contacts over the set of proteins were calculated for all pairs of sites and squared correlations averaged to determine a mean squared correlation for contact sites.

Phylogenetic Reconstruction with Sequences

The structural alignment method was used to align globin-fold sequences since sequence similarity in many cases was less than 20%. Sequences were aligned using DaliLite as above. Contacts between amino acid sites that could not be aligned were treated as missing data. This procedure eliminated from analysis contact sites lying in nonglobin domains that were present in some structures and many loop regions.

Bayesian phylogenetic analyses were conducted using MrBayes version 2.01 (Hueslsenbeck et al. 2001). The JTT substitution model (Jones et al. 1992) was used in the analysis with rates of all amino acid sites set to be equal. The Markov chain Monte Carlo process was set to run four chains simultaneously and run for 500,000 generations, with a burn-in of 50,000 generations and tree sampling every 100 generations. Likelihood scores for sampled trees converged within 10,000 generations. Two independent runs converged on similar likelihood scores. The post-burn-in trees for each run were combined and used to generate 50% majority rule consensus trees, with the percentage of samples recovering any particular clade representing the posterior probability of that clade (Hueslsenbeck et al. 2001). The topology of the consensus trees of the two replicate runs was identical, with similar posterior probability support for nodes.

Phylogenetic Reconstruction with Contact Characters

Phylogenetic trees were reconstructed with the neighbor-joining (NJ) method (Saitou et al. 1987) implemented in the Paup* program, version 4.0b10. Contact character matrices encoded as 1 = contact and 0 = noncontact were analyzed with the datatype=standard option. This input constrained the analysis to contact distance in the dpc (Eq. [2]) form. Contact characters were presented in a data matrix with equivalent contacts aligned. However, the order of the characters was arbitrary since characters represented a three-dimensional feature of the protein, not sequence. As described above, contact characters were only included in analysis if the positions of both amino acid residues participating in the contact were alignable in at least 14 of 15 structures. The remaining unalignable contact sites were treated as missing data. Data were subjected to bootstrap replications.

Distribution of Changes in Contact Characters During Evolution

The contact character data described above were used for parsimony analysis of contact character evolution. The Bayesian sequence consensus tree of globin proteins was used as an independent tree topology for the analysis of changes in contact characters. The contact character states of internal nodes were reconstructed by maximum parsimony (MP) analysis, using PAUP*. To understand if contacts evolved as a single class of character it was necessary to determine the frequency with which each character changed and the extent of homoplasy. Character change on branches involving either external or internal nodes of the unrooted tree was scored. Invariant contact sites were not removed in this analysis. The distribution of number of changes per contact site was fit to gamma and Poisson distributions. The homoplasy index (HI) (Kluge et al. 1969) was calculated excluding uninformative contact sites.

Testing for Structural Convergence

If an ideal globin structure exists and selection pulls globin fold proteins toward this ideal, then structural convergence might be detectable. Contact characters were especially suited to detect convergence since contact status of internal nodes could be reconstructed. The convergence test of Zhang and Kumar (Zang 1997) was adapted for contact binary data. The core of the Zhang and Kumar approach is to compare changes observed on two independent branches of a tree to the changes predicted by chance. In the original method, maximum likelihood (ML) was used to infer ancestral nodes (Zang et al. 1997). Models are not sufficiently developed to permit use of ML alone with contact characters. Instead, internal node characters were reconstructed using parsimony as described above. Calculation of the probability of chance convergence was required. Since the frequencies of contacts and noncontacts in structures were not equal, a modification of Felsenstein’s (1981) model for substitution was used, with an assumption of different rates for formation and loss of contacts. The frequency of contact-to-noncontact changes (Pi-1,0) and the frequency of noncontact-to-contact changes (Pi-0,1) were separately determined for each branch. Frequencies were calculated without correction for multiple changes at one site. The frequencies for individual branches were used to determine the probability that correlated convergences on two branches i and j would occur by chance, (Pi-1,0)*(Pj-1,0) and (Pi-0,1)*(Pj-0,1). The binomial equation was used to determine the probability that an observed number of convergences had occurred by chance for a pair of branches. Only branches joining an external node to its closest internal node were analyzed. A confidence level, p < 0.01, was selected to optimize sensitivity rather than selectivity of the convergence test.

To test the sensitivity of the convergence test, simulated convergences were introduced into trees. Random segments of the contact character data matrix corresponding to one structure were replaced with characters from another branch. Structures in different protein families were tested. Because of ambiguities in their relationship, hemoglobin/truncated hemoglobin pairs were not included in the presented comparison, though they gave similar results. Random pairs of branches were selected for each simulation. For each replacement a simulated tree was reconstructed with MP in PAUP* using the sequence tree topology (Fig. 3A). A total of 1100 trees with inferred ancestral nodes were produced. The trees were subjected to convergence tests to determine if the introduced convergence could be detected against the background of random convergence. The power of the assay was defined as the frequency that the test correctly led to rejection of the null hypothesis (random convergence only).

Results and Discussion

Residue Contacts in Globin Fold Proteins

A goal of this study was to understand how structure in bacterial globin fold proteins (Table 1) evolved in the face of conflicting pressures. On one hand, structural integrity must be maintained. On the other, mutational divergence pressures introduce changes. A common approach to structure definition uses a map of residue-residue contacts (contact map) to define structure. The notation for defining residue contact similarity is not dependent on sequence (different sequences can have the same pattern of contacts and hence the same contact map). Residue contacts are, in essence, molecular morphology characters. Contact maps record local features that are collectively important for stabilizing structure and therefore evolutionarily relevant.

Contacts were defined for this study in a manner that focused on the types of residue interactions that would be most significant in evolution. Contact maps were prepared using a procedure that eliminated contacts from neighboring residues within five amino acid sites in order to reduce characters representing contacts within secondary structure. This procedure enhanced the proportion of residues making contact between secondary structure elements which largely determine protein shape. Contacts defined in this way were scattered throughout the globin structure, with most clustering at the interfaces between α-helices. Amino acid sites involved in contacts were distributed throughout the globin sequences.

Alignment is a crucial step in comparing structures (Bourne et al. 2003). Structural alignment was used as a basis for sequence alignment since sequence similarity was generally low (Table 1). Nonalignable regions of the structure, mostly loop regions and nonglobin domains, were excluded. The DALI method of structural alignment used here is widely used for structure comparisons (Holm et al. 1998). Comparisons of DALI with another widely used method, CE, indicate that they agree on site assignments about 75% of the time (Sauder et al. 2000). I have made similar observations with this globin set of structures. If each method is correct half of the time for disputed residues, the assignment error rate for each method would be about 13%. Though the lack of complete correspondence suggests that improvements may be possible in structure alignment algorithms, the approach taken here minimizes the uncertainty in alignment. Only highly structurally conserved regions (see methods) were included in analysis. These regions have much higher concurrence between DALI and CE (about 95%). Most of the disagreements between CE and DALI involved register shifts of only a single residue. In structure alignments a residue from one structure often hovers spatially between two residues of the other structure, making assignment uncertain. I treated contacts that involved adjacent residue sites as a single class of contact. This approach eliminated the largest class of potential misalignments. I estimate that the extent of misalignment in this study is likely to be between 1% and 3%, with the lower number more likely. The value of 5% error in assignment of contacts (involving two sequence sites) has been used to be cautious. This level of accuracy was sufficient for the analyses performed. If less conserved regions of structures had been included, the rate of misalignment would have been higher.

The conserved globin core was studied in this work (sites scored as alignable by DALI in >90% of structures). The number of contacts per structure in this core domain ranged from 19 to 45, with a mean of 33.27 contacts. The average structure contained 34.3% of the 97 distinct contacts observed over the globin structures. The distribution of number of contacts for an even more stringently defined core region (for which all amino acid residue sites could be aligned ) is shown in Fig. 1. Structures did not share a specific number of contacts. On the contrary, the variation in number of contacts fit a binomial distribution consistent with a fixed probability of contact for each residue (model that curve is binomial, not rejected; p = 0.158). Regions such as loops removed from this analysis contain additional contacts, not studied here, that contribute to total protein stability. Because the structures in the study were selected as containing the globin fold, contacts by definition were consistent with the spatial topology of that fold.

Figure 1
figure 1

Distribution of number of residue contacts in structurally conserved region of globin fold proteins. The number of contacting sites for each of 15 globin fold proteins was compared. Filled circles and solid line: observed distribution of number of contacts per structure (in alignable regions). Dashed line: model binomial distribution modally centered.

Analogous sites in the various globins frequently interacted with different residues. For example, in two bacterial hemoglobins, Tyr64 in helix 3 interacts with different residues. Both helix 4 and helix 6 pass close to Tyr64, and in both hemoglobins Tyr64 contacts both helices. In 1GVH-A it contacts mostly residues in helix 4 and in 1CQX-A it contacts mostly residues in helix 6. From a sequence perspective Tyr64 is conserved. Acceptable substitutions at site 64 might differ for the two hemoglobins because the environment of the Tyr residues differs (Parisi et al. 2005). Though contacts are treated here as occurring independently, correlation between contact sites occurred at a an observable level (mean r 2 = 0.1020). However only 1.8% of contact pairs had a squared correlation >0.5. Most correlations were weak, with positive and negative correlations similar in magnitude. Contact correlations observed might be due to local steric considerations or to larger-scale structural correlations.

Contact-Based Distances

Distance measures were defined to study the evolution of contacts. Because the proteins under consideration share a similar spatial organization, it is not unreasonable to view the presence or absence of a specific contact as a morphological trait. For the globins, only a limited group of residues was candidates to participate in the formation of contacts. A set of potential contacts was defined that included all conserved core contacts observed in at least one of the globin proteins studied.

Contacts are not equivalent to sequence data, and distance for them could be defined differently. When comparing two DNA sequences the total number of characters is equivalent to the number of nucleotides. Each DNA site must be represented by a base. On the other hand, when comparing contacts the character state was “contact” or “lack of contact” between a pair of residues. Potentially the space defining noncontacting sites could include every possible pair of residues that did not touch in a structure, but as described above, most of the potential contacts are not used by any globin and are inconsistent with the fold. The rates of change of noncontacting and contacting characters must differ, if their frequencies differ, in order to maintain a stable average number of contacts. A measure of evolutionary distance using contact differences must accommodate these distinctive features inherent in contact map data. To resolve these difficulties, a distance equation based only on the rate of change of contacting sites was developed. For this analysis it was assumed that the product of the noncontacting rate and noncontacting frequency equaled the product of the contacting rate and contacting frequency so that the mean number of contacts remained stationary over evolution. This assumption permits us to avoid the issue of the distribution of rates for noncontacting residues and express distance in terms of contacts only. Distance then reflects the fraction of contact sites in common. The measure is averaged with respect to each sequence in order to make the distance matrix symmetric.

$$ d_{\rm c} = 1.0 - 0.5\;(({\rm{m}}/{\rm{c}}_{1} ) + ({\rm{m}}/{\rm{c}}_{2} )) $$
(1)

where d c is the contact distance between two structures, c1 and c2 are the total numbers of contacts in structures 1 and 2, respectively, and m is the number of contacts that matches between the two structures. The distance metric d c takes values between 0.0 and 1.0. It is difficult to adjust d c to correct for multiple changes since the noncontacting sites are not defined. As an alternative, it may be desirable to estimate the size of the pool of potential contacts. For a data set of structures the number of unique contact sites observed over all structures represents a conservative estimate of the number of potential contacts possible within that set. With this estimate we can define a distance based on the proportion of potential contact sites that differ between two structures:

$$ d_{{\rm cp}} = 1 - ({\rm{m}}/{{n}}_{{\rm{a}}} ) $$
(2)

where n a is the number of unique contacts observed over all the structures in a sample. An advantage of the proportional distance, d cp, is that it takes a form familiar in the analysis of DNA and amino acid sequence data, allowing its use with established methods for phylogenetic analysis. A disadvantage of normalizing to a collection of structures is that distance values have meaning only relative to a specific data set and distance values may vary depending on the structures included in the analysis. The presence of correlation between contacts could increase the variance of distances by making the independence of coordinate changes uncertain. Contact distance (d c) accurately reflects the unique features of contact characters as an evolutionary data type and was used in this study when distance values presented. The proportional contact distance d cp was used for tree-building.

Distance is overestimated if sequences are poorly aligned. This is because the total number of unique contacts is overestimated due to failure to recognize pairs as equivalent (n a is inflated) and because true correspondences between structures are missed (m is falsely reduced). The distance, d cp′, with correction for a proportion, q, of misaligned residues is approximately

$$ {{d{}_{{\rm cp}}}^{\prime} = 1 - ({\rm{m}}(1 + 2{\rm{q}}))/(n_{\rm a} - 2{\rm{mq}})} $$
(3)

Error due to misalignment leads to an apparent increase in distance approximately proportionate to the fraction of incorrect contact assignments. Distances were used without correction for potential alignment errors. Alignment error would also be reflected as lower bootstrap confidence levels in phylogenetic reconstruction.

A broad range of contact distances was observed with globin structures. Distances within families with related functions were lower than distances between families. For example, the phycocyanin β chains 1JBOB and 1GH0B diverged by only 0.09, whereas 1GH0B and succinate dehydrogenase 1NEKB had a distance of 0.80 (Table 2). The relationship between contact distance and sequence distance was determined to understand the relative rates of divergence. Contact distances, d c, calculated using Eq. (1) were compared to Poisson corrected sequence distances (Fig. 2). The two measures were weakly but significantly (p < 0.001) correlated. Despite their similar structures, most of the proteins in this sample have less than 20% sequence identity, making comparisons involving sequence difficult. A better estimate of the correlation of contact distance and sequence could be made with a metazoan hemoglobin set of structures with at least 40% identity. The R 2 value was 0.661 for these data. Regression analysis of Poisson-corrected distance and contact distance indicated a slope of 0.120, suggesting that contacts change more slowly than sequence. Backbone conformations in proteins also change significantly more slowly than sequence (Levitt et al. 1998).

Table 2 Contact distances of globin fold structures
Figure 2
figure 2

Relationship between contact distance and Poisson corrected (PC) sequence distance. Comparison of residue contact distance (d c) and PC distance of bacterial globin fold proteins. Sequence identity was calculated from structural alignment by the DaliLite program and contact distance was calculated as described in the text. Regression line is indicated. R 2 = 0.387.

An alternative method of deriving an evolutionary distance from structure comparisons of globins involves determination of root mean square deviation (RMSD) of the deviation of backbone (Cα) residues of two aligned sequences (Johnson et al. 1990). RMSD, like contact distance, relies on a sequence alignment. The RMSD measure can be modified to improve its value for evolutionary purposes by, for example, weighting for the number of residues in the analysis (Levitt et al. 1998). An advantage of RMSD or its modifications for determining distance is that an easily applied value defines the similarity of two structures (Bostick et al. 2004; Johnson et al. 1990). In some ways contact distance and RMSD are similar measures. Both contact distance and RMSD represent the sum of local differences in structure. Both compare differences at evolutionarily equivalent sites. Displacement of Cα atoms is represented by a continuous value, whereas contacts are binary features. Unlike sequence similarity, there is no unambiguous definition of structural similarity, so methods must be directed to specific purposes. For ease of use and robustness RMSD-related methods have strengths. A model for evolution is more evident for the contact-based method.

Phylogenetic Relationships of Globin Fold Sequences

In order to study the process of evolution of contact characters a gene tree for the bacterial globin proteins (Table 1) was inferred. The bacterial proteins of the globin fold form four major divisions (termed superfamilies or families in the SCOP database of structures) that each share structural features with globins. Proteins are referred to by their five-character PDB code and SCOP nomenclature is used to designate type. The flavohemoglobins (1GVH-A, 1CQX-A; globin family) are clearly related to one another and, also, are closely related to the bacterial dimeric hemoglobin (1VHB-A). These proteins bind oxygen and also enzymatically detoxify NO. It is believed that the fused flavoprotein domain of 1GVH-A and 1CQX-A is a derived trait, but this is uncertain and 1VHB-A may interact with a flavoprotein. The truncated hemoglobins (1MWB-A, 1IDR-A, 1NGK-A; truncated hemoglobin family) are smaller than the other globin fold proteins. Most globins have a “3 over 3” structure, with a splayed group of three α helices over another similar grouping. The truncated hemoglobins display a “2 over 2” arrangement (Wu et al. 2003). If the simpler arrangement is derived, the observed data can be explained parsimoniously. The FeS cluster enzymes, succinate dehydrogenase (1NEK-B) and fumerate reductases (1KF6-A, 1QLA-B; ferridoxin superfamily), are some of the multiprotein enzymes involved in energy metabolism. The FeS cluster proteins share biochemical functions with the flavohemoglobins. Both globin fold proteins represent iron binding, catalytic domains of flavin redox enzymes. Since both shape and general enzymatic activity are conserved, hemoglobins and FeS cluster enzymes likely share an ancestor. The phycocyanins (1I7Y-A and -B, 1JBO-A and -B, 1GH0-A and –B, 1B33-A and -B; phycocyanin-like family) have two related subunits, reminiscent of α and β globin, but bind a phycobilinin ligand. Most serve as accessory pigments in photosynthesis. The α and β phycocyanin subunits are derived from an ancient duplication event (Eberlein et al. 1990). The phylogenetic relationship of phycocyanin and hemoglobin has been studied (Pastore et al. 1990).

A sequence tree was derived by Bayesian inference using the protein sequences of the 15 globin fold proteins (Fig. 3A). The relationships determined should be considered a preliminary attempt to develop approaches appropriate for dealing with protein families that likely share a common ancestor but whose observed members exhibit a high degree of divergence. The FeS cluster and hemoglobin and α phycocyanin families formed clusters on the Bayesian tree. The truncated hemoglobin family and β phycocyanin families were poorly resolved. The topology was most consistent with multiple independent origins of the truncated hemoglobins from full-length globin ancestors. Independent origins for other truncated hemoglobins have been suggested (Wu et al. 2003). No evidence was available for rooting the tree.

Figure 3
figure 3

Globin fold protein trees. A Fifty percent majority rule consensus tree derived from Bayesian analysis of amino acid sequences of globin fold proteins. Numerals at nodes are Bayesian posterior probabilities, expressed as fractions. B Neighbor-joining tree using proportional contact distance (d cp). Fifty percent majority rule consensus tree. Numerals at nodes are the percentages of bootstrap support from 1000 bootstrap replications. Nodes with a bootstrapping value lower than 50% are collapsed into a single one. Both trees are unrooted. Protein families: A, hemoglobin; B, truncated hemoglobin; C, FeS cluster; D, phycocyanin.

Phylogenetic Inference Using Contact Characters

Structural features were also used as a basis to infer the globin fold tree. Contact distances were used in NJ inference (Fig. 3B). The contact distance tree exhibited a topology with similarities to the sequence tree, though bootstrap support values were generally low. The FeS, hemoglobin, and phycocyanin families clustered. The truncated hemoglobin family was poorly resolved. The sequence and contact distance trees were mostly congruent when branches were collapsed to include only well-supported nodes. Because of the somewhat lower resolution of the contact-based tree, the sequence tree was used as the basis for subsequent analysis. RMSD-based phylogenic studies of globins have also been described and the requirements for successful reconstruction using structure defined (Perutz 1983; Johnson et al. 1990; Lesk et al. 1980). Like the contact-distance trees, RMSD-based trees generally are similar to sequence trees. For the current work, analysis was limited to a single domain conserved in all of the structures. This approach reduced the complications of comparing regions of the structure that were not conserved for all pairs of proteins, at the cost of reducing the number of sites available for distance calculations. Currently used methods for structural phylogeny are dependent on the quality of a sequence alignment used to define equivalencies of Cα atoms in RMSD methods, or residues in my contact method (Bostick et al. 2004; Johnson et al. 1990). Alignment errors usually inflate distance in my method but may deflate distances with RMSD methods by overfitting structures. The alignment of structures becomes unreliable as the divergence between a pair becomes greater than the divergences for pairs compared here (Sauder et al. 2000). Structure comparison approaches that do not rely on sequence alignment can provide a distance value for more divergent structures (Bourne et al. 2003), but the ancestral relationships of such proteins are often dubious and the structural distance may not reflect evolutionary distance.

Test for Convergence of Globin Structures

Two models have been suggested for change in proteins structure. The neutral model for structural evolution suggests that many structures have a similar fitness (Bastolla et al. 2003; Xia et al. 2002). A nearly neutral model for structure suggests that a selected structure exists for some folds (Govindarajan et al. 1996). A selected structure might represent a thermodynamically “ideal” form that was more stable or resistant to unfolding due to mutation. Over a tree, structures would tend both to converge and to diverge under both models, but under the nearly neutral model, convergence would dominate. On the face of it, the evolution of globins seems purely divergent. Amino acid sequences have diverged. However, amino acid sequence poorly reflects structure, and convergent shifts toward an ideal, more stable structure might occur without sequence convergence (Kitazoe et al. 2005).

The convergence test of Zhang and Kumar (Zang et al. 1997) was adapted for the data under study. This test attempts to distinguish selected convergence from random convergence. Potential convergent changes in contact characters along branches representing two protein families were identified. The probability of these observed contact convergences occurring by chance was calculated (see Materials and Methods). Branches consisted of an external node and a reconstructed internal node on an independent sequence tree topology. Ancestral contacts were inferred using MP. Ancestral protein structures have previously been inferred using an approach, which has similarities to the method used here, involving a character-based representation of structure (Johannissen et al. 2003). Reconstruction of structure, like reconstruction of sequence, might provide a tool to study the evolution of protein function (Chang et al. 2002).

I searched for evidence of convergence in the globins. In comparisons between protein families, no more than three convergence events were observed for any of the 81 branch pairs tested. The number of convergences never reached significance. Convergences were also not detected when the contact-based tree was substituted for the sequence tree in the test. Within families there was not enough variation to perform the convergence test. The test should have been sensitive to even weak convergence since with the selected confidence value (p < 0.01), and the number of tests of branch combinations (81), it was expected that random variation alone might have produced a spurious convergence result.

In order to test the ability of the convergence test to detect weak selection, simulated convergences were introduced into globin structures. Convergence was simulated by replacing a random segment of contact data from one structure with that of a structure from another family. The analysis series involved reconstruction of 1100 simulated trees and their analysis for convergence. Figure 4 shows the power to detect selection with varying number of convergent contacts. The analysis shows that selection involving even eight converging contacts can be reliably detected. To put these results in context, each of the six hemoglobin helices contains about 16 contact characters. Convergent positioning of half of the contacts of a single helix in a single pair of structures might be detected by this method. Convergence of only one or two contacts could not be distinguished from random variation. Occasionally in structural evolution a few residues are key to structural function, e.g., in the serpins (Roberts et al. 2004). In these cases convergence tests might underestimate significance. However, for analysis of fold shape the test should be sensitive. It was possible that sequences contained misaligned sites. In a convergence test, contacts involving these misaligned sites would appear to diverge even if they were truly converging. Based on the estimate that at most 5% of contact sites are incorrect in the conserved regions used in this analysis, alignment difficulties are unlikely to prevent detection of convergence under most conditions. For example, based on the curve in Fig. 4, in the presence of 5% incorrect contact sites, eight contact convergences would be required to achieve the same power achieved by seven contact convergence observances in the absence of misalignment (one convergence event lost to analysis due to sequence misalignment). The presence of correlation between contacts biased in favor of finding convergence rather than divergence, since the convergence test was based on the assumption that all possible combinations of contacts were equally likely.

Figure 4
figure 4

Power to detect selection. The probability that observed contact convergences would reach a convergence test critical value was determined by simulation. Power (sensitivity) for trials with fewer than 11 observed convergences are shown in plot. Power was 100% when more than 10 convergences occurred. In principle, up to 97 contact convergences could be observed. Convergence tests were performed on 1100 simulated trees as described under Materials and Methods. Confidence interval. p < 0.01.

The convergence test suggested that selection was not a major force in the evolution of the structure. A limitation of this result is that it is based on only the contact map as a measure of structure. Other aspects of structure might evolve differently and not be sampled by my method. Some powerful tests for selection such as those based on nonsynonymous/synonymous substitution ratios (Hughes et al. 1988) are not available for analysis of structure. The approach to detect selection used here involved reconstruction of structure. Unfortunately backbone conformations measured by RMSD can not be reconstructed using current methods. A method related to mine has promise for studying evolution of other aspects of protein structure through reconstruction (Johannissen et al. 2003). The neutral evolution of globin structure over an extended interval is similar to the more recent neutral divergence of hemoglobin DNA sequences that has been described (Aguileta et al. 2004). My analysis of the core of globin required a relatively high quality alignment of sites. Convergence of structures involving nonequivalent residues (similar structures, different ancestry) could not be studied by this approach. The question of whether selection for a stable structure might lead structurally unrelated proteins to converge on the same fold is interesting but difficult to approach.

Surveys of databases indicate that certain structural folds such as the globin fold are disproportionately common, and others very rare (Qian et al. 2001). Computer simulations of structural evolution similarly find that a subset of possible structures predominates (Shapiro et al. 2004). Structures that tolerate many different protein sequences are termed robust or designable (Taverna et al. 2000; Meyers et al. 2004). Evolution might be mostly divergent within the limits of a robust fold (Govindarajan et al. 1996; Meyers et al. 2004; Xia et al. 2002). It is possible that my approach might have detected convergence with structures representing a less robust fold, one that had fewer acceptable contact patterns. I did not take into consideration functional adaptations of the globins. Many protein folds, including globins, are capable of hosting different enzymatic, interaction, or ligand binding functions with only local changes in structure (Pastore et al. 1990).

Patterns of Change for Contact Characters

The rate distribution for contact characters was determined by MP inference, constrained to the independently determined topology of the sequence tree with ancestral node reconstruction (Yang et al. 1996). Tree length for the topology in Fig. 3A was 219. The expected number of transitional steps required to produce the states observed for each structural character was determined. The distribution of sites experiencing a specific number of changes is shown in Fig. 5. The distribution was consistent with either a gamma or a Poisson distribution. DNA and protein sequence site rate distributions often fit gamma distribution or Poisson distributions but there was no a priori reason to believe that these models would describe evolution of protein structure. The simplest model (Poisson) consistent with the data would be that all contact sites belong to a single rate class. The gamma model which posits a range of different site rates was fit well by selection of an α shape parameter of 2.5 and is more intuitively compelling, since contacts at different structural sites are believed play different roles in proteins and be differently selected. There was no evidence for a class of invariant contacts. Most characters changed at least once on the tree. A mean of 2.38 changes per contact site was observed. The homoplasy index excluding uninformative sites for contact characters given the MP tree topology was 0.641. The rate of back mutation for contacts suggested that if convergence were selected, the rate of mutation would be high enough to allow detection.

Figure 5
figure 5

Distribution of number of changes per site. Contact sites were analyzed for changes. The minimum number of changes required for each site over an independent tree was determined by MP with internal node state reconstruction. The frequency of different rate classes is presented. The mean number of changes per site was 2.38.

Conclusion

This work was based on a model in which key contacts between globin helices could freely change during evolution. The model depended on the existence of strong structural similarity among the proteins analyzed. It was a requirement that, for the most part, contacts could form or be lost without change in the basic structure of the protein. For the proteins in the set studied here about 10% of the variance at each contact site could be explained by correlation with another site, which might limit some types of analysis. Some combinations of contacts might have been incompatible with one another in the proteins studied, though that could not be determined because of the limited number of proteins studied. For structures less related than the set used here, independent change would be sterically less possible so correlation between changes would probably be higher. The number of contacts per protein was distributed in a fashion consistent with a random probability for each contact forming. Contacts changed over the globin tree and exhibited extensive homoplasy. These findings are supportive of relatively unconstrained contact change. However, it may be unrealistic to treat analysis of globin contact change as equivalent to analysis of sequence change. The structural features of the globin fold may permit greater independence for contacts than would be observed in other folds. The core of the globin fold is comprised of a set of overlapping helices that touch at their intersection points, perhaps allowing greater flexibility in contact patterns than another fold might exhibit.

The globin fold is a robust fold that has diversified in the course of evolution and it may be that high tolerance of the fold to substitutions influenced its evolutionary success. Far fewer possible globin structures, defined by contact patterns, exist than potential globin sequences. Thus it is sensible to ask whether globin structure evolution is convergent, even if its sequence evolution is divergent (Aguileta et al. 2004). The observations on contact patterns were most consistent with divergence of globin structure. The weak correlation that was observed for contact changes would tend to bias calculations in favor of convergence and against divergence.

Contacts are similar to other measures of structural distance such as RMSD in some respects. Contacts, however, represent a feature of structure that may be subject to selection. As with RMSD-based structural approaches, the quality of a sequence alignment is a key to the method. In this study the accuracy of contact assignment was improved by limiting the analysis to conserved globin core residues and by using an algorithm for defining contacts that treated interactions involving adjacent residues as a single character. It is not clear whether contact distance has a role in phylogenetic reconstruction, since it did not appear to have advantages over sequence for these globin proteins. As binary characters they could be used with only minor modifications in evolutionary analysis methods used with sequences. Evolutionary models established in studies of sequence can thus be adapted to structure using this approach.

Contact evolution findings may inform models for sequence evolution. The varying structural environment of sites over evolution is largely a consequence of contacts made and broken. Because of change in contacts, the environment of globin residues changes during evolution. This varying residue environment might be a source of heterotachy (Lopez et al. 2002). The distribution of types of amino acids involved in contacts in protein structures has been determined and used to study protein stability (Huang et al. 2000). It is possible that such data could be used to produce amino acid substitution models that incorporate the influence of contact patterns (Jones et al. 1992; Dayhoff et al. 1978). To determine the generality and significance of contact-based models, protein folds in addition to the globins should be studied.