Introduction

Darwin described evolution as “descent with modification in traits by natural selection” (Darwin 1859). Extending Darwin’s statement to the molecular level, genomic sequences have been evolving in different species since they first diverged from common ancestral sequences. The hundreds of complete genomic sequences strongly suggest that all life on earth have common origin and that their morphological evolution has been mediated by molecular changes. However, evolutionary rates vary significantly among proteins (Kimura and Ota 1974; Wilson et al. 1977; Wall et al. 2005), and identification of factors affecting protein evolutionary rates has been the focus of numerous studies (Koonin 2005; Koonin and Wolf 2006; McLnerney et al. 2006; Pal et al. 2006). Despite these efforts, an unambiguous and consistent mechanism-based explanation of evolutionary rate differences among genes has not yet emerged.

One of the most consistent empirical findings related to the evolutionary rates of sequences is that genic sequences generally evolve slower than nongenic sequences (Gilbert 1978; Li and Graur 1991; Li 1997). Nearly neutral theory explains that the faster rates of nongenic sequence evolution can be attributed to the relaxed functional constraints on these regions (Ohta 1992; Li 1997). This reasoning can be extended to explain the regional evolutionary rate differences within a gene. For example, coding regions evolve slower than the 5′UTR, 3′UTR, and introns; nonsynonymous sites evolve slower than synonymous sites; fourfold degenerate sites evolve more rapidly than less degenerate sites (Li 1997); and certain domains or functional motifs evolve slower than other coding regions of a gene. All of these observations are consistent with the theory of nearly neutral evolution; i.e., the strength of functional constraints determines the rates of evolution within a gene (Li 1997). Thus, the regional sequence conservation observed within a group of orthologous genes provides a useful marker for identifying functionally important domains or motifs (Zhang and He 2005). In addition, the severity of detrimental phenotypic effects caused by genetic perturbations, including site-directed mutagenesis or the deletion of genic regions is a reasonable predictor of the functional importance of that region (Hirsh and Fraser 2001; Castillo-Davis and Hartl 2003; Krylov et al. 2003; Fraser et al. 2003; Rocha and Danchin 2004; Liao et al. 2006; Dotsch et al. 2010). Therefore, it seems clear that the functional constraint is a major determinant of evolutionary rate differences between genic and nongenic sequences, between coding and noncoding regions at the genomic level, and between functional domains and other regions within a single gene.

Functional importance is not, however, a directly measurable quantity. Perhaps for this reason several studies aiming to identify major determinants of average evolutionary rate differences between proteins have arrived at inconsistent conclusions (Jovelin and Phillips 2009; Wang and Zhang 2009; Zeng and Gu 2010; Razeto-Barry et al. 2011). A specific point of contention has been triggered by reports suggesting that essential genes or proteins, as defined by the lethality caused by deleting these genes in model organisms such as yeast and mouse, do not necessarily evolve more slowly than nonessential genes, for instance Hurst and Smith (1999). This is contrary to intuition considering the established relationship between functional importance and evolutionary rate described above. Following Hurst and Smith (1999), several other studies have investigated the relationship between essentiality and evolutionary rate of a protein (Hirsh and Fraser 2001; Jordan et al. 2002; Krylov et al. 2003; Rocha and Danchin 2004; Zhang and He 2005; Dotsch et al. 2010). Essentiality is one of many quantifiable variables explored in search for determinants of genic evolutionary rate (see below for a discussion of the others). However, interrelation among these variables and their independent contributions to evolutionary rates are not entirely clear. In this review, we try to synthesize many scattered studies that link gene evolutionary rates with various variables, and we suggest a cohesive way of viewing the relationships between evolutionary rates and its correlative variables. In the following section, we focus on three determinants that are justified by explicit models, explaining how they may directly affect coding sequence substitution.

Variables and Their Correlations with Genic Evolutionary Rates

Table 1 (see Table 2 for glossary of terms) lists the most prominent variables tested so far with regards to overall evolutionary rates of proteins as well as those studied with regards to regional evolutionary rates within a protein. Evolutionary rates measured by dN (nonsynonymous substitution rate), dS (synonymous substitution rate), or dN/dS (considered as normalized dN) have also been correlated with several other parameters (Koonin 2005; Pal et al. 2006; Koonin and Wolf 2006; McLnerney et al. 2006). Essentiality has been the most compelling variable investigated as a potential determinant of protein evolutionary rate because it is an intuitive measure of overall structural–functional constraints on a protein. However, essentiality has failed to emerge as the only or primary variable. In fact, some studies even reported a weak or no negative correlation between essentiality and the evolutionary rate (Hurst and Smith 1999; Yang et al. 2003) (Table 3), although negative correlations have been observed in most of the studies (Hirsh and Fraser 2001; Jordan et al. 2002; Krylov et al. 2003; Rocha and Danchin 2004; Zhang and He 2005; Chen and Xu 2005; Wall et al. 2005; Liao et al. 2006; Plotkin and Fraser 2007; Larracuente et al. 2008; Wang and Zhang 2009; Dotsch et al. 2010).

Table 1 Variables correlated with protein evolutionary rate and the corresponding model
Table 2 Abbreviations
Table 3 Relationship of essentiality with the gene evolutionary rate

Besides essentiality, the variable that has been most consistently observed to correlate with coding sequence evolutionary rate is expression level (EL) of the gene (Krylov et al. 2003; Subramanian and Kumar 2004; Rocha and Danchin 2004; Drummond et al. 2005; Wall et al. 2005; Lemos et al. 2005; Lin et al. 2007) (Table 4). As shown in Table 4, especially in single-celled organisms such as E. coli, B. subtilis, and S. cerevisiae, highly expressed genes are consistently found to evolve slowly, although Hudson and Conant (2011) argued that EL is not a primary predictor of evolutionary rate in mammals. In addition, Liao et al. (2006) showed that mammalian genes follow a different evolutionary rule, i.e., compactness, expression breadth (EB), and essentiality are more important than EL in determining the evolutionary rate. In mammals, EB was found to be consistently negatively associated with the probability of amino acid replacement, even more strongly than EL (Duret and Mouchiroud 2000; Zhang and Li 2004; Subramanian and Kumar 2004; Yang et al. 2005; Liao et al. 2006; Liao and Zhang 2006; Zhu et al. 2008; Park and Choi 2010) (Supplementary Table 1). The number of protein–protein interactions (PPI) is another variable that shows a negative correlation with evolutionary rate, mostly demonstrated in yeast (Supplementary Table 1). It is assumed that proteins with a higher degree of interaction may evolve slowly because more sites should be evolutionarily constrained in a protein that is functionally interacting with several other proteins, relative to those interacting with fewer proteins. In fact, several studies have shown that highly connected proteins in a PPI network evolve more slowly (Fraser et al. 2002; Teichmann 2002; Jordan et al. 2003; Krylov et al. 2003; Hahn et al. 2004; Lemos et al. 2005), although contradictory observations have also been reported (Bloom and Adami 2003; Zhou et al. 2008; Jovelin and Phillips 2009; Podder et al. 2009). Propensity of gene loss (PGL) is also correlated with evolutionary rates (Krylov et al. 2003), i.e., genes with lower PGL are more likely to evolve slowly (Supplementary Table 1).

Table 4 Relationship of EL with the gene evolutionary rate

Variables related to gene length, such as intron, UTR, and CDS lengths (sometimes jointly referred to as gene compactness) are also known to correlate with evolutionary rate. Genes with longer introns i.e., less compact genes, are more likely to evolve slowly (Marais et al. 2005; Liao et al. 2006; Vinogradov 2010), although a contradictory observation has also been reported (Lemos et al. 2005) (Table 5). The variables pertaining to gene length have been investigated with regards to their correlation with other variables such as recombination rates (Vinogradov 2001; Comeron and Kreitman 2002; Comeron and Guthrie 2005), EL (Vinogradov 2001; Castillo-Davis et al. 2002; Urrutia and Hurst 2003; Marais et al. 2005; Carmel and Koonin 2009; Woody et al. 2011), EB (Vinogradov 2004; Vinogradov 2006; Zhu et al. 2008; Shabalina et al. 2010; Rao et al. 2010), and codon usage (Vinogradov 2001; Comeron and Kreitman 2002; Comeron and Guthrie 2005) (Supplementary Table 2). The relationships of the length variables with other variables are subtle and apparently contradictory at times, and ultimately evade a unified and biologically interpretable theoretical model.

Table 5 Relationship of gene compactness with the gene evolutionary rate

Correlation tests have been the major methodology used for most of these studies, which does not necessarily imply causation. Some of the relationships between evolutionary rates and variables may be secondary or indirect, because of the variable’s inherent correlation with another, potentially unknown, causative variable. For instance, there is a positive correlation between EL and EB (Subramanian and Kumar 2004; Pal and Guda 2006; Park and Choi 2010), between EL and codon usage bias (Comeron et al. 1999; Duret and Mouchiroud 1999; Iida and Akashi 2000; Urrutia and Hurst 2003; Ingvarsson 2008; Zhou et al. 2009), between PGL and PPI (Krylov et al. 2003), and between pleiotropy and PPI (He and Zhang 2006). All of these variables are correlated with evolutionary rate. It is unclear which of these inter-variable correlations are due to direct causal links and which are indirect correlations caused by their independent correlations with evolutionary rate. One possible way to distinguish causal links from circumstantial correlation might be to look for experimental or theoretical evidence that supports a causal relationship. In the following, we describe three determinants and their corresponding models/hypotheses that specify how each directly and independently influences the evolutionary rate of genes.

“Function-Centered” Variable and the “Function (Fitness) Density” Model

Several variables including solvent accessibility (Choi et al. 2006; Lin et al. 2007; Conant and Stadler 2009; Franzosa and Xia 2009; Ramsey et al. 2011; Toth-Petroczy and Tawfik 2011), types of interaction (e.g., hydrogen bond, disulfide, ionic, hydrophobic, etc.) and types of secondary structure (helix, strand, loop, turn, etc.) in a protein (Choi et al. 2006; Peralta et al. 2011), and types of alternative splicing (e.g., ASE (alternative splicing exon), and CSE (constitutive splicing exon)) (Ermakova et al. 2006; Plass and Eyras 2006; Chen et al. 2012; Wu and Chen 2012) have been considered to be important in determining the regional evolutionary rates within a protein (Table 1). For instance, the sites encoding internal amino acids of a protein (i.e., a lower solvent accessibility), the sites involved in disulfide or hydrophobic interactions, the sites with higher electrostatic charge, or the sites participating in helix or strand structure of a protein tertiary structure were reported to be under a stronger evolutionary constraint and to evolve slowly, although there are also some contradictory findings even on this regional evolutionary rate issue (Zhou et al. 2008). At the most basic level, it is the mutations at nucleotide sites that are favored or disfavored by selection depending on if and how they affect organismal fitness or reproductive success. Therefore, according to the “function (fitness) density” model, fraction of sites in a protein or density of functionally important sites should ultimately determine a protein’s evolutionary rate (Zuckerkandl 1976; Rocha 2006; Lin et al. 2007; Wang and Zhang 2009). Thus it seems intuitive that the proteins with a greater fraction of functionally constrained sites evolve slower, implying that the region-specific variables are specific instances of a more general variable for quantifying overall functional constraints on a protein.

Essentiality (or dispensability) was the first variable to be tested for correlation with evolutionary rate of a protein, because it is arguably an appropriate measure of a protein’s overall functional importance. Classically, proteins that have a large fraction of functionally important residues are expected to be under stronger evolutionary constraints and are also expected to be essential and to evolve relatively slowly (Hirsh and Fraser 2001; Jordan et al. 2002; Krylov et al. 2003; Rocha and Danchin 2004; Koonin 2005; Pal et al. 2006). Essentiality has generally been estimated by phenotypic lethality or sometimes by a growth rate differential after a gene knock-out or knock-down in a model organism such as yeast or mouse (Gerdes et al. 2003; Fang et al. 2005; Kim and Copley 2007; Gong et al. 2008; Scholle and Gerdes 2008; Bergmiller et al. 2012). With some exceptions, as mentioned earlier, significant negative correlations have been consistently observed between gene essentiality and evolutionary rate (Table 3). The negative relationship between essentiality and evolutionary rate was observed in multicellular organisms, but was less obvious in unicellular organisms such as yeast (Hurst and Smith 1999; Yang et al. 2003; Chen and Xu 2005), possibly due to the inadequacy of essentiality as a measure of functional importance. Essentiality may not be a robust proxy for functional importance for several reasons: (1) essentiality has been measured under laboratory conditions which are nutritionally abundant compared to natural conditions, and therefore may not faithfully reveal essential genes in natural conditions (Pal et al. 2006), (2) as such, essentiality measures the effect of complete gene deletion, a relatively rare occurrence in nature, and therefore cannot capture the fraction of functionally constrained sites in the protein (Pal et al. 2006), (3) essential genes in one species may not be essential in other organisms used to estimate the evolutionary rate (Bergmiller et al. 2012), (4) essential genes for growth may not be essential from an evolutionary point of view (Fang et al. 2005), and (5) while essentiality implies functional importance, the converse need not be true; a gene affecting reproductive fitness is clearly functionally important and will be evolutionarily constrained but may not be essential, as measured by the experiments.

Functional importance of a protein can be generally defined as the effect size on fitness caused by any perturbation in the protein’s activity. Under this definition, one can unify several variables. Specifically, a protein expressed in early embryonic stage rather than in late adult stage, a protein expressed broadly rather than restricted to a tissue, or a protein that is highly connected to other protein, is more likely to be functionally important and therefore more likely to be evolutionarily constrained. Thus, functional importance of a gene has multiple facets represented by presumably independent variables. These variables nevertheless exert a joint effect on functional importance, and their individual effects on overall functional importance and evolutionary rates is difficult to disambiguate. It is notable that several studies have investigated whether proteins expressed in a broad range of tissues (EB), more connected in a PPI network, affecting several processes or phenotypes (pleiotropy), and with a lower propensity of gene loss (PGL) during evolution, are more constrained and more likely to be essential (Tudor et al. 1999; Jeong et al. 2001; Wuchty 2002; Krylov et al. 2003; Yu et al. 2004; Hahn et al. 2004; Hahn and Kern 2005; Yang et al. 2005; Goh et al. 2007; Zotenko et al. 2008; Park and Kim 2009; Hao et al. 2010) (Supplementary Table 1). In fact, all four variables – EB, PPI, pleiotropy, and PGL, have been shown to be correlated with essentiality (Supplementary Table 2). Although the correlation is sometimes weak (Coulomb et al. 2005), the existence of this correlation indicates that these four variables may affect evolutionary rate only indirectly via their effect on essentiality. Furthermore, several studies have shown that there are inherent correlations among these variables; EB has a positive correlation with PPI (Alvarez-Ponce 2012; Rodgers-Melnick et al. 2012) and pleiotropy (Tuller et al. 2008), PPI has a positive correlation with pleiotropy (He and Zhang 2006), and PGL has a negative correlation with PPI (Krylov et al. 2003). Thus, these variables—essentiality, EB, PPI, pleiotropy, and PGL, may not be independent and can operationally be grouped into a single category called “function (fitness)-centered” variable, representing the overall functional importance of a protein. Considering the “function (fitness)-centered” variable and the corresponding “functional (fitness) density” model supports the idea that functionally important proteins, and more specifically, the proteins with greater fraction of functionally important residues, are more likely to evolve slowly. Thus, we argue that the “function (fitness)-centered” variable unifies several variables and as such, represents a more complete variable in determining protein evolutionary rate—one that is likely to be more independent of other factors that could potentially influence evolutionary rate.

EL and Two Different Hypotheses: The “Translational Selection” Hypothesis and “Mistranslation-Induced Misfolding (MIM)” Hypothesis

The possibility that EL can constrain nucleotide sequence evolution was first recognized from the observation that highly expressed genes are biased in their synonymous codon usage (i.e., codon usage bias) in several prokaryotes and unicellular eukaryotes including E. coli, S. typhimurium, and S. cerevisiae (Pal et al. 2001; Krylov et al. 2003; Marais et al. 2004; Rocha and Danchin 2004; Wall et al. 2005), although this relationship was less obvious in multicellular organisms (Liao et al. 2006; Hudson and Conant 2011) (Supplementary Table 3). Several groups suggested the “translational selection” hypothesis as an explanation of how requirement for high gene expression level might directly affect synonymous changes in the coding region of a gene (Sharp et al. 1993; Akashi 1994; Akashi and Eyre-Walker 1998; Moriyama and Powell 1998; Akashi 2001; Drummond et al. 2006; Comeron 2006; Kotlar and Lavner 2006; Waldman et al. 2011; Gingold and Pilpel 2011; Plotkin and Kudla 2011). The “translational selection” hypothesis posits that highly expressed proteins require optimized codons for accurate and efficient translation, which produces a negative correlation between codon usage bias and dS, and between EL and dS (Precup and Parker 1987; Akashi 1994; Akashi and Eyre-Walker 1998; Comeron et al. 1999; Iida and Akashi 2000; Urrutia and Hurst 2001; Stoletzki and Eyre-Walker 2007; Ingvarsson 2008; Hiraoka et al. 2009). Interestingly, some studies have suggested that selection favors preferred codons at sites where misincorporations, or translational errors are expected to be critical, implicating that translation selection for increased accuracy might, in part, influence variation on codon usage bias and dN (Precup and Parker 1987; Akashi 1994; Stoletzki and Eyre-Walker 2007; Kramer and Farabaugh 2007; Drummond and Wilke 2009).

Several papers have shown that EL can constrain nonsynonymous codon evolution. The important role of EL in determining nonsynonymous evolutionary rate has been well described in bacteria, yeast, and Drosophila (Pal et al. 2001; Rocha and Danchin 2004; Zhang and He 2005; Drummond et al. 2005, 2006; Larracuente et al. 2008) (Table 4). A strong negative correlation was consistently observed between the dN and the EL of a gene, although this correlation is stronger in unicellular organisms than in multicellular organisms (Liao et al. 2006; Hudson and Conant 2011). However, the “translational selection” hypothesis is not sufficient to explain why EL shows an even stronger correlation with dN than it does with dS (Wilke and Drummond 2006; Gingold and Pilpel 2011). Drummond et al. (2006) have therefore suggested a novel hypothesis, “mistranslation-induced misfolding (MIM),” which posits that highly expressed genes evolve more slowly because they experience a stronger negative selection against the toxic effects of misfolded proteins induced by mistranslation, leading to a slower rate of coding sequence substitutions (Drummond et al. 2006; Drummond and Wilke 2008). Consistently, when analyzing the relationship between codon bias and protein structural integrity, some groups found that translationally optimal codons were preferentially used at the sites at which mutations led to protein misfolding and aggregation (Zhou et al. 2009; Lee et al. 2010;). In addition, Yang et al. (2012) showed through simulation that highly abundant proteins in yeast are more likely to use misfolding-minimizing amino acids and that these sites are evolutionarily more constrained than other sites of the same proteins. Thus, EL can influence the rates of both synonymous and nonsynonymous changes and represents a major independent determinant of evolutionary rate.

Gene Compactness and the “Hill-Robertson Interference” Hypothesis

Gene compactness, variously measured by intron length, UTR length, or CDS length, has additionally been considered as an independent variable determining coding sequence evolution (Comeron et al. 1999; Duret and Mouchiroud 1999; Subramanian and Kumar 2004; Liao et al. 2006; Kim and Yi 2007; Larracuente et al. 2008) (Table 5). For instance, correlative studies in several organisms including E. coli, D. melanogaster, and C. elegans revealed CDS length to be inversely correlated with codon bias (Kliman and Hey 1993; Comeron et al. 1999; Marais et al. 2001; Comeron and Kreitman 2002; Campos et al. 2012) (Supplementary Table 4). Interestingly, the relationship between intron length and codon bias is mixed, i.e., a positive correlation in unicellular organisms whereas a negative correlation in multicellular organisms has been observed (Vinogradov 2001; Comeron and Kreitman 2002; Comeron and Guthrie 2005) (Supplementary Table 4). Recombination was considered as a mechanism underlying the relationship between gene or intron length and codon bias (Comeron et al. 1999; Marais et al. 2001; Duret 2001; Comeron and Kreitman 2002; Fedorova and Fedorov 2003; Pal et al. 2006). Several papers have reported that a lower probability of recombination in short genes can reduce the effectiveness of natural selection thus reducing the codon bias in those genes (Kliman and Hey 1993; Hudson 1994; Betancourt et al. 2009; Charlesworth et al. 2009; Campos et al. 2012). More generally, “Hill-Robertson interference” hypothesis posits that efficient natural selection at two genetic loci is curbed by low recombination and higher linkage between the two loci (Hill and Robertson 1966; Felsenstein 1974; Gordo and Charlesworth 2001). In other words, when two loci are genetically linked, both, fixation of beneficial mutations, as well as elimination of deleterious mutations at a site can be prevented due to interference caused by selection at the linked site (Marais et al. 2005; Larracuente et al. 2007). Recombination can enhance the effectiveness of selection by breaking the linkage during meiosis (Carvalho and Clark 1999; Comeron et al. 1999; Duret 2001; Comeron and Kreitman 2002). Thus, higher rates of recombination can affect evolution in both directions, either a decreased or an increased evolutionary rate, depending on the relative occurrence of advantageous and deleterious mutation (Pal et al. 2006). Hill-Robertson interference implies that a larger genomic distance between sites under selection (say, in less compact genes with longer introns separating exons) can facilitate natural selection such that advantageous mutations can be fixed and deleterious mutations can be eliminated more efficiently.

There are two potential evolutionary mechanisms for relieving Hill-Robertson interference, either by lowering compactness via elongation of introns, UTRs, or even CDS of a gene, or by increasing recombination rates. Interestingly, in Drosophila, genes located in regions of lower recombination rates tend to have longer introns (Carvalho and Clark 1999; Comeron and Kreitman 2000), which might be interpreted as a mechanism for enhancing the probability of recombination, presumably to facilitate natural selection (Comeron and Kreitman 2000; Comeron et al. 2008). However, Prachumwat et al. (2004) reported an opposite finding that longer introns are located in regions of higher recombination rates in C. elegans (Prachumwat et al. 2004).

Clearly, intron (or exon) number is directly related to gene compactness and therefore indirectly influence gene evolutionary rate. However, intron number can also influence the evolutionary rate via alternative mechanisms. For instance, some groups showed that mammalian exonic splice site enhancers (ESEs) located at the exon–intron boundaries are strongly constrained, thus constraining the codons near the boundaries. Consequently, intron number is significantly negatively correlated with dN and dS (Parmley et al. 2007; Larracuente et al. 2008; Carmel and Koonin 2009).

The relationship between gene compactness and the rate of protein evolution is nuanced, and only a few studies have investigated this relationship directly (Table 5). Marais et al. (2005) found that among 630 Drosophila genes analyzed, the genes with introns have significantly lower dN, and there is a negative relationship between total intron length and dN. The authors concluded that the negative relationship is likely to be driven by a need for more efficient purifying selection against deleterious mutations in genes with longer introns due to relaxation of the Hill-Robertson interference. Liao et al. (2006) also showed that genes with longer introns or UTRs tend to evolve slowly, consistent with a more effective negative selection in less compact genes.

This relationship between intron length and evolutionary rate is further complicated when we consider the relationship between EL and intron length. Several studies have shown that highly expressed genes have shorter introns (Castillo-Davis et al. 2002; Urrutia and Hurst 2003; Subramanian and Kumar 2004; Warringer and Blomberg 2006) (Table 6). Taken together with observation that genes with higher EL evolve slower, genes with longer introns should evolve faster than genes with shorter introns. However, as described above, genes with longer introns evolve slower, consistent with relaxation of the Hill-Robertson interference in the presence of purifying selection. In fact, inconsistent with the studies showing a negative relationship between intron length and EL, some studies have found that genes with longer introns or CDS are expressed at a higher level, particularly in plants such as rice and Arabidopsis (Vinogradov 2001; Marais et al. 2005; Stenoien 2007; Carmel and Koonin 2009; Woody et al. 2011). Similarly, the relationship between EL and codon bias has not been straightforward, and highly expressed genes do not necessarily use more biased codons, especially in mammals (Gonzalez et al. 1989; Fitch and Strausbaugh 1993; Hiraoka et al. 2009; Misawa and Kikuno 2011) (Supplementary Table 3). Furthermore, one more issue should be considered to appreciate the nuanced relationship between EL and intron length, and between intron length and evolutionary rate, which is related to two different models of explaining how or why introns are maintained in genomes: “selection for energy cost” or “genome design.”

Table 6 Relationship of EL/EB and compactness of genes

The Relationship Between EL and Compactness: the “Genome Design” Versus the “Selection for Economy” Model

Several studies have attempted to show how one evolutionary-rate-correlative variable is correlated with other such variables (Rocha and Danchin 2004; Liao et al. 2006; Larracuente et al. 2008). For instance, EL has a positive correlation with EB but a negative correlation with intron length (Vinogradov 2001; Marais et al. 2005; Liao et al. 2006; Stenoien 2007; Park and Choi 2010) (Table 6). However, it is unclear whether these various correlative relationships are due to direct causation. For example, with respect to the observation that genes with a higher EL tend to have shorter introns, it is possible that these two variables are correlated because they both independently influence coding sequence evolution and are correlated with evolutionary rate. Similarly, with regards to the correlation between compactness and the various fitness-centered variables, it has not been shown whether intron length of a gene is directly related to functional importance of that gene independent of its relationship with evolutionary rate. Of the three broad categories of variables—functional importance, EL (or EB), and compactness, no mechanistic models have been proposed that link functional importance to either EL or gene compactness. However, two models have been proposed to explain the relationship between expression and compactness: the “selection for economy” model and “genome design” model (Castillo-Davis et al. 2002; Eisenberg and Levanon 2003; Urrutia and Hurst 2003; Wagner 2005; Vinogradov 2006).

The “selection for economy” model posits that due to high energetic costs associated with transcription and translation, natural selection would favor compactness (i.e., shorter size) of highly expressed genes (Castillo-Davis et al. 2002; Urrutia and Hurst 2003). As described above, while several studies have shown a negative correlation between EL and intron length, i.e., positive correlation between EL and compactness (Table 6), several others have found the opposite result, i.e., that highly expressed genes have longer introns (Table 6). A negative correlation between EB and intron length is also expected following the energetics argument. However, similar to EL, while some studies found a positive correlation between EB and intron length, others reported a negative correlation (Eisenberg and Levanon 2003; Rao et al. 2010; Vinogradov 2004; Zhu et al. 2008) (Table 6).

The studies that demonstrated negative correlations between EB and intron length generally adopted the “selection for economy” model (Moriyama and Powell 1998; Castillo-Davis et al. 2002; Eisenberg and Levanon 2003; Rao et al. 2010; Zhu et al. 2008). In contrast, to explain a positive correlation between EB and intron lengths, other studies invoke the “genome design” model which posits that genes expressed in multiple contexts may require a more complex regulatory mechanism and thus may have longer introns to accommodate regulatory elements relative to genes with shorter introns (Vinogradov 2004). Consistent with this possibility, in Drosophila, longer introns especially first introns were found to be evolutionarily more conserved (Haddrill et al. 2005; Marais et al. 2005; Presgraves 2006). Intronic sequence conservation has long been considered as an indicator of transcriptional regulatory elements (Fedorova and Fedorov 2003; Marais et al. 2005; Parra et al. 2011). In addition, Vinogradov (2004) showed that genes with intermediate breadth of expression, likely to require a more complex regulatory mechanism, are more likely to have longer introns, relative to genes expressed within specific context or expressed ubiquitously (Vinogradov 2004).

It seems that the “selection for economy” model could explain why some genes have shorter introns for both broadly or highly expressed genes. However, the “genome design” model does not explain why some highly expressed genes have longer introns even in unicellular eukaryotes, because it is unclear how the complexity of transcriptional regulation would impact level of gene expression. The “selection for economy” model is not enough either to explain the observation that some highly or broadly expressed genes have long introns. In those cases, longer introns might be favored to enable recombination thereby enhancing the efficiency of natural selection.

Concluding Remarks

In this review, we have suggested a framework based on known mechanistic models to interpret the correlations between evolutionary rate and the various variables, as well as correlations between the variables (Fig. 1). We have first attempted to clarify the concept of functional importance of a gene as it relates to evolutionary rate. While gene essentiality is a highly intuitive proxy for functional importance, we have argued why five independent variables studied so far, including essentiality, ought to be considered jointly as “function (fitness)-centered” variable. We then considered three variables, functional importance, EL, and gene compactness as independent effectors of a gene’s evolutionary rate.

Fig. 1
figure 1

Three major determinants of gene evolutionary rate and the corresponding models. The three major determinants of gene evolutionary rates are the gene’s functional importance, its expression level and its compactness. Functional importance is represented by five interrelated variables—EB, PPI, PGL, pleiotropy, and essentiality. Genes with a higher EB, PPI, pleiotropy, and genes with a lower PGL are more likely to be essential and are thus negatively correlated with the evolutionary rate. All five variables are inherently correlated with each other, and jointly represent the overall functional importance of a protein, and their combined influence on evolutionary rate is consistent with “function (fitness) density” model as described in the main text. EL is another primary determinant of evolutionary rate with a negative impact on evolutionary rate. Codon usage bias has a strong positive correlation with EL. MIM hypothesis has been suggested to explain why highly expressed genes evolve slower than lowly expressed genes. Gene compactness represented by intron, CDS, and UTR lengths, is a third independent determinant of evolutionary rate. Longer genes, genes with a higher probability of recombination (i.e., less compact genes), can make natural selection more efficient, as posited by “Hill-Robertson interference” hypothesis, thereby increasing or decreasing the rate of evolution depending on the relative occurrence of advantageous and deleterious mutations respectively. That may explain why both positive and negative correlations between the lengths and evolutionary rates have been reported. Refer to Table 2 for abbreviations

We have also attempted to separate the issue of identifying determinants of evolutionary rate from the correlative (secondary) relationships among variables. For instance, some studies showed that highly expressed genes tend to have smaller introns (Castillo-Davis et al. 2002; Subramanian and Kumar 2004; Urrutia and Hurst 2003; Warringer and Blomberg 2006) and asked whether genes with smaller introns might evolve more slowly because they are highly expressed (Marais et al. 2005). In fact, the opposite result was found to be true, i.e., genes with longer introns evolve more slowly. This conceptual inconsistency occurs mainly due to the complex relationships between different variables: (1) relationship between intron size (e.g., gene compactness) and EL proposed by the “selection for economy” and “genome design” hypotheses, (2) the relationship between EL and evolutionary rate posited by the MIM hypothesis, (3) the relationship between intron size and evolutionary rate posited by the “Hill-Robertson interference” hypothesis. Because three different relationships controlled by different evolutionary forces act independently, it may not be possible to predict the overall rate at which a gene with short introns will evolve. Even in the argument for (1), the two models have opposing explanations for how intron lengths are influenced by EL. Furthermore, it remains debatable whether the correlations of evolutionary rates with lengths of introns, CDS, or UTRs (i.e., compactness) are mainly controlled by MIM or by the degree of relief from the Hill-Robertson interference. If the former mechanism is stronger, shorter genes should evolve more slowly, because shorter genes tend to be highly expressed and these short but highly expressed genes are more vulnerable to toxicity caused by a translational error and thus are more likely to be under a stronger selective constraint; however, if the latter mechanism is stronger, genes with shorter introns should evolve more rapidly, assuming that the purifying selection is more prevalent than positive selection. Some groups demonstrated that intronless genes evolve faster in mammals and human, which indicates that EL rather than intron length is the primary determinant of evolutionary rates in those systems (Agarwal 2005; Shabalina et al. 2010).

Another confounding issue is that of determining whether multiple variables exert independent influence on evolutionary rate. For instance, Chen and Dokholyan (2008) showed that essential proteins tend to have lower aggregation propensity compared with nonessential proteins, suggesting that EL might share its influence on evolutionary rate with functional importance. On the other hand, Wolf et al. (2008) have demonstrated that structural–functional constraints and EL have comparable contributions to the rate of protein sequence evolution, suggesting independent roles of EL and functional importance in determining the evolutionary rate. Kim and Yi (2007) have shown, through partial correlation and principal component analysis, that protein length and essentiality play independent roles in protein evolution. Larracuente et al. (2008) have also shown, through partial correlation study in Drosophila, that gene essentiality and recombination along with tissue specificity of gene expression and intron number contribute to evolutionary rates. Taken together, our synthesis of the current literature suggests three main determinants: functional importance, EL, and compactness via recombination, each supported by a mechanistic model, act simultaneously and independently to determine the overall evolutionary rate of a gene.