Keywords

1 Introduction

Production of soluble and functionally active proteins in heterologous and homologous host organisms is the cornerstone of many modern biotechnology applications. In recent years, the demand for recombinant proteins used in research laboratories or in medical settings (e.g., for therapeutic applications) has increased dramatically. Specifically, the protein therapeutic market was valued in excess of $85 billion in 2010 and is predicted to double by the end of 2018, reaching up to $165 billion, as new products (especially therapeutic monoclonal antibodies) become available (http://www.researchandmarkets.com/reports/2729030/global_protein_therapeutics_market_outlook_2018). Despite the strong existing and potential significance of efficient recombinant protein production for both research applications and development of novel therapeutics, obtaining soluble, active recombinant proteins in sufficient amounts remains challenging in many cases.

A wide variety of recombinant protein expression systems are well established. These include, but are not limited to, various cellular systems, such as bacterial, yeast, insect and mammalian systems [17], and cell-free in vitro systems [8, 9]. The urgent need for robust and highly scalable protein manufacturing systems has further led to the development of in vivo plant- and animal-based systems [1013]. All of these systems have their own advantages and disadvantages [14]. The choice of system to use for a particular application depends on the specific requirements for the final recombinant protein product (e.g., requirements for proper protein processing and/or co- and posttranslational protein folding and modifications) [14]. In most cases, use of a recombinant protein expression system that closely resembles the protein’s natural in vivo expression system/environment is highly desirable, but this is obviously not always achievable [14]. For example, toxicity of the final product may not allow enhanced expression of a protein in a homologous, or even heterologous, cellular system(s) [15, 16]. In such cases, cell-free protein synthesis systems on a larger scale, particularly with continuous action, may offer an alternative solution [8, 9, 15, 17, 18]. In addition, expression of unmodified natural genes in a homologous environment frequently does not support levels of protein expression sufficient for large-scale protein production. The key to solving this problem lies in development of gene redesign approaches that result in robust expression of functionally active proteins both inside and outside their natural (homologous) cellular environments.

One of the main approaches to gene redesign facilitating protein production in heterologous and homologous organisms [1921] takes advantage of the degeneracy of the genetic code (meaning a given amino acid may be encoded by more than one “synonymous” codon). Synonymous codons are present at different frequencies in different organisms and are decoded at different rates [2224]. Therefore, substitution of synonymous codons in a gene can dramatically affect the rate/efficiency of synthesis of the encoded protein without altering its amino acid sequence [1921]. In a given organism, frequently used codons are typically translated more rapidly than infrequently used ones due to the fact that tRNAs corresponding to the frequently used codons are relatively more abundant [2531]. Many synonymous codons that are frequently used in eukaryotes (especially mammals) are utilized with low frequency in prokaryotes [2224] such as the bacteria Escherichia coli, one of the most common hosts for heterologous protein production [14]. The impact of these differences on recombinant protein production is now well appreciated, and it has been clearly demonstrated that the level of protein expression in heterologous and homologous organisms can be increased through suitable selection of synonymous (frequent) codons along target mRNAs [1921].

In addition to the effects of differential codon usage, the secondary structure of messenger RNAs (mRNA) has been recognized as a factor that can have a negative impact on translation and reduce protein yields by slowing or blocking translation initiation and/or the movement of ribosomes along the mRNA [3239].

Several other considerations important for recombinant protein production (e.g., choice of appropriate vector/promoter system(s), means of gene delivery, etc.) are outside the scope of this short review.

Approaches involving substitution of the majority of infrequently used codons with synonymous frequently used ones, often combined with elimination of extreme GC content that could contribute to formation of stable mRNA secondary structures, have been widely used by many biotechnology companies and research groups for optimization of heterologous gene/protein expression, but with mixed results ([19, 40] and references therein). Use of gene sequences optimized through the abovementioned approaches often yielded large amounts of recombinant proteins [19, 40]; however, in many cases, the products formed biologically inactive insoluble aggregates which had to be refolded (whenever it was possible) in order to regain similarity in structure and biological activity with native analogues [19, 28]. Moreover, even when proteins expressed in heterologous or homologous hosts remained soluble, they were not necessarily natively folded [41].

These and other experiments brought about awareness of the scientific community to the impact of synonymous codon usage on not only the efficiency of translation but also on other aspects of gene function, particularly, protein folding. The significance of synonymous codon usage on protein folding was highlighted by findings showing that multiple and, more surprisingly, single synonymous substitutions/mutations can affect proteins’ activity [4244], interactions with drugs and inhibitors [43], phosphorylation profiles [45], sensitivity to limited proteolysis [43, 45, 46], spectroscopic properties [47], and aggregation propensity [4749] and ultimately change protein structure [50].

Many recent studies have shown that synonymous substitutions or naturally occurring synonymous mutations are not neutral and may affect gene function by multiple mechanisms [51, 52], including but not limited to those mentioned above, as well as mechanisms exerting effects on mRNA splicing and/or mRNA stability [53, 54]. Synonymous codon choice has been also suggested to affect efficient interaction of nascent polypeptides with the signal recognition particle [55]. Changes in codon context caused by synonymous mutations may also induce mistranslation leading to protein misfolding [56].

While in many instances complete understanding of the exact effects caused by synonymous substitutions and/or mutations is still lacking, it nevertheless seems possible to use existing knowledge for the development of some common rules to gene design and redesign that should increase the chances of getting the desired levels and activity of the expressed recombinant proteins and reduce protein misfolding and aggregation.

This review discusses the most common approaches to gene redesign that involve synonymous codon substitutions and contains a set of recommendations for optimizing protein synthesis and folding through this approach. These recommendations take into account recent developments in the field highlighting the impact of synonymous codon usage on protein production and function.

2 Synonymous Gene Exploration in Protein Production and Folding

Designing an optimal gene for recombinant protein production requires choosing from an enormous number of possible DNA/RNA sequences. It is a combinatorial problem, giving approximately 3N variants for a sequence with N codons. However, as discussed below, this number can be substantially reduced by taking into account a set of critical considerations.

In general, two global gene design/redesign approaches predominate (1) de novo gene design based on reverse translation from an amino acid sequence to DNA/RNA and (2) gene redesign based on recoding of a natural DNA/RNA sequence. Numerous online/web-based and stand-alone platforms are available for use in one or both of these approaches. These include, for example, Codon Optimization OnLine (COOL) [57], DNA Works [58], D-Tailor [59], EuGene [60], GeneDesign [61], Gene Designer 2.0 [62], Jcat [63], mRNA Optimiser [64], OPTIMIZER [65], Synthetic Gene Designer [66], TmPrime [67], Visual Gene Developer [68], and others (for a review see [69]). The majority of available tools, however, start with a natural DNA/RNA sequence and employ either codon or RNA structure optimization algorithms (or both) to maximize gene expression; only TmPrime [67] is a “pure” de novo back-translation tool. GeneDesign [61] and OPTIMIZER [65] offer both possibilities – de novo back-translation from protein to DNA/RNA sequence and recoding of the natural DNA/RNA sequence.

Most of the abovementioned platforms customize codon usage by setting codon frequency percentage [70] and/or Codon Adaptation Index (CAI) [71] thresholds and then substituting rare synonymous codons with frequent ones along the entire open reading frame (ORF) of a gene to achieve the desired threshold level(s). Substitutions are selected based on known organism-specific codon biases [2224, 68]. The COOL [57], D-Tailor [59], EuGene [60], OPTIMIZER [65], and Visual Gene Developer [68] tools also take into account the RNA structure and/or GC/AT content, aiming to reduce obstacles related to formation of stable RNA structures. mRNA Optimizer [64] and TmPrime [65] focus solely on mRNA secondary structure optimization to avoid stable secondary structures by means of maximizing the minimum free energy (MFE) of the nucleotide sequences without changing the final resulting amino acid sequence.

As mentioned above, all currently available algorithms (with the exception of TmPrime [65]) typically start from the original/natural coding sequence and then evolve the sequence through iterations of synonymous codon changes that would increase/maximize the MFE and/or codon usage frequency/CAI or both to achieve the desired outcome. However, none of the abovementioned tools typically considers the impact of synonymous codon usage on protein folding (rather than simply on translation efficiency). They also fail to take into account some other important considerations that can affect mRNA translatability and stability and, therefore, preclude efficient expression of correctly folded and functional proteins. Below, I examine some of these considerations that may facilitate gene design and redesign toward optimized expression of active, correctly folded proteins.

2.1 Codon Usage at ORF (Open Reading Frame) 5′ Termini

The occurrence of synonymous codons in protein-coding open reading frames (ORFs) of genes is not random, thus revealing the existence of evolutionary pressure on codon choice [23, 24, 28, 7274]. Clustering of synonymous codons has been observed at specific conserved locations in mRNAs indicating that there are forces that influence the selection of these codons at specific locations within mRNA sequences [28, 33, 37, 38, 55, 75, 76]. Strategic placement of specific synonymous codons, particularly those that are rare, in gene ORFs suggests a functional role conserved in evolution rather than random chance. Therefore, the randomized and/or global substitution of rare synonymous codons with frequent ones that is offered by the majority of tools aimed at simply increasing CAI/codon usage frequency and/or MFE (see above) might not be beneficial for the production of a functional protein.

An example of nonrandom synonymous codon usage within ORFs is the observed enrichment of rare codons at the 5′ termini of genes in E. coli and many other prokaryotes, as well as in genes of some eukaryotes such as the yeast Saccharomyces cerevisiae [75, 76]. The clustering of rare codons at 5′ gene termini (typically at codon positions 1 to ~20 [33, 37, 38, 76]) clearly indicates an influence of evolutionary pressure on their selection. This particular aspect of natural codon usage may be explained by fact that rare codons in many bacteria are largely AT-rich [70]. Thus, their clustering at 5′ORF termini leads to reduced secondary structure in that region of the mRNA and, consequently, enhanced protein expression (it is known that mRNA secondary structure at 5′ ORF termini negatively affects protein expression by limiting access of the ribosomes to the ribosome binding site (RBS) on the mRNA [33, 37, 38, 55, 75]).

It should be noted, however, that the enrichment of rare codons at 5′ ORF termini has been mostly found in bacteria with genomes with overall GC content of at least 50% [77]. Recent work showed that, in general, AT-rich codons as opposed to rare codons are preferentially located at 5′ ORF termini in prokaryotes [33, 34, 37, 38, 54]. This further implicates secondary structure as the driving force for specific codon selection at 5′ ORF termini in bacteria [33, 38, 54]. Interestingly, the higher the GC content of a genome, the more mRNA stability is reduced at the region near the start codon [78].

It should be also noted that despite differences in translation apparatus and the mechanism of protein synthesis between prokaryotes and eukaryotes, many eukaryotic ORFeomes also are characterized by reduced 5′-terminal mRNA secondary structure near the start codon [78]. This indicates that reduced 5′-terminal ORF mRNA secondary structure may have been evolutionary selected in all organisms. In eukaryotes, this can be expected to facilitate start-codon recognition by the scanning ribosome [78].

Could there be additional reasons for preferential use of rare codons at the 5′ ORF termini of some natural genes, including those in E. coli? It has been suggested that clustering of rare codons at 5′ ORF termini may in certain cases allow slow co-translational formation of the N-terminal folding nucleus of the protein, thus facilitating overall correct protein folding in the cell [28].

Interestingly, strong enrichment of rare codons at 5′ gene termini has been preferentially observed (with very high statistical significance (P < 0.0001)) in genes/ORFs encoding secretory proteins [76]. It has been suggested that for genes encoding secretory proteins with N-terminal signal sequences, 5′ rare codon clusters could have a functional role related to secretion, by transiently slowing down translation prior to membrane localization of the nascent chain(s) [79]. It has been experimentally shown in yeast that local slowdown of translation caused by presence of rare codons (located ~35–40 codons downstream of signal sequences or transmembrane segments) promotes nascent-chain recognition by signal recognition particle (SRP), which assists in protein translocation across membranes [55]. Similarly, strategically located Shine-Dalgarno-like elements were identified in ORFeomes of E. coli secretory proteins; these elements serve to transiently slow down translation elongation in order to allow efficient integration of the transmembrane helix of many membrane proteins [80].

Therefore, based on the considerations described above, carefully planned placement of rare/non-optimal (or AT-rich) codons in the 5′ ORF termini of mRNAs, especially for those encoding secretory and transmembrane proteins, may represent an important strategy for successful gene design and redesign enhancing proper protein production, secretion, and folding.

2.2 Conserved Rare Codon Clusters Within Gene ORFs

It is widely believed that the major influence of codon usage is on global and local translation rate. As mentioned above, frequently used codons are translated more rapidly than infrequently used ones [2531]. However, which codons are more rare or frequent varies by organism [2225, 70]. Surprisingly, across all organisms, rare codons appear to occur in clusters, rather than being randomly scattered across genes [28, 75]. Although there is a general tendency for rare codons to cluster at the 5′ termini of ORFs (see above), such clustering is also observed within ORFs [28, 75, 81]. These clusters are not confined to the 5′ end of ORFs or to ORFs of genes/proteins that are expressed at a low level (as might be expected if rare codons are thought of as simply correlating with reduced translation rate). Rather, they are found to occur equally in genes for all types of proteins, including abundant/highly expressed proteins [75, 81].

Analyses of ORFeomes from prokaryotic and eukaryotic organisms revealed that rare codon clustering (1) is not limited to a particular set of genes or genotype, (2) does not depend on and is not related to the overall GC content of the organism’s genome, and (3) is significantly more abundant than would be expected based on random selection [75, 81]. Furthermore, for some protein families, the locations of rare codon-rich regions within mRNAs are highly conserved across homologs in different organisms; this is observed, for example, in families of cytochromes c, globins, gamma-B crystallins [28], ocular lacritins [82], and chloramphenicol acetyltransferases [28, 83].

Enrichment of rare codon clusters at specific locations in a broad range of genes and organisms suggests that evolutionary selection determines such clustering and that it must have some functional significance [28, 75, 8183]. One hypothesis links the location of rare codon clusters to the process of protein folding in the cell [84, 85]. This proposes that sequential folding events that occur during co-translational folding of proteins might be separated by rare codon clusters, with such clusters serving to reduce the speed of translation at these positions and thereby facilitating proper folding through temporal separation of folding events on the ribosome [28, 74, 8691]. This is consistent with the finding that there seems to be a certain hierarchy in the location of rare codon-rich regions along mRNAs. Frequently, but not always, the rarest codons seem to encode boundaries of relatively large structural units (e.g., protein domains), whereas less rare codons encode boundaries of smaller units (e.g., protein motifs and subdomains) [28]. This might reflect the need to provide a more substantial translational delay for independent co-translation folding of larger units in comparison with smaller ones [28].

In summary, while there is a substantial body of literature underlining the overall negative effects of rare codons on levels of protein production (see [19] for a review), it is becoming increasingly clear that strategic placement of conserved rare codons clusters can have positive effects on protein biogenesis (particularly proper folding) and function. Some biotech companies, such as DAPCEL, Inc., are already using this knowledge to enhance protein production and facilitate correct co-translational protein folding.

2.3 Codon Usage at ORF (Open Reading Frame) 3′ Termini

Enrichment of rare codons at the 3′ terminus of E. coli ORFs (and ORFs of 11 other prokaryotes) has also been observed [76]. While significant enrichment of rare codons at the 5′ termini of genes in E. coli can be explained as a mechanism that facilitates interaction between ribosomes and ribosome binding sites on mRNAs (see above; [33, 37, 38, 55, 75]), the observed incidence (albeit less pronounced) of increased rare codon abundance at the 3′ termini of E. coli ORFs is not that easy to explain. It is possible that rare codon clusters at 3′ ORF termini could be required for more robust termination of translation and/or for reducing the rate of protein folding before release from the ribosome [76]. Queuing of ribosomes at the 3′ termini of ORFs due to presence of rare codons may also protect mRNAs from degradation. An improved understanding of the impact of codon usage at 3′ ORF termini is required before this feature can be rationally exploited in gene design and redesign strategies and/or interpretation of in vivo folding pathways.

3 Synonymous Codons and mRNA Stability

mRNA turnover plays a critical role in regulating gene expression. mRNAs with longer half-lives generally produce more protein than those with shorter half-lives simply because they are available to be translated for a longer period of time. A link between codon usage and mRNA turnover rate has been long recognized in both prokaryotes and eukaryotes [9294], but has not been well understood until recently [53, 54]. Previously, it was generally believed that more thermodynamically stable mRNAs would also be more resistant to degradation. However, recent work showed that, at least in yeast, so-called codon optimality [53] rather than mRNA thermodynamic stability has a broad and powerful influence on in vivo mRNA degradation rates. Codon optimality is a scale that reflects the balance between the supply of specific charged tRNA molecules and the demand for their use by translating ribosomes, thus representing a measure of translation efficiency [53]. Optimal codons (typically, these are frequent codons) are decoded faster. In the yeast study, it was found that many stable/long-lived mRNAs harbor optimal codons within their ORFs, while many unstable/short-lived mRNAs harbor non-optimal codons [53]. Moreover, it was found that substitution of optimal codons with synonymous, non-optimal codons results in dramatic destabilization of the mRNA and vice versa [53]. Interestingly, very similar results were obtained in E. coli [54]. These findings suggest that transcript-specific translation elongation rate is an important determinant of mRNA stability and that more rapidly translated mRNAs (at least in yeast and E. coli) are likely to be more stable and, thus, produce more protein. This new information presents an opportunity to upscale protein production in yeast and E. coli via reassignment of codon optimality in an mRNA to increase its stability and, thus, its capacity to produce protein. Whether the same paradigm exists in higher eukaryotic organisms remains to be determined. However, this approach should be applied with caution since assignment of codons that are optimal for translation rate and mRNA stability could lead to incorrect protein folding.

4 Synonymous Codons and Mistranslation/Frameshifting

Another aspect of mRNA biology that can be impacted by synonymous codon usage is the accuracy with which they are translated. Clearly, mRNAs must be translated accurately in order for fully functional proteins to be produced. Estimates of missense error rates (referred to as miscoding or mistranslation) during protein synthesis from natural mRNAs vary from 10−3 to 10−4 per codon ([9598] and references therein). Mistranslation is the incorporation of an amino acid that is different from the one encoded by a specific codon in the mRNA. Recent research has enhanced our understanding of mistranslation mechanisms and how it is controlled [9598]. While it is generally believed that synonymous codon changes should be silent (not changing the amino acid that is incorporated), that is not always the case [9598]. Moreover, certain codons are mistranslated more frequently than others [95, 98]. This is apparently due to the fact that translation speed and mistranslation rate are carefully balanced during protein synthesis and situations maximizing translation speed place demands on the translational machinery that reduces accuracy [9598]. In general, translation has multiple layers of proofreading; however, most errors occur during decoding, which takes place on the ribosome [96, 98]. The frequency of miscoding of different codons varies over a nearly 20-fold range ([95] and references therein). Mispairing at the wobble position and scarce availability of cognate competitor tRNAs appear to play major roles in mistranslation [9598]. For example, the frequency of miscoding of the AAU (Asn) codon in E. coli leading to incorporation of Lys (encoded by AAG and AAA) instead of Asn is about fourfold higher than that for the AAC (Asn) codon [95]. It should be noted, however, that the AAU codon is used more frequently than the AAC codon (codon usage frequency per 1,000 codons is 29.32 for AAU vs. 20.26 for AAC [70]); thus, substitution of AAC with AAU with the intention of maximizing codon frequency/CAI could result in increased levels of miscoding, which in turn could lead to loss of protein activity due to misfolding [56] or absence of a functionally important amino acid.

While, as described above, there is considerable evidence linking codon usage and missense errors, little is known about the relationship between codon usage and frameshifting errors. Programmed ribosomal frameshifting is utilized by many viruses and bacteria to increase the information content of their genomes; through frameshifting, multiple proteins can be produced from a single span of sequence [99, 100]. Signals in mRNAs have been identified that cause frameshifting by one base in the 5′ (−1) or 3′ (+1) direction [99, 100]. While beneficial in some cases for bacteria and viruses as mentioned above, unintended frameshifting during translation is clearly not desirable. Frameshifting errors can lead to premature termination of translation or generate abnormal proteins with toxic effects on the cell [56]. Attempts have been made to develop computational tools to assess whether codon usage can be optimized to minimize the frequency of frameshifting errors [101]. The results of this work indicate that natural synonymous codon usage is biased toward specific patterns correlated with avoidance of mistranslation and frameshifting-induced protein misfolding [101]. Overall, an understanding of the impact of codon usage on mistranslation and frameshifting errors may be helpful in minimizing the risk of producing subpopulation of proteins with different amino acid sequences when undertaking recombinant protein production from a redesigned gene.

5 The Impact of Single Synonymous Codon Substitutions

Gene redesign usually involves numerous substitutions of synonymous codons. However, recent studies have shown that some specific single synonymous mutations are deleterious for proper protein expression and, moreover, organism health ([51, 52] for a review). The majority of identified deleterious single synonymous mutations exert effects on mRNA splicing (in eukaryotes), but there are also quite a few that may alter protein folding and, as a consequence, protein activity and/or resistance to degradation [51, 52]. These single synonymous mutations can produce disease in the expressing organism, and their inadvertent introduction into genes of therapeutic proteins may produce undesirable effects. It should be noted that the exact mechanisms underlying the effects of many synonymous mutations linked to disease are not yet well understood [51, 52]. One of the major challenges in the field is to understand why some disease-causing synonymous mutations are more deleterious than others and to predict the likely effects of a single mutation.

Evaluation of mRNA stability of fragments of genes of several proteins carrying neutral vs. disease-associated mutations and synonymous vs. non-synonymous mutations revealed that deleterious synonymous mutations tend to occur in mRNA regions with higher MFE levels and often lead to a reduction in MFE [102105]. It is not yet clear how broadly applicable this situation originally identified for “disease-associated” mutations in the F8 and F9 genes encoding blood-coagulation factors VIII and IX, respectively, might be [102, 105]. Mutations in the F8 and F9 genes lead to blood clotting disorders known as hemophilia A and B [102, 105]. While further investigation into the deleterious effects of specific synonymous mutations is required, it is clear that known disease-associated mutations should be avoided in gene redesign efforts.

6 Concluding Remarks and Future Perspectives

Gene design and redesign approaches target protein-coding genes and aim to introduce predefined features of interest into the final protein product. These approaches frequently involve changes in synonymous codon usage intended to improve protein production in homologous and/or heterologous hosts without compromising the integrity of the encoded protein. Optimization of gene design and protein production is of strong significance due to the high, and continually increasing, demand for recombinant proteins for use in research and in therapeutic applications. Advances in DNA synthesis have enabled construction of numerous gene variants and facilitated our understanding of the impact of codon usage on gene function. Additional knowledge came from genome-wide studies aimed at uncovering the impact of synonymous mutations on gene function and phenotype and understanding their association with various diseases.

It has become clear that synonymous codon usage and synonymous mutations do not only alter the speed of protein synthesis but affect many critical aspects of mRNA and protein biogenesis (ranging from mRNA stability to protein mistranslation and folding), thus ultimately changing the phenotype associated with the protein. Importantly, it was revealed that even a single synonymous mutation may be deleterious to protein function. While complete understanding of the effects caused by multiple and single synonymous mutations remains lacking, it is possible, as done in this review, to use existing knowledge to develop some common rules to gene design and redesign that should increase the probability of achieving the desired quantity and activity of an expressed recombinant protein.

A combination of evolutionary, computational, and synthetic biology should ultimately enable (1) full genome-based understanding of the impact of individual synonymous mutations on gene function, mRNA biogenesis, protein production, and protein folding; (2) efficient manufacturing of safer, more effective, and even potentially individualized protein therapeutics; and (3) improved understanding of evolutionary processes.

7 Notes

  1. 1.

    Carefully planned placement of rare/non-optimal (or AT-rich) codons in the 5′ termini of mRNA ORFs, especially those encoding secretory and transmembrane proteins, may represent an important strategy for successful gene design and redesign enhancing proper protein production, secretion, and folding.

  2. 2.

    Enrichment of rare codon clusters at specific locations in a broad range of genes implies that they have functional significance. Therefore, strategic placement of evolutionarily conserved rare codon clusters within ORFs may facilitate correct protein folding.

  3. 3.

    Use of optimal synonymous codons during gene design and redesign may lead to substantial stabilization of the mRNA and enhancement of protein production (at least in yeast and E. coli).

  4. 4.

    Mistranslation as a result of synonymous codon changes may lead to incorrect protein folding; this should be taken into consideration when planning production of recombinant proteins.

  5. 5.

    Although a variety of methods are available for gene redesign, approaches that take into account the effect(s) of synonymous codon substitutions on translation efficiency, protein folding, and protein activity will allow the most productive manufacturing of safer and more effective protein therapeutics.