Introduction

Evolution and protein folding are intertwined processes. Indeed, protein sequences, encoded by DNA, determine their tridimensional structure (Anfinsen 1973), which in turn determines their function, while evolution could alter either one by mutations. Then, does the folding rate—which is a measure of how quickly or slowly a protein folds from its unfolded forms to its native state—restrains the mutation frequency? If this were the case, what would be its impact on evolution? Whatever the answer to these questions, protein folding cannot happen in cosmic times (~ 1027 years), as foreseen by an exhaustive sampling of all possible conformations for a 100-residue protein (Zwanzig et al. 1992), because the observed folding rates in water for single-domain two-state proteins are smaller than ~ 10 s (Garbuzynskiy et al. 2013). As the reader may be aware, several possible solutions to this apparent contradiction, also known as Levinthal’s paradox (Levinthal 1968), exist in the literature (Zwanzig et al. 1992; Dill and Chan 1997; Karplus 1997; Rooman et al. 2002; Ben-Naim 2012; Finkelstein and Garbuzynskiy 2013; Martinez 2014; Ivankov and Finkelstein 2020; Finkelstein et al. 2022). However, the existence of numerous solutions to this paradox does not assure a clear answer to the following key question: why can proteins reach their native state in a biologically reasonable time? As a strategy to answer this question, we will prove that a reasonable estimation of the height of the activation barrier (see Fig. 1), separating the native state from the highest free-energy nativelike conformation—beyond which the protein unfolds or becomes non-functional—will enable us to determine the slowest folding/unfolding time for two-state monomeric proteins. Before resuming the analysis, let us recall the last question. Should we focus on why—rather than on how—proteins reach their native state in a biologically reasonable time? This dilemma does not have a simple solution because both are relevant queries. Indeed, the interrogative how is associated with determining the mechanism, e.g., the routes or pathway/s of the folding/unfolding (Sali, et al. 1994; Wolynes et al. 1995; Lazaridis and Karplus 1997; Jackson 1998; Lindorff-Larsen et al. 2011; Englander and Mayne 2014; Wolynes 2015; Li and Gong 2022), while the why is associated with identifying the main factors—independently of the mechanism—governing the folding/unfolding process. An attempt to answer how proteins reach their native state in a biologically reasonable time has been recently analyzed (Ivankov and Finkelstein 2020). Therefore, we choose to focus on why two-state proteins reach their native state in a biologically reasonable time because, in the first place, it questions our basic knowledge of the main factors determining protein folding rate changes and, hence, poses a preliminary problem to one of the most critical unanswered questions in structural biology: how a sequence encodes its folding. Second, it will help to understand the origins of protein folding rate evolution after amino-acid substitutions and/or post-translational modifications.

Fig. 1
figure 1

The Gibbs free-energy profile (G) for a two-state protein unfolding is sketched out in broad strokes. The native state and the highest point of the free-energy profile are highlighted as green- and red-filled dots, respectively. The Gibbs free-energy gap between these two states is indicated by ΔG

Overall, we start by determining the slowest folding rate for a two-state monomeric protein, i.e., by providing an answer to why proteins fold in a biologically reasonable time. Arguments, such as that life would not have emerged if it took the age of the universe for a protein to fold, or that proteins should fold fast enough in a cell—not to be degraded—could be, at first glance, plausible answers. None of these ideas, however, could adequately describe the nature of the key factors determining how protein folding/unfolding rates evolve in response to amino acid substitutions and/or post-translational modifications. For this reason, this phenomenon is examined here in terms of (i) protein-marginal stability (Dinner and Karplus 2001; Vila 2019; Martin and Vila 2020; Vila 2021) and (ii) arguments from the transition state theory (Ivankov and Finkelstein 2020). Unless otherwise stated, the terms “folding” and “unfolding” shall be used interchangeably from this point on.

Results and discussion

I.- Two-state protein folding time scales

Among the possible solutions to the time scales for protein folding, we distinguish three studies that have determined a plausible relation between protein length (N), with N being the number of residues, and folding time logarithm (ln τ), namely, ln τ ~ N1/2 (Thirumalai 1995), ~ ln (N) (Gutin et al. 1996), and ~ N2/3 (Finkelstein and Badretdinov 1997; Wolynes 1997). Although an analysis of such a relationship is vital, given the strongly observed anti-correlation—between N and ln τ—for three-state folding proteins (R ~  − 0.80) (Galzitskaya et al. 2003), it is also equally important to highlight that such a relationship for two-state folding proteins is nearly inexistent (R ~  − 0.07) (Galzitskaya et al. 2003). Therefore, we will focus on determining a plausible explanation for the latter. For this purpose, we will resolve the slowest folding/unfolding time (τmax) for a monomeric two-state protein in terms of the result obtained for the marginal-stability upper bound of proteins obtained via a statistical-mechanics analysis of the partition function in the thermodynamic limit, also known as “the infinite chain limit” (Vila 2019, 2021). Therefore, for two-state proteins of any sequence and length (N), the expected value for the slowest folding/unfolding time (τmax) will hold if the following conjecture and facts are plausible:

  1. 1.

    The folding approach for monomeric two-state proteins is a reversible thermodynamic-driven process (Privalov 1979; Matouschek et al. 1989)

  2. 2.

    The two-state protein unfolding model shown in Fig. 1 alludes to a process in which the thermodynamics and kinetic stability happen only between the native-state and unfolded states, which are separated by an energetic barrier higher than thermal fluctuation energy (Akmal and Muñoz 2004; Kuwajima 2020). In other words, folded and unfolded states are separated by an ensemble of a high-energy set of structures, i.e., the transition state ensemble (TSE), representing the energetic barrier for the process (Privalov 1979; Matouschek et al. 1989; Itzhaki et al. 1995; Englander 2000; Fersht and Daggett 2002; Akmal and Muñoz 2004; Shakhnovich 2006). In this simple unfolding model, there are no stable intermediate states necessary to complete the process

  3. 3.

    We will focus our attention on the analysis of the unfolding rather than on the folding process because the former enables us to make a quick estimation of the height of the Gibbs free-energy difference (ΔG) between the native state (representing a well-defined reference point) and the highest point of the TSE (see Fig. 1). The latter is feasible since the “detailed balance principle” demands that the TSE be the same for unfolding and folding processes (Ivankov and Finkelstein 2020), e.g., as shown by the analysis of the rates and equilibria of folding from ~ 100 mutants strategically distributed throughout the protein chymotrypsin inhibitor 2 (Itzhaki et al. 1995). This conjecture is in line with the observed folding/unfolding data from 108 proteins (70 showing two-state kinetics) that demonstrate that the logarithm of the folding and unfolding rates is well correlated (R ~ 0.8) and that such correlation is better for two than that for multiple-state proteins (Glyakina and Galzitskaya, 2020)

  4. 4.

    The largest size of the Gibbs free-energy barrier (ΔG) between the native state and the highest point of the free-energy profile (see Fig. 1) is assumed to be given by the protein marginal-stability upper bound limit, i.e., ΔG ~ 7.4 kcal/mol, which (i) is a universal feature of proteins, i.e., was obtained regardless of their sequence or length (Vila 2019; Vila 2021); (ii) is a consequence of Anfinsen’s dogma validity (Vila 2019; 2021); and (iii) represents a threshold beyond which a conformation will unfold and become non-functional (Martin and Vila 2020; Vila 2021; 2022)

  5. 5.

    The word “mutation” usually refers to an amino-acid substitution in the protein sequence as a result of a nucleotide pair replacement (Kimura 1968). This is a very well-known phenomenon in the protein folding/unfolding field because it could alter protein stability (Privalov and Tsalkova, 1979; Tokuriki et al. 2008; Tokuriki and Tawfik 2009; Socha and Tokuriki 2013; Martin and Vila 2020), structure (Koehl and Levitt 2002), function (Tokuriki et al. 2008; Otwinowski 2018), and evolvability (Kimura 1968; Bloom et al. 2006; Kurahashi et al. 2018; Vila 2022) through a variety of mechanisms. As such, there has been considerable interest in understanding the structural and energetic consequences of such amino acid substitutions. Interestingly, an alteration that also has a significant impact on the protein structure, stability, and function occurs through post-translational modifications (PTMs), a phenomenon that refers to an amino acid side-chain modification in some proteins after their biosynthesis. In this regard, it is worth noting the existence of more than 400 types of PTMs, among which phosphorylation, acetylation, methylation, and glycosylation, are the most common (Khoury et al. 2011). Notably, N-linked glycoproteins, which are the result of a reversible enzyme-directed reaction, are a particularly interesting case of PTM since more than 50% of all eukaryotic are glycoproteins (Shental-Bechor and Levy 2008; Ellis et al. 2012), and hence, there is considerable interest in predicting the structural and functional consequences of such site-specific modifications (Chen et al. 2010; Garay et al. 2016; Ramazi and Zahiri 2021; Weaver et al. 2022). PTMs are particularly relevant to biology because they increase proteomic diversity by several orders of magnitude (Spoel 2018). All of this enables us to conjecture that each PTM could be thought of as a different amino acid from the 20 naturally occurring ones. Then, unless otherwise noted, the word “mutation” will merely refer to a protein sequence modification, and, thus, its effects on the protein structure, stability, and foldability rate will be analyzed without making any distinction among these phenomena

  6. 6.

    It is assumed that point mutations mainly affect the native-state stability (Zeldovich et al. 2007). This assumption is equivalent to assuming an average ϕ-value—a technique commonly used to examine the kinetic effects on the protein folding upon a point mutation (Matouschek et al. 1989; Itzhaki et al. 1995; Campos 2022)—closer to ~ 0 than to ~ 1. In line with this, the average ϕ-value—of more than 800 mutations for 24 two-state proteins—is < ϕ >  ~ 0.24 (Naganathan and Muñoz 2010)

  7. 7.

    The unfolding Gibbs free energy (ΔGU) between the wild-type (wt) and the mutant (m) protein can be effortlessly computed as ΔΔGU = \(({\Delta G}_{U}^{m}-{\Delta G}_{U }^{wt })\) (Bigman and Levy 2018). This definition—together with assumption 6—enables us to propose (Vila 2022) a reasonable strategy to assess the change in the protein marginal stability upon point mutations (ΔΔG), namely, as ΔΔG ~ ΔΔGU

  8. 8.

    The best candidates for simulations of all-atom molecular dynamics are proteins that fold at or close to the speed limit, simply because such simulations are computationally intensive. This has inspired experimentalists to look for proteins that fold rapidly as well as to develop other proteins that fold even more quickly. For this reason, the folding speed limit (τ0) of two-state proteins (the barrier-less limit) has been discussed at great length in the literature (Zana 1975; McCammon 1996; Hagen et al. 1996; Mayor et al. 2000; Krieger et al. 2003; Yang and Gruebele 2003; Akmal and Muñoz 2004; Muñoz et al. 2008; Ivankov and Finkelstein 2020; Glyakina and Galzitskaya 2020; Muñoz and Cerminara 2016; Chung and Eaton 2018; Eaton 2021), and there is a consensus that it should be within the following range of values

    $${\sim 10}^{-8} \ [\mathrm{sec}] < {\uptau }_{0} < {\sim 10}^{-5} \ [\mathrm{sec}]$$
    (1)

Let us quickly show how these constraints on the folding rate impact the ability of proteins to evolve. If a given 100-residue two-state protein cannot fold faster than τ0 ~ 10–8 (or ~ 10–5) seconds, and if life began on earth around a billion (~ 109) years ago, its protein space size (Maynard Smith 1970) would contain at most ~ 1024 (or ~ 1021) sequences. If this were the case, the average mutation rate per amino acid (ξ) should be ≤  ~ 1.74 (or ≤  ~ 1.62) since ξ must satisfy ξ100 =  ~ 1024 (or ~ 1021). The fact that ξ < 2 is of paramount importance from an evolutive point of view because it means that only a fraction of a given protein sequence is available for an amino acid substitution at any one time, in agreement with both previous estimations of the protein space size (Vila 2020) and existent pieces of evidence (Margoliash and Smith 1965; Sarkisyan et al 2016). From an evolutionary perspective, an in-depth discussion of an accurate estimation of the protein space size in light of the factors that govern it is of utmost importance (Mandecki 1998; Dryden et al. 2008; Romero and Arnold 2009; Ivankov 2017), as well as it is of practical interest for studies of directed evolution (Arnold, 2009).

  1. 9.

    The time (τ) to overcome the free-energy barrier ΔG (shown in Fig. 1) may be computed by using an argument from the transition state theory (Ivankov and Finkelstein 2020) as

    $$\tau = {\tau }_{0}\mathrm \ { exp }(\beta\Delta G)\ [\mathrm{sec}]$$
    (2)

in which the lower and upper bound of the pre-exponential factor (τ0) is given in Eq. (1), β = 1/RT, R is the gas constant and T is the absolute temperature (298 K for all the calculations). If the free energy barrier vanishes (ΔG ~ 0), a downhill, barrierless, or one-state unfolding (Garcia-Mira et al., 2002; Naganathan et al. 2005; Muñoz et al. 2008) occurs in times given by τ0.

  1. 10.

    After assuming the validity of all of the above conjectures and facts, it is possible to determine the following range of τmax values from Eq. (2) (with ΔG ~ 7.4 kcal/mol and τ0 given by Eq. 1)

    $${\sim 10}^{-3} \ [\mathrm{sec}] \le {\uptau }_{\mathrm{max}} \le \ \sim 1 \ [\mathrm{sec}]$$
    (3)

The results of simulations on the protein folding (Sali et al. 1994; Karplus 1997; Lindorff-Larsen et al. 2011) and the observed folding rates for 65 two-state proteins that fold in an aqueous solution under biological conditions (Garbuzynskiy et al. 2013; Ivankov and Finkelstein 2020) attest that this time window for the slowest folding rate, τmax, is acceptable from a biological point of view. This result is a consequence of the fact that there is an upper bound on the marginal stability of proteins (~ 7.4 kcal/mol), which seems to be a universal property of biomolecules and macromolecular complexes (Martin and Vila 2020; Vila 2021) and arises from the validity of Anfinsen’s dogma (Vila 2019, 2021; Martin and Vila 2020).

Overall, the range of variation for τmax shown in Eq. (3) for a two-state protein (i) does not depend on the chain length, which is consistent with the observation that chain length has a nearly null correlation (R ~  − 0.07) with the folding time logarithm (Plaxco et al. 2000; Galzitskaya et al. 2003), (ii) provides the answer to the central question of Levinthal’s paradox’s of how long it takes for a protein to reach its native state, and (iii) is a standard that will allow us to evaluate the impact of amino acid substitutions and/or post-translational modifications on the rates of protein folding, which we will examine in the next section.

II.- Evolution of protein folding rate in light of mutations

If the free-energy barrier height (ΔG) rules the unfolding (and folding) time τ for a two-state protein; then, a single-point mutation could affect it by either increasing (stabilizing) or decreasing (destabilizing) the marginal stability. Let us start by examining the physics that rules the phenomenon of protein folding time changes upon mutations. The ratio between the wild-type protein folding time (τwt) and that of this protein upon a point mutation (τm) can be computed—after assuming that τ0 is insensitive to mutations (Socci et al. 1996; Muñoz and Eaton 1999)—using Eq. (2) as (Chaudhary et al. 2015; Ivankov and Finkelstein 2020)

$$\Delta {\tau }_{m} = ({\tau }_{m}/{\tau }_{wt}) \sim {exp}({\beta{\Delta \Delta G}}_{m}) \Rightarrow{RT \ ln \Delta }{\tau }_{m} \sim {{\Delta \Delta G}}_{m}$$
(4)

where ΔΔGm = (ΔGm –ΔGwt) ~ ΔΔGU is the change, upon a single-point mutation, between the mutant and the wild-type Gibbs free-energy gap (ΔG), respectively. The key takeaway from this analysis is that the protein marginal-stability change upon a mutation (ΔΔGm) provides the necessary and sufficient information to accurately estimate, via a Boltzmann factor, the evolution of the folding rates (Δτm). The physics underpinning this conclusion follows. Mutations affect, mainly, the stability of the native state (Zeldovich et al. 2007) and, to a lesser extent, the ensemble of high-energy native-like structures that coexist with it, i.e., the transition state ensemble, shown in Fig. 1). This hypothesis is supported by convincing theoretical simulations of the amide hydrogen exchange mechanism on proteins (Vendruscolo et al. 2003), as well as the results of a high-resolution structure determination method indicating that high-energy native-like structures may be required for protein function (Stiller et al. 2022).

Since ΔΔGm is a state function, Eq. 4 will be valid for any number of j (≥ 2) consecutive mutations and, hence, it can be generalized straightforwardly by replacing m → j because \(\left(\Delta {G}_{1}-{\Delta G}_{ wt}\right)+\sum_{k=2}^{j}(\Delta {G}_{k}-\Delta {G}_{k-1})=\left(\Delta {G}_{j}-{\Delta G}_{ wt}\right)=\Delta \Delta {G}_{j}\). This generalization is particularly relevant to determine the evolution of folding rates upon mutations because many forms of post-translational modifications may occur in tandem (Khoury et al. 2011). Additionally, this property of ΔΔGm should have a profound impact on evolutionary biology research. To illustrate this, imagine evolution as a walk across Protein Space, i.e., one where “…functional proteins must form a continuous network which can be traversed by unit mutational steps without passing through nonfunctional intermediates…” (Maynard Smith 1970). Then, if we assign a “fitness value” to each functional protein in that sequence space, which is a measurement of how effectively each protein may perform an expected function (Romero and Arnold 2009), it becomes clear that starting from an arbitrary functional protein, nature can follow any mutational trajectory to achieve a specified “fitness target” (see Fig. 2), if there is no penalty for doing so (Weinreich et al. 2006). This simple illustration shows that it is not necessary to predict mutational trajectories (Sailer and Harms 2017a, b) or account for epistasis effects (Breen and et al. 2012; Starr and Thornton 2016; Miton and Tokuriki 2016; Sailer and Harms 2017a, b; Sailer and Harms 2017b; Domingo et al. 2019; Vila 2022), a phenomenon which occurs when the total effect of two or more mutations is different from the sum of those effects, to determine the evolution of the folding rate. However, if a particular mutational trajectory has a higher probability than all the others, epistasis effect considerations (Romero and Arnold 2009; Sailer and Harms 2017b) may be crucial to understanding the reason for such a preference. Overall, Eq. (4) enables us to calculate the evolution of the protein folding rate after j consecutive mutations, regardless of the paths that evolution takes in the protein space, as follows:

Fig. 2
figure 2

Cartoon of the Protein Space (Maynard Smith, 1970) as a model of evolution, where each circle represents a functional protein that differs from any of its nearest neighbors by one amino acid. The yellow- and red-filled circles—the wild-type (wt) and the target-sequence (ts), respectively—represent the starting and finishing functional proteins of any possible mutational trajectory. The green- and blue-filled circles illustrate two arbitrary mutational trajectories, each representing a walk in the protein space of 15 and 21 mutational steps (amino-acid substitutions), respectively. Then, for any mutational trajectory, the following relation for the protein marginal-stability evolution holds ΔΔG = (ΔGts–ΔGwt). Consequently, \({\tau }_{ts }\sim {\tau }_{wt} \ { e}^{\beta \Delta \Delta G}\), with β = 1/RT (see text for details)

$${\tau }_{j }\sim {\tau }_{wt} \ {e}^{\beta \Delta \Delta {G}_{j}}$$
(5)

In light of all of this, directed evolution studies (Arnold, 2009; Romero and Arnold 2009; Socha and Tokuriki 2013) that look for protein sequences that carry out a desired function in a specific amount of time would undoubtedly benefit from knowing all of the parameters influencing protein evolvability, especially those governing changes in folding rates as a result of mutations.

In general, the above analysis confirms that evolution influences, through mutations, unfolding/folding time scales by altering the height and composition of the energetic barrier but not their rate-limiting step set by the physics of folding, namely, by the largest-possible change in the free energy barrier (|ΔΔG|< ~ 7.4 kcal/mol). From a thermodynamic standpoint, this barrier defines a threshold beyond which a two-state protein unfolds or becomes non-functional and, from a kinetic viewpoint, the time ceiling for the unfolding process.

In the following subsections, the magnitude of the τj changes upon mutations will be illustrated by using data from (a) post-translational modifications (PTMs) and (b) amino acid substitutions.

(a) Post-translational modifications

Among all possible PTMs effects, we choose N-linked glycosylation—a covalent attachment of carbohydrate to certain residues of a protein—because it is, on the one hand, one of the most common PTM in eukaryotes (Shental-Bechor and Levy 2008; Hanson et al. 2009; Ellis et al. 2012) and, on the other, one for which detailed information of its effects on two-state protein folding rates is well documented (Hanson et al. 2009). Thus, a study on the observed folding energetics of the mono-N-glycosylated adhesion domain of the human immune cell receptor cluster of differentiation 2 (hCD2ad)—a protein with a β-sandwich topology—reveals that the N-glycan first saccharide unit is responsible for a stabilization-free-energy (ΔΔG1) of ~ 2.3 kcal/mol and a 50-fold rate slower than that of the nonglycosylated protein (Hanson et al. 2009). This observed folding rate change upon a PTM is fully consistent with Eq. (5) which—for such change on the protein marginal stability—predicts an unfolding rate of τ1 ~ 48-fold slower than that of the nonglycosylated wild-type (τwt) at room temperature (298 K). We focused on the analysis of the effect of the first N-linked glycan because it affects the thermodynamics and kinetics of the protein folding by 65% out of 100% of the total N-glycan contributions to HCD2ad (Hanson et al. 2009).

It is worth noting that glycosylation does not always lead to a more stable protein structure. Indeed, there is evidence that the contrary occurs for O-glycosylation of the serum vitamin D binding protein for which each event destabilizes the protein by ~  − 1 kcal/mol (Spiriti et al. 2008). Then, the unfolding speed will be ~ fivefold faster than that of the nonglycosylated one (τwt), as indicated by Eq. (5).

(b) Amino acid substitutions

An analysis to determine the magnitude of the τj changes upon amino acid substitution is, actually, unnecessary because the existence of large databases providing detailed information on the changes in protein stability upon single-point mutations makes their computation trivial. Indeed, ThermoMutDB (Xavier et al. 2021) is a manually curated database containing ~ 8,800 entries that collect experimental information on the effect of single-point mutations on protein stability (ΔΔG1), together with available experimental structural information. Then, the corresponding values for ln (Δτ1) or τ1 can be straightforwardly computed by using Eqs. (4) and (5), respectively. At this point, it is worth noting that the ThermoMutDB database contains nearly all (~ 98%) single-point mutated proteins whose report |ΔΔG1| values are ≤  ~ 7.4 kcal/mol, confirming the hypothesis that protein marginal stability cannot exceed this threshold (Vila 2019, 2021).

Conclusions

The analysis has made it possible for us to find a straightforward answer to a key question that sits at the heart of Levinthal’s paradox: how long does it take for a protein to achieve its native state? As proved, it takes seconds—not years, as suggested by a naïve solution to the dilemma—for a two-state protein of any sequence or length to acquire its native state. Also, it helped us to comprehend why proteins reach their native state within a biologically acceptable timeframe, specifically because the largest-possible change in the two-state protein free-energy barrier (~ 7.4 kcal/mol) is a consequence of the validity of the thermodynamic hypothesis—or Anfinsen’s dogma—a limit set by the physics of folding. Furthermore, we have shown that the evolution of protein folding rates is primarily driven by changes in the marginal stability of proteins caused by amino acid substitutions and/or post-translational modifications. This dependence ensures that, given a starting and a target sequence, whatever the mutational paths in sequence space or epistasis effects are, they will not have an impact on the determination of the evolution of the protein folding rate. This is an important result since the evolutionary trajectories are unpredictable, and the estimation of the epistasis effects is a daunting task. Moreover, if folding/unfolding speed becomes a bottleneck in the search for new proteins and functions, the prediction of the folding rate becomes important, and all factors influencing it should be thoroughly investigated. The analysis offered from this point of view may well be a good place to start.

Overall, this review focuses on protein sequence changes caused by mutations—amino-acid substitutions and/or post-translational modifications—and their impact on protein folding rates, a phenomenon closely related to one of the most important unanswered questions in structural biology: how a sequence encodes its folding. In this regard, we have learned that some properties of two-state proteins, such as their slowest folding time, are sequence-independent. As already explained, this is a consequence of a universal feature of proteins, namely, the existence of a marginal-stability upper bound limit beyond which the protein unfolds or becomes non-functional. Then, all biologically relevant processes must take place under this stability threshold and, hence, are sequence-dependent, since the latter determines the tridimensional structure of proteins, which in turn regulates its function. Therefore, finding a solution to the abovementioned question becomes critical and highly relevant in this context, as state-of-the-art numerical methods have so far been unable to solve it. The current study, we firmly believe, will encourage researchers to continue looking for solutions to this and other unsolved structural biology problems.