Keywords

1 Introduction

Protein dynamics and folding have been challenging phenomena essential for the molecular-level understanding of protein function. Molecular dynamics (MD) simulation is a valuable tool that comprehends macromolecular structural and functional insights. Data assembled after the MD simulation study can confer good knowledge about the macromolecular structure and provide detailed informational insights [1].

1.1 Importance of Molecular Dynamics

Proteins and nucleic acids are dynamic entities, and their dynamics play a significant role in their functions. Crystal structures stored at the PDB provide a halfway and limited perspective on three-dimensional (3D) construction. Especially protein molecules undergo crucial conformational changes during a particular function [2, 3]. One such change is the structural rearrangement in the protein molecule upon binding a substrate or inhibitor [4, 5]. This can be effectively verified by comparing apo and ligand-bound 3D protein structures. The conformational changes are usual parameters of enzymes’ catalytic mechanisms [6]. One of the common instances is loop movement or domain rearrangements that change the local composition of the active site’s chemical environment to perform a function. Sometimes, these alterations activate the catalytic process by bringing protein subunits together. Moreover, one can correlate protein function only when dynamic properties are considered [7,8,9].

There are several ways to deal with the conformation correlated with the relevant macromolecular function. One of the conventional ways is to gather experimentally determined structures covering the conformational space using X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, or cryo-electron microscopy (cryo-EM) methods. These methods can be used to study structures of macromolecules in different environments or bound with other substrates or ligands. However, these experimental studies are time taking and need specific high-end instruments.

On the other hand, theoretical strategies are the most helpful method for getting an image of the macromolecular dynamic properties of a protein. Protein folding occurs in a timescale of a few microseconds, allosteric transitions in microseconds to milliseconds, relative motions of protein domains in nanoseconds to seconds, and dynamics of side chains in picoseconds to nanoseconds (Fig. 1) [10]. Additionally, it is observed that longer timescale motions can influence shorter timescale dynamics and vice versa. Hence, long timescale simulations have always been a well-chosen option [11, 12]. Long-time simulations provide an opportunity to understand the flexibility of proteins and their related ensemble of alternative structural states, which are crucial for understanding the folding and dynamics of proteins [13, 14].

Fig. 1
A schematic representation of the protein motion with time for bond stretching, elastic vibrations of proteins, alpha-helix folding, beta-hairpin folding, and protein folding.

The figure represents the protein motion concerning the time axis. The MD simulations must be performed at femtosecond time steps to capture the bond stretching motion and similarly for other represented motions that are more time-scaled atomistic. Hence, more computational power is required

Protein-conformational changes play a vital role in its functioning [15, 16]. Hence it is not enough to study just one PDB conformer. Modern-day advances in simulation algorithms and calculations have promoted the idea of “conformational ensembles” as an option in contrast to examining a single structure from PDB. These ensembles or conformers can be examined to determine thermodynamic properties, entropy, free energy, conformational changes, or protein folding phenomenon [16,17,18]. There are two significant difficulties in analyzing MD simulations of biomolecules: adequate conformational sampling and exact physical force fields. Despite remarkable improvements in modern computing capacity, conventional MD (cMD) simulations are still essentially constrained to shorter timescales than those demonstrated by various biomolecular movements and functions [19,20,21,22]. Hence, to gather multiple conformations, a specified tool is required.

Furthermore, protein folding remains one of biology’s fundamental and least understood phenomena. This fascinating phenomenon of conversion of the primary sequence of a protein to the native 3D structure remains less understood. Small molecular weight proteins with ~10–100 amino acid residues fold in the microsecond to sub-millisecond timescales, known as “fast-folding” proteins. They are magnificent model systems to study and analyze protein folding through long timescale cMD simulations in explicit water [23]. Protein folding needs a broad measure of conformational examination and computational ability to describe the free energy landscape appropriately. Advancement in computation with more extended simulations is insufficient to expand the conformational sampling in the molecular framework. The complicated state of the free energy landscape makes the majority of the simulations investigate only a small region around the energy least near to the initial conformation. With the accessibility of the current advanced HPC systems, a conspicuous methodology is to play out a series of parallel simulations with several initial energy-minimized conformations. Although this could be proficient, it requires detailed information on the framework to simulate and cannot be applied as an overall strategy.

Nevertheless, protein folding has been analyzed using cMD and utilizing productive examining methods such as replica-exchange MD [24], Markov State Models (MSM), biasing MD simulations such as bias-exchange metadynamics [25], and transition path sampling [26]. This chapter sheds light on how MSM helps tackle protein dynamics and folding problems.

1.2 Motivation Behind Using MSM Technique

At times, protein folding and dynamics require long timescale simulations, or the system becomes highly complex or enormous (such as in the case of membrane protein simulation). The first microsecond-length all-atom MD simulation of a small protein was carried out by Duan and Kollman [27]. Further advancements in computer power open up possibilities of MD simulations of thousands of protein atoms, long time-scaled simulation of proteins, etc. Biomacromolecules frequently perform their functions through dynamic transitions between conformational states. For instance, the AdeB efflux pump undergoes carbapenem resistance through conformational modifications [28]. By performing long timescale dynamics based on several short MD simulations, MSM has emerged as a prominent method for bridging this timescale gap [2, 29].

Representing physical, chemical, or biological systems using stochastic processes is standard practice. The objective is to analyze the stochastic model and roughly compute the exciting properties of the system. Direct sampling and building a coarse-grained model of the system are two methods for carrying out such analysis. In a direct sampling strategy, one attempt to produce a statistically significant number of occurrences representing the system property in question. Here, making sufficient statistics for accurate estimates requires much computation. Estimation through direct numerical simulation is impossible, especially if the state space is continuous and has a high dimension [30]. In the coarse-grained model, discretization of the systems state space is used. This is achievable using MSM. The advantage is that it uses discrete finite space. Due to this, the vast systems became finite discrete models that can be solved numerically to find their properties. It uses transition path theory (TPT) to analyze systems’ discrete states. In summary, the analysis of the ensemble of reactive trajectories, or trajectories that originate from a specific set of states A and go to B. Hence using such a technique provides a more comprehensive analysis of biological protein simulations.

2 Markov State Model

A theoretical model, often known as the Markov State Model (MSM), is frequently used to study the dynamic nature of biological systems. The basic idea of MSM is making the square matrix known as the transition probability matrix (TPM). In the case of protein dynamics, MSM can be used after obtaining initial data from MD simulation trajectories.

2.1 Building of MSM

To develop MSM, an adaptive sampling algorithm is frequently used. Adaptive sampling is a statistical approach for solving protein dynamics on large timescales (100 μs to the ms) to sample conformational transitions. The adaptive sampling algorithm is based on iterations, which are used until the desired sampling criteria are reached [19]. The adaptive sampling process is divided into three steps: (i) to run an MD simulation and get many short trajectories, (ii) build an MSM using trajectories, and (iii) run a simulation trajectory based on obtained results from the MSM. MSM uses a matrix, so it needs microstates that can be prepared in two ways: one is based on geometric distributions (distance metric), and the other is based on a free energy map (kinetic-based metric). The preferred one is to choose free energy minima, i.e., kinetic distribution, instead of the geometric distribution. The pathway of MSM is illustrated in Fig. 2.

Fig. 2
A process flow of M S M has the following steps. Molecular dynamics, clustering, metastable states, M S M model, M S M model validation, and analysis.

The schematic pathway of the Markov State Model (MSM)

2.2 Microstates and Macrostates Generation

Microstates are required to construct MSM. They are the nonoverlapping discrete configurational space. Every transition among these microstates is not dependent on the previous state. This phenomenon is known as memoryless transition. In this regard, one needs microstates where shifts can happen smoothly and rapidly. For this, there is a requirement to group configurations, often known as clustering. Since many clustering techniques are available, one must choose them wisely. One of the clustering techniques is choosing a distance metric. The k-centers, k-medoids, and hybrid k-centers/k-medoids clustering are some of the essential clustering algorithms. To determine states, one needs to go through the MD simulations first and then find the suitable conformations based on either the root mean square deviation (RMSD) chosen appropriately 2 to 3 Å or based on the energy barriers. Most of the time, it is assumed that as the degree of structural similarity is higher, the corresponding kinetic similarity is also higher. It is known as the kinetic clustering of microstates into larger macrostates [31].

In Markovian microstate formation, there is a timeframe difference at which the states occur, often known as lag time or Markovian lag (τ). Hence, after lag time τ, the state will not be dependent on the previous state. MSM building requires a transition probability among these microstates, which depends on the number of microstates and lag time. Markovian lag should be large enough but not too large so that it does not alter significantly from other trajectories, which are often considered microstates. Markovian lag is just a method of selecting steps for trajectories that must be chosen carefully.

Additionally, in the case of tens of thousands of microstates or huge system sizes (such as membrane protein simulation), kinetic-based clustering can be performed that are supersets of microstates and are named macrostates. These macrostates are obtained using coarse-graining the model. This method collects microstates that are quickly clumping together and are collected to form macrostates. Available lumping procedures from microstates to macrostates are perron cluster cluster analysis (PCCA), their improved version (PCCAC), Bayesian agglomerative clustering engine (BACE), and super level set hierarchical clustering (SHC).

2.3 MSM Model and Validation

After obtaining the microstates, the next step is constructing the transition count matrix (TCM). It is a matrix that describes the transition from one state to another. The transition count matrix in general form is shown below:

$$ M=\left[\begin{array}{ccc}{a}_{11}& {a}_{12}\dots \dots \dots & {a}_{1\mathrm{n}}\\ {}{a}_{21}& {a}_{22}\dots \dots \dots & {a}_{2\mathrm{n}}\\ {}\begin{array}{c}\vdots \\ {}{a}_{\mathrm{n}1}\end{array}& \begin{array}{ccc}\ddots & \cdots & \cdots \\ {}{a}_{\mathrm{n}2}& \ddots & \cdots \end{array}& {a}_{\mathrm{n}\mathrm{n}}\end{array}\right] $$

where aij denotes the transition from ith state to jth state. For example, if the states chosen from trajectories named A, B, and the trajectory are given as:

$$ \mathrm{Trajectory}: AABBBABABAABB. $$

Also, if the trajectory is chosen one step, then the number of transitions from A to A is 2 (NAA = 2), from A to B is 4 (NAB = 4), from B to A is 3 (NBA = 3), and from B to B is 3 (NBB = 3). Then the TCM can be written as mentioned in Table 1.

Table 1 Transition count matrix representing the transition between states A and B

The transition count matrix is usually not symmetric, so it is necessary to make a symmetric matrix and any symmetric matrix. One must follow the symmetry property of the matrix, which is defined as any (square) matrix. It is written as the sum of a symmetric matrix and an antisymmetric matrix [32].

$$ M=\frac{\left[M+{M}^T\right]}{2}+\frac{\left[M-{M}^T\right]}{2} $$

where MT is the transpose of M, [M + MT] is symmetric, and [M − MT] is antisymmetric.

This matrix should be symmetric because the transition between states depends not only on the forward direction but also on the reverse direction and is transposable. The transpose matrix describes moving from one state to another in either a forward or reverse direction. The transpose of TCM is shown below:

$$ {M}^T=\left[\begin{array}{ccc}{a}_{11}& {a}_{21}\dots \dots \dots & {a}_{\mathrm{n}1}\\ {}{a}_{12}& {a}_{22}\dots \dots \dots & {a}_{\mathrm{n}2}\\ {}\begin{array}{c}\vdots \\ {}{a}_{1\mathrm{n}}\end{array}& \begin{array}{ccc}\ddots & \cdots & \cdots \\ {}{a}_{2\mathrm{n}}& \ddots & \cdots \end{array}& {a}_{\mathrm{n}\mathrm{n}}\end{array}\right] $$

For the transpose matrix, the row (horizontal elements) is changed into a column (vertical components) and vice versa, as shown in Table 2.

Table 2 Transpose of the transition count matrix

Averaging the transition matrix counts by adding a transition matrix, and their transpose matrix gives symmetry.

$$ {M}^{\mathrm{symm}}=\frac{M+{M}^T}{2} $$

The symmetry matrix is shown below:

$$ {M}_{\mathrm{ij}}^{\mathrm{symm}}=\frac{1}{2}\left[\begin{array}{ccc}{a}_{11}+{a}_{11}& {a}_{12}+{a}_{21}\dots \dots \dots & {a}_{1\mathrm{n}}+{a}_{\mathrm{n}1}\\ {}{a}_{21}+{a}_{12}& {a}_{22}+{a}_{22}\dots \dots \dots & {a}_{2\mathrm{n}}+{a}_{\mathrm{n}2}\\ {}\begin{array}{c}\vdots \\ {}{a}_{\mathrm{n}1}+{a}_{1\mathrm{n}}\end{array}& \begin{array}{ccc}\ddots & \cdots & \cdots \\ {}{a}_{\mathrm{n}2}+{a}_{2\mathrm{n}}& \ddots & \cdots \end{array}& {a}_{\mathrm{n}\mathrm{n}}+{a}_{\mathrm{n}\mathrm{n}}\end{array}\right] $$

For the present example, the symmetric matrix is shown in Table 3.

Table 3 Symmetry matrix for present trajectory

After this, reversible TPM will be calculated for each element of the matrix. There are two requirements for the TPM that must be rigorously followed. First, the total probability in each row is equal to unity, and second, elements should be nonnegative. There is no negative value meaning because probability only contains values between zero and one. Another essential point about transition probability is that it depends only on the time difference, i.e., the transition should be homogeneous [33].

$$ {P}_{\mathrm{ij}}=\frac{M_{\mathrm{ij}}^{\mathrm{symm}}}{\sum_i^j\left({M}_{\mathrm{ij}}^{\mathrm{symm}}\right)\ } $$

The transition probability matrix is shown below:

$$ {M}_{\mathrm{prob}}=\left[\begin{array}{ccc}{P}_{11}& {P}_{12}\dots \dots \dots & {P}_{1\mathrm{n}}\\ {}{P}_{21}& {P}_{22}\dots \dots \dots & {P}_{2\mathrm{n}}\\ {}\begin{array}{c}\vdots \\ {}{P}_{\mathrm{n}1}\end{array}& \begin{array}{ccc}\ddots & \cdots & \cdots \\ {}{P}_{\mathrm{n}2}& \ddots & \cdots \end{array}& {P}_{\mathrm{n}\mathrm{n}}\end{array}\right] $$

For the present example, the transition probability matrix will be shown in Table 4.

$$ \mathrm{Auxiliary}\ \mathrm{equation}:\left|M-\lambda I\right|=0; $$
Table 4 The transition probability matrix for a given trajectory

where I is an identity matrix, and λ is for eigenvalues.

After solving the auxiliary equation for the TPM, one can get the eigenvectors and corresponding eigenvalues. The total sum of eigenvalues is to be zero. From eigenvalues data, one can analyze that the most positive value gives the most fluctuation from the equilibrium states, and the least negative value is in the most equilibrium states. There are several methods and tests to validate the models, such as Chapman–Kolmogorov equation model-based test, correlation function test, Bayesian Model selection, Swope–Pitera eigenvalue test, etc.

3 MSM to Understand Protein Folding and Dynamics

The initial studies of using MSM were started by studying peptide folding [34,35,36] and other small systems [37]. Further, it was applied in protein folding, protein–ligand binding, nucleic acids, and other biological problems (Fig. 3). It is used to analyze small-timescale and large-timescale simulations to gather relevant information. We now discuss how MSM is used to understand protein folding and dynamics, focusing on ensemble sampling and conformational fluctuations.

Fig. 3
A radial diagram of M S M lists the following. Peptide, protein folding, intrinsically disordered proteins, protein-ligand, nucleic acid, and native state conformation changes.

Applications of MSM in protein folding and protein dynamics

3.1 Peptide Modeling

Researchers have tried to address the issues related to understanding the mechanism of protein folding and finding the nature of folds. MD simulations have been regularly used along with experimental studies. In 2004, Swope et al. developed an algorithm to study the kinetics of protein folding. They applied it to a small peptide, a C-terminal alpha-hairpin motif from protein G. They used a Boltzmann-weighted ensemble to formulate the transition function from MD simulation [35]. They found the pattern and number of hydrogen bonds in a peptide. The Markov model depends on finding the finite number of metastable states; thus, identifying them is a critical and essential step. Hence, the clustering algorithm was applied to get kinetics-based states that were long-lived in dynamic systems. This kinetics-based clustering was used by Noe et al., who tested ALA8 and ALA12 peptides [36]. This study, by Noé et al., brought a new direction to form metastable states, which consider dynamic behavior and not geometric proximity. Following this method, the automated algorithm was proposed, which detects the kinetically metastable states and was tested on three peptides [38]. After this, the master equation was developed by Buchete & Hummer for studying MD simulation of peptide folding at an atomistic level [39]. ALA5 peptide was used for the study, which was intended to form a small helix. In recent studies, this technique has been used to study peptides like amyloid-β peptide (Aβ), which is responsible for Alzheimer’s disease [40].

3.2 Protein Folding

Protein folding prediction through an in silico approach has been a mystery since the inception of protein simulation. Protein folds have numerous possibilities, as stated by the Leventhal paradox [41]. However, protein folds within a few microseconds in natural states and retains its native fold to function [42]. At the same time, predicting protein folding, understanding different folding conformations, and the folding rates also matter [43]. Several mechanisms have been proposed to explain the protein folding process, from a simple two-state model [44] to more complex models [45]. Also, it has been observed that some proteins do not fold and exist in an intrinsically disordered state [46].

Additionally, the misfolding of protein also occurs and has been observed in neurodegenerative disorders [47]. Thus, gathering the information on the folded and unfolded states is not enough, but the intermediate, misfolded, and disordered should also be analyzed. MSM uses the MD simulation data to find transition probability between different finite states. Initially, the model is constructed using geometric conformation similarity [48, 49]. The obvious choice is to use RMSD between the conformations by limiting it to a smaller cut-off value [50]. However, the RMSD is based on a protein backbone and is used to generate distance metrics. Hence, side chain and dihedral angle flip may hinder the results. The assumption is that the conformations with smaller deflections may have similar kinetic stability. However, finding more kinetically relevant metastable states should be carried out. Different clustering algorithms have been used [51], such as k-centers clustering, k-medoids clustering, and a hybrid of both k-clustering methods. The k-center clustering algorithm aims to find clusters with approximately the same radius and map different conformations to the nearest center of the cluster so that the distance from a distance is minimum. Li et al. & Voelz et al. used this clustering algorithm to improve the microstate generation efficiently [29, 52]. In the case of k-medoid clustering, the optimization is performed for the average distance between the center and other cluster points. In protein folding, this algorithm creates many clusters in the folded scenario and very few in the unfolded system [53]. The hybrid approach of both the k-clustering techniques was used to build MSMBuilder2 [54].

3.3 Protein–Ligand Binding

Analyzing the interaction of a protein with its substrate/inhibitor can provide critical information about the protein’s function [5, 55]. The binding of small molecules to proteins or detecting new binding sites could be performed using MSM methods. Earlier, binding kinetics has been studied by constructing MSM to find long-lived intermediates of trypsin inhibitors [56]. The induced fit model (conformation changes due to ligand binding) and conformation selection model (ligand bind to protein without changing in protein’s conformation) are used to detect protein–ligand recognition [57,58,59]. But later, it was observed that both are found in real-life scenarios [60,61,62]. In an earlier study to find the contribution of both methods, an analytical model based on a three-pronged approach of MD simulation, flux, and MSM was developed [63]. The choline-binding protein (ChoX) was used as a case study, and MD and MSM methods were used to find parameters for flux analysis [61].

3.4 Analyzing Intrinsically Disordered Proteins

Intrinsically Disordered Proteins (IDPs) are proteins that do not have a stable 3D structure. They bind to nucleic acids or other proteins for their functions. IDPs are dynamic ensembles that continuously change their internal conformation with high structural heterogeneity [64, 65]. However, IDPs are responsible for several cellular functions and are involved in many diseases like diabetes, cancer, neurodegenerative diseases, and cardiovascular diseases [66,67,68,69]. While interacting with partners, IDPs are coupled binding and folding reactions, which is essential for their function. Similar to ligand binding, induced fit, conformational selection, and a combination of both models are used to study IDPs. However, the kinetics of the binding-folding reaction, specifically binding to a partner or conformation without a partner, requires detailed investigation [70, 71]. Here, MD simulation can provide a contemporary way to analyze IDP folding at the atomistic level. To achieve this, MD simulations of IDPs should be performed so that the whole binding-folding pathways can be analyzed. Such simulation trajectories are complex to study; however, MSM techniques can help to identify metastable states in the pathway and the transition probability [72, 73].

3.5 Native State Conformation Changes

Generally, the rational structure-based drug design does not take into account protein-conformational changes. Approximately 15% of proteins have deep active sites related to their activity [74]. Hence, conformational heterogeneity is essential to understand protein behavior. This could provide information on the novel active sites or transient catalytic sites, which are allosteric or can block protein–protein interaction [75,76,77,78]. Since MD simulation can provide the system’s dynamical behavior, if coupled with MSM, it can provide a set of ensembles where the metastable state is in an equilibrium state. Also, the advancement in MSM to capture kinetic and thermodynamic properties makes it a more viable option to identify the transient active site. There are several examples where similar approaches have been used to find cryptic pockets and allosteric sites. Among such studies, the TEM-1 beta-lactamase was used and observed that several such allosteric sites were present [79]. Such studies could also be performed with novel proteins to find active or allosteric sites.

4 Summary

Advancements in computational power, such as parallel programming and GPUs, have made the MD simulation more achievable. However, analyzing the simulation data is challenging. MSM is based on finite ensembles and uses clustering methods to create ensembles. Before MSM, geometric clustering was used, but MSM provides enhanced metastable states, which means it is the kinetic energy-based state. It is a coarse-graining of a system’s dynamics, which depicts the underlying free energy landscape that governs the system’s structure and dynamics. Identifying states in a kinetically relevant scheme and effectively using state decomposition to construct a transition matrix are the two main issues for creating an MSM. To build the MSM model, the traditional geometric clustering method is used to develop microstates. These microstates are further used to build a transition matrix. This step takes care of finding kinetically related microstates. This information is used to build MSM. However, adaptive sampling is used to improve the MSM model. Further, validations can be done by Bayesian Model selection, Swope–Pitera eigenvalue test, and other such tests (Fig. 4).

Fig. 4
An illustrated flow diagram has the following steps. 1. peptide, protein, complex protein, intrinsically disordered protein, and membrane protein, 2. molecular dynamic, 3. microstates, 4. transition probability matrix, and 5. significant conformational states.

Various applications which could be studied using Markov State Model

Protein folding and the dynamics of the native 3D structure are critical biological phenomena [80]. MD simulation can provide a way to understand these processes in millisecond simulations [81,82,83]; however, analyzing such data requires sophisticated protocols and methods [84, 85]. MSM provides a convenient and interpretable solution [86]. With the current advancement in computational power and algorithm, the use of MSM has increased and will continue to grow. This technique can also analyze and comprehend complicated systems such as membrane proteins, peptide folding, IDPs, and other biological systems; hence, it is emerging as a critical in silico approach.