Introduction

The human proteome consists of more than 20,000 proteins with diverse sizes, compositions, structures, and functions. Almost every cellular activity depends on the conformation and concentration of proteins. Normal cellular processes are tightly aligned with proteostasis, which involves synthesis, folding, trafficking, and degradation of the protein. Internal or external perturbations that disturb the proteostasis could lead to loss of protein functions, to changes in protein turnover rate and protein concentration, and potentially to undesirable consequences such as deposition of protein aggregates in the affected tissues and organs. Protein aggregation, which can be fibrillar or amorphous, has been studied over several decades (Astbury et al. 1935; Green and Hughes 1955; Kyle and Bayrd 1975) from both physiochemical and pathological perspectives. The recent surge of interest in protein aggregation is attributed to two crucial applications: the association of cross β-steric zipper–rich aggregates in human proteinopathies and the development of protein-based therapeutic molecules.

Protein and peptide therapeutics are promising classes of medicines with vast and growing clinical applications. Protein therapeutics include monoclonal antibodies (mAb), hormones, vaccines, enzymes, growth factors, fusion proteins, and so on (Leader et al. 2008; Kintzing et al. 2016; Lagassé et al. 2017; Usmani et al. 2017). The manufacturing of these biotherapeutics is tedious and often complicated by protein instability and aggregation. As a result, developability assessments and optimization to increase protein stability and solubility as well as decrease viscosity and aggregation have become a critical step in biotherapeutic drug discovery and development (Wang et al. 2009; Zurdo 2013; Li et al. 2016; Jain et al. 2017). The aggregation propensity of a biopharmaceutical can potentially affect its solubility and the viscosity of its liquid formulations. Low solubility and high viscosity often translate to difficulties in drug delivery, manufacturing, and storage (Roberts 2014). In order to assess and improve the developability, several in silico approaches have been developed for designing and optimizing therapeutic proteins and peptides (Nichols et al. 2015; Agrawal et al. 2016).

Neurodegenerative diseases such as Parkinson’s, Alzheimer’s, prion, etc. are characterized by progressive nervous system dysfunction (Iadanza et al. 2018). Despite diverse risk factors such as ageing, environmental factors, and genetic mutations, the accumulation of intracellular and extracellular proteinaceous deposits is considered the key factor. Amyloidosis refers to a group of diseases associated with the deposition of amyloid fibrils leading to pathogenesis (Benson et al. 2018; Ke et al. 2020). However, the word ‘amyloid’ was derived from the Latin word ‘amylum’ for starch, and the amyloid deposits are made up of long, 50- to 200-Å-wide, β-sheet–rich protein fibrils (Kyle and Bayrd 1975). The characteristics and location of these deposits and the symptoms associated with the disease vary depending on the protein involved in the amyloidosis (Gertz 2018). For example, amyloid light-chain (AL) amyloidosis, one of the common amyloidosis, consists of deposition of immunoglobulin light chains in the kidney and heart (Dogan 2017).

Apart from the role in human pathology and ageing, protein aggregates have been found to also play functional roles in several organisms: For example, Curlin in E. coli to mediate host interaction and Ure2p in Saccharomyces cerevisiae to regulate nitrogen intake (Chiti and Dobson 2006). The ability of a protein to form amyloid fibrils is attributed to the aggregation-prone region (APR) in the protein sequence (Ventura et al. 2004; Esteras-Chopo et al. 2005). These APRs mediate intermolecular self-interactions leading to cross β-steric zipper formation, which forms the stable core of fibrillar macromolecular structures found in amyloid deposits (Nelson et al. 2005; Sawaya et al. 2007).

Databases for protein aggregation

The exponential increase in the experimental data related to protein aggregation in the last few years has led to the necessity of storing and curating the information related to protein aggregation. Currently, there are several databases available to assist the scientific community (Table 1). These specialized protein aggregation–related databases contain comprehensive, extended knowledge from literature. Fibril_one (Siepen and Westhead 2002) was the first amyloidogenic protein database containing 250 mutations and 50 experimental conditions associated with 22 proteins. Lopez de la Paz and Serrano (2004) curated the amyloidogenic peptides by systematically mutating the residues of amyloidogenic STVIIE peptide. The dataset was extended with the inclusion of peptides from insulin, β2-microglobulin, amylin, tau protein, etc. (Thompson et al. 2006). Goldschmidt et al. (2010) predicted the aggregation profile of 76 genomes and created the ZipperDB database. WALTZ-DB (Beerten et al. 2015) is a collection for experimentally known amyloid-forming hexapeptides, characterized using electron microscopy, dye binding, and Fourier transform infrared spectroscopy. WALTZ-DB was recently updated to WALTZ-DB 2.0 (Louros et al. 2020) by expanding the hexapeptide sequence dataset and adding new structural information. Angarica et al. (2014) developed the database PrionScan for predicted prion-like domains in complete proteomes. Around the same time, Shobana and Pandaranayaka (2014) constructed the integrated database ProADD for the diseases caused by protein aggregation along with the proteins involved in aggregation. The AmyLoad (Wozniak and Kotulska 2015) database compiled amyloidogenic and non-amyloidogenic sequence fragments from various sources (Conchillo-Solé et al. 2007; Fernandez-Escamilla et al. 2004; Goldschmidt et al. 2010) as well as from literature. AmyPro (Varadi et al. 2018) is a recently developed comprehensive database on precursor proteins and their aggregation-prone regions.

Table 1 Protein aggregation databases

Thangakani et al. (2016) developed the comprehensive database CPAD on experimentally verified aggregating proteins, aggregation-prone regions of different lengths, and aggregation kinetics. CPAD has been updated recently to CPAD 2.0 and includes a new category of aggregation-related structures (Rawat et al. 2020a). AmyloBase (Belli et al. 2011) was the first resource for the aggregation kinetics experiments. AMYPdb (Pawlicki et al. 2008) curates the structural information of amyloidogenic proteins and currently has 1200 structures from 31 amyloidogenic protein families. The protein families are clustered based on amyloidogenic sequence pattern. PDB_Amyloid (Takács et al. 2019) is a recently developed database containing a list of amyloid structures and globular structure entries with an amyloid-like substructure. AL-Base (Bodi et al. 2009) is a curated database of light-chain sequence of antibodies derived from patients with light-chain (AL) amyloidosis.

In silico methods and tools for protein aggregation

Over the last few decades, several computational techniques and tools have been developed to address protein aggregation. These tools and techniques can be broadly classified into three classes: (a) APR and aggregation propensity prediction, (b) aggregation kinetics prediction, and (c) molecular simulation techniques.

Aggregation-prone region and aggregation propensity prediction

Amyloid fibrils are composed of a cross-β steric zipper motif (Fig. 1), which consists of stacked β-strands with interdigitating side chains and axially oriented backbone hydrogen bonds (Sunde and Blake 1997). Nucleation of the cross-beta steric zipper motif (Sawaya et al. 2007) and its assembly into amyloid fibrils depend on several extrinsic factors such as temperature, pH, and protein and ionic concentration as well as on the intrinsic ones such as amino acid composition and sequence patterning of short aggregation-prone regions (APRs), which are typically 5–15 residues long (Chiti et al. 2002a). Interestingly, in a recent study, Yagi-Utsumi et al. (2020) demonstrated the effect of gravity on amyloid fibril morphology and fibrillation kinetics of amyloid β through experiments under microgravity conditions. Further, mutations in APRs and their flanking residues alter protein aggregation propensity, aggregation kinetics, and also the morphologies of its aggregates (Sipe and Cohen 2000; López de la Paz and Serrano 2004). Insertion of such aggregating peptides in globular proteins triggers aggregation (Ventura et al. 2004). For example, Fig. 2 shows the Cryo-EM structure of an amyloid fibril formed by the lambda light chain (AL55) of the IGLV6-57 germline gene (Swuec et al. 2019). The fibrils were isolated from deposits of amyloid light-chain (AL) cardiac amyloidosis patients through autopsy. The APRs identified by several APR prediction proteins cluster in the sequence region 17–38. The ability of the APRs to dictate the fates of even large proteins has attracted considerable research efforts.

Fig. 1
figure 1

Structural model of an amyloid fibril: a) amyloid fibril of an 11-residue fragment (125–135) of transthyretin protein (PDB: 3ZPK, UniProt ID: P02767), b) a protofibril , c) intersheet steric zipper formation, and d) the intersheet hydrogen bonds along the fibril axis

Fig. 2
figure 2

Structure and predicted APRs in lambda light chain (A55): a structure of AL55 modelled using ABodyBuilder (Leem et al. 2016), b cryo-EM structure of AL55 amyloid protofibril (PDB: 6HUD), and c residue contacts in the fibril. The atoms constituting APRs (17–38) predicted using WALTZ, PASTA2, ANuPP, FishAmyloid, and MetAmyl (consensus) are highlighted as spheres. ChimeraX was used for visualization (Goddard et al. 2018)

Several computational methods have been developed to identify the aggregation-prone regions (APRs) in proteins and peptides and predict protein aggregation propensity (Meric et al. 2017) (Fig. 3). These methods can be classified based on their application and the features used to predict (Table 1). Broadly, these methods can be divided into sequence- and structure-based methods depending on the input data required for the prediction.

Fig. 3
figure 3

Various APR and aggregation propensity prediction tools

Sequence-based approaches to predict protein aggregation

Sequence-based approaches to predict aggregation in proteins rely on features such as amino acid physiochemical properties, sequence patterns, statistically derived propensity values, knowledge-based scoring functions, secondary structure propensities, residue-residue contact potentials, and threading. Pattern matching is the simplest of all the sequence approaches. Lopez de la paz and Serrano (2004) carried out experiments on peptide STVIIE through positional scanning mutagenesis and identified sequence patterns of hexapeptides, which form amyloid-like fibrils in vitro. However, the patterns are specific to certain hexapeptides and did not cover several new amyloid-forming peptides that were reported over the years.

Amino acid properties

Amino acid properties such as hydrophobicity, size, surface area, charge, aromaticity, contact frequency, beta-sheet propensity, and several other physiochemical properties are used for the identification of aggregation-prone regions. AGGRESCAN uses the amino acid aggregation-propensity scale derived from in vivo experiments on amyloidogenic proteins (Conchillo-Solé et al. 2007). These propensity values were used to identify the aggregation-prone regions through a sliding-window approach. Zyggregator uses amino acid scales for α-helix and β-sheet formation, charge, hydrophobicity, and also hydrophobic pattern and presence of gatekeeper residues (Tartaglia and Vendruscolo 2008). WALTZ uses a hybrid approach, combining a position-specific scoring matrix derived from amyloidogenic peptides and amino acid physicochemical properties and position-specific pseudoenergy values obtained from modelled structures (Maurer-Stroh et al. 2010). ANuPP is an ensemble classifier that consists of nine logistic regression models trained independently on groups of amyloidogenic peptides to address the diversity in aggregation nucleation, propagation, and fibrillation processes. ANuPP uses atom composition as features to represent sequence segments (Prabakaran et al. 2020).

Secondary structure preference

The propensity to form β-sheet is one of the key features of amyloid-fibril-forming peptides and proteins, and this has been extensively used in developing prediction algorithms. TANGO uses various empirically and statistically derived potential functions to estimate the probability of a segment to form β-strand-mediated aggregates (Fernandez-Escamilla et al. 2004). In theory, TANGO compares the probability of a segment to be in various secondary structural states such as α-helix, β-sheet, coil, and turn. Conformational switch from other secondary states to β-sheet formation is the principle behind SecStr and NetCSSP (Hamodrakas et al. 2007; Kim et al. 2009).

Residue-pair occurrence and contact preference

Cross-β spine made of steric zippers is a common feature of all amyloid fibrils. These steric zippers consist of interdigitating side chains and axial hydrogen bonds and strengthen the supramolecular structure of amyloid fibrils. In addition, the stacking of aromatic residues and ladders of hydrogen bonds formed by Asn, Gln, Thr, and Ser residues adds additional stability to the structure. These residue-residue interactions are seen as crucial for an APR and thus used in various prediction methods. GAP uses position-specific residue-pair energy potential derived from amyloid and amorphous β-aggregating hexapeptide sequences to identify amyloidogenic peptides (Thangakani et al. 2014). PASTA2 and BETASCAN use residue-residue probabilities and scoring functions for β-sheet hydrogen bond formation and contact derived from protein structure databases. Apart from predicting the APR stretch on the protein sequence, these approaches can predict the β-strand orientation and pairing between residues.

Threading and forcefield

Thompson et al. (2006) used the crystal structure of the cross-β spine of peptide NNQQNY to identify APR segments in amyloidogenic proteins. Each hexapeptide from a given protein sequence was mapped onto an ensemble of steric zipper templates and scored subsequently. The assumption behind this 3D profile method is the conserved cross-β motif in amyloid fibrils of diverse proteins. Apart from identifying amyloidogenic peptides and regions in protein sequence, the approach is capable of predicting orientation between strands forming the zipper.

Structure-based approaches

Structure-based methods such as SAP, developability index, AGGRESCAN3D, and Aggscore require protein structure as input (Chennamsetty et al. 2009; Lauer et al. 2012; Zambrano et al. 2015; Sankar et al. 2018). These methods predominantly take account of the solvent accessibility of protein residues and atoms to estimate surface hydrophobicity. In addition to the static structure, short molecular dynamics (MD) simulations are performed to calculate the ensemble statistic over time. Unlike sequence-based methods, structure-based methods account for the folding and native state of a protein. At the same time, the limited timescale of MD simulations also biases the prediction of a single protein structure in its native state. This approach may not hold for highly dynamic proteins with multiple metastable states and disorder regions.

Comparison of APR prediction tools

Table 2 lists the various APR and aggregation propensity prediction tools that have been published to date. We selected ten prediction tools from the literature that were accessible as a webserver or stand-alone application and easily applicable to large datasets for a comparative evaluation. Table 3 lists the performance of these APR prediction tools in distinguishing between APR and non-APR segments in a dataset of 37 amyloidogenic proteins. The dataset was extracted from AmyPro database (Varadi et al. 2018) at 40% sequence identity cut-off. Segment OVerlap (SOV) scores: SOVAPR, SOVnon-APR, SOVaverage, and SOVoverall are used for evaluation (Zemla et al. 1999). Similar to secondary structure assessments, SOV scores the prediction performance based on the overlap between the predicted and actual segments instead of residue-wise comparison. Overall, the consensus methods, Amylpred2 and MetAmyl, showed better performance over other methods. ANuPP and TANGO scored better than other methods with SOVoverall of 50.2 and 48.1, respectively. Though several tools showed a good overall score, they exhibited an imbalance between SOVAPR and SOVnon-APR. Similar assessments were performed based on a dataset of 142 amyloid-like fibril-forming hexapeptides from WALTZ-DB 2.0 (Louros et al. 2020). In spite of the imbalance in sensitivity and specificity, ANuPP and AGGRESCAN scored better than other methods (data not included). These results highlight the need for more robust methods to identify APRs accurately.

Table 2 Summary of available methods to predict APR and aggregation propensity of proteins
Table 3 Performance of APR identification algorithms and tools

Aggregation kinetic prediction tool

Aggregation kinetics measure how fast/slow the proteins will aggregate under given experimental conditions. The aggregation mechanisms, curve fitting, and experiments related to aggregation kinetics have been reviewed earlier in the literature (Morris et al. 2009; Hirota et al. 2019). In this section, we have focused on the role of biophysical features and experimental conditions in determining aggregation kinetics.

The detailed in vitro analysis of mutants of acylphosphatase (AcP) revealed the role of charge, hydrophobicity, and secondary structure propensity towards altering aggregation kinetics (Chiti et al. 2002a; Chiti et al. 2002b). Further, they derived the first empirical equation to predict the change in aggregation kinetics upon point mutation using the physicochemical features of proteins (Chiti et al. 2003). Subsequently, several studies on different amyloidogenic proteins analysed the role of physicochemical properties on protein aggregation such as hydrophobicity (Calamai et al. 2003; Fink 1998; Hilbich et al. 1992), β-strand propensity (Tartaglia et al. 2004; Família et al. 2015; Tjernberg et al. 2002; Fernandez-Escamilla et al. 2004), polarity (Tartaglia et al. 2004; Polanco et al. 2015), charge (Tartaglia et al. 2004; Calamai et al. 2003; Tjernberg et al. 2002), aromaticity (Tartaglia et al. 2004; Azriel and Gazit 2001; Gazit 2002), and stability (Fink 1998; Ramírez-Alvarado et al. 2000; Brito et al. 2003).

The aggregation kinetics assays have shown that the rate of aggregation is sensitive to even a small change in experimental conditions such as protein or buffer concentration, pH, temperature, ionic concentration, seeding, or agitation (Brudar and Hribar-Lee 2019; Hortschansky et al. 2005; Morel et al. 2010; Ow and Dunstan 2013). Currently, there are few in silico methods available to predict the absolute aggregation rate or change in aggregation rate upon point mutation (Table 4) as discussed below.

Table 4 Summary of available methods to predict aggregation kinetics

Methods to predict change in aggregation rate upon point mutation

Chiti et al. (2003) first proposed a mathematical equation (Eq. 1) to predict the change in aggregation rate using intrinsic protein sequence features, which includes change in the hydrophobicity of the polypeptide chain (ΔHydr.), propensity to convert from α-helical to β-sheet structure (ΔΔGcoil − α + ΔΔGβ − coil), and change in overall charge (ΔCharge).

$$ \ln \left(\frac{\upsilon_{\mathrm{mut}}}{\upsilon_{\mathrm{wt}}}\right)=A\Delta \mathrm{Hydr}.+B\left({\Delta \Delta G}_{\mathrm{coil}-\upalpha}+{\Delta \Delta G}_{\upbeta -\mathrm{coil}}\right)+C\Delta \mathrm{Charge} $$
(1)

where A, B, and C in the above equation are constants, which are estimated by fitting the equation to experimental change in the aggregation rate. The model achieved a correlation of 0.85 on a set of 27 mutations found in short peptides or natively unfolded proteins, including amylin, amyloid β-peptide, tau, and α-synuclein. This model has some limitations, including (i) smaller dataset size, (ii) inability to predict aggregation kinetics for mutations involving proline residues due to undefined values for change in β-sheet propensity (ΔΔGβ − coil), and (iii) inability to predict the aggregation kinetics for residues, for which α-helical propensity is predicted zero by the AGADIR server (Muñoz and Serrano 1994).

Rawat et al. (2018) developed the method AggreRATE-Disc (Discrimination of Aggregation Rate change Upon Mutation) using sequence-based features to predict the aggregation rate enhancer or mitigator mutations using machine learning. It is developed using a support vector machine (SVM)–based classifier on 220 point mutations from 25 proteins. The model grouped the mutations based on the local secondary structure conformation at the mutation site (helix, strand, and coil) and achieved an average prediction accuracy of ~ 82% using leave-one-out cross-validation. AggreRATE-Disc identified a unique set of sequence-based features that influence the aggregation rate in each mutation site conformational class. For example, changes in protein stability and flexibility in the helical region influence the rate of aggregation. Similarly, the aggregation rate is mainly affected by charge, polarity, and β-strand propensity when the mutations fall in the β-strand regions. For other mutation sites falling under the coil category, such as bends, turns, and disordered regions, aggregation rates are affected by both helical tendency and aggregation propensity.

The AggreRATE-Pred model (Rawat et al. 2020b) was an improvement over AggreRATE-Disc, which included structure-based features to predict the quantitative change in aggregation rate. The statistical model is developed by combining four different regression equations, which is generated by classifying the data based on polypeptide length and local secondary structure conformation at the mutation site, and fitting of the regression equation. The dataset of 183 point mutations in 23 amyloidogenic proteins was primarily divided into two groups: (i) short peptides (length < 40 residues) and (ii) long polypeptides and proteins (length ≥ 40 residues). The long polypeptide and protein dataset are further classified to helix, strand, and coil class based on local secondary structure conformation, similar to the previous study (Rawat et al. 2018). The statistical model achieved an average correlation coefficient of ~ 0.82 and an average MAE of ~ 0.43 on the training dataset. The regression analysis showed the importance of local structural context, thermodynamic stability changes, and effect of neighbour residues at the mutation site.

Methods to predict the absolute aggregation rate

DuBay et al. (2004) improved Eq. 1 to predict the absolute aggregation rate of polypeptides using intrinsic features, such as hydrophobicity (Ihydr), alternating hydrophobic-hydrophilic residue pattern (Ipat), and absolute value of net charge (Ich), and extrinsic features, pH (EpH), ionic strength (Eionic), and polypeptide concentration (Econc). The mathematical formula for the prediction of the absolute rate of aggregation is given in Eq. 2.

$$ \log (k)={\alpha}_0+{\alpha}_{\mathrm{hydr}}{I}^{\mathrm{hydr}}+{\alpha}_{\mathrm{pat}}{I}^{\mathrm{pat}}+{\alpha}_{\mathrm{ch}}{I}^{\mathrm{ch}}+{\alpha}_{\mathrm{pH}}{E}^{\mathrm{pH}}+{\alpha}_{\mathrm{ionic}}{E}^{\mathrm{ionic}}+{\alpha}_{\mathrm{conc}}{E}^{\mathrm{conc}} $$
(2)

where log(k) is the logarithm in base 10 of the aggregation rate (k) in units of s−1. α values are constants estimated by fitting the equation on experimental data of 79 mutations. The model has achieved a correlation of 0.92 on the training dataset of 79 proteins/peptides. However, the model was prone to biasness due to limited availability of the aggregation rates, where 59 out of 79 data points were point mutation variants of acylphosphatase (AcP) protein.

Tartaglia et al. (2005) proposed a sequence-based algorithm to predict the aggregation rate and aggregation-prone regions in protein/polypeptide sequences. The aggregation propensity (πil) of the sequence is calculated using position-dependent factors (Φil) and composition-dependent factors (φil).

$$ {\pi}_{il}={\varPhi}_{il}{\varphi}_{il} $$
(3)

where l is the length of the segment starting at the position i in the sequence. The position-dependent factors include aromaticity (Ail), β-propensity (Bil), and charge (Cil).

$$ {\varPhi}_{il}={e}^{A_{il}+{B}_{il}+{C}_{il}} $$
(4)

The amino acid composition–dependent factors include side-chain-accessible surface area of apolar (\( {S}_j^{\mathrm{a}} \)), polar (\( {S}_j^{\mathrm{p}} \)), and all residues (\( {S}_j^{\mathrm{t}} \)); solubility (σj); and parallel (ϴ⇈) and antiparallel (ϴ⇅) tendency to aggregate. The hatted values are averages of 20 standard amino acids.

$$ {\varphi}_{il}={\left[\prod \limits_{j=1}^{i+l-1}\left(\frac{S_j^{\mathrm{a}}}{{\hat{\mathrm{S}}}^{\mathrm{a}}}\theta \uparrow \uparrow +\frac{S_j^{\mathrm{p}}}{{\hat{\mathrm{S}}}^{\mathrm{p}}}\theta \uparrow \downarrow \right)\frac{{\hat{\mathrm{S}}}^{\mathrm{t}}}{S_j^{\mathrm{t}}}\frac{\hat{\sigma}}{\sigma_j}\right]}^{1/l} $$
(5)

The model predicts the aggregation rate from the aggregation propensity (Eq. 3) by including a function for experimental conditions (α(c, T), which takes account of protein concentration and temperature) as given in the following formula (Eq. 6):

$$ {v}_{il}=\alpha \left(c,T\right){\pi}_{il} $$
(6)

The model achieved a correlation of 0.95 with 90 data points. However, this model also suffers from the limited availability of the data and biasness within the dataset.

Yang et al. (2019) proposed a feedforward fully connected neural network (FCN)–based machine learning model for predicting the absolute aggregation rates. The model is trained on a dataset of 21 amyloidogenic proteins (140 data points) using 16 intrinsic sequence-based features and 4 extrinsic features. The model focuses on the inclusion of more experimental conditions and considers them as a separate data point in the prediction model. Although the model showed an average prediction accuracy of more than 90% on the training dataset, it seems overfitted as it employs 16 intrinsic sequence-based features to essentially predict 21 sequence variants of amyloidogenic proteins.

AbsoluRATE (Rawat et al., manuscript under review) is a support vector machine (SVM)–based regression model to predict absolute rates of protein and peptide aggregation. The model trained on 82 non-redundant proteins/peptides has achieved a correlation coefficient of 0.72 with MAE of 0.91 (natural log of kapp, where kapp is in hour−1) using leave-one-out cross-validation. The model accounts for sequence-based features (such as features derived from APR prediction servers, disorderness, polarity, beta-sheet propensity, etc.) and extrinsic features (such as temperature, pH, ionic and protein concentration).

Comparison of aggregation kinetics prediction methods

The limited availability of experimental aggregation kinetics resources is a major limitation towards the development of accurate computational models. Hence, to benchmark the above kinetic models, we randomly selected a test set from the AggreRATE-Pred dataset (Rawat et al. 2020b) in such a way that it (i) includes all structural classes (10% of the training dataset in the respective class) and (ii) has predictable aggregation rates for all the participating models (Table 5). For a fair comparison, we retrained the AggreRATE-Pred by removing test set data points and achieved a correlation of 0.81 on the training dataset (original correlation r = 0.82). The performance of the removed test set was further evaluated on the newly developed AggreRATE-Pred model with a reduced training set, as shown in Table 5. The absolute aggregation rate prediction models were tested by subtracting the predicted aggregation rates for mutant and wild-type protein sequences. The correlations obtained by the absolute aggregation rate prediction models were expectedly low, with the highest correlation of 0.41 obtained by Tartaglia’s model (Tartaglia et al. 2005). Chiti’s and DuBay’s models are almost two-decade-old models trained on minimal datasets available at that time, which is also reflected in the prediction performance of these methods on the test dataset. AggreRATE-Pred, a structure-based method, showed the highest correlation among all models. AggreRATE-Disc is a sequence-based method that cannot predict the quantitative change in aggregation rate. However, it has correctly predicted the effect on aggregation rate (increase/decrease) with 73.7% accuracy. Yang’s model (Yang et al. 2019) was not benchmarked due to the unavailability of the webserver/stand-alone program.

Table 5 Different kinetics prediction methods benchmarked on the test dataset

Molecular dynamics approach

A better understanding of the physical phenomenon and mechanistic details of self-association is often obtained through molecular simulation studies. Molecular dynamics and Monte Carlo simulations have been widely used to understand protein aggregation dynamics to study various aspects. Various simulation techniques and methodologies have been used to study protein and peptide aggregation. Depending on the focus of the study, the simulation can vary from (i) coarse-grained to all-atom models, (ii) Monte Carlo to molecular dynamic simulations, (iii) implicit to explicit solvation models, and (iv) dimers to bulk simulations (Morriss-Andrews and Shea 2014, 2015; Carballo-Pacheco and Strodel 2016). APRs from the proteins are often studied as peptides instead of the whole protein to understand the aggregation mechanisms and residue-residue interactions at a reduced computational cost. We have provided a summary of the diverse simulations and their applications below.

Atomic simulation of peptide assembly

All atom simulations are often limited to the simulation of peptide aggregation. For example, Ma and Nussinov (2002a) carried out molecular dynamics simulations of the two peptides, AGAAAAGA observed in PrP protein and a polyalanine peptide AAAAAAAA to identify critical oligomer size (Ma and Nussinov 2002a). They showed that oligomers of size 6–8 strands were found to be stable and retained the fibril model conformation (10 Å inter-sheet distance and 5 Å inter-strand distance). Similar works have been carried out on several peptides such as poly-glutamine, poly-glycine, and peptide fragments from amyloidogenic proteins (Cecchini et al. 2006; Karandur et al. 2014; Marchut and Hall 2006). Ma and Nussinov (2002b) also studied three different segments of amyloid β (16–22, 6–35, and 10–35) using MD and compared the results with solid-NMR structures. Gsponer et al. (2003) studied the heptapeptide GNNQQNY from Sup35 using 20-ns simulations of a 3-peptide system and observed 25 parallel β-strand formation events, consistent with experimental data. The study showed the influence of residues on orientation preference and stability of the strand. Zanuy et al. (2003) studied the oligomeric stability of two segments (NFGAIL 22–27 and NFGAILSS 22–29) from islet amyloid polypeptide using molecular dynamics. Their work highlighted the importance of the assembly of interacting sheets in amyloid fibril formation (Zanuy et al. 2003; Zanuy and Nussinov 2003). Similar studies have been carried out on peptides STVIIE and its 5 variants, NFGAIL 22–27 of the human islet amyloid polypeptide, and hIAPP 1–19 peptide (Wu et al. 2005; López De La Paz et al. 2005; Guo et al. 2015; Tran and Ha-Duong 2015). Priya and Gromiha (2019) revealed that the length of polyQ in the aggregation of huntingtin protein is important for β-sheet formation and for elucidating the pathological mechanism in Huntington disease. Figure 4 shows the first and last snapshots from the multi-copy MD simulations of the ‘VLVIY’ peptide assembly. Studies showed that introducing the lysine residue in the ‘VLVIY’ segment increased the solubility and reduced the viscosity of a monoclonal antibody, stamulumab (Nichols et al. 2015; Kumar et al. 2018).

Fig. 4
figure 4

All-atom simulation of peptide (VLVIY) assembly: a the initial setup of 105 peptides separated by 1 nm distance from each other and b the aggregated peptides after a simulation time of 50 ns. VMD was used for the visualization (Humphrey et al. 1996)

Extending the spatio-temporal limits using coarse-grained models

Coarse-grained models are simplified models of polypeptides and their associated interactions. Depending on the depth of abstraction and the level of resolution, coarse-grained models can be categorized as phenomenological models, lower-resolution representative models, and high-resolution coarse-grained models (Morriss-Andrews and Shea 2015). The loss in detail and accuracy of a model is compensated with increased computational efficiency and inference from the extended spatio-temporal scale of simulations. Coarse-grained models extend the timescale of simulation beyond what is currently possible for the atomistic simulations, thereby assisting in studies of protein aggregation mechanisms, phase separation, and nanostructure formation. Such simulations can also help derive thermodynamic parameters of phase separation and fibril growth. All-atom and coarse-grained models of α-synuclein protein are shown in Fig. 5. Advantages of coarse-grained models to extend the spatio-temporal limitations of MD were put to use by Nguyen et al. (Nguyen and Hall 2004a, b, 2005, 2006). Nguyen and Hall (2004a) studied the phase diagram of the polyalanine peptide system using DMD simulations. The authors constructed five 96-peptide simulations at various concentrations using the PRIME model of the 16-residue polyalanine peptide. They showed that the peptide exists in four distinctive single-phase regions: α-helices, fibrils, nonfibrillar β-sheets, and random coils depending on concentration and temperature.

Fig. 5
figure 5

Coarse-grained simulation of protein aggregation: a all-atom model, b Martini coarse-grained model (Marrink et al., 2007), and c aggregated structure obtained from coarse-grained simulation of α-synuclein protein. ChimeraX was used for visualization (Goddard et al. 2018)

Marchut and Hall (2006, 2007) studied the aggregation of polyglutamine and the role of side chains using an intermediate resolution model, PRIME, which showed the spontaneous formation of long annular tube-like structures. Peng et al. (2004) studied the stacking of the entire amyloid β 1–40 peptide into β-sheet using discrete molecular simulation (DMD) and coarse-grained modelling and showed that the peptide system formed the stacking of β-strands at higher temperatures and amorphous aggregates at lower temperatures. Bellesia and Shea (2007, 2009) performed off-lattice simulations of peptide aggregation using a coarse-grained model, consisting of 2 and 1 beads representing backbone and Cβ, respectively, and analysed the kinetics, thermodynamics, and aggregate structure through simulations of different peptide sequences. Interestingly, their work highlighted the role of charged residues in stabilizing and changing the preference of orientation of peptides during aggregation.

Singh et al. (2008) studied the effect of finite system size of peptide aggregation by simulating all an atom-model of the IAPP fragment (15–19) in the TIP3P water model using the AMBER force field and discussed the effect of concentration and system size on peptide aggregation. Magno et al. (2010) studied the effect of molecular crowding on the aggregation of an amphipathic peptide model through simulation of a 125-peptide system of varying box size (150 to 290 Å). They reported that crowders play a crucial role in accelerating the nucleation of low-aggregation propensity peptides. Matthes et al. (2011, 2012) studied the spontaneous steric zipper oligomerization of peptides 306–311 of tau protein, 12–17 of insulin B chain, and 51–56 segment of alpha-synuclein using an all-atom model of a 10-peptide system. Kumar et al. (2019) used MD simulation to analyse the aggregation propensity of the three peptides in 24–33 (N-terminal domain), 126–136 (RNA recognition motif 1), and 247–254 (RNA recognition motif 2) of human TDP-43. Wang et al. (2019) studied the solubility of different oligomers and fibril models of amyloid-beta (16–22) by measuring the equilibrium monomer concentration in the system using the PRIME20 model.

Molecular dynamics simulations have also been used to study the solubility and aggregation propensity of peptides (Karandur et al. 2014). Frederix et al. (2011, 2015) explored the self-assembly of the entire sequence space of dipeptides (400) and tripeptides (8000) through coarse-grain simulations using the MARTINI force field to measure the aggregation propensity of the peptide and the nature of nanostructures. They showed that aggregation propensity depends on hydrophobicity.

Understanding protein-protein interaction and oligomerization

Simulations of protein-protein interactions and protein oligomerization provide valuable insights on the role of residue-residue interactions, key structural motifs, and transitions in dictating protein aggregation at an earlier stage. Brown and Bevan (2016) investigated the oligomerization of amyloid-β and its binding to membrane models through simulation of a united-atom model with tetramer and pentamer systems. They explored the structural changes during oligomer-membrane binding to understand the Aβ oligomer toxicity. Similar studies using MD have been employed to study the formation and stability of oligomers of aggregation-prone protein such as amyloid-β, TDP-43, and amylin and the role residue-residue contact in oligomerization (Kumar et al. 2019; Berhanu and Masunov 2014; Khatua and Bandyopadhyay 2017). However, sampling the entire landscape of protein-protein aggregation in an explicit solvent model is computationally intensive on both by the spatial and temporal scales. Alternatively, coarse-grained models, implicit solvent models, and peptide simulation have been widely used in the literature (Morriss-Andrews and Shea 2014; Carballo-Pacheco and Strodel 2016). Molecular dynamic simulations were also carried out on immunoglobins to study the self-aggregation tendency of the molecules (Buck et al. 2013, 2015; Tiller et al. 2017).

Beyond coarse-grained models, continuum modelling has also been used to extend the spatio-temporal scale. Continuum models have also been developed to study the mesoscale properties of fibrils (Knowles and Buehler 2011; Paparcone et al. 2011). In addition, several techniques such as replica exchange molecular dynamics, Hamiltonian replica-permutation molecular dynamics, umbrella sampling, and metadynamics have also been applied to tune, accelerate, and study protein aggregation simulations (Barducci et al. 2006; Larini and Shea 2012; Itoh and Okumura 2013, 2016; Zheng et al. 2016; Morriss-Andrews and Shea 2014, 2015; Carballo-Pacheco and Strodel 2016). The Markov state model (MSM) and adaptive sampling techniques have also been used to study the transition states in protein oligomerization (Kelley et al. 2008; Jia et al. 2020).

Expanding horizons

Phase separation of proteins can lead to the formation of liquid droplets, colloidal suspensions, gelation, and solid aggregates. The main focus of the current review is on computational techniques associated with liquid-to-solid phase separation of proteins. Liquid-liquid phase separation (LLPS) driven by intermolecular interactions is an equally important phenomenon. LLPS is important for the formation of several biomolecular condensates, which are essential for cellular and nuclear functions. Understanding the mechanism and identification of proteins capable of liquid-liquid phase separation would help in understanding complex biological processes (Boeynaems et al. 2018). Choi et al. (2019) developed a lattice model–based simulation engine for exploring the phase separation of proteins. In this model, a protein molecule is modelled as ‘stickers separated by spacer regions’ to represent the regions that form inter-chain interactions. However, there are still a couple of open questions: (i) how does a cell control phase separation? and (ii) what decides the nature of the separated phase? These challenges offer new avenues for future research. For example, a unified computational model to predict both solid and liquid phase separation of proteins and peptides would help us understand cellular regulation and the mechanisms of biocondensate formation.

In an alternate direction, Mishra et al. (2018) applied the predictions from a protein aggregation prediction tool, AGGRESCAN, to screen native structures of proteins and their applications in protein tertiary structure prediction. Further, the design of peptide inhibitors, which selectively bind amyloid fibrils and fibril-forming proteins, is an active area of research (Lu et al. 2019; Seidler et al. 2019). These peptides bind to protofibrils and oligomers of amyloidogenic proteins to mitigate the protein aggregation in neurodegenerative diseases. In silico tools to predict the self-assembly of peptides have a wide range of applications. Several studies have shown the bactericidal activity of self-assembling peptides through the disruption of biofilm and cell membrane (Khodaparast et al. 2018; Lombardi et al. 2019; Tucker et al. 2018). The development of in silico tools for the designing and screening of such antimicrobial peptides could accelerate and widen the field. Peptide self-assembly has also been widely studied for its structural properties as a drug carrier and a scaffold in tissue engineering and constructing synthetic nanomaterials (Esteras-Chopo et al. 2005; Gallardo et al. 2016; Gupta et al. 2020; Hauser et al. 2014; Knowles and Mezzenga 2016). These diverging fields of protein aggregation provide scope and new horizon for the development of in silico tools.

Conclusions and future directions

Protein aggregation is a multidimensional phenomenon that involves diverse considerations such as stability of the native state, total aggregation propensity of the protein sequence, presence of aggregation-prone regions and gatekeeping residues, and environmental conditions such as concentration, pH, and ionic strength of protein solutions. Protein deposits such as tangles and plaques have been found in a diverse range of human pathologies with an undebatable association to the disease propagation itself. Predictions of the aggregation-prone region, propensity and aggregation rate of a protein sequence provide insights into its inherent tendency to drive intermolecular interactions and amyloid fibril formation. MD simulations have been instrumental in studying the protein oligomer formation and stability and peptide aggregation.

Peptide assembly and nanostructure formation are concentration-dependent and kinetically controlled phenomena. Studies have shown that external conditions could vary the nature and structure of aggregates. The methods currently available to predict the aggregation kinetics are still in the nascent stages. However, with the increase in experimental data, it may be possible to develop reliable next-generation kinetics models using large datasets. The inclusion of complex features such as pH, temperature, buffer, protein or ionic concentration, and agitating condition in computational models could help predict the rates of aggregation with greater accuracy. Similar limitations are also applicable to APR prediction tools. Most APR prediction tools assume the protein of interest exists predominately in the unfolded state and do not interact with other biomolecules. In contrast, cellular environments are highly crowded and only a small fraction of proteins are unfolded. For example, studies have shown cross-seeding, where a fibril fragment of a protein chain initiates amyloid formation of another (Ren et al. 2019). The phenomenon of cross-seeding is highly specific and mostly unidirectional.

In addition to the nucleation of aggregates, it is also important to understand propagation of the aggregation process. For example, what drives amyloid fibril polymorphism? Polymorphism in amyloid fibrils refers to the multiplicity of amyloid fibrillary structures formed by a given amyloidogenic peptide or protein. Researchers attribute polymorphism to the fibrillation kinetics and external conditions that influence the aggregation process. Addressing the polymorphism in zipper and protofibril structure predictions would pave ways for predicting and understanding complex nanostructure formations, which in turn could be useful for the design of novel biomaterials.

Understanding the molecular interactions that drive complex phenomena such as nucleation, polymorphism of amyloid fibrils, cross-seeding, etc. is essential to fully understand protein aggregations. Currently available experimental techniques are capable of playing only a limited role in this regard. With the availability of increased computing power, multi-scale molecular dynamics simulations are proving invaluable in elucidating these interactions. Exploiting recent advancements in sampling techniques, coarse-grained models, polarizable force fields, and constant-pH simulations shall enhance our understanding of the molecular events that determine the fate of protein aggregation.