Introduction

Proteins are mainly composed of a succession of 20 standard amino acid types. Their 3D structures, determined by their sequences, are the support of major crucial biological functions. But it was found that post-translational modifications (PTMs) influence the structure and regulate the function of proteins. It is speculated that nearly every protein undergoes some form of PTMs (Lodish 2013) which involve the attachment of chemical groups to the amino acid side chains, and in rare cases, to the backbone of proteins (Muller 2018). Although proteins can be modified pre-, co- or post-translationally, all protein modifications are generally referred to as PTMs, because they are typically made post-translationally, after the protein is folded, and they can be reversible or irreversible.

PTMs alter the physicochemical properties of proteins and thereby play a critical role in modulating various biological functions. However different PTMs display different physicochemical properties; thus, the same protein may exhibit different functions upon different modifications (Jungblut et al. 2008; Mann and Jensen 2003). As a result of their high diversity and their reversibility reflecting the dynamic nature of a cell, PTMs have also been reported to play essential roles in many cellular control mechanisms, folding, conformational change, stability, activity, localization, turnover, and molecular interactions with partners (Mann and Jensen 2003; Deribe et al. 2010; Walsh et al. 2005). PTMs influence protein function both in orthosteric and allosteric modes (Berezovsky et al. 2017).

Apart from normal cellular processes, it was also shown that dysregulation of PTMs and mutation of PTM sites are implicated in a number of human diseases (Vidal 2011) such as cancer (Bode and Dong 2004; Dai and Gu 2010; Radivojac et al. 2008), diabetes (Donnelly and Williams 2020; Sidney et al. 2018; Lernmark 2013), cardiovascular disorders (Van Eyk 2011; Aggarwal et al. 2020; Gao et al. 2020) and neurodegenerative disorders (Gong et al. 2005; Nekooki-Machida and Hagiwara 2020; Ajit et al. 2019). It was also shown that proteins modified by multiple types of PTMs are significantly more prone to participate in disease than proteins carrying no known PTM sites (Huang et al. 2014). The formers were found notably implicated in protein complexes with many partners with a preference to act as hubs in protein–protein interaction (PPI) networks.

In the past, PTMs were primarily studied with the aid of low-throughput biological techniques. Nowadays, the current high-throughput MS-based approaches and proteomic studies allow many more novel sites to be identified and produce a wealth of new information regarding PTMs. For instance, the PRoteomics IDEntifications PRIDE database (Perez-Riverol et al. 2019) is the world’s largest data repository of mass spectrometry-based proteomics data and is one of the founding members of the global ProteomeXchange (PX) consortium (http://www.proteomexchange.org) (Deutsch et al. 2017); many of the deposited data are related to PTMs, mainly glycosylation, such as glycoproteome associated with prostate cancer progression (Kawahara et al. 2021). With this increased amount of PTM data, Researchers encountered challenges and difficulties to include all this information in a consistent and structured way by standardizing the annotation of PTM features and adopting a controlled vocabulary associated with every described PTM (Farriol-Mathis et al. 2004), to facilitate easy retrieval and promote understanding by biologist expert users as well as computer programs. Providing a precise number of the different types of PTMs is a highly difficult task as some, such as Phosphorylation and N- and O-linked Glycosylation are ubiquitous, while others are specific to a clade. 682 types of PTMs have been reported in the UniProt database (UniProt 2019) using a controlled vocabulary (2021 3rd release of 02-Jun-2021), spanning all domains of life.

A decade after the release of the Human Proteome Project (HPP) in 2010 (Legrain et al. 2011), 191,837 PTMs across the 20,379 proteins of the human proteome have been already detected by mass spectrometry in the different cell types that comprise the human body as indicated on the neXtProt platform in its February 2021 release (Adhikari et al. 2020; Zahn-Zabal et al. 2020). All these PTMs, altering protein properties, are in part responsible for the largely unmapped complexity and diversity of the human proteoforms (Aebersold et al. 2018). For instance, the human histone H4 (UniProt accession: P62805) mapped 75 frequent proteoforms overs its length of only 103 residues.

Owing to the importance of PTMs, several databases have been developed and released in decades, but very few of these databases incorporate 3D structural data (i.e., with real 3D coordinates). Since PTMs influence the function of the protein and their aberrant states are frequently implicated in human diseases, providing structural insights to understand the influence and dynamics of PTMs is crucial for unraveling cellular processes. Many web-based protein structure databases exist providing the scientific community access to a wide variety of structural information. The primary repository of 3D structural data on proteins (and other biological macromolecules) is the Protein Data Bank that was founded in 1971 with only seven experimentally determined protein structures at that time (Berman et al. 2000). In 2021, the PDB is celebrating its 50 years anniversary with more than 178,000 entries by June 2021. Despite the PDB being a rich reservoir of structural information for biological macromolecules and having powerful querying interfaces, it turns out that specialized databases, derived from the PDB and cross-annotated with other types of data, are often easier, faster, and more informative for some specific scientific/research questions and goals. These databases have also the added value of being built, maintained, and updated by experts in the field of structural biology.

To illustrate some of the points discussed above, the human Cyclin-Dependent Kinase 2 (CDK2) protein (UniProt ID: P24941) is selected. This kinase is involved in the control of the cell cycle and its modifications are essential to regulate its activity (to cite a few of the published research works Gu et al. 1992; Welburn et al. 2007; Timofeev et al. 2010; Choudhary et al. 2009). Figure 1 represents an overall view of the superimposition of 4 human CDK2 protein structures downloaded from the Protein Data Bank and exhibiting different types of PTMs highlighted in the 4 zoomed views located on the sides: PDB id 2CJM (Welburn et al. 2007), colored in green, highlighting two phosphorylation on residues Tyr15 and Thr160; PDB id 1H01 (Beattie et al. 2003) and PDB id 4RJ3 (Hanan et al. 2014), colored in gold and blue, displaying a carboxylation and acetylation on the same Lys33 residue, respectively; finally PDB id 1GZ8 (Gibson et al. 2002), colored in dark red, showing a sulfino-alanine on residue 177. We can clearly observe the conformational changes induced by the phosphorylation of the two residues (Tyr15 and Thr160) affecting mainly the N-terminal domain and a few loops on the CDK2 structure colored in green.

Fig. 1
figure 1

Overall view of the superimposition of 4 human CDK2 protein structures. The CDK2 fold is rendered in ribbon representation and colored in green, gold, blue, and dark red for the PDB IDs 2CJM, 1H01, 4RJ3 and 1GZ8, respectively. The PTMs present in each of these structures are drawn in ball-and-stick mode with carbon atoms colored similarly to the overall structure while nitrogen, oxygen, phosphorus, and sulfur atoms colored in dark blue, red, orange, and yellow, respectively. RMSD calculations, between all pairs of CDK2 structures, showed values ranging from 0.25 Å (for 1H01 with 4RJ4) to 4.85 Å (for 1GZ8 with 2CJM); The structure colored in green (2CJM) presented most local and global conformational changes, consisting of loop motions and tilt of the smaller N-terminal lobe

The immense majority of currently existing PTM databases predominantly focus on protein sequence information and basic modification site metadata while the 3D structural data related to PTMs have been largely overlooked. In this review, we focus on the current state and development of the limited number of PTM 3D structural databases, i.e., with 3D coordinates. Then we highlight their importance and application in many research studies. Sample cases where these databases have been used to aid computational and modeling studies of PTM structures or to advance our knowledge about biological macromolecules are referenced briefly. Finally, we conclude with the many promising in silico research area that still needs to be further explored for a better understanding of PTM’s impacts on protein structures and functions, and improvement of their predictions.

PTM structural databases

The focus of this review is to discuss the current status of structural databases providing three-dimensional data that are experimentally confirmed and/or predicted on PTM sites in proteins. These databases can be general, covering multiple types of PTMs in different organisms, or specific to one particular type of PTM, class of proteins or organism. The emphasis is on repositories offering open-access data through a web user interface. Some of the significant historical projects, no longer maintained, will be also mentioned in this review. The information discussed below are summarized in Table 1. For each database, its name, the year of first and last publications, a brief description of the various features, the data coverage and their citation references are provided.

Table 1 Summary of the cited PTM structural databases

The most famous database giving access to PTM sites in protein is dbPTM (http://dbptm.mbc.nctu.edu.tw), firstly published in 2006 (Lee et al. 2006), and often updated (Lu et al. 2013; Huang et al. 2016, 2019; Lee et al. 2009). The first release of this database includes all the experimentally validated PTM sites from three external biological databases related to protein PTM information [SwissProt (Wu et al. 2003), PhosphoELM (Diella et al. 2004) and O-GLYCBASE (Gupta et al. 1999)] and authors have developed computational tools to systematically identify and predict three major types of protein PTM (phosphorylation, glycosylation and sulfation) sites against the SwissProt proteins. Protein structural properties and functional information, such as the solvent accessibility of residues, protein isoforms, non-synonymous single nucleotide polymorphism (SNP), protein tertiary structures and protein functional domains, are provided for researchers who are investigating the protein PTM mechanisms by integrating the following external data sources: Ensembl (Hubbard et al. 2005), InterPro (Mulder et al. 2002) and PDB (Deshpande et al. 2005). Solvent accessibility and secondary structure of residues, when experimental 3D structures are not available, are computationally predicted and are mapped to the PTM sites. To help access the database content, a web query interface and graphical visualization were designed and implemented.

In the second version of dbPTM (Lee et al. 2009), the database was enhanced to comprise a variety of new features and collected literature related to PTM, protein conservations and the specificity of substrate site. Furthermore, a variety of prediction tools have been developed for more than ten PTM types (Zhou et al. 2006), such as phosphorylation, glycosylation, acetylation, methylation, sulfation and sumoylation. The interface was also redesigned and enhanced.

In 2013, the dbPTM database in its third version (Lu et al. 2013) was updated to integrate experimental PTMs obtained from public resources as well as manually curated MS/MS peptides associated with PTMs from research articles. The aim is to become an informative resource for investigating the substrate specificity of PTM sites and functional association of PTMs between substrates and their interacting proteins. Additionally, the information of structural topologies on transmembrane (TM) proteins is integrated into dbPTM to delineate the structural correlation between the reported PTM sites and TM topologies. To facilitate the investigation of PTMs on TM proteins, the PTM substrate sites and the structural topology are graphically represented. Also, literature information related to PTMs, orthologous conservations and substrate motifs of PTMs are also provided in the resource. Finally, this version features an improved web interface to facilitate convenient access to the resource.

In 2014, the authors have developed topPTM (http://topptm.cse.yzu.edu.tw) (Su et al. 2014), a new dbPTM module that provides a public resource for identifying the functional PTM sites on transmembrane (TM) proteins with structural topology giving the crucial roles of TM proteins in various cellular processes and the importance of PTMs in their functioning. Experimentally verified TM topology data were integrated from TMPad (Lo et al. 2011), TOPDB (Tusnady et al. 2008), PDBTM (Kozma et al. 2013) and OPM (Lomize et al. 2012). In addition to the PTMs obtained from dbPTM, experimentally verified PTM sites were manually extracted from research articles by text mining. The PTM sites on the tertiary structures of TM proteins can be visualized using a Jmol plugin.

In its most recent publication, dbPTM integrates more than 30 different PTM databases leading to 908,917 experimentally verified PTM sites (571,032 experimentally verified phosphorylation sites, 137,442 acetylation and 118,495 ubiquitination, …) and 347,984 predicted putative sites. It described more than 130 PTM types (Huang et al. 2019).

dbPTM strength is not only limited to the fact that the database has been well maintained for over 10 years and that it integrates many experimentally validated PTMs from available databases and through manual curation of literature but also it provides PTM-disease associations based on non-synonymous single nucleotide polymorphisms (nsSNPs). Some 3D structures are shown with some information of secondary structures; this last is predicted when no 3D structures are available. A JSmol applet allows the visualization of the molecules (Huang et al. 2016).

PTMcode, constructed by Minguez et al., is another general database that integrates 3D structural data, co-evolution and literature curation (http://ptmcode.embl.de) (Minguez et al. 2013, 2015). The PTM residues were searched in the Protein Data Bank (Berman et al. 2000) and specific works have been performed to analyze distance between two PTMs in the same proteins; their conformation could have been visualized using the Jmol plugin. However, the complete list of PTMs, within and between interacting proteins, can be downloaded under the Data” tab in a tab-separated flat files. Based on Flash technology, that is now obsolete, it is complex to browse the database. The second version of PTMcode was released in 2014 and it includes a new strategy to propagate PTMs from validated modified sites through orthologous proteins. This second release covers 19 eukaryotic species from which more than 300,000 experimentally verified PTMs were collected of 69 types (Minguez et al. 2015).

Another publicly available 3D structure database for a wide range of PTMs, named PRISMOID [PRoteIn Structure MOdIfication Database (http://prismoid.erc.monash.edu)] was recently developed (Li et al. 2020). The focus is the 3D structural context of PTMs sites and mutations that occur on PTMs and neighboring PTM sites with functional impact. PRISMOID provides the users with a variety of interactive and customizable search options and data browsing functions to access the data for the target of interest via keywords, PDB/UniProt ID. For each entry in the database, a comprehensive page includes a detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure assignment, solvent accessibility features of PTM sites and predicted disordered regions. In addition, visualization tools are employed to underline the position of the PTM. However, it is not possible to highlight it in a specific and direct way; the user must do it with its own tool. The DSSP assignment is not provided as classically done by the succession of 3- or 8-states but with 3D coordinates of the Cα residues and backbone angles, e.g., phi, psi and alpha angles.

Few years back, we have built and continue to maintain the Post Translational Modification Structural Database (PTM-SD, http://dsimb.inserm.fr/dsimb_tools/PTM-SD), a curated database that provides access to proteins for which PTMs are both experimentally annotated and structurally resolved (Craveur et al. 2014). It combines different PTM information and annotation gathered from other databases; it crosses information from the PDB, UniProt, PTMCuration and dbPTM.

While most databases and web servers concerning PTMs are dedicated to their compilation and prediction, PTM-SD is probably the only database that focuses on the experimentally resolved amino acid modifications in view of the proteins 3D structures as retrieved from the PDB.

PTM-SD can be browsed using different criteria and users can compute statistics and conduct some analyses on the selected subset of data. PTM-SD gives valuable information on observed PTMs in protein 3D structures facilitating sequence–structure–function analyses in light of PTMs and could provide insights for comparative modeling and PTM prediction protocols.

We can also notice the existence of novPTMenzy (http://202.54.249.142/~shradha/PTM/master.html), a database cataloging information on the sequence, structure, active site and genomic neighborhood of experimentally characterized enzymes involved in five novel PTMs, namely AMPylation, Eliminylation, Sulfation, Hydroxylation and Deamidation (Khater and Mohanty 2015). Based on a comprehensive analysis of the sequence and structural features of these known PTM catalyzing enzymes, an interesting feature of novPTMenzy is the availability of Hidden Markov Model profiles for the identification of similar PTM catalyzing enzymatic domains in genomic sequences.

Other databases are specific to a particular PTM type. ADPriboDB (http://adpribodb.leunglab.org) is a database dealing with ADP-ribosylation; it was firstly developed in 2016 by Vivelo et al. to facilitate studies in uncovering insights into the mechanisms and biological significance of ADP-ribosylation (Vivelo et al. 2017). This protein modification refers to the addition of one or more ADP-ribose units onto proteins and is responsible for many biological processes such as DNA repair, RNA regulation, cell cycle and biomolecular condensate formation. Its dysregulation is implicated in cancer, inflammatory diseases, and neurological disorders. This database was updated in 2020 by Ayyappan et al. (2021). ADPriboDB 2.0 comprises 48 346 entries and 9097 ADP-ribosylated unique proteins, of which 35,946 and 6708 were newly identified, respectively, since the original database release, showing an acceleration of ADP-ribosylation related research. In addition, the authors have created a new interactive tool to visualize the local context of ADP-ribosylation, such as structural and functional features as well as other post-translational modifications.

Another interesting, specialized database is dbSNO 2.0 (http://dbSNO.mbc.nctu.edu.tw) that firstly released in 2012 (Lee et al. 2012; Chen et al. 2015); it focuses only on S-nitrosylation (SNO). This reversible PTM involves the covalent attachment of nitric oxide (NO) to the thiol group of cysteine (Cys) residues, regulating protein activity, localization and stability. SNO is associated with a large panel of pathologies like cancers (Bignon et al. 2018). 298 3D S-nitrosylation are included in the database and are presented with a Java applet, in addition to secondary structure assignment and surface solvent accessibility calculation using DSSP (Kabsch and Sander 1983), modified residues positions, link to PubMed and experiments if available. This website does not allow multiple queries, but only individually.

Many databases are dedicated to phosphorylation, one of the most abundant PTM in proteins. Phospho.ELM (Dinkel et al. 2011) (http://phospho.elm.eu.org) is an anciently established database, manually curated, dedicated to eukaryotic phosphorylation sites in proteins. The data is extracted from the literature and phospho-proteomics analyses. Its last version comprises more than 42,500 non-redundant phosphorylation sites in more than 11,000 different protein sequences. The user can query the database by keyword or sequence identifier (UniProt or Ensembl) to get the information about single proteins/substrates, or by kinase name to retrieve all the phosphorylated substrates of a particular kinase. It is also possible to restrict the query to different taxonomy groups. Figure 2 illustrates the usage of Phospho.ELM with the similarity search feature. At first, a sequence is provided in the Phospho.ELM Blast search (see Fig. 2a); different hits can be found by the search engine (see Fig. 2b); the selection of one of these hits (see Fig. 2c) provides many information such as the potential interactions with other proteins. In this case, a PDB file is also available, and the user can follow the link to the PDBe (Gutmanas et al. 2014; Velankar et al. 2010) website (see Fig. 2d).

Fig. 2
figure 2

Presentation of Phospho.ELM with the example of BLAST Search feature. a The sequence is provided, b a list of results is returned, c by selecting one of them multiple information such as SMART and MINT interactions are provided, but also sequences, PMID, disorder prediction and link to the structure that is d link to PDBe

Phospho3D (http://phospho3d.org), a database of three-dimensional structures of phosphorylation sites (P-sites), is derived from Phospho.ELM database previously discussed. It collects information on the P-site instance, its flanking sequence (10 residues) and the P-site 3D zone (the set of residues in a 12 Å radius surrounding the P-site in space). The database uses the latter to conduct large-scale structural comparison to identify structurally similar sites in other proteins (Zanzoni et al. 2011, 2007). It was also enriched with structural annotation at the residue level, including secondary structure and solvent accessibility as defined by DSSP (Kabsch and Sander 1983) and residue conservation as from the Consurf-HSSP database (Glaser et al. 2005). In the same field, we must note the defunct mtcPTM database (mitocheck.org/cgi-bin/mtcPTM/) that stored a large number of structural models of phosphorylatable structures (Jimenez et al. 2007).

PhosphoSitePlus (PSP) (http://phosphosite.org) (Hornbeck et al. 2015) is an open and continuously curated database for studying experimentally observed PTMs in the regulation of biological processes. It was reengineered from the PhosphoSite (Hornbeck et al. 2004) resource that was solely dedicated to phosphorylation in proteins. PSP now covers other commonly studied PTMs including acetylation, methylation, ubiquitination, and O-glycosylation. The interface provides the users with multiple features to browse the database. For each specific modification sites, PSP provides structural and functional information, and many powerful tools for interpreting this data in different contexts: diseases, tissues, subcellular localization, protein domains, sequences, motifs, etc. When available, a list of PDB ids for the protein in question is provided with the possibility of downloading a PyMOL or Chimera script to visualize the location of the different modified residues on the protein structure. It is important to note that not all the PDB structures have the PTM experimentally resolved.

Another database providing structural data on phosphosites is Scop3P (http://iomics.ugent.be/scop3p) (Ramasamy et al. 2020) developed by Ramasamy et al. Scop3P integrates sequences (UniProtKB/Swiss-Prot), structures (PDB), and uniformly reprocessed PRIDE (Perez-Riverol et al. 2019) phospho-proteomics data to annotate all known human phosphosites. Furthermore, these sites are put into biophysical context by annotating each phosphoprotein with per-residue structural propensity, solvent accessibility, disordered probability, and early folding information. The web interface presents a 3D plugin for visualization and analysis of phosphosites, and for the understanding of phosphosite structure–function relationships.

Finally, carbohydrates constitute a specific research area of PTMs by themselves due not only to the impressive diversity of saccharides, links, and dispersion in every clade in addition to their biotechnology applications and implications in multiple diseases. The number of databases is impressive with the large majority focusing mainly on the carbohydrates without their target macromolecules i.e. Carbohydrate Structure Database (http://csdb.glycoscience.ru) (Egorova et al. 2015), while others take into account both the proteins with their glycosylation. One particularly interesting web portal is Glycosciences.DB (http://glycosciences.de/database/) (Bohm et al. 2019) that provides databases and tools to support glycobiology and glycomics research. Its focuses on 3D structures, including 3D structural models as well as references to PDB entries that feature carbohydrates. Another website is Glyco3D (http://glyco3d.cermav.cnrs.fr) (Pérez et al. 2015) that have a large number of information on free carbohydrates and some on linked ones, such as the recent GAG database (http://gagdb.glycopedia.eu) (Perez et al. 2020) that contains the 3D structure of glycosaminoglycan (GAG) binding proteins that have been crystallized with their ligands.

We will not delve into the structural databases on glycosylation in this manuscript as Scherbinina and Toukach have dedicated a recent review to approaches of chemo- and glyco-informatics towards 3D structural data generation, deposition and processing in regard to carbohydrates and their derivatives (Scherbinina and Toukach 2020). They focus on the important aspects of carbohydrate 3D structure availability to researchers including structural repositories, glycoinformatics tools and workflows, carbohydrate 3D structure presentation and visualization methods.

In silico applications

Many of the above-mentioned PTM structural databases have been employed in a variety of in silico applications and computational studies. In this section, we will be briefly reporting two types of applications. The chosen examples were selected based on their remarkability.

Predictions

One of the main applications is the computer-aided prediction of PTM sites which is essential for the functional annotation of uncharacterized proteins (Eisenhaber and Eisenhaber 2010). During the last decades, machine learning has become a valuable approach for understanding the large amount of biological data being generated and made accessible to the scientific community; bibliographic databases are witnessing an exponential growth of ML publications. Many methodologies based on machine learning and deep learning approaches have been developed to predict the modification sites for certain specific types of PTM and the PTM databases, highlighted above, constitute benchmark datasets for training the predictive tools and measuring their performance.

We can note the historical work of Wilkins et al. who developed a tool based on MS data, FindMod (http://web.expasy.org/findmod/), to predict 22 PTM types, including acetylation, phosphorylation, and less classical ones (Wilkins et al. 1999a, 1999b). As the main purpose of this manuscript is not to review PTM prediction methods, we decided to only list a few recent works: Wang R. et al. have employed Support Vector Machine (SVM) and Random Forest (RF) machine learning methods to identify lysine crotonylation sites in both plant and mammalian (Wang et al. 2020c). Zhang et al. have developed a succinylation site prediction tool based on protein sequences (Zhang et al. 2020). The training data were collected from dbPTM. Another study by Wang H. et al. have applied an improved word-embedding scheme based on the transfer learning strategy incorporated with the multilayer convolutional neural network (CNN) for identifying protein ubiquitylation sites in plant (Wang et al. 2020b). Finally, Wang D. et al. have developed MusiteDeep (http://musite.net) (Wang et al. 2020a); it combines deep learning approaches with evolutionary information to predict 13 different PTM types (including N6-acetyllysine, methylarginine, methyllysine and pyrrolidone-carboxylic-acid), with excellent results. It integrates an interesting feature, supported by the NGL viewer (Rose and Hildebrand 2015), to visualize the predicted PTM sites in the 3D context of homologous proteins that have known 3D structures. It is important to mention that no structural information is used in the prediction by itself.

However, since these methods operate through a learning process with positive and negative observations, it is essential to construct clean datasets for training purposes. Ideally, the positive set should only consist of protein sites where experimental proof of their modification has been found. On the other hand, creating a negative dataset is a difficult task, because experimental negative results are rarely described in scientific papers and the protein to be included in the negative set must be located in the same cellular compartment as the modification enzyme, to make sure that the sequence motif is not recognized by the enzyme. Readers are advised to consider these various ML studies with caution and check whether all the good practices of ML (including data collection and splitting, features engineering and selection, model training and optimization with parameters and hyperparameters tuning, model performance and generalization on unseen data with the appropriate evaluation metrics) were applied properly. Other issues to consider are the comparison of the performance of a certain ML predictor with other similar tools and the reproducibility of the results. This type of comparison is likely to be biased because these models were not trained on the same datasets or using the same evaluation metrics. As for reproducibility, it is often impossible due to the unavailability of the source code and the used dataset, and the lack of details in the original publications.

In a study published in 2012, Schwartz discussed the metrics and procedures used to assess predictive tools and surveyed 11 online computational tools aimed at the prediction of the four most widely studied lysine post-translational modifications (acetylation, methylation, SUMOylation and ubiquitination) (Schwartz 2012). His findings suggested that nine of the 11 tools performed no better than random or have false-positive rates which make them unusable by the experimental biologist when assessed using unbiased testing datasets. Another similar study was recently published in which proline hydroxylation was considered as a case study to compare the performance of seven predictors on two newly constructed independent datasets (Piovesan et al. 2020). The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets indicating overfitting and lack of generalization to detect new sites.

To counter these above-mentioned phenomena, recommendations for machine-learning-based analyses applied to biological studies have recently been proposed for non-experts in the field to help improve machine learning assessment and reproducibility focusing on four aspects related to data, optimization, model and evaluation (DOME) (Walsh et al. 2021). Finally, some PTM types are limited by the size of the training data. A close collaboration between data scientists and experimentalists could help generate appropriate experimental datasets for model training and the experimental validation of these ML methods.

Impact of PTMs on protein structure

Appending PTMs repositories with 3D structural data opens the way for the computational modeling of PTMs structures at atomic resolution. Such studies allow to examine the association of PTMs with the structural rearrangements of their target proteins and to provide critical insights into the mechanics behind the dynamic regulation of protein function.

Recently, we have investigated the local and global impacts of PTMs on the backbone conformation of the modified proteins (Craveur et al. 2019). We have considered two main PTM types (N-glycosylation and phosphorylation) in non-redundant datasets extracted from PTM-SD, and four examples of proteins were selected to illustrate our findings and compare the backbone flexibility in the presence and absence of PTMs. We used a structural alphabet to analyze the structural local protein conformations, namely the Protein Blocks, able to approximate in a very fine way the structural architecture (Etchebest et al. 2005). We observed that PTMs could either stabilize or destabilize the backbone structure, at a local and global scale, and that the impact of multiple PTMs is not additive on protein structure flexibility and lastly that these effects depend on the PTM types. A similar study was conducted by Xin and Radivojac (Xin and Radivojac 2012). Their results provide evidence that PTMs induce conformational changes at both local and global level. However, the proportion of large changes is unexpectedly small.

It had also been broadly discussed that many PTM sites are found in intrinsically disordered regions (IDRs) (Tompa et al. 2014; Bah and Forman-Kay 2016). Some studies have investigated the correlation between protein disorder and PTMs by integrating data from different databases (UniProt/Swiss-Prot and 3D structures solved by NMR from Protein Data Bank) (Gao and Xu 2012). These studies shed light on the significant preference of PTMs to occur in disordered regions (phosphorylation, hydroxylation, …) or ordered regions (S-nitrosocysteine, most of ADP-ribosylation, …), while acetyllysine does not show any significant preference. Further analysis of NMR structures suggested disorder-to-order transitions might be introduced by some type of modifications. Intrinsically Disorder Proteins (IDPs) are found in sequence databases, the most famous being DisProt (Hatos et al. 2020) and MobiDB (Piovesan et al. 2021). However, the number of resolved cases of IDRs with PTMs remains limited in the PDB. To work with IDP, structural models are often considered (structural models must be handled cautiously as they are theoretical and not experimental) and must be complemented with experimental data; it is one of the most complex art of PTM research using 3D structures.

Lastly, molecular dynamics (MD) simulation is an interesting computational method that is being increasingly used by many research groups in the last few years to investigate the impact of PTM on the dynamics of the modified proteins. Just to cite couple of these studies: Yalinca et al. (2019) have used MD simulations to study the effects of phosphorylation and acetylation as well as cross-talk between these modifications on the energy landscape of huntingtin N-terminus. Their findings provide insights to understand the structural basis underlying the effect of PTMs in the aggregation and cellular properties of huntingtin protein and its implications in Huntington disease. In a more recent study, Rao et al. has investigated the effects of changes in glycan composition on glycoprotein dynamics by considering the example of N-glycans on insulin receptor (Rao et al. 2021). However, it is important to note that one of the challenges in MD simulation is the selection of appropriate force field parameters to correctly simulate the dynamics of the biological systems involving PTMs. Many tools were developed to explore non-standard amino acids and protein modifications using MD simulations such as Privateer (Bagdonas et al. 2020), CHARMM-GUI (Jo et al. 2008) and Vienna-PTM (Margreitter et al. 2013); but for unconventional and rare PTMs, their parameters are not present in the existing force fields and therefore must be calculated using quantum mechanics approaches; the latter can be cumbersome and computationally expensive.

Conclusion and prospects

In this review, we have discussed the existing PTMs structural databases and highlighted their importance in providing the scientific community with the data needed to advance PTM-related research and more specifically to assist structure–function relationship studies.

We first noticed that PTM structure databases are in limited numbers and this observation was confirmed by the 2021 Nucleic Acid Research Database issue (Rigden and Fernandez 2021) that contains 189 papers with only one database on glycan structures (GlycoPOST). Secondly, these databases also suffer from classical database issues such as availability and sustainability. A recent study, screening the availability of thousands of bioinformatics web services published from 2010 to 2020, has shed the light on the factors affecting their lifetime (Kern et al. 2020). Some of the databases we have tested and reviewed became unreachable/unavailable during the writing of the manuscript.

The analysis and correct identification of the modified residues encounters many difficulties that are intrinsic to the properties of the PTM in question. These modifications are dynamic and change over time. A single protein (like the Human CDK2 example discussed in the introduction) can also have several PTM sites that can be modified in different combinations. The analysis of the peptides by the mass spectrometer can also produce doubtful results because it is not possible to identify the exact modified residue and its location especially when the peptide contains several possible modification sites. Some studies have revealed the extent of differences in PTM patterns for the same protein between different species [i.e., Myelin basic protein (MBP) between mammals and lower vertebrates (Zhang et al. 2012)], making the automatic annotation of PTM inferred by similarity and predictions prone to errors. Finally, the major issue severely limiting the structural studies of PTMs are the deficiency of structural data in general and the absence of PTM in resolved structures; these are often over-expressed in systems quite different from their original organisms.

Because of all the above-mentioned reasons, care must be taken when analyzing and interpreting PTMs data in public repositories. Further advances in this field will help building a better understanding of PTMs implications in biological processes. Finally, we expect that the unprecedented performance of AlphaFold2 (Jumper et al. 2021) in CASP14 and their partnership with EMBL-EBI in releasing the most complete database of predicted protein 3D structures, AlphaFold DB, covering almost the entire human proteome (98.5% of human proteins). (Tunyasuvunakool et al. 2021), will be a great hub of information to assist scientists in studying and modeling PTMs in their three-dimensional context. It is possible to model some of the PTM annotated with care in UniProt on the AlphaFold models as done recently for glycosylations (Bagdonas et al. 2021). However, it is always necessary to carefully check the experimental data behind the annotation carried out (difference between publication and sequence analysis) and the proposed modeling which should be analyzed in a precise manner (some models are erroneous).