Keywords

1 Introduction

Proteases comprise a group of structurally and functionally diverse enzymes that have the common ability to catalyze the hydrolysis of peptide bonds [1]. This ability is facilitated by active sites of varying composition that give different classes of proteolytic enzymes their respective names. This way there are serine, cysteine, threonine, aspartyl, glutamyl proteases and metalloproteases. Among these groups, serine and cysteine proteases are the most studied and act under the widest condition ranges. Most serine proteases employ a catalytic triad consisting of Ser, His and Asp residues as an active site. Three residues must be positioned in a specific way to facilitate Ser deprotonation in order to perform a nucleophilic attack on a substrate (Fig. 1). Ser must be hydrogen bonded to His, which in turn has a second hydrogen bond to Asp that positions His correctly and shifts its pKa. More generally and according to the performed function these three residues are called a nucleophile, a base and an acid (or an activator). A base formed by histidine is a widespread scenario. In turn, a nucleophile and an activator may be different from the most common Ser and Asp residues [2]. Thereby exist classes of triad-harbouring cysteine proteases [3]. In general, this evolutionary successful arrangement of three residues is exploited for other hydrolytic functions. Despite the pronounced need for a specific hydrogen bonding of triad residues, it can be achieved by more than one spatial arrangement. Comparisons of single representative enzymes showed that, for example, inside the class of serine triad-harbouring proteases, chymotrypsin and subtilisin implement different triad architectures. On the other hand, cysteine TEV protease is remarkably similar to serine protease trypsin in terms of active site spatial arrangement. Systematic investigation into the space of all triad architectures was not performed to date.

Animal, plant, and, especially, microbial proteases represent the largest and most important segment of an industrial enzyme market where they are used in detergents, food processing or leather industry, as biocatalysts in organic synthesis, and as therapeutics. The list of potential practical applications of proteases can be greatly expanded, especially for therapeutic applications, once their catalytic activities can be engineered for specific uses [1]. This can be done with the help of state-of-the-art computational techniques. A combination of structural analysis, reaction modelling and rational design can be used to modify specificity, stability or other properties of existing enzymes, including proteases [4,5,6,7]. Among the examples of sound success stories in protease design is Kuma062, a kumamolisin variant repurposed to process gluten [8]. To face humanity’s demand for applied proteolytic functions, modification of existing enzymes alone is, however, not sufficient. The computational methods of de novo introduction of enzymatic function into previously non-catalytic protein folds are regarded as a major step forward in addressing the needs of industry and medicine [9]. These efforts are, however, limited to the approach implemented in Rosetta3 enzyme design protocol [10]. This method was previously successfully used for transfer of existing active sites into manually generated folds and for computational design of previously non-existent enzymatic functions such as retro-aldolase or Diels-Alder reaction catalysts [11,12,13,14].

The underlying computational procedure starts from the definition of a theozyme – a set of atoms at their respective coordinates mimicking the crucial step of the enzymatic process, e.g. transition state. Most commonly a theozyme is constructed from the substrate moiety and the sidechains of active site residues. After theozyme is constructed, a suitable backbone scaffold to harbour its residues needs to be obtained either by searching the space of known structures or by constructing it from scratch. One can focus only on backbone scaffolds because there is only a limited set of them, and they are highly degenerate in terms of underlying sequences; thus it is unnecessary to perform placement search in all individual proteins with known structure. The searching algorithm implemented in the Rosetta Match application is rather slow as it scans through rotameric libraries of the desired active site residues’ sidechains. The sampling takes even longer if the geometry constraints tolerate fluctuations in the theozyme structure. The further design is based on preservation and additional stabilisation of some interactions between the active site sidechains and the transitional state of the reaction [10].

Such technique does not require the similarity between the active site backbone conformations in the newly constructed enzyme and in the initial source of theozyme, and relies on idealized backbone-dependent rotamer libraries for sidechains. However, in some enzymes the backbone of the active site residues plays a key role in the oxyanion hole formation. For example, both in serine and cysteine proteases the catalytic Ser/Cys backbone N forms a hydrogen bond with the carbonyl O of the substrate. Such interaction is crucial for the catalysis [15, 16]. Moreover, it was shown that the residues directly involved in the catalytic act more often are rotameric outliers [17, 18].

Conformations of protease catalytic residues are highly specific being the result of evolutionary selection. Since rotamer distributions are inherently backbone conformation-dependent, and since backbone is likely to itself participate in the reaction, we suggest that relative backbone geometries of catalytic residues are themselves highly specific. What is more, we hypothesize that by knowing the relative geometries of the triad’s backbones one can derive the triad’s full structure. It makes it possible to make theozyme placement search task completely sidechain-agnostic. In this work we provide justification for this idea. We propose a description of the catalytic site using the involved residues’ backbone orientation. We demonstrate a distinguishable difference between triads in active sites found in available serine and cysteine proteases and other non-catalytic combinations of the same residues. Once the hypothesis is proven, we show that the natural consequence of it is the possibility to drastically speed up the scaffold searching. We present a computational protocol for theozyme placement based on scaffold backbone orientation analysis. When applied to the search for trypsin catalytic triad placement, our backbone-based approach outperforms Rosetta Match in speed by at least 30 times while retaining the accuracy. Low computational cost of the presented solution allows one to run over about 180 000 structures (a full PDB) placement search for one active site in a matter of minutes when using supercomputer resources.

2 Materials and Methods

2.1 Backbone-Based Vectorization of Triads

We define triad as a triplet of unique protein residues with known position of their backbone atoms N, CA, C. For each residue we introduce a virtual point in space called BB placed at the geometric center between its N and C atoms. For a pair of residues i and j, a number of terms is computed. Term αij is defined as an angle between atoms iC–iBB–jBB. Term θij is defined as a torsion angle constructed for atoms iC–iBB–jBB–jC. Term ηij is defined as a torsion angle constructed for atoms iCA–iC–iBB–jBB. Triad vector V for residues i, j, k is then constructed from these terms as follows:

$$ {\varvec{V}}_{ijk} = \left\{ {\alpha_{ij} ,\alpha_{ji} ,\theta_{ij} ,\eta_{ij} ,\alpha_{jk} ,\alpha_{kj} ,\theta_{jk} ,\eta_{jk} ,\alpha_{ki} ,\alpha_{ik} ,\theta_{ki} ,\eta_{ki} } \right\} $$
(1)

Throughout the paper all angular terms are expressed in degrees.

2.2 Protease Triads Dataset Construction

For this work, a collection of PDB IDs matching EC codes 3.4.21 (Serine endopeptidases) and 3.4.22 (Cysteine endopeptidases) with resolution under 3 Å and R-free under 0.4 was obtained. To ensure non-redundancy the dataset was culled at the 90% sequence similarity level using Pisces [19]. The resulting PDB IDs dataset comprised 811 entries.

We then searched for catalytic-like triads in the structures of these proteins. Search and analysis was performed with the help of ProDy [20]. First, all histidine residues were selected. We then analyzed the surroundings of both its sidechain Nδ and Nε nitrogen atoms. Catalytic-like triad was identified as a triplet of residues Nuc-His-Act, where Nuc (nucleophile) is either Ser or Cys and Act (activator) is either Asp, Glu, Asn or Gln, if there was simultaneously Oγ or Sγ atom of Nuc closer then 3.5 Å to any one nitrogen atom of His and one of the sidechain oxygens of Act closer then 3.5 Å to another nitrogen of His. If the analyzed structure comprised several copies of the same subunit the triad from only one of them was retained for subsequent studies.

For each catalytic-like triad obtained this way a triad vector was computed as described above. The resulting triad dataset comprised 312 entries.

2.3 Clusterization of Triad Vectors

We chose to compose our vector only from angular and torsional terms to avoid normalization problems since all the values are expressed in the same units and lie in the same range. However, half of vector values represent torsions which are naturally periodic. Because of this, straightforward implementation of distance-based clusterization is incorrect since commonly used distance metrics are not periodic. We thus precompute the distance matrix manually. For two triad vectors Vijk and Vabc, the distance D between them is the Euclidean norm of a vector ΔV:

$$ \begin{gathered} D = \left\| {\Delta {\varvec{V}}} \right\|_{2} = \, \left\| {\{ \Delta \alpha_{ij,ab} ,\Delta \alpha_{ji,ba} ,\Delta \theta_{ij,ab} ,\Delta \eta_{ij,ab} ,\Delta \alpha_{jk,bc} ,\Delta \alpha_{kj,cb} ,} \right. \hfill \\ \left. {\Delta \theta_{jk,bc} ,\Delta \eta_{jk,bc} ,\Delta \alpha_{ki,ca} ,\Delta \alpha_{ik,ac} ,\Delta \theta_{ki,ca} ,\Delta \eta_{ki,ca} \} } \right\|_{2} \hfill \\ \end{gathered} $$
(2)

Where

$$ \Delta \alpha_{ij,ab} = \alpha_{ij} {-}\alpha_{ab} $$
(3)
$$ \Delta \theta_{ij,ab} = min\left( {\left| {\theta_{ij} {-}\theta_{ab} } \right|,360 - \left| {\theta_{ij} {-}\theta_{ab} } \right|} \right) $$
(4)
$$ \Delta \eta_{ij,ab} = min\left( {\left| {\eta_{ij} {-}\eta_{ab} } \right|, \, 360{-}\left| {\eta_{ij} {-}\eta_{ab} } \right|} \right) $$
(5)

and similar for all other instances of α, θ and η.

Precomputed distance matrix was utilized to perform density-based clusterization. DBScan from the sklearn Python package was utilized for the task [21]. The epsilon parameter, specifying the maximum distance between two samples for one to be considered as in the neighborhood of the other, was set to 50. The number of samples in a neighborhood for a point to be considered as a core point was set to 10. 4 clusters were identified, with 86 points not being in any of them.

2.4 Visualization of Clusterization Results

Informative visualization of clusterization results of data represented as a 12-dimensional vector naturally calls for a dimensionality reduction. UMAP technique was selected for the task, implemented in the umap-learn Python package [22]. All parameters were set to default except for the metric which in our case was set to “precomputed” since we used an already built distance matrix as an input.

Protein structures with triads from the same clusters were superposed with the help of pair_fit functionality in PyMol, which was also used for molecular visualization throughout the paper [23].

2.5 Scaffold Preprocessing and Placement Search Procedure

For a given protein scaffold query and a triad vector template the placement search procedure is intended to produce a ranked list of triples of scaffold residues most closely matching the relative backbone orientation of the template. To enforce reusability, a protein scaffold is first preprocessed. Protein structure is transformed into a graph with its residues represented as nodes in this graph. The edge between two nodes i and j is drawn if the distance between CA atoms of two respective residues (dCA) lies between 4 and 13 Å and the distance between their CB atoms (dCB) does not exceed dCA by more then 1 Å. The edge is assigned a data container comprised of two vectors {αij, αji, θij, ηij} and {αji, αij, θji, ηji}. The list of all triads is then obtained by performing a clique search and selecting all the cliques of length 3. An additional filter is imposed on a triad ijk so that the area of the triangle with vertices CAi, CAj and CAk does not exceed 35 Å2. As discussed earlier, the construction of a final triad vector relies on specifying the sequence of its constitutive residues. For all selected cliques all permutations of its vertices are constructed and assigned a triad vector by combining the respective components of a data stored on the graph’s edges. The final list of all triad vectors and respective residue indices in each sequential order is saved as a Python pickle for further use. All graph manipulations are performed with the help of the networkx Python package [24].

For a placement search for a specified template and a scaffold a list of stored triad vectors is further reduced by considering that only half of the six permutations of triad indices are relevant for each single search task. For the input template triad ijk it is calculated whether the triple of vectors \(\underline{{N_{i} C_{i} }}\), \(\underline{{N_{j} C_{j} }}\), \(\underline{{N_{k} C_{k} }}\) is right-handed or otherwise, and only matching triples from the scaffold are retained for search. Finally, distance between each scaffold triad vector and template vector is calculated as described earlier, and scaffold positions are reported if such distance is below the threshold.

2.6 Scaffold Library Construction

CATH non-redundant S40 collection of domains was obtained as PDB files totaling 31879 scaffolds [25]. Since the position of CB atoms is a prerequisite for one of the filter stages in a placement search, all positions in all scaffolds were turned into alanines without moving altering backbone coordinates with Rosetta3 fixbb protocol [26]. Each structure then was preprocessed as was described earlier.

Preprocessing and scaffold searching was carried out using the equipment of the shared research facilities of HPC computing resources at the Lomonosov Moscow State University (“Lomonosov-2” supercomputer) [27]. Preprocessing stage took 30 min 22 s on 64 cores with average preprocessing time for one scaffold of 3.66 s. Scaffold searching stage took 5 min 43 s on 64 cores with average search time for one scaffold of 0.69 s.

2.7 Rosetta Match Assessment

Structure of Porcine Pancreatic Trypsin (PDB ID 4DOQ) was used to assess the computational time of Rosetta Match application [10]. The theozyme included the catalytic triad Ser-His-Asp and a water molecule as a dummy substrate. The search was performed into the whole protein structure (221 residues). The -consolidate_matches flag was used to prevent massive and time-consuming output of nearly identical structures.

Rosetta Match was tested with -packing:ex1 and -packing:ex2 levels set to either default 1 or 3 for more precise rotamer sampling.

3 Results and Discussion

3.1 Backbone-Based Vectorization of Triads

We start by hypothesizing that for a scaffold searching task a theozyme for enzyme design may be in principle reduced to just the relative organization of backbones of crucial residues. Similar reduction was previously shown to be beneficial to the design of small-molecule binding sites [28]. Rationale for such reduction was given in the introduction section of the manuscript.

Another aspect that would benefit a scaffold searching problem is an ability to directly compare different backbone organizations by having a distance metric defined for such an object type. This notion requires a vectorization procedure as well. Trivial way to perform such vectorization is by expressing each triad as a vector of each of its atoms’ coordinates. Once this is done, root mean square deviation of atomic positions (RMSD) is a natural measure of similarity between two such objects. However, such comparison requires an optimal superposition performed firsthand; what is more, such a description is redundant since it explicitly differentiates between translated and rotated copies of the same triad. It is possible to construct a more concise vectorization that would be translation- and rotation-invariant and thus would not require preemptive superposition.

Backbone orientation of each residue may be represented as an oriented triangle with vertices N-CA-C (Fig. 1A). All measures of these triangles are fixed since the length of N-CA and CA-C bonds and the angle between them may be safely considered a constant for all protein structures. The vectorization task therefore is reduced to the problem of encoding the relative orientations of three such triangles. Taking rotational and translational invariance into account, only 12 degrees of freedom are left. For a pair of residues 6 values are sufficient to describe the relative orientations of their backbone: 5 angles defined in the Fig. 1B and the distance between any pair of their atoms. It is possible to construct an asymmetric triad vector by choosing a pivot residue and constructing two sets of 6 values each to explicitly encode the positions of two remaining residues. However, we decided to choose a different formalization in which each pair of residues forming a triad contributes 4 degrees of freedom, all expressed in angular or torsional form. Such vectorization is thus symmetric and uniform in data ranges and units which is useful for the calculation of distance between two such vectors without need for normalization (see Materials and Methods).

Fig. 1.
figure 1

Catalytic triad typical organization and vectorization. A. Architecture of the trypsin’s catalytic triad. Backbone atom names are labeled and highlighted in gray. B. Vectorization introduced in current work.

3.2 Space of Architectures of Proteases’ Catalytic Triads

We intended to investigate whether our simplistic approach is useful to describe the space of active site architectures. In this work we focused on catalytic triads of serine and cysteine endopeptidases. We found that clusterization based on our 12-dimensional vectorization produces highly informative insights, clearly distinguishing between different classes of proteolytic triads and non-catalytic triads (Fig. 2). The following clusters were formed: subtilisin-like architectures (Fig. 3A), trypsin-like architectures (Fig. 3B), papain-like architectures (Fig. 3C), caseinolytic protease-like (CLP-like), Backbone-based superposition to the cluster centers revealed that, indeed, backbone-only representation is sufficient to discriminate between various architectures more often described in terms of their sidechain relative orientations (Fig. 3). Our method was also sensitive enough to correctly assign a cluster label to the PDB entries harbouring substitutions in their active sites and ones covalently or noncovalently inhibited, even if sidechain geometry in these cases was distorted.

Fig. 2.
figure 2

Clusterization of catalytic triads architectures based on backbone vectorization.

Thus, a backbone-based approach was proven to not only be applicable to scaffold searching, but also to be a powerful tool to study the space of catalytic site architectures. Further generalization of the approach on different enzyme classes may produce new insight into the intricacies and evolutionary constraints of biocatalytic machineries.

Fig. 3.
figure 3

Catalytic triad architecture of representatives of all clusters. A. Subtilisin-like cluster. Numbering is based on PDB ID 3BX1. B. Trypsin-like cluster. Numbering is based on PDB ID 1AVW. C. Papain-like cluster. Numbering is based on PDB ID 5Z5O. D. CLP-like cluster. Numbering is based on PDB ID 6NAH. Carbon coloration is in accordance with Fig. 2.

3.3 Scaffold Searching

We utilized our study of proteases to devise a distance threshold to be used to distinguish between adequate and inadequate placements, as well as some filters to reduce the number of scaffold triads to search through. We found that distributions of average distances to other cluster mates vary between triad architectures (Fig. 4), however always lying much lower than those of non-catalytic ones (minimal average distance of 224°). For the placement search for exact architecture type it is thus preferable to use a relevant threshold that we define as 90th percentile in the mean distances distribution within the cluster. However, due to the dramatic difference between catalytic and non-catalytic architectures, a milder threshold may be used, e.g. the maximum of the thresholds (in our case 47°, for trypsin-like triads).

Fig. 4.
figure 4

Distributions of mean distances between each point in a cluster and every other point inside the same cluster. Upper-left: subtilisin-like triads, upper-right: trypsin-like triads, lower-left: papain-like triads, lower-right: CLP-like triads. Black dash represents the 90th percentile.

We further demonstrate our scaffold searching procedure on two examples: trypsin- and papain-templated search against a TEV protease scaffold, and trypsin-templated search against the whole CATH S40 non-redundant domains datased.

Prior to performing scaffold search we preprocessed each structure by converting it into a set of vectorized triads. To reduce the number of triads we applied several filters derived from the distributions studied for natural catalytic triads in proteases (Fig. 5).

Fig. 5.
figure 5

Distributions of various auxiliary metrics useful for scaffold triads filtration prior to placement search. Upper-left: inter-CA-atomic distance, upper-right: inter-CB-atomic distance, lower-left: their difference, lower-right: area of the triangle built upon CA atoms of triad residues.

TEV protease is known to harbour a triad very much resembling that of trypsin despite being a cysteine protease [29]. On the other hand, it does not share much in common with papain-like architectures. Trypsin-templated search was able to easily identify the correct placement of TEV protease catalytic triad with the distance to it of 41.83° separated from all others by a significant margin (Table 1). On the other hand, papain-templated search did not find any promising placements at all since all the best ones had a distance significantly higher than 47° from the reference vector (Table 2).

Table 1. Best 5 placements from trypsin-templated search against TEV protease scaffold.
Table 2. Best 5 placements from papain-templated search against TEV protease scaffold.

Both these searches were performed under 2 s on a single core. We decided to compare the computational effectiveness of our approach with those of Rosetta Match on a trivial case of trypsin-templated search against trypsin scaffold. Naturally, both methods succeeded in correctly identifying an ideal placement. However, it took Rosetta Match 42 s to perform the task with a standard level of rotamer sampling and 1 m 10 s with sampling extended to 3σ. Extending the number of samples per constraint skyrockets the computational time beyond 1 day. Our backbone-vectorization based approach took just 1.29 s. This comparison clearly shows the strength and practical applicability of our approach.

As an example of a near real-world application we performed a search against the whole CATH S40 non-redundant domains datased. It took on average 0.69 s to scan through all the possible placements inside a scaffold. In total, 16 placements with distance below 47° were found (Table 3).

Table 3. Hits (distance <47°) from trypsin-templated search against CATH S40 domains dataset.

Unsurprisingly, the top of the table is occupied by other proteases. Starting from the 8th hit, 1AUK with a distance of 39.72°, is a transition towards non-proteolytic folds. Whether they can in fact be successfully engineered into proteases utilizing the recommendations from the scaffold search is a matter of further study. If so, the recommended threshold at 90th percentile is indeed a reasonable assumption. We note however that a protein designer may want to search for looser matches if one has means of computational backbone reengineering at a disposal. The used dataset is also only partially reflecting real-world enzyme design studies since more specific, potentially de novo modeled scaffolds may be of better use to scan for. Concreticising use-cases as well as fine-tuning the filters and adding new ones is certainly needed in order to turn the presented approach into a tool or a web-service that can be accessed by a global community.

4 Conclusions

In the presented study a simplistic approach to the scaffold search problem of de novo enzyme design is proposed and validated. We show that by reducing the problem to the level of relative backbone orientations we can achieve a dramatic speed-up compared to existing approaches while producing meaningful results. Our solution makes it possible to routinely scan the whole structural proteome for promising placements of catalytic architectures on a working station or a small cluster. What is more, proposed vectorization allows to uncover hidden patterns in the organization of enzymes that may lead to new fundamental discoveries in the field of structural enzymology.