1 Introduction

Thermolysin-like proteinases (TLPs), or peptidases of the M4 family [1], are a group of metalloendopeptidases found in dozens of gram-positive and gram-negative bacterial species. In addition, TLPs were identified in fungi and the archaeon Methanosarcina acetivorans [2]. TLPs contain a zinc ion in the catalytic site [3] and prefer to hydrolyze substrates with large hydrophobic amino acids (Leu or Phe) at the P1′ position [46] (using the numbering system of Schechter and Berger [7]).

Similar to many other proteolytic enzymes, TLPs are synthesized as precursors carrying structural elements that are absent in mature molecules. Such additional sequences can be localized in both N- and C-terminal regions of the precursor. Many amino-terminal extensions of the precursors contain a signal peptide, providing for extracellular secretion of the protein, and a prosequence. The prosequences act as an intramolecular chaperone [8] that modulates protein folding [912], at least in some TLPs. In addition, the proregion of M4 peptidases can inhibit the corresponding mature proteins [1215] and mediate their secretion [9, 16]. The C-terminal regions of TLP precursors seem to mediate their binding to insoluble substrates, as supported by the limited data available [1721].

Our previous studies [22, 23] and ample data obtained by other groups indicate several structural types of TLP precursors. In this work, the primary structures of the full-length precursors of the M4 peptidase family were systemically analyzed.

2 Materials and Methods

Multiple sequence alignments were constructed by ClustalX 1.8 using the Gonnet series of protein weight matrices [24].

The set of sequences of full-length TLP precursors used in this work (Table 1) was compiled as follows. The sequences of M4 peptidases available in the MEROPS database (release 7.20) were complemented by four TLP sequences translated from GenBank: Ser_pro, Brb_lie, Gib_zea, and V_har-1. All of these 146 sequences were aligned, and the following sequences representing fragments of the full-length precursors were excluded: MER01042, MER01048, MER03790, MER03915, MER05572, MER12295, MER16511, MER19065, MER20843, MER27031, MER27032, MER27033, MER27034, MER40474, MER41301, MER48741 and MER50323. Among the remaining sequences, some that were closely related (identity exceeding 85%) were recognized: MER01025, MER01034, MER01353, and MER01927; MER01026 and MER01027; MER01035 and MER01038; MER01030, MER01354, MER03181, and MER21824; MER04711 and MER14408; MER05727, MER11853, MER29943, and MER30073; MER19561 and MER27496; MER20835, MER28888, and MER39814; MER20840, MER28890, and MER39811; MER21804, MER28887, and MER39810; MER25370, MER28622, MER29961, MER40142, and MER50804; MER26466 and MER45739; MER27846 and MER52672; MER28889 and MER39813; MER30706 and MER30781; MER39812 and MER43807; and MER48895 and MER50231. A single representative of each group (underlined) was included in the final set of 100 sequences so as to avoid overweighting closely related family members. The resulting set of sequences was realigned.Footnote 1

Table 1 List of peptidases of the M4 family analyzed in this work

For the alignment of the amino-terminal regions (ATRs) of the precursors, all mature parts were discarded by two steps. At the first step, the cutoff point was set after the first unambiguously alignable cluster of the mature protein (after position 94 in thermolysin precursor; the numbering starts from the first amino acid in the mature protein). At the second step, the realigned sequences were cut off before the experimentally identified processing sites (after position −9 in thermolysin precursor). Note that protealysin-like short ATRs in the resulting sequence set still included 15–20 amino acids of the mature part. The sequences of Par_sp and Par_sp-1 deduced from the Parachlamidia sp. UWE25 genomic sequence demonstrate good similarity with other TLPs in the region of the Zn2+-binding motif (HEXXH_E) only. They could not be aligned with the whole set of sequences analyzed. Hence, Par_sp and Par_sp-1 sequences were aligned with Ser_pro and thermolysin sequences alone, and their C-terminal sequences after position −9 in thermolysin were excluded. The resulting N-terminal regions were added to the resulting set. This set of aligned sequences was used to construct the dendrogram shown in Fig. 1.

Fig. 1
figure 1

Dendrogram of the N-terminal amino acid sequences in peptidase precursors of the M4 family

In the case of the carboxy-terminal regions (CTRs), long sequences corresponding to the pre- and propeptides, as well as mature parts, were excluded from the alignment of the whole set of the full-length precursors. The resulting sequence set included the regions following the last unambiguously alignable cluster of the mature part (following amino acid 316 in thermolysin). Then, the sequences shorter than 30 amino acids were excluded and the remaining sequences were realigned.

Signal peptides were identified in the amino acid sequences using the SignalP 3.0 server (http://www.cbs.dtu.dk/services/SignalP). The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models [25].

The sequence logos [26] were generated using the WebLogo tool (http://weblogo.berkeley.edu) [27]. The regions −144…−106, −90…−78, and −47…−16 of long ATRs and PPL motifs were extracted, and columns containing gaps in at least 95% of the sequences were eliminated.

Phylogenetic analysis was performed using PHYLIP (version 3.65) [28]. Protein distances were calculated using the Jones–Taylor–Thoruton matrix. The unweighted pair group method with arithmetic mean (UPGMA) was applied to clustering distance data.

3 Results and Discussion

3.1 Structure and Function of Precursor Amino-Terminal Regions

Comparison of the amino acid sequences suggested the division of M4 family ATRs into two main groups (Fig. 1).

The first group includes 74 enzymes (from the resulting set of 100 sequences described in Sect. 2) with 183–287 amino acid long amino-terminal regions (long ATRs hereafter) removed during maturation in all studied cases. (ATR size is given for the enzymes with known processing sites.) All experimentally described proteins of this group are secreted extracellular enzymes with a typical leader peptide (presequence) at the N-terminus that can be reliably identified by prediction algorithms (e.g., SignalP 3.0 [25]). In the proteins with long ATRs, the signal peptide is followed by a region called prosequence, propeptide, or proregion. The propeptide length ranges from 162 to 264 amino acids.

As demonstrated for many TLPs with long ATRs, the proregion acts as an intramolecular chaperone essential for the production of the mature active enzyme [912, 2931]. At the same time, the propeptide is not required to produce at least one catalytically active TLP [32]. The propeptides in TLPs with long ATRs function as inhibitors of the mature protein [12, 13, 15, 2931]. Apparently, this function is significant in vivo to prevent premature release of the active enzyme, which can be harmful for the cell [33]. In addition, the propeptides can be essential for the secretion of the M4 proteinases with long ATRs [11, 16]. Note also that the propeptides of TLPs with long ATRs are cleaved autocatalytically [30, 31, 34, 35] and intramolecularly [34].

In complete agreement with the data obtained by other investigators, two primary conserved regions were identified in the propeptide of TLPs with long ATRs. These regions (Fig. 2a, b) correspond to amino acids −144…−106 and −47…−16 in the thermolysin sequence. The first one is similar to the fungalysin/thermolysin propeptide (FTP) motif (Protein families database of alignments and HMMs (PFAM) accession number PF07504) [36] and to the hydrophilic region (ProM) found in the middle of the P. aeruginosa elastase propeptide [37]. The second one is similar to the more hydrophobic (compared to ProM) region ProC in P. aeruginosa elastase propeptide proximal to the propeptide-processing site [37], and is a fragment of the peptidase propeptide and YPEB (PepSY) domain (PFAM accession number PF03413) [38]. Note that, although the FTP motif and PepSY domain are largely found in propeptides of the M4 peptidase family, they were also identified in non-M4 peptidases and many non-peptidase proteins [36]. A conserved Ala was found between these clusters (position −83) in most long ATRs (Fig. 2c). Note that the propeptides of M4 peptidases with long ATRs contain no amino acids conserved in all sequences analyzed. The data on the highly conserved amino acids identified in the prosequences (most of which have been identified by other investigators) are summarized in Table 2. These amino acids are most likely crucial for the functioning of propeptides of TLPs with long ATRs.

Fig. 2
figure 2

LOGO presentation of the consensus sequences of the most conserved regions in long ATRs: −144…−106 (related to FTP motif) (a), −47…−16 (related to PepSY domain) (b), and region around Ala(−83) (c)

Table 2 Conserved amino acids in prosequences of TLPs with long ATRs

The second group of M4 peptidases recognized here from the structure of precursor ATRs includes 20 enzymes with N-terminal regions that are typically 50–60 amino acid long (short ATRs hereafter; Fig. 1). Most proteins of the group are putative, and only three enzymes with short ATRs have been isolated and described. These include the protease of Pectobacterium carotovorum (Erwinia carotovora subsp. carotovora) (Pec_car) [39], minor protease of Serratia marcescens (Ser_mar) [40] and protealysin of Serratia proteamaculans (Ser_pro) [22]. No data on the functional role of the precursor ATRs are currently available for this group.

The previously studied Pec_car and Ser_mar were considered secretory proteins. When expressed in E. coli cells, Ser_mar was shown to accumulate in the extracellular space [40]. Analysis of the precursor ATRs in Pec_car and Ser_mar [39, 40] suggested that these regions contain a typical signal peptide. Indeed, some short ATRs in M4 peptidases contained a structure that, at first glance, resembled signal peptides. However, SignalP 3.0 algorithms failed to identify classical signal peptides in short ATRs. At the same time, a hydrophobic cluster in the region near the initiation methionine was found in the analyzed short ATRs (Fig. 3). This cluster, the PPL-motif, includes seven amino acids, three of which are invariant: two neighboring Pro residues and a Leu two residues downstream of them. An aromatic Tyr or His amino acid immediately follows the Pro–Pro site in most cases. The function of the PPL-motif remains unclear. The PPL-motif can represent a previously unknown cell sorting signal. Noteworthily, this motif can be widespread in nature, since it is found in the N-terminal regions of some putative nonproteolytic proteins, including eukaryotic ones [22].

Fig. 3
figure 3

LOGO presentation of the PPL-motif consensus. Protealysin numbering system, the position relative to the initiation methionine is given in parentheses

Of note is a unique feature of the ATR in Gib_zea, among the sequences analyzed. This ATR is the longest (about 100 amino acids) of the short ATRs and contains two PPL-motifs; one is close to the initiation methionine, and the other starts at position 50 from the initiation Met. Comparison of the left and right halves of the Gib_zea ATR demonstrates their 31% identity and 44% similarity, which is largely due to the region (of about 20 amino acids) around the PPL-motif. This can exemplify a duplication in the short ATR of TLP precursors.

Previously, we proposed that the enzymes with short ATRs constitute a separate group in the M4 family [22]. Here, we analyzed many sequences of TLP precursors to demonstrate that the M4 family indeed includes two groups of enzymes with a different ATR size and structure. At the same time, no intermediate variants indicating an evolutionary transition from one ATR type to the other have been identified. These ATR structures are not species-specific, and the same species can produce enzymes with both short and long ATRs.

Our analysis also demonstrates the differences in the N-terminal regions of mature M4 proteinases with long and short ATRs of the precursors. Figure 4 demonstrates different distances from the first amino acid to the first unambiguously alignable cluster of the N-terminal regions of mature protealysin relative to thermolysin and P. aeruginosa elastase, classical representatives of the M4 family. Thus, the N-terminal part of mature M4 peptidases, precursors of which have short ATRs, is by about 30 amino acids shorter compared to those with long ATRs.

Fig. 4
figure 4

Alignment of the N-terminal sequences in some TLPs, and the consensus sequence of the conserved peptidase_M4 domain (PF01447). Amino acids identical to PF01447 are shaded. Thermolysin numbering system

The analyzed set of proteins included six enzymes with ATRs which could not be unambiguously recognized as short or long ATRs (Fig. 1). Visual analysis suggests that the ATRs of Ren_sal and Col_psy precursors resemble long ATRs; Bra_jap and Neu_cra ATRs resemble short ATRs; while Par_sp and Par_sp-1 ATRs resemble neither long nor short ATRs.

Finally, we would like to emphasize an important structural feature of the ATR in M4 precursors. It is common knowledge that TLP prosequences are more tolerant to mutations compared to the mature parts. For instance, the proportion of identical amino acids in ATRs is lower compared to the mature regions (Fig. 5); the majority of modifications in the most conserved amino acids in TLP prosequences (such data are available for long ATRs only) do not completely abolish enzyme activity [37]; TLP propeptides can be replaced with heterologous ones [12]; and long propeptide regions can be replaced with heterologous ones [41] without disturbing protein folding and processing. At the same time, the analysis of mutations in the ATRs and mature regions of the proteins with both short and long ATRs (Fig. 5) demonstrates a concerted accumulation of mutations in the ATRs and mature regions, i.e., they evolved in parallel and no exchange of these domains between different enzymes took place (which is not the case for the precursor CTRs; see below).

Fig. 5
figure 5

The proportion of identical amino acids in mature domains (♦) and ATRs (■) of M4 peptidases. (a) Enzymes with long ATRs as compared to thermolysin. Pearson’s product moment correlation coefficient between mature domains and ATRs was 0.85. (b) Enzymes with short ATRs as compared to protealysin. Point mismatch corresponding to Ser_mar ATR (encircled) is due to sequence errors in the mature domain of this enzyme [22]. Pearson’s product moment correlation coefficient between mature domains and ATRs was 0.59 (regardless of Ser_mar)

One can speculate that this system of variable prosequences serves as a specific evolution module for changing the functional activity of mature proteins in vivo. On the one hand, mutations in prosequences can have an effect on the enzyme accumulation and localization in the cell [37]. On the other hand, there are experimental data demonstrating the direct effect of the changes in propeptides on the catalytic properties of mature proteinases. To date, there are at least two such examples. First, a point mutation in subtilysin propeptide changes the secondary structure, thermostability, and substrate specificity of the mature protein [42, 43]. Second, cathepsin E with the propeptide from cathepsin D had a different catalytic efficiency and constant of inhibition by protein inhibitors [44].

3.2 Structure and Function of Precursor Carboxy-Terminal Regions

Our analysis of the deduced amino acid sequences of the full-length precursors of M4 peptidases demonstrates an additional C-terminal extension relative to the catalytic part in 37 enzymes from the resulting set of 100 sequences described in Sect. 2 (Table 1). Thirty-three of them are TLPs with long ATRs, three have short ATRs, and one (Col_psy) cannot be assigned to either group (see above).

The length of the precursor CTRs ranges from 110 to 670 amino acids, and their primary structure is even more heterogeneous than in ATRs. The absence of CTRs has been experimentally confirmed for certain mature active proteins [23, 4550], which allows us to call them C-terminal prosequences. Conversely, other M4 proteinases were isolated from natural hosts exclusively in the form containing CTRs, and were also catalytically active [18, 19, 5153]. However, thorough investigation of metalloproteinases with CTRs usually demonstrates the presence of both protein forms, with and without CTRs [1719, 54]. In vitro, recombinant metalloproteinases are also transformed from the CTR-containing to the short form [30, 31, 53, 55], apparently by an autocatalytic mechanism [21, 30, 31]. To summarize the published data, it looks like most M4 peptidases lose their CTR, although the lifetime of the CTR-containing forms varies considerably between different proteins.

The most significant feature of CTRs in TLP precursors is the presence of previously identified conserved domains (Table 3 and Fig. 6) also typical for many other proteins. Such domains have been found in many bacterial and archaeal proteins, as well as mammalian and vertebrate proteins. The wide distribution of the conserved domains found in CTRs of M4 peptidases suggests domain shuffling between TLPs and proteins of different groups.

Table 3 Conserved CTR domains in peptidase precursors of the M4 family
Fig. 6
figure 6

Architecture of TLP C-terminal regions containing conserved domains. Pattern A is typical of V_cho, V_ang, V_flu, V_sp, V_pro, V_vul, Aer_cav, and Ther_sp; pattern B, V_par and V_har; pattern C, Pse_sp, Alt_sp, Ant_bac, and Myx_xan; pattern D, V_harv-1, She_ama, and She_bal; pattern E, Col_psy; pattern F, Alt_sp-1; pattern G, Str_ave-4, Str_coe-1, and Str_gri; pattern H, Met_ace; pattern I, Str_ave-1; pattern J, Str_ave-7; pattern K, B_cer-2; pattern L, Clo_ace; and pattern M, Str_exf. M4 peptidase designations are given in Table 1. Conserved domain designations are given in Table 3

Experimental data on the functions of CTRs in M4 peptidases are not abundant. However, the available data suggest that these regions provide for the enzyme binding to insoluble protein and/or polysaccharide substrates. In the case of Vibrio vulnificus proteinase (V_vul), the absence of the PPC domain-containing CTR had no effect on the efficiency of hydrolysis of soluble proteins. At the same time, the form without CTR demonstrated an increased rate of hydrolysis of short peptide substrates and, conversely, a considerably decreased capacity to bind and hydrolyze insoluble proteins such as collagen and elastin. In addition, the CTR removal decreased the hemorrhagic activity in vivo [20, 21]. Similar data were obtained for Vibrio fluvialis proteinase (V_flu), which has a PPC domain in its CTR, too. In contrast to the CTR-containing form, and similar to V_vul, the CTR-less proteinase demonstrated insignificant activity towards insoluble elastin and was unable to agglutinate rabbit erythrocytes [53]. The CTR of Myxococcus xanthus metalloproteinase, with two PPC domains, proved to be bound to the extracellular matrix of this bacterium in vivo [17]. At the same time, the extracellular matrix of M. xanthus is arranged as fibrils composed of a carbohydrate backbone with associated proteins [56]. The isolated CTR of thermolysin-like metalloproteinase, a component of the chitinolytic complex in marine bacterium Alteromonas sp. (Alt_sp-1), contains three PKD domains and was shown to bind cellulose, chitosan, as well as α- and β-chitins [18, 19].

The conclusion that the CTR in M4 peptidases is a substrate-binding module is confirmed by the data on the properties of conserved domains in other enzyme groups. For instance, two C-terminal PPC domains in class I collagenase from Clostridium histolyticum (Col G) can bind different collagen types [57, 58]. According to the PFAM database, most PKD domains are found in the extracellular parts of proteins interacting with other proteins or polysaccharides. Substrate binding is one of the proposed functions of P-domains of proprotein convertases [5961], although these elements also mediate proper cellular localization of the enzymes [6064], proprotein convertase stability, and, possibly, folding of the catalytic domain and processing [6467].

The CTR in M4 peptidases can also mediate cellular localization of the enzymes. For instance, the CTR in B_cer-2 includes conserved domains typical of a variety of bacterial surface proteins (Fig. 6), suggesting that the CTRs are responsible for the B_cer-2 localization on the bacterial cell surface.

Unusually, the catalytic domain peptidase_M28 is found in the CTR of M4 peptidase from Streptomyces exfoliatus (Str_exf). The M28 peptidase family includes amino- and carboxypeptidases from Bacteria, Archaea, Protozoa, Fungi, plants, and animals. The functional role of the combination of two peptidases is not known.

Thus, the data on the functions of TLP CTRs and their conserved domains suggest that most CTRs are modules binding high molecular weight insoluble substrates. At the same time, CTRs are absent from many mature M4 peptidases. What can be the function of such removed substrate-binding modules? The above-mentioned data for V_vul and V_flu demonstrate that CTR-less proteins better hydrolyze low molecular weight soluble substrates, while CTR-containing enzymes better hydrolyze high molecular weight insoluble ones [21, 53]. In this context, the following scenario can be proposed: a CTR-containing enzyme binds an insoluble substrate and starts its hydrolysis. Later, the enzyme is released from the CTR anchor and efficiently hydrolyzes the resulting low molecular weight products.

3.3 TLPs Encoded within the Same Genome

Analysis of the available data on the genome structure indicates that some genomes include several peptidases of the M4 family (Table 1). In many cases, the genes in the same species code for TLPs with different precursor structures. Let us consider the most vivid example.

The highest number of genes of M4 proteinases is found in the genome of S. avermitilis MA-4680. One of eight TLPs in this species has a short ATR, while the other seven proteins have long ATRs and four of them have CTRs (Fig. 7). Analysis of their amino acid sequences demonstrated the general pattern observed for the whole M4 family. The precursor ATRs have a lower similarity compared to the mature parts, while the similarity level of the ATRs and mature parts between different TLPs from S. avermitilis correlated. The CTR structure of TLPs from S. avermitilis MA-4680 also supports the concept of domain shuffling.

Fig. 7
figure 7

Architecture of peptidase precursors of the M4 family in Streptomyces avermitilis MA-4680. Str_ave, Str_ave-1, Str_ave-2, Str_ave-4, Str_ave-5, and Str_ave-7 have 61-79% identity between the catalytic parts. Str_ave-1, Str_ave-4, and Str_ave-7 have additional CTRs with seemingly low similarity. However, thorough analysis of these CTR sequences demonstrates that Str_ave-4 and Str_ave-1 P-domains have 53% identity, while Str_ave-1 and Str_ave-7 He_PIG domains have 63% identity. Str_ave-3 significantly differs from other TLPs in S. avermitilis MA-4680. The similarity of the catalytic part with other S. avermitilis TLPs with long ATRs is low (about 20% identity) and the CTR of Str_ave-3 contains no known conserved domains. S, signal peptide; FTP, fungalysin/thermolysin propeptide motif (PF07504); PEPSY, peptidase propeptide and YPEB domain (PF03413); sATR, short ATR of TLPs; peptidase M4, TLP catalytic region including Peptidase_M4 (PF01447) and Peptidase_M4_C (PF02868) domains; P, proprotein convertase P-domain (PF01483); and He_PIG, putative Ig-like domain (PF05345)

4 Conclusions

In summary, this analysis has revealed the following significant facts concerning the structure of peptidase precursors of the M4 family:

  • no precursor of M4 peptidases without amino-terminal regions (ATRs) in addition to the catalytic domain has been found;

  • there are two ATR types: short and long ATRs of about 50 and 200 amino acids in length, respectively;

  • long ATRs contain no amino acids conserved in all M4 peptidases;

  • no classical signal peptides have been identified in short ATRs, but short ATRs proved to contain conserved a PPL-motif near the initiation methionine;

  • the accumulation of mutations in ATRs of both types correlates with that in the catalytic domains;

  • about one-third of TLP precursors have C-terminal extensions (CTRs); they are found in about half of precursors with long ATRs, but occur more rarely in precursors with short ATRs; CTRs contain previously identified conserved domains typical of many other proteins, too, and likely underlie the interaction with high molecular weight substrates.