Introduction

Malaria remains one of the most important life-threatening diseases. It afflicts approximately 300–500 million people a year, killing 1–2 million, mostly in the developing countries in the tropical or subtropical regions. The causative agents of malaria are a group of protozoan parasites in the genus Plasmodium. The rapid spread of the parasite populations resistant to the available antimalarial drugs underscores the pressing need for new drugs.

Genomics-based searches for new antimalarial targets hold considerable promise (Carlton et al. 2002; Gardner et al. 2002; Carlton 2003), but have been limited by a practical difficulty: our inability to assign a functional identity to a large fraction of the recognized open reading frames (ORFs) in the parasite genome. In the case of Plasmodium falciparum which causes the most severe form of malaria, about 60% of the 5,300 ORFs were annotated as “hypothetical” due to the lack of statistically significant sequence similarity to proteins with known function/structure (Gardner et al. 2002). An effective solution to circumvent this problem lies in the development of new algorithms that can capture subtle similarities between the unknown proteins and the annotated proteins in protein databases.

We propose to improve protease prediction among those uncharacterized Plasmodium proteins with a computational prediction approach that applies support vector machines (SVMs) using extended profile kernels for remote homology detection. SVMs are a family of machine learning algorithms for classification and regression problems (Vapnik 1998). A SVM classifier is a linear function that separates the training data into two classes and also maximizes the geometric margin between them in a feature space. Our binary classification problem is the classification of an uncharacterized protein sequence as a member or a non-member of a given protein family with a SVM classifier learned from the training proteins. The SVM-based classification of protein sequences uses negative sequences (proteins outside the protein family) as well as positive sequences (members of the protein family) to learn the difference between the two classes. This discriminative nature of SVMs distinguishes them from those alignment-based approaches that build models only with positive sequences (Karplus et al. 1998), and often results in better empirical classification performance. Another desirable property of SVM is that learning a SVM classifier only depends on the pairwise similarity between the examples; therefore, we can use any symmetric and positive-definite similarity functions, called kernels, to achieve better classification performance and faster computation. Recently, it has been shown that SVM-based kernel approaches are especially effective in remote homology detection (Jaakkola et al. 2000; Liao and Noble 2003; Leslie et al. 2004; Kuang et al. 2005; Rangwala and Karypis 2005). Our previous work on profile kernels (Kuang et al. 2005) established the-state-of-the-art performance for remote homology detection. The profile kernel is a function that measures the similarity of two protein sequence profiles based on their representation in a high-dimensional vector space indexed by all k-mers (k-length subsequences of amino acids). We modify the original profile kernel, which is defined on a feature space indexed by subsequence of a fixed length, to include subsequences of length in a certain range as features. We found that the extended profile kernels achieve significant improvements in protein classifications of the SCOP benchmark dataset (Results are not shown) (Murzin et al. 1995).

In this proof of concept study, we attempt to combine powerful SVM classifiers and the traditional alignment based PSI-Blast algorithms to predict the protease complements (degradomes) in Plasmodium. The proteases were chosen because:

  1. 1.

    They have been thought of as attractive drug targets. Firstly, proteases, the digestive enzymes that hydrolyze peptides, are essential for the parasite life cycle: for example, aspartic proteases (plasmepsins) (Coombs et al. 2001; Goldberg 2005; Ersmark et al. 2006), cysteine proteases (falcipains) (Rosenthal et al. 2002, 2004) and metalloprotease (falcilysin) (Eggleson et al. 1999; Murata and Goldberg 2003a, b) are actively involved in hemoglobin digestion for parasite nutrition; serine proteases (subtilisins) are important for red blood cell invasion (Withers-Martinez et al. 2004); and, recently, proteases have been implicated in cell cycle progression and cell signaling (Baker et al. 2006; O’Donnell et al. 2006; Le Chat et al. 2007; Meslin et al. 2007). Secondly, it is feasible to design specific inhibitors for proteases if the mechanism of protease action is known or can be predicted. Various types of inhibitors have been shown to effectively block parasite growth or/and invasion (Sharma 2007). Emerging techniques in combinatorial high throughput screening and computational structured based drug design (SBDD) have made promising contributions to the recent progress in searching out and designing malarial protease inhibitors: combinatorial libraries have been synthesized and screened for plasmepsins (Carroll et al. 1998; Haque et al. 1999; Kasam et al. 2007) and a group of inhibitors for falcipains has been identified as well (Li et al. 1996; Scheidt et al. 1998; Pandey et al. 2006). Thirdly, because of the remote evolutionary relatedness between the malaria parasite and the human host, the inhibitors designed based on malaria protease targets should have little or no adverse effect on the host.

  2. 2.

    A large amount of relevant data is available for the protease family, which makes the application of kernel based machine learning feasible. Substantial knowledge has been accumulated and a specialized expert-curated database, MEROPS, is available for proteases; it includes a catalog of characterized and predicted proteases in over 3,100 organisms (Rawlings et al. 2008).

Here we report a catalog of the proteases in five species of Plasmodium, including the two human malaria parasites P. falciparum and P. vivax, and the three parasites P. yoelii yoelii, P. berghei, and P. chabaudi, which serve as the rodent models. This catalog opens a new line of novel proteases or protease-regulated cellular processes for functional characterization.

Methods

Data preparation

The predicted ORFs of the five Plasmodium species were downloaded from the PlasmoDB database (http://www.plasmodb.org/, release 5.2). In this release, there are 5,411 ORFs in P. falciparum genome, 5,352 in P. vivax genome, 7,861 in P. yoeli genome, 12,235 in P. berghei genome, and 15,007 in P. chabaudi genome. A total of 47,499 known peptidase units and peptidase inhibitor units in the MEROPS database (http://merops.sanger.ac.uk/, release 7.4) were used as the target sequences for PSI-Blast search and SVM training.

In the PSI-Blast search using the unidentified ORFs against the MEROPS sequences, one-iteration and the default e-value threshold 0.0001 are chosen to avoid retrieving too many false positives. The training data for SVM remote homology classification are constructed from the MEROPS database and the annotated proteins in P. falciparum, P. vivax and P. yoelii genomes. In the MEROPS database peptidase units and peptidase inhibitors are organized into a hierarchy with three levels—clans, superfamilies and families from the root to the leaves. We randomly sampled 1,208 proteases from all the protease families with a sample size from each family proportional to the total number of proteases in the family. We combined the 1,208 selected proteases from MEROPS with the 91 known P. falciparum proteases, the 72 known P. vivax proteases and the 98 known P. yoelii proteases to form the positive training set. We manually selected 1,087 annotated P. falciparum proteins, 553 annotated P. vivax proteins and 507 annotated P. yoelli proteins that are clearly not functionally related to any protease as the negative set, under the assumption that the negative proteins from Plasmodium species will be more sensitive examples for detecting their remote homologs in the uncharacterized ORFs. The construction is designed to maximize the detection performance with comprehensive representation of the data, while keeping the data size tractable for learning by the careful selection of training examples.

For all the protein sequences in the training set and the ORFs, we computed the sequence profiles by searching against a non-redundant protein database using PSI-Blast with five iterations and the default e-value threshold 0.0001. The positional frequencies of amino acids in the profiles were smoothed using background frequencies. We used the smoothed emission probabilities in computing the profile kernels for SVM training.

Support vector machines

Support vector machines are a family of machine learning algorithms for classification and regression problems (Vapnik 1998; Cristianini and Shawe-Taylor 2000). The SVM learning algorithm finds a linear classifier f(x) = <wx> + b (w ∈ R nb ∈ R) to discriminate examples between the positive and the negative classes with a “large margin”. The learned linear classifier defines a decision boundary, the hyperplane <wx> + b = 0. A test example x will be classified as positive if f(x) > 0, negative otherwise. Empirically, most of the real datasets are not separable in a linear feature space for learning such a SVM. For these harder cases, a soft margin SVM (Cristianini and Shawe-Taylor 2000), which incorporates a trade-off between maximizing the geometric margin and minimizing margin violations on the training set, can be learned to handle the exceptions. One important property of the SVM learning problem is that in its dual optimization form, we can replace the inner product between x and y< x, y > by a kernel function K(xy); here, the kernel implicitly maps (possibly nonlinearly) the original input vector space to a feature space (or a Hilbert space) with some feature mapping Φ, i.e. the kernel K is defined with the mapping Φ and K(xy) = <Φ (x), Φ (y). If Φ is a non-linear mapping from the original feature space, it will allow SVM to easily handle non-linear data by learning a linear classifier in the new feature space.

We used the publicly available SPIDER package (http://www.kyb.tuebingen.mpg.de/bs/people/spider/) to learn the binary classifiers in our experiments. Due to the computational cost of constructing the SVM classifiers, we only applied the SVM classification on three species P. falciparum, P. vivax and P. yoelii, which are of more interest in this study.

Extended profile kernels

We chose to use profile kernels (Kuang et al. 2005) for SVM learning since they have been shown to be the state-of-the-art kernels for remote homology detection. Profile kernels are kernel functions for measuring the similarity between a pair of protein sequence profiles based on their representation in a high-dimensional feature space indexed by all k-mers (k-length subsequences of amino acids). For a sequence x and its sequence profile P(x) (e.g. PSI-Blast profile), the positional mutation neighborhood at position j with threshold δ is defined to be the set of k-mers β = b 1 b 2… b k satisfying a likelihood inequality with respect to the corresponding block of the profile P(x), as follows:

$$ M_{(k,\,\delta )} (P(x[j + 1:j + k])) = \left\{ {\beta = b_{1} b_{2} \ldots \,b_{k} {: \sum\limits_{i = 0}^{k-1} -\log P_{j + i} (b_{i} ) \le k\delta } } \right\} $$

Note that in the definition P i (b i ) denotes the emission probability of amino acid b i at position j + i in the profile P(x). Let be the alphabet of amino acids, the profile feature mapping of profile kernels can be defined as \( \Upphi_{k,\,\delta } (P(x)) = (\phi_{\beta } (P(x)))_{{\beta \in \sum^{k} }} \), where the dimension ϕ β (P(x)) is the number of occurrence of in the mutational neighborhood M (k,δ)(P(x)).

We extended the original profile kernels by considering a new feature space indexed by all subsequences of lengths in a range [k mink max], i.e. the feature space is indexed by all the k-mers with k min ≤ k ≤ k max. The assumption of this extension is that lengths of most meaningful subsequences (motifs) are within a certain range. By limiting the possible length of the subsequences, the new feature space can cover most of the motifs without involving mapping to a space of much higher dimensions. If we use the same threshold δ for computing the positional mutation neighborhoods of k-mers with k min ≤ k ≤ k max, the positional mutation neighborhood of the extended profiles kernel is simply an addition of all the profile kernels computed with the k-mers of length in [k mink max]. Since profile kernels can be efficiently computed with a trie data structure in linear time complexity in terms of input sequence length, the time complexity of computing the combined profile kernels is also linear in sequence length.

In our experiments, k min = 4 and k max = 6 are chosen as the range of the k-mers by a cross-validation on the SCOP bench mark dataset for remote homology detection (Kuang et al. 2005). The extended profile kernels are normalized, and the SVM parameters are chosen by the default setting as in the benchmark experiments described in (Kuang et al. 2005).

Multiple alignment and phylogenetic analysis

Multiple alignments were generated using the T-coffee program (http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi) (Notredame et al. 2000), followed by manual inspection and editing. Graphic representations of the alignment and consensus sequences were deduced by the program BOXSHADE (http://www.ch.embnet.org/software/BOX_form.html). Phylogenetic trees were inferred by the neighbor-joining method using MEGA (http://www.megasoftware.net/) (Tamura et al. 2007). Unweighted Maximum Parsimony (as implemented in PAUP 4.0) and Maximum Likelihood (as implemented in PHYLIP) (Felsenstein 1981) were used to examine (Hall et al. 2005) the robustness of the inferred phylogeny. Bootstrap resampling with 1,000 pseudo replicates was carried out to assess support for individual branches. Bootstrap values of <50% were collapsed and treated as polytomies.

Results and discussion

Protease prediction with PSI-Blast and PF-SVM

In our study, we applied both SVMs using profile kernels and PSI-Blast to identify the proteases in the three complete or nearly complete genomes of P. falciparum, P. vivax, and P. yoelii yoelii. For P. berghei and P. chabaudi, only PSI-Blast was used for three empirical reasons: (1) the sequencing of these two genomes is not complete yet; gene finding and annotation is still at an early stage; (2) very little is known about the proteolytic machinery in these genomes; (3) the numbers of the predicted ORFs (12,235 in P. berghei and 15,007 in P. chabaudi genome) in these genomes are relatively larger than those in the other three species due to the fragmented nature of the sequence data and incomplete annotation of these genomes (Hall et al. 2005). Thus, a much longer time is required for computing the extended profile kernels.

The positively classified ORFs by the PF-SVM and the ORFs with e-value < 1E-5 in the PSI-Blast search were subjected to further analysis. The domain organization of the predicted proteases was revealed by Pfam search (Finn et al. 2008). To annotate each predicted protease, we used the known protease sequence or protease domain with the highest similarity as a reference. The catalytic type and protease family were predicted in accordance with the MEROPS classification system, and the enzyme was named in accordance with the SWISS-PROT peptidase nomenclature (http://www.expasy.ch/cgi-bin/lists?peptidas.txt) and the literature. A gene ontology (GO) analysis was performed to predict the biological function, cellular process, and cellular location of the putative proteases (Ashburner et al. 2000). For P. falciparum, mining of the published microarray and mass spectrometry proteomics data revealed the expression of the putative proteases at the mRNA and protein levels, respectively (Florens et al. 2002; Lasonder et al. 2002; Bozdech et al. 2003a, b; Le Roch et al. 2003, 2004; Florens et al. 2004; Hall et al. 2005).

Among the candidates predicted by PSI-Blast and PF-SVM, we discovered 28 putative proteases in P. falciparum, 45 in P. vivax and 19 in P. yoelii yoelii, all of which were not reported as proteases in the MEROPS database (release 7.4). For the two less-studied genomes, our PSI-Blast search predicted 127 putative proteases in P. berghei, and 137 in P. chabaudi. In Table 1 we report the new proteases that are discovered only by PSI-Blast or PF-SVM but not by both. Overall PSI-Blast identified more of the verified predictions because our major verification relies heavily on analyzing sequence motifs. Many predictions made by the PF-SVM are unknown cases without reliable supporting evidence. PF-SVM also discovered several candidates that were not detectable by PSI-Blast. For example, we identified one putative PPPDE protease (PFI0940c and its orthologs in other Plasmodium species). This novel protease family has a circularly permuted papain-like fold and was postulated to play a role in the deubiquitination pathway and cell cycle control (Iyer et al. 2004). We also predicted a putative zinc protease PF13_0260, which has a weak prosite motif that was missed by PSI-Blast detection. Another example is PF10_0317. It does not have a detectable peptidase domain, but it has a novel domain belonging to the Der1-like family (Pfam PF04511 with E = 3.4e-17). The Der1 protein is thought to play an indispensable role in the degradation process associated with the endoplasmic reticulum (ER) (Knop et al. 1996). Although there is no direct evidence of its proteolytic activity, this family may be distantly related to the rhomboid protease family, indicating a function in cellular signaling.

Table 1 Newly identified putative proteases by SVM or PSI-Blast but not by both

The PF-SVM performs reasonably well in keeping the homologous candidates at the top of the rank list, although profile kernels measure the overall similarity between two sequences instead of relying on estimating the statistical significance of a good alignment. In Fig. 1, we show the plotting of the number of detected true positives given a certain number of false positives (up to 50). This plotting of sensitivity and specificity is commonly used to measure classification performance of remote homology detection in benchmark experiments (Jaakkola et al. 2000; Liao and Noble 2003; Leslie et al. 2004; Kuang et al. 2005; Rangwala and Karypis 2005). In the experiments with P. falciparum genome and P. yoelii yoelii genome, the PF-SVM is more sensitive in detecting true positives compared with PSI-Blast, while PSI-Blast performs better on the P. vivax genome. From the plots in Fig. 1, it is clear that when few false positives are present in the predictions, PF-SVM significantly outperforms PSI-Blast by ranking more true positives at the top of the rank list. At a given threshold of ten false positives, the PF-SVM detects five more proteases than PSI-Blast in the P. falciparum genome (12 vs. 8), one more in the P. vivax genome (19 vs. 18) and six more in the P. yoelii yoelii genome (9 vs. 3). Overall, PF-SVM performs better on the P. falciparum genome than on the other two genomes compared with PSI-Blast. We postulate that this difference might be related to the validation criteria for evaluating the predictions. In our analysis, the false positives are putative and many of them are unknown cases that cannot be fully determined with enough supporting evidences. This lack of evidence is a more severe problem for evaluating the predictions of PF-SVM since unlike PSI-Blast, PF-SVM does not provide any sequence alignment for the analysis, and many more predictions of PF-SVM are possibly unknown cases. Thus, the plots are just one empirical measure and they might not truly reflect the performance of PF-SVM compared against PSI-Blast. Furthermore, the P. falciparum genome has been relatively well studied. Presumably the predictions on this genome have relatively more supporting evidences, compared with those on the P. vivax genome and the P. yoelii yoelii genome.

Fig. 1
figure 1

Performance comparison of SVM and PSI-Blast. a Prediction performance on P. falciparum genome. b Prediction performance on P. vivax genome. c Prediction performance on P. yoelii yoelii genome

The PF-SVM missed 20 putative proteases with good alignment (with e-value < 1E-20). Thirteen of the missed candidates fall into four MEROPS families, C14 (caspase family), C50 (separase family), C54 (Aut2 peptidase family) and C65 (otubain-1 family). To test if this resulted from insufficient sampling of the MEROPS sequences—the training sequences sampled from these four families do not represent the sequence diversity in the family well—we constructed a larger training set by pulling in all the 436 sequences in the four families as additional positive training sequences. We found that several missed proteases were promoted to the top of the PF-SVM prediction lists. However, this change also introduced more false positives, and the overall ranking deteriorated.

Why might PF-SVM be better for remote homology detection?

There are two reasons why PF-SVM may outperform PSI-Blast. Firstly, PF-SVM is not misled by widely shared structural motifs. For example, we found that a disproportionate number of the false positive PSI-Blast predictions fell into the S9 and S33 protease families. This is largely due to the presence of an alpha/beta fold in their peptidase unit. This alpha/beta fold structure is commonly shared with a large number of hydrolytic enzymes including the S9 and S33 proteases and other non-protease hydrolases with broad substrate specificity. These enzymes are believed to derive from a common ancestor with the basic arrangement of the catalytic residues. The false positive hits from PSI-Blast searches included a number of lipases that have that typical alpha/beta fold. By contrast, these proteins did not appear at the top of the rank list in PF-SVM ranking, since even if there is a match of alpha/beta folds in S9 or S33 in the positive training sequences, they are also present in the negative training sequences such as lipases, and thus, features describing these domains are assigned relatively low importance in protease classification.

Secondly, PF-SVM does not suffer from the so-called “profile-drift” problem: the incorporation of the additional weakly matched sequences dilutes the signal in the original sequence. In applying PSI-Blast, we used both single iteration search and five iteration searches to generate predictions. Most of the verified predictions were not highly ranked due to a large number of false positives that were introduced by the iterative PSI-Blast search. Thus, we carefully analyzed only the predictions produced by the single iteration PSI-Blast. This is probably a specific case of the profile drifting problem in PSI-Blast. Instead of relying on estimating the statistical significance of a particular alignment, profile kernels measure the overall similarity between two sequence profiles, and thus are more robust in preserving the original sequence signal while evolutionary information is introduced in the profile for effective remote homology detection.

The degradome distributions in malaria parasites

The degradome complements of two human malaria parasites (P. falciparum and P. vivax) and three rodent parasites (P. yoelii yoelii, P. berghei, and P. chabaudi) have been revealed by SVM-based remote homology detection combining conventional PSI-Blast homology search. The proteolytic repertoire of Plasmodium consists of about 115–137 predicted proteins of five catalytic classes (aspartic, cysteine, metallo, serine and threonine). They can be further classified into 37 families according to the MEROPS protease nomenclature, which is based on intrinsic evolutionary and structural relationships (Rawlings et al. 2008) (Tables 2, 3). The detailed predicted characteristics of the proteases are summarized in Supplementary Tables 1–5 (URL: http://compbio.cs.umn.edu/Protease_Class/). The fractions of proteases relative to predicted proteome complexity vary from 0.9 to 2.3% in five Plasmodium species: the human parasites appear to have relatively more abundant proteases than their rodent kin. The overall protease fraction in Plasmodium is similar to that in the 363 organisms with completed genomes that have been sequenced and annotated (2.9%) (Southan 2001; Puente et al. 2005; Rawlings et al. 2008).

Table 2 Protease complements in Plasmodium species and other model organisms
Table 3 Protease homologs in five Plasmodium genomes

The core degradome

Our results indicate that malaria parasites possess a core degradome structure consisting of 29 families of proteases. This degradome may be common to all Apicomplexan parasites. The proteases in this set have been found to play diverse roles in metabolism, cell cycle regulation, invasion and infection (Table 2). These families fall into four of the most important catalytic classes of proteases, and we discuss them below.

Cysteine proteases

Cysteine proteases comprise about 30% of the degradome; the two most prominent families from this class are the papain family (C1) and the ubiquitin carboxyl-terminal hydrolase two family (UCH2, C19) (Table 2). The papain family (C1) includes well-characterized members of the falcipains and serine-repeat antigens (SERAs). The functions of falcipains range from hemoglobin digestion, erythrocyte rupture to erythrocyte invasion as indicated by protease inhibition assay (Rosenthal 2002; Shenai et al. 2002), biochemical characterization (Shenai et al. 2000; Sijwali et al. 2001), RNA interference (Malhotra et al. 2002; Mohmmed et al. 2003) and gene disruption knockout experiments (Sijwali and Rosenthal 2004; Sijwali et al. 2006) (See Rosenthal 2004 for a review). SERAs are potential vaccine targets since their gene products are immunogenic, and at least one member of the SERA family, SERA-5 (PFB0340c) in P. falciparum, may have proteolytic activity (Hodder et al. 2003; McCoubrie et al. 2007). Recently, a P. berghei SERA (PB000649.01.0) was suggested to be a protease that functions at sporozoite egress from oocyst (Aly and Matuschewski 2005; Arisue et al. 2007). The UCH2 (C19) family is another highly expanded gene family. This feature has likely arisen from the large-scale gene duplication events, as evidenced by the preservation of multiple copies of threonine proteases (T1 family) in multiple proteasome α and β subunits, and the ubiquitin C-terminal hydrolase family (C12). Such a massive retention of duplicates reflects the crucial role of the ATP-dependent ubiquitin-proteasome system, which has been implicated in cell-cycle control and stress responses in parasite life cycle (Gantt et al. 1998). Another cysteine protease family that can be of critical importance for parasite cell cycle is the metacaspase family (C14). We found that multiple copies (2–4) of metacaspases are present in Plasmodium, and they have the histidine and cysteine residues that are predicted to form the typical catalytic dyad (Wu et al. 2003). These paralogs may play complementary functions in parasite development and apoptosis in P. falciparum and P. berghei (Le Chat et al. 2007; Meslin et al. 2007).

Metallo and serine proteases

Although metallo and serine proteases are also abundant in Plasmodium, very little is known about their biological functions. Eleven metalloproteases are conserved in Plasmodium. For example, falcilysin, which belongs to the pitrilysin family (M16), is thought to be involved in hemoglobin degradation in the food vacuole (Eggleson et al. 1999; Goldberg 2005). Recently its potential role in the degradation of apicoplast targeting peptides has been explored (Ponpuak et al. 2007). Our analysis shows that at least one copy of a falcilysin ortholog is present in each of the five Plasmodium genomes; two copies are found in the two rodent parasites P. berghei and P. chabaudi, and at least five copies of the M16 paralogs are present. As with the metalloproteases, only one of the seven families of serine proteases that seem to be conserved in Plasmodium, the subtilisin family (S8), has been extensively studied as a potential new drug target due to its apparent role in parasite invasion and egress (Blackman et al. 1998; Barale et al. 1999; Hackett et al. 1999; Wu et al. 2003; Withers-Martinez et al. 2004; Yeoh et al. 2007). We confirmed the existence of multiple paralogs of subtilisins in the Plasmodium genomes. Moreover, the S8 family has experienced an expansion to four copies in P. vivax and five copies in P. berghei.

Aspartic proteases

Two families of aspartic proteases are conserved in Plasmodium. Plasmepsin, the pepsin family (A1) in P. falciparum, has long thought to play important roles in hemoglobin digestion (Coombs et al. 2001; Goldberg 2005). We identified a large family of plasmepsins in the other Plasmodium species which supports the speculation that it is an ancient family that has undergone domain shuffling, possibly rounds of gene duplications, gene loss, and gene gain by lateral gene transfers (Jean et al. 2001). We identified a new family of presenilin in the aspartic clan (A22). It may be involved in regulated intermembrane proteolysis.

Threonine protease

One single proteasome family (T1) forms the threonine protease clan in Plasmodium and plays a central function in degrading damaged or unused proteins by proteolysis. Although the detailed pathways and the entities of the substrates remain unclear, the core complex structure of protease subunits (seven α- and seven β- subunits) and regulatory subunits have been revealed by our previous comparative genomic analysis (Wu et al. 2003). Independent microarray expression assays have shown apparent co-expressed patterns of the predicted threonine proteases (Bozdech et al. 2003a; Le Roch et al. 2003; Wang and Wu 2004). A schematic map can be found at Dr. Hagai Ginsburg’s Malaria Parasite Metabolic Pathway, (http://sites.huji.ac.il/malaria/maps/proteaUbiqpath.html). In addition, we identified two new threonine proteases in P. falciparum: a proteasome catalytic subunit three homolog (PF10_0111) and an ATP-dependent heat shock protease hslV (PFL1465c). Both proteins possess a characteristic domain for threonine protease (pfam PF00227) with high statistical support (E = 5.1e-64 and E = 1.6e-13, respectively). Their potential importance will be discussed in the next section.

Potentially important under-characterized proteases

To date, the studies of malaria proteases as potential drug or vaccine targets have been mainly focused on a small number of proteases. Several newly discovered proteases could be worth functional characterization.

Threonine protease—proteasome catalytic subunit (PF10_0111): protein–protein interactions?

It is particularly interesting that PF10_0111 showed 15 possible protein–protein interactions in yeast two-hybrid assays (Suthram et al. 2005). Given the substantial evolutionary distance between the two species, their different life styles and the relatively high rate of false–positive predictions in such assays, caution must be used when using yeast to predict protein networks in P. falciparum. Nonetheless, there is a high likelihood that PF10_0111 is an active component in protein networks. The nature of the protein interaction network(s) awaits further experimentation since these 15 interacting proteins seems to span a variety of functional categories, including (1) a ubiquitin transferase that could be a component of the ubiquitin–proteasome conjugated proteolysis, (2) a translation elongation factor, (3) a ribosome protein L15, (4) a ribosomal protein L4/L1, (5) a CCAAT-box DNA binding protein, (6) a nucleosome assembly protein, (7) a merozoite surface protein, (8) an erythrocyte membrane protein, (9) and seven hypothetical proteins.

Threonine protease hslV PFL1465c: prokaryotic origin

The proteasome inhibitor lactacystin has been shown to block the cell growth and cell division in malaria parasites, suggesting the proteasome can be targeted for drug development (Gantt et al. 1998). Which components in proteasome should be targeted? Malaria parasites, which are a group of primordial eukaryotes, seem to have a mosaic proteasome structure: a catalytic core 20S complex that is typically found in eukaryotes and a structurally complex HslV that is typically found in eubacteria are simultaneously present. The core complex is less attractive from a drug development perspective since it is conserved in the eukaryote domain. For example, a number of α and β subunits of threonine proteases in Plasmodium show considerable homology to the human proteases, suggesting their inhibitors could have potential side effects. By contrast, inhibitors for the prokaryotic version of the proteasome are more feasible. We confirmed that a putative heat shock protein PFL1465c is a homolog of the HslV threonine protease. It has several desirable features: (1) it is expressed at the erythrocytic stage, especially at the schizont stage, as suggested by multiple microarray experiments (Bozdech et al. 2003a, b; Le Roch et al. 2003) and RT-PCR (Ramasamy et al. 2007); (2) it is likely catalytically active. The recombinant protein showed threonine, chymotrypsin and peptidyl glutamyl peptide hydrolase activity and the active sites are conserved between P. falciparum and the template E. coli protein, as shown by homology modeling (Ramasamy et al. 2007); (3) it may be a soluble protein as shown by localization assays; (4) it is distantly related to the host, as shown by phylogenetic analysis (Fig. 2); (5) it is feasible to develop inhibitors specific to PFL1465c. In fact, a small-molecule inhibitor Nip–Leu–Leu–LeuVS-Me has been developed for general HslV proteases. It shows irreversible inhibition due to covalent modification of the catalytic threonine (Powers et al. 2002). It is possible that the inhibitors for malaria HslV could have none or low side effects as there is no human homolog.

Fig. 2
figure 2

The phylogenetic tree of the HslV threonine proteases, inferred by the neighbor-joining method based on the amino acid sequences with Poisson corrected distance. The option of complete deletion of gaps was used for tree construction. 1000 bootstrap replicates were used to infer the reliability of branching points. The scale bar indicates the number of amino acid substitutions per site. The accession numbers are: NP_699055 (Brucella suis 1330), YP_001259885 (Brucella ovis ATCC 25840), YP_001369390 (Ochrobactrum anthropi ATCC 49188), NP_105747 (Mesorhizobium loti MAFF303099), ZP_02164574 (Hoeflea phototrophica DFL-43), NP_384162 (Sinorhizobium meliloti 1021), NP_353083 (Agrobacterium tumefaciens str. C58), YP_001976232 (Rhizobium etli CIAT 652), ZP_03287799 (Candidatus Liberibacter asiaticus str. psy62), YP_001608702 (Bartonella tribocorum CIP 105476), YP_033060 (Bartonella henselae str. Houston-1), YP_031904 (Bartonella quintana str. Toulouse), YP_760731 (Hyphomonas neptunium ATCC 15444), YP_001234539 (Acidiphilium cryptum JF-5), YP_746057 (Granulibacter bethesdensis CGDNIH1), YP_001603516 (Gluconacetobacter diazotrophicus PAl 5), YP_002299467 (Rhodospirillum centenum SW), ZP_02118779 (Methylobacterium nodulans ORS 2060), YP_001638129 (Methylobacterium extorquens PA1), YP_530197 (Rhodopseudomonas palustris BisB18), YP_779310 (Rhodopseudomonas palustris BisA53), ZP_01045070 (Nitrobacter sp. Nb-311A), NP_900071 (Chromobacterium violaceum ATCC 12472), YP_316299 (Thiobacillus denitrificans ATCC 25259), NP_886730 (Bordetella bronchiseptica RB50), YP_001002780 (Halorhodospira halophila SL1), ZP_03278234 (Thioalkalivibrio sp. HL-EbGR7), ZP_01102923 (gamma proteobacterium KT 71), YP_094676 (Legionella pneumophila subsp. pneumophila str. Philadelphia 1), YP_958102 (Marinobacter aquaeolei VT8), YP_001980929 (Cellvibrio japonicus Ueda107), YP_528169 (Saccharophagus degradans 2–40), YP_156839 (Idiomarina loihiensis L2TR), ZP_01042581 (Idiomarina baltica OS145), NP_232303 (Vibrio cholerae O1 biovar eltor str. N16961), NP_667636 (Yersinia pestis KIM), YP_001337873 (Klebsiella pneumoniae subsp. pneumoniae MGH 78578), NP_457960 (Salmonella enterica subsp. enterica serovar Typhi str. CT18), ZP_02780625 (Escherichia coli O157:H7 str. EC4401), NP_709736 (Shigella flexneri 2a str. 301), YP_198552 (Wolbachia endosymbiont strain TRS of Brugia malayi), CAL54733 (Ostreococcus tauri), XP_001418801 (Ostreococcus lucimarinus CCE9901), XP_001692687 (Chlamydomonas reinhardtii), XP_645845 (Dictyostelium discoideum AX4), ACI65383 (Phaeodactylum tricornutum CCAP 1055/1), PFL1465c (P. falciparum 3D7), Pv124160 (P. vivax SaI-1), PY03772 (P. yoelii yoelii str. 17XNL), PB000649.02.0 (P. berghei strain ANKA), PC000270.03.0 (P. chabaudi chabaudi), CAQ42254 (P. knowlesi strain H)

Regulated intramembrane proteolysis (RIP)

The discovery of RIP overturned the traditional paradigm of cell signaling where receptors transmit signals across membrane via binding specific molecules or ions (Brown et al. 2000). In the RIP pathways, proteases are the central players that cleave receptors and then release the fragments, which become messengers for the downstream signaling process. We identified two families of proteases in Plasmodium that may conduct RIP using different structure motifs and mechanisms.

Rhomboid proteases (S54)—potential roles in invasion?

Rhomboid is a serine protease that is involved in regulated intramembrane proteolysis. It is ubiquitously present in archaea, bacteria and eukaryotes (Urban et al. 2002). It has been shown to be important for animal development by activating epidermal growth factor receptor (EGFR) signaling in Drosophila melanogaster (Urban et al. 2001) and for mitochondrial morphology and remodeling in yeast and human (Herlan et al. 2003; McQuibban et al. 2003). The function of rhomboid protease in Apicomplexa, the phylum to which malaria parasites belong, was first revealed in Toxoplasma gondii: four rhomboids were shown to cleave surface MIC adhesions, which are essential for invasion (Brossier et al. 2005; Dowse et al. 2005), Dowse and Soldati (2005) proposed a uniform nomenclature for Apicomplexan rhomboids, which we adopt here. These authors detected eight rhomboid-like proteins in P. falciparum and seven of these had homologs in P. berghei. More recently, reports showed that two of these malarial rhomboid proteases, PF11_0150 (PfROM1) and PFE0340c (PfROM4), could cleave multiple adhesions during invasion (Baker et al. 2006), and that PFE0340c (PfROM4) specifically mediated shedding of the erythrocyte-binding antigen (EBA-175) (O’Donnell et al. 2006).

Our analysis found that homologs of the rhomboids detected by Dowse and Soldati (2005) are also found in the three additional species we examined. Based on our phylogenetic analysis, there are from five to eight homologs of rhomboid proteases present in the Plasmodium species. They can be divided into at least five clusters based on their sequence similarity, depending on the bootstrap values used to establish the groups: ROM1/2, ROM3, ROM4/5, and ROM6/7/9 appeared to be conserved in the Apicomplexa parasites, while ROM8/10 seemed to be Plasmodium-specific (Fig. 3a). Note that the homologs we uncovered in P. vivax, P. yoelii yoelii, and P. chabaudi were not uniformly distributed among the five clusters; there are two rhomboids from P. vivax in ROM8/10 and no P. chabaudi homolog in ROM4/5. We also uncovered a second P. berghei homolog in ROM6/7/9. It remains unknown why the rhomboid family has been greatly expanded in Plasmodium. One possible evolutionary driver for such a lineage specific expansion is to meet the needs of parasite or parasite-host signaling: different rhomboids might modulate the proteolysis of substrates such as adhesions and dynamins with diverse structures.

Fig. 3
figure 3

a The phylogenetic tree of the rhomboid protease homologs in Apicomplexa, inferred by the neighbor-joining method based on the amino acid sequences with Poisson corrected distance. The option of complete deletion of gaps was used for tree construction. 1000 bootstrap replicates were used to infer the reliability of branching points. The scale bar indicates the number of amino acid substitutions per site. The rhomboids in T. gondii were used as the reference for the naming system ROM1-10 (Dowse and Soldati 2005). b The alignment of the rhomboid domain region of the eight P. falciparum homologs. The putative catalytic dyad Serine (S) and Histidine (H) are highlighted in red

All the predicted Plasmodium rhomboids have a typical rhomboid domain (PF01694). As clearly shown in the alignment (Fig. 3b), seven of the eight rhomboids in P. falciparum possess a conserved dyad: a serine (S) and a histidine (H) in two separate transmembrane domains. This dyad is a characteristic of the active sites required for rhomboid catalytic function as revealed by the crystal structure of the GlpG protein, a rhomboid protease from E. coli (Wang et al. 2006). The S–H dyad is missing in PFF0900c (PfROM10), which appears to be quite divergent from the other rhomboids (Fig. 3a).

Signal peptide peptidase (SPP, presenilin family A22)

The second family of the proteases that may govern RIP in malaria parasites is the SPP or presenilin. The four human homologs of this family have been under extensive investigation because their mutation is strongly associated with the early onset of Alzheimer’s disease. SPP has also been implicated in a variety of developmental and physiological functions. We found only single copies in four Plasmodium species; the exception was P. berghei where two paralogous copies are found. The P. chabaudi SPP homolog is a 68-residue partial fragment. It is remarkable that the plasmodial SPPs have two invariant catalytic motifs that are believed to be active sites for this protease family: a Tyr–Asp (YD) motif in a transmembrane domain and a Gly–Leu–Gly–Asp (GLGD) motif in a downstream transmembrane domain (Fig. 4). Recently, Nyborg et al. (2006) showed that the P. falciparum SPP (PF14_0543), when cloned into a mammalian vector, was capable of cleaving a SPP substrate. Microarray experiments have shown that PF14_0543 is expressed during the erythrocyte stage; the mass-spectrometry proteomics assay also pinpointed its expression at the merozoite stage, which is critical for invasion. If the plasmodial SPPs are bona fide proteases, it would be intriguing to test whether the well-known adhesins are the potential substrates of SPP. Moreover, because a line of inhibitors and compound libraries targeting animal SPPs have already been established, it should be relatively straightforward to design inhibitors of the plasmodial SPP, making it a good potential antimalarial target.

Fig. 4
figure 4

The alignment of the active site region of the signal peptide peptidases in representative species. The putative catalytic motifs Tye–Asp (YD) and Gly–Leu–Gly–Asp (GLGD) are highlighted in red

Unclassified proteases

We identified four protease homologs that do not fall into any typical protease clan classification: (1) U48 (prenyl protease 2 family). Very little is known about this protease family, the majority of which are hypothetical proteins in diverse species from all the domains. The membrane-bound, prenyl protease is a new member of the Plasmodium degradome, which may be involved in secretion and protein modification. (2) A new signal peptidase. We previously predicted the two signal peptidases in P. falciparum, both belonging to the S26 family, which resemble the bacterial signal peptidase I and the eukaryotic mitochondrial 21KD signal peptidase (Wu et al. 2003). The new putative protease resembles the signal peptidase complex SPC22 unit in yeast and mammals. Apparently, the signal peptide processing machinery in Plasmodium is a mosaic of prokaryotic and eukaryotic types. The plasmodial SPC22 may have an important function, as the yeast SPC22 is essential for processing newly synthesized secreted proteins. (3) The PPPDE protease. This novel protease family has a circularly permuted papain-like fold and may function in the deubiquitination pathway and cell cycle control (Iyer et al. 2004). (4) A putative zinc protease that has a weak prosite motif.

Comparison of the degradome in parasitic protozoa Plasmodium and the free-living ciliate Tetrahymena thermophila

We compared the Plasmodium degradomes with the degradome in the ciliate T. thermophila (Eisen et al. 2006), the fully sequenced free-living organism most closely related to the malaria parasites. Twenty-one protease families are present in both genomes. For example, the members in the ATP-dependent ubiquitin-proteasome system (proteases C12, C19, and T1) are well conserved. There are more abundant proteases in T. thermophila, including 19 protease families that seem to be unique to T. thermophila. Surprisingly, leishmanolysin (M8), which was originally identified in the kinetoplastid parasite Leishmania major (Gruszynski et al. 2003; LaCount et al. 2003), is not present in any Plasmodium species despite their close evolutionary relatedness. However, a huge number (48) of leishmanolysins are found in the free-living T. thermophila, including 15 members in a tandem array. It remains unclear why leishmanolysin are expanded in nonkinetoplastid eukaryotes. Similarly, the carboxypeptidase A (M14) family is expanded to 28 members in T. thermophila, while only one copy is present in Plasmodium; The carboxypeptidase Y (S10) family includes 25 members, while none is found in Plasmodium.

Seven protease families are unique to Plasmodium: The metacaspase family (C14), a prototype caspase that has been implicated in apoptosis-like signal transduction (Madeo et al. 2002); the rhomboid family (S54) that can be essential for regulated intramembrane proteolysis during invasion and parasite development; the otubain-1 family (C65) and the Poh1 peptidase family (M67) that includes the isopeptidases that release ubiquitin from polyubiquitin for recycling; the thimet oligopeptidase family (M3) that regulates the intracellular degradation of oligopeptides such as cleaved signal peptides, and degraded protein products; the S2P protease family (M50), which has been shown in mammals to be involved in transcriptional regulation by proteolysis of transcription regulators; and the ClpP endopeptidase family (S14) which is a component of the ClpXP and ClpAP complexes responsible for the degradation of nascent polypeptides whose synthesis is interrupted.

Conclusion

We explored an approach combining PSI-Blast search and supervised SVM learning using profile kernels (PF-SVM) for improving the prediction of malaria degradomes. The PF-SVM was proved to be able to identify new proteases that were not detectable by PSI-Blast. Furthermore, when we restricted the number of false positives to be small, the PF-SVM also achieves higher sensitivity and accuracy than PSI-Blast. Our approach captured a global picture of the degradome of the five malaria parasite genomes, and is readily extensible to the study of organisms with remote homology to known model systems. The addition of the degradomes from four other species of Plasmodium to the existing one for P. falciparum revealed the core degradome for this important group of parasite. Our study also extended the list of proteases in all the species examined, unveiling proteases that are known to play key roles in other organisms in regulation, protein processing and housekeeping.