Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel

Kuang, Rui; Gu, Jianying; Cai, Hong; Wang, Yufeng

doi:10.1007/s10709-008-9336-9

Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel

Published: 06 December 2008

Volume 136, pages 189–209, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Genetica Aims and scope Submit manuscript

Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel

Download PDF

Rui Kuang¹,
Jianying Gu²,
Hong Cai³ &
…
Yufeng Wang³

298 Accesses
22 Citations
Explore all metrics

An Erratum to this article was published on 04 July 2009

Abstract

The spread of drug resistance through malaria parasite populations calls for the development of new therapeutic strategies. However, the seemingly promising genomics-driven target identification paradigm is hampered by the weak annotation coverage. To identify potentially important yet uncharacterized proteins, we apply support vector machines using profile kernels, a supervised discriminative machine learning technique for remote homology detection, as a complement to the traditional alignment based algorithms. In this study, we focus on the prediction of proteases, which have long been considered attractive drug targets because of their indispensable roles in parasite development and infection. Our analysis demonstrates that an abundant and complex repertoire is conserved in five Plasmodium parasite species. Several putative proteases may be important components in networks that mediate cellular processes, including hemoglobin digestion, invasion, trafficking, cell cycle fate, and signal transduction. This catalog of proteases provides a short list of targets for functional characterization and rational inhibitor design.

ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning

Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites

Article Open access 18 July 2017

Identification of mitochondrial proteins of malaria parasite using analysis of variance

Article 11 November 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Malaria remains one of the most important life-threatening diseases. It afflicts approximately 300–500 million people a year, killing 1–2 million, mostly in the developing countries in the tropical or subtropical regions. The causative agents of malaria are a group of protozoan parasites in the genus Plasmodium. The rapid spread of the parasite populations resistant to the available antimalarial drugs underscores the pressing need for new drugs.

Genomics-based searches for new antimalarial targets hold considerable promise (Carlton et al. 2002; Gardner et al. 2002; Carlton 2003), but have been limited by a practical difficulty: our inability to assign a functional identity to a large fraction of the recognized open reading frames (ORFs) in the parasite genome. In the case of Plasmodium falciparum which causes the most severe form of malaria, about 60% of the 5,300 ORFs were annotated as “hypothetical” due to the lack of statistically significant sequence similarity to proteins with known function/structure (Gardner et al. 2002). An effective solution to circumvent this problem lies in the development of new algorithms that can capture subtle similarities between the unknown proteins and the annotated proteins in protein databases.

We propose to improve protease prediction among those uncharacterized Plasmodium proteins with a computational prediction approach that applies support vector machines (SVMs) using extended profile kernels for remote homology detection. SVMs are a family of machine learning algorithms for classification and regression problems (Vapnik 1998). A SVM classifier is a linear function that separates the training data into two classes and also maximizes the geometric margin between them in a feature space. Our binary classification problem is the classification of an uncharacterized protein sequence as a member or a non-member of a given protein family with a SVM classifier learned from the training proteins. The SVM-based classification of protein sequences uses negative sequences (proteins outside the protein family) as well as positive sequences (members of the protein family) to learn the difference between the two classes. This discriminative nature of SVMs distinguishes them from those alignment-based approaches that build models only with positive sequences (Karplus et al. 1998), and often results in better empirical classification performance. Another desirable property of SVM is that learning a SVM classifier only depends on the pairwise similarity between the examples; therefore, we can use any symmetric and positive-definite similarity functions, called kernels, to achieve better classification performance and faster computation. Recently, it has been shown that SVM-based kernel approaches are especially effective in remote homology detection (Jaakkola et al. 2000; Liao and Noble 2003; Leslie et al. 2004; Kuang et al. 2005; Rangwala and Karypis 2005). Our previous work on profile kernels (Kuang et al. 2005) established the-state-of-the-art performance for remote homology detection. The profile kernel is a function that measures the similarity of two protein sequence profiles based on their representation in a high-dimensional vector space indexed by all k-mers (k-length subsequences of amino acids). We modify the original profile kernel, which is defined on a feature space indexed by subsequence of a fixed length, to include subsequences of length in a certain range as features. We found that the extended profile kernels achieve significant improvements in protein classifications of the SCOP benchmark dataset (Results are not shown) (Murzin et al. 1995).

In this proof of concept study, we attempt to combine powerful SVM classifiers and the traditional alignment based PSI-Blast algorithms to predict the protease complements (degradomes) in Plasmodium. The proteases were chosen because:

1.
They have been thought of as attractive drug targets. Firstly, proteases, the digestive enzymes that hydrolyze peptides, are essential for the parasite life cycle: for example, aspartic proteases (plasmepsins) (Coombs et al. 2001; Goldberg 2005; Ersmark et al. 2006), cysteine proteases (falcipains) (Rosenthal et al. 2002, 2004) and metalloprotease (falcilysin) (Eggleson et al. 1999; Murata and Goldberg 2003a, b) are actively involved in hemoglobin digestion for parasite nutrition; serine proteases (subtilisins) are important for red blood cell invasion (Withers-Martinez et al. 2004); and, recently, proteases have been implicated in cell cycle progression and cell signaling (Baker et al. 2006; O’Donnell et al. 2006; Le Chat et al. 2007; Meslin et al. 2007). Secondly, it is feasible to design specific inhibitors for proteases if the mechanism of protease action is known or can be predicted. Various types of inhibitors have been shown to effectively block parasite growth or/and invasion (Sharma 2007). Emerging techniques in combinatorial high throughput screening and computational structured based drug design (SBDD) have made promising contributions to the recent progress in searching out and designing malarial protease inhibitors: combinatorial libraries have been synthesized and screened for plasmepsins (Carroll et al. 1998; Haque et al. 1999; Kasam et al. 2007) and a group of inhibitors for falcipains has been identified as well (Li et al. 1996; Scheidt et al. 1998; Pandey et al. 2006). Thirdly, because of the remote evolutionary relatedness between the malaria parasite and the human host, the inhibitors designed based on malaria protease targets should have little or no adverse effect on the host.
2.
A large amount of relevant data is available for the protease family, which makes the application of kernel based machine learning feasible. Substantial knowledge has been accumulated and a specialized expert-curated database, MEROPS, is available for proteases; it includes a catalog of characterized and predicted proteases in over 3,100 organisms (Rawlings et al. 2008).

Here we report a catalog of the proteases in five species of Plasmodium, including the two human malaria parasites P. falciparum and P. vivax, and the three parasites P. yoelii yoelii, P. berghei, and P. chabaudi, which serve as the rodent models. This catalog opens a new line of novel proteases or protease-regulated cellular processes for functional characterization.

Methods

Data preparation

The predicted ORFs of the five Plasmodium species were downloaded from the PlasmoDB database (http://www.plasmodb.org/, release 5.2). In this release, there are 5,411 ORFs in P. falciparum genome, 5,352 in P. vivax genome, 7,861 in P. yoeli genome, 12,235 in P. berghei genome, and 15,007 in P. chabaudi genome. A total of 47,499 known peptidase units and peptidase inhibitor units in the MEROPS database (http://merops.sanger.ac.uk/, release 7.4) were used as the target sequences for PSI-Blast search and SVM training.

In the PSI-Blast search using the unidentified ORFs against the MEROPS sequences, one-iteration and the default e-value threshold 0.0001 are chosen to avoid retrieving too many false positives. The training data for SVM remote homology classification are constructed from the MEROPS database and the annotated proteins in P. falciparum, P. vivax and P. yoelii genomes. In the MEROPS database peptidase units and peptidase inhibitors are organized into a hierarchy with three levels—clans, superfamilies and families from the root to the leaves. We randomly sampled 1,208 proteases from all the protease families with a sample size from each family proportional to the total number of proteases in the family. We combined the 1,208 selected proteases from MEROPS with the 91 known P. falciparum proteases, the 72 known P. vivax proteases and the 98 known P. yoelii proteases to form the positive training set. We manually selected 1,087 annotated P. falciparum proteins, 553 annotated P. vivax proteins and 507 annotated P. yoelli proteins that are clearly not functionally related to any protease as the negative set, under the assumption that the negative proteins from Plasmodium species will be more sensitive examples for detecting their remote homologs in the uncharacterized ORFs. The construction is designed to maximize the detection performance with comprehensive representation of the data, while keeping the data size tractable for learning by the careful selection of training examples.

For all the protein sequences in the training set and the ORFs, we computed the sequence profiles by searching against a non-redundant protein database using PSI-Blast with five iterations and the default e-value threshold 0.0001. The positional frequencies of amino acids in the profiles were smoothed using background frequencies. We used the smoothed emission probabilities in computing the profile kernels for SVM training.

Support vector machines

Support vector machines are a family of machine learning algorithms for classification and regression problems (Vapnik 1998; Cristianini and Shawe-Taylor 2000). The SVM learning algorithm finds a linear classifier f(x) = <w, x> + b (w ∈ R ⁿ, b ∈ R) to discriminate examples between the positive and the negative classes with a “large margin”. The learned linear classifier defines a decision boundary, the hyperplane <w, x> + b = 0. A test example x will be classified as positive if f(x) > 0, negative otherwise. Empirically, most of the real datasets are not separable in a linear feature space for learning such a SVM. For these harder cases, a soft margin SVM (Cristianini and Shawe-Taylor 2000), which incorporates a trade-off between maximizing the geometric margin and minimizing margin violations on the training set, can be learned to handle the exceptions. One important property of the SVM learning problem is that in its dual optimization form, we can replace the inner product between x and y, < x, y > by a kernel function K(x, y); here, the kernel implicitly maps (possibly nonlinearly) the original input vector space to a feature space (or a Hilbert space) with some feature mapping Φ, i.e. the kernel K is defined with the mapping Φ and K(x, y) = <Φ (x), Φ (y)> . If Φ is a non-linear mapping from the original feature space, it will allow SVM to easily handle non-linear data by learning a linear classifier in the new feature space.

We used the publicly available SPIDER package (http://www.kyb.tuebingen.mpg.de/bs/people/spider/) to learn the binary classifiers in our experiments. Due to the computational cost of constructing the SVM classifiers, we only applied the SVM classification on three species P. falciparum, P. vivax and P. yoelii, which are of more interest in this study.

Extended profile kernels

We chose to use profile kernels (Kuang et al. 2005) for SVM learning since they have been shown to be the state-of-the-art kernels for remote homology detection. Profile kernels are kernel functions for measuring the similarity between a pair of protein sequence profiles based on their representation in a high-dimensional feature space indexed by all k-mers (k-length subsequences of amino acids). For a sequence x and its sequence profile P(x) (e.g. PSI-Blast profile), the positional mutation neighborhood at position j with threshold δ is defined to be the set of k-mers β = b ₁ b ₂… b _k satisfying a likelihood inequality with respect to the corresponding block of the profile P(x), as follows:

$$ M_{(k,\,\delta )} (P(x[j + 1:j + k])) = \left\{ {\beta = b_{1} b_{2} \ldots \,b_{k} {: \sum\limits_{i = 0}^{k-1} -\log P_{j + i} (b_{i} ) \le k\delta } } \right\} $$

Note that in the definition P _j + i(b _i) denotes the emission probability of amino acid b _i at position j + i in the profile P(x). Let ∑ be the alphabet of amino acids, the profile feature mapping of profile kernels can be defined as $ \Upphi_{k,\,\delta } (P(x)) = (\phi_{\beta } (P(x)))_{{\beta \in \sum^{k} }} $, where the dimension ϕ_β(P(x)) is the number of occurrence of in the mutational neighborhood M _(k,δ)(P(x)).

We extended the original profile kernels by considering a new feature space indexed by all subsequences of lengths in a range [k _min, k _max], i.e. the feature space is indexed by all the k-mers with k _min ≤ k ≤ k _max. The assumption of this extension is that lengths of most meaningful subsequences (motifs) are within a certain range. By limiting the possible length of the subsequences, the new feature space can cover most of the motifs without involving mapping to a space of much higher dimensions. If we use the same threshold δ for computing the positional mutation neighborhoods of k-mers with k _min ≤ k ≤ k _max, the positional mutation neighborhood of the extended profiles kernel is simply an addition of all the profile kernels computed with the k-mers of length in [k _min, k _max]. Since profile kernels can be efficiently computed with a trie data structure in linear time complexity in terms of input sequence length, the time complexity of computing the combined profile kernels is also linear in sequence length.

In our experiments, k _min = 4 and k _max = 6 are chosen as the range of the k-mers by a cross-validation on the SCOP bench mark dataset for remote homology detection (Kuang et al. 2005). The extended profile kernels are normalized, and the SVM parameters are chosen by the default setting as in the benchmark experiments described in (Kuang et al. 2005).

Multiple alignment and phylogenetic analysis

Multiple alignments were generated using the T-coffee program (http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi) (Notredame et al. 2000), followed by manual inspection and editing. Graphic representations of the alignment and consensus sequences were deduced by the program BOXSHADE (http://www.ch.embnet.org/software/BOX_form.html). Phylogenetic trees were inferred by the neighbor-joining method using MEGA (http://www.megasoftware.net/) (Tamura et al. 2007). Unweighted Maximum Parsimony (as implemented in PAUP 4.0) and Maximum Likelihood (as implemented in PHYLIP) (Felsenstein 1981) were used to examine (Hall et al. 2005) the robustness of the inferred phylogeny. Bootstrap resampling with 1,000 pseudo replicates was carried out to assess support for individual branches. Bootstrap values of <50% were collapsed and treated as polytomies.

Results and discussion

Protease prediction with PSI-Blast and PF-SVM

In our study, we applied both SVMs using profile kernels and PSI-Blast to identify the proteases in the three complete or nearly complete genomes of P. falciparum, P. vivax, and P. yoelii yoelii. For P. berghei and P. chabaudi, only PSI-Blast was used for three empirical reasons: (1) the sequencing of these two genomes is not complete yet; gene finding and annotation is still at an early stage; (2) very little is known about the proteolytic machinery in these genomes; (3) the numbers of the predicted ORFs (12,235 in P. berghei and 15,007 in P. chabaudi genome) in these genomes are relatively larger than those in the other three species due to the fragmented nature of the sequence data and incomplete annotation of these genomes (Hall et al. 2005). Thus, a much longer time is required for computing the extended profile kernels.

The positively classified ORFs by the PF-SVM and the ORFs with e-value < 1E-5 in the PSI-Blast search were subjected to further analysis. The domain organization of the predicted proteases was revealed by Pfam search (Finn et al. 2008). To annotate each predicted protease, we used the known protease sequence or protease domain with the highest similarity as a reference. The catalytic type and protease family were predicted in accordance with the MEROPS classification system, and the enzyme was named in accordance with the SWISS-PROT peptidase nomenclature (http://www.expasy.ch/cgi-bin/lists?peptidas.txt) and the literature. A gene ontology (GO) analysis was performed to predict the biological function, cellular process, and cellular location of the putative proteases (Ashburner et al. 2000). For P. falciparum, mining of the published microarray and mass spectrometry proteomics data revealed the expression of the putative proteases at the mRNA and protein levels, respectively (Florens et al. 2002; Lasonder et al. 2002; Bozdech et al. 2003a, b; Le Roch et al. 2003, 2004; Florens et al. 2004; Hall et al. 2005).

Among the candidates predicted by PSI-Blast and PF-SVM, we discovered 28 putative proteases in P. falciparum, 45 in P. vivax and 19 in P. yoelii yoelii, all of which were not reported as proteases in the MEROPS database (release 7.4). For the two less-studied genomes, our PSI-Blast search predicted 127 putative proteases in P. berghei, and 137 in P. chabaudi. In Table 1 we report the new proteases that are discovered only by PSI-Blast or PF-SVM but not by both. Overall PSI-Blast identified more of the verified predictions because our major verification relies heavily on analyzing sequence motifs. Many predictions made by the PF-SVM are unknown cases without reliable supporting evidence. PF-SVM also discovered several candidates that were not detectable by PSI-Blast. For example, we identified one putative PPPDE protease (PFI0940c and its orthologs in other Plasmodium species). This novel protease family has a circularly permuted papain-like fold and was postulated to play a role in the deubiquitination pathway and cell cycle control (Iyer et al. 2004). We also predicted a putative zinc protease PF13_0260, which has a weak prosite motif that was missed by PSI-Blast detection. Another example is PF10_0317. It does not have a detectable peptidase domain, but it has a novel domain belonging to the Der1-like family (Pfam PF04511 with E = 3.4e-17). The Der1 protein is thought to play an indispensable role in the degradation process associated with the endoplasmic reticulum (ER) (Knop et al. 1996). Although there is no direct evidence of its proteolytic activity, this family may be distantly related to the rhomboid protease family, indicating a function in cellular signaling.

Table 1 Newly identified putative proteases by SVM or PSI-Blast but not by both

Full size table

The PF-SVM performs reasonably well in keeping the homologous candidates at the top of the rank list, although profile kernels measure the overall similarity between two sequences instead of relying on estimating the statistical significance of a good alignment. In Fig. 1, we show the plotting of the number of detected true positives given a certain number of false positives (up to 50). This plotting of sensitivity and specificity is commonly used to measure classification performance of remote homology detection in benchmark experiments (Jaakkola et al. 2000; Liao and Noble 2003; Leslie et al. 2004; Kuang et al. 2005; Rangwala and Karypis 2005). In the experiments with P. falciparum genome and P. yoelii yoelii genome, the PF-SVM is more sensitive in detecting true positives compared with PSI-Blast, while PSI-Blast performs better on the P. vivax genome. From the plots in Fig. 1, it is clear that when few false positives are present in the predictions, PF-SVM significantly outperforms PSI-Blast by ranking more true positives at the top of the rank list. At a given threshold of ten false positives, the PF-SVM detects five more proteases than PSI-Blast in the P. falciparum genome (12 vs. 8), one more in the P. vivax genome (19 vs. 18) and six more in the P. yoelii yoelii genome (9 vs. 3). Overall, PF-SVM performs better on the P. falciparum genome than on the other two genomes compared with PSI-Blast. We postulate that this difference might be related to the validation criteria for evaluating the predictions. In our analysis, the false positives are putative and many of them are unknown cases that cannot be fully determined with enough supporting evidences. This lack of evidence is a more severe problem for evaluating the predictions of PF-SVM since unlike PSI-Blast, PF-SVM does not provide any sequence alignment for the analysis, and many more predictions of PF-SVM are possibly unknown cases. Thus, the plots are just one empirical measure and they might not truly reflect the performance of PF-SVM compared against PSI-Blast. Furthermore, the P. falciparum genome has been relatively well studied. Presumably the predictions on this genome have relatively more supporting evidences, compared with those on the P. vivax genome and the P. yoelii yoelii genome.

The PF-SVM missed 20 putative proteases with good alignment (with e-value < 1E-20). Thirteen of the missed candidates fall into four MEROPS families, C14 (caspase family), C50 (separase family), C54 (Aut2 peptidase family) and C65 (otubain-1 family). To test if this resulted from insufficient sampling of the MEROPS sequences—the training sequences sampled from these four families do not represent the sequence diversity in the family well—we constructed a larger training set by pulling in all the 436 sequences in the four families as additional positive training sequences. We found that several missed proteases were promoted to the top of the PF-SVM prediction lists. However, this change also introduced more false positives, and the overall ranking deteriorated.

Why might PF-SVM be better for remote homology detection?

There are two reasons why PF-SVM may outperform PSI-Blast. Firstly, PF-SVM is not misled by widely shared structural motifs. For example, we found that a disproportionate number of the false positive PSI-Blast predictions fell into the S9 and S33 protease families. This is largely due to the presence of an alpha/beta fold in their peptidase unit. This alpha/beta fold structure is commonly shared with a large number of hydrolytic enzymes including the S9 and S33 proteases and other non-protease hydrolases with broad substrate specificity. These enzymes are believed to derive from a common ancestor with the basic arrangement of the catalytic residues. The false positive hits from PSI-Blast searches included a number of lipases that have that typical alpha/beta fold. By contrast, these proteins did not appear at the top of the rank list in PF-SVM ranking, since even if there is a match of alpha/beta folds in S9 or S33 in the positive training sequences, they are also present in the negative training sequences such as lipases, and thus, features describing these domains are assigned relatively low importance in protease classification.

Secondly, PF-SVM does not suffer from the so-called “profile-drift” problem: the incorporation of the additional weakly matched sequences dilutes the signal in the original sequence. In applying PSI-Blast, we used both single iteration search and five iteration searches to generate predictions. Most of the verified predictions were not highly ranked due to a large number of false positives that were introduced by the iterative PSI-Blast search. Thus, we carefully analyzed only the predictions produced by the single iteration PSI-Blast. This is probably a specific case of the profile drifting problem in PSI-Blast. Instead of relying on estimating the statistical significance of a particular alignment, profile kernels measure the overall similarity between two sequence profiles, and thus are more robust in preserving the original sequence signal while evolutionary information is introduced in the profile for effective remote homology detection.

The degradome distributions in malaria parasites

The degradome complements of two human malaria parasites (P. falciparum and P. vivax) and three rodent parasites (P. yoelii yoelii, P. berghei, and P. chabaudi) have been revealed by SVM-based remote homology detection combining conventional PSI-Blast homology search. The proteolytic repertoire of Plasmodium consists of about 115–137 predicted proteins of five catalytic classes (aspartic, cysteine, metallo, serine and threonine). They can be further classified into 37 families according to the MEROPS protease nomenclature, which is based on intrinsic evolutionary and structural relationships (Rawlings et al. 2008) (Tables 2, 3). The detailed predicted characteristics of the proteases are summarized in Supplementary Tables 1–5 (URL: http://compbio.cs.umn.edu/Protease_Class/). The fractions of proteases relative to predicted proteome complexity vary from 0.9 to 2.3% in five Plasmodium species: the human parasites appear to have relatively more abundant proteases than their rodent kin. The overall protease fraction in Plasmodium is similar to that in the 363 organisms with completed genomes that have been sequenced and annotated (2.9%) (Southan 2001; Puente et al. 2005; Rawlings et al. 2008).

Table 2 Protease complements in Plasmodium species and other model organisms

Full size table

Table 3 Protease homologs in five Plasmodium genomes

Full size table

The core degradome

Our results indicate that malaria parasites possess a core degradome structure consisting of 29 families of proteases. This degradome may be common to all Apicomplexan parasites. The proteases in this set have been found to play diverse roles in metabolism, cell cycle regulation, invasion and infection (Table 2). These families fall into four of the most important catalytic classes of proteases, and we discuss them below.

Cysteine proteases

Cysteine proteases comprise about 30% of the degradome; the two most prominent families from this class are the papain family (C1) and the ubiquitin carboxyl-terminal hydrolase two family (UCH2, C19) (Table 2). The papain family (C1) includes well-characterized members of the falcipains and serine-repeat antigens (SERAs). The functions of falcipains range from hemoglobin digestion, erythrocyte rupture to erythrocyte invasion as indicated by protease inhibition assay (Rosenthal 2002; Shenai et al. 2002), biochemical characterization (Shenai et al. 2000; Sijwali et al. 2001), RNA interference (Malhotra et al. 2002; Mohmmed et al. 2003) and gene disruption knockout experiments (Sijwali and Rosenthal 2004; Sijwali et al. 2006) (See Rosenthal 2004 for a review). SERAs are potential vaccine targets since their gene products are immunogenic, and at least one member of the SERA family, SERA-5 (PFB0340c) in P. falciparum, may have proteolytic activity (Hodder et al. 2003; McCoubrie et al. 2007). Recently, a P. berghei SERA (PB000649.01.0) was suggested to be a protease that functions at sporozoite egress from oocyst (Aly and Matuschewski 2005; Arisue et al. 2007). The UCH2 (C19) family is another highly expanded gene family. This feature has likely arisen from the large-scale gene duplication events, as evidenced by the preservation of multiple copies of threonine proteases (T1 family) in multiple proteasome α and β subunits, and the ubiquitin C-terminal hydrolase family (C12). Such a massive retention of duplicates reflects the crucial role of the ATP-dependent ubiquitin-proteasome system, which has been implicated in cell-cycle control and stress responses in parasite life cycle (Gantt et al. 1998). Another cysteine protease family that can be of critical importance for parasite cell cycle is the metacaspase family (C14). We found that multiple copies (2–4) of metacaspases are present in Plasmodium, and they have the histidine and cysteine residues that are predicted to form the typical catalytic dyad (Wu et al. 2003). These paralogs may play complementary functions in parasite development and apoptosis in P. falciparum and P. berghei (Le Chat et al. 2007; Meslin et al. 2007).

Metallo and serine proteases

Although metallo and serine proteases are also abundant in Plasmodium, very little is known about their biological functions. Eleven metalloproteases are conserved in Plasmodium. For example, falcilysin, which belongs to the pitrilysin family (M16), is thought to be involved in hemoglobin degradation in the food vacuole (Eggleson et al. 1999; Goldberg 2005). Recently its potential role in the degradation of apicoplast targeting peptides has been explored (Ponpuak et al. 2007). Our analysis shows that at least one copy of a falcilysin ortholog is present in each of the five Plasmodium genomes; two copies are found in the two rodent parasites P. berghei and P. chabaudi, and at least five copies of the M16 paralogs are present. As with the metalloproteases, only one of the seven families of serine proteases that seem to be conserved in Plasmodium, the subtilisin family (S8), has been extensively studied as a potential new drug target due to its apparent role in parasite invasion and egress (Blackman et al. 1998; Barale et al. 1999; Hackett et al. 1999; Wu et al. 2003; Withers-Martinez et al. 2004; Yeoh et al. 2007). We confirmed the existence of multiple paralogs of subtilisins in the Plasmodium genomes. Moreover, the S8 family has experienced an expansion to four copies in P. vivax and five copies in P. berghei.

Aspartic proteases

Two families of aspartic proteases are conserved in Plasmodium. Plasmepsin, the pepsin family (A1) in P. falciparum, has long thought to play important roles in hemoglobin digestion (Coombs et al. 2001; Goldberg 2005). We identified a large family of plasmepsins in the other Plasmodium species which supports the speculation that it is an ancient family that has undergone domain shuffling, possibly rounds of gene duplications, gene loss, and gene gain by lateral gene transfers (Jean et al. 2001). We identified a new family of presenilin in the aspartic clan (A22). It may be involved in regulated intermembrane proteolysis.

Threonine protease

One single proteasome family (T1) forms the threonine protease clan in Plasmodium and plays a central function in degrading damaged or unused proteins by proteolysis. Although the detailed pathways and the entities of the substrates remain unclear, the core complex structure of protease subunits (seven α- and seven β- subunits) and regulatory subunits have been revealed by our previous comparative genomic analysis (Wu et al. 2003). Independent microarray expression assays have shown apparent co-expressed patterns of the predicted threonine proteases (Bozdech et al. 2003a; Le Roch et al. 2003; Wang and Wu 2004). A schematic map can be found at Dr. Hagai Ginsburg’s Malaria Parasite Metabolic Pathway, (http://sites.huji.ac.il/malaria/maps/proteaUbiqpath.html). In addition, we identified two new threonine proteases in P. falciparum: a proteasome catalytic subunit three homolog (PF10_0111) and an ATP-dependent heat shock protease hslV (PFL1465c). Both proteins possess a characteristic domain for threonine protease (pfam PF00227) with high statistical support (E = 5.1e-64 and E = 1.6e-13, respectively). Their potential importance will be discussed in the next section.

Potentially important under-characterized proteases

To date, the studies of malaria proteases as potential drug or vaccine targets have been mainly focused on a small number of proteases. Several newly discovered proteases could be worth functional characterization.

Threonine protease—proteasome catalytic subunit (PF10_0111): protein–protein interactions?

It is particularly interesting that PF10_0111 showed 15 possible protein–protein interactions in yeast two-hybrid assays (Suthram et al. 2005). Given the substantial evolutionary distance between the two species, their different life styles and the relatively high rate of false–positive predictions in such assays, caution must be used when using yeast to predict protein networks in P. falciparum. Nonetheless, there is a high likelihood that PF10_0111 is an active component in protein networks. The nature of the protein interaction network(s) awaits further experimentation since these 15 interacting proteins seems to span a variety of functional categories, including (1) a ubiquitin transferase that could be a component of the ubiquitin–proteasome conjugated proteolysis, (2) a translation elongation factor, (3) a ribosome protein L15, (4) a ribosomal protein L4/L1, (5) a CCAAT-box DNA binding protein, (6) a nucleosome assembly protein, (7) a merozoite surface protein, (8) an erythrocyte membrane protein, (9) and seven hypothetical proteins.

Threonine protease hslV PFL1465c: prokaryotic origin

The proteasome inhibitor lactacystin has been shown to block the cell growth and cell division in malaria parasites, suggesting the proteasome can be targeted for drug development (Gantt et al. 1998). Which components in proteasome should be targeted? Malaria parasites, which are a group of primordial eukaryotes, seem to have a mosaic proteasome structure: a catalytic core 20S complex that is typically found in eukaryotes and a structurally complex HslV that is typically found in eubacteria are simultaneously present. The core complex is less attractive from a drug development perspective since it is conserved in the eukaryote domain. For example, a number of α and β subunits of threonine proteases in Plasmodium show considerable homology to the human proteases, suggesting their inhibitors could have potential side effects. By contrast, inhibitors for the prokaryotic version of the proteasome are more feasible. We confirmed that a putative heat shock protein PFL1465c is a homolog of the HslV threonine protease. It has several desirable features: (1) it is expressed at the erythrocytic stage, especially at the schizont stage, as suggested by multiple microarray experiments (Bozdech et al. 2003a, b; Le Roch et al. 2003) and RT-PCR (Ramasamy et al. 2007); (2) it is likely catalytically active. The recombinant protein showed threonine, chymotrypsin and peptidyl glutamyl peptide hydrolase activity and the active sites are conserved between P. falciparum and the template E. coli protein, as shown by homology modeling (Ramasamy et al. 2007); (3) it may be a soluble protein as shown by localization assays; (4) it is distantly related to the host, as shown by phylogenetic analysis (Fig. 2); (5) it is feasible to develop inhibitors specific to PFL1465c. In fact, a small-molecule inhibitor Nip–Leu–Leu–LeuVS-Me has been developed for general HslV proteases. It shows irreversible inhibition due to covalent modification of the catalytic threonine (Powers et al. 2002). It is possible that the inhibitors for malaria HslV could have none or low side effects as there is no human homolog.

Regulated intramembrane proteolysis (RIP)

The discovery of RIP overturned the traditional paradigm of cell signaling where receptors transmit signals across membrane via binding specific molecules or ions (Brown et al. 2000). In the RIP pathways, proteases are the central players that cleave receptors and then release the fragments, which become messengers for the downstream signaling process. We identified two families of proteases in Plasmodium that may conduct RIP using different structure motifs and mechanisms.

Rhomboid proteases (S54)—potential roles in invasion?

Rhomboid is a serine protease that is involved in regulated intramembrane proteolysis. It is ubiquitously present in archaea, bacteria and eukaryotes (Urban et al. 2002). It has been shown to be important for animal development by activating epidermal growth factor receptor (EGFR) signaling in Drosophila melanogaster (Urban et al. 2001) and for mitochondrial morphology and remodeling in yeast and human (Herlan et al. 2003; McQuibban et al. 2003). The function of rhomboid protease in Apicomplexa, the phylum to which malaria parasites belong, was first revealed in Toxoplasma gondii: four rhomboids were shown to cleave surface MIC adhesions, which are essential for invasion (Brossier et al. 2005; Dowse et al. 2005), Dowse and Soldati (2005) proposed a uniform nomenclature for Apicomplexan rhomboids, which we adopt here. These authors detected eight rhomboid-like proteins in P. falciparum and seven of these had homologs in P. berghei. More recently, reports showed that two of these malarial rhomboid proteases, PF11_0150 (PfROM1) and PFE0340c (PfROM4), could cleave multiple adhesions during invasion (Baker et al. 2006), and that PFE0340c (PfROM4) specifically mediated shedding of the erythrocyte-binding antigen (EBA-175) (O’Donnell et al. 2006).

Our analysis found that homologs of the rhomboids detected by Dowse and Soldati (2005) are also found in the three additional species we examined. Based on our phylogenetic analysis, there are from five to eight homologs of rhomboid proteases present in the Plasmodium species. They can be divided into at least five clusters based on their sequence similarity, depending on the bootstrap values used to establish the groups: ROM1/2, ROM3, ROM4/5, and ROM6/7/9 appeared to be conserved in the Apicomplexa parasites, while ROM8/10 seemed to be Plasmodium-specific (Fig. 3a). Note that the homologs we uncovered in P. vivax, P. yoelii yoelii, and P. chabaudi were not uniformly distributed among the five clusters; there are two rhomboids from P. vivax in ROM8/10 and no P. chabaudi homolog in ROM4/5. We also uncovered a second P. berghei homolog in ROM6/7/9. It remains unknown why the rhomboid family has been greatly expanded in Plasmodium. One possible evolutionary driver for such a lineage specific expansion is to meet the needs of parasite or parasite-host signaling: different rhomboids might modulate the proteolysis of substrates such as adhesions and dynamins with diverse structures.

All the predicted Plasmodium rhomboids have a typical rhomboid domain (PF01694). As clearly shown in the alignment (Fig. 3b), seven of the eight rhomboids in P. falciparum possess a conserved dyad: a serine (S) and a histidine (H) in two separate transmembrane domains. This dyad is a characteristic of the active sites required for rhomboid catalytic function as revealed by the crystal structure of the GlpG protein, a rhomboid protease from E. coli (Wang et al. 2006). The S–H dyad is missing in PFF0900c (PfROM10), which appears to be quite divergent from the other rhomboids (Fig. 3a).

Signal peptide peptidase (SPP, presenilin family A22)

The second family of the proteases that may govern RIP in malaria parasites is the SPP or presenilin. The four human homologs of this family have been under extensive investigation because their mutation is strongly associated with the early onset of Alzheimer’s disease. SPP has also been implicated in a variety of developmental and physiological functions. We found only single copies in four Plasmodium species; the exception was P. berghei where two paralogous copies are found. The P. chabaudi SPP homolog is a 68-residue partial fragment. It is remarkable that the plasmodial SPPs have two invariant catalytic motifs that are believed to be active sites for this protease family: a Tyr–Asp (YD) motif in a transmembrane domain and a Gly–Leu–Gly–Asp (GLGD) motif in a downstream transmembrane domain (Fig. 4). Recently, Nyborg et al. (2006) showed that the P. falciparum SPP (PF14_0543), when cloned into a mammalian vector, was capable of cleaving a SPP substrate. Microarray experiments have shown that PF14_0543 is expressed during the erythrocyte stage; the mass-spectrometry proteomics assay also pinpointed its expression at the merozoite stage, which is critical for invasion. If the plasmodial SPPs are bona fide proteases, it would be intriguing to test whether the well-known adhesins are the potential substrates of SPP. Moreover, because a line of inhibitors and compound libraries targeting animal SPPs have already been established, it should be relatively straightforward to design inhibitors of the plasmodial SPP, making it a good potential antimalarial target.

Unclassified proteases

We identified four protease homologs that do not fall into any typical protease clan classification: (1) U48 (prenyl protease 2 family). Very little is known about this protease family, the majority of which are hypothetical proteins in diverse species from all the domains. The membrane-bound, prenyl protease is a new member of the Plasmodium degradome, which may be involved in secretion and protein modification. (2) A new signal peptidase. We previously predicted the two signal peptidases in P. falciparum, both belonging to the S26 family, which resemble the bacterial signal peptidase I and the eukaryotic mitochondrial 21KD signal peptidase (Wu et al. 2003). The new putative protease resembles the signal peptidase complex SPC22 unit in yeast and mammals. Apparently, the signal peptide processing machinery in Plasmodium is a mosaic of prokaryotic and eukaryotic types. The plasmodial SPC22 may have an important function, as the yeast SPC22 is essential for processing newly synthesized secreted proteins. (3) The PPPDE protease. This novel protease family has a circularly permuted papain-like fold and may function in the deubiquitination pathway and cell cycle control (Iyer et al. 2004). (4) A putative zinc protease that has a weak prosite motif.

Comparison of the degradome in parasitic protozoa Plasmodium and the free-living ciliate Tetrahymena thermophila

We compared the Plasmodium degradomes with the degradome in the ciliate T. thermophila (Eisen et al. 2006), the fully sequenced free-living organism most closely related to the malaria parasites. Twenty-one protease families are present in both genomes. For example, the members in the ATP-dependent ubiquitin-proteasome system (proteases C12, C19, and T1) are well conserved. There are more abundant proteases in T. thermophila, including 19 protease families that seem to be unique to T. thermophila. Surprisingly, leishmanolysin (M8), which was originally identified in the kinetoplastid parasite Leishmania major (Gruszynski et al. 2003; LaCount et al. 2003), is not present in any Plasmodium species despite their close evolutionary relatedness. However, a huge number (48) of leishmanolysins are found in the free-living T. thermophila, including 15 members in a tandem array. It remains unclear why leishmanolysin are expanded in nonkinetoplastid eukaryotes. Similarly, the carboxypeptidase A (M14) family is expanded to 28 members in T. thermophila, while only one copy is present in Plasmodium; The carboxypeptidase Y (S10) family includes 25 members, while none is found in Plasmodium.

Seven protease families are unique to Plasmodium: The metacaspase family (C14), a prototype caspase that has been implicated in apoptosis-like signal transduction (Madeo et al. 2002); the rhomboid family (S54) that can be essential for regulated intramembrane proteolysis during invasion and parasite development; the otubain-1 family (C65) and the Poh1 peptidase family (M67) that includes the isopeptidases that release ubiquitin from polyubiquitin for recycling; the thimet oligopeptidase family (M3) that regulates the intracellular degradation of oligopeptides such as cleaved signal peptides, and degraded protein products; the S2P protease family (M50), which has been shown in mammals to be involved in transcriptional regulation by proteolysis of transcription regulators; and the ClpP endopeptidase family (S14) which is a component of the ClpXP and ClpAP complexes responsible for the degradation of nascent polypeptides whose synthesis is interrupted.

Conclusion

We explored an approach combining PSI-Blast search and supervised SVM learning using profile kernels (PF-SVM) for improving the prediction of malaria degradomes. The PF-SVM was proved to be able to identify new proteases that were not detectable by PSI-Blast. Furthermore, when we restricted the number of false positives to be small, the PF-SVM also achieves higher sensitivity and accuracy than PSI-Blast. Our approach captured a global picture of the degradome of the five malaria parasite genomes, and is readily extensible to the study of organisms with remote homology to known model systems. The addition of the degradomes from four other species of Plasmodium to the existing one for P. falciparum revealed the core degradome for this important group of parasite. Our study also extended the list of proteases in all the species examined, unveiling proteases that are known to play key roles in other organisms in regulation, protein processing and housekeeping.

Abbreviations

EGFR:: Epidermal growth factor receptor
ER:: Endoplasmic reticulum
PF-SVM:: Support vector machine using profile kernels
ORF:: Open reading frame
RIP:: Regulated intramembrane proteolysis
SBDD:: Structured based drug design
SERA:: Serine-repeat antigen
SPP:: Signal peptide peptidase
SVM:: Support vector machine

References

Aly AS, Matuschewski K (2005) A malarial cysteine protease is necessary for Plasmodium sporozoite egress from oocysts. J Exp Med 202:225–230. doi:10.1084/jem.20050545
Article PubMed CAS Google Scholar
Arisue N, Hirai M, Arai M, Matsuoka H, Horii T (2007) Phylogeny and evolution of the SERA multigene family in the genus Plasmodium. J Mol Evol 65:82–91. doi:10.1007/s00239-006-0253-1
Article PubMed CAS Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25:25–29. doi:10.1038/75556
Article PubMed CAS Google Scholar
Baker RP, Wijetilaka R, Urban S (2006) Two Plasmodium rhomboid proteases preferentially cleave different adhesins implicated in all invasive stages of malaria. PLoS Pathog 2:e113. doi:10.1371/journal.ppat.0020113
Article PubMed Google Scholar
Barale JC, Blisnick T, Fujioka H, Alzari PM, Aikawa M, Braun-Breton C, Langsley G (1999) Plasmodium falciparum subtilisin-like protease 2, a merozoite candidate for the merozoite surface protein 1–42 maturase. Proc Natl Acad Sci USA 96:6445–6450. doi:10.1073/pnas.96.11.6445
Article PubMed CAS Google Scholar
Blackman MJ, Fujioka H, Stafford WH, Sajid M, Clough B, Fleck SL, Aikawa M, Grainger M, Hackett F (1998) A subtilisin-like protein in secretory organelles of Plasmodium falciparum merozoites. J Biol Chem 273:23398–23409. doi:10.1074/jbc.273.36.23398
Article PubMed CAS Google Scholar
Bozdech Z, Llinas M, Pulliam BL, Wong ED, Zhu J, DeRisi JL (2003a) The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biol 1:E5. doi:10.1371/journal.pbio.0000005
Article PubMed Google Scholar
Bozdech Z, Zhu J, Joachimiak MP, Cohen FE, Pulliam B, DeRisi JL (2003b) Expression profiling of the schizont and trophozoite stages of Plasmodium falciparum with a long-oligonucleotide microarray. Genome Biol 4:R9. doi:10.1186/gb-2003-4-2-r9
Article PubMed Google Scholar
Brossier F, Jewett TJ, Sibley LD, Urban S (2005) A spatially localized rhomboid protease cleaves cell surface adhesins essential for invasion by Toxoplasma. Proc Natl Acad Sci USA 102:4146–4151. doi:10.1073/pnas.0407918102
Article PubMed CAS Google Scholar
Brown MS, Ye J, Rawson RB, Goldstein JL (2000) Regulated intramembrane proteolysis: a control mechanism conserved from bacteria to humans. Cell 100:391–398. doi:10.1016/S0092-8674(00)80675-3
Article PubMed CAS Google Scholar
Carlton J (2003) The Plasmodium vivax genome sequencing project. Trends Parasitol 19:227–231. doi:10.1016/S1471-4922(03)00066-7
Article PubMed CAS Google Scholar
Carlton JM, Angiuoli SV, Suh BB, Kooij TW, Pertea M, Silva JC, Ermolaeva MD, Allen JE, Selengut JD, Koo HL, Peterson JD, Pop M, Kosack DS, Shumway MF, Bidwell SL, Shallom SJ, van Aken SE, Riedmuller SB, Feldblyum TV, Cho JK, Quackenbush J, Sedegah M, Shoaibi A, Cummings LM, Florens L, Yates JR, Raine JD, Sinden RE, Harris MA, Cunningham DA, Preiser PR, Bergman LW, Vaidya AB, Van Lin LH, Janse CJ, Waters AP, Smith HO, White OR, Salzberg SL, Venter JC, Fraser CM, Hoffman SL, Gardner MJ, Carucci DJ (2002) Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419:512–519. doi:10.1038/nature01099
Article PubMed CAS Google Scholar
Carroll CD, Patel H, Johnson TO, Guo T, Orlowski M, He ZM, Cavallaro CL, Guo J, Oksman A, Gluzman IY, Connelly J, Chelsky D, Goldberg DE, Dolle RE (1998) Identification of potent inhibitors of Plasmodium falciparum plasmepsin II from an encoded statine combinatorial library. Bioorg Med Chem Lett 8:2315–2320. doi:10.1016/S0960-894X(98)00419-3
Article PubMed CAS Google Scholar
Coombs GH, Goldberg DE, Klemba M, Berry C, Kay J, Mottram JC (2001) Aspartic proteases of Plasmodium falciparum and other parasitic protozoa as drug targets. Trends Parasitol 17:532–537. doi:10.1016/S1471-4922(01)02037-2
Article PubMed CAS Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Google Scholar
Dowse TJ, Pascall JC, Brown KD, Soldati D (2005) Apicomplexan rhomboids have a potential role in microneme protein cleavage during host cell invasion. Int J Parasitol 35:747–756. doi:10.1016/j.ijpara.2005.04.001
Article PubMed CAS Google Scholar
Eggleson KK, Duffin KL, Goldberg DE (1999) Identification and characterization of falcilysin, a metallopeptidase involved in hemoglobin catabolism within the malaria parasite Plasmodium falciparum. J Biol Chem 274:32411–32417. doi:10.1074/jbc.274.45.32411
Article PubMed CAS Google Scholar
Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, Wortman JR, Badger JH, Ren Q, Amedeo P, Jones KM, Tallon LJ, Delcher AL, Salzberg SL, Silva JC, Haas BJ, Majoros WH, Farzad M, Carlton JM, Smith RK Jr, Garg J, Pearlman RE, Karrer KM, Sun L, Manning G, Elde NC, Turkewitz AP, Asai DJ, Wilkes DE, Wang Y, Cai H, Collins K, Stewart BA, Lee SR, Wilamowska K, Weinberg Z, Ruzzo WL, Wloga D, Gaertig J, Frankel J, Tsao CC, Gorovsky MA, Keeling PJ, Waller RF, Patron NJ, Cherry JM, Stover NA, Krieger CJ, del Toro C, Ryder HF, Williamson SC, Barbeau RA, Hamilton EP, Orias E (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol 4:e286. doi:10.1371/journal.pbio.0040286
Article PubMed Google Scholar
Ersmark K, Samuelsson B, Hallberg A (2006) Plasmepsins as potential targets for new antimalarial therapy. Med Res Rev 26:626–666. doi:10.1002/med.20082
Article PubMed CAS Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376. doi:10.1007/BF01734359
Article PubMed CAS Google Scholar
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A (2008) The Pfam protein families database. Nucleic Acids Res 36:D281–D288. doi:10.1093/nar/gkm960
Article PubMed CAS Google Scholar
Florens L, Washburn MP, Raine JD, Anthony RM, Grainger M, Haynes JD, Moch JK, Muster N, Sacci JB, Tabb DL, Witney AA, Wolters D, Wu YM, Gardner MJ, Holder AA, Sinden RE, Yates JR, Carucci DJ (2002) A proteomic view of the Plasmodium falciparum life cycle. Nature 419:520–526. doi:10.1038/nature01107
Article PubMed CAS Google Scholar
Florens L, Liu X, Wang YF, Yang SG, Schwartz O, Peglar M, Carucci DJ, Yates JR, Wu YM (2004) Proteomics approach reveals novel proteins on the surface of malaria-infected erythrocytes. Mol Biochem Parasitol 135:1–11. doi:10.1016/j.molbiopara.2003.12.007
Article PubMed CAS Google Scholar
Gantt SM, Myung JM, Briones MRS, Li WD, Corey EJ, Omura S, Nussenzweig V, Sinnis P (1998) Proteasome inhibitors block development of Plasmodium spp. Antimicrob Agents Chemother 42:2731–2738
PubMed CAS Google Scholar
Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DMA, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B (2002) Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419:498–511. doi:10.1038/nature01097
Article PubMed CAS Google Scholar
Goldberg DE (2005) Hemoglobin degradation. Curr Top Microbiol Immunol 295:275–291. doi:10.1007/3-540-29088-5_11
Article PubMed CAS Google Scholar
Gruszynski AE, DeMaster A, Hooper NM, Bangs JD (2003) Surface coat remodeling during differentiation of Trypanosoma brucei. J Biol Chem 278:24665–24672. doi:10.1074/jbc.M301497200
Article PubMed CAS Google Scholar
Hackett F, Sajid M, Withers-Martinez C, Grainger M, Blackman MJ (1999) PfSUB-2: a second subtilisin-like protein in Plasmodium falciparum merozoites. Mol Biochem Parasitol 103:183–195. doi:10.1016/S0166-6851(99)00122-X
Article PubMed CAS Google Scholar
Hall N, Karras M, Raine JD, Carlton JM, Kooij TW, Berriman M, Florens L, Janssen CS, Pain A, Christophides GK, James K, Rutherford K, Harris B, Harris D, Churcher C, Quail MA, Ormond D, Doggett J, Trueman HE, Mendoza J, Bidwell SL, Rajandream MA, Carucci DJ, Yates JRIII, Kafatos FC, Janse CJ, Barrell B, Turner CM, Waters AP, Sinden RE (2005) A comprehensive survey of the Plasmodium life cycle by genomic, transcriptomic, and proteomic analyses. Science 307:82–86. doi:10.1126/science.1103717
Article PubMed CAS Google Scholar
Haque TS, Skillman AG, Lee CE, Habashita H, Gluzman IY, Ewing TJ, Goldberg DE, Kuntz ID, Ellman JA (1999) Potent, low-molecular-weight non-peptide inhibitors of malarial aspartyl protease plasmepsin II. J Med Chem 42:1428–1440. doi:10.1021/jm980641t
Article PubMed CAS Google Scholar
Herlan M, Vogel F, Bornhovd C, Neupert W, Reichert AS (2003) Processing of Mgm1 by the rhomboid-type protease Pcp1 is required for maintenance of mitochondrial morphology and of mitochondrial DNA. J Biol Chem 278:27781–27788. doi:10.1074/jbc.M211311200
Article PubMed CAS Google Scholar
Hodder AN, Drew DR, Epa VC, Delorenzi M, Bourgon R, Miller SK, Moritz RL, Frecklington DF, Simpson RJ, Speed TP, Pike RN, Crabb BS (2003) Enzymic, phylogenetic, and structural characterization of the unusual papain-like protease domain of Plasmodium falciparum SERA5. J Biol Chem 278:48169–48177. doi:10.1074/jbc.M306755200
Article PubMed CAS Google Scholar
Iyer LM, Koonin EV, Aravind L (2004) Novel predicted peptidases with a potential role in the ubiquitin signaling pathway. Cell Cycle 3:1440–1450
PubMed CAS Google Scholar
Jaakkola T, Diekhans M, Haussler D (2000) A discriminative framework for detecting remote protein homologies. J Comput Biol 7:95–114. doi:10.1089/10665270050081405
Article PubMed CAS Google Scholar
Jean L, Long M, Young J, Pery P, Tomley F (2001) Aspartyl proteinase genes from apicomplexan parasites: evidence for evolution of the gene structure. Trends Parasitol 17:491–498. doi:10.1016/S1471-4922(01)02030-X
Article PubMed CAS Google Scholar
Karplus K, Barrett C, Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846–856. doi:10.1093/bioinformatics/14.10.846
Article PubMed CAS Google Scholar
Kasam V, Zimmermann M, Maass A, Schwichtenberg H, Wolf A, Jacq N, Breton V, Hofmann-Apitius M (2007) Design of new plasmepsin inhibitors: a virtual high throughput screening approach on the EGEE grid. J Chem Inf Model 47:1818–1828. doi:10.1021/ci600451t
Article PubMed CAS Google Scholar
Knop M, Finger A, Braun T, Hellmuth K, Wolf DH (1996) Der1, a novel protein specifically required for endoplasmic reticulum degradation in yeast. EMBO J 15:753–763
PubMed CAS Google Scholar
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3:527–550. doi:10.1142/S021972000500120X
Article PubMed CAS Google Scholar
LaCount DJ, Gruszynski AE, Grandgenett PM, Bangs JD, Donelson JE (2003) Expression and function of the Trypanosoma brucei major surface protease (GP63) genes. J Biol Chem 278:24658–24664. doi:10.1074/jbc.M301451200
Article PubMed CAS Google Scholar
Lasonder E, Ishihama Y, Andersen JS, Vermunt AMW, Pain A, Sauerwein RW, Eling WMC, Hall N, Waters AP, Stunnenberg HG, Mann M (2002) Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature 419:537–542. doi:10.1038/nature01111
Article PubMed CAS Google Scholar
Le Chat L, Sinden RE, Dessens JT (2007) The role of metacaspase 1 in Plasmodium berghei development and apoptosis. Mol Biochem Parasitol 153:41–47. doi:10.1016/j.molbiopara.2007.01.016
Article PubMed Google Scholar
Le Roch KG, Zhou Y, Blair PL, Grainger M, Moch JK, Haynes JD, De La Vega P, Holder AA, Batalov S, Carucci DJ, Winzeler EA (2003) Discovery of gene function by expression profiling of the malaria parasite life cycle. Science 301:1503–1508. doi:10.1126/science.1087025
Article PubMed Google Scholar
Le Roch KG, Johnson JR, Florens L, Zhou Y, Santrosyan A, Grainger M, Yan SF, Williamson KC, Holder AA, Carucci DJ, Yates JRIII, Winzeler EA (2004) Global analysis of transcript and protein levels across the Plasmodium falciparum life cycle. Genome Res 14:2308–2318. doi:10.1101/gr.2523904
Article PubMed Google Scholar
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476. doi:10.1093/bioinformatics/btg431
Article PubMed CAS Google Scholar
Li R, Chen X, Gong B, Selzer PM, Li Z, Davidson E, Kurzban G, Miller RE, Nuzum EO, McKerrow JH, Fletterick RJ, Gillmor SA, Craik CS, Kuntz ID, Cohen FE, Kenyon GL (1996) Structure-based design of parasitic protease inhibitors. Bioorg Med Chem 4:1421–1427. doi:10.1016/0968-0896(96)00136-8
Article PubMed CAS Google Scholar
Liao L, Noble WS (2003) Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J Comput Biol 10:857–868. doi:10.1089/106652703322756113
Article PubMed CAS Google Scholar
Madeo F, Herker E, Maldener C, Wissing S, Lachelt S, Herian M, Fehr M, Lauber K, Sigrist SJ, Wesselborg S, Frohlich KU (2002) A caspase-related protease regulates apoptosis in yeast. Mol Cell 9:911–917. doi:10.1016/S1097-2765(02)00501-4
Article PubMed CAS Google Scholar
Malhotra P, Dasaradhi PV, Kumar A, Mohmmed A, Agrawal N, Bhatnagar RK, Chauhan VS (2002) Double-stranded RNA-mediated gene silencing of cysteine proteases (falcipain-1 and -2) of Plasmodium falciparum. Mol Microbiol 45:1245–1254. doi:10.1046/j.1365-2958.2002.03105.x
Article PubMed CAS Google Scholar
McCoubrie JE, Miller SK, Sargeant T, Good RT, Hodder AN, Speed TP, de Koning-Ward TF, Crabb BS (2007) Evidence for a common role for the serine-type Plasmodium falciparum serine repeat antigen proteases: implications for vaccine and drug design. Infect Immun 75:5565–5574. doi:10.1128/IAI.00405-07
Article PubMed CAS Google Scholar
McQuibban GA, Saurya S, Freeman M (2003) Mitochondrial membrane remodelling regulated by a conserved rhomboid protease. Nature 423:537–541. doi:10.1038/nature01633
Article PubMed CAS Google Scholar
Meslin B, Barnadas C, Boni V, Latour C, De Monbrison F, Kaiser K, Picot S (2007) Features of apoptosis in Plasmodium falciparum erythrocytic stage through a putative role of PfMCA1 metacaspase-like protein. J Infect Dis 195:1852–1859. doi:10.1086/518253
Article PubMed CAS Google Scholar
Mohmmed A, Dasaradhi PV, Bhatnagar RK, Chauhan VS, Malhotra P (2003) In vivo gene silencing in Plasmodium berghei—a mouse malaria model. Biochem Biophys Res Commun 309:506–511. doi:10.1016/j.bbrc.2003.08.027
Article PubMed CAS Google Scholar
Murata CE, Goldberg DE (2003a) Plasmodium falciparum falcilysin: a metalloprotease with dual specificity. J Biol Chem 278:38022–38028. doi:10.1074/jbc.M306842200
Article PubMed CAS Google Scholar
Murata CE, Goldberg DE (2003b) Plasmodium falciparum falcilysin: an unprocessed food vacuole enzyme. Mol Biochem Parasitol 129:123–126. doi:10.1016/S0166-6851(03)00098-7
Article PubMed CAS Google Scholar
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
PubMed CAS Google Scholar
Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217. doi:10.1006/jmbi.2000.4042
Article PubMed CAS Google Scholar
Nyborg AC, Ladd TB, Jansen K, Kukar T, Golde TE (2006) Intramembrane proteolytic cleavage by human signal peptide peptidase like 3 and malaria signal peptide peptidase. FASEB J 20:1671–1679. doi:10.1096/fj.06-5762com
Article PubMed CAS Google Scholar
O’Donnell RA, Hackett F, Howell SA, Treeck M, Struck N, Krnajski Z, Withers-Martinez C, Gilberger TW, Blackman MJ (2006) Intramembrane proteolysis mediates shedding of a key adhesin during erythrocyte invasion by the malaria parasite. J Cell Biol 174:1023–1033. doi:10.1083/jcb.200604136
Article PubMed Google Scholar
Pandey KC, Singh N, Arastu-Kapur S, Bogyo M, Rosenthal PJ (2006) Falstatin, a cysteine protease inhibitor of Plasmodium falciparum, facilitates erythrocyte invasion. PLoS Pathog 2:e117. doi:10.1371/journal.ppat.0020117
Article PubMed Google Scholar
Ponpuak M, Klemba M, Park M, Gluzman IY, Lamppa GK, Goldberg DE (2007) A role for falcilysin in transit peptide degradation in the Plasmodium falciparum apicoplast. Mol Microbiol 63:314–334. doi:10.1111/j.1365-2958.2006.05443.x
Article PubMed CAS Google Scholar
Powers JC, Asgian JL, Ekici OD, James KE (2002) Irreversible inhibitors of serine, cysteine, and threonine proteases. Chem Rev 102:4639–4750. doi:10.1021/cr010182v
Article PubMed CAS Google Scholar
Puente XS, Gutierrez-Fernandez A, Ordonez GR, Hillier LW, Lopez-Otin C (2005) Comparative genomic analysis of human and chimpanzee proteases. Genomics 86:638–647. doi:10.1016/j.ygeno.2005.07.009
Article PubMed CAS Google Scholar
Ramasamy G, Gupta D, Mohmmed A, Chauhan VS (2007) Characterization and localization of Plasmodium falciparum homolog of prokaryotic ClpQ/HslV protease. Mol Biochem Parasitol 152:139–148. doi:10.1016/j.molbiopara.2007.01.002
Article PubMed CAS Google Scholar
Rangwala H, Karypis G (2005) Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21:4239–4247. doi:10.1093/bioinformatics/bti687
Article PubMed CAS Google Scholar
Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ (2008) MEROPS: the peptidase database. Nucleic Acids Res 36:D320–D325. doi:10.1093/nar/gkm954
Article PubMed CAS Google Scholar
Rosenthal PJ (2002) Hydrolysis of erythrocyte proteins by proteases of malaria parasites. Curr Opin Hematol 9:140–145. doi:10.1097/00062752-200203000-00010
Article PubMed Google Scholar
Rosenthal PJ (2004) Cysteine proteases of malaria parasites. Int J Parasitol 34:1489–1499. doi:10.1016/j.ijpara.2004.10.003
Article PubMed CAS Google Scholar
Rosenthal PJ, Sijwali PS, Singh A, Shenai BR (2002) Cysteine proteases of malaria parasites: targets for chemotherapy. Curr Pharm Des 8:1659–1672. doi:10.2174/1381612023394197
Article PubMed CAS Google Scholar
Scheidt KA, Roush WR, McKerrow JH, Selzer PM, Hansell E, Rosenthal PJ (1998) Structure-based design, synthesis and evaluation of conformationally constrained cysteine protease inhibitors. Bioorg Med Chem 6:2477–2494. doi:10.1016/S0968-0896(98)80022-9
Article PubMed CAS Google Scholar
Sharma A (2007) Malarial protease inhibitors: potential new chemotherapeutic agents. Curr Opin Investig Drugs 8:642–652
PubMed CAS Google Scholar
Shenai BR, Sijwali PS, Singh A, Rosenthal PJ (2000) Characterization of native and recombinant falcipain-2, a principal trophozoite cysteine protease and essential hemoglobinase of Plasmodium falciparum. J Biol Chem 275:29000–29010. doi:10.1074/jbc.M004459200
Article PubMed CAS Google Scholar
Shenai BR, Semenov AV, Rosenthal PJ (2002) Stage-specific antimalarial activity of cysteine protease inhibitors. Biol Chem 383:843–847. doi:10.1515/BC.2002.089
Article PubMed CAS Google Scholar
Sijwali PS, Rosenthal PJ (2004) Gene disruption confirms a critical role for the cysteine protease falcipain-2 in hemoglobin hydrolysis by Plasmodium falciparum. Proc Natl Acad Sci USA 101:4384–4389. doi:10.1073/pnas.0307720101
Article PubMed CAS Google Scholar
Sijwali PS, Shenai BR, Gut J, Singh A, Rosenthal PJ (2001) Expression and characterization of the Plasmodium falciparum haemoglobinase falcipain-3. Biochem J 360:481–489. doi:10.1042/0264-6021:3600481
Article PubMed CAS Google Scholar
Sijwali PS, Koo J, Singh N, Rosenthal PJ (2006) Gene disruptions demonstrate independent roles for the four falcipain cysteine proteases of Plasmodium falciparum. Mol Biochem Parasitol 150:96–106. doi:10.1016/j.molbiopara.2006.06.013
Article PubMed CAS Google Scholar
Southan C (2001) A genomic perspective on human proteases. FEBS Lett 498:214–218. doi:10.1016/S0014-5793(01)02490-5
Article PubMed CAS Google Scholar
Suthram S, Sittler T, Ideker T (2005) The Plasmodium protein network diverges from those of other eukaryotes. Nature 438:108–112. doi:10.1038/nature04135
Article PubMed CAS Google Scholar
Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol 24:1596–1599. doi:10.1093/molbev/msm092
Article PubMed CAS Google Scholar
Urban S, Lee JR, Freeman M (2001) Drosophila rhomboid-1 defines a family of putative intramembrane serine proteases. Cell 107:173–182. doi:10.1016/S0092-8674(01)00525-6
Article PubMed CAS Google Scholar
Urban S, Schlieper D, Freeman M (2002) Conservation of intramembrane proteolytic activity and substrate specificity in prokaryotic and eukaryotic rhomboids. Curr Biol 12:1507–1512. doi:10.1016/S0960-9822(02)01092-8
Article PubMed CAS Google Scholar
Vapnik VN (1998) Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York
Google Scholar
Wang Y, Wu Y (2004) Computer assisted searches for drug targets with emphasis on malarial proteases and their inhibitors. Curr Drug Targets Infect Disord 4:25–40. doi:10.2174/1568005043480952
Article PubMed Google Scholar
Wang Y, Zhang Y, Ha Y (2006) Crystal structure of a rhomboid family intramembrane protease. Nature 444:179–180. doi:10.1038/nature05255
Article PubMed CAS Google Scholar
Withers-Martinez C, Jean L, Blackman MJ (2004) Subtilisin-like proteases of the malaria parasite. Mol Microbiol 53:55–63. doi:10.1111/j.1365-2958.2004.04144.x
Article PubMed CAS Google Scholar
Wu YM, Wang XY, Liu X, Wang YF (2003) Data-mining approaches reveal hidden families of proteases in the genome of malaria parasite. Genome Res 13:601–616. doi:10.1101/gr.913403
Article PubMed CAS Google Scholar
Yeoh S, O’Donnell RA, Koussis K, Dluzewski AR, Ansell KH, Osborne SA, Hackett F, Withers-Martinez C, Mitchell GH, Bannister LH, Bryans JS, Kettleborough CA, Blackman MJ (2007) Subcellular discharge of a serine protease mediates release of invasive malaria parasites from host erythrocytes. Cell 131:1072–1083. doi:10.1016/j.cell.2007.10.049
Article PubMed CAS Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their constructive comments. We thank PlasmoDB for providing an all-in-one portal for malaria genomic data. The project described is supported by grants 1SC1GM081068, 8SC1AI080579, and R21AI067543 from the National Institute of General Medical Sciences and National Institute of Allergy and Infectious Diseases to Y. Wang. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences, National Institute of Allergy and Infectious Diseases or the National Institutes of Health. YW is also supported by NIH grant G12RR013646, and San Antonio Area Foundation Biomedical Research Funds. RK is supported by Grant-in-Aid of Research, Artistry and Scholarship at University of Minnesota, and the Biomedical Informatics and Computational Biology Seed Grant for UM-Mayo-IBM Collaboration. JG is supported by PSC-CUNY 37 Research Award and Summer Research Award for faculty at College of Staten Island/CUNY.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, Twin Cities, Minneapolis, MN, 55455, USA
Rui Kuang
Department of Biology, College of Staten Island/City University of New York, Staten Island, NY, 10314, USA
Jianying Gu
Department of Biology, University of Texas at San Antonio, San Antonio, TX, 78249, USA
Hong Cai & Yufeng Wang

Authors

Rui Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Jianying Gu
View author publications
You can also search for this author in PubMed Google Scholar
Hong Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Rui Kuang or Yufeng Wang.

Additional information

Rui Kuang and Jianying Gu have contributed equally to this work.

An erratum to this article can be found at http://dx.doi.org/10.1007/s10709-009-9383-x

Electronic supplementary material

Below is the link to the electronic supplementary material.

(XLS 46 kb)

(XLS 87 kb)

(XLS 88 kb)

(XLS 91 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuang, R., Gu, J., Cai, H. et al. Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel. Genetica 136, 189–209 (2009). https://doi.org/10.1007/s10709-008-9336-9

Download citation

Received: 22 July 2008
Accepted: 17 November 2008
Published: 06 December 2008
Issue Date: May 2009
DOI: https://doi.org/10.1007/s10709-008-9336-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel

Abstract

Similar content being viewed by others

Introduction

Methods

Data preparation

Support vector machines

Extended profile kernels

Multiple alignment and phylogenetic analysis

Results and discussion

Protease prediction with PSI-Blast and PF-SVM

Why might PF-SVM be better for remote homology detection?

The degradome distributions in malaria parasites

The core degradome

Cysteine proteases

Metallo and serine proteases

Aspartic proteases

Threonine protease

Potentially important under-characterized proteases

Threonine protease—proteasome catalytic subunit (PF10_0111): protein–protein interactions?

Threonine protease hslV PFL1465c: prokaryotic origin

Regulated intramembrane proteolysis (RIP)

Rhomboid proteases (S54)—potential roles in invasion?

Signal peptide peptidase (SPP, presenilin family A22)

Unclassified proteases

Comparison of the degradome in parasitic protozoa Plasmodium and the free-living ciliate Tetrahymena thermophila

Conclusion

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation