Introduction

While several species of Plasmodium cause disease in humans (including P. vivax, P. malariae, P. ovale and P. knowlesi), Plasmodium falciparumis (P. falciparum) is by far the deadliest. The malaria caused by P. falciparum remains the world’s most devastating tropical infectious disease, which results in 300–500 million clinical cases and 1–2 million deaths annually and its long-term control and eradication is still a long way off (Snow et al. 2005; Winzeler 2008). The potential for developing effective drugs and vaccines against this parasite is thus considerable. Parasite secretes an array of proteins within the host erythrocyte and beyond to facilitate its own survival within the host cell and for immunomodulation (Verma et al. 2008). These proteins secreted by parasite can serve as potential drug or vaccine targets. Therefore, the identification of secretory proteins of P. falciparum will be helpful for drug design and combination (Birkholtz et al. 2008).

Recently, An increasing amount of studies have indicated that mathematical/computational approaches, such as molecular docking (Chou et al. 2003; Wang et al. 2008a), pharmacophore modeling (Chou et al. 2006; Sirois et al. 2004), protein subcellular location prediction (Chou and Shen 2007d, 2008a; Shen and Chou 2007c), protein structural class prediction (Chou 1995, 2000; Chou and Cai 2004; Chou and Zhang 1995; Xiao et al. 2008), identification of proteases and their types (Chou and Shen 2008b; Shen and Chou 2009), protein cleavage site prediction (Chou 1993,1996; Shen and Chou 2008a), and signal peptide prediction (Chou and Shen 2007e; Shen and Chou 2007b), can be timely provide very useful information and insights for both basic research and drug design and hence are widely welcome by science community. The present study was devoted to develop a novel computational approach for predicting secretory proteins that would be particularly effective in characterizing the properties of selected proteins. Several programs have been developed for predicting secretory proteins, such as SecretomeP-2.0 (Bendtsen et al. 2004), TargetP (Emanuelsson et al. 2000), SRTpred (Garg and Raghava 2008) and the work of Liu et al. (2007). When P. falciparum genome sequence was published in 2002, it was revealed that the nucleotide composition was unusually AT-rich (~80% AT on average) and the proteins of P. falciparum are more complex than other species. Thus, the prediction of secretory proteins for P. falciparum is more difficulty than other species (Gardner et al. 2002). Recently, Verma et al. first developed the SVM models for predicting secretory proteins of malaria parasite, and achieved good prediction accuracy (Verma et al. 2008).

On the basis of the Shannon entropy definition, Laxton introduced the concept of measure of diversity (Laxton 1978), which is a parallel definition with Shannon entropy. The measure of diversity is a kind of information description on discrete state space and a measure of whole uncertainly of a system. In order to compare the distribution of two species, one defines the increment of diversity (ID) by the difference of the total diversity measure of two systems and the diversity measure of the mixed system. The ID are successfully developed and employed for classification in biogeography. Recently, the Li’s group firstly introduced the ID to protein prediction, the recognition of protein structural class (Li and Lu 2001; Lin and Li 2007a), the protein superfamily classification (Lin and Li 2007b), the subcellular and subnuclear location (Chen and Li 2007; Li and Li 2008a, b), beta-hairpin and gamma-turn prediction (Hu and Li 2008) and good prediction performances are obtained. It can be proved that the ID is a good index for distinguishing two different sources established by proteins. In this paper, based on the ID and K-nearest neighbor method, the K-minimum increment of diversity (K-MID) is developed to predict secretory protein of malaria parasite. Using amino acid composition, the prediction accuracy and Mathew’s correlation coefficient (MCC) are 88.69% and 0.78 when K = 5, higher than the SVM models. In order to investigate how a particular class or property of amino acids affects prediction accuracy and examine the effect of special amino acid with different biochemical properties, several different reduced amino acids alphabets are introduced in this study. The results indicate that the 20 amino acids can be clustered into about ten reduced amino acid groups. And by using reduced amino acids obtained from Protein Blocks method, the best prediction performance is obtained.

Materials and methods

Datasets

A critical issue in developing secretory protein prediction algorithm of malaria parasite is lack of suitable training and testing sets. In this study, the 252 secretory proteins and 252 non-secretory proteins were constructed by Verma et al. (2008). From the literature Verma et al. collected total 267 secretory proteins consisting of 208 secretory proteins (119 Rifins, 22 Stevors, 67 PfEMP1); 6 experimentally proven proteins (PF10_0159, PFE0040c, PFB0100c, PFB0095c, AAD31511, AAC47454). Another set of 3 experimentally proved secretory proteins (PFD1175w, PFD1170c, PFB0100c); more 7 proteins (PFI1755c, PFE0055c, PFI1780w, PFE0360c, PF10_0321, PF14_0607, PFE0355c); 4 REX proteins (PFI1740c, PFI1755c, PFI1760w, PFI1735c); 2 PIESPs (PFC0435c, PFE0060w); clag9 (PFI1730w); Sbp1 (PFE0065w) and 35 maurer’s cleft associated proteins. These all sum up to 267 secretory proteins. They got 252 non redundant secretory proteins after removing redundant proteins using program PROSET. The 252 non-secretory proteins are extracted from two sources, 197 non-secretory proteins are extracted from Swiss-Prot using SRS with query “Plasmodium falciparum (organism) but not secreted (comment)” and the remaining 55 non-secretory proteins are extracted nuclear proteins from PlasmoDB and randomly picked up 55 proteins from ~300 nuclear proteins.

The definition of increment of diversity

For a discrete state space X with d dimension X:{n 1 , n 2 , …, n i , …, n d }, n i denotes the times of ith state, the Shannon information entropy (Shannon 1948), a measure of uncertainty and denoted by H(X), is defined as:

$$ H(X) = - \sum\limits_{i = 1}^{d} {P_{i} } \log_{b} P_{i} $$
(1)

where \( N = \sum_{i = 1}^{d} {n_{i} } \), P i  = n i /N, P i indicates probability of ith state.

From the idea of information, the quantity of the measured diversity is called measure of diversity, denoted by D(X), is defined as:

$$ D(X) = - \sum\limits_{i = 1}^{d} {n_{i} } \log_{b} P_{i} = - \sum\limits_{i = 1}^{d} {n_{i} } \log_{b} \frac{{n_{i} }}{N} = N\log N - \sum\limits_{i = 1}^{d} {n_{i} \log_{b} } n_{i} $$
(2)

According to the definition of information entropy, combining the formula (1), we get

$$ H(X) = - \sum\limits_{i = 1}^{d} {P_{i} } \log_{b} P_{i} = - \sum\limits_{i = 1}^{d} {\frac{{n_{i} }}{N}} \log_{b} \frac{{n_{i} }}{N} = \frac{1}{N}D\left( X \right) $$
(3)

So we have

$$ D(X) = N \cdot H(X) $$
(4)

H(X) is the information entropy, which indicates a measure of the uncertainty associated with a random variable. The measure of diversity D(X) in formula (4) means a kind of information description on state space and a measure of whole uncertainly and total information of a system (Laxton 1978).

In general, for two sources of diversity in the same parameter space of d dimensions X:{n 1, n 2, …, n i , …, n d } and Y:{m 1, m 2, …, m i , …, m d }, the increment of diversity (ID), denoted by ID(X, Y), is defined as:

$$ {\text{ID}}\left( {X,Y} \right) = D\left( {X + Y} \right) - D\left( X \right) - D\left( Y \right) $$
(5)

Here, D(X+Y) is the measure of diversity of the sum of two diversity sources called combination diversity source space.

It is easily proved that the increment of diversity (ID(X, Y)) satisfies nonnegative and symmetry. Therefore, the ID is a quantitative measure of the similarity level of two diversity sources. The higher the similarity of two sources, the smaller the ID.

The K-minimum increment of diversity (K-MID) classifier

The K-nearest neighbor (K-NN) technique has become extremely popular for a variety of forest inventory mapping and estimation applications, such as protein subcellular localization (Chou and Shen 2006a, b, 2007a, b; Shen and Chou 2007b, c; Shen et al. 2007), subnuclear protein localization (Shen and Chou 2005a), protein structural classification (Shen et al. 2005; Zhang et al. 2008a, b), protein fold pattern (Shen and Chou 2006), membrane protein type (Shen and Chou 2005b; Shen et al. 2006; Chou and Shen 2007c), enzyme main and sub functional classification (Shen and Chou 2007a) as well as signal peptide (Chou and Shen 2007e). Much of this popularity may be attributed to the non-parametric, multivariate features of the technique, its intuitiveness, and its ease of use. The query protein should be classified by a majority vote of its neighbors, with the protein being assigned to the class most common amongst its K nearest neighbors. K is a positive integer, typically small. If K = 1, then the protein is simply assigned to the class of its nearest neighbor. Although different distance measures can be used for this, such as Euclidean distance, Hamming distance (Mardia et al. 1979) and Mahalanobis distance (Chou 1995), the Euclidean distance is mostly used. In this paper, the similarity measure of ID is used for predicting secretory protein.

For an arbitrary protein sequence X to be predicted, the increment of diversity (ID) between the sequence and to all stored sequences of the diversity sources established by secretory proteins (S) or non-secretory proteins (N) are computed. K-minimum IDs are selected and the average ID, denoted by K-MID(X, Y), is calculated as follows:

$$ K{\text{-MID}}\left( {X,Y} \right) = \frac{1}{K}\sum\limits_{i = 1}^{K} {{\text{ID}}\left( {X,Y} \right)} $$
(6)

The ID(X, Y) can be calculated by using Eq. 5. Then the protein X can be predicted as belonging to the category (secretory (S) or non-secretory (N)) for which the corresponding K-MID has the minimum value, and can be formulated as follows:

$$ K{\text{ - MID}}\left( {X,Y^{\xi } } \right) = {\text{Min}}\left\{ {K{\text{ - MID}}\left( {X,Y^{S} } \right),K{\text{ - MID}}\left( {X,Y^{N} } \right)} \right\}\quad \left( {\xi = S,N} \right) $$
(7)

where ξ can be secretory and non-secretory proteins and the Min means taking the minimum value among those in the parentheses, then the ξ in Eq. 7 will give the protein to which the predicted protein sequence Y should belong.

Reduced amino acids alphabets

It has been found that some residues are similar in their physicochemical features, and can be clustered into groups because they play similarly structural or functional roles in proteins (Regan and Degrado 1988; Kamtekar 1993; Henikoff and Henikoff 1992). The reduced amino acids not only simplify the complexity of the protein system, but also improve the ability in finding structurally conserved regions and the structural similarity of entire proteins. In recent years, several alphabet reduction techniques have been applied to protein prediction, such as intrinsically disordered proteins prediction (Weathers et al. 2004), recognition of protein structurally conserved regions (Li and Wang 2007), subcellular localization prediction (Oğul and Mumcuoğu 2007) and peptide and protein classification (Nanni and Lumini 2008). To investigate how a particular class or property of amino acids affects prediction accuracy, several reduced amino acid alphabets with different clustering approaches were discussed in this study. These cluster methods include Miyazawa–Jernigan matrix (MJM) based on inter-residue contact energies (Rakshit and Ananthasuresh 2008), Markov models of evolution (Susko and Roger 2007), BLOSUM62 matrix (Li et al. 2003), Protein Blocks based on local protein structures (Etchebest et al. 2007) and BLOSUM50 similarity matrix (Henikoff and Henikoff 1992).

Protein sequence representation

The amino acid composition (AAC) representation of a given sequence is composed by 20 different amino acids with a variety of shapes, size and chemical properties. The AAC representation has recently been widely utilized in predicting protein function annotation. To avoid completely lose the sequence-order information, the pseudo-amino acid (PseAA) composition or PseAAC was proposed (Chou 2001; Chou 2005). The essence of Chou’s pseudo-amino acid composition is to use a discrete model to represent a protein sample yet without complete losing its sequence-order information. Ever since the concept of Chou’s pseudo-amino acid composition was introduced, various PseAAC approaches have been stimulated to deal with different problems in proteins and protein-related systems (Chen et al. 2009; Ding and Zhang 2008; Jiang et al. 2008; Li and Li 2008b; Lin 2008; Lin et al. 2008; Wang et al. 2008b; Zhang and Fang 2008; Zhang et al. 2008a, b; Zhou et al. 2007). Owing to its wide usage, recently a very flexible PseAA composition generator, called “PseAAC” (Shen and Chou 2008b), was established at the website http://chou.med.harvard.edu/bioinf/PseAAC/, by which users can generate 63 different kinds of PseAA composition. In this study, the amino acid compositions (AAC) and dipeptide compositions (DPC) of the reduced amino acid alphabets are selected to test the K-MID algorithm.

Test and assessment

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test (Chou and Zhang 1995). However, as elucidated by Chou and Shen (2008a, b) and demonstrated in Chou and Shen (2007d), among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used by investigators to examine the accuracy of various predictors (Jiang et al. 2008; Li and Li 2008b; Lin 2008; Lin et al. 2008; Yang and Chou 2008; Zhang and Fang 2008; Zhang et al. 2008a, b; Zhou 1998; Zhou and Assa-Munt 2001; Zhou and Doctor 2003; Zhou et al. 2007). During the process of jackknife test, each protein is singled out in turn as a test sample, the remaining proteins are used as training set to calculate test sample’s membership and predict the class. The prediction performance was evaluated by the sensitivity (Sn), specificity (Sp), positive predictive value (PPV), accuracy (Acc) and Mathew’s correlation coefficient (MCC), which defined as follows:

$$ {\text{Sn}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right) $$
(8)
$$ {\text{Sp}} = {\text{TN}}/\left( {{\text{TN}} + {\text{FP}}} \right) $$
(9)
$$ {\text{PPV}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FP}}} \right) $$
(10)
$$ {\text{Acc}} = \left( {{\text{TP}} + {\text{TN}}} \right)/\left( {{\text{TP}} + {\text{FN}} + {\text{TN}} + {\text{FP}}} \right) $$
(11)
$$ {\text{MCC}} = \frac{{\left( {{\text{TP}} \times {\text{TN}}} \right) - \left( {{\text{FP}} \times {\text{FN}}} \right)}}{{\sqrt {\left( {{\text{TP}} + {\text{FN}}} \right) \times \left( {{\text{TN}} + {\text{FN}}} \right) \times \left( {{\text{TP}} + {\text{FP}}} \right) \times \left( {{\text{TN}} + {\text{FP}}} \right)} }} $$
(12)

where TP denotes the number of the correctly predicted secretory proteins, FN denotes the number of the secretory proteins predicted as non-secretory proteins, FP denotes the number of the non-secretory proteins predicted as secretory proteins, and TN denotes the number of correctly predicted non-secretory proteins.

Result and discussion

To investigate the best K value for predicting secretory proteins, test has been done by using of the various values of minimum increment of diversity (MID) K (from 1 to 20). The prediction results compared with SVM Models based on the 20 amino acid composition (AAC) are shown in the Table. 1. For different values of K, it is shown that the prediction ability is improved along with the K increase, up to the peak when K equals to 5, and decrease when the K > 11. The prediction accuracy (Acc) and MCC are not changed significantly at 6–11. The performance of prediction achieves 88.89% Acc with 0.78 MCC when K = 5, better than the best results achieved by the SVM models with 85.66% Acc and 0.72 MCC when Thr = 0.4. Therefore, in the following calculations, the K = 5 is used as the operation parameters.

Table 1 Prediction result of K-MID compared with the SVM models based on amino acid composition

In order to investigate how a particular class or property of amino acids affects prediction accuracy and to determine the minimal amount of information needed for prediction, three latest reduced amino acids methods, Miyazawa–Jernigan matrix (MJM) (Rakshit and Ananthasuresh 2008), Markov models of evolution (MME) (Susko and Roger 2007) and BLOSUM62 matrix (Li et al. 2003), are applied to predict the secretory proteins of P. falciparum. The prediction accuracy (Acc) and MCC with the different number of alphabets N are shown in Fig. 1 and 2, respectively. The results show that the Acc and MCC do not present significant change when the N at 10–20. The Acc or MCC with the number of alphabets around 10 performed similarly with the N from 12 to 20. This regular is similar to other studies, such as disorder protein prediction (N = 10), and structure conservative regions prediction (N = 9). The conclusion indicates that the amino acids content with similar features of protein sequence can be clustered properly.

Fig. 1
figure 1

Prediction accuracy of K-MID method by using different reduced amino acids methods

Fig. 2
figure 2

Prediction Mathew’s correlation coefficient of K-MID method by using different reduced amino acids methods

Based on the above discussion, two latest reduced amino acids alphabets based on Protein Blocks (Etchebest et al. 2007) and BLOSUM50 substitution matrix (Weathers et al. 2004) methods are used to further determine the optimization of alphabets for secretory protein prediction (Table. 2). The reduced amino acid alphabet obtained from Protein Blocks method is a kind of structural alphabet which is composed of 16 average protein fragments of five residues in length. Because the reduced amino acid alphabet obtained from Protein Blocks method can extract more useful information in secretory protein sequences, eliminate some useless information and reduces the dimension of the feature space, the Protein Blocks method has been successful used to analyze longer protein fragments and to predict functional regions. And the results have proven their efficiency both in description and prediction of longer fragments (Etchebest et al. 2007; de Brevern 2005), such as local protein structures prediction (Benros et al. 2006), outer membrane proteins analysis (Martin et al. 2008) and backbone structure prediction of proteins (de Brevern 2005). The prediction results of K-MID based on the amino acids composition of the reduced alphabets with K = 5 are shown in Table. 3.

Table 2 Two schemes for reducing amino acid alphabet used in our study
Table 3 The prediction performance of K-MID method for different vector sizes of reduced amino acids alphabets with K = 5

As Table 3 shown, the sizes of 13, 11 and 9 vectors achieve 88.89, 87.70 and 87.50% accuracy (Acc) for Protein Block method, and the sizes of 15, 10 and 8 vectors achieved 88.49, 87.70 and 84.92% accuracy for BLOSUM50 substitution matrix method. The best results are 88.89% Acc and 0.78 MCC with the 13 vector size, the same to the prediction performance of 20 amino acid compositions. When using dipeptide composition as the input features, the prediction performance is improved further. The accuracy achieved 89.88% with 0.81 MCC based on 100 dipeptide compositions (DPC) of BLOSUM50 substitution matrix reduced alphabets. The best prediction accuracy is up to 90.67% by using the 169 DPC of Protein Blocks reduced alphabets with 0.83 MCC. In summary, the suitable reduced amino acids alphabets can improve the predict accuracy by clustering the similar amino acids and the reduce alphabets also can reduce the dimensions of the feature space.

In order to examine the performance of our method, some comparisons with the SVM models program are made, and the prediction results of two methods based on different features are listed in Table 4. The results in Table 4 show that prediction accuracy(Acc) obtained by our K-MID method based on 20 AAC achieves 88.89% with 0.78 MCC, about 3.2% higher than the SVM models with 85.66% Acc and 0.72 MCC. For the prediction based on 400 dipeptide compositions, the K-MID method achieves 34.92% sensitivity (Sn) with 100% specificity (Sp) and the SVM models only achieve 24.21% Sn with the 99.60% Sp, about 10.71% higher than the SVM models. The sensitivity (Sn) achieves 79.76% with 100% specificity (Sp) by using 100 dipetide compositions of reduced amino acids alphabet. The best prediction performance of SVM models achieves 91.07% Acc with 82.94% Sn, 99.21% Sp and 0.83 MCC by using the PSSM profiles obtained by PSI-BLAST. Based on the 169 dipeptide compositions of reduced amino acids alphabet as the only input vectors, the K-MID method achieved 90.67% Acc with 81.75% Sn, 99.60% Sp, 99.52% PPV, and 0.83 MCC, which are similar to the SVM models. The surprising good prediction performance indicates that the K-MID method is indeed a good predictor for secretory proteins annotation.

Table 4 Comparisons of K-MID method with the SVM models for secretory protein prediction

Conclusion

For protein prediction and classification, most of the existing methods are based on a group of features that possess kinds of discriminative information from the protein sequence. In this study, the K-MID method is firstly developed to predict secretory protein of malaria parasite. The successful prediction performance indicates that amino acid composition and ID combined with K-nearest neighbor method are quite suitable to predict secretory protein. The reduced amino acids alphabets can reduce the dimension of inputting vector and improve the prediction accuracy. The results obtained in our study have also demonstrated that amino acid alphabet obtained from Protein Blocks method has the ability of abstracting useful functional and conservative information and it is suitable for secretory protein prediction. When compared with the work of Verma et al. (2008), the results show that the sensitivity in our method is less than result in Verma et al. (2008), but the specificity is higher in our results. Moreover, the overall accuracy is higher in our method than results in the work of Verma et al. (2008). We hope this algorithm will assist annotation of protein function and help for drug and vaccine design against malaria caused by P. falciparum.