Introduction

Computational approaches, such as structural bioinformatics (Argos et al. 1982; Chou 2004a, b, c, d, 2005a), molecular docking (Chou et al. 2003; Gao et al. 2007; Li et al. 2007; Wang et al. 2008; Zhang et al. 2006a, b, c; Zheng et al. 2007), molecular packing (Chou et al. 1984, 1988), pharmacophore modeling (Chou et al. 2006; Sirois et al. 2004), Mote Carlo simulated approach (Chou 1992), diffusion-controlled reaction simulation (Chou and Jiang 1974; Chou and Zhou 1982; Li and Chou 1976), bio-macromolecular internal collective motion simulation (Chou 1988), QSAR (Dea-Ayuela et al. 2008; Du et al. 2005, 2008a, b; Gonzalez-Diaz et al. 2006, 2008; Prado-Prado et al. 2008), protein subcellular location prediction (Chou and Shen 2006a, b, 2007a, c, 2008a; Shi et al. 2008), identification of membrane proteins and their types (Chou and Shen 2007b), identification of enzymes and their functional classes (Shen and Chou 2007a), identification of GPCR and their types (Chou 2005b; Chou and Elrod 2002), identification of proteases and their types (Chou and Shen 2008b), protein cleavage site prediction (Chou 1993, 1996; Shen and Chou 2008), and signal peptide prediction (Chou and Shen 2007d; Shen and Chou 2007b) can timely provide very useful information and insights for both basic research and drug design and hence are widely welcome by science community. The present study is attempted to develop a computational approach for predicting the subcellular localization of apoptosis proteins in hope to stimulate the development of the relevant areas (Emanuelsson et al. 2007; Fauchere et al. 1988; Janin 1979; Janin and Wodak 1978).

Apoptosis is a form of cell death which plays a central role in normal tissue homeostasis by regulating a balance between cell proliferation and death (Chou et al. 1997, 1999, 2000; Chou 2004a, b, c, d, 2005a, b, c). Cells undergoing apoptosis usually exhibit a characteristic morphology, including fragmentation of the cell into membrane-bound apoptotic bodies, nuclear and cytoplasm condensation and hemolytic cleavage of the DNA into small oligonucleosomal fragments (Kerr et al. 1972; Steller 1995). Unregulated excessive apoptosis may cause various degenerative and autoimmune diseases. Conversely, an inappropriately low rate of apoptosis may promotes survival and accumulation of abnormal cells that can give rise to tumor formation and prolonged autoimmune stimulation such as in cancers and Graves disease (Peter et al. 1997).The study on apoptosis proteins can help us to understand the mechanism of apoptosis and provide many targets for therapeutic intervention (Cosic 1994; Du and Li 2006; Hong et al. 1999; Hopp and Woods 1981; Huang and Shi 2005; Chou 2000, Chou 2004a, b, c, d, 2005a, b, c).

The function of a protein is closely correlated with its subcellular location (Cai and Chou 2003; Cai et al. 2003; Chou 2002; Chou and Cai 2002, 2004, 2005; Chou and Shen 2006a, b, c; Shen et al. 2007b; Shen et al. 2005; Shen and Chou 2007; Chou and Elrod 1999; Chou 2000, 2001; Feng 2002). Thus, the knowledge of apoptosis proteins subcellular location will help to understand the apoptosis mechanism and functions of proteins (Schulz et al. 1999; Reed and Paternostro 1999). The knowledge of apoptosis proteins function is very important for understanding the mechanism of programmed cell death. The malfunction of apoptosis or cell death will lead to some formidable diseases, such as cancer (Adams and Cory 1998; Evan and Littlewood 1998), autoimmune diseases, ischemic damage, or neurodegenerative disease (Schulz et al. 1999). With the rapid increasing of the number of unknown function protein sequences in protein databank, it is crucial to develop fast and powerful computational tools and algorithms to predict apoptosis proteins subcellular location directly from their amino acid sequences.

Several prediction algorithms have been reported for subcellular location of apoptosis protein. Zhou and Doctor (2003) have predicted four kinds of subcellular locations by using amino acid composition (AAC) representing sample of protein, and covariant discriminate algorithm of Chou (1995) as prediction engine. They obtained overall accuracy 72.5% by jackknife test. Bulashevska and Eils (2006) achieved accuracies 85.7 and 89.9% using single Bayesian classifier and hierarchical ensemble classifier, respectively. Zhang et al. (2006b) developed a new encoding method with grouped weight for protein sequence. Meanwhile, they constructed a larger dataset with 225 apoptosis protein belonged to four subcellular locations. A prediction algorithm of dual-layer support vector machine has been developed (Zhou et al. 2008). Chen and Li (2007a, b) have developed two prediction approaches based on increment of diversity (ID) and increment of diversity with support vector machine (ID_SVM), which are validated on a new dataset covering six subcellular compartments and 317 apoptosis proteins.

Compare to lots of research on protein subcellular location (Chou and Shen 2007b), the studies on apoptosis protein subcellular location are limited. It is mainly due to the flexibility of the apoptosis proteins distribution and the limited of apoptosis proteins annotated. In this study, we propose a new prediction approach based on ensemble classifier and feature selection for prediction of apoptosis protein subcellular location based on the analysis above mentioned. A new kind of ensemble classifier is introduced as prediction engine. The methods of ensemble classifier, which has the capability of reducing the variance caused by the peculiarities of a single training set and hence be able to learn a more expressive concept in classification than a single classifier, are proposed in various attributes of protein science (Shen and Chou 2006a, b, 2007a; Shen et al. 2007a; Kedarisetti et al. 2006; Chou and Shen 2006a, 2006b, 2007a). The basic classifier is fuzzy K-nearest neighbor (FKNN) (Keller et al. 1985) classifier, which is a simple and powerful classifier often used in identifying various protein attributes (Huang and Li 2004; Shen et al. 2006; Huang et al. 2006). For each basic classifier within ensemble classifier, the input data is k-spaced amino acid pair’s composition after feature selection. The test results obtained by jackknife test indicate that the proposed method might be a useful tool for subcellular location of apoptosis protein, or at least can play a complimentary role to the existing methods in the relevant areas.

Materials and methods

Datasets

Two datasets constructed by the previous investigators are used to examine the power of the new method. The dataset CL317 is a larger one with 317 apoptosis proteins constructed by Chen and Li (2007a), which has 112 cytoplasmic proteins, 55 membrane proteins, 34 mitochondrial proteins, 17 secreted proteins, 52 nuclear proteins, and 47 endoplasmic reticulum proteins. The dataset ZW225 with 225 apoptosis proteins in the work (Zhang et al. 2006b) includes four subcellular locations with 41 nuclear proteins, 70 cytoplasmic proteins, 25 mitochondrial proteins, and 89 membrane proteins.

K-spaced amino acid pairs

As mentioned in prior works, amino acids (AA) composition vector of protein sequence is a simple sequence representation that is widely used in prediction of various structural aspects. Given 20 alphabetically ordered (A, C,D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) AA, which are denoted as A 1 , A 2 ,…, A 19 , and A 20 , and the number of occurrences of A i in the sequence that is denoted as x i , the composition vector is defined as (x 1/L, x 2/L,…, x i /L), where L is the length of the sequence (Chen et al. 2006b, 2007b; Kawashima et al. 1999; Nakashima and Nishikawa 1994; Park and Kanehisa 2003; Pincus 1991; Richman and Moorman 2000; Shi et al. 2007; Tanford 1962; Zimmerman et al. 1968). However, the composition vector is insufficient to represent a sequence, since it only counts the frequencies of individual AAs. At the same time, frequencies of AA pairs (dipeptides) provide more information since they reflect interaction between local (with respect to the sequence) AA pairs. Based on the frequency of collocation of AA pairs in the sequence, all dipeptides in the sequence can be counted. Since there are 400 possible dipeptides (AA, AC, AD,…,YY), a feature vector of that size is used to represent occurrence of these pairs in the sequence. Each AA pairs occurrence rate is (n 1/(− 1), n 2/(− 1),…, n i /(− 1)). Since short-range interactions between AAs, rather than only interactions between immediately adjacent AAs, have impact of folding, the proposed representation also considers collocated pairs of AAs, i.e., the AA pairs that are separated by p other AAs (e.g., the AA pairs form is AA 1 A 2 …A p A, where A 1 A 2 …A p are other AA). In summary, these pairs can be understood as the dipeptides with gaps. For each value of p, there are 400 corresponding feature values. At the same time, each AA pairs occurrence rate is reduced to (n 1/(− − 1), n 2/(− − 1),…, n i /(− − 1)). Collocated pairs for = 0, 1,…, 20 are considered for the reason that the distance of AA in motif database PROSITE is up to 20 (Chen et al. 2007a; Falquet et al. 2002). As a result, we propose representation that includes total of 400(20 + 1) + 20 = 8,420 features.

Binary particle swarm optimization

Particle swarm optimization (PSO) is a population-based stochastic optimization technique, which was developed by Kennedy and Eberhart (1995). It is one of the evolutionary optimization methods inspired by nature which include evolutionary strategy, evolutionary programming, genetic algorithm and genetic programming. PSO is distinctly different from other evolutionary-type methods in that it does not use the filtering operation (such as crossover and/or mutation) and the members of the entire population are maintained through the search procedure (Kennedy et al. 2001).

In the PSO algorithm, every solution is a bird of the flock and is referred to as a particle: in this framework the birds, besides having individual intelligence, also develop some social behavior and coordinate their movement towards a destination.

Initially, the process starts from a swarm of particles, in which each of them contains a solution to the hydraulic problem that is generated randomly, and then one searches the optimal solution by iteration. The ith particle is associated with a position in an s-dimensional space, where M is the number of variables involved in the problem; the values of the M variables which determine the position of the particle represent a possible solution of the optimization problem. Each particle i is completely determined by three vectors: its current position X i, and its velocity V i as follows:

$$ {\text{Current position }}X_{i} = (x_{i1} ,x_{i2} , \ldots ,x_{iM} ) $$
(1)
$$ {\text{Flight velocity }}V_{i} = (v_{i1} ,v_{i2} , \ldots ,v_{iM} ) $$
(2)

This algorithm simulates a flock of birds which communicate during flight. Each bird looks at a specific direction (its best ever attained position), and later, when they communicate among themselves, the bird which is in the best position is identified. With coordination, each bird moves also towards the best bird using a velocity which depends on its present velocity. Thus, each bird examines the search space from its current local position, and this process repeats until the bird possibly reaches the desired position. Note that this process involves as much individual intelligence as social interactivity; the birds learn through their own experience (local search) and the experience of their peers (global search).

In each cycle, one identifies the particle which has the best instantaneous solution to the problem; the position of this particle subsequently enters into the computation of the new position for each of the particles in the flock. This calculation is carried out according to

$$ X_{id}^{k + 1} = X_{id}^{k} + V_{id}^{k + 1} $$
(3)
$$ v_{id}^{k + 1} = v_{id}^{k} + C_{1} \,rand()(pbest_{id}^{k} - x_{id}^{k} ) + C_{2} \,rand()(gbest_{d}^{k} - x_{id}^{k} ) $$
(4)

Here, rand() represents a function which creates random numbers between 0 and 1 (two independent random numbers enter Eq. 4); pbest k id represents the best position of each particle i reached in kth cycle whereas gbest represents the best result of global search. C 1 and C 2 are two positive constants which are called learning factors or rates which are usually set to 2.

PSO was originally introduced as an optimization technique for real-number spaces. However, many optimization problems occur in a space featuring discrete, qualitative distinctions between variables and between levels of variables. Kennedy and Eberhart introduced binary PSO (BPSO), which can be applied to discrete binary variables. In a binary space, a particle may move to near corners of a hypercube by flipping various numbers of bits; thus, the overall particle velocity may be described by the number of bits changed per iteration (Kennedy and Eberhart 1997). In BPSO, each particle position X i is set to 1 or 0, but the flight velocity V i are not limited. In our paper, BPSO is used as feature selection algorithm. All the AA pairs feature of apoptosis proteins above mentioned compose of particle space. If the ith feature is selected, then X i  = 1; if not, X i  = 0. The fitness function of the feature selection algorithm is formulated by Eq. 5.

$$ fitness = Ac - k*nNewFeature/nAllFeature $$
(5)

In Eq. 5, Ac represents the accuracy of Jackknife test (Chou and Zhang 1995) on training dataset, nNewFeature represents the number of newly features selected, nAllFeature represents the number of all features, and k is a parameter represents the fixed ratio of feature selected in the algorithm. In our paper, = 1.

Based on the velocity of particles in BPSO calculated by Eq. 4, each particle’s new position X k+1 id can be get as follows:

$$ x_{{id}}^{{k + 1}} = \left\{ \begin{array}{*{20}l}R{\text{(}}0,1{\text{)}} > 0.7\quad\;&{\text{if}}\;v_{{id}}^{{k + 1}} < 2a/3 \hfill \\ x_{{id}}^{k} \quad\quad\quad\,\,\;&{\text{if}}\;2a/3 <= v_{{id}}^{{k + 1}} < a \hfill \\ pbest(id)\quad\quad\,&{\text{if}}\;a <= v_{{id}}^{{k + 1}} < {\text{(}}1 + a{\text{)}}/2 \hfill \\ gbest\quad\quad\quad\,&{\text{if}}\;{\text{(}}1 + a{\text{)}}/2 <= v_{{id}}^{{k + 1}} < 1 \hfill \\ \end{array} \right.$$
(6)

where a is a parameter represents updating of particles, in our research = 1.

Ensemble classifier

The framework of the ensemble classifier is illustrated in Fig. 1. The basic classifier is FKNN classifier which is trained on the k-spaced amino acid pair’s composition after feature selection. Combining a set of basic classifiers, the ensemble classifier is formulated by

$$ C = C_{1} \{ BiAAC(p = 0)\} \oplus C_{2} \{ BiAAC(p = 1)\} \oplus \cdots \oplus C_{n} \{ BiAAC(p = n)\} $$
(7)

where C denotes the ensemble classifier, C i {BiAAC(i)}, = 0, 1, , n, represent the basic classifiers trained by proteins based on the feature selection results of p-spaced amino acid pairs composition. The symbol ⊕ is the combination operator. Here, the basic classifier is the FKNN classifier (Keller et al. 1985), which combines the fuzzy set theory with KNN algorithm. The detailed algorithm description of the FKNN can be found in the work (Huang and Li 2004; Shen et al. 2006; Zheng et al. 2007). The output of each basic classifier is the fuzzy membership value of subcellular location of apoptosis protein. A fuzzy membership matrix can be formulated as Eq. 8

$$ \left[ {\begin{array}{*{20}c} {m_{1}^{1} (x)} & {m_{1}^{2} (x)} & \cdots & {m_{1}^{n + 1} (x)} \\ {m_{2}^{1} (x)} & {m_{2}^{2} (x)} & \cdots & {m_{2}^{n + 1} (x)} \\ \vdots & \vdots & \vdots & \vdots \\ {m_{c}^{1} (x)} & {m_{c}^{1} (x)} & \cdots & {m_{c}^{n + 1} (x)} \\ \end{array} } \right] $$
(8)

where c is the number of subcellular location and n is the number of k-spaced amino acid pair.

Fig. 1
figure 1

The flowchart of prediction approach and framework of ensemble classifier

Through fusing the output of each basic classifier, the fuzzy membership value of output of ensemble classifier can be obtained.

$$ f_{i}^{\text{comb}} (u_{i} ) = \frac{1}{n}\sum\limits_{j = 1}^{n} {u_{i} (j)} $$
(9)

where u i  = (m 1 i (x), m 2 i (x),…, m n i (x)), i = 1, 2,…, c, “comb” express the rule of fusion. The final result is the maximum of f i in Eq. 10.

$$ predicted={\mathop{\arg\max(f_{i})}\limits_{i = 1,2,\ldots,c}}$$
(10)

Fuzzy K-nearest neighbor classifier

Combining the fuzzy set theory with KNN algorithm, Keller has proposed a new method named as FKNN classifier algorithm (Keller et al. 1985). The fuzzy membership of a sample of protein is assigned to different subcellular location according to the formulation as below:

$$ u_{i} (p) = \frac{{\sum\nolimits_{j = 1}^{k} {u_{i} } \left( {p^{(j)} } \right)\left( {\left\| {p - p^{(j)} } \right\|^{{{{ - 2} \mathord{\left/ {\vphantom {{ - 2} {(m - 1)}}} \right. \kern-\nulldelimiterspace} {(m - 1)}}}} } \right)}}{{\sum\nolimits_{j = 1}^{k} {\left( {\left\| {p - p^{(j)} } \right\|^{{{{ - 2} \mathord{\left/ {\vphantom {{ - 2} {(m - 1)}}} \right. \kern-\nulldelimiterspace} {(m - 1)}}}} } \right)} }},\quad i = 1, \ldots ,c $$
(11)

where k is the number of nearest neighbors, u i (p) is the membership value of a protein sample to structural class i. m is the fuzzy parameter, which determines the weight of distance of each neighbor to membership value. ||− p (j)|| is the distance between the test protein sample and it nearest neighbor samples, various distance functions can be chosen, Here, we use Euclidean distance. u i (p (j)) is the membership value of the jth nearest neighbor to ith subcellular location. It is assigned in crispest way, which is illuminated as below.

$$ u_{i} \left( {p^{(j)} } \right) = \left\{ {\begin{array}{*{20}c} {1\quad {\text{if}}\,p^{(j)} \in C_{i} } \\ {0\quad {\text{otherwise}}} \\ \end{array} } \right. $$
(12)

When all memberships of each subcellular location are calculated, the test protein sample is assigned to the class with highest membership value. As the prior work we did, it is a useful prediction engine (Zhang et al. 2006a, b, c, 2008). For the reason that = 0, 1,…, 20 in feature selection in our research, 21 FKNN are selected as basic classifiers of ensemble classifier.

Performance measurement

To measure the quality of apoptosis protein subcellular locations prediction, it is convenient to introduce an accuracy matrix [M ii ] of size × c (c is the number of compartments to be predicted). The element M ii of the accuracy matrix is the number of proteins to be predicted in subcellular location j, which are actually in subcellular location i.

Three indexes are applied to evaluate the prediction accuracy, which are sensitivity (S n ), specialty (S p ), and Matthew’s correlation coefficients (MCC).

$$ S_{n} = \frac{{M_{ii} }}{{\sum\nolimits_{j = 1}^{c} {M_{ij} } }} $$
(13)
$$ S_{p} = \frac{{M_{ii} }}{{\sum\nolimits_{j = 1}^{c} {M_{ji} } }} $$
(14)
$$ {\text{MCC}} = \frac{{M_{ii} \left( {\sum\nolimits_{k \ne i}^{c} {\sum\nolimits_{j \ne i}^{c} {M_{jk} } } } \right) - \left( {\sum\nolimits_{j \ne i}^{c} {M_{ij} } } \right) \times \left( {\sum\nolimits_{j \ne i}^{c} {M_{ji} } } \right)}}{{\left[ {\left( {M_{ii} + \sum\nolimits_{j \ne i}^{c} {M_{ij} } } \right)\left( {M_{ii} + \sum\nolimits_{j \ne i}^{c} {M_{ji} } } \right)\left( {\sum\nolimits_{k \ne i}^{c} {\sum\nolimits_{j \ne i}^{c} {M_{jk} } } + \sum\nolimits_{j \ne i}^{c} {M_{ji} } } \right)\left( {\sum\nolimits_{k \ne i}^{c} {\sum\nolimits_{j \ne i}^{c} {M_{jk} } } + \sum\nolimits_{j \ne i}^{c} {M_{ij} } } \right)} \right]^{1/2} }} $$
(15)
$$ A_{c} = {{\left( {\sum\limits_{i = 1}^{c} {M_{ii} } } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_{i = 1}^{c} {M_{ii} } } \right)} {\left( {\sum\limits_{i = 1}^{c} {\sum\limits_{j = 1}^{c} {M_{ij} } } } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\sum\limits_{i = 1}^{c} {\sum\limits_{j = 1}^{c} {M_{ij} } } } \right)}} $$
(16)

S n represents the accuracy, and S p represents the reliability in prediction. MCC is a single parameter characterizing the matching degree between the observed and predicted structural classes.

Results and discussion

In statistical prediction, the following three cross-validation tests are often used to examine the power of a predictor: independent dataset test, sub-sampling (such fivefold or tenfold sub-sampling) test, and jackknife test (Chou and Zhang 1995; Cai et al. 2001; Zhou and Assa-Munt 2001; Zhou 1998). Of these three, however, the jackknife test is thought the most rigorous and objective that can always yield a unique result for a given benchmark dataset, as elucidated in (Zhou and Cai 2006; Chou and Shen 2008a) and demonstrated by Eq. 50 of (Chou and Shen 2007c), and hence has been used by more and more investigators (e.g., Chen et al. 2006a, b; Gao et al. 2005a, b; Liu et al. 2005a; Liu et al. 2005b; Chou and Shen 2006a, b, 2007a; Xiao et al. 2005; 2006a, b; Lin and Li 2007a, b; Zhang et al. 2006a, b; Zheng et al. 2007) in examining the power of various prediction methods.

Firstly, the dataset CL317 (Chen and Li 2007a) is applied to validate our research approach. The dimension of protein features and jackknife test result are showed in Table 1.

Table 1 The results of feature selection for different space in k-spaced amino acid pairs

From Table 1, we can see that the features dimension of different k-space has been reduced, while the jackknife accuracy of each basic classifier reasonably increases after feature selection. The reason for that is using BPSO as the feature selection method can reduce the redundancy features efficiently.

After ensemble the 21 FKNN classifiers as prediction engine, the jackknife results on CL137 dataset are listed in Table 2.

Table 2 The results by jackknife test on the dataset CL317

As shown in Table 2, the overall accuracy of jackknife test is 91.5% by using ensemble classifier with 21 trained FKNN weak classifiers, 1–3% higher than using only one FKNN classifier (Table 1). The reason is listed as follows: the ensemble classifier, which has the capability of reducing the imbalance caused by the peculiarities of a single training set and hence be able to learn a more expressive concept in classification than a single classifier, are proposed in various attributes of protein science (Shen and Chou 2006a, b, 2007a). From the Table 2 we also can find two results: firstly, the result of our approach is obviously higher than ID (Chen and Li 2007a) and ID_SVM (Chen and Li 2007b) in the same dataset. The reason is our protein features after feature selection are more effective than that of two methods. From Table 2, we can see the jackknife results are obviously higher in Cytoplasmic, Nuclear proteins and endoplasmic reticulum location than ID and ID_SVM methods.

In order to validate the performance of the proposed approach further, the dataset ZW225 (Zhang et al. 2006b) is adopted. The jackknife results are shown in Table 3.

Table 3 The results by jackknife on the dataset ZW225

As shown in Table 3, the overall prediction accuracy Ac(%) of this study is the highest both in total accuracy and success rate in each subcellular compartment. What is more, from the Tables 2 and 3 we can see the desirable values of S n , S p , MCC, which also verify the objective of jackknife test.

Conclusions

In this paper, binary particle swarm optimization (BPSO) is applied to extract effective feature, and AA pair compositions with different spaces are used to construct feature sets for protein feature selection. In order to increase prediction accuracy, ensemble classifier is applied as prediction engine, of 21 classifier is the FKNN (fuzzy K-nearest neighbor) trained with different feature sets. Two datasets CL317 and ZW225 are selected to validate the performance of proposed approach, the jackknife result are 91.5 and 88.0%, respectively, which both are the better than other methods. The results indicate that the proposed method will be a potentially useful tool for subcellular location of apoptosis protein.