1 Introduction

Apoptosis is a type of cell death regulated growth, development and immune response, and clearing redundant or abnormal cells in organisms (Raff 1998; Steller 1995). It plays a key role in development and tissue homeostasis (Chou et al. 1998, 1999). The malfunctions of apoptosis will deal to a variety of formidable diseases, for example, blocking apoptosis is associated with cancer (Adams and Cory 1998; Evan and Littlewood 1998) and autoimmune disease, whereas unwanted apoptosis can possible lead to ischemic damage (Reed and Paternostro 1999) or neurodegenerative disease (Schulz et al. 1999). Because the localization of proteins in cellular is closely associated with the protein function, the study of subcellular localization of apoptosis protein is very important for elucidating functions of apoptosis protein involved in various cellular processes (Schulz et al. 1999; Suzuki et al. 2000) and drug development (Chou et al. 1997, 2000; Chou 2004).

Computational approaches, such as structural bioinformatics (Chou 2004), molecular docking (Chou et al. 2003; Li et al. 2007; Wang et al. 2008; Zheng et al. 2007), molecular packing (Chou et al. 1984, 1988), pharmacophore modeling (Sirois et al. 2004; Chou et al. 2006), Mote Carlo simulated approach (Chou 1992), diffusion-controlled reaction simulation (Chou and Zhou 1982), bio-macromolecular internal collective motion simulation (Chou 1988), QSAR (Du et al. 2008), protein subcellular location prediction (Chou and Shen 2007a, 2008a) identification of membrane proteins and their types (Chou and Shen 2007b), identification of enzymes and their functional classes (Shen and Chou 2007), identification of GPCR and their types (Chou 2005), identification of proteases and their types (Chou and Shen 2008b), protein cleavage site prediction (Shen and Chou 2008b), and signal peptide prediction (Chou and Shen 2007c) and so on can timely provide very useful information and insights for both basic research and drug design and hence are widely welcome by science community. The present study is attempted to develop a computational approach for predicting the subcellular localization of apoptosis proteins in hope to stimulate the development of the relevant areas.

In the past 5 years, several algorithms such as covariant discriminant function (Zhou and Doctor 2003), support vector machine (SVM) (Huang and Shi 2005; Zhang et al. 2006; Zhou et al. 2008; Shi et al. 2008), Bayesian classifier (Bulashevska and Eils 2006), increment of diversity (ID) (Chen and Li 2007a), increment of diversity combined with support vector machine (ID_SVM) (Chen and Li 2007b) and fuzzy K-nearest neighbor (FKNN) (Jiang et al. 2008; Ding and Zhang 2008) have been proposed to predict subcellular localization of apoptosis protein based on various amino acid composition or pseudo amino acid composition. The pseudo amino acid composition (PseAAC) was firstly proposed by Chou to efficiently improve prediction quantity of protein subcellular localization (Chou 2001; Chou and Shen 2007a). PseAAC can represent a protein sequence with a discrete model yet without completely losing its sequence order information.

In this paper, based on the concept of Chou’s PseAAC, SVM is applied to the latest dataset with 317 apoptosis proteins. The jackknife cross-validation is applied to examine the predictive ability of method. Moreover, another 98 apoptosis proteins built by Zhou and Doctor (2003) are examined by proposed method. The predictive results of proposed method can improve the predictive success rates, and hence the current method may play a complementary role to other existing methods for predicting protein subcellular localization of apoptosis protein.

2 Materials and Methods

2.1 Data Sets

The 317 apoptosis proteins extracted from Swiss-Prot 49.0 can be classified into six subcellular locations: 112 cytoplasmic proteins, 55 membrane proteins, 34 mitochondrial proteins, 17 secreted proteins, 52 nuclear proteins and 47 endoplasmic reticulum proteins. The distribution of the sequence identity percentage is 40.1% with ≤40% sequence identity, 15.5% with sequence identity from 41% to 80%, 18.9% with sequence identity from 81% to 90% and 25.6% with ≥91% sequence identity (Chen and Li 2007a, b).

In addition, the 98 apoptosis proteins containing 43 cytoplasmic proteins, 30 plasma membrane-bound proteins, 13 mitochondrial proteins and 12 other proteins (Zhou and Doctor 2003) are also used to estimate the effectiveness of the method.

2.2 Pseudo Amino Acid Composition

The appropriate parameter is one of the most important aspects for prediction issues. The essence of PseAAC includes not only the main feature of amino acid composition, but also the sequence order correlation (Chou 2001; Chou and Shen 2007a; Shen and Chou 2008a). Consider a protein (X) chain with length L amino acid residues:

$$ R_{ 1} R_{ 2} R_{ 3} \ldots R_{L} $$
(1)

Then a protein may be denoted as a (20 + λ) dimension vector defined by 20 + λ discrete numbers; i.e.

$$ X = \left[ {x_{ 1} \ldots x_{ 2 0} x_{{ 2 0 + 1}} \ldots x_{ 2 0+ \lambda } } \right]^{T} $$
(2)
$$ {\text{here}}\,x_{u} = \left\{ {\begin{array}{*{20}c} {\frac{{f_{u} }}{{\sum\limits_{i = 1}^{ 2 0} {f_{i} } + \omega \sum\limits_{j = 1}^{\lambda } {\theta_{j} } }},\quad ( 1\le u \le 2 0) } \\ {\frac{{\omega \theta_{u - 2 0} }}{{\sum\limits_{i = 1}^{ 2 0} {f_{i} } + \omega \sum\limits_{j = 1}^{\lambda } {\theta_{j} } }},\quad ( 2 1\le u \le 2 0+ \lambda ) } \\ \end{array} } \right. $$
(3)

In Eq. 3, the f i is the normalized frequency of the 20 amino acids in protein X, ω is the weight factor for sequence order effect. θ j is the j-tier sequence correlation factor computed by the following formula:

$$ \theta_{j} = \frac{ 1}{L - j}\sum\limits_{i = 1}^{L - j} {\Uptheta (R_{i} ,R_{i + j} )} ,\quad (j < L) $$
(4)

where Θ(R i , R i+j ) is the correlation function and can be given by

$$ \Uptheta (R_{i} ,R_{i + j} ) = \frac{ 1}{k}\sum\limits_{l = 1}^{k} {\left[ {H_{l} \left( {R_{i + j} } \right) - H_{l} \left( {R_{i} } \right)} \right]^{ 2} } $$
(5)

In Eq. 5, k is the number of factors. H l (R i ) is any one of the physico-chemical characteristics values of the amino acid R i . These physico-chemical characteristics mainly include hydrophobicity, hydrophilicity, side chain mass, pK of the α-COOH group, pK of the α-NH3 + group and pI at 25°C. The hydrophobicity, hydrophilicity and side chain mass are used for the current study. The physico-chemical characteristics values must convert to standard type by the following equation:

$$ \begin{gathered} \hfill \\ H_{l} (R_{i} ) = \frac{{H_{l}^{ 0} (i) - \sum\limits_{i = 1}^{ 2 0} {\left( {{{H_{l}^{ 0} (i)} \mathord{\left/ {\vphantom {{H_{l}^{ 0} (i)} { 2 0}}} \right. \kern-\nulldelimiterspace} { 2 0}}} \right)} }}{{\sqrt {\frac{{\sum\limits_{i = 1}^{ 2 0} {\left[ {H_{l}^{ 0} (i) - \sum\limits_{i = 1}^{ 2 0} {\left( {{{H_{l}^{ 0} (i)} \mathord{\left/ {\vphantom {{H_{l}^{ 0} (i)} { 2 0}}} \right. \kern-\nulldelimiterspace} { 2 0}}} \right)} } \right]^{ 2} } }}{ 2 0}} }} \hfill \\ \end{gathered} $$
(6)

where \( H_{l}^{ 0} (i) \) is the original physico-chemical characteristics values of the i-th amino acid. We use the numerical indices 1,2,3,…,20 to represent the 20 native amino acids according to the alphabetical order of their single-letter codes: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. The data calculated by standard conversion will have a zero mean value and will remain unchanged if going through the same conversion procedure again.

2.3 Support Vector Machine

SVM is a kind of machine learning method based on statistical learning theory (Vapnik 1998). As a supervised machine learning technology, it has been successfully used in wide fields of bioinformatics by transforming the input vector into a high-dimension Hilbert space and to seek a separating hyperplane in this space. Now, we briefly explain the basic idea of the SVM. For a two-class classification problem, a series of training vectors \( \mathop {X_{i} }\limits^{ \to } \in R^{d} \) (i = 1, 2, …, N) with corresponding labels \( y_{i} \in \{ + 1, - 1\} \) (i = 1, 2, …, N). Here, +1 and −1, respectively indicate the two classes. SVM maps the input vectors \( \mathop {X_{i} }\limits^{ \to } \in R^{d} \) into a high dimensional feature space for constructing an optimal separating hyperplane with the largest distance between two classes, measured along a line perpendicular to this hyperplane. The decision function implemented by SVM can be written as:

$$ f(\mathop X\limits^{ \to } ) = {\text{sgn}}\left( {\sum\limits_{i = 1}^{N} {y_{i} \alpha_{i} \cdot K(\mathop X\limits^{ \to } ,\mathop {X_{i} }\limits^{ \to } ) + b} } \right) $$
(7)

where \( K\left( {\mathop X\limits^{ \to } ,\mathop {X_{i} }\limits^{ \to } } \right) \) is a kernel function which defines an inner product in a high dimensional feature space. Three kinds of kernel functions may be defined as: Polynomial function:

$$ K\left( {\mathop {X_{i} }\limits^{ \to } ,\mathop {X_{j} }\limits^{ \to } } \right) = \left( {\mathop {X_{i} }\limits^{ \to } \cdot \mathop {X_{j} }\limits^{ \to } + 1} \right)^{d} $$
(8)

Radial basis function (RBF):

$$ K\left( {\mathop {X_{i} }\limits^{ \to } ,\mathop {X_{j} }\limits^{ \to } } \right) = { \exp }\left( { - \gamma ||\mathop {X_{i} }\limits^{ \to } - \mathop {X_{j} }\limits^{ \to } ||^{ 2} } \right) $$
(9)

Sigmoid function:

$$ K\left( {\mathop {X_{i} }\limits^{ \to } ,\mathop {X_{j} }\limits^{ \to } } \right) = { \tan }h\left[ {b\left( {\mathop {X_{i} }\limits^{ \to } \cdot \mathop {X_{j} }\limits^{ \to } } \right) + c} \right]. $$
(10)

The coefficients α i can be solved by the following convex Quadratic Programming (QP) problem: Maximize

$$ \sum\limits_{i = 1}^{N} {\alpha_{i} } - \frac{ 1}{ 2}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {\alpha_{i} \alpha_{j} \cdot y_{i} y_{j} \cdot K\left( {\mathop {X_{i} }\limits^{ \to } ,\mathop {X_{j} }\limits^{ \to } } \right)\quad {\text{subject}}\,{\text{to}}\quad } } 0\le \alpha_{i} \le C $$
(11)

here \( \sum\limits_{i = 1}^{N} {\alpha_{i} y_{i} } = 0 , \) i = 1, 2, …, N. The regularization parameter C can control the trade off between margin and misclassification error. These \( \mathop {X_{i} }\limits^{ \to } \) are called Support Vectors only if the corresponding α i  > 0.

In general, One-Versus-Rest (OVR) and One-Versus-One (OVO) are the most commonly used approach for solving multi-class problems by reducing a single multi-class problem into multiple binary problems. This paper used the OVO strategy. The software used to implement SVM is LibSVM2.83 written by Lin’s lab and can be freely downloaded from: http://www.csie.ntu.edu.tw/~cjlin/libsvm (Chang and Lin 2001). Here, the RBF is used for all our calculations. The regularization parameter C and the kernel parameter γ of the RBF must be determined in advance.

2.4 The Criteria Definitions

The predictive capability of the algorithm is estimated by four parameters: sensitivity (S n ), specificity (S p ) and correlation coefficient (CC) defined as follows (Chen and Li 2007a, b):

$$ \begin{gathered} S_{n} = {{TP} \mathord{\left/ {\vphantom {{TP} {(TP + FN)}}} \right. \kern-\nulldelimiterspace} {(TP + FN)}} \hfill \\ \hfill \\ \end{gathered} $$
(12)
$$ S_{p} = {{TP} \mathord{\left/ {\vphantom {{TP} {(TP + FP)}}} \right. \kern-\nulldelimiterspace} {(TP + FP)}} $$
(13)
$$ CC = \tfrac{(TP \times TN) - (FP \times FN)}{{\sqrt {(TP + FP) \times (TN + FN) \times (TP + FN) \times (TN + FP)} }} $$
(14)

here TP denotes the numbers of the correctly recognized positives, FN denotes the numbers of the positives recognized as negatives, FP denotes the numbers of the negatives recognized as positives, TN denotes the numbers of correctly recognized negatives.

3 Results and Discussion

In statistical prediction, the following three cross-validation tests are often used to examine the power of a predictor: independent dataset test, sub-sampling (such fivefold or tenfold sub-sampling) test, and jackknife test. Of these three examine method, the jackknife test is deemed the most objective and rigorous one (Chou and Zhang 1995) that can always yield a unique outcome as demonstrated by a penetrating analysis in a recent comprehensive review (Chou and Shen 2007a) and has been widely and increasingly adopted by investigators to test the power of various prediction methods (Lin and Li, 2007a, b; Lin 2008; Li and Li 2008a, b; Jia et al. 2008; Jin et al. 2008; Zhang and Fang 2008; Munteanu et al. 2008; Niu et al. 2008; Lin et al. 2008; Gao et al. 2008). For the jackknife cross-validation, each proteins in the dataset is in turn singled out as an independent test sample and all the rule parameters are calculated based on the remaining proteins without including the one being identified. Therefore, we also use the jackknife cross-validation to examine proposed method.

The weight factor w and correlation factor λ in the Chou’s PseAAC are two kind important parameters. Usually, the larger the λ, the more information the representation bears. However, if the PseAAC contains too many components, it would reduce the cluster-tolerant capacity (Chou 1999) so as to lower down the jackknife success rate. We examine a great deal of parameters of PseAAC (ω and λ) and SVM (C and r) by using jackknife cross-validation. For the current study, we found that, when w = 0.1, λ = 3, C = 1,000 and r = 0.04, the predicted successful rate is the highest. The results of 317 apoptosis proteins are listed in Table 1. The results show that the sensitivity, specificity and CC of endoplasmic reticulum proteins are 95.7, 95.7 and 94.9%, respectively, which is higher than other subcellular location.

Table 1 The predictive results of jackknife cross-validation for 317 apoptosis proteins

The compared results with other methods are shown in Table 2. Table 2 exhibits that the sensitivities of SVM combined with PseAAC are higher than other methods for cytoplasmic proteins, membrane proteins, mitochondrial proteins and endoplasmic proteins, whereas for secreted proteins and nuclear proteins, the sensitivities of proposed method are lower than ID and FKNN. The overall predictive successful rate of proposed method is highest among other methods.

Table 2 The predictive results of different methods by the jackknife test for 317 apoptosis proteins

Table 3 exhibits the compared results with other methods for 98 apoptosis proteins. Here, by use of lots of examination, we select ω = 0.3, λ = 3, C = 1,000 and r = 0.08 for this prediction. The results show that the predictive successful rate of proposed method is 92.9%.

Table 3 The predictive results of different methods by the jackknife test for 98 apoptosis proteins

The successful accuracies clearly indicate that the SVM combined PseAAC is a promising approach. We hope that the better results using novel descriptors or appropriate parameters will improve the performance of subcellular localization prediction of apoptosis proteins. The high accuracy is helpful for further drug development.