Introduction

Apoptosis, or programmed cell death, is a fundamental process controlling normal tissue homeostasis by regulating a balance between cell proliferation and death (Chou et al. 1997, 1999, 2000; Chou 2004a, b, c, d, 2005a, b, c; Jacobson et al. 1997). When apoptosis malfunctions, a variety of formidable diseases can ensue: blocking apoptosis is associated with cancer (Adams and Cory 1998; Evan and Littlewood 1998) and autoimmune diseases, while unwanted apoptosis can possibly lead to ischemic damage (Reed and Paternostro 1999) or neurodegenerative disease (Schulz et al. 1999). Apoptosis proteins play a central role in the mechanism of programmed cell death (Raff 1998; Steller 1995). The function of a protein is closely correlated with its subcellular location (Chou 2001; Chou and Cai 2002; Chou and Elrod 1999). To understand the apoptosis mechanism and functions of various apoptosis proteins, it is helpful to know about the subcellular location of apoptosis proteins. Therefore, the study of subcellular location of apoptosis proteins is very important in biology.

During the last decade, much work had been done in an attempt to predict the proteins’ subcellular location, which are mainly focused on how to effectively represent a protein sequence and obtain the feature space of the sequence (Cedano et al. 1997; Chen and Li 2004; Chou 2001; Dubchak et al. 1995; Feng 2001; Garg et al. 2005; Gu et al. 2010; Huang and Li 2004; Nakashima and Nishikawa 1994; Zhou and Doctor 2003). Recently, some topics on the impact of the feature space were discussed (Assfalg et al. 2009, 2010). Most of the work obtained the feature space of the protein sequence based on amino acid composition (AAC; Cedano et al. 1997; Feng 2001; Nakashima and Nishikawa 1994; Zhou and Doctor 2003), dipeptide composition (DPC) (Chen and Li 2004; Huang and Li 2004). However, these approaches treated each peptide or polypeptide separately, and their relationships were ignored. Actually, some amino acids have similar properties and thus can be substituted for each other without changing either the structure or the function of the proteins. To partially incorporate this effect, some sequence feature space models based on classifications of amino acids were proposed. For example, based on the concept of coarse-grained description and grouping, Zhang et al. (2006) presented a new encoding method with grouped weight for protein sequence (encoding based on grouped weight, named as EBGW). Recently, Zhang et al. (2009) introduced a novel representation method of protein sequence for prediction of subcellular location on the basis of distance frequency and used a novel way to calculate distance frequency. Chen and Li (2007a) also proposed a new algorithm by using a distinctive set of information parameters derived from primary sequences. By an attempt on different classifications of 20 amino acids, the prediction accuracy was greatly improved. Though the overall prediction accuracy had been improved for apoptosis proteins using existing methods, they still have some disadvantages. For example, amino acids of the same group also have discrepancies in some properties, but they could not be distinguished in the above methods. In other words, the above methods failed to describe the differences between amino acids quantitatively. In addition, the sequence-order information of protein sequences was ignored.

Actually, many studies have indicated that sequence-based prediction approaches, such as protein subcellular location prediction (Chou and Shen 2007a, 2010b), protein quaternary attribute prediction (Xiao et al. 2009), identification of proteases and their types (Chou and Shen 2008b), and signal peptide prediction (Chou and Shen 2007b; Hiss and Schneider 2009), can timely provide very useful information and insights for both basic research and drug design and hence are widely welcome by science community. The present study is attempted to develop a novel sequence-based method for predicting apoptosis protein subcellular localization in hopes that it may become a useful complementary tool to the existing methods in the relevant areas.

To avoid losing many important information hidden in protein sequences, the pseudo amino acid composition (PseAAC) was proposed (Chou 2001, 2005a) to replace the simple AAC for representing the sample of a protein. For a summary about its development and applications, such as how to use the concept of Chou’s PseAAC to develop 16 different forms of PseAAC, including those that are able to incorporate the functional domain information, gene ontology (GO) information, cellular automaton image information, sequential evolution information, among many others, see a recent comprehensive review (Chou 2009). In this paper, we aim to propose a different model of PseAAC to represent protein samples via the approach of amino acid substitution matrix and auto covariance transformation. This method is applied to predict the apoptosis proteins’ subcellular location of two datasets. Based on the amino acid substitution matrix, we first convert a given apoptosis protein sequence with L residues into a 20 × L matrix by representing each peptide with a 20-D vector. Then the auto covariance transformation is used to transform the above representation matrix into a fixed-length vector. Finally, we employ the SVM and the jackknife test to evaluate our method. Our prediction results show that the overall prediction accuracy of apoptosis proteins subcellular location for the two datasets ZW225 and CL317 is 87.1 and 90%, respectively.

Materials and methods

Datasets

In this study, we use the two datasets constructed by Zhang et al. (2006) and Chen and Li (2007b). The former dataset (denoted as ZW225) consists of 225 apoptosis proteins divided into four subcellular locations with 41 nuclear proteins, 70 cytoplasmic proteins, 25 mitochondrial proteins and 89 membrane proteins, while proteins sequences in the second dataset (denoted as CL317) are classified into six types in subcellular locations, including 112 cytoplasmic proteins, 55 membrane proteins, 34 mitochondrial proteins, 17 secreted proteins, 52 nuclear proteins and 47 endoplasmic reticulum proteins. All the protein sequences in the two datasets are extracted from SWISS-PROT, and the accession numbers can be found in the literature (Zhou and Doctor 2003; Zhang et al. 2006).

As is well known, the sequence similarity of a dataset will seriously affect the final evaluation results, and thus should be considered in construction of a dataset. For example, Chou and Shen (2010a, b) constructed a benchmark dataset of eukaryotic proteins using a cutoff similarity threshold of 25%. But in this study we did not use a stringent threshold to cutoff the homologous sequences from the original datasets because the current two datasets, which served as widely used benchmark datasets to evaluate a new proposed method (Chen and Li 2007b; Zhang et al. 2009; Gu et al. 2010), contain too few samples to reduce the identity.

Substitution matrix

As is known, the degrees of similarity between 20 amino acids are different. The mutations between them are scored by a 20 × 20 matrix called substitution matrix (Henikoff and Henikoff 1992; Leslid et al. 2002; Malde 2008). In bioinformatics and evolutionary biology, the substitution matrix describes the rate at which one character in a sequence changes to other character states over time. The substitution matrices are usually used in the context of amino acid or DNA sequence alignment, where the similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix. In this paper, different substitution matrices are used which belong to the two well-known families: Blosum (Henikoff and Henikoff 1992) and Pam (Dayhoff et al. 1978), i.e., Blosum40, Blosum62, Blosum100, Pam40, Pam80, and Pam160.

Representation of protein sequence

We denote a given 20 × 20 substitution matrix as M, and the element M i,j represents the probability of amino acid i mutating to amino acid j during the evolution process (i, j = 1, 2,…,20). The matrix M could be denoted as a 20-D vector, that is, M = (V 1, V 2,…,V 20), where V i  = (M 1,i , M 2,i ,…,M 20,i )T. For a given protein sequence S = s 1 s 2… s L, s i represents the ith amino acid of the protein sequence and could be substituted by a vector \({V_{{s_{i}}}}\) of the substitution matrix M. Then, we can easily obtain a 20 × L matrix D. For convenience, let us denote

$$ D = \left( {V_{{s_{1} }} ,V_{{s_{2} }} ,\ldots ,V_{{s_{L} }} } \right) $$

to describe the given protein sequence.

In order to employ SVM classifier to perform our method, the protein sequences should be converted into fixed-length vectors. AAC is a conventional feature construction method, which refers to the occurrence frequency of each of these 20 components in a given protein sequence. Since the information in the primary sequence is greatly reduced by considering the AAC alone, other informative features should be taken into account within our studies. Here, the auto covariance (AC) transformation is introduced to convert the above matrix D into a fixed-length vector. As a statistical tool for analyzing sequences of vectors developed by Wold et al. (1993), AC transformation has been successfully used for protein family classification (Guo et al. 2006; Lapinsh et al. 2002), protein interaction prediction (Guo et al. 2008) and prediction of secondary structure content (Lin and Pan 2001; Zhang et al. 1998, 2001). Here, the AC variable measures the average correlation between two residues separated by a distance of lg along the sequence S, which can be calculated by

$$ AC\left( {i,\lg } \right) = \sum\limits_{j = 1}^{L - \lg } {\left( {D_{i,j} - \bar{D}_{i} } \right)} \left( {D_{i,j + \lg } - \bar{D}_{i} } \right)/\left( {L - \lg } \right) $$

where i denotes the ith amino acid, L is the length of the protein sequence, D i,j is the matrix score of amino acid i at position j, \( \bar{D}_{i} \) is the average score for amino acid i along the whole sequence:

$$ \overline{{D_{i} }} = \sum\limits_{j = 1}^{L} {D_{i,j} /L} $$

in such way, the number of AC variables can be calculated as 20 × LG, where LG is the maximum of lg (lg = 1, 2,…,LG). Combining the 20 AAC and the 20 × LG AC variables, each given protein sequence is characterized by a (20 + 20 × LG)-D feature vector.

Support vector machine

In recent years, SVM-based machine learning algorithm has been used for predicting various protein attributes tasks, such as membrane protein type (Cai et al. 2004), protein structural class (Cai et al. 2002a; Ding et al. 2007), specificity of GalNAc-transferase (Cai et al. 2002c), HIV protease cleavage sites in protein (Cai et al. 2002b), and so on. The algorithm often obtains higher prediction accuracy compared with other classification approaches, when the invariant feature vectors are used (Cai et al. 2002a; Hua and Sun 2001; Huang and Shi 2005; Zhou et al. 2007). The basic idea of applying SVM to pattern classification can be stated briefly: first, map the input vectors into one feature space; then, within this feature space, construct a hyperplane which can separate the two classes. SVM training always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features.

In our study, the LIBSVM package is used to implement the SVM classifier (Chang and Lin 2009). The radial basis function (RBF) is chosen as the kernel function, which is defined as \( K\left( {x,x^{\prime}} \right) = \exp \left( { - \gamma \left| {x - x^{\prime}} \right|^{2} } \right) \). Two parameters, the regularization parameter C and the kernel width parameter γ are optimized on the training set using a grid search strategy in the LIBSVM.

Evaluation methods

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, sub sampling test, and jackknife test (Chou and Zhang 1995). However, as elucidated in (Chou and Shen 2008a) and demonstrated by Eq. 50 of (Chou and Shen 2007a), among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used by investigators to examine the accuracy of various predictors (Chen et al. 2009; Ding et al. 2009; Jiang et al. 2008; Li and Li 2008; Lin 2008; Lin et al. 2008; Zeng et al. 2009; Zhou 1998; Zhou et al. 2007). So, in this paper, jackknife test is employed to evaluate the prediction performance of our method. Each protein sequence in the samples is singled out in turn as a test sample, and the remaining protein sequences are used as training samples. To evaluate the performance of the test, the overall prediction accuracy A c, individual sensitivity S in, individual specificity S ip and Matthews’s correlation coefficient MCCi are discussed, and they are calculated as follows:

$$ S_{\text{in}} = {\frac{{{\text{TP}}_{i} }}{{{\text{TP}}_{i} + {\text{FN}}_{i} }}} $$
$$ S_{\text{ip}} = {\frac{{{\text{TN}}_{i} }}{{{\text{TN}}_{i} + {\text{FP}}_{i} }}} $$
$$ {\text{MCC}}_{\text{i}} = {\frac{{{\text{TP}}_{i} {\text{TN}}_{i} - {\text{FP}}_{i} {\text{FN}}_{i} }}{{\sqrt {\left( {{\text{TP}}_{i} + {\text{FP}}_{i} } \right)\left( {{\text{TP}}_{i} + {\text{FN}}_{i} } \right)\left( {{\text{TN}}_{i} + {\text{FP}}_{i} } \right)\left( {{\text{TN}}_{i} + {\text{FN}}_{i} } \right)} }}} $$
$$ A_{\text{c}} = {\frac{{\sum\nolimits_{i} {{\text{TP}}_{i} } }}{N}} $$

where TP i denotes the numbers of the ith subcellular location correctly recognized positives, FN i denotes the numbers of the ith subcellular location recognized as other subcellular location, FP i denotes the numbers of other subcellular location recognized as the ith subcellular location, TN i denotes the numbers of other subcellular location correctly recognized. N is the number of all protein sequences.

Results and discussion

Firstly, the dataset CL317 is applied to validate the proposed method. We transform each apoptosis protein sequence into a fixed-length vector through the substitution matrix and auto covariance transformation. Then these feature vectors are fed to the SVM classifier to perform our prediction. In this paper, we select radial basis kernel function to build the prediction model and the two parameters C and γ are set at C = 128, γ = 8. In order to optimize our prediction accuracy, we try to investigate the effect of the parameter LG value and different substitution matrices (Blosum40, Blosum62, Blosum100, Pam40, Pam80, Pam160) variation on the quality of our method; results are shown in Fig. 1. As is seen from Fig. 1, the accuracy first increases to a maximum value and then slightly goes down as the value of LG increases, but last with a little fluctuation. The best prediction accuracy reaches 90% for the dataset CL317, when the value of LG is 15, and the substitution matrix is Blosum100. It is worth mentioning that the Blosum matrices, which are based on the replacement patterns found in more highly conserved regions of the sequences, were also proved to have better performance in many research (Henikoff and Henikoff 1993; Johnson and Overington 1993).

Fig. 1
figure 1

Effect of the LG values and substitution matrices on dataset CL317 in jackknife test

In order to validate the performance of the proposed approach further, the dataset ZW225 is adopted. We also employ the parameter LG = 15 and substitution matrix Blosum100 on the dataset ZW225. Through SVM classifier and the jackknife test, the prediction results of dataset CL317 and ZW225 are listed in Table 1.

Table 1 Prediction results on two datasets in jackknife test

From Table 1, we can see that the overall accuracies for ZW225 and CL317 datasets by our method achieve 87.1 and 90%, respectively. Table 2 shows the prediction results of different methods by the jackknife test for the ZW225 dataset. We can find that the overall accuracy by our method is higher than that of EBGW_SVM (Zhang et al. 2006), DF_SVM (Zhang et al. 2009), ID_SVM (Chen and Li 2007b). The value of sensitivity for each protein class is listed. For example, the sensitivity of mitochondrial proteins reaches 85.7% in our method, while the others are 60, 64, 68 and 60%. For the nuclear proteins, the sensitivity of our method is 84.6%, which is also the highest. To evaluate the performance of our method, we also compared other methods with our method on the CL317 dataset in Table 3. The overall accuracy of our method reaches 90%, which is slightly lower than FKNN (Jiang et al. 2008), FKNN (Ding and Zhang 2008), PseAAC_SVM (Lin et al. 2009), EN_FKNN (Gu et al. 2010), but higher than the other three methods (Chen and Li 2007a, b; Zhang et al. 2009). All the results indicate that the proposed method has a good performance for prediction of subcellular locations. The successful performance of our method may be attributed to the following reasons: (1) compared with the conventional classification-based and composition-based methods, our approach making use of the Blosum100 matrix could quantitatively measure various degrees of similarity between amino acids; (2) the parameter value LG = 15 is adopted in the auto covariance transformation, so our method considered correlations between not only neighbor residues but also residues with a long distance in a sequence, which could describe more sequence-order information.

Table 2 Comparison of different methods by the jackknife test on ZW225 dataset
Table 3 Comparison of different methods by the jackknife test on CL317 dataset

Conclusions

Based on amino acid substitution matrix and auto covariance transformation, a new representation model for protein sequence was presented, and applied to predict the apoptosis proteins subcellular location. Two datasets CL317 and ZW225 are selected to validate the performance of our proposed method. Comparing with other feature extraction approaches, our model is shown effectively in obtaining information from protein sequences. The experiment results indicated that the proposed method is promising. With the growing amount of the size of the datasets, we hope that our model will be a useful complementary tool to the existing methods for further study in the prediction of apoptosis proteins subcellular location.