Introduction

One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the cellular environment. Determination of protein subcellular localization, purely using experimental approaches, is both time-consuming and expensive. Particularly, the number of new protein sequences yielded by the high-throughput sequencing technology in the post-genomic era has increased explosively. Facing such an avalanche of new protein sequences, it is both challenging and indispensable to develop an automated method for fast and accurate annotation of the subcellular attributes of uncharacterized proteins. The knowledge thus obtained can help us in the timely utilization of these newly found protein sequences for both basic research and drug discovery (Chou 2004; Lubec et al. 2005).

During the last decade, many theoretical and computational methods were developed in an attempt to predict protein subcellular localization (Cedano et al. 1997; Chou and Elord 1999; Nakai and Horton 1999; Chou 2000, 2001; Chou and Cai 2002, 2003a, b; Cai and Chou 2003, 2004; Cui et al. 2004; Gao et al. 2005; Pan et al. 2003; Zhou and Doctor 2003; Xiao et al. 2005, 2006; Wang et al. 2005; Shi et al. 2006, 2007; Gardy and Brinkman 2006; Chou and Shen 2006a, b, c, 2007a; Xiao and Chou 2007; Liu et al. 2007; Shen and Chou 2005a, b, 2007a; Shen et al. 2007). However, all these prediction methods were established chiefly based on a single classifier, or based on the statistical approach and amino acid physicochemical character to represent protein sequences. Obviously, the prediction quality would be further improved by introducing protein evolutionary information, and multi-classifiers combining. The protein subcellular localization tends to be evolutionary conservation, and the evolution rates of proteins with different subcellular localizations are different (Julenius and Pedersen 2006). Thus, the evolutionary conservation information of protein subcellular localization could be useful for distinguishing the different subcellular localizations.

Here, we introduce a novel method to calculate amino acid evolutionary conservation scores. After the residue conservation scores were obtained, the protein sequence of English symbols can be translated into a number signal sequence. Then we can form a feature vector by wavelet multi-scale energy (MSE) function (Shi et al. 2007) to represent the protein subcellular localization. The samples of proteins can be represented by other different feature vectors formed by weighted auto-correlation function (Zhang et al. 2006a), moment descriptor (Shi et al. 2006). Based on the hybridization representation, a novel ensemble classifier was formed by fusing many individual classifiers through a sum decision rule and a product decision rule (Kitter et al. 1998). The success rates obtained by hybridizing the multi-source information of proteins and fusing multi-classifiers in predicting protein subcellular localization were significantly improved.

Methods

Residue conservation

The residue ranking function assigns a score to each residue, according to which they can be sorted in the order of the presumably decreasing evolutionary pressure they experience. Out of many methods proposed in the literature (Lichtarge et al. 1996; Soyer et al. 2004; Mihalek et al. 2004), Lichtarge research group’s hybrid methods (real-valued evolutionary trace method and zoom method) are the two robust methods. These two methods rank the evolutionary importance of residues in a protein family, which is based on the column variation in multiple sequence alignments (MSA) and evolutionary information extracted from the underlying phylogenetic trees (Mihalek et al. 2004). However, the hybrid methods treat the gaps in the multi-sequences alignment as the 21st amino acid. In this paper, we propose an improved algorithm to estimate the residue evolutionary conservation. The processes of calculation are as follows:

First, the initial similarity sequences were created by using three iterations of PsiBlast (Altschul et al. 1997), with the 0.001 E-value cutoff, on the UniProt database (http://www.expasy.org/) of proteins. The PsiBlast resulting sets were aligned by a standard alignment method such as ClustalW 1.83 (Thompson et al. 1994). So, the MSA were obtained.

Second, an MSA is divided into sub-alignments (that is g groups) that correspond to nodes in the tree (Mihalek et al. 2004). This subdivision of an MSA into smaller alignments reflects the tree topology, and therefore the evolutionary variation information within it. Then, the von Neumann Entropy (VNE) (Caffrey et al. 2004; Mintseris and Weng 2005) for a residue belonging to alignment column i in a group g MSA is given by the following equation:

$$ {\text{VNE}}^{g}_{i} = - {\text{Tr(}}\omega ^{g}_{i} {\text{log}}_{{20}} \,\omega ^{g}_{i} {\text{)}} $$
(1)

Where \( \omega ^{g}_{i} \) is a density matrix of group g with trace = 1. Apart from normalization by the trace, the density matrix is given by the product of the relative frequencies of the standard amino acids of group g in each sub-alignment position i and an appropriate similarity matrix, that is, \( \omega ^{g}_{i} = {\text{diag}}{\left[ {f^{g}_{{_{{iA}} }} ,f^{g}_{{_{{iC}} }} , \ldots , f^{g}_{{_{{i{\alpha}}} }} , \ldots ,f^{g}_{{_{{iY}} }} } \right]} \times {\text{Similarity}}\;{\text{matrix;}}\;f^{g}_{{_{{i{\alpha}}} }} \) is the relative frequency of each amino acid α (α represents one of the 20 standard amino acids, that is, A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y,) of group g in the alignment position i. The base of 20 ensures that all values are bounded between zero and one (assuming that we ignore entities such as “X”, “Z”, “B”, and “–”).

The calculation of Eq. (1) is facilitated by first calculating the eigenvalues \( \lambda ^{g}_{{_{{i,{\alpha}}} }} \) of \( \omega ^{g}_{{_{i} }} , \) and hence it follows that

$$ {\text{VNE}}^{g}_{i} = - {\sum\limits_{\alpha = 1}^{20} {\lambda ^{g}_{{i\alpha }} } }\log _{{20}} \,\lambda ^{g}_{{i\alpha }} $$
(2)

where \( \lambda ^{g}_{{i\alpha }} \) is the eigenvalues of \( \upomega^{g}_{{_{i} }} . \) Intuitively, the residue is more conservational, the score calculated with Eqs. (1) and (2) is bigger. Then, the von Neumann can be changed into information score (IS).

$$ {\text{IS}}^{g}_{i} = 1 - {\text{VNE}}^{g}_{i} = 1 + {\sum\limits_{\alpha = 1}^{20} {\lambda ^{g}_{{i\alpha }} {\text{log}}_{{20}} \,\lambda ^{g}_{{i\alpha }} } } $$
(3)

where \( {\text{IS}}^{g}_{i} \) is the IS of residue belonging to alignment column i in a group g MSA. Considering the effect of alignment gap, Eq. (3) can be rewrote as

$$ {\text{IS}}^{g}_{i} = {\left( {1 + {\sum\limits_{\alpha = 1}^{20} {\lambda ^{g}_{{i\alpha }} {\text{log}}_{{20}} \,\lambda ^{g}_{{i\alpha }} } }} \right)} \times f^{g}_{{i,{\text{number}}}} $$
(4)

Where, \( f^{g}_{{_{{i,{\text{ number}}}} }} \) is the number of standard amino acids of group g in the alignment position i, divided by the number of alignment sequences of group g.

The evolutionary score for a residue belong to column i in an MSA is given by the following equations.

$$ R_{i} = 1 + {\sum\limits_{n = 1}^{N - 1} {w_{{{\text{node}}}} {\left( n \right)}{\sum\limits_{g = 1}^n {w_{{{\text{group}}}} {\left( g \right)}{\left[ {{\left( {1 + {\sum\limits_{\alpha = 1}^{20} {\lambda ^{g}_{{i\alpha }} {\text{log}}_{{20}} \,\lambda ^{g}_{{i\alpha }} } }} \right)} \times f^{g}_{{i,{\text{number}}}} } \right]}} }} } $$
(5)

where w node(n), w group(g) are weights assigned to a node n and a group g, respectively.

$$w_{\hbox{node}} ( n) = \left\{\begin{array}{ll} 1 & \hbox{ if } n\hbox{ on the path to the query protein } \\ 0 & \hbox{otherwise} \end{array}\right.$$
(6)
$$ w_{\rm group} (g) = \left\{ \begin{array}{*{20}l} 1 &\text{if}\ g\ \text{on the path to the query protein} \\ 0 & \text{otherwise} \end{array}\right. $$
(7)

Multi-scale energy

Through residue conservation scores calculated, the protein sequence of English letters can be translated into a conservation score sequence. The numerical sequence can be considered as digital signal. Projecting the signal onto a set of wavelet basis functions with various scales, the fine-scale and large-scale conservation information of a protein can be simultaneously investigated. Here, the wavelet basis function used is symlet wavelet (Pittner and Kamarthi 1999). Consequently, the protein can be characterized as the following MSE feature vector (Shi et al. 2007):

$$ {\text{MSE}} = {\left[ {d_{1} , \ldots ,d_{j} , \ldots ,d_{m} ,a_{m} } \right]} $$
(8)

Here, m is the coarsest scale of decomposition, d j is the root mean square energy of the wavelet detail coefficients in the corresponding jth scale, and a m is the root mean square energy of the wavelet approximation coefficients in the scale m. The energy factors d j and a m are defined as

$$ d_{j} = {\sqrt {\tfrac{1} {{N_{j} }}{\sum\limits_{n = 0}^{N_{j} - 1} {{\left[ {u_{j} {\left( n \right)}} \right]}^{2} } }} }\quad j = 1,2, \ldots ,m $$
(9)
$$ a_{m} = {\sqrt {\tfrac{1} {{N_{m} }}{\sum\limits_{n = 0}^{N_{m} - 1} {{\left[ {v_{m} {\left( n \right)}} \right]}^{2} } }} } $$
(10)

Here, N j is the number of the wavelet detail coefficients, N m is the number of the wavelet approximation coefficients, u j (n) is the nth detail coefficient in the corresponding jth scale, and v m (n) is the nth approximation coefficient in the scale m. In general, for the protein sequence with length L, m equals INT(log2 L). But, the diverse lengths of protein sequences make that m value must be optimized to select. Here, we select m = 12.

In order to represent a protein sequence with a discrete model yet without completely losing its sequence order information, Chou (2001, 2005) introduced the concept of pseudo-amino acid (PseAA) composition. Ever since the concept of Chou’s PseAA composition was introduced, various PseAA composition approaches have been developed for improving the prediction quality of protein attributes (see, e.g., Chen et al. 2006a, b; Chen and Li 2007; Diao et al. 2007a, b; Ding et al. 2007; Du and Li 2006; Fang et al. 2007; Gao et al. 2005a; Kurgan et al. 2007; Li and Li 2007; Lin and Li 2007a, b; Mondal et al. 2006; Mundra et al. 2007; Pan et al. 2003; Pu et al. 2007; Shi et al. 2007; Xiao and Chou 2007; Xiao et al. 2006; Zhang et al. 2006a; Zhang and Ding 2007; Zhou et al. 2007; Chou and Shen 2007b). Recently, a web server called PseAAC (Shen and Chou 2007b) was established at http://chou.med.harvard.edu/bioinf/PseAAC/. PseAAC is a flexible web server, by which users can generate 63 different parallel correlation types of PseAA composition and 63 different series correlation types of PseAA composition as well as the dipeptide PseAA composition. Here, we would like to propose a different approach to formulate PseAA composition, combining the MSE with amino acid composition (AAC), which is consisted of the 20-D components of the amino acid frequencies. The protein can be represented by the following (20 + m + 1)-D vector.

$$ X = {\left[ {f_{1} ,f_{2} , \ldots ,f_{\alpha } , \ldots ,f_{{20}} ,d_{1} ,d_{2} , \ldots ,d_{j} , \ldots ,d_{m} ,a_{m} } \right]}^{T} $$
(11)

Here f α (α = 1, 2,…, 20) is the occurrence frequencies of 20 standard amino acids in the protein sequence concerned, arranged alphabetically according to their signal letter codes. Conveniently, the feature set based on the residue evolutionary conservation and MSE approach can be written as EMSE.

Multi-classifiers fusion methods

Let us assume that we have R classifiers each representing the given pattern by a distinct measurement vector. Denote the measurement vector used by the ith classifier by x i . The p(x 1, x 2,…, x R k ) represents the joint probability distribution of the measurements extracted by the classifiers. Let us also assume that the representations used are conditionally statistical independent. The use of different representations may be a probable cause of such independence in special cases. We will investigate the consequences of this assumption and write

$$ p{\left( {x_{1} ,x_{2} , \ldots ,x_{R} |\omega _{k} } \right)} = {\prod\limits_{i = 1}^R {p{\left( {x_{i} |\omega _{k} } \right)}} } $$
(12)

where p(x i | ωk) is the measurement process model of the ith representation. According to the Bayes theorem, the product decision rule can be written as (Kittler et al. 1998)

$$ \begin{aligned}{} & {\text{assign}}\quad {\text{protein}}\;{\text{X}} \to {\text{class}}\;\omega _{j} \quad {\text{if}} \\ & P{\left( {\omega _{j} } \right)}{\prod\limits_{i = 1}^R {P{\left( {\omega _{j} |x_{i} } \right)} = {\mathop {{\text{max}}}\limits_{k = 1}^m }} }P{\left( {\omega _{k} } \right)}{\prod\limits_{i = 1}^R {P{\left( {\omega _{k} |x_{i} } \right)}} } \\ \end{aligned} $$
(13)

where p k | x i ) is a posteriori probability yield by the ith classifier. The decision rule (13) quantifies the likelihood of a hypothesis by combining the posteriori probabilities generated by the individual classifier by means of a product rule. It is effectively a severe rule of fusing the classifier outputs as it is sufficient for a single recognition engine to inhibit a particular interpretation by outputting a close to zero probability for it.

In some applications it may be appropriate to assume further that a posterior probability computed by the respective classifier will not deviate dramatically from the prior probabilities. This is a rather strong assumption but it may be readily satisfied when the available observational discriminatory information is highly ambiguous due to high levels of noise. In such a situation, we can obtain a sum decision rule (Kittler et al. 1998).

$$\begin{aligned}{}& {\text{assign}}\quad {\text{protein}}\;{\text{X}} \to{\text{class}}\;\omega _{j} \quad {\text{if}} \\& {\left( {1 - R} \right)}P{\left( {\omega _{j} } \right)} +\sum\limits_{i=1}^R P\left(\omega _{j} |x_{i}\right ) \\& = {\mathop {{\text{max}}}\limits_{i = 1}^m } {\left[ {{\left( {1- R} \right)}P{\left( {\omega _{k} } \right)} +{\sum\limits_{i = 1}^R {P{\left( {\omega _{i} |x_{i} } \right)}} }} \right]} \\\end{aligned} $$
(14)

Results and discussion

Results with different feature extraction methods

The training dataset and independent dataset taken from Chou’s paper (Chou 2000) were used to validate the current method. The schematic illustration of the 12 subcellular localizations of proteins is shown in Fig. 1. Among the independent dataset test (INDE), sub-sampling (e.g., five or tenfold cross-validation) test, and jackknife test (JACK), which are often used for examining the accuracy of a statistical prediction method, the jackknife test was deemed the most rigorous and objective (Chou and Zhang 1995) as demonstrated by a penetrating analysis in a recent comprehensive review (Chou and Shen 2007c) and has been increasingly and widely utilized by investigators to test the power of various prediction methods (see, e.g., Cao et al. 2006; Chen et al. 2006a, b, 2007; Chen and Li 2007; Diao et al. 2007a, b; Ding et al. 2007; Du and Li 2006; Fang et al. 2007; Gao and Wang 2006; Gao et al. 2005a, b, 2006a, b; Huang and Li 2004; Jahandideh et al. 2007; Kedarisetti et al. 2006; Li and Li 2007; Lin and Li 2007a, b; Liu et al. 2007; Mondal et al. 2006; Niu et al. 2006; Pan et al. 2003; Shen and Chou 2007c; Shen et al. 2007; Shi et al. 2007; Sun and Huang 2006; Tan et al. 2007; Wang et al. 2005; Wen et al. 2006; Xiao and Chou 2007; Xiao et al. 2005, 2006; Zhang and Ding 2007; Zhang et al. 2003, 2006a, b; Zhou 1998; Zhou and Assa-Munt 2001; Zhou and Doctor 2003; Zhou et al. 2007). Conveniently, the feature set based on the weighted auto-correlation functions approach (Zhang et al. 2006a) can be written as PARJ, which is composed of AAC and the weighted auto-correlation functions of amino acid residue index PARJ860101 (Parker et al. 1986); the feature set based on the Moment descriptors approach (Shi et al. 2006) can be written as MD. The results of four feature extraction methods based on support vector machine (SVM) and “one-versus-one” classification policy (Zhang et al. 2006a) are shown in Table 1.

Fig. 1
figure 1

Schematic illustration to show the 12 subcellular localizations of proteins: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Note that the vacuole and chloroplast proteins exist only in a plant. Reproduced from Fig. 2 of Chou (2001) with permission

Table 1 Results (in percentage) of four feature extraction methods with SVM and “one-versus-one” classification strategy

Table 1 shows that protein evolutionary conservation information can be used to predict subcellular localization. The overall accuracies of EMSE, PARJ and MD are almost equal, but they are all higher than that of AAC in jackknife and independent tests. For EMSE, the predictive accuracy is critically dependent on the input selection of sequences, namely, on the breadth and the depth of the associated sequence similarity tree. That is, how many initial similarity sequences were selected, and how to prune these sequences to form multiple alignment sequences? If the optimal two parameters were selected, we can obtain better results. Considering the computer power, the cutoff of initial similarity sequences was defined as 250, and we did not prune the initial similarity sequences. These results indicate that the performance of predictive system can be improved by using different feature extraction methods. EMSE, PARJ and MD are effective to represent protein sequence and robust for predicting subcellular localization.

Hybrid results of multi-classifiers

Let us assume that the classifiers modeled on ACC, EMSE, PARJ and MD, respectively, are independent. The results of ensemble classifier by fusing four individual classifiers which is modeled on ACC, EMSE, PARJ and MD feature vector set with sum decision rule and product decision rule respectively, are shown in Table 2. From Tables 1 and 2, we can see that the current ensemble hybridization classifier outperforms the individual classifier by 1.3–4.7% in overall accuracy with jackknife test. The performance of sum decision rule is almost equal to that of product decision rule. But the accuracy of some classes such as chloroplast, cytoskeleton, Golgi apparatus and vacuoles has about 2–4% difference between sum decision rule and product decision rule. In addition, the method of multi-classifier fusion assumes that feature vectors used in each classifier should be independent. But the ACC, EMSE, PARJ and MD feature sets used in this paper are not strictly independent. The EMSE, PARJ and MD feature sets include ACC information. If we delete the redundancy information from EMSE, PARJ and MD feature sets, and use the hybrid method of multi-classifiers, maybe we can then get better fusion results.

Table 2 Hybrid results (in percentage) of four individual classifier fusion with probability fusion system using sum decision rule and product decision rule

Comparison with other prediction methods

The performance of the hybrid method developed in this study was compared with the existing methods such as Pan’s (Pan et al. 2003), Gao’s (Gao et al. 2005a) and Xiao’s (Xiao et al. 2005, 2006), which were also developed from the same dataset. The comparison results of different methods are listed in Table 3. The comparison results demonstrated that the overall prediction accuracies of our hybrid method are higher than that of the other four methods both in the Jackknife and independent tests. For example, the overall accuracy of the hybrid method is 8.1%, 5.4% higher than that of Xiao’s method (Xiao et al. 2005) in the Jackknife and independent tests, respectively.

Table 3 Overall accuracy (in percentage) obtained by different methods

To ensure our methods’ validity, we utilize another dataset of Gram-negative bacteria proteins that have been used in previous work (Gardy et al. 2005). Gram-negative bacteria have five major protein subcellular localizations. The dataset consists of 1,444 sequences, 278 of which are cytoplasm, 309 inner membranes, 276 periplasm, 391 outer membranes and 190 extracellular space. The results of different methods are shown in Table 4. From Table 4, we can see that the performances of our methods are better than Gardy’s method. The overall accuracy of the fusing EMSE, PARJ and MD is 6.7% higher than that of PSORTb v2.0. These results show that our methods are more effective in predicting subllular localization.

Table 4 Results (in percentage) of different methods based on the Gram-negative bacteria dataset

Conclusion

A new kind of protein evolutionary feature extraction method and a hybrid approach to fuse multi-feature classifiers, were proposed in this paper. The results shows that using residue evolutionary conservation and MSE to represent protein can better reflect protein evolutionary information and predict the subcellular localizations. Weighted auto-correlation function and Moment descriptor methods can optimally reflect the sequence order effect. It is demonstrated that the novel hybrid approach by fusing multi-feature classifiers with sum decision rule and product decision rule is a very intriguing and promising avenue.