Introduction

In eukaryotic cells, the nucleus is the largest, most prominent structure organelle. It organizes the assembly of genes and the life processes of cell, such as directing cellular reproduction and controlling cellular differentiation during the development of the organism, etc. Compartmentalization of cell nucleus is closely related to several nuclear processes, and has potential influence of cancer-related alternations on gene express (Fraser and Bickmore 2007; Schneider and Grosschedl 2007). Mis-localized nuclear proteins can lead to human genetic disease and cancer (Sutherland et al. 2001; Zaidi et al. 2007). Thus, the knowledge of proteins subnuclear localization is essential for understanding cell life processes and genomic regulation. Prediction of protein subnuclear location is thus an important topic in bioinformatics. Although the protein subnuclear localization can be determined by experiments ways, it is time-consuming and costly. The gap between the number of protein sequences and the number of identified proteins is rapidly increasing. It is highly desired to develop computational methods for fast identifying the proteins’ subnuclear localization in cell nucleus.

Various approaches for protein subcellular localization prediction have been developed (Cai et al. 2002; Cai and Chou 2003; Chou 2001; Chou and Cai 2002, 2004; Chou and Shen 2006a, b, 2007a, 2008; Gao et al. 2005a, b; Park and Kanehisa 2003; Shen and Chou 2006, 2007a, b, c, d; Xiao et al. 2005, 2006; Zhang et al. 2006b, c, d, 2007; Chen and Li 2007a, b; Zhou and Doctor 2003) [see Chou (2000), Feng (2002) and Chou and Shen (2007b) for a comprehensive review in this area] since the algorithm proposed by Nakashima and Nishikawa (1994). Many recent prediction algorithms have been built as web servers freely available for scientist. However, the prediction algorithms for proteins subnuclear localization are far less than those for subcellular localization. Yet, only few studies have been carried out. Shen and Chou (2005) developed the first algorithm for prediction of nine classes subnuclear compartments localization, of which protein sequence is represented by Pseudo Amino Acid (PseAA) composition (34-D) and Optimized Evidence-Theoretic KNN (OET-KNN) is used as prediction engine. Lei and Dai (2005) employ support vector machine (SVM) for prediction of six classes subnuclear localization. Huang et al. (2007) developed algorithm named ProLoc using SVM with automatic selection from physicochemical composition features. Encouraged by the concept of PseAA discrete model introduced by Chou (2001), three predicting approaches based on PseAA composition have been proposed (Mundra et al. 2007; Li and Li 2008; Shen and Chou 2007a, b, c, d, e). Meanwhile, many Web-servers for predicting subcellular localization of proteins in various organisms have been established. Recently, two protocols with a step-by-step guide were published (Chou and Shen 2008; Emanuelsson et al. 2007; Shen and Chou 2007d) to help experimental scientists how to use some important Web-servers to predict the results they need. Three reasons are mainly responsible for limited study in this field (Lei and Dai 2005; Mundra et al. 2007): (1) proteins within the cell nucleus face no apparent physical barrier like a membrane; (2) the nucleus is far more compact and complicated in comparison with other compartments in a cell; and (3) protein complexes within the cell nucleus are not static.

Compared with the conventional amino acid composition (AAC), the PseAA as originally introduced by Chou can incorporate much more information of a protein sequence so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. Based on the concept of PseAA composition, a series of follow-up studies have been made to predict protein subcellular localization and other protein’s attributes (Chen et al. 2006a, b; Chen and Li 2007a, b; Diao et al. 2008; Ding et al. 2007; Du and Li 2006; Fang et al. 2008; Gao et al. 2005b; Kurgan et al. 2007; Li and Li 2008; Lin and Li 2007a, b; Mondal et al. 2006; Mundra et al. 2007; Pu et al. 2007; Shi et al. 2007; Xiao and Chou 2007; Xiao et al. 2006; Zhang et al. 2006a, b, d, 2007, 2008; Zhang and Ding 2007; Zhou et al. 2007a, b). The promising results obtained from the approaches based on PseAA composition indicate that the PseAA discrete model can represent protein sequence in different subnuclear compartments effectively. In this study, we propose a prediction system for prediction subnuclear localization based on ensemble classifier, where sample of protein is represented with PseAA characterized by approximate entropy (ApEn). The ApEn is a non-negative number that denotes the complexity of time series (Pincus 1991; Richman and Moorman 2000). When the amino acids along a protein chain are replaced by a series of numbers, the protein sequence can be imaged as a short time series. Various studies based on ensemble classifier have been executed in protein attributes (Chou and Shen 2006a, b, 2008; Shen and Chou 2006; Shen and Chou 2007a, b, c, d, e; Nanni and Lumini 2007; Kedarisetti et al. 2006). According to the concept of Chou’s PseAA discrete model, the weight factor is essential for PseAA composition. Note that λ is an uncertain parameter. Shen and Chou (2006) used an ensemble approach by fusing PseAA component with different λ and it has been successfully used to enhance the prediction quality in a number of relevant areas (see, e.g., Chou and Shen 2007a; Chou and Shen 2007b). The ensemble classifier is the ensemble of three AdaBoost classifiers which is one of Boosting ensemble methods that have the ability of generating a strong classifier from a weak method (Freund and Schapire 1997) and it has been used in predicting protein structural classes (Niu et al. 2006).

Materials and methods

Datasets

Two datasets often used in published works are adopted to validate the performance of the proposed approach. The one is the SNL9 (Shen and Chou 2005), which have 370 proteins localized in 9 subnuclear compartments: 10 Cajal body, 59 chromation, 31 heterochromatin, 65 nuclear diffuse, 25 nuclear pore, 15 nuclear speckle, 115 nucleolus, 10 PcG body and 40 PML body. Another is SNL6 (Lei and Dai 2005) which contains 504 proteins localized in 6 subnuclear compartments: 38 PML body, 61 chromatin, 75 nuclear diffuse, 219 nucleolus, 56 nuclear speckle, and 55 nuclear lamina.

Representation of protein sequence

According to the concept of Chou’s PseAA (Chou 2001) composition, a sample of protein sequence is a point in (20 + λ)-D space.

$$ X = [x_{1} ,x_{2} , \ldots ,x_{{20}} ,x{}_{{21}}, \ldots ,x_{{20 + \lambda }} ]^{{\text{T}}} \in {\Re }^{{(20 + \lambda )}} $$
(1)
$$ x_{i} = \left\{ \begin{aligned}{} & \frac{{f_{i} }} {{{\sum\nolimits_{j = 1}^{20} {f_{j} } } + w{\sum\nolimits_{j = 1}^\lambda {p_{j} } }}}\quad (1 \le i \le 20) \\ & \frac{{w\mu _{i} }} {{{\sum\nolimits_{j = 1}^{20} {f_{j} + w{\sum\nolimits_{j = 1}^\lambda {p_{j} } }} }}}\quad (21 \le i \le 20 + \lambda ) \\ \end{aligned} \right. $$
(2)

where, the f i (1 ≤ i ≤ 20) in Eq. (2) is the occurrence frequencies of 20 amino acids in sequence, i.e., the AAC which was often used as representation of protein sequence in early studies (Chou 1995; Zhang et al. 1995; Nakashima and Nishikawa 1994; Shen et al. 2005). p i (21 ≤ i ≤ 20 + λ) is the additional factors that incorporate some sort of sequence order information. The parameter w is weight factors. Encouraged by our previous success in the design of PseAA composition with ApEn for prediction of protein structural classes (Zhang et al. 2008) and subcellular localization (Zhang et al. 2006c), the ApEn values of protein sequence still are adopted as additional factors in PseAA composition. The ApEn values of a sample of protein sample could be easily computed [see Eqs. (3)–(8) in Zhang et al. (2008)].

Ensemble classifier prediction systems

It is well known that in many situations combining the output of several classifiers leads to an improved classification results (Opitz and Maclin 1999; Alexandre et al. 2001). The proposed prediction system consists of three AdaBoost classifiers that are combined into an ensemble. The AdaBoost is one of ensemble method that has the ability of generating a strong classifier from a weak classifier (Freund and Schapire 1997). The architecture of ensemble is illustrated in Fig. 1, where the weak classifiers of three AdaBoost classifiers are decision trumps (Schapire and Singer 1999), Fuzzy K-nearest neighbor classifier (FKNN) (Keller et al. 1985; Huang and Li 2004; Shen et al. 2006 ), and radial basis-SVMs (Cristianini and Shawe-Taylor 2000), respectively. The AdaBoost algorithm is obtained from the classification toolbox in Matlab (Duda et al. 2001). The description of AdaBoost algorithm is illuminated in next section. The AdaBoost method usually applies to two-class problems. For the current case of multi-class problems, the “one-Vs-one” strategy is adopted.

Fig. 1
figure 1

The architectures of the ensemble of AdaBoost classifiers

AdaBoost algorithm

Given one or more classification methods, one of the most natural ways of obtaining more accurate classifiers is the use of ensembles (Rodríguez and Maudes 2007). The ensemble method of Boosting is one of most successful methods. There are several variants, AdaBoost is the most well known.

Given the input dataset \( S = \{ (x_{1} ,y_{1} ), \ldots ,(x_{n} ,y_{n} )\} \) where \( x_{i} \in {\Re }^{m} \) is the ith vector in m-D (dimensional) space, and y i  ∈ {−1,+1} is the binary label of x i . AdaBoost calls a weak learning algorithm repeatedly in a series of time intervals t = 1, 2,…, T. In the iteration t, a weight D t (x i ) is associated to the training sample x i . The method generates a base classifier h t , taking into account the weights distribution. It is necessary to determine a real value α t . It is the weight associated to h t and it depends on the training error of that classifier. The AdaBoost algorithm is illustrated in Fig. 2.

Fig. 2
figure 2

AdaBoost algorithm, reproduced from Rodríguez and Maudes (2007) with permission

Performance measurement

In statistical prediction, the following three cross-validation tests are often used to examine the power of a predictor: sub-sampling (e.g., fivefold, sevenfold, etc.), jackknife, and independent dataset tests. Sub-sampling test, such as fivefold approach as often used in literatures, cannot avoid arbitrariness and yield a unique outcome even for a same benchmark dataset as illustrated by Eq. 50 of (Chou and Shen 2007a, b). Of these three, the jackknife test is thought the most rigorous and objective one [see (Chou and Zhang 1995) for a comprehensive review in this regard], and hence has been used by more and more investigators (see, e.g., Chen et al. 2007; Chou and Shen 2007b; Diao et al. 2007, 2008; Ding et al. 2007; Fang et al. 2008; Gao et al. 2005a, b; Guo et al. 2006; Li and Li 2008; Liu et al. 2007; Niu et al. 2006; Shen and Chou 2007a, b, c; Shi et al. 2007; Sun and Huang 2006; Tan et al. 2007; Wang et al. 2005; Wen et al. 2006; Xiao and Chou 2007; Xiao et al. 2005, 2006; Zhang et al. 2006a, b, c, d, 2008; Zhang and Ding 2007; Zhou et al. 2007a) in examining the power of various prediction methods.

In statistic prediction study, it is convenient to introduce an accuracy matrix [M ij ] of size c × c (c is the number of compartments to be predicted). The element M ij of accuracy matrix is the number of proteins predicted to be in subnuclear compartment j, which are actually in the compartment i.

Three indexes are applied to evaluate the prediction accuracy, i.e., sensitivity (S n ), specificity (S p ), and Matthew’s correlation coefficients (CC).

$$ S_{n} = \frac{{M_{{ii}} }} {{{\sum\nolimits_{j = 1}^c {M_{{ij}} } }}} $$
(3)
$$ S_{p} = \frac{{M_{{ii}} }} {{{\sum\nolimits_{j = 1}^c {M_{{ji}} } }}} $$
(4)
$$ {\text{CC}} = \frac{{M_{{ii}} {\left( {{\sum\nolimits_{k \ne i}^c {{\sum\nolimits_{j \ne i}^c {M_{{jk}} } }} }} \right)} - {\left( {{\sum\nolimits_{j \ne i}^c {M_{{ij}} } }} \right)} \times {\left( {{\sum\nolimits_{j \ne i}^c {M_{{ji}} } }} \right)}}} {{{\left[ {{\left( {M_{{ii}} + {\sum\nolimits_{j \ne i}^c {M_{{ij}} } }} \right)}{\left( {M_{{ii}} + {\sum\nolimits_{j \ne i}^c {M_{{ji}} } }} \right)}{\left( {{\sum\nolimits_{k \ne i}^c {{\sum\nolimits_{j \ne i}^c {M_{{jk}} } }} } + {\sum\nolimits_{j \ne i}^c {M_{{ji}} } }} \right)}{\left( {{\sum\nolimits_{k \ne i}^c {{\sum\nolimits_{j \ne i}^c {M_{{jk}} } }} } + {\sum\nolimits_{j \ne i}^c {M_{{ij}} } }} \right)}} \right]}^{{1/2}} }} $$
(5)
$$ A_{c} = {\left( {{\sum\limits_{i = 1}^c {M_{{ii}} } }} \right)}\bigg/{\left( {{\sum\limits_{i = 1}^c {{\sum\limits_{j = 1}^c {M_{{ij}} } }} }} \right)} $$
(6)

S n represents the accuracy, and S p represents the reliability in procedure of prediction. The CC is a single parameter characterizing the matching extent between the observed and predicted subnuclear compartments.

Results and discussion

According to the concept of Chou’s PseAA (Chou 2001), the weight factor and the dimension of addiction feature are essential parameter. In this study, genetic algorithm (GA) toolbox in Matlab is used to optimize the weight factors and lambda (λ) in Eq. (2). The ranges of parameters selection: weight factor w ∈ [0, 1] and the dimension of ApEn λ ∈ [1, 12], The overall accuracy value of jackknife test with given classifier is used as the result of fitness function in GA, where the give classifiers is three AdaBoost classifiers, respectively. The crossover rate P c = 0.9 and mutation rate P m = 0.2. When the weight factors (w) and the dimension of ApEn (λ) are determined using by the three AdaBoost classifier, respectively, the input PseAA compositions of each base classifier within ensemble are also determined. The optimized parameters of three basic classifiers are listed in Table 1.

Table 1 Optimization results of input data of three basic classifiers in AdaBoost with the dataset SNL9

Three Adaboost classifiers are trained with the optimized parameters, dimension of ApEn (λ), and weight factors (w) in Eq. 2, respectively. The results of three Adaboost classifiers are fusing through weight voting, of which the voting weight is the success rate of each basic classifier. The fusion scheme has been used in several protein function prediction studies with ensemble classifier (Shen and Chou 2006, 2007e). The final results are listed in Table 2, where for facilitating comparison, the results by other methods on the same dataset are also listed. The overall accuracy (A c ) is 83.2%, distinctly higher than the methods of SVM (Cai et al. 2002), OET-KNN (Shen and Chou 2005), and PSSM (Mundra et al. 2007). The accuracies of each subnuclear compartment are higher or same as them of the method of PSSM (Mundra et al. 2007).

Table 2 Results of Jackknife test by different algorithms on SNL9

In order to validate the performance of the approach further, the dataset SNL 6 constructed by Lei and Dai (2005) is also used, which is composed by 504 protein sequences classified in 6 subnuclear compartments. Same as the optimization process using by the dataset of SNL9, the optimized results of three base classifiers trained by the dataset of SNL6 are listed in Table 3. When the λ and w are determined, the PseAA composition as input data are also determined. The results of jackknife cross-validation test are deposited in Table 4. In order to compare with other methods, the results of the prediction methods on the same dataset are also listed in Table 4. The overall accuracy (A c ) of proposed approach is 73.2%, higher than the methods of Lei-SVM (Lei and Dai 2005) and ESVM (Huang et al. 2007). Comparing the accuracy (S n ) of subnuclear compartments in proposed approach with them in other methods, the accuracies of PML body, chromatin, nuclear speckles, and nuclear lamina, the four compartments with less proteins number, are higher than them of other methods. We can see that the results of proposed approach are more balance than that of the methods of Lei-SVM (Lei and Dai 2005) and ESVM (Huang et al. 2006).

Table 3 Optimization results of input data of three basic classifiers in AdaBoost with the dataset SNL6
Table 4 Results of Jackknife test by different algorithms on the dataset of SNL6

The results of Jackknife cross-validation test with two datasets indicate that the proposed approach is effective and practical. The PseAA composition based on ApEn indeed reflects the core feature of proteins in different subnuclear compartments. Comparing the results of base classifier with that of ensemble, performance of ensemble is stronger than that of base classifier. The ensemble of three AdaBoost classifiers might become a useful tool in prediction of protein subnuclear localization.

Conclusions

A novel approach for protein subnuclear localization is proposed. Sample of protein sequence is represented by PseAA based on ApEn. Ensemble classifier is used as prediction engine. The ensemble classifier is combined with three AdaBoost classifiers in which base classification algorithms are decision stumps, FKNN, and RSVM, respectively. The input data of AdaBoost classifier in ensemble is PseAA composition, of which dimension (λ) and weight factor (w) are optimized by GA. Two datasets often used in various methods of this area are used to validate the performance of the novel approach. Promising results obtained by jackknife cross-validation test indicate that the proposed approach is effective and practical, and might become a useful tool for prediction protein subnuclear localization.