Introduction

Mitochondria are popularly known as the powerhouse of the cell as well as the central unit of eukaryotic cells. Mitochondria perform key roles in composite biochemical processes such as programed cell death (Gottlieb 2000) and ionic homeostasis (Jassem et al. 2002). In addition, mitochondrial dysfunctions have been revealed to be associated with apoptosis, aging, and a number of pathological conditions. We are particularly interested in predicting mitochondrial proteins since they are affiliated with over 100 known human diseases such as Alzheimer’s disease (Hutchin and Cortopassi 1995), Type II diabetes (Gerbitz et al. 1996), and Parkinson’s disease (Wooten et al. 1997).

The prediction of mitochondrial proteins has largely been performed using machine learning and statistical approaches. Some of the interesting approaches that use machine learning and statistical approaches in conjunction with sequence information as well as biological information for the prediction of mitochondrial proteins are Target P (Emanuelsson et al. 2000), Signal P 3.0 (Bendtsen et al. 2004), WoLF PSORT (Horton et al. 2006), TargetLoc (Höglund et al. 2006), MitoProt II (Claros and Vincens 1996), MITOPRED (Guda et al. 2004), MitPred (Kumar et al. 2006) and PFMpred (Verma et al. 2009). In MitPred, first support vector machine (SVM)-based methods were developed using amino acid and dipeptide (Dp) composition of proteins (Kumar et al. 2006) and then the split amino acid composition (SAAC) was used (Kumar et al. 2006). The prediction accuracy was further improved by combining blast search and SVM method. Finally, a hybrid approach that combines Hidden Markov model profiles and SVM was used for the prediction of mitochondria proteins. On the other hand, in case of PFMpred, a hybrid model combining PSSM profile and SAAC was developed for mitochondria prediction and has achieved high performance accuracy of 92% (Kumar et al. 2006). Similarly, several approaches that employ in machine learning or statistical methods using protein sequence information have also been reported, whereby typically, avoiding any need of biological information is paid in terms of a decrease in the accuracy. A number of different computational approaches based on amino acid composition (AAC) or Dp composition (Dp) have been developed, including the covariant discriminant algorithm based (Chou and Elrod 1999), discrete wavelet transform based (Jiang et al. 2006), SVM based (Hua and Sun 2001; Chou and Cai 2002; Kumar et al. 2006; Jiang et al. 2006; Tan et al. 2006) and fuzzy kNN based (Huang and Li 2004). Tan et al. (2006) have reported the highest accuracy of 85% using pure machine learning approaches by applying genetic algorithm-partial least square (GA-PLS) on Dp features in conjunction with SVM.

Recently, Hu and Fan (2009) have proposed the physiochemical encoding method that maps protein sequences into feature vector composed of the locations and length of the amino acid groups (AAGs) with similar physiochemical properties. Their method yields an improvement of about 20% than that of the method based on simple ACC. An extended version of the pseudo-amino acid composition has been also been employed for the submitochondria subcellular localization and a good prediction performance has been achieved (Du and Li 2006). Similarly, Nanni and Lumini (2008a) have achieved high prediction performance using an interesting approach based on genetic programming (GP) for creating Chou’s pseudo amino acid-based features for sub-mitochondria localization. To improve the mitochondria prediction ability, this study employed GP for generating an effective decision space from that of the individual classifiers’ spaces. It has been observed that the utilization of ensemble classifiers is increasing for predicting protein subcellular localization. There are protein sequences that have multiple protein subcellular localizations for which interesting ensemble classifiers have been developed (Shen and Chou 2007; Chou and Shen 2007). Rotation forest has been proposed, which is based on investigating the diversity-accuracy landscape for ensemble classifier (Rodríguez et al. 2006). Similarly, RotBoost, which is a relatively new ensemble technique, has been proposed by combining rotation forest and Adaptive Boosting (AdaBoost) and has yielded lower prediction error than either of rotation forest and Adaboost (Zhang and Zhang 2008). Recently, (Nanni et al. 2010) have proposed an effective ensemble approach for protein subcellular localization using a high performance set of PseAAC and sequence-based descriptor.

For approximately incorporating the sequence-order effects, the idea of the pseudo amino acid composition (PseAAC) has been proposed (Chou 2001, 2005a, b). PseAAC has then been used in conjunction with various machine learning approaches to enhance the prediction quality (Chou and Cai 2006; Chou and Shen 2006a; Guo et al. 2006; Xiao et al. 2005, 2006a). However, the percent composition of the whole sequence does not give proper weight to the computational bias, which is known to be present in mitochondrial protein termini. Therefore, the concept of SAAC was introduced where the protein sequence is divided into three parts: N terminus, C terminus and a region between these two termini (Chou and Shen 2006a, b). SAAC has thus proved better accuracy for mitochondria predictions as it is giving greater weight to the proteins that have a signal at either N or C terminus (Kumar et al. 2006).

In the present study, our aim is to develop a novel high performance prediction system that can employ both selection of an effective feature extraction strategy and construction of an ensemble approach for mitochondria classification using sequence information only. For this purpose, we have used two recent datasets and analyzed different feature extraction strategies such as AAC, Dp, PseAAC, and SAAC. A number of different classifiers are then trained on these extracted features, i.e. SVM, k-nearest neighbor (kNN), random forest (RF), multilayer perceptron (MLP), AdaBoost, and bagging. A GP-based ensemble classifier is subsequently developed for mitochondrial prediction, which is able to develop an efficient decision space from the decision spaces of the individual classifiers.

Materials and methods

Datasets

The first dataset used in this paper is the same as in (Jiang et al. 2006), which consist of 499 mitochondrial proteins called positive examples. We denote this dataset by Mitochondria dataset (Mito_D). This dataset was obtained from Swiss-Prot release 46.6 by the keyword mitochondrial. A total of 2,833 entries were obtained. All the sequences with ambiguous words, such as POTENTIAL, BY SIMILARITY, or PROBABLE and fragments were then excluded. Moreover, 681 proteins (so called negative examples) belonging to locations other than mitochondrial site were selected by taking one out of every 250 entries in Swiss-Prot. Mitochondrial protein sequences or fragments were then deleted from the negative examples.

To validate the performance of our proposed method, we have used another dataset taken from (Verma et al. 2009). We denote this dataset by Malaria Parasite Mitochondria dataset (MP_Mito_D). This dataset consists of total 175 instances out of which 40 are mitochondrial proteins called positive examples and 135 examples belong to other locations, i.e. cytoplasm, extracellular, apicoplast, and are called negative examples. The homologies between sequences are checked. To remove the homologous sequences from the benchmark dataset, a 25% cut-off threshold is imposed and only those protein sequences are considered that have less than 25% sequence identity to any other protein sequences in a same subset (Chou and Shen 2006b, 2007, 2008).

Performance measures

We have assessed the performance of our method using the following performance measures.

  1. (a)

    Sensitivity or coverage of positive examples It is the percentage of mitochondrial proteins, which are correctly predicted as mitochondria.

    $$ {\text{Sensitivity}} = {\frac{\text{TP}}{\text{TP} + \text{FN}}} \times 100 $$
    (1)

    where, TP and TN are correctly predicted mitochondrial and non-mitochondrial proteins, respectively, whereas FP and FN are wrongly predicted mitochondrial and non- mitochondrial proteins, respectively.

  2. (b)

    Specificity or coverage of negative examples It is the percentage of non-mitochondrial proteins, which are correctly predicted as non-mitochondria.

    $$ {\text{Specificity}} = {\frac{\text{TN}}{\text{FP} + \text{TN}}} \times 100 $$
    (2)
  3. (c)

    Accuracy It is the percentage of correctly predicted proteins (mitochondrial and non-mitochondrial proteins).

    $$ {\text{Accuracy}} = {\frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}} \times 100 $$
    (3)
  4. (d)

    Mathew’s correlation coefficient It is considered as one of the most robust performance parameter. MCC equal to one is regarded as perfect prediction while zero for completely random prediction.

    $$ {\text{MCC}} = {\frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{{\sqrt {[\text{TP} + \text{FP}][\text{TP} + \text{FN}][\text{TN} + \text{FP}][\text{TN} + \text{FN}]} }}} $$
    (4)
  5. (e)

    Q-Statistics To measure the diversity of the classifiers Q-statistic is considered as a promising performance parameter (Nanni and Lumini 2008b). The Q-statistic of any two base classifiers C and D is defined as:

    $$ Q_{i,j} = {\frac{ad - bc}{ad + bc}} $$
    (5)

    where, a and d represent the number of correct and incorrect prediction of both classifiers. However, b is the correct prediction of classifier first and incorrect prediction of classifier second. In contrast, c is the correct prediction of classifier second but incorrect prediction of classifier first. The value of Q varies between −1 and 1.

Feature extraction strategies

Amino acid composition and dipeptide composition

The aim of calculating composition of proteins is to transform the variable length of protein sequence into a fixed-length feature vector (Hayat and Khan 2010). This is a most crucial step during classification of proteins using machine-learning techniques because they require fixed length patterns. The information of proteins can be encapsulated to a vector of 20 dimensions using AAC of the protein. In addition to AAC, Dp has also been used for classification that gives a fixed pattern length of 400. The advantage of Dp over AAC is that it encapsulates information about the fraction of amino acids as well as their local order. The AAC as well as Dp-based features have been generated as described by (Garg et al. 2005). Both compositions have been used as features to classify mitochondrial and non-mitochondrial proteins.

Split amino acid composition

In SAAC method (Chou and Shen 2006a, b), the protein sequence is divided in parts and composition of each part is calculated separately. Recently, (Verma et al. 2009), have developed SAAC-based method to predict mitochondrial proteins of malaria parasite and have achieved reasonable accuracy. In our SAAC model, each protein is divided into three parts: (i) 25 amino acids of N-termini, (ii) 25 amino acids of C-termini, and (iii) region between these two terminuses. There are some small sequences in the datasets whose length is less than 50 amino acids in the protein sequence. Therefore, to accommodate these sequences, we have divided these sequences into three parts with 10 amino acids of N and C terminus and the region between these two terminuses.

Pseudo amino acid composition

Amino acid composition model has been widely used in conjunction with quite a few statistical methods for predicting protein attributes. However, in case of AAC, all the sequence-order information is lost. To compensate this problem, the concept of the PseAAC has been proposed, which incorporates the sequence-order effects (Chou 2000, 2001). Simple AAC contains the composition of the 20 amino acids while the PseAA composition contains a set of greater than 20 discrete factors. The first 20 of these represent the components of its basic AAC and the additional factors carry some sequence-order information. For example

$$ PseAA = P_{1} , \, P_{2}, \ldots ,P_{20,}\, P_{20 + 1} ,\ldots, P_{\lambda } $$
(6)

where λ is the numbers of tiers used in PseAA. The optimal value of tiers and the selection of the best physiochemical properties can influence the classification performance. In our case, we have selected λ = 21 and analyzed the performance using different combination of physiochemical properties. We have considered λ = 21 because it is yielding best results. The first 20 elements, i.e. P 1, P 2,…, P 20 just represent the occurrence of frequencies of the 20 amino acids. Whereas, P 21 is the 1st correlation order factor, P 22 is the second correlation order factor, and so on. These elements are determined based on the physiochemical properties. In this study, we have used three physiochemical properties, i.e. hydrophobicity, electronic, and bulk properties. There are various models for representing these properties. We have used FH, EIIP, and CPV models, respectively.

Classification approaches

Support vector machine

Support vector machine is a machine learning approach and is based on statistical learning theory (Vapnik 1998). A brief and clear description on how to use SVM for classification (Chou and Cai 2002; Cai et al. 2003). It has also been reported that SVM in conjunction with feature selection provides quite interesting results (Huang et al. 2008). In this study, we have implemented SVM using the LIBSVM 2.88-1 package, which allows us to choose a number of parameters and kernels (e.g. linear, polynomial, radial basis function, and sigmoid). In this particular work, the mitochondrial proteins were defined as one class (labeled as +1) and the non-mitochondrial proteins were defined as another class (labeled as −1). SVM was implemented in MATLAB 7.7 and a third degree polynomial was chosen as the kernel function. Quadratic programming method was employed to solve the optimization problem. All the parameters were kept constant except C (regulatory parameter) and s (the kernel width parameter). In the training process, C and s were optimized by parameter optimization (Guo et al. 2006; Khan et al. 2008a).

k-nearest neighbor

The k-nearest neighbor algorithm is a method which classifies objects based on k-nearest training examples in the feature space. The Euclidean distance of test sample to all other samples in the feature space is calculated and k samples are selected based on minimum Euclidian distance. The value of k is usually taken as odd. The Euclidian distance is calculated using the Eq. 5.

$$ S\left( {X,X_{i} } \right) = 1 - {\frac{{X \cdot X_{i} }}{{\left\| X \right\|\left\| {X_{i} } \right\|}}} \, \left( {i = 1,2, \ldots,N} \right) $$
(7)

The sample under question X is then assigned to the category, which is found in majority among the k samples. The kNN is considered as a simple classifier, based on instance-based learning and has been commonly employed in protein prediction problems (Khan et al. 2008a).

Random forest

Random forest is designed to produce accurate predictions that do not overfit the data (Breiman 2001). RF employs the statistical technique bootstrap in which samples are drawn to construct multiple trees. Each tree is grown using some form of randomization. The leaf nodes of each tree are labeled by estimates of the posterior distribution. Each internal node contains a test that best divides the space of data to be classified. A protein sample is classified by sending it down every tree and aggregating the reached leaf distributions. Out-of-bag samples can be employed to compute an unbiased error rate and variable importance, eliminating the need for a test set or cross-validation. Because a large number of trees are grown, there is limited generalization error (that is, the true error of the population as opposed to the training error only), which means that overfitting is unlikely.

By growing each tree to maximum size without pruning and selecting only the best split among a random subset at each node, RF tries to maintain some prediction strength while inducing diversity among trees (Breiman 2001). Random predictor selection decreases the correlation among un-pruned trees and keeps the bias low. By taking an ensemble of un-pruned trees, variance is also reduced. Another advantage of RF is that the predicted output depends only on one user-selected parameter which is the number of predictors to be chosen randomly at each node. In this work, the parameters of RF are set with number-of-trees equal to 15 and iterations equal to 25.

AdaBoost

AdaBoost is one of the most popular and successful implementations of boosting. Its name is an acronym created from its description, i.e. Adaptive Boosting. We have used Adaboost.M1 (Freund and Schapire 1996) provided in Weka 3.6.2, where REPTree has been used as a weak learner with number of iterations equal to 25 and the rest of the parameters are set with the default values.

Bagging

One of the simple ensemble classifier is a method that generates multiple versions of a predictor and employs these to develop an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and employs a plurality vote when predicting a class (Breiman 1996). Weka 3.6.2 is used for simulation of bagging approach with REPTree as a weak learner. The numbers of iterations were set to 25 and the remaining parameters are set with default values.

Multilayer perceptron

The MLP consists of a system of simple interconnected neurons, which yields a model representing a nonlinear mapping between an input and an output vector (Khan et al. 2008b). The nodes are connected by weights and output signals, which function as the sum of the inputs to the node modified by a simple nonlinear transfer, or activation function. The superposition of many simple nonlinear transfer functions enables the MLP to approximate non-linear functions. An MLP has one or more hidden layers between input and output layers. In this work, we have used the Matlab-based neural networks toolbox for MLP implementation. An MLP having one input, two hidden, and one out layers has been used. The training algorithm was set to trainl-based backpropagation approach. The rest of the parameters were used as default.

Predicting mitochondria proteins

Developing individual mitochondria classifier

Several mitochondria prediction methods employing individual classifiers have been proposed in the literature so far. In our proposed method, we have trained different individual classifiers on several feature extraction strategies. First, mitochondria protein sequences are converted into features using the feature extraction strategies. These features are finally provided to the different individual classifiers for training and prediction performance for each individual classifier is determined as shown in Fig. 1. We have trained six different individual classifiers such as SVM, kNN, MLP, RF, AdaBoost, and bagging on four different feature extraction strategies: AAC, Dp, PseAAC and SAAC.

Fig. 1
figure 1

Proposed mitochondria prediction method for individual classifier

Developing GP based ensemble classifier (Mito-GSAAC)

GP introduced by Koza (1992) is an evolutionary algorithm designed for automatically constructing and evolving computer programs. It differs from genetic algorithm in the ability to evolve variable length solutions. GP has emerged as a powerful tool not only for evolving a classifier but also for optimum combination of classifiers (Khan et al. 2005). GP works by evolving a population of randomly created initial population using a fitness measure. It selects the fittest one to take part in the evolution and thus efficiently searches for the desired solution.

Several interesting combination strategies in classification of protein sequences have been employed recently. Out of these, combination strategies majority voting-based strategies have been widely used (Khan et al. 2010). GP, on the other hand, has shown promising results when used for combination of binary classifiers due to its inherent learning capability and its tree structure (Khan et al. 2005). Therefore, this work capitalizes the learning capabilities of GP, whereby it is used as a stacking-based ensemble approach and thus is different from the majority voting technique. We have employed GP to develop a complex but efficient decision space when provided with decision spaces of the individual trained classifiers as shown in Fig. 2.

Fig. 2
figure 2

GP-based ensemble classifier

Results and discussion

Mitochondria prediction using different amino acid-based features

This work aims at predicting eukaryotic mitochondrial proteins using SAAC and ensemble classifier without using any biological information. However, in order to create the ensemble classifier, first we have tested a number of classifiers using different feature extraction strategies like AAC, Dp, PseAAC, and SAAC. The performance of the various classification algorithms is evaluated through various performance measures such as accuracy, sensitivity, specificity, MCC and Q-Statistics. We have found that all the classifiers, SVM, kNN, MLP, RF, AdaBoost and bagging have yielded better performance on SAAC as compared to other feature extraction strategies (Table 1). This means that SAAC offers greater discrimination power in comparison with the rest of feature extraction strategies and is largely due to the composition difference at the N and C terminus in mitochondria and non-mitochondria. We have achieved a classification accuracy of 90.34% on SAAC using RF as a classifier with number of iterations equal to  25 and number of trees equal to 15. Finally, as detailed in Sect. 4.3, we have developed a GP-based ensemble classifier using the predictions of individual classifiers, SVM, kNN, RF and AdaBoost on SAAC features.

Table 1 Jackknife results on Mito_D dataset using different feature extraction strategies and individual classifiers

Using the first feature extraction strategy AAC, the classification accuracy of AdaBoost is better compared to the other classification algorithms. In case of Dp, SVM obtained the highest accuracy among the various classifiers. Similarly using PseAA, the classification accuracy of SVM is the highest among the various classifiers. In case of SAAC, the accuracies of the individual classifiers are: AdaBoost 88.64%, bagging 88.64%, kNN 84.58%, MLP 85.54%, RF 90.34%, and SVM 88.05%. Thus, the classification performance of the various classification algorithms that we used has improved in case of SAAC.

We have also used a hybrid feature-extraction strategy to analyze the performance of the various classification algorithms, whereby the different features are just concatenated. In case of this hybrid feature-extraction strategy, the accuracy of AdaBoost, Bagging, kNN, and MLP decreases, while slight improvement in the accuracy of RF and SVM has been observed. Therefore, it is observed that petite improvement in accuracy is obtained for some of the classifiers at the cost of quite high dimensionality of the feature vector space.

Is SAAC really better for mitochondria classification?

It is reported in literature that mitochondria have a large difference in composition from non-mitochondria mostly in the N- and C-terminus (Kumar et al. 2006). N-terminus or amine terminus is the initial portion of some amino acids in the protein sequence. Similarly, C-terminus or carboxyl-terminus is the end portion of amino acids in the sequence. Therefore, splitting the protein sequence into three parts i.e. N-terminus, C-terminus and the portion of sequence between these two termini, and then calculating the AAC for all three parts would provide a better discrimination for mitochondria versus non-mitochondria as can be observed from Table 1. Hence, it shows that overall protein sequence composition avoids the high signals in some parts of the protein sequence, which can better discriminate the mitochondria from non-mitochondria.

GP ensemble and SAAC-based mitochondria classification

We observed that for mitochondria prediction, SAAC performs better than the various feature extraction strategies that we have used. Therefore, for better exploitation of the SAAC features for mitochondria prediction, we have developed an ensemble classification approach using GP. Previously, GP has been used for the optimum combination of classifiers for gender classification problem and it has been observed that it provides better prediction performance (Khan et al. 2005; Khan and Mirza 2007). Thus, by employing GP-based ensemble approach, a high accuracy of 92.62% is obtained for the mitochondria prediction. We denote our proposed GP ensemble as Mito-GSAAC.

We have used conventional functions in the GP tree: a set of four binary arithmetic operators (+, −, * and a protected division), if less than (IFLT), if greater than (IFGT), and absolute. We have combined the predictions of kNN, SVM, RF, and AdaBoost using GP to develop an optimal decision space. The dataset is divided into two portions, i.e., training and testing. Two-third of training data is given to GP for training and then it is validated on the remaining one-third data (Table 2). First, an initial population of 100 polynomials is generated. Fitness for each new individual is calculated using area under receiver operating curve (AUROC).

Table 2 GP parameter settings for evolving the ensemble classifiers

ROC and MCC as fitness criteria in GP

ROC and MCC have been used as fitness criteria in GP simulation for developing ensemble classifier. ROC is a graph plotted between true positive rate (TPR) and false positive rate (FPR) for different threshold values. TPR represents the number of correct positive cases divided by the total number of positive cases whereas, FPR represents the number of negative cases predicted as positive cases, divided by the total number of negative cases (Khan et al. 2005). The area under the ROC curve i.e. AUROC is then computed and is considered as the fitness in GP simulation. A GP individual with the highest value of AUROC is chosen the best individual in the population.

MCC is also considered as a rigorous performance measure in classification applications. Therefore, we have also used MCC as fitness criteria in some of the GP simulations. It is observed that both the fitness criterion i.e. AUROC and MCC yield almost the same performance on the testing data (Table 3). The accuracy versus complexity graph and the best individual GP tree for the MCC-based fitness criteria are demonstrated in Fig. 3.

Table 3 GP performance on Mito_D dataset for Jackknife test using both the ROC and MCC fitness criteria
Fig. 3
figure 3

Accuracy versus complexity and the best individual GP tree for MCC as fitness criteria

Comparison with existing state of the art approaches

Performance comparisons on the Mito_D dataset

We have compared our proposed Mito-GSAAC with the existing prediction methods using the Mito_D dataset and jackknife test. All prediction performances are listed in Table 4. The results show that Mito-GSAAC can identify mitochondrial proteins from other proteins with a relatively high accuracy of 92.62% and MCC of 0.85. MITOPRED, which uses biological information, also shows high performance with an accuracy of 95.68%. However, we have achieved highest accuracy without using any biological information like that of one which has achieved accuracy of 85%. In practice, not all the biological information can easily be attained. Once such information is absent, our method will be influenced little. Like MITOPRED, MitoProt also has some limitations; it can only predict the sequences starting by a methionine and the mature proteins. The discrete wavelet transforms method (Jiang et al. 2006) based on the sequence-scale similarity measurement does not rely on subcellular locations information and can directly predict protein sequences with different length. Although the performance in terms of the specificity is relatively higher, the accuracy is poor. It is usually due to the specific properties of mitochondrial protein that make it difficult to discriminate it from other proteins by just one method, or simply because the number of proteins in the mitochondrion is immoderate (Cameron et al. 2005). With the increase of the exact experimental mitochondrial proteins, the performance may also be improved significantly.

Table 4 Performance comparison on Mito_D dataset using Jackknife test for the existing state of the art mitochondria predictors and our proposed Mito-GSAAC

Performance analysis of the proposed Mito-GSAAC on the MP_Mito_D dataset

We also evaluate the performance of the proposed approach Mito-GSAAC on the MP_Mito_D dataset using two important statistical tests; jackknife and fivefold cross validation test.

In case of jackknife test, RF using AAC yielded the highest accuracy and MCC values of 85.14% and 0.53, respectively (Table 5). While, in case of Dp composition, SVM achieved better accuracy than the rest of the classifiers. On the other hand, in case of SAAC, both SVM and kNN yielded better accuracies as compared to the rest of the classifiers. Using the hybrid features (when all the different features are just concatenated) and jackknife test, the accuracy of SVM is 89.21%, which is slightly high as compared to that obtained using SAAC. On the other hand, the performance of the rest of the classifiers decreased when used in conjunction with the hybrid features, mainly due to the curse of dimensionality.

Table 5 Prediction performance of classifiers using Jackknife test on MP_Mito_D dataset

In case of fivefold cross validation, the performance of RF, SVM, and kNN using individual feature extraction strategy is shown in Table 6. When the Mitochondria prediction performance using different feature-extraction strategies is analyzed, enhanced performance is observed using SAAC in case of all the different classification algorithms.

Table 6 Prediction performance of classifiers using fivefold cross validation on MP_Mito_D dataset

In case of the hybrid feature-extraction strategy, the classification accuracy obtained by AdaBoost, bagging, MLP, RF, SVM, and kNN are 85.22, 85.10, 83.14, 87.01, 92.23, and 90.71%, respectively. It has been observed again that using the hybrid feature-extraction strategy, SVM has achieved slight improvement in accuracy as compared to that using SAAC. The performance improvement for SVM is only 0.23%; however, the dimensionality of the feature space is increased greatly. On the other hand, the performance of the rest of the classifiers decreased when trained on the hybrid features.

It has thus been observed that a mitochondria protein can be efficiently discriminated based on the differences in the amino acid at their N- and C-terminus by the SAAC feature extraction strategy. Therefore, the predicted results of AdaBoost, RF, SVM and kNN using the SAAC features are then combined through GP. The results of GP-ensemble are analyzed using both the tests; jackknife, and fivefold cross-validation as shown in Table 7.

Table 7 Performance of Mito-GSAAC on MP_Mito_D dataset

In case of jackknife test, Mito-GSAAC has obtained an accuracy of 90.05%. In the GP simulation, the population size and no. of generations were kept equal to 200 and 100, respectively. The accuracy of Mito-GSAAC is 1.98% higher than the highest individual classifier’s result using all the feature extraction strategies.

Similarly, the predicted results for fivefold cross-validation test are also combined through GP. Mito-GSAAC obtained an accuracy of 93.21%. The accuracy of Mito-GSAAC using fivefold cross-validation test is 1.71% higher than the highest individual classifier’s result using all the feature extraction strategies.

Comparison with existing approaches on the MP_Mito_D dataset

We have also compared our proposed approach Mito-GSAAC with already published methods using the MP_Mito_D dataset (Table 8). In case of the MP_Mito_D dataset, Verma et al. (2009) have employed Dp and PSSM composition using SVM and have reported 92.57% accuracy and 0.78 MCC. Further, they have applied the combination of SAAC and PSSM using SVM as a classifier and have reported an accuracy of 92.00% and 0.81 MCC. However, the values of the classification accuracy and MCC of our proposed approach Mito-GSAAC are 93.21%, respectively, which are better compared to those obtained by Verma et al. 2009. Thus, the predicted results show that the performance of our proposed approach is not only better from RF, SVM, and kNN but also higher than the existing approaches.

Table 8 Comparison with existing approaches on MP_Mito_D dataset

Conclusions

In this paper, we have shown that a GP-based ensemble classifier can be developed for better exploitation of the advantages of the individual classifier trained on SAAC-based features. We have investigated four kinds of protein representations for mitochondria prediction namely, AAC, Dp, PseAAC and SAAC and used two datasets for performance analysis. First different individual classifiers such as SVM, kNN, MLP, RF, AdaBoost and bagging are trained and their prediction performances are determined. In case of Jackknife test, RF among all individual classifiers has given the highest accuracy of 90.34% in conjunction with SAAC-based features on the Mito_D dataset while SVM obtained the highest accuracy of 92.0% using SAAC and fivefold cross validation test on the MP_Mito_D dataset. Consequently, SAAC has performed better than the rest of feature extraction strategies. This better performance might be due to the high signals in the parts of mitochondria protein sequence. The proposed Mito-GSAAC has achieved a high performance accuracy of 92.62% on the Mito_D dataset and 93.21% on the MP_Mito_D dataset. Until now, most of existing studies employed only a single individual classifier to predict mitochondrial proteins. However, in the current work, we have first employed different individual classifiers and then, finally, we have developed a GP-based ensemble classifier for mitochondria prediction. This work thus proposes an effective mitochondria prediction method, Mito-GSAAC that uses raw sequence data only and thus can be helpful in the research related to cell biology and drug discovery.