1 Introduction

Mitochondrion is one of the key organelles in eukaryotic cells, which provides the energy a cell needs to move, divide and produce secretory products etc. (Henze and Martin 2003). The mitochondrion mainly contains four distinct compartments i.e. the outer membrane, intermembrane space, inner membrane and the matrix. The outer membrane has large numbers of integral membrane proteins forming channels to allow small molecules and proteins with signal peptides diffusing into or out of mitochondrion. The intermembrane space is the space between the outer membrane and the inner membrane. The inner membrane contains many different polypeptides which are responsible for regulating metabolite and performing the redox reactions of oxidative phosphorylation. The matrix is the space enclosed by the inner membrane. The main function of the matrix is to produce ATP with the aid of the ATP synthase contained in the inner membrane.

The proteins located in these four submitochondria locations play distinct biological roles. For timely understanding protein functions, it needs to accurately identify the submitochondria location of mitochondrion proteins. Unfortunately, it is cost ineffective for experimental approach to confirm protein’s location in mitochondrion. Phylogenetic tree is a traditional method for most experimental scholars to predict the sub-subcellular locations of proteins. Although this method is not particularly expensive, it is more time consuming than machine learning approaches. Furthermore, for the sequences which do not have homologous sequences in benchmark dataset, phylogenetic tree will provide ineffective, inexact and even wrong information. Therefore, it is a good choice to develop machine learning methods to predict the submitochondria location of mitochondrion proteins.

The protein sub-subcellular location prediction is a hot research area and has been studied by some scholars as reviewed in recent literature (Du et al. 2011). Due to the plenty of proteins in nuclear location, the prediction of protein subnuclear location is widely studied. Shen and Chou (2005, 2007) developed a method based on pseudo amino acid composition (PseAAC) to predict subnuclear location of proteins and achieved an overall accuracy of ~65 %. Subsequently, other works improved the accuracy to ~85 % by developing various methods (Lei and Dai 2005, 2006; Huang et al. 2007, 2008, 2009; Li and Li 2008; Jiang et al. 2008; Mei and Fei 2010). For the prediction of protein subchloroplast location, Du et al. (2009) developed a server, called SubChlo to predict subchloroplast locations of proteins. This server obtained an overall accuracy of 67.18 % in jackknife test. For improving the predictive accuracy, another work (Shi et al. 2011) used discrete wavelet transform to exact feature and achieved the accuracy of 89.31 %. The sub-Golgi location of protein has also been attended by some researchers. van Dijk et al. (2008) have focused on the prediction of the sub-Golgi locations of type II membrane proteins. Recently, cis-Golgi and trans-Golgi proteins were studied (Ding et al. 2011) by using modified Mahalanobis discriminant.

Due to the special function of mitochondrion, the submitochondria localization of proteins has attracted many bioinformatics scholars. Du and Li (2006) presented a PseAAC-based method to predict protein submitochondria locations. Total of 317 mitochondria proteins with sequence identity less than 40 % were constructed and the overall accuracy achieved 85.2 % in jackknife cross-validation. Subsequently, a genetic programming-based method was developed by Nanni and Lumini (2008) to predict protein submitochondria locations. The jackknife test accuracy increased to 89 %. Zakeri et al. (2011) increased the overall accuracy to 94.7 % by using feature fusion. Shi et al. (2011) proposed to use discrete wavelet transform to extract features and achieved the overall accuracy of 93.38 % in jackknife cross-validation. Mei (2012) used GO information to improve the accuracy of submitochondria location prediction of proteins. Another dataset including 399 mitochondria proteins with sequence identity less than 40 % was constructed by Zeng et al. (2009). By using augmented PseAAC as parameters, the overall accuracy achieved 89.7 % in jackknife cross-validation. Fan and Li (2012) constructed the third dataset containing 1,105 mitochondria proteins with sequence identity less than 40 %. By using the pseudo-average chemical shift and GO information, they obtained an overall accuracy of 93.57 %. Although these models have their respective merits, all of them have a common limit: the success rate is very low when the query protein has less than 25 % sequence identity to proteins with known locations (Chou and Shen 2007, 2008).

In this study, we constructed a very stringent benchmark dataset in which none of the proteins have ≥25 % sequence identity with any other proteins with the same submitochondria location. The binomial distribution was used to optimize the tetrapeptide words. The support vector machine (SVM) was proposed to perform prediction. Results of jackknife test showed that the overall accuracy of 91.1 % was achieved with the average accuracy of 88.0 %. Our method was also examined on other three benchmark datasets and achieved the accuracies of 94.0, 94.7 and 93.4 %, respectively. These results show that the proposed method can be efficiently and accurately used to annotate submitochondria locations for new mitochondria proteins.

2 Materials and Methods

2.1 Dataset

The mitochondria proteins were extracted from Universal Protein Resource (Uniprot) (UniProt Consortium 2012). To construct a reliable benchmark dataset, the following steps were used to prepare high quality datasets. (1) Although proteins with multiple submitochondria locations have some special biological functions, we collected the proteins with only one mitochondria location because the number of proteins with multiple submitochondria locations is too small to have statistical signification. (2) Proteins with ambiguous protein existence annotations, such as ‘uncertain’, ‘predicted’ and ‘inferred from homology’ were excluded because they lack confidence. (3) Only those proteins with experimental confirmed submitochondria location were included because they can provide validate information. (4) Sequences which are fragment of other proteins were excluded because their information is redundant and not integrity. (5) Sequences containing nonstandard letters, such as ‘B’, ‘X’ or ‘Z’, were excluded because their meanings are ambiguous. (6) To avoid any homology bias, the protein sequence with ≥25 % sequence identity to any other proteins in the same subset was excluded by using PISCES (Wang and Dunbrack 2005). After strictly following the above procedures, we finally obtained 495 mitochondria proteins called M495, which includes 254 inner membrane proteins, 132 matrix proteins and 109 outer membrane proteins. The data can be freely downloaded from http://lin.uestc.edu.cn/server/subMito/data. The method that divides benchmark dataset into training set and test set is strict and objective for evaluating the performance of proposed method. However, in this study we did not use such method because the currently available data do not allow us to do so. Otherwise, the number of proteins for some subsets would be too few to have statistical significance.

To facilitate comparison with previous studies, three benchmark datasets were used. The dataset M317 constructed by Du and Li (2006) contains 131 inner membrane proteins, 145 matrix proteins and 41 outer membrane proteins. The second dataset M399 constructed by Zeng et al. (2009) contains 171 inner membrane proteins, 166 matrix proteins and 62 outer membrane proteins. The third dataset M1105 constructed by Fan and Li (2012) contains 589 inner membrane proteins, 280 matrix proteins and 236 outer membrane proteins. Each of the three benchmark datasets has the sequence identity of <40 %.

2.2 Tetrapeptide Words

Informative parameters play a key role in machine learning problem. Secondary structure information of protein has been widely used in protein structure and function prediction. It has been proved that the predicted secondary structure information of protein can be used for predicting submitochondria location of proteins by calculating average chemical shift (Fan and Li 2012). Rackovsky (1993) has estimated that 60–70 % of tetrapeptides encode the specific structure. Feng and Luo (2008) have used tetrapeptide signals to predict secondary structure of proteins. Thus, in this study, tetrapeptide words were utilized to represent the sample of mitochondria proteins. The following processes were performed to define the tetrapeptide words.

Firstly, by sliding a window of four residues with step of one residue along mitochondria protein sequences, we calculated the occurrence frequency (n ij ) of the i-th tetrapeptide in the j-th submitochondria location, here i = 1, 2, …. 160,000 and j = 1, 2, 3 respectively for inner membrane protein, matrix protein and outer membrane protein.

Secondly, for a stochastic event, when one observes the i-th tetrapeptide occurring in the j-th submitochondria location, there are two possible outcomes: occurrence and not occurrence in the j-th submitochondria location. Each outcome has a fixed probability, the same from trial to trial. Thus, the probability p j , also called prior probability, occurring in the j-th submitochondria location is defined as:

$$ p_{j} = {{\sum\nolimits_{i = 1}^{160,000} {n_{ij} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i = 1}^{160,000} {n_{ij} } } {\sum\nolimits_{j = 1}^{3} {\sum\nolimits_{i = 1}^{160,000} {n_{ij} } } }}} \right. \kern-0pt} {\sum\nolimits_{j = 1}^{3} {\sum\nolimits_{i = 1}^{160,000} {n_{ij} } } }} $$
(1)

where \( \sum\nolimits_{j = 1}^{3} {\sum\nolimits_{i = 1}^{160,000} {n_{ij} } } \) is the total occurrence number of all tetrapeptides in the benchmark dataset. \( \sum\nolimits_{i = 1}^{160,000} {n_{ij} } \) denotes the occurrence number of all tetrapeptides in the j-th submitochondria location proteins. According to this definition, the prior probability p j correlates with the dimensions of benchmark dataset and sub-dataset. Correspondingly, the probability not occurring in the j-th submitochondria location is q j  = 1 − p j .

Thirdly, we denoted the frequency of the i-th tetrapeptide occurring the benchmark dataset as N i which formulated as N i = \( \sum\nolimits_{j = 1}^{3} {n_{ij} } \). That is to say that, under the condition of the prior probability p j , one performs trial or observation with N i times. Then, the probability (P ij ) of the i-th tetrapeptide occurring in the j-th submitochondria location n ij or more times obeys the binomial distribution and can be defined by:

$$ P_{ij} = 1 - CL_{ij} = \sum\limits_{{n = n_{ij} }}^{{N_{i} }} {\frac{{N_{i} !}}{{n!(N_{i} - n)!}}p_{j}^{n} (1 - p_{j} )^{{N_{i} - n}} } $$
(2)

where CL ij is called the confidence level (CL) of the i-th tetrapeptide in the j-th submitochondria location.

Fourthly, there are three submitochondria locations in the current study, namely j = 1, 2, 3. Therefore, according to Eqs. (1) and (2), for an arbitrary tetrapeptide i, it will have three confidence levels (CL i inner , CL i inner matrix and CL i outer ) which describe the probabilities the i-th tetrapeptide occurring in the three classes of submitochondria proteins, respectively. Then the confidence level of tetrapeptide i in benchmark dataset can be defined as follows:

$$ CL_{i} = \hbox{max} \{ CL_{i \, inner} , \, CL_{i \, matrix} , \, CL_{i \, outer} \} $$
(3)

Finally, if there are m tetrapeptides whose CL i values are larger than a given cutoff CL o , the frequencies of the m tetrapeptides are selected as optimized features. Accordingly, a protein in the benchmark dataset can be formulated by a discrete vector F as given by

$$ F_{m} = \left[ {f_{1} ,f_{2} , \ldots ,f_{i} , \ldots ,f_{m} } \right]^{T} $$
(4)

where f i (i = 1, 2, …, m) are the frequencies of the m tetrapeptides in a protein and T is the transposing operator. If CL o is set to zero, 160,000 tetrapeptides are all selected. If CL o  > 1, no tetrapeptides are selected. Based on confidence level (Eq. 1), high-dimensional data can be projected into low-dimensional space. The parameter m or CL o can be chosen by use of cross-validation.

2.3 Support Vector Machine

Support vector machine (SVM) is a wonderful and popular machine learning method, which has been widely applied in bioinformatics. In this study, we used the software LIBSVM to implement SVM (Fan et al. 2005). For multi-class problems, we adopts one-versus-one (OVO) strategy for classification. The radial basis function (RBF) was chosen as the kernel function. The grid search program was applied to optimize the regularization parameter C and kernel parameter γ by using five-fold cross-validation.

2.4 Performance Evaluation

The jackknife cross-validation was used to evaluate the performance of the proposed model (Chou and Shen 2007). Three important parameters: sensitivity (Sn), overall accuracy (OA) and Matthews correlation coefficient (MCC) were calculated as the following formulas:

$$ Sn = TP/(TP + FN) $$
(5)
$$ OA = (TP + TN)/(TP + TN + FP + FN) $$
(6)
$$ MCC = \frac{(TP \times TN) - (FP \times FN)}{{\sqrt {(TP + FP) \times (TN + FN) \times (TP + FN) \times (TN + FP)} }} $$
(7)

where TP, TN, FP and FN denote true positives, true negatives, false positives and false negatives, respectively.

3 Results

For the benchmark dataset M495, the over-represented tetrapeptides can be obtained by using Eq. 1. In our statistics, the tetrapeptides with N i  < 3 in the dataset are eliminated, since these tetrapeptides do not prefer to occur in mitochondria proteins (p < 0.0001). Generally, the tetrapeptides with high confidence level give more reliable information for classification. However, the number of these tetrapeptides is too small to afford enough information, which deduces the poor predictive accuracy. For example, using >99.9 % as confidence level, we can achieve 13 tetrapeptides. But the overall accuracy is only 56.8 % by using five-fold cross-validation (Table 1). In contrast, the tetrapeptides with low confidence contains too many components. But it would reduce the cluster-tolerant capacity so as to lower down the cross-validation accuracy. For instance, 20,781 tetrapeptides with >50.0 % of confidence level can only produce the overall accuracy of 60.4 % by using five-fold cross-validation (Table 1).

Table 1 The accuracies of different confidence levels by using five-fold cross-validation

Therefore, using appropriate tetrapeptides would yield a prediction with higher accuracy. By changing the cutoff of confidence level, we can obtain a series of tetrapeptide sets and examine the accuracies of these sets. For economizing time and improving efficiency, the five-fold cross-validation was used to optimize the regularization parameter C and kernel parameter γ. By examining all features subsets, we found that when the cutoff of CL is set to >96.5 %, the maximum overall accuracy reaches its maximum in jackknife cross-validation. A total of 91.1 % mitochondrial proteins can be correctly predicted by using 1,302 optimized tetrapeptides (Table 2). The Sns and MCCs are 98.8 % and 0.84, 86.4 % and 0.89 as well as 78.9 % and 0.85, respectively for inner membrane proteins, matrix proteins and outer membrane proteins.

Table 2 Performance of different methods on different datasets

To verify the advantage of the proposed method, we repeated the process of feature selection and prediction on another three benchmark datasets: M317, M399 and M1105. When the cutoffs of CL were selected as 90.0, 93.9 and 93.2 %, the overall accuracies reached their maximums. The results were recorded in Table 2. It shows that the proposed method can achieve ~94 % overall accuracy for the three benchmark datasets. These are almost as high as the best accuracies obtained by other methods, indicating that our method can be used for the prediction of proteins submitochondria location. Especially, our model can achieve the highest accuracy for the prediction of inner membrane proteins among other methods.

We noticed that the gene ontology (GO)-based model developed by Fan and Li (2012) obtained the best results on M317 and M1105. GO database was established based on the molecular function, biological process and cellular component. If the molecular function, biological process or cellular component of one protein has been experimentally defined, it is easy to guess the subcellular location of this protein. For example, Mei (2012) used this parameter to predict submitochondria protein location and achieved >99 % accuracy for M317. Thus it is not strange to obtain high accuracies by Fan and Li’s (2012) model. However, there is a serious issue for using GO information to predict. The percentage (<50 %) of the protein entries with subcellular annotations in GO database is lower than that (>50 %) in the Uniprot database (Chou and Shen 2008). If a query protein has not been annotated in GO database and no homologous can be found in GO database, their model can not perform prediction. Our model predicts the location of mitochondria proteins only using primary sequence information, suggesting that our model is more neatly and freely. Furthermore, no matter which dataset was used to evaluate it, our model always achieved >91 % accuracy, suggesting the model is robust.

4 Discussion

Using tetrapeptides to recode protein sequences play a key role for predicting submitochondria location of proteins. Because frequencies of tetrapeptides occurrence in random sequence are very low (1/160,000), particular tetrapeptides tend to be present within a protein because of their contributions to the particular functional role of that protein and not as the result of some random choice (Stuart et al. 2002). Tripeptides or larger peptides could be used in prediction. However, tripeptides appear about 20 times more frequently than tetrapeptides; hence, they would bring more noise into prediction. For larger peptides, the size of the feature dimension is so large that the three problems: over-fitting, information redundancy and dimension disaster, would appear in computation. Moreover, some studies have proved that several mitochondrial intermembrane space proteins share tetrapeptide motifs (Verhagen et al. 2007; Polianskyte et al. 2009; Shi 2002). Furthermore, study has shown that tetrapeptides can product comprehensive gene and species phylogenies for mitochondrial genomes and also serve to identify correlated peptides as motifs (Stuart et al. 2002). According to these analyses, tetrapeptides are suitable for the mitochondrial protein data.

In addition, it is widely accepted that a protein sequence determines its structure and its structure determines its function. Rackovsky (1993) have used entropy to investigate the local coding properties of protein sequences and estimated that 60–70 % of tetrapeptides encode the specific structures. The Feng and Luo’s (2008) results that the accuracy of 80 % was achieved by these tetrapeptides in the prediction of protein secondary structure have further demonstrated the relationship between sequence and structure. They suggested that these tetrapeptide signals can be regarded as the protein folding code in the protein structure prediction. Furthermore, the results of Fan and Li (2012) have proved that the predictive secondary structure information of protein can improve the accuracy for predicting submitochondria location of proteins. Therefore, using tetrapeptides directly to predict submitochondria location of proteins is a feasible approach. Our results also suggested that the optimized tetrapeptides are informative and can reflect inherent properties of mitochondria proteins.

The tetrapeptide words occurring once or twice in benchmark dataset do not prefer to occur in mitochondria proteins, thus we ignored them to guarantee the reliability of feature selection. In our statistics, we achieved 1,302 over-represented tetrapeptides which is much larger than the size (495) of the dataset. However, total of 188,940 tetrapeptides occurring in the dataset can guarantee the statistical significant of the over-represented tetrapeptides. Furthermore, the jackknife cross-validation (Chou and Zhang 1995) that can always yield a unique outcome was used to evaluate our method. Thus our results are credible.

For the convenience of the vast majority of experimental scientists, we constructed an on-line server, called TetraMito, which can be freely available at http://lin.uestc.edu.cn/server/TetraMito. The server may become a useful vehicle for in-depth studying mitochondria proteins, or at least a complementary tool to the existing methods in this area.

5 Conclusions

In this study, we developed a feature selection-based method to predict the submitochondria locations of mitochondria proteins using primary sequence information. Results demonstrate that the proposed method has the capability to predict and annotate the submitochondria locations of mitochondria proteins. Based on this model, we have constructed a free online server TetraMito. The current study will become an important progress in the prediction of the submitochondria protein locations and promote the study in the related areas.