Keywords

1 Introduction

With the deluge of gene products in the post genomic age, the gap between the newly found protein sequences and their cellular location is growing larger. To use these newly found protein sequences for drug discovery it is desired to develop an effective method to bridge such a gap. In real life, it is found that proteins may simultaneously exist at or move between two or more different Sub-Cellular locations. Thus, localization of proteins is very challenging problem in Bioinformatics. The annotations of protein Sub-Cellular localization can be detected by various biochemical experiments such as cell fraction, electron microscopy and fluorescent microscopy. These accurate experimental approaches are time consuming and expensive which necessitates the computational techniques to predict protein Sub-Cellular Localization which will be useful for protein function prediction. A number of in-silico Sub-Cellular Localization methods have been proposed. Most of the prediction methods can be classified into various categories which are based on the recognition of protein N-terminal sorting signals, amino acid composition, functional domain, homology and fusion. Sorting signals are short sequence segments that localize proteins to intra or extra cellular environments. These include signal peptides, membrane-spanning segments, lipid anchors, nuclear import signals and motifs that direct proteins to organelles such as Mitochondria, Lysosomes etc. [1]. Nakai and Kanehisa [2] took pioneering attempt to propose a computational method, named PSORT, based on sequence motifs and amino acid composition by exploiting a comprehensive knowledge of protein sorting. Reinhardt and Hubbard [3] used amino acid composition information to predict protein subcellular location in neural network based system. Chou and Elrod [4, 5] also used amino acid composition in prediction of subcellular location applying covariant discriminant algorithm. They got better prediction accuracies when they used correlations of residue pairs and acid composition. A work based on Signal based information [6] has been proposed by Emanuelsson and co-authors where individual sorting signals e.g. signal peptides, mitochondrial targeting peptides chloroplast transit peptides are identified [14]. Then they proposed an integrated prediction system using neural network based on the prediction of individual sorting signals. The reliability of the method is based on the quality of the genes 5′-region or protein N-terminal sequences assignment. However, the assignment of 5′-regions are usually not reliable using gene identification methods. Inadequate information of signals may give inaccurate results which results in low accuracy. Hua and Sun [7] used a radial Basis kernel SVM based prediction system using Amino Acid composition. Another voting scheme based work using amino acid composition for prediction of 12 Sub-Cellular locations is done by Park and Kanehisa [8] where a set of SVMs was trained based on its amino acid, amino acid pair and gapped amino acid pair compositions. MultiLoc [9] is an SVM based approach which integrates N-terminal targeting sequences, amino acid composition and protein sequence motifs. It predicts eukaryotic proteins very well. Hortron et al. [10] proposes extension to PSORT-II which is a sorting signal composition based method called WOLF PSORT where amino acid content, sequence length, sorting signals are used. The use of feature sets increased the prediction accuracy of PSORT II with the same classifier k-nearest neighbor. In the work of Chou and Shen [11] proposed an ensemble classifier with kNN basic classifier which uses the concept of pseudoAA (pseAA) composition. Mer and his co-author proposed a novel approach [12] exploiting amino acid composition and different levels of amino acid exposure. The concept was based on that differently exposed residues have different evolutionary pressures to mutate towards specific amino acid types whose side chains have physicochemical properties that agree to the Sub-Cellular location where the protein performs its better activity. To predict singleplex or multiplex protein siLoc-Euk [13] uses multi-label classifier over 22 location sites. APSLAP [14] uses adaptive boosting technique empowered with physicochemical descriptor, Amino acid composition and CTD. From the above mentioned methods, it can be observed that some predictors have experimented with different feature sets for a particular classifier [27] or some predictors have taken a voting scheme or ensemble classifier from set of classifiers [8, 11]. In this work, these facts motivate us to use multiple physico-chemical properties weighted by AAC and ensemble classifier of different classifier.

2 Materials and Methods

In this work, an attempt has been taken to use combination of amino acid composition and their physicochemical properties for prediction of five different eukaryotic Sub-Cellular locations, i.e. Cell wall, Cytoplasm, Mitochondrion, Extracellular and Nucleus. Here, whole experiment is conducted in two stages. In the first stage, four different types of classifiers, namely, PART, Multi-Layer Perceptron (MLP), Adaboost and RBF neural network are taken and their performance are observed for prediction. In the second stage of experiment, an ensemble classifier is constructed on the basis of two well performed classifier (in this case, PART and Adaboost Classifier) to achieve better prediction accuracy.

2.1 The Feature Set

The Amino Acid Composition (AAC) of a protein specifies the occurrence (sometimes percentage) for each of the 20 amino acids. AAC of a protein for location is based on the hypothesis that differences in AAC associate with different locations [12]. On the other hand, use of appropriate physico-chemical properties of amino acids also determines its location of activity. Relevant physico-chemical properties of amino acids can be mentioned in this respect, namely, hydropathy, charge, solubility, pKa value, LP value, hydrophilicity and Isoelectric point value. According to the theory of Lim (1974), amino acid residue hydrophilic patterns incline to occur in secondary structure of a protein sequence. The hydrophobic value of amino acid residue represents the major driving force behind protein folding and protein has activity only in specific folding pattern. As proteins take different functions in different part of cellular location it can be concluded that the Hydropathy and Hydrophilicity feature of amino acid have a great influence in protein Sub-Cellular localization. Charge is also important in this field, e.g., it has been seen that the most nucleus protein consists of much more amino acid residues which are positively charged [15]. On the other hand, LP [16] values of amino acids are basically used for protein function prediction as the function and location of a protein is highly correlated, LP value can be used as a feature for protein Sub-Cellular Localization. Studies say that the solubility of a protein is highly related with its function [3] and is a major property of proteins that determines their function and location within a cell. Isoelectric points or pKa value of amino acids are changed according with their location environment. So proteins which reside in particular location of a cell may have identical isoelectric point and pKa value.

In this work, every protein sequence is represented by seven elements vector where each element in the vector represents a particular physicochemical property weighted by AAC. It is mathematically represented as P = [P1, P2, P3, P4, P5, P6, P7] of any protein P refers to occurrence of any residue ai of 20 amino acids and is calculated using the Eq. 1. Finally it is normalized in the range [0, 1].

$$ AACa_{i} = \frac{{Occurrence\, of\, a_{i} }}{length\, of\,protein \,sequence}. $$
(1)

The feature indices of Charge, Hydrophilicity, LP value, Hydropathy were taken from AAindex dataset [17]. The physicochemical properties are weighted by AAC using Eqs. 28.

$$ P_{1} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times hydropathy\,\left( {a_{i} } \right) $$
(2)
$$ P_{2} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times charge\,\left( {a_{i} } \right) $$
(3)
$$ P_{3} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times solubility\,\left( {a_{i} } \right) $$
(4)
$$ P_{4} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times isoelectricpoint\,\left( {a_{i} } \right) $$
(5)
$$ P_{5} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times pK\,\left( {a_{i} } \right) $$
(6)
$$ P_{6} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times hydrophilicity\,\left( {a_{i} } \right) $$
(7)
$$ P_{7} = \sum\nolimits_{n = 1}^{20} {AAC_{i} } \times LP\,\left( {a_{i} } \right) $$
(8)

2.2 Design of the Classifier

As previously mentioned, four different classifiers, namely, PART, RBF NN, Adaboost and MLP are taken and their individual performance is observed. Prediction decisions of two well performed classifiers are combined to construct an ensemble classifier PLoc-Euk to boost up its prediction accuracy. The basis of ensemble classifier is to accept prediction decision from one of its component classifier which classifies a protein at higher confidence. Ensemble classifier PLoc-Euk is constructed from two component classifiers PART and Adaboost as they are found to have better prediction accuracy compared to MLP and RBFNN.

2.3 Experimentation and Results

Data Set. We have taken 1001 Eukaryotic protein sequences with five Sub-Cellular locations extracted from (http://www.bioinfo.tsinghua.edu.cn/~guotao/data/) where 750 protein sequences serve as training data and remaining 251 sequences act as test data. For training data 150 protein sequences are taken from each Sub-Cellular location and 50-51 protein sequences are taken as test data for every five locations.

Performance Measure. The performance of classifiers is evaluated using two performance measures: Matthews Correlation Coefficient and Accuracy which is described as follows:

Matthews Correlation Coefficient (MCC)

It is used in machine learning as a measure of quality of binary (two class) classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of different sizes. The MCC is a correlation coefficient between the observed and predicted binary classifications. It returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and −1 an inverse prediction. Here, when considering a particular class as positive (here location i.e., cell wall) then all other locations are considered to be negative class. Thus, TP, FP, TN, FN for every class or location are calculated and used in computation of MCC.

$$ MCC = \frac{(TP \times TN - FP \times FN)}{{\sqrt {(\left( {TP + FN} \right)\left( {TP + FP} \right)\left( {TN + FP} \right)\left( {TN + FN} \right))} }}. $$
(9)

Accuracy

It is calculated to measure the performance of a predictor system and defined by

$$ Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}. $$
(10)

where, TP, TN, FP, FN have their usual meanings.

Performance Evaluation. The whole experiment is conducted in two stages. Initially, four classifiers are applied for the prediction of Sub-Cellular location of test proteins. In the second stage, two best classifiers are taken as component classifier for constructing an ensemble classifier PLoc-Euk. As two classifiers are taken as component classifier, so PLoc-Euk takes prediction decisions from them which classify the test sample at higher confidence. In this section, performance of four classifiers, namely, PART, RBFNN, MLP and Adaboost classifier are observed in prediction of subcellular location i.e., cell wall, extracellular, mitochondrion, nucleus and cytoplasm. Tables 1, 2, 3 and 4 show MCC scores and Accuracy measures of our classifiers. From this table, it is evident, the average accuracies of PART classifier and Adaboost classifier are comparatively better than MLP and RBFNN. Finally, Table 5 shows performance of PLoc-Euk where in most of the cases it performs well compared to component classifiers. The comparison of the performances of PLoc-Euk and its component classifiers are graphically presented in Fig. 1.

Table 1 Performance measures of PART classifier
Table 2 Performance measures of MLP classifier
Table 3 Performance measures of RBFNN classifier
Table 4 Performance measures of Adaboost classifier
Table 5 Performance measures of PLoc-Euk classifier
Fig. 1
figure 1

Performance comparison of ensemble classifier PLoc-Euk and other classifiers

Comparison of PLoc-Euk with existing Predictors. We have taken Cello v 2.5 [18] and WOLF-PSORT [10] as existing methods for comparison because they are freely available though they are not too recent but they are based on machine learning method. To compare the performance of the present work PLoc-Euk, 251 test proteins are tested with Cello v2.5 and WOLF -PSORT. From Fig. 2. It can be explained that for cytoplasmic protein, mitochondrion and Nucleus proteins PLoc-Euk performs better than WOLF-PSORT. In case of mitochondrion protein it performs better than two predictors. But, for extracellular proteins, PLoc-Euk does not achieve well.

Fig. 2
figure 2

Comparison of PLoc-Euk, CelloV-2.5 and Wolf-PSORT on test proteins of five locations

Conclusion. Sub-Cellular localization information of any protein gives proper insight of its function. Thus it has become very challenging task in Bioinformatics. Previously signal based, amino acid composition based, structural based approaches were taken for computation prediction approach. In this work we have combined weighted physicochemical based properties of amino acids and their composition as input vector. We have taken 7 relevant physicochemical properties and represented them according to their amino acid composition. Thus weighted properties indicate their intensity and dominance over the protein thereby making the predictor to predict their Sub-Cellular location properly. In addition to these physicochemical properties the performance of the different classifiers has been observed and it is found that we get good performance in PART and Adaboost classifier and also from PLOC-Euk classifier which was designed upon PART and Adaboost classifier. We also compare our work with some existing prediction system. Signal based information can be added with the physicochemical properties to strengthen the prediction power of this classifier. Individual physicochemical properties also have its own influence on a protein to be in a particular location within the cell. So, a number of physicochemical properties can be taken and any feature optimization technique can be employed to reduce the dimension of the input feature vector physicochemical properties, more cellular location also can be included to increase the number of classes and it will also make our system reliable. From further analysis of our work, we can also create a relationship between the Sub-Cellular location and Protein-Protein Interaction [19, 20] and domain information [21] of protein which may be a further research of Bioinformatics.