Abstract
The sequence parameters for halophilic adaptation are still not fully understood. To understand the molecular basis of protein hypersaline adaptation, a detailed analysis is carried out, and investigated the likely association of protein sequence attributes to halophilic adaptation. A two-stage strategy is implemented, where in the first stage a supervised machine learning classifier is build, giving an overall accuracy of 86 % on stratified tenfold cross validation and 90 % on blind testing set, which are better than the previously reported results. The second stage consists of statistical analysis of sequence features and possible extraction of halophilic molecular signatures. The results of this study showed that, halophilic proteins are characterized by lower average charge, lower K content, and lower S content. A statistically significant preference/avoidance list of sequence parameters is also reported giving insights into the molecular basis of halophilic adaptation. D, Q, E, H, P, T, V are significantly preferred while N, C, I, K, M, F, S are significantly avoided. Among amino acid physicochemical groups, small, polar, charged, acidic and hydrophilic groups are preferred over other groups. The halophilic proteins also showed a preference for higher average flexibility, higher average polarity and avoidance for higher average positive charge, average bulkiness and average hydrophobicity. Some interesting trends observed in dipeptide counts are also reported. Further a systematic statistical comparison is undertaken for gaining insights into the sequence feature distribution in different residue structural states. The current analysis may facilitate the understanding of the mechanism of halophilic adaptation clearer, which can be further used for rational design of halophilic proteins.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Extremophilic organisms have the ability to survive and carry out metabolic processes at the limits of environmental physical parameters, like temperature, pressure, high salt conditions, etc. (Pikuta et al. 2007), that are life threatening to most of the living organisms on Earth. Halophiles are the class of extremophiles which can flourish optimally in extreme salt conditions (Lanyi 1974; Eisenberg 1995). One of the mechanisms of halophilic adaptation is the accumulation of osmolytes for balancing the osmotic pressure (Mevarech et al. 2000), the other being the alterations along the protein sequence (Reed et al. 2013; Smole et al. 2011). Adaptation in protein sequence parameters is one of the principal ways to cope with the extremes of environment (Tekaia et al. 2002; Tekaia and Yeramian 2006; Brocchieri 2004; Nath et al. 2012; Nath and Subbiah 2014). Understanding the sequence parameters responsible for the stability and functioning of halophilic proteins can have a great impact in protein engineering (Madigan and Marrs 1997). The halophilic enzymes also hold enormous potential in biotechnology (Delgado-García et al. 2012). Previously, some of the preceding studies have computed the importance of amino acid composition, dipeptide composition and amino acid physicochemical properties in classifying halophilic proteins from their non-halophilic counterparts(Ebrahimie et al. 2011; Zhang and Ge 2013a, b; Zhang et al. 2013; Smole et al. 2011). But these methods did not give the assessment of the statistical significance of the differences between the halophilic and non-halophilic features distributions. Apart from accurate classification/prediction, understanding the sequence parameters resulting in correct classification/prediction is equally important. Almost all of the supervised machines learning algorithms implemented for halophilic protein sequence classification are black boxes, i.e., they do not give any clue about the underlying biological phenomena resulting in correct classification.
The sequence properties along with the amino acid physicochemical properties should be explored in more detail, giving answers to few salient questions such as: what are the sequence parameters and in what combination they represent the halophilic niche? What are the generalized sequence parameters for protein hypersaline adaptation?
To answer these questions, the proposed pipeline consists of two stages: in the first stage, the best possible classification for discriminating halophilic proteins from non-halophilic proteins is carried out, and human interpretable rules are generated; the second stage consists of statistical analysis of significant sequence features and search for the halophilic molecular signatures. Further, a systematic statistical comparison is carried out using halophilic and non-halophilic protein structures for gaining insights into the distribution of the sequence parameters according to residue structural states.
Materials and methods
Dataset
In the current study, the dataset of Zhang and Ge (2013a) which consisted of 139 pairs of halophilic/non-halophilic proteins, having less than 25 % sequence identity is taken. The blind testing set consisted of 15 pairs of halophilic and their non-halophilic structural homologs collected from reference (Siglioccolo et al. 2011) as in(Zhang and Ge 2013a). Only the protein structures which are from “salt in” organisms are considered for further sequence and structural analysis.
Protein sequence representation
For making the protein sequences suitable for machine learning analysis, it is necessary to convert the sequences into fixed length feature vectors. The following features are calculated for the protein sequence representation:
Amino acid composition (AAC)
The amino acid composition is the simplest feature that can be calculated for any protein sequence. It is calculated by using the following formula:
where AA denotes one of the 20 amino acid residues; AA i denotes the amino acid percentage frequency of specific type ‘AA’ in the ith Sequence; Tot(AA i ) denotes the total count of amino acid of specific type ‘AA’ in the ith sequence; Tot(Res i ) denotes the total count of all the residues in the ith sequence (i.e. length of the sequence).
Property group composition (PGC)
Reduced alphabet of amino acids has been found to be useful in some of the previous classification and comparative analysis, the amino acid groupings taken in the current study came from (Nath et al. 2013; Kumari et al. 2015).
where PG denotes one of the 11 different amino acid property groups; PG i denotes the percentage frequency of specific amino acid property group ‘PG’ in the ith sequence; Tot(PG i ) denotes the total count of specific amino acid property group ‘PG’ in the ith sequence; Tot(Res i ) denotes the total count of all the residues in the ith sequence.
Dipeptide counts (DC)
Dipeptide counts can take into account the coupling and local order effects of amino acid residues that are present in a protein sequence. It is a 400-dimensional feature vector consisting of counts of all the 400 possible dipeptides.
Property group-2grams (PNG)
The conservation of physicochemical properties along the protein sequence is important for its function and structure. The property groups of amino acids presented in Table 1 are retained for the calculation of this feature. It captures conservation of multiple physicochemical groups along the sequence. In a sliding window of length 2 if the amino acids within the window share any one of the physicochemical groups, then the respective physicochemical group count is incremented. The formula for calculating physiochemical − 2grams for small property group is given below:
where N denotes the length of the protein sequence; i denotes the position of the amino acid residue along the protein sequence; C(i, i + 1) is a binary valued quantity and is calculated as: if the condition \((aa_{i} \in S^{*} {\text{AND }}aa_{i + 1} \in S^{*} )\) is true, then the value of C(i, i + 1) is one else the value of C(i, i + 1) is zero. Here S* = {Ala, Cys, Asp, Gly, Asn,Pro, Ser, Thr, Val} is the set of small-amino-acids.
Estimated isoelectric point (IP)
This feature is motivated from the reporting of a previous study which has mentioned the importance of charged residues and proteome charge in halophilic adaptation (Smole et al. 2011).
Atomic composition (AC)
The counts for the carbon, hydrogen, nitrogen, oxygen, and sulfur are taken as the fifth component of the feature vector.
Average physicochemical properties (APCP)
The average values of six physicochemical properties are taken as the sixth component of the feature vector. The individual physicochemical property values of the amino acid residues are taken from the AAindex database (Kawashima et al. 1999). The following physicochemical properties are selected for the purpose of constructing the feature vector: (1) average volumes of residues (aaindex id-PONJ960101), (2) average flexibility indices (aaindex id-BHAR880101), (3) hydrophobicity scale from native protein structures (aaindex id-CASG920101), (4) bulkiness (aaindex id -ZIMJ680102), (5) polarity (aaindex id-GRAR740102), (6) net charge (aaindex id-KLEP840101).
Classification protocol
In the current study 12 individual classifiers [average one dependent classifier (A1DE), Bayesian logistic regression (BLR), Bayesian networks (BN), Naive bayes (NB), radial basis function classifier (RBF), Support vector machines with sequential minimization optimization (SMO), adaboosting (AB), bagging, random subspace (RSS), Real adaboosting (RAB), rotation forest (ROF), random forest (RF)] along with three meta classifiers with random forest as base learners are compared for prediction accuracy. Further two common decision fusion schemes Majority voting and StackingC are also used to combine the decisions from the best individual classifiers. In majority voting, each base classifier will predict a class label, the class label which is predicted the most will then be selected as the final class label (Kuncheva 2002). Stacking consists of two stages, in the first stage, base classifiers are used to predict the class labels. The outputs of the base classifiers are used as input in the second stage classifier to predict the final class label (Wolpert 1992; Seewald 2002). All the experiments are performed using WEKA (Hall et al. 2009), and the default parameters are used to train all the predictors.
Extraction of strong instances
The best performing classifiers are used to develop the final model. For further statistical analysis, only the strong instances from both the training and testing sets are extracted and kept to form the correctly predicted instance dataset (CPI dataset). The strong instances consisted of only true positives (correctly predicted halophilic proteins) and true negatives (correctly predicted non-halophilic proteins). In fact the first stage machine learning classifiers are used as filters to sieve out the noise. The strong instances are used for gaining useful insights into the sequence parameters and for the extraction of halophilic signatures.
Statistical test for searching significant sequence feature differences between halophilic and non-halophilic proteins
Independent sample t test is used to determine the significant differences between the two groups (Nath et al. 2013; Metpally and Reddy 2009) (halophilic/non-halophilic) at 5 % significance level. The t values for the different sequence features are calculated using the given formula:
where Varhalophilic and Varnon-halophilic are the variances of sequence features of halophilic and non-halophilic proteins, respectively; M halophilic and M non-halophilic are the mean frequencies of sequence features of halophilic and non-halophilic proteins, respectively; n halophilic and n non-halophilic are the total number of halophilic and non-halophilic proteins, respectively.
If the t value is positive and greater than the critical value at 5 % probability (1.968), then the mean values (M halophilic) of sequence parameters of halophilic proteins is significantly greater than that of the non-halophilic proteins (M non-halophilic) at 95 % or higher confidence level. If the t value is negative and less than −1.968, then the mean values of sequence features of halophilic proteins (M halophilic) is significantly less than that of non-halophilic proteins (M non-halophilic) at 95 % or higher confidence level. A fdr (false discovery rate) adjusted p value (q value) (Storey and Tibshirani 2003; Collard and Charles 2007; Noble 2009; Zheng et al. 2010) is applied to reduce the number of false positives among the significant sequence features. A q value cutoff of <0.05 is used to determine the significant sequence features between the two groups.
Feature ranking
The significant features obtained from statistical test are also ranked according to their discriminative ability. The ranking is performed using Relieff algorithm (Kira and Rendell 1992).
PART-based halophilic signatures
Rule-based classifiers are used to gain insights into the feature combinations and its corresponding target class. PART is a partial decision tree-based rule generation algorithm, and it uses separate and conquer rule learning (Frank and Witten 1998). In the first stage, a stable subtree is generated with the selection of the leaf that covers the highest number of instances. The covered instances are removed and it again creates a stable subtree for the rest of the left out instances. Previously, PART has been successfully used for gaining useful insights into biological phenomena (Nath and Subbiah 2014). Before using PART, the protein sequence features are discretized for gaining improvement in prediction accuracy and for enhanced comprehensibility and interpretability of generated rules (Wakulicz-Deja et al. 1998). The process of conversion of continuous features into its nominal counterpart constituted by a finite set of intervals is known as discretization. Both supervised and unsupervised discretization methods are implemented in the current study. Supervised discretization uses the target class information for binning, while the unsupervised methods are independent of target class information. Equal frequency unsupervised discretization methods divides the data into n groups, where each group consists of an approximately similar number of values. Equal width binning involves dividing the data into n intervals of equal size. The flow diagram for the current study is shown in Fig. 1.
Residue structural properties
The accessible surface area was first defined by Lee and Richards (1971). It can be defined as the area traced out by the centre of a probe sphere representing a solvent molecule as it is rolled over the surface of the molecule of interest. For comparison purposes, solvent accessibility is usually represented by the relative solvent accessibility (RSA) which for an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. RSA is calculated as follows:
where SA i is the solvent accessibility of the ith amino acid; MSA i is the maximum solvent accessibility of the ith amino acid and RSA i is the relative solvent accessibility of the ith amino acid. The maximum accessibility values are either taken from an extended tripeptide conformation with glycine or alanine as flanking amino acids with the concerned amino acid at the centre, i.e. Gly-X-Gly or Ala-X-Ala. The solvent accessibility values are calculated by the DSSP program (Kabsch and Sander 1983). The maximum values of solvent accessibility of amino acid residues are taken from (Rost and Sander 1994). All residues having ≤10 % RSA are defined as buried (or core) residues and the residues having more than 10 % RSA are defined as the exposed (or surface) residues. In the current work, eight pairs of halophilic and non-halophilic protein structures are taken from the testing set for comparison of distribution of sequence properties among the different locations of the protein structures.
Performance evaluation metrics
The performances of the machine learning methods are evaluated by using threshold-dependent and threshold-independent parameters. These parameters are calculated from the values of true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP).
Sensitivity
It expresses the percentage of correctly predicted halophilic proteins.
Specificity
It expresses the percentage of correctly predicted non-halophilic proteins.
Accuracy
It expresses the percentage of correctly predicted halophilic and non-halophilic proteins.
Mathews correlation coefficient (MCC)
It is a measure of both sensitivity and specificity and is used as a valuable measure for selecting optimal performing model in binary classification problems. Its value ranges from −1 to +1, where a value of +1 means accurate prediction, a value of zero means random prediction and −1 means total disagreement.
Area under ROC (AUC)
ROC (receiver operating characteristic) curves are summarized by a single numerical quantity known as AUC. It is a threshold independent parameter and invariant to priori class probabilities and its values ranges from 0 to 1 (Bradley 1997), a value of 0 for the worst case, 0.5 for random ranking and 1 for the best prediction.
Results and discussion
Machine learning based analysis
Table 2 presents the performance evaluation metrics for the different machine learning algorithms on both the stratified tenfold cross validation and on blind testing set. RAB (RF) (real adaboosting with random forest as base learners) and BLR performed best on the holdout testing set. Inspired by their better performance, a combination of BLR and RAB (RF) using two common decision fusion techniques, majority voting and StackingC is carried out. It is found that majority voting is more efficient in combining the decisions of BLR and RAB (RF) than StackingC, the performance of StackingC is inferior to the majority voting.
Higher sensitivity, specificity and accuracy values are obtained by the decision fusion using majority voting of RAB (RF) and BLR. Overall accuracy of 90 % is obtained by decision fusion, which is higher than the previously reported 80 % accuracy (Zhang and Ge 2013a) on the same dataset. Although a direct comparison can not be made to the method of (Zhang and Ge 2013b) due to differences in the dataset, but the present method can complement other methods in important ways and can also be adapted to other datasets.
It is very interesting to note that when random forest is used as base learners with meta learners (RAB, AB and RSS), gave considerable improvement in the performances of the meta learners, than when decision stumps (default in WEKA) are used as the base learners. Random forest as base learners improved notably the sensitivity of the meta learners.
After filtering out the strong instances from the output of the first-stage classifiers (from both training and testing sets), there are 127 true positives (correctly predicted halophilic proteins) and 127 true negatives (correctly predicted non-halophilic proteins), which are used to form the CPI dataset. Table 3 presents the performance evaluation metrics of the PART rule generation algorithm with different discretization methods on the CPI dataset.
The ROCs for PART with different discretization methods are presented in Fig. 2 (PART_WD: without discretization, PART_EF: with equal frequency discretization, PART_EW: with equal width discretization, PART_SD: with supervised discretization). The best performing model is that of PART with supervised discretization. The rules are extracted from PART with supervised discretization and analysis of only high confidence rules (rules without any misclassification) is carried out. Rules without any misclassification are related to unique specific attributes of particular halophilic and non-halophilic class. Certain combinations of sequence parameters are informative for a particular class. The biologically interpretable halophilic signatures are presented in Table 4.
PART generates rules, which are of the form “If antecedent 1 and antecedent 2 … and antecedent N then consequent”. These combination rules are human interpretable and gives the interaction patterns among sequence features, which are responsible for halophilicity. Lower average charge, lower lysine content and lower serine content are the most prominent features in the extracted signatures of the halophilic protein sequences.
Significant sequence features
The strong instances from the output of the first-stage classifiers are further evaluated for finding out the significant sequence feature differences between the halophilic and non-halophilic groups. The statistically significant features are tabulated in Table 5 (except the dipeptides) and in Table 6 (significant dipeptides) which are preferred/avoided in halophilic protein sequences. Out of 453 sequence features, 145 features are found to be statistically significant (the corresponding p values, q values and t values are given in the supplementary1.xlsx).
Amino acid composition and property group composition
Among amino acid residues, aspartic acid and glutamic acid are strongly preferred, in fact, the preference for aspartic acid is stronger than the preference for glutamic acid (see supplementary1.xlsx for t values), along with preference for small hydrophobic amino acids like valine. Glutamic acid is having one more methylene group than aspartic acid, which makes it little more bulky that results in destabilization effect as observed in the mutational stability studies of Tadeo et al. (2009), this may contribute towards the strong preference for aspartic acid over glutamic acid. Larger hydrophobic amino acids like ISOLEUCINE, methionine are strongly avoided. The preference of valine over other large hydrophobic amino acids may be attributed to its contribution to flexibility (Paul et al. 2008; Madern et al. 2000). Among polar amino acids, asparagine and serine are avoided while glutamine and threonine are preferred which are in accordance with the findings of a previous study (Paul et al. 2008). Among nonpolar amino acids, phenylalanine and cysteine are avoided, which might give them more flexibility (Paul et al. 2008). Among the positively charged amino acids (basic amino acids), only histidine is preferred, while lysine is strongly avoided. Overall acidic amino acid residues are preferred, along with the preference for both polar (threonine, aspartic acid, glutamic acid, histidine, glutamine) and nonpolar (proline and valine) amino acids. An immediate effect in the increase of negatively charged residues causes an overall increase in polarity of halophilic proteins.
Dipeptides
Majority of the statistically significant features belonged to the dipeptide counts (110 out of 145 significant features). Interesting trends can be observed in the significant dipeptide counts (Table 6). The salient observations to be pointed out from the Table 6 are as follows: all *D (where * belongs to A/R/Q/E/G/H/D/P/T/W/Y/V), and D* (where * belongs to A,R,D,Q,E,G,H,L,P,T,V), all *Q (where * belongs to A/D/T), and Q*(where * belongs to R/D/E/T), all *P (where * belongs to A/D/G/V), and P* (where * belongs to R/D/T), *H (where * belongs to A/D), and H* (where * belongs to D/E/T/V), and W* (where * belongs to D, V), are preferred.
Likewise all *K (where * belongs to A/R/G/I/L/M/F/S/T/V) and K* (where * belongs to A/N/C/G/I/L/K/F/S/T/Y/V), *I (where * belongs to N/K/M/F/Y/V/L) and I* (where * belongs to A/N/C/G/I/L/K/M/F/S/T), *C (where * belongs to I/K) and C* (where * belongs to R/I/Y), *N (where * belongs to R/I/K) and N* (where * belongs to R/I), *M (where * belongs to A/I/S) and M* (where * belongs to I/L/K), *S (where * belongs to G/I/K/F) and S* (where * belongs to R/G/K/M/F/S/Y) are avoided.
Similarly *E (where * belongs to D/Q/H/T/Y/V) and E* (where * belongs to A/R/D/E/T/V) except EF are preferred, *F (where * belongs to E) and F* (where * belongs to R/I/K/S/T) except FT are avoided. *T (where * belongs to R/D/Q/E/H/P) and T* (where * belongs to D/Q/E/F) are preferred except when *T (where * belongs to I/K/F) and T* (where * belongs to K). An overrepresentation of DA, RA, AD, RR, AP, DD, PD, EA, DV and an underrepresentation of LK, IL, II, IA, KK, IS, KA, GK, RK, have also been reported in(Zhang et al. 2013). The dipeptides of D are more strongly preferred than the dipeptides of E. The avoidance of dipeptides belonging to large hydrophobic amino acids and preference of dipeptides of Valine may be attributed to the fact that they possibly hinder in the maintenance of optimal flexibility in halophilic proteins (Paul et al. 2008).
Further grouping of the significant dipeptides according to the amino acid property groups also revealed some interesting trends (Table 7). A cutoff of 70 % is taken to find out the occurrence counts of specific physicochemical dipeptides falling in avoidance/preference class. Tiny* (where * belongs to any other amino acid property group) dipeptides are avoided except tiny small. Aliphatic, nonpolar and basic dipeptides are also avoided except when these are accompanied with acidic residues. *Acidic and all *acidic dipeptides (where * belongs to any other amino acid property group) are also preferred. Most of the preferred dipeptides consisted of acidic residues either in preceding or succeeding positions while most of the avoided dipeptides consisted of aliphatic, nonpolar and basic dipeptides either in preceding or succeeding positions.
Property group-2grams
Among property group-2grams, charged, acidic and hydrophilic—2grams are preferred while basic-2grams are avoided. Overrepresentation of short stretches of acidic regions has also been previously reported in halophilic proteins (Siddiqui and Thomas 2008).
Atomic composition
The count of sulfur is avoided, and it is the only feature from the atomic composition feature type, which is found to be statistically significant. The avoidance of sulfur containing amino acids, methionine and cysteine are reflected in the avoidance of sulfur for halophilic proteins.
Average physicochemical properties
Except average volume, all other physicochemical properties are found to be statistically significant. Higher values of average flexibility are preferred along with higher values of average polarity in halophiles. Higher values of average charge, higher values of average bulkiness and higher values of average hydrophobicity are avoided in halophiles.
Feature ranking
Further the Relieff algorithm is used to rank the 145 significant features (the rankings of all the 145 significant features are given in supplementary2.xlsx) according to their discriminative ability.
It is evident from the box plots (Fig. 3), the distribution of the top ten features is quite different between the two groups. Notably, the spread for methionine in halophiles is more pronounced in comparison to the other top ranked features.
To remain structurally and functionally stable, the protein sequences of halophiles undergo a lot of changes at the sequence level which is reflected in the amino acid composition. The avoidance of Lysine and preference of acidic residues plays important roles in preventing aggregation of proteins at high salt concentrations. The repulsive electrostatic interaction between the acidic residues prevents non-specific interactions in proteins and also contributes towards the maintenance of flexibility, which is essential for the catalytic activity of enzymes, (Siddiqui and Thomas 2008; Lanyi 1974; Elcock and McCammon 1998; Mevarech et al. 2000). The other important roles played by acidic residues are assistance in binding of salt and water molecules (Siddiqui and Thomas 2008; Kuntz 1971). The role of hydrated ion network in maintenance of stability is stressed in (Zaccai et al. 1989). The preference for Proline and avoidance for Isoleucine in halophilic proteins contributes towards reduction in overall bulkiness, reduced residue bulkiness is essential for the structural stability of halophilic proteins (Kimura and Kimura 1987). The preference for small hydrophobic amino acid like proline and valine and avoidance of large hydrophobic amino acid like phenylalanine, isoleucine, methionine also contributes towards the preservation of optimal flexibility in halophilic protein sequences (Siddiqui and Thomas 2008). The underrepresentation of cysteines and overrepresentation of threonines is also reported in (Paul et al. 2008). Higher values of IP are avoided in halophilic proteins, which is a direct consequence of higher percentage of acidic residues and lower percentage of basic residues (Smole et al. 2011).
Distribution of sequence features in different residue structural states
The distribution of amino acid residues, amino acid property groups, atomic composition and physicochemical properties are compared according to the residue location (dipeptide counts and property group—ngrams are not compared as these features are sequence residue order dependent). Three types of comparisons are carried out: (1) between halophilic surface and halophilic core residues, (2) between halophilic core and non-halophilic core residues, (3) between halophilic surface and non-halophilic surface residues.
In comparison to halophilic cores, halophilic surfaces preferred R, D, Q, E, K, and avoided A, I, L and V. Among amino acid property groups tiny, aliphatic, nonpolar and hydrophobic are avoided and polar, charged, basic, acidic, and hydrophilic are preferred in halophilic surfaces. Average flexibility, and average polarity are also preferred and average hydrophobicity and average bulkiness are avoided. Halophilic surfaces also preferred to have lower average charge. However, no significant differences are observed for the halophilic core and non-halophilic core residues. In comparison to non-halophilic surfaces, halophilic surfaces most prominently preferred D, acidic property group and lower average charge (more negative charge) and avoided K and basic property group (see supplementary3.xlsx for the FDR-corrected t test analysis). The differences of sequence parameters are found to be more pronounced in the halophilic surfaces than on the core. Large number of negatively charged residues in the halophilic surfaces, increases the solubility and prevents aggregation, while the avoidance of basic property group, notably K on the surfaces can be attributed to its long side chain which has a destabilizing effect as it disrupts the network of hydration shell which is important in preventing aggregation in halophilic environments (Tadeo et al. 2009; Britton et al. 1998; Kastritis et al. 2007).
Conclusion
In the present work, a detailed analysis of possible association of protein sequence parameters with hypersaline adaptation is carried out using filtered out strong instances from an accurate classifier. Certain trends in amino acid composition, dipeptide composition and physicochemical properties are observed and their possible role in halophilicity is discussed. The salient feature of the current work is the statistically significant avoidance/preference lists of sequence features along with their corresponding ranks in discriminating between the two groups. Out of 463 features, 145 features are found to be statistically significant. The important signatures of halophilic protein sequences are lower average hydrophobicity, higher average flexibility, lower average bulkiness, higher average polarity and lower average charge (more negative charge). Isoleucine, lysine and serine are strongly avoided while aspartic acid and glutamic acid are more strongly preferred than any other amino acid residue. Acidic property group is strongly preferred in comparison to other significant property groups. This is also the first time reporting of statistically significant avoidance/preference of sequence features of halophilic proteins. The current feature set gave “good enough” prediction accuracy. Using the current sequence features, a 10 % increase in the overall accuracy is obtained than previously reported accuracy on the same dataset. Based on the current feature set, possible patterns for halophilicity are extracted. The current study also reported preference/avoidance of dipeptides grouped according to the amino acid physicochemical grouping. Further a systematic statistical comparison is carried out to find out the distribution of sequence parameters in the interior and surface of the halophilic proteins. The current analysis will facilitate the understanding of the possible association of sequence parameters with halophilicity and the mechanism of halophilic adaptation.
References
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159. doi:10.1016/S0031-3203(96)00142-2
Britton KL, Stillman TJ, Yip KSP, Forterre P, Engel PC, Rice DW (1998) Insights into the molecular basis of salt tolerance from the study of glutamate dehydrogenase from Halobacterium salinarum. J Biol Chem 273(15):9023–9030. doi:10.1074/jbc.273.15.9023
Brocchieri L (2004) Environmental signatures in proteome properties. Proc Natl Acad Sci USA 101(22):8257–8258. doi:10.1073/pnas.0402797101
Collard MD, Charles D (2007) A razor may be sharper than an ax, but it cannot cut wood. Anesthesiology 106(3):420–422
Delgado-García M, Valdivia-Urdiales B, Aguilar-González CN, Contreras-Esquivel JC, Rodríguez-Herrera R (2012) Halophilic hydrolases as a new tool for the biotechnological industries. J Sci Food Agric 92(13):2575–2580. doi:10.1002/jsfa.5860
Ebrahimie E, Ebrahimi M, Sarvestani N, Ebrahimi M (2011) Protein attributes contribute to halo-stability, bioinformatics approach. Saline Syst 7(1):1
Eisenberg H (1995) Life in unusual environments: progress in understanding the structure and function of enzymes from extreme halophilic bacteria. Arch Biochem Biophys 318(1):1–5. doi:10.1006/abbi.1995.1196
Elcock AH, McCammon JA (1998) Electrostatic contributions to the stability of halophilic proteins. J Mol Biol 280(4):731–748. doi:10.1006/jmbi.1998.1904
Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. Paper presented at the proceedings of the fifteenth international conference on machine learning
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637. doi:10.1002/bip.360221211
Kastritis PL, Papandreou NC, Hamodrakas SJ (2007) Haloadaptation: insights from comparative modeling studies of halophilic archaeal DHFRs. Int J Biol Macromol 41(4):447–453. doi:10.1016/j.ijbiomac.2007.06.005
Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369. doi:10.1093/nar/27.1.368
Kimura J, Kimura M (1987) The primary structures of ribosomal proteins S14 and S16 from the archaebacterium Halobacterium marismortui. Comparison with eubacterial and eukaryotic ribosomal proteins. J Biol Chem 262(25):12150–12157
Kira K, Rendell LA (1992) A practical approach to feature selection. Paper presented at the proceedings of the ninth international workshop on machine learning, Aberdeen, Scotland, United Kingdom
Kumari P, Nath A, Chaube R (2015) Identification of human drug targets using machine-learning algorithms. Comp Biol Med 56:175–181. doi:10.1016/j.compbiomed.2014.11.008
Kuncheva LI (2002) A theoretical study on six classifier fusion strategies. IEEE Trans Pattern Anal Mach Intell 24(2):281–286. doi:10.1109/34.982906
Kuntz ID (1971) Hydration of macromolecules. III. Hydration of polypeptides. J Am Chem Soc 93(2):514–516. doi:10.1021/ja00731a036
Lanyi JK (1974) Salt-dependent properties of proteins from extremely halophilic bacteria. Bacteriol Rev 38(3):272–290
Lee B, Richards FM (1971) The interpretation of protein structures: estimation of static accessibility. J Mol Biol 55(3):379–400. doi:10.1016/0022-2836(71)90324-X
Madern D, Ebel C, Zaccai G (2000) Halophilic adaptation of enzymes. Extremophiles 4(2):91–98. doi:10.1007/s007920050142
Madigan MT, Marrs BL (1997) Extremophiles. Sci Am 276(4):82–87
Metpally R, Reddy B (2009) Comparative proteome analysis of psychrophilic versus mesophilic bacterial species: insights into the molecular basis of cold adaptation of proteins. BMC Genom 10(1):11
Mevarech M, Frolow F, Gloss LM (2000) Halophilic enzymes: proteins with a grain of salt. Biophys Chem 86(2–3):155–164. doi:10.1016/S0301-4622(00)00126-5
Nath A, Chaube R, Karthikeyan S (2012) Discrimination of psychrophilic and mesophilic proteins using random forest algorithm. In: Biomedical Engineering and Biotechnology (iCBEB), 2012 International Conference on 28–30 May 2012, pp 179–182. doi:10.1109/iCBEB.2012.151
Nath A, Subbiah K (2014) Inferring biological basis about psychrophilicity by interpreting the rules generated from the correctly classified input instances by a classifier. Comput Biol Chem Part B. doi:10.1016/j.compbiolchem.2014.10.002
Nath A, Chaube R, Subbiah K (2013) An insight into the molecular basis for convergent evolution in fish antifreeze proteins. Comput Biol Med 43(7):817–821. doi:10.1016/j.compbiomed.2013.04.013
Noble WS (2009) How does multiple testing correction work? Nat Biotech 27(12):1135–1137
Paul S, Bag S, Das S, Harvill E, Dutta C (2008) Molecular signature of hypersaline adaptation: insights from genome and proteome composition of halophilic prokaryotes. Genome Biol 9(4):1–19. doi:10.1186/gb-2008-9-4-r70
Pikuta EV, Hoover RB, Tang J (2007) Microbial extremophiles at the limits of life. Crit Rev Microbiol 33(3):183–209. doi:10.1080/10408410701451948
Reed CJ, Lewis H, Trejo E, Winston V, Evilia C (2013) Protein adaptations in archaeal extremophiles. Archaea 2013:14. doi:10.1155/2013/373275
Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins Struct Funct Bioinf 20(3):216–226. doi:10.1002/prot.340200303
Seewald AK (2002) How to make stacking better and faster while also taking care of an unknown weakness. Paper presented at the proceedings of the nineteenth international conference on machine learning
Siddiqui KS, Thomas T (eds) (2008) Protein adaptation in extremophiles. Molecular anatomy and physiologyof proteins, Uversky VN (series ed). Nova Biomedical Books, New York
Siglioccolo A, Paiardini A, Piscitelli M, Pascarella S (2011) Structural adaptation of extreme halophilic proteins through decrease of conserved hydrophobic contact surface. BMC Struct Biol 11(1):1–12. doi:10.1186/1472-6807-11-50
Smole Z, Nikolic N, Supek F, Smuc T, Sbalzarini I, Krisko A (2011) Proteome sequence features carry signatures of the environmental niche of prokaryotes. BMC Evol Biol 11(1):26
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci 100(16):9440–9445. doi:10.1073/pnas.1530509100
Tadeo X, López-Méndez B, Trigueros T, Laín A, Castaño D, Millet O (2009) Structural basis for the aminoacid composition of proteins from halophilic archaea. PLoS Biol 7(12):e1000257. doi:10.1371/journal.pbio.1000257
Tekaia F, Yeramian E (2006) Evolution of proteomes: fundamental signatures and global trends in amino acid compositions. BMC Genom 7(1):307
Tekaia F, Yeramian E, Dujon B (2002) Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297(1–2):51–60. doi:10.1016/S0378-1119(02)00871-5
Wakulicz-Deja A, Boryczka M, Paszek P (1998) Discretization of continuous attributes on decision system in mitochondrial encephalomyopathies. In: Polkowski L, Skowron A (eds) Rough sets and current trends in computing, vol 1424., Lecture notes in computer scienceSpringer, Berlin, pp 483–490. doi:10.1007/3-540-69115-4_66
Wolpert DH (1992) Original contribution: stacked generalization. Neural Netw 5(2):241–259. doi:10.1016/s0893-6080(05)80023-1
Zaccai G, Cendrin F, Haik Y, Borochov N, Eisenberg H (1989) Stabilization of halophilic malate dehydrogenase. J Mol Biol 208(3):491–500. doi:10.1016/0022-2836(89)90512-3
Zhang G, Ge H (2013a) Protein hypersaline adaptation: insight from amino acids with machine learning algorithms. Protein J 32(4):239–245. doi:10.1007/s10930-013-9484-3
Zhang G, Ge H (2013b) Support vector machine with a Pearson VII function kernel for discriminating halophilic and non-halophilic proteins. Comput Biol Chem 46:16–22. doi:10.1016/j.compbiolchem.2013.05.001
Zhang G, Huihua G, Yi L (2013) Stability of halophilic proteins: from dipeptide attributes to discrimination classifier. Int J Biol Macromol 53:1–6. doi:10.1016/j.ijbiomac.2012.10.031
Zheng J, Khil PP, Camerini-Otero RD, Przytycka TM (2010) Detecting sequence polymorphisms associated with meiotic recombination hotspots in the human genome. Genome Biol 11(10):R103. doi:10.1186/gb-2010-11-10-r103
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that there is no conflict of interest.
Additional information
Handling Editor: S. C. E. Tosatto.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Nath, A. Insights into the sequence parameters for halophilic adaptation. Amino Acids 48, 751–762 (2016). https://doi.org/10.1007/s00726-015-2123-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-015-2123-x