Introduction

Extremophilic organisms have the ability to survive and carry out metabolic processes at the limits of environmental physical parameters, like temperature, pressure, high salt conditions, etc. (Pikuta et al. 2007), that are life threatening to most of the living organisms on Earth. Halophiles are the class of extremophiles which can flourish optimally in extreme salt conditions (Lanyi 1974; Eisenberg 1995). One of the mechanisms of halophilic adaptation is the accumulation of osmolytes for balancing the osmotic pressure (Mevarech et al. 2000), the other being the alterations along the protein sequence (Reed et al. 2013; Smole et al. 2011). Adaptation in protein sequence parameters is one of the principal ways to cope with the extremes of environment (Tekaia et al. 2002; Tekaia and Yeramian 2006; Brocchieri 2004; Nath et al. 2012; Nath and Subbiah 2014). Understanding the sequence parameters responsible for the stability and functioning of halophilic proteins can have a great impact in protein engineering (Madigan and Marrs 1997). The halophilic enzymes also hold enormous potential in biotechnology (Delgado-García et al. 2012). Previously, some of the preceding studies have computed the importance of amino acid composition, dipeptide composition and amino acid physicochemical properties in classifying halophilic proteins from their non-halophilic counterparts(Ebrahimie et al. 2011; Zhang and Ge 2013a, b; Zhang et al. 2013; Smole et al. 2011). But these methods did not give the assessment of the statistical significance of the differences between the halophilic and non-halophilic features distributions. Apart from accurate classification/prediction, understanding the sequence parameters resulting in correct classification/prediction is equally important. Almost all of the supervised machines learning algorithms implemented for halophilic protein sequence classification are black boxes, i.e., they do not give any clue about the underlying biological phenomena resulting in correct classification.

The sequence properties along with the amino acid physicochemical properties should be explored in more detail, giving answers to few salient questions such as: what are the sequence parameters and in what combination they represent the halophilic niche? What are the generalized sequence parameters for protein hypersaline adaptation?

To answer these questions, the proposed pipeline consists of two stages: in the first stage, the best possible classification for discriminating halophilic proteins from non-halophilic proteins is carried out, and human interpretable rules are generated; the second stage consists of statistical analysis of significant sequence features and search for the halophilic molecular signatures. Further, a systematic statistical comparison is carried out using halophilic and non-halophilic protein structures for gaining insights into the distribution of the sequence parameters according to residue structural states.

Materials and methods

Dataset

In the current study, the dataset of Zhang and Ge (2013a) which consisted of 139 pairs of halophilic/non-halophilic proteins, having less than 25 % sequence identity is taken. The blind testing set consisted of 15 pairs of halophilic and their non-halophilic structural homologs collected from reference (Siglioccolo et al. 2011) as in(Zhang and Ge 2013a). Only the protein structures which are from “salt in” organisms are considered for further sequence and structural analysis.

Protein sequence representation

For making the protein sequences suitable for machine learning analysis, it is necessary to convert the sequences into fixed length feature vectors. The following features are calculated for the protein sequence representation:

Amino acid composition (AAC)

The amino acid composition is the simplest feature that can be calculated for any protein sequence. It is calculated by using the following formula:

$${\text{AA}}_{i} = \frac{{{\text{Tot}}\left( {{\text{AA}}_{i} } \right)}}{{{\text{Tot}}\left( {{\text{Res}}_{i} } \right)}} \times 100$$
(1)

where AA denotes one of the 20 amino acid residues; AA i denotes the amino acid percentage frequency of specific type ‘AA’ in the ith Sequence; Tot(AA i ) denotes the total count of amino acid of specific type ‘AA’ in the ith sequence; Tot(Res i ) denotes the total count of all the residues in the ith sequence (i.e. length of the sequence).

Property group composition (PGC)

Reduced alphabet of amino acids has been found to be useful in some of the previous classification and comparative analysis, the amino acid groupings taken in the current study came from (Nath et al. 2013; Kumari et al. 2015).

$${\text{PG}}_{i} = \frac{{{\text{Tot}}\left( {{\text{PG}}_{i} } \right)}}{{{\text{Tot}}\left( {{\text{Res}}_{i} } \right)}} \times 100$$
(2)

where PG denotes one of the 11 different amino acid property groups; PG i denotes the percentage frequency of specific amino acid property group ‘PG’ in the ith sequence; Tot(PG i ) denotes the total count of specific amino acid property group ‘PG’ in the ith sequence; Tot(Res i ) denotes the total count of all the residues in the ith sequence.

Dipeptide counts (DC)

Dipeptide counts can take into account the coupling and local order effects of amino acid residues that are present in a protein sequence. It is a 400-dimensional feature vector consisting of counts of all the 400 possible dipeptides.

Property group-2grams (PNG)

The conservation of physicochemical properties along the protein sequence is important for its function and structure. The property groups of amino acids presented in Table 1 are retained for the calculation of this feature. It captures conservation of multiple physicochemical groups along the sequence. In a sliding window of length 2 if the amino acids within the window share any one of the physicochemical groups, then the respective physicochemical group count is incremented. The formula for calculating physiochemical − 2grams for small property group is given below:

$${\text{Physicochemical}} - 2\;{\text{grams}}:{\text{small}} = \sum\limits_{i = 1}^{{N-1}} {C(i,i + 1)}$$
(3)

where N denotes the length of the protein sequence; i denotes the position of the amino acid residue along the protein sequence; C(i, i + 1) is a binary valued quantity and is calculated as: if the condition \((aa_{i} \in S^{*} {\text{AND }}aa_{i + 1} \in S^{*} )\) is true, then the value of C(i, i + 1) is one else the value of C(i, i + 1) is zero. Here S* = {Ala, Cys, Asp, Gly, Asn,Pro, Ser, Thr, Val} is the set of small-amino-acids.

Table 1 The property groups of amino acids that have been taken for the current study

Estimated isoelectric point (IP)

This feature is motivated from the reporting of a previous study which has mentioned the importance of charged residues and proteome charge in halophilic adaptation (Smole et al. 2011).

Atomic composition (AC)

The counts for the carbon, hydrogen, nitrogen, oxygen, and sulfur are taken as the fifth component of the feature vector.

Average physicochemical properties (APCP)

The average values of six physicochemical properties are taken as the sixth component of the feature vector. The individual physicochemical property values of the amino acid residues are taken from the AAindex database (Kawashima et al. 1999). The following physicochemical properties are selected for the purpose of constructing the feature vector: (1) average volumes of residues (aaindex id-PONJ960101), (2) average flexibility indices (aaindex id-BHAR880101), (3) hydrophobicity scale from native protein structures (aaindex id-CASG920101), (4) bulkiness (aaindex id -ZIMJ680102), (5) polarity (aaindex id-GRAR740102), (6) net charge (aaindex id-KLEP840101).

Classification protocol

In the current study 12 individual classifiers [average one dependent classifier (A1DE), Bayesian logistic regression (BLR), Bayesian networks (BN), Naive bayes (NB), radial basis function classifier (RBF), Support vector machines with sequential minimization optimization (SMO), adaboosting (AB), bagging, random subspace (RSS), Real adaboosting (RAB), rotation forest (ROF), random forest (RF)] along with three meta classifiers with random forest as base learners are compared for prediction accuracy. Further two common decision fusion schemes Majority voting and StackingC are also used to combine the decisions from the best individual classifiers. In majority voting, each base classifier will predict a class label, the class label which is predicted the most will then be selected as the final class label (Kuncheva 2002). Stacking consists of two stages, in the first stage, base classifiers are used to predict the class labels. The outputs of the base classifiers are used as input in the second stage classifier to predict the final class label (Wolpert 1992; Seewald 2002). All the experiments are performed using WEKA (Hall et al. 2009), and the default parameters are used to train all the predictors.

Extraction of strong instances

The best performing classifiers are used to develop the final model. For further statistical analysis, only the strong instances from both the training and testing sets are extracted and kept to form the correctly predicted instance dataset (CPI dataset). The strong instances consisted of only true positives (correctly predicted halophilic proteins) and true negatives (correctly predicted non-halophilic proteins). In fact the first stage machine learning classifiers are used as filters to sieve out the noise. The strong instances are used for gaining useful insights into the sequence parameters and for the extraction of halophilic signatures.

Statistical test for searching significant sequence feature differences between halophilic and non-halophilic proteins

Independent sample t test is used to determine the significant differences between the two groups (Nath et al. 2013; Metpally and Reddy 2009) (halophilic/non-halophilic) at 5 % significance level. The t values for the different sequence features are calculated using the given formula:

$$t = \frac{{M_{{{\text{halophilic}}\left( {{\text{sequence}}\,{\text{features}}} \right)}} - M_{{_{{{\text{non-halophilic}}\left( {{\text{sequence}}\,{\text{features}}} \right)}} }} }}{{\sqrt {\left( {{{{\text{Var}}_{\text{halophilic}} } \mathord{\left/ {\vphantom {{{\text{Var}}_{\text{halophilic}} } {n_{\text{halophilic}} }}} \right. \kern-0pt} {n_{\text{halophilic}} }}} \right) + \left( {{{{\text{Var}}_{\text{non-halophilic}} } \mathord{\left/ {\vphantom {{{\text{Var}}_{\text{non-halophilic}} } {n_{\text{non-halophilic}} }}} \right. \kern-0pt} {n_{\text{non-halophilic}} }}} \right)} }}$$
(4)

where Varhalophilic and Varnon-halophilic are the variances of sequence features of halophilic and non-halophilic proteins, respectively; M halophilic and M non-halophilic are the mean frequencies of sequence features of halophilic and non-halophilic proteins, respectively; n halophilic and n non-halophilic are the total number of halophilic and non-halophilic proteins, respectively.

If the t value is positive and greater than the critical value at 5 % probability (1.968), then the mean values (M halophilic) of sequence parameters of halophilic proteins is significantly greater than that of the non-halophilic proteins (M non-halophilic) at 95 % or higher confidence level. If the t value is negative and less than −1.968, then the mean values of sequence features of halophilic proteins (M halophilic) is significantly less than that of non-halophilic proteins (M non-halophilic) at 95 % or higher confidence level. A fdr (false discovery rate) adjusted p value (q value) (Storey and Tibshirani 2003; Collard and Charles 2007; Noble 2009; Zheng et al. 2010) is applied to reduce the number of false positives among the significant sequence features. A q value cutoff of <0.05 is used to determine the significant sequence features between the two groups.

Feature ranking

The significant features obtained from statistical test are also ranked according to their discriminative ability. The ranking is performed using Relieff algorithm (Kira and Rendell 1992).

PART-based halophilic signatures

Rule-based classifiers are used to gain insights into the feature combinations and its corresponding target class. PART is a partial decision tree-based rule generation algorithm, and it uses separate and conquer rule learning (Frank and Witten 1998). In the first stage, a stable subtree is generated with the selection of the leaf that covers the highest number of instances. The covered instances are removed and it again creates a stable subtree for the rest of the left out instances. Previously, PART has been successfully used for gaining useful insights into biological phenomena (Nath and Subbiah 2014). Before using PART, the protein sequence features are discretized for gaining improvement in prediction accuracy and for enhanced comprehensibility and interpretability of generated rules (Wakulicz-Deja et al. 1998). The process of conversion of continuous features into its nominal counterpart constituted by a finite set of intervals is known as discretization. Both supervised and unsupervised discretization methods are implemented in the current study. Supervised discretization uses the target class information for binning, while the unsupervised methods are independent of target class information. Equal frequency unsupervised discretization methods divides the data into n groups, where each group consists of an approximately similar number of values. Equal width binning involves dividing the data into n intervals of equal size. The flow diagram for the current study is shown in Fig. 1.

Fig. 1
figure 1

Schematic representation of the current methodology

Residue structural properties

The accessible surface area was first defined by Lee and Richards (1971). It can be defined as the area traced out by the centre of a probe sphere representing a solvent molecule as it is rolled over the surface of the molecule of interest. For comparison purposes, solvent accessibility is usually represented by the relative solvent accessibility (RSA) which for an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. RSA is calculated as follows:

$${\text{RSA}}_{i} = {\text{SA}}_{i} /{\text{MSA}}_{i} \; \times 100$$
(5)

where SA i is the solvent accessibility of the ith amino acid; MSA i is the maximum solvent accessibility of the ith amino acid and RSA i is the relative solvent accessibility of the ith amino acid. The maximum accessibility values are either taken from an extended tripeptide conformation with glycine or alanine as flanking amino acids with the concerned amino acid at the centre, i.e. Gly-X-Gly or Ala-X-Ala. The solvent accessibility values are calculated by the DSSP program (Kabsch and Sander 1983). The maximum values of solvent accessibility of amino acid residues are taken from (Rost and Sander 1994). All residues having ≤10 % RSA are defined as buried (or core) residues and the residues having more than 10 % RSA are defined as the exposed (or surface) residues. In the current work, eight pairs of halophilic and non-halophilic protein structures are taken from the testing set for comparison of distribution of sequence properties among the different locations of the protein structures.

Performance evaluation metrics

The performances of the machine learning methods are evaluated by using threshold-dependent and threshold-independent parameters. These parameters are calculated from the values of true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP).

Sensitivity

It expresses the percentage of correctly predicted halophilic proteins.

$${\text{Sensitivity = }}\frac{\text{TP}}{{\left( {{\text{TP}} + {\text{FN}}} \right)}} \times 100$$
(6)

Specificity

It expresses the percentage of correctly predicted non-halophilic proteins.

$${\text{Specificity}} = \frac{\text{TN}}{{\left( {{\text{TN}} + {\text{FP}}} \right)}} \times 100$$
(7)

Accuracy

It expresses the percentage of correctly predicted halophilic and non-halophilic proteins.

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}} \times 100$$
(8)

Mathews correlation coefficient (MCC)

It is a measure of both sensitivity and specificity and is used as a valuable measure for selecting optimal performing model in binary classification problems. Its value ranges from −1 to +1, where a value of +1 means accurate prediction, a value of zero means random prediction and −1 means total disagreement.

$${\text{MCC}} = \frac{{({\text{TP}} \times {\text{TN}}) - ({\text{FP}} \times {\text{FN}})}}{{\sqrt {\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} }}$$
(9)

Area under ROC (AUC)

ROC (receiver operating characteristic) curves are summarized by a single numerical quantity known as AUC. It is a threshold independent parameter and invariant to priori class probabilities and its values ranges from 0 to 1 (Bradley 1997), a value of 0 for the worst case, 0.5 for random ranking and 1 for the best prediction.

Results and discussion

Machine learning based analysis

Table 2 presents the performance evaluation metrics for the different machine learning algorithms on both the stratified tenfold cross validation and on blind testing set. RAB (RF) (real adaboosting with random forest as base learners) and BLR performed best on the holdout testing set. Inspired by their better performance, a combination of BLR and RAB (RF) using two common decision fusion techniques, majority voting and StackingC is carried out. It is found that majority voting is more efficient in combining the decisions of BLR and RAB (RF) than StackingC, the performance of StackingC is inferior to the majority voting.

Table 2 Performance evaluation metrics for the different machine learning algorithms

Higher sensitivity, specificity and accuracy values are obtained by the decision fusion using majority voting of RAB (RF) and BLR. Overall accuracy of 90 % is obtained by decision fusion, which is higher than the previously reported 80 % accuracy (Zhang and Ge 2013a) on the same dataset. Although a direct comparison can not be made to the method of (Zhang and Ge 2013b) due to differences in the dataset, but the present method can complement other methods in important ways and can also be adapted to other datasets.

It is very interesting to note that when random forest is used as base learners with meta learners (RAB, AB and RSS), gave considerable improvement in the performances of the meta learners, than when decision stumps (default in WEKA) are used as the base learners. Random forest as base learners improved notably the sensitivity of the meta learners.

After filtering out the strong instances from the output of the first-stage classifiers (from both training and testing sets), there are 127 true positives (correctly predicted halophilic proteins) and 127 true negatives (correctly predicted non-halophilic proteins), which are used to form the CPI dataset. Table 3 presents the performance evaluation metrics of the PART rule generation algorithm with different discretization methods on the CPI dataset.

Table 3 Performance evaluation metrics for the PART algorithm with different types of discretization

The ROCs for PART with different discretization methods are presented in Fig. 2 (PART_WD: without discretization, PART_EF: with equal frequency discretization, PART_EW: with equal width discretization, PART_SD: with supervised discretization). The best performing model is that of PART with supervised discretization. The rules are extracted from PART with supervised discretization and analysis of only high confidence rules (rules without any misclassification) is carried out. Rules without any misclassification are related to unique specific attributes of particular halophilic and non-halophilic class. Certain combinations of sequence parameters are informative for a particular class. The biologically interpretable halophilic signatures are presented in Table 4.

Fig. 2
figure 2

ROCs for PART with and without discretization techniques

Table 4 PART based halophilic molecular signatures

PART generates rules, which are of the form “If antecedent 1 and antecedent 2 … and antecedent N then consequent”. These combination rules are human interpretable and gives the interaction patterns among sequence features, which are responsible for halophilicity. Lower average charge, lower lysine content and lower serine content are the most prominent features in the extracted signatures of the halophilic protein sequences.

Significant sequence features

The strong instances from the output of the first-stage classifiers are further evaluated for finding out the significant sequence feature differences between the halophilic and non-halophilic groups. The statistically significant features are tabulated in Table 5 (except the dipeptides) and in Table 6 (significant dipeptides) which are preferred/avoided in halophilic protein sequences. Out of 453 sequence features, 145 features are found to be statistically significant (the corresponding p values, q values and t values are given in the supplementary1.xlsx).

Table 5 Avoidance/Preference list for the different types of features in halophilic protein sequences (in bold are preferred and in italics are avoided)
Table 6 Avoidance/Preference list for the dipeptides in halophilic protein sequences (in bold are preferred and in italics are avoided)

Amino acid composition and property group composition

Among amino acid residues, aspartic acid and glutamic acid are strongly preferred, in fact, the preference for aspartic acid is stronger than the preference for glutamic acid (see supplementary1.xlsx for t values), along with preference for small hydrophobic amino acids like valine. Glutamic acid is having one more methylene group than aspartic acid, which makes it little more bulky that results in destabilization effect as observed in the mutational stability studies of Tadeo et al. (2009), this may contribute towards the strong preference for aspartic acid over glutamic acid. Larger hydrophobic amino acids like ISOLEUCINE, methionine are strongly avoided. The preference of valine over other large hydrophobic amino acids may be attributed to its contribution to flexibility (Paul et al. 2008; Madern et al. 2000). Among polar amino acids, asparagine and serine are avoided while glutamine and threonine are preferred which are in accordance with the findings of a previous study (Paul et al. 2008). Among nonpolar amino acids, phenylalanine and cysteine are avoided, which might give them more flexibility (Paul et al. 2008). Among the positively charged amino acids (basic amino acids), only histidine is preferred, while lysine is strongly avoided. Overall acidic amino acid residues are preferred, along with the preference for both polar (threonine, aspartic acid, glutamic acid, histidine, glutamine) and nonpolar (proline and valine) amino acids. An immediate effect in the increase of negatively charged residues causes an overall increase in polarity of halophilic proteins.

Dipeptides

Majority of the statistically significant features belonged to the dipeptide counts (110 out of 145 significant features). Interesting trends can be observed in the significant dipeptide counts (Table 6). The salient observations to be pointed out from the Table 6 are as follows: all *D (where * belongs to A/R/Q/E/G/H/D/P/T/W/Y/V), and D* (where * belongs to A,R,D,Q,E,G,H,L,P,T,V), all *Q (where * belongs to A/D/T), and Q*(where * belongs to R/D/E/T), all *P (where * belongs to A/D/G/V), and P* (where * belongs to R/D/T), *H (where * belongs to A/D), and H* (where * belongs to D/E/T/V), and W* (where * belongs to D, V), are preferred.

Likewise all *K (where * belongs to A/R/G/I/L/M/F/S/T/V) and K* (where * belongs to A/N/C/G/I/L/K/F/S/T/Y/V), *I (where * belongs to N/K/M/F/Y/V/L) and I* (where * belongs to A/N/C/G/I/L/K/M/F/S/T), *C (where * belongs to I/K) and C* (where * belongs to R/I/Y), *N (where * belongs to R/I/K) and N* (where * belongs to R/I), *M (where * belongs to A/I/S) and M* (where * belongs to I/L/K), *S (where * belongs to G/I/K/F) and S* (where * belongs to R/G/K/M/F/S/Y) are avoided.

Similarly *E (where * belongs to D/Q/H/T/Y/V) and E* (where * belongs to A/R/D/E/T/V) except EF are preferred, *F (where * belongs to E) and F* (where * belongs to R/I/K/S/T) except FT are avoided. *T (where * belongs to R/D/Q/E/H/P) and T* (where * belongs to D/Q/E/F) are preferred except when *T (where * belongs to I/K/F) and T* (where * belongs to K). An overrepresentation of DA, RA, AD, RR, AP, DD, PD, EA, DV and an underrepresentation of LK, IL, II, IA, KK, IS, KA, GK, RK, have also been reported in(Zhang et al. 2013). The dipeptides of D are more strongly preferred than the dipeptides of E. The avoidance of dipeptides belonging to large hydrophobic amino acids and preference of dipeptides of Valine may be attributed to the fact that they possibly hinder in the maintenance of optimal flexibility in halophilic proteins (Paul et al. 2008).

Further grouping of the significant dipeptides according to the amino acid property groups also revealed some interesting trends (Table 7). A cutoff of 70 % is taken to find out the occurrence counts of specific physicochemical dipeptides falling in avoidance/preference class. Tiny* (where * belongs to any other amino acid property group) dipeptides are avoided except tiny small. Aliphatic, nonpolar and basic dipeptides are also avoided except when these are accompanied with acidic residues. *Acidic and all *acidic dipeptides (where * belongs to any other amino acid property group) are also preferred. Most of the preferred dipeptides consisted of acidic residues either in preceding or succeeding positions while most of the avoided dipeptides consisted of aliphatic, nonpolar and basic dipeptides either in preceding or succeeding positions.

Table 7 Amino acid property group distribution of significant dipeptides and their preference/avoidance in halophilic protein sequences (in bold are preferred and in italics are avoided)

Property group-2grams

Among property group-2grams, charged, acidic and hydrophilic—2grams are preferred while basic-2grams are avoided. Overrepresentation of short stretches of acidic regions has also been previously reported in halophilic proteins (Siddiqui and Thomas 2008).

Atomic composition

The count of sulfur is avoided, and it is the only feature from the atomic composition feature type, which is found to be statistically significant. The avoidance of sulfur containing amino acids, methionine and cysteine are reflected in the avoidance of sulfur for halophilic proteins.

Average physicochemical properties

Except average volume, all other physicochemical properties are found to be statistically significant. Higher values of average flexibility are preferred along with higher values of average polarity in halophiles. Higher values of average charge, higher values of average bulkiness and higher values of average hydrophobicity are avoided in halophiles.

Feature ranking

Further the Relieff algorithm is used to rank the 145 significant features (the rankings of all the 145 significant features are given in supplementary2.xlsx) according to their discriminative ability.

It is evident from the box plots (Fig. 3), the distribution of the top ten features is quite different between the two groups. Notably, the spread for methionine in halophiles is more pronounced in comparison to the other top ranked features.

Fig. 3
figure 3

Boxplot for the top 10 statistically significant features as obtained from Relieff algorithm

To remain structurally and functionally stable, the protein sequences of halophiles undergo a lot of changes at the sequence level which is reflected in the amino acid composition. The avoidance of Lysine and preference of acidic residues plays important roles in preventing aggregation of proteins at high salt concentrations. The repulsive electrostatic interaction between the acidic residues prevents non-specific interactions in proteins and also contributes towards the maintenance of flexibility, which is essential for the catalytic activity of enzymes, (Siddiqui and Thomas 2008; Lanyi 1974; Elcock and McCammon 1998; Mevarech et al. 2000). The other important roles played by acidic residues are assistance in binding of salt and water molecules (Siddiqui and Thomas 2008; Kuntz 1971). The role of hydrated ion network in maintenance of stability is stressed in (Zaccai et al. 1989). The preference for Proline and avoidance for Isoleucine in halophilic proteins contributes towards reduction in overall bulkiness, reduced residue bulkiness is essential for the structural stability of halophilic proteins (Kimura and Kimura 1987). The preference for small hydrophobic amino acid like proline and valine and avoidance of large hydrophobic amino acid like phenylalanine, isoleucine, methionine also contributes towards the preservation of optimal flexibility in halophilic protein sequences (Siddiqui and Thomas 2008). The underrepresentation of cysteines and overrepresentation of threonines is also reported in (Paul et al. 2008). Higher values of IP are avoided in halophilic proteins, which is a direct consequence of higher percentage of acidic residues and lower percentage of basic residues (Smole et al. 2011).

Distribution of sequence features in different residue structural states

The distribution of amino acid residues, amino acid property groups, atomic composition and physicochemical properties are compared according to the residue location (dipeptide counts and property group—ngrams are not compared as these features are sequence residue order dependent). Three types of comparisons are carried out: (1) between halophilic surface and halophilic core residues, (2) between halophilic core and non-halophilic core residues, (3) between halophilic surface and non-halophilic surface residues.

In comparison to halophilic cores, halophilic surfaces preferred R, D, Q, E, K, and avoided A, I, L and V. Among amino acid property groups tiny, aliphatic, nonpolar and hydrophobic are avoided and polar, charged, basic, acidic, and hydrophilic are preferred in halophilic surfaces. Average flexibility, and average polarity are also preferred and average hydrophobicity and average bulkiness are avoided. Halophilic surfaces also preferred to have lower average charge. However, no significant differences are observed for the halophilic core and non-halophilic core residues. In comparison to non-halophilic surfaces, halophilic surfaces most prominently preferred D, acidic property group and lower average charge (more negative charge) and avoided K and basic property group (see supplementary3.xlsx for the FDR-corrected t test analysis). The differences of sequence parameters are found to be more pronounced in the halophilic surfaces than on the core. Large number of negatively charged residues in the halophilic surfaces, increases the solubility and prevents aggregation, while the avoidance of basic property group, notably K on the surfaces can be attributed to its long side chain which has a destabilizing effect as it disrupts the network of hydration shell which is important in preventing aggregation in halophilic environments (Tadeo et al. 2009; Britton et al. 1998; Kastritis et al. 2007).

Conclusion

In the present work, a detailed analysis of possible association of protein sequence parameters with hypersaline adaptation is carried out using filtered out strong instances from an accurate classifier. Certain trends in amino acid composition, dipeptide composition and physicochemical properties are observed and their possible role in halophilicity is discussed. The salient feature of the current work is the statistically significant avoidance/preference lists of sequence features along with their corresponding ranks in discriminating between the two groups. Out of 463 features, 145 features are found to be statistically significant. The important signatures of halophilic protein sequences are lower average hydrophobicity, higher average flexibility, lower average bulkiness, higher average polarity and lower average charge (more negative charge). Isoleucine, lysine and serine are strongly avoided while aspartic acid and glutamic acid are more strongly preferred than any other amino acid residue. Acidic property group is strongly preferred in comparison to other significant property groups. This is also the first time reporting of statistically significant avoidance/preference of sequence features of halophilic proteins. The current feature set gave “good enough” prediction accuracy. Using the current sequence features, a 10 % increase in the overall accuracy is obtained than previously reported accuracy on the same dataset. Based on the current feature set, possible patterns for halophilicity are extracted. The current study also reported preference/avoidance of dipeptides grouped according to the amino acid physicochemical grouping. Further a systematic statistical comparison is carried out to find out the distribution of sequence parameters in the interior and surface of the halophilic proteins. The current analysis will facilitate the understanding of the possible association of sequence parameters with halophilicity and the mechanism of halophilic adaptation.