Introduction

A biological membrane is an anchoring membrane that works as a barrier within or around a cell. It plays a central role in cellular processes ranging from basic molecule transport to sophisticated signaling pathways. Currently, in market, more than half of all drugs are directly targeted against the membrane proteins (Klabunde and Hessler 2002). However, it is complex and difficult to get high-resolution three-dimensional (3D) structures of membrane proteins. Only a few percent of membrane protein structures are available in protein Data Banks (Berman et al. 2000). Membrane protein contains one or more transmembrane (TM) helices, which define the orientation or topology of a membrane protein corresponding to the lipid bilayer. Alpha helical is a prime category of TM proteins, and it performs most of the important biological processes of a cell such as cell signaling, cell-to-cell interaction, cell recognition, and adhesion. Information about TM helix provides some useful clue in determining the function of membrane proteins. Since the determination of the crystal structure of membrane proteins by X-ray or nuclear magnetic resonance (NMR) is extremely difficult, computational methods are considered as valuable tools for correctly identifying locations of TM helix segments and topology of TM helix proteins.

In the past few decades, a series of efforts have been carried out for predicting the topology of TM helix proteins. In the early studies, the analysis was mostly based on the physicochemical properties of amino acids, namely, hydrophobicity (Argos et al. 1982; Cserzo et al. 1997; Eisenberg et al. 1982; Juretic et al. 2002; Kyte and Doolittle 1982; Nakai and Kanehisa 1992; Von Heijne 1992), charge (Claros and Von Heijne 1994; Hirokawa et al. 1998; Juretic et al. 2002), nonpolar phase helicity (Deber et al. 2001), and multiple sequence alignment (Persson and Argos 1996; Rost et al. 1995). DAS-TMfilter (Cserzo et al. 2004), TOP-Pred (Claros and Von Heijne 1994), and SOSUI (Hirokawa et al. 1998) are the most reliable models that give descriptive information about TM helices. Although these methods have efficiently identified TM helix segments, they did not achieve promising results in topology prediction of TM helix proteins. Researchers have used various statistical models such as Hidden Markov Models (HMM) as well as neural networks and support vector machine (SVM) for predicting TM helix. In this regard, several user-friendly web predictors have also been developed for the benefit of academics and researchers. A few of them include TopPred (Claros and Von Heijne 1994), MEMSAT (Jones 2007), PHD (Rost et al. 1996), HMMTOP (Tusnady and Simon 1998, 2001), TMHMM (Krogh et al. 2001; Sonnhammer et al. 1998), PRODIV_TMHMM (Viklund and Elofsson 2004), TMMOD (Kahsay et al. 2005), Phobius (Kall et al. 2007), ENSEMBLE (Martelli et al. 2003), PONGO (Amico et al. 2006), HMM-TM (Bagos et al. 2006), MemBrain (Shen and Chou 2008), MEMSAT-SVM (Nugent and Jones 2009a), MEMPACK (Nugent and Jones 2009b), and SVMtop (Lo et al. 2008). The main problem regarding HMM-based methods is that they are computationally expensive. In addition, they use multiple sequence alignments. Additionally, the HMM-based methods fail if TM helix segments are shorter than 16 residues or longer than 35 residues (Shen and Chou 2008). Some researchers have used accuracy along with sensitivity and specificity for evaluation of their proposed methods (Hosseini et al. 2008; Pylouster et al. 2010; Shen and Chou 2008; Zaki et al. 2011a). In addition, several studies have concentrated only on sensitivity and reliabilities of different methods rather than accuracy (Chen et al. 2002; Cuthbertson et al. 2005; Kall and Sonnhammer 2002; Melen et al. 2003; Moller et al. 2001).

In this study, we focus on developing a more effective and accurate TM helix segment prediction system, denoted as WRF-TMH. The proposed approach is based on two kinds of information. The first information is related to the compositional index, whereas the second information is related to the physicochemical properties of amino acids. Compositional index profile is generated by calculating the compositional index of each amino acid in TM helix and non-TM helix segments. The probability of each amino acid is then calculated in both segments. The physicochemical properties of amino acids such as charge, polarity, aromaticity, size, and electronic are used for exploring the behavior of amino acid sequences. The extraneous information is eliminated through singular value decomposition (SVD) whereby it tries to find such a matrix that has least possible information with strong patterns and trends. Then, highly discriminative features of both the features spaces are combined to form a Hybrid feature space. Weighted random forest (RF) is utilized as a classifier in our proposed system. Weighted RF is an ensemble classifier, where the prediction is made on majority voting; hence, the probability of error becomes less. Further, TM helix segments are less in strength compared with non-TM helix segments. In case of imbalanced data, the classifier will usually be biased towards the majority class. In order to control the bias, each class is assigned different weight, where majority class is assigned low weight and minority class is assigned high weight (Bush et al. 2008). Two standard datasets and tenfold cross validation are used to assess the performance of the proposed WRF-TMH model. The advantage of our proposed approach is that it uses an overlap of 11 residues, while the existing techniques mostly use an overlap of 9. The remaining paper is organized as follows: first, a description of “Materials and methods”; next, a explanation of “the proposed system”; next, a presentation of “performance measures”; and results and discussion and finally a conclusion is drawn.

Materials and methods

Datasets

In order to develop a high quality and reliable prediction model, one needs to construct or select a benchmark dataset according to the problem. Due to the availability of a standard dataset, the learning capability of the model is boosted and thus, the predictions are generally in accordance with the desired output. For this purpose, the dataset must have unbiased homology and less redundancy. Therefore, we used two benchmark datasets. DT1 is a low-resolution membrane protein dataset, which was developed by Moller et al. (2000). It is annotated from SWISS-PROT release 49.0 (Bairoch and Apweiler 1997). Initially, it contained 145 protein sequences, but later two protein sequences were discarded, which had no annotation with membrane proteins. Finally, DT1 consists of 143 protein sequences, which include 687 TMH segments.

DT2 is a high-resolution membrane protein dataset. In this dataset, 101 protein sequences of 3-D structure helix are selected from MPtopo database (Jayasinghe et al. 2001a), while 231 protein sequences are obtained from TMPDB database (Ikeda et al. 2002). After merging both the datasets, 30 % CD-hit has been applied to reduce the redundancy. After this screening, DT2 contains 258 single and multispanning TM protein sequences, which consist of 1,232 TMH segments.

Feature extraction techniques

In this study, we have considered two protein sequence representation methods for extracting pertinent and useful information from the TM protein sequences.

Physicochemical properties

A protein sequence is composed of amino acids. Each amino acid has varying side chain. Amino acids are categorized into different groups according to their nature. Physicochemical properties play vital role in recognizing the behavior of amino acids. In order to extract informative features from protein sequences, we have used some important physicochemical properties of amino acids including charge, polarity, aromaticity, size, and electronic. Each physicochemical property has further sub-types, which differentiate amino acids from each other as shown in Table 1. Physicochemical properties of amino acids perform a significant role in formatting and folding of proteins structure and are largely based on propensity of a side chain of amino acids. Each property of amino acid has its specific characteristics, which are typically defined by the type of the side chain the amino acids possess. For instance, polar and charged amino acids cover the surface of molecules and are in contact with solvents due to their ability to form hydrogen bonds. Mostly, they interact with each other, for example, positively and negatively charged amino acids form salt bridges, whereas polar amino acid side chains form hydrogen bonds. These interactions are often useful for the stabilization of protein’s 3D structures. Polar amino acids are hydrophilic, whereas non-polar amino acids are hydrophobic, which are used to twist protein into useful shapes (Hayat and Khan 2012). In this study, the TM protein sequences are replicated into five sequences and then each amino acid is replaced with its corresponding property. For example, residue r i at position i can be represented as

$$ r_{\text{i}} = \left( {C_{\text{i}} P_{\text{i}} A_{\text{i}} S_{\text{i}} E_{\text{i}} } \right) $$
(1)

where C i , P i , A i , S i, and E i represent charge, polarity, aromaticity, size, and electronic, respectively. Each amino acid is replaced with its corresponding value, for example, in case of charge the amino acid sequence is replaced by three values: positive, negative, and neutral. Thus, by applying the sliding window, three features are calculated against each position (one residue at a time) and then the window moves to the next position. This process is repeated up to the last residue of a protein sequence. The same procedure is applied for each property of amino acids. Consequently, 16 features are extracted against each position. The feature vector can be expressed as

$$ R_{\text{i}} = \left[ {C_{\text{ij}} } \right]_{1\, \times \,16} $$
(2)

where C ij is the occurrence frequency of property j in window i. Finally, the obtained feature matrix is

$$ P = \left[ {R_{1}^{T} R_{2}^{T} \ldots R^{T}_{L - l + 1} } \right]_{16\, \times \,L - l + 1} $$
(3)

where T represents transpose, L is the size of protein sequence, and l is the window size.

Table 1 Physicochemical properties of amino acids

Compositional index

Compositional index shows the occurrence frequency of amino acids in protein sequences. High frequency indicates the more existence of that amino acid in protein sequences. To compute compositional indices of amino acids, first, we separated TM and non-TM helix segments from each TM protein sequence. TM helix segments are represented by T1 and non-TM helix segments by T2. Then, the occurrence frequency of each amino acid in T1 and T2, respectively, is computed. The compositional index p i for each amino acid is calculated as.

$$ p_{i} = - \ln \left( {\frac{{f_{\text{i}}^{non - TM} - f_{\text{i}}^{TM} }}{{f_{i}^{TM} }}} \right) $$
(4)

where \( f_{\text{i}}^{non - TM} \)and \( f_{\text{i}}^{TM} \)indicate occurrence frequencies of amino acid i in T1 and T2 datasets, respectively. Subsequently, each amino acid in the sequence is substituted by the corresponding index value. Recently, Compositional index was effectively exploited for TM helix prediction (Zaki et al. 2011a, b). Zaki et al. improved the concept of DomCut method (Suyama and Ohara 2003) by incorporating the amino acid composition knowledge. The compositional index for a TM protein sequence p, with window size w can be computed as

$$ m_{j}^{w} = \left\{ \begin{array}{*{20}l} \frac{{\sum\nolimits_{i = 1}^{{j + {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}}} {p_{i} } }}{{j + {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}}}& { 1} \le {\text{j}} \le {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2} \hfill \\ \frac{{\sum\nolimits_{{i = j - {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}}}^{{j + {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}}} {p_{i} } }}{w} & {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}{\text{ < j}} \le L - {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2} \hfill \\ \frac{{\sum\nolimits_{{i = j - {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}}}^{L} {p_{i} } }}{{L - j + 1 + {{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}}}&{\text{ L}}-{{\left( {w - 1} \right)} \mathord{\left/ {\vphantom {{\left( {w - 1} \right)} 2}} \right. \kern-0pt} 2}{\text{ < j}} \le L \hfill \\ \end{array} \right. $$
(5)

We choose window size w = 7–25, considering odd size only. The extracted feature vector is thus of size 10-D.

Singular value decomposition

Singular value decomposition (SVD) is a dimensionality reduction technique that plays a vital role in many multivariate data analyses. Using SVD, one can find a reduced dimensional matrix, which has strong correlation with no noise effect. SVD recreates the best possible matrix, which has minimum possible information and emphases, strong patterns, and trends. SVD first transforms correlated variables into uncorrelated variables, which exposes the relationship between the original data and then identifies and orders the dimensions along with exhibition of the most variation in data points. Once highest variation is identified, then it is possible to find the best approximation of original data points in the form of fewer dimensions.

Usually, feature space contains redundant, irrelevant, and mutually dependent information. Therefore, it is needed to transform the feature vector into an orthogonal dimensional space. SVD exposes the matrix or linear transformation in minimum number of dimensions. For instance, if a feature space is of N-dimension laying in a K-dimensional supspace, where K < N, then each N-dimensional vector has only K degree of freedom and can be uniquely represented by K number of dimensions. SVD divides the matrix A of size M × N into three matrices U, W, and V as A = UWV T when M > N. U is an M × M orthogonal matrix (UUT = I) that indicates the left singular vector of A, V is an N × N orthogonal matrix (VVT = I) that represents the right singular vector of A, and W is an M × N diagonal matrix having the singular values of A. If we assume that M < N, then linear transformation can be represented by SVD as follows:

$$ \left( {\begin{array}{*{20}c} {a_{11} } & {a_{12} } & \ldots & {a_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {a_{M1} } & {a_{M2} } & \ldots & {a_{MN} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {u_{11} } & {u_{12} } & \ldots & {u_{1M} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {u_{M1} } & {u_{M2} } & \ldots & {u_{MM} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\sigma_{11} } & {\sigma_{12} } & \ldots & {\sigma_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {\sigma_{M1} } & {\sigma_{M2} } & \ldots & {\sigma_{MN} } \\ \end{array} } \right) \times \left( {\begin{array}{*{20}c} {v_{11} } & {v_{12} } & \ldots & {v_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {v_{N1} } & {v_{N2} } & \ldots & {v_{NN} } \\ \end{array} } \right) $$
(6)

Subsequently, the value of σ i, j is zero when j > M. So, the product WV T will produce zero value for rows M + 1 through N.

$$ \left( {\begin{array}{*{20}c} {a_{11} } & {a_{12} } & \ldots & {a_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {a_{M1} } & {a_{M2} } & \ldots & {a_{MN} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {u_{11} } & {u_{12} } & \ldots & {u_{1M} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {u_{M1} } & {u_{M2} } & \ldots & {u_{MM} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\sigma_{1} } & 0 & \ldots & 0 \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ 0 & 0 & \ldots & {\sigma_{M} } \\ \end{array} } \right) \times \left( {\begin{array}{*{20}c} {v_{11} } & {v_{12} } & \ldots & {v_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {v_{M1} } & {v_{M2} } & \ldots & {v_{MN} } \\ \end{array} } \right) $$
(7)

This indicates that a column A. i of A can be expressed as a linear combination of M vectors in U(u .1 , u .2 …u . M ) using the singular values in W(σ 1, σ 2 … σ M ) and i th column V T i in V T. The diagonal nonnegative values of W can be ordered such that σ 1  ≥ σ 2  ≥  ≥ σ M . If some entries on the diagonal of W are zero, then for some K, σ 1  ≥ σ 2  ≥  ≥ σ K  ≥ σ K+1  =  = σ M  = 0. So, the number of columns in U and the number of rows in V T can be reduced to K dimensions. However, both the number of rows and columns of W can be reduced to K dimensions.

The rank shows non-zero singular values of matrix A. The required matrix is obtained by multiplying the first K columns of U matrix by first K singular values from W matrix and first K rows of V T matrix as shown in Fig. 1.

Fig. 1
figure 1

Graphical representation of SVD

$$ \left( {\begin{array}{*{20}c} {a_{11} } & {a_{12} } & \ldots & {a_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {a_{M1} } & {a_{M2} } & \ldots & {a_{MN} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {u_{11} } & {u_{12} } & \ldots & {u_{1K} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {u_{M1} } & {u_{M2} } & \ldots & {u_{MK} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\sigma_{1} } & 0 & \ldots & 0 \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ 0 & 0 & \ldots & {\sigma_{K} } \\ \end{array} } \right) \times \left( {\begin{array}{*{20}c} {v_{11} } & {v_{12} } & \ldots & {v_{1N} } \\ . & . & \ldots & . \\ . & . & \ldots & . \\ . & . & \ldots & . \\ {v_{K1} } & {v_{K2} } & \ldots & {v_{KN} } \\ \end{array} } \right) $$
(8)

Consequently, Matrix A of size M × N can be equally represented by K × N matrix WV T where K < M.

Non-zero singular values, which are close to zero are eliminated from the matrix because σ K+1  ≥ σ M show distance from the subspace spanned by U 1 , …, U K . A very small distance may not affect the operation that will be performed on the reduced data. In this study, we have picked the first five top dimensions, where 83 % variance is found among these dimensions.

Proposed prediction system for TM helices (WRF-TMH)

In this study, we propose an affective model WRF-TMH for the prediction of TM helices. The proposed model is based on two informative protein sequence representation methods: physicochemical properties and compositional index. In physicochemical properties based feature extraction, each residue in a protein sequence is first substituted according to the behavior of amino acid by the corresponding value of physicochemical property. After that, the frequency of each value is computed in the specified peptide. The process is repeated to the last residue of a protein sequence. Consequently, 16 features are extracted against each position in a protein sequence.

On the other hand, using compositional index, first we have calculated the occurrence frequency of each amino acid in TM and non-TM helix segments. Next, the compositional index against each amino acid is computed. Then, each amino acid in a protein sequence is substituted by its corresponding compositional index. Finally, taking window size of odd number from 7 to 25, as a result, 10 features are extracted against each position. In order to eradicate the redundancy and irrelevant features, we have employed SVD on each feature space separately. Five features with high variations are selected from each feature space and then combined with the selected features of both the feature spaces, to enhance the discriminative power of the feature space. In addition, Weighted RF is used as a learner, which is the collection of tree hypotheses whereby each tree grows with respect to a different bootstrap sample with the same distribution (Afridi et al. 2012; Hayat et al. 2012). The output of final prediction is made on using majority voting; hence, the chances of error are minimized. Recently, RF has been successfully utilized to a wide range of classification problems, especially for predicting protein–protein binding sites (Bordner 2009), residue–residue contact, and helix–helix interaction (Wang et al. 2011) as well as solvent accessible surface area of TM helix residues (Wang et al. 2012) in membrane proteins.

In this study, the number of TM helix segments is less than the number of non-TM helix segments. Usually, it is perceived that in such situation the prediction of learner is often biased towards the majority class, whereas the purpose of our proposed approach is to predict TM helix segments more accurately. For this purpose, weight is assigned to each class, whereby high weight is assigned to minority class and low weight is assigned to majority class (Bush et al. 2008). The framework of the proposed approach is illustrated in Fig. 2.

Fig. 2
figure 2

Framework of the proposed approach

User-friendly web predictor

In order to provide an easy way to access and utilize the developed resource for the prediction of TM helix, we have lunched a user-friendly web predictor “WRF-TMH predictor”. This predictor uses a simple format of text and displays the start and end location of each helix along with coloring the residue of each helix in sequence. The main page of WRT-TMH predictor is shown in Fig. 3a, whereas the predicted page is shown in Fig. 3b.

Fig. 3
figure 3

a Illustrates the Main page of WRF-TMH Predictor b Shows the output of predictor

Performance measures

Various measures including accuracy, recall, precision, and MCC are used to evaluate the performance of WRT-TMH model at different levels such as per protein based, per segment based, and per residue based.

$$ Q_{\text{htm}}^{{\% {\text{obsd}}}} \, = \, \frac{\text{number\,of\,correctly\,predicted\,TM\,in\,dataset}}{\text{Total\,number\,of\,TM\,in\,dataset}} \times 1 0 0 $$
(9)

where \( Q_{\text{htm}}^{\% obsd} \)indicates the recall of TM helix segments.

$$ Q_{\text{htm}}^{{\% {\text{prd}}}} \, = \, \frac{\text{number\,of\,correctly\,predicted\,TM\,in dataset}}{\text{number\,of\,TM\,predicted\,in\,dataset}} \times 1 0 0 $$
(10)

where \( Q_{\text{htm}}^{\% prd} \) represents the precision of TM helix segments.

$$ Q_{\text{ok}} = \frac{{\sum\nolimits_{i}^{{N_{\text{Prot}} }} {\delta_{i} } }}{{N_{\text{Prot}} }} \times 100 \, \delta_{i} = \left\{ \begin{gathered} 1,{\text{ if }}Q_{\text{htm}}^{\% obsd} \, \wedge \, Q_{htm}^{\% prd} = 100{\text{ for protein i}} \hfill \\ 0,{\text{ otherwise}} \hfill \\ \end{gathered} \right. $$
(11)

where \( Q_{ok} \) indicates the number of protein sequences in which all its TM helix segments are correctly predicted.

$$ {\text{Q}}_{ 2} = \frac{{\sum\nolimits_{\text{i}}^{{{\text{N}}_{\text{Prot}} }} {\frac{\text{number of residues predicted correctly in protein i}}{\text{number of residues in poriten i}}} }}{{{\text{N}}_{\text{Prot}} }} \times 100 $$
(12)

where \( Q_{2} \) shows the percentage of correctly predicted residues in both the TM helix and non-TM helix segments.

$$ Q_{{ 2 {\text{T}}}}^{{\% {\text{obsd}}}} = \frac{{{\text{number}}\,{\text{of}}\;{\text{residues}}\;{\text{correctly\,predicted\,in\,TM\,helices}}}}{\text{number\,of\,residues\,observed\,in\,TM\,helices}} \times 1 0 0 $$
(13)

where \( Q_{2T}^{\% obsd} \)measures how many residues are correctly predicted in the observed residues.

$$ Q_{{ 2 {\text{T}}}}^{{\% {\text{prd}}}} = \frac{\text{number\,of\,residues\,correctly\,predicted\, in\,TM\,helices}}{\text{number\,of\,residues\,predicted\,in\,TM\,helices}} \times 1 0 0 $$
(14)

where \( Q_{2T}^{\% prd} \)measures how many residues are correctly predicted in the predicted residues.

$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }} $$
(15)

MCC is a Mathew correlation coefficient, where the value of MCC is in the range of −1 and 1. In Eq.15, TP is the number of correctly predicted TM helix residues; FP is the number of incorrectly predicted TM helix residues, TN is the number of correctly predicted non-TM helix residues, and FN is the number of incorrectly predicted non-TM helix residues.

Results and discussion

Generally, three cross-validation tests including self-consistency, jackknife, and independent dataset are used for evaluating the performance of prediction model. Among these three cross-validation tests, investigators have extensively applied jackknife test due to its special attributes (Khan et al. 2010; Naveed and Khan 2012); however, it is computationally expensive. In order to reduce the computational cost along with considering the important characteristics of jackknife test, we have adopted tenfold cross validation. Jackknife test splits the training dataset into n-fold, while tenfold cross validation randomly partitioned the training dataset into ten approximately equal mutually exclusive folds. In both tests, one fold is used for testing and the remaining folds are used for training. The whole process is repeated ten times where each fold takes place exactly once as testing fold. Finally, the prediction of each fold is averaged to find out the final output. Two benchmark datasets, low-resolution and high-resolution, are used, whereas the performance is assessed on three different levels (per protein, per segment, and per residue).

Performance analysis between selected feature space and full feature space

After identifying patterns and extracting the information from protein sequences, it is possible that sometimes the information contains redundancy and noise, which becomes the cause that disgraces the performance of classification algorithms. In this study, we have performed comparative analysis of selected feature space and full feature space. The full and selected feature spaces are shown in Fig. 4, where black color shows full feature space and red color illustrates the selected feature space.

Fig. 4
figure 4

Full feature space and selected feature space using SVD

Performance of weighted RF using full feature space

In this work, first, we have examined the performance of weighted RF in conjunction with full space of individual and hybrid features. Success rates of weighted RF are reported in Table 2. Weighted RF in conjunction with physicochemical properties obtained 66.4 % accuracy at per protein level for low resolution dataset. At per segment level, weighted RF achieved 93.7 % precision and 92.9 % recall. Whereas, in term of per residue, the predicted results of weighted RF are 86.2 % accuracy, 85.1 % precision, 77.6 % recall, and 0.75 MCC. The performance of weighted RF using compositional index is better than physicochemical properties at per protein level; however, it is almost similar at per-segment and per-residue levels. In case of hybrid feature space, weighted RF obtained enhanced results compared to individual feature spaces. Success rates of weighted RF using hybrid space are, 72.7 % accuracy at per protein level, 94.2 % precision, and 94.1 % recall at per segment level, and 88.4 % accuracy, 87.0 % precision, 80.0 % recall, and 0.77 MCC at per residue level. Using high-resolution dataset, the predicted outcome of weighted RF is 69.0 % accuracy in case of per protein level. It achieved 90.3 % precision and 94.0 % recall at segment level while 89.7 % accuracy, 85.4 % precision, 90.6 % recall, and 0.81 MCC at per residue level. In contrast, the performance of weighted RF using compositional index-based features is not so good compared with using physiochemical properties. On utilizing the hybrid feature space the performance of weighted RF is better than that of individual feature spaces. It obtained 71.7 % accuracy at per protein level, whereas 91.2 % precision and 93.9 % recall at per segment level, and 90.3 % accuracy, 88.1 % precision, 91.8 % recall, and 0.83 MCC at per residue level.

Table 2 Success rates of WRF-TMH at different levels using individual and Hybrid feature space

Performance of weighted RF using selected feature space

After applying SVD the performance of weighted RF is boosted on each feature space for both the datasets as reported in Table 3. In case of low-resolution dataset, the performance of weighted RF using individual feature spaces is comparable at each level. In contrast, the performance of weighted RF is enhanced using hybrid feature space, which is 76.9 % accuracy in terms of protein level, whereas it is 96.1 % precision and 95.1 % recall at per segment level and 90.8 % accuracy, 87.8 % precision, 81.1 % recall, and 0.78 MCC at per residue level.

Table 3 Successes rates of WRF-TMH at different levels after applying SVD using individual and Hybrid feature space

In case of high-resolution dataset, the performance of weighted RF is better using physicochemical-based feature space than that of compositional index-based feature space. At per protein level, weighted RF obtained 71.1 % accuracy using physicochemical properties, while 69.7 % accuracy using compositional index. Likewise, low-resolution dataset, the performance of weighted RF is further improved using hybrid feature space compared with individual feature spaces. The proposed model achieved 74.0 % accuracy at protein level, 93.3 % precision and 95.5 % recall at segment level and 92.1 % accuracy, 89.3 % precision, 93.3 % recall, and 0.84 MCC at per residue level.

Empirical results revealed that the performance of weighted RF in conjunction with hybrid feature space is promising in case of both the datasets. Hybrid feature space is the fusion of two feature spaces, which compensates the weaknesses of each other. In addition, feature selection technique SVD further improved the performance of weighted RF because it has selected only the high variated features from the feature space.

Performance comparison with existing approaches

Performance comparison of the proposed approach WRF-TMH model with existing approaches at different levels is listed below.

Performance analysis at protein level

The predicted outcomes of WRF-TMH model at per protein level along with already published methods are listed in Table 4. WRF-TMH model has achieved 76.92 % accuracy using low-resolution dataset. In the existing approaches, Arai et al.’s model has obtained the highest accuracy 74.83 % (Arai et al. 2004), whereas Lo et al.’s (2008) developed model SVMtop has obtained the accuracy of 73.29 %. In addition, the performance of WRF-TMH model is also compared with the other published methods including HMMTOP2, TMHMM2, MEMSAT3, Phobius, PHDhtm v.1.96, Top-Pred2, SOSUI 1.1, and SPLIT4. The performance of WRF-TMH model is 3.63 and 2.09 % higher than SVMtop and ConPred-II, respectively. On the other hand, using high-resolution dataset, WRF-TMH model has yielded 74 % accuracy. In current state-of-the-art methods, SVMtop has obtained the highest accuracy of 72.09 % (Lo et al. 2008), while ConPred-II has obtained 69.14 % of accuracy (Arai et al. 2004). The success rate of WRF-TMH model is 1.91 and 4.86 % higher than that of SVMtop and ConPred-II, and is more advanced than other existing methods.

Table 4 Performance comparison with existing approaches

Performance analysis at segment level

At per segment level, the performance of WRF-TMH model is measured as recall and precision of the TM helix segments. The recall and precision of WRF-TMH model and other existing methods are shown in column 3–4 of Table 4. The performance of WRF-TMH model is also substantially good at per segment level compared with the already published methods. In case of low-resolution dataset, WRF-TMH model has obtained 96.06 % of recall and 95.10 % of precision. On the other hand, SVMtop has achieved 94.76 % recall and 93.94 % precision (Lo et al. 2008). In other existing methods, several methods have yielded comparable recall but worse precision and vice versa. Besides, the recall and precision of ConPred-II are relatively better compared with other state-of-the-art methods (Arai et al. 2004). Whereas WRF-TMH model has achieved 1.84 and 1.16 % higher results than that of SVMtop and 1.84 and 2.89 % higher than that of ConPred-II, using high-resolution dataset, our proposed approach has achieved 93.26 % recall and 95.45 % precision. In contrast, SVMtop has yielded 92.78 % recall and 94.46 % precision.

Performance analysis at residue level

The performance of WRF-TMH model is also analyzed at per residue level. In per protein and per segment levels only TM helix segments are considered; however, in per residue level both TM and non-TM helix segments are measured. At per residue level, the performance of the WRF-TMH model is assessed using four measures such as accuracy, recall, precision, and MCC. The predicted outcomes of the WRF-TMH model are 90.84, 87.81, 81.11, and 0.78 accuracy, recall, precision, and MCC, respectively, using low-resolution dataset. In existing literature, SVMtop provides 89.23, 87.50, 80.35, and 0.77 accuracy, recall, precision, and MCC, (Lo et al. 2008). In addition, Krogh et al.’s (2001) proposed method TMHMM2 has achieved 89.23 % accuracy, 82.82 % recall, 83.03 % precision, and 0.76 MCC. Arai et al.’s (2004) proposed model ConPred-II has obtained 90.07, 84.37, 84.13, and 0.78 %, accuracy, recall, precision, and MCC, respectively. On the other hand, using high-resolution dataset, the predicted outcomes of WRF-TMH model are 92.13, 89.27, 93.33, and 0.84 % accuracy, recall, precision, and MCC, respectively, whereas, in existing methods the highest success rates have been achieved by SVMtop, which are 90.90, 87.84, 84.36, and 0.81 % accuracy, recall, precision, and MCC, respectively. Likewise, PolyPhobius has yielded 88.79 % accuracy and 0.77 MCC, whereas SPLIT4 has obtained 83.84 % recall and ConPred II has 84.17 % precision. However, in former methods, three residue long helix segments were considered (Jayasinghe et al. 2001a), but later, it was increased to nine residue long helix segment (Jayasinghe et al. 2001b). Likewise, Moller et al. (2000) also considered nine residues long segment in his proposed model, but our proposed model has considered 11 residue-long segment.

Finally, we have concluded that our proposed method has obtained remarkable outcomes at all the three levels. The significance of our proposed approach over existing methods is that it has not only obtained the highest accuracy but also increased the length of overlap segments. These attainments have been conceivable due to merging of two informative protein sequence representation methods and ensemble classifier, i.e. weighted RF.

Conclusion

Owing to the dynamic role of TM helix in living organisms, it is indispensable to develop an accurate, effective, and high-quality prediction model for predicting TM helix. For this purpose, we propose a prediction model WRF-TMH, which has shown superior performance compared with the existing approaches. The proposed model is based on two different types of feature extraction schemes: compositional index and physicochemical properties. In order to avoid training of the model from unnecessary and irrelevant features, SVD is applied. Weighted RF is utilized to handle the problem of bias by assigning different weights to different classes. The performance of the classifier is evaluated through tenfold cross validation using two benchmark datasets. The predicted results of the WRF-TMH model are higher than that of existing methods at each level, so far. So, it is anticipated that our proposed method might play a significant role and provide vital information for further structural and functional studies on membrane proteins.