Splice site identification in human genome using random forest

Pashaei, Elham; Ozen, Mustafa; Aydin, Nizamettin

doi:10.1007/s12553-016-0157-z

Splice site identification in human genome using random forest

Original Paper
Published: 02 December 2016

Volume 7, pages 141–152, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Health and Technology Aims and scope Submit manuscript

Splice site identification in human genome using random forest

Download PDF

324 Accesses
17 Citations
Explore all metrics

Abstract

Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2-RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.

Prediction of donor splice sites using random forest with a new sequence encoding approach

Article Open access 22 January 2016

Random Forest in Splice Site Prediction of Human Genome

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

Article Open access 01 June 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Biological sequence data has been increasing rapidly during the past few decades, so there is a crucial need of effective methods to detect genes [1, 2]. Despite of many efforts, the issue has been not solved satisfactorily yet [3]. Accurate splice site identification is essential in gene detection. In eukaryotic genomes, each gene is composed of exons and introns. During DNA transcription only exons of the gene, which contain codes for proteins are transcribed into mRNAs. [4]. The term splice site refers to boundary between exon and intron [5]. While the intron-exon junction with consensus dinucleotide AG is called acceptor splice site, donor splice site refers to exon-intron junction with consensus dinucleotide GT (see Fig. 1) [3]. In DNA sequence, splice site prediction is a search problem for finding donor and acceptor boundaries.

To predict the splice site, approximately all of the proposed methods consist of three main steps; proper encoding schema (feature extraction), feature selection (optionally), and classification. Machine learning methods are used to detect splice site (classification step). The input of machine learning classifiers is numerical, whereas the information of DNA sequences is given as strings. Therefore, encoding the DNA sequence into numbers is initial and main task of splice site prediction (feature extraction step) [6]. The probabilistic encoding approaches such as the zero order Markov model (MM0), the first order Markov model (MM1), the second order Markov model (MM2), and the Markov Chain Model (MCM) are so famous and high usage methods [7–16].

In biology, where structures are described by a large number of features as splice sites, the feature selection is an important step towards the classification task. It provides useful biological knowledge and allows for a faster and better classification. Feature selection techniques by considering the method’s output can be divided into two groups; wrapper methods and filter methods [17, 18]. The wrapper methods pick up the feature subset based on classifiers performance. However, the filter methods assess the relevance of features via univariate statistical criteria instead of cross-validation performance. So, the wrapper methods give better performance result than filter methods due to taking into account features dependencies and directly interacting with the classifier. However, they are computationally more expensive than filter approaches [18]. On the other hand, the filter methods are known as the fast, rapidly scalable and efficient feature selection approaches in bioinformatics [17, 18]. There are two types of filter methods, univariate and multivariate methods. Most filter methods in the literature are univariate [17]. Multivariate filter methods can find relationships among the features, whereas univariate methods consider each feature individually. Therefore, multivariate filter methods can not disclose mutual information between features [19]. There are many various wrapper and filter approaches in the literature. Particle swarm optimization (PSO), genetic algorithms (GA), sequential forward and backward selection are some examples of the wrapper approach, while chi-square, correlation coefficient, Fisher score (F-score) feature ranking are some examples of filter approaches. There are few specific works where feature selection techniques have been used in splice site prediction domain. Principle feature selection (PFA) is a multivariate filter method that has been employed by Maji [15] in Human splice site prediction. F-score feature ranking [9, 14] and Estimated distribution algorithm (EDA) ranking methods [20] are two univariate filter methods that have been applied on human and plants splice sites, respectively. Also the EDA has been utilized as a wrapper approach in [21] which has shown good performance in plant splice site prediction.

Random forests (RF) are among the most popular machine learning methods due to their relatively good performance. They also provide method for feature selection [22–24]. The random forest feature ranking (variable importance) has been used in various domain such as integrated analysis of multiple data type [25], biomarker discovery [26] and multi-label classification [27]. In this study we investigate the ability of random forest feature ranking methods on the splice site prediction domain.

Various successful computational methods such as support vector machine [1, 6, 8, 9, 14, 28], decision trees [29, 30], hidden Markov model [13, 31], artificial neural network [2, 32–34] and Bayesian network [35, 36] have been developed to recognize splice junction of DNA sequences. Among them, SVM is the most popular classifier method [5]. Baten [8] used MM1 encoding method to extract the features of splice sites sequences and give them to SVM as the input for classifying splice sites. Reduced MM1-SVM [9] was developed using F-score feature ranking method to choose a subset of more informative MM1 parameters for SVM to predict splice site. Zhang [6] constructed a mapping method from Bayes’ rule and integrated it with linear SVM (SVM-B) to predicted splice sites. A length-variable Markov model (LVMM) [13] is developed by employing the MM2 encoding. The method can choose a particular subset of features to predict a candidate splice site according to the ratio of likelihood at each position. Despite of the high accuracy that LVMM method produces, determining the method’s threshold parameters is not easy task [14]. In [15], a hybrid approach using second order Markov model and SVM with principle feature analysis (MM2F-SVM) as a new feature selection method has been proposed. The MCM encoding method [12] that is combination of MM1 and MM2 has provided inputs for SVM classifier in [16]. Despite the presence of these methods, splice site identification remains still a major bottleneck in gene detection domain due to existing complex dependencies between the bases around splice site [37]. Therefore, development of accurate methods to identify splice site junction continue [2].

This study is concerned with RF for feature selection [23] and classification in splice site prediction domain. The performance of RF ranking method has been compared with F-score feature ranking [38] by using the learning curve concept. Liu [39] and Kocev [27] have remarked on the use of learning curves to show the effect of adding features when a list of ordered features is provided. We have investigated their effect on HS3D datasets with the goal of using a small number of features to achieve better classification performance.

Due to its high performance, SVM classifier is frequently used in prediction of splice sites. However, some parameters of SVM classifier such as penalty parameter, the kernel type, and kernel parameters, must be tuned. Parameter tuning can be time-consuming when there are multiple parameters involved in the training. So, one should be cautious whether SVM is a suitable method to genome-wide splice sites prediction or not [13]. In this study, we have combined RF as an efficient and fast classifier with three predefined encoding methods (MM1 [8], MM2 [15], MCM [3, 12, 16]) and compared their results with the SVM. We have also investigated effect of our methods on H3SD and NN269 datasets and have evaluated efficiency of proposed methods by making a comparison with some current methods such as MM1-SVM [9], SVM-B [6], LVMM [13], MCM-SVM [16], and MM2F-SVM [15].

The remainder of the paper is organized as follows. In Section 2, Materials and methods are described. Experimental results are explained in Section 3. Section 4 provides the conclusion.

2 Materials and methods

2.1 Splice sites datasets

Experiments have been performed on the Homo sapiens Splice Site Data set (HS3D) [40], which is composed of 2796 confirmed true donor sites, 2880 confirmed true acceptor sites, 271,937 false donor sites, and 329,374 false acceptor sites. The performance of proposed methods are examined on both donor and acceptor sites separately. Each splice site sequence consists of 140 nucleotides with the consensus nucleotides AG at position 69 and 70 and consensus nucleotide GT at position 71 and 72 for acceptor sites and donor sites, respectively. Balanced (1:1) and unbalanced (1:10) datasets have been formed by selecting all the true splice sites for both of them. The ratio between number of true splice site and randomly selected false splice site in the balanced dataset is the same, whereas in unbalanced dataset number of randomly selected false splice sites is 10 times more than true splice sites.

We have performed an extra evaluation on the NN269 dataset [10] to estimate the reproducibility and consistency of our method. The dataset has been gathered from 269 human genes that are composed of 1324 true acceptor sites, 5552 false acceptor sites, 1324 true donor sites, and 4922 false donor sites. The NN269 dataset has been divided into two subsets: the acceptor dataset and the donor dataset. The training dataset for acceptor (donor) site are made up of 1116 true acceptor (donor) sites and 4672 false acceptor (4140 false donor) sites. The test dataset contains 208 true acceptor (donor) sites and 881 false acceptor (782 false donor) sites. We evaluate the efficiency of the proposed methods on acceptor sites and donor sites separately. The length of the sequences in acceptor splice site is 90 nucleotides whereas donor splice sites have the length of 15 nucleotides. The consensus dinucleotide AG in acceptor splice site is at positions 69 and 70 and the consensus nucleotides GT in donor splice site is at positions 8 and 9.

2.2 Markovian based encoding methods

To do classification analysis on splice sites, DNA sequences should be represented as feature vectors. Different encoding methods are applied to DNA sequences to extract associated features. Each encoding method tries to provide as much information as sequences have. The performance of a classifier used in splice sites prediction highly depends on the DNA encoding methods. So, effective DNA encoding methods for extracting feature vectors from DNA sequences are essential. In this study, MM1 encoding [8], MM2 encoding [15], and MCM [12, 16] encoding have been used. The Markov model describes a sequence of possible states, in which the probability of each state depends only on the preceding states.

Consider a sequence (s ₁, s ₂, … , s _n) of length n. The nucleotide s _i is a realization of the i th state variable in Markov chain. Each state is characterized by a position-specific probability parameter. The set of parameters in first order Markov model and second order Markov model are {P(s _i| s _i − 1)} and {P(s _i| s _i − 1 , s _i − 2)}, respectively. The estimation of the model parameters is calculated by (1)

$$ P\left({s}_i|{s}_{i-1},\dots, {s}_{i-k}\right)=\frac{N\left({s}_{i-k},\dots, {s}_i\right)}{N\left({s}_{i-k},\dots, {s}_{i-1}\right)} $$

(1)

where k denotes the order of Markov model and N(s _i − k, … , s _i) shows the occurrence number of (s _i − k, … , s _i). In this study k = 1 and k = 2 have been chosen for MM1 and MM2. As it is mentioned in [8], to create Markov model only true splice site sequences are considered.

The MCM was earlier used by Lio in [12] and again was employed recently in [3, 16]. This encoding method utilizes both MM1 and MM2 encoding methods. Each sequence is broken down into three parts: signal segment (S ^S), upstream segment (S ^U), and downstream segment (S ^D), as shown in Fig. 2. The signal segment is encoded by MM1 and the model is denoted by M _S. The upstream segments and downstream segments are encoded using MM2 and denoted by M _U and M _D, respectively. We also define a false model M _F to characterize the signal segment for false splice sites. The final model is combination of them, that is (M _U, M _S, M _F , M _D). We have set l _U=30, l _S=47, and l _D= 63 bp for donor sites, l _U= 48, l _S=21, and l _D=69 bp for acceptor site in the HS3D dataset, while we have adjusted l _U=3, l _S=9, and l _D= 3 bp for donor sites and l _U= 52, l _S=19, and l _D=19 bp for acceptor site in the NN269 dataset.

2.3 Random forest classifiers

The RF, which has been introduced by Breiman in 2001 [41], is an ensemble classification algorithm based on decision trees. Each tree in the forest is trained by randomly selecting samples with replacement (bootstrap) from total samples of the original dataset. The rest of the samples are used as the test set. A single decision tree uses randomly m number of features from total M features in splitting each node (mtry). A random forest with k decision tree (ntree) repeats above procedure for each decision tree and final classification is obtained by voting result of these k decision trees on testing data. Figure 3 describes the steps of Random Forest algorithms. We have implemented Random Forest algorithm using “Random Forest” package in R software. The Random Forest has two parameters for tuning namely “mtry” and “ntree”. They are number of features to choose at each node for splitting and number of trees to be grown in the forest respectively. In this study, “mtry” is equal to √M, while “ntree” is equal to 500 (default value) on the HS3D dataset. The value of “ntree” has been set to 530 for the NN269 dataset.

2.4 Support vector machine classifier

SVM [42] is the most important learning machine that has been used in many domains due to its excellent classification accuracy. The SVM aims to find a maximal margin hyperplane to separate classes. The kernel function are used to map data to a higher dimensional space for learning non-linearly separable functions. New instances are classified according to the direction of the hyperplane they belong to [43]. The accuracy of the SVM largely depends on the proper chosen kernel and its parameters. This study has adopted radial basis function (RBF) kernel and utilized SVM of “e1071” package, which is an interface of LIBSVM in R. We have used grid-based search method to find optimal parameters (C- penalty parameter and γ-gamma).

2.5 Fisher score feature ranking method

The feature ranking methods typically assign a weight to each feature and rank them accordingly. Then informative features can be selected and low-scoring features are removed. F-score is a simple univariate filter approach, which is used for ranking features according to their discriminative powers. Given training instance x _i , i = 1 , … , l, the F-score of the jth attribute is calculated by:

$$ F(j)=\frac{{\left({\overset{-}{x}}_j^{\left(+\right)}-{\overset{-}{x}}_j\right)}^2+{\left({\overset{-}{x}}_j^{\left(-\right)}-{\overset{-}{x}}_j\right)}^2}{variance\left({\overset{-}{x}}_j^{\left(+\right)}\right)+ variance\left({\overset{-}{x}}_j^{\left(-\right)}\right)} $$

(2)

where $ {\overset{-}{x}}_j^{\left(+\right)} $, $ {\overset{-}{x}}_j^{\left(-\right)} $ and $ {\overset{-}{x}}_j $ are the average of the jth attribute of the positive, negative and whole datasets, respectively. The numerator indicates the inter class variance, while the sum of the variance inside each class is shown by the denominator. High F-score value of an attribute demonstrates that this attribute has more discriminative power [38].

2.6 Random Forest feature ranking method

Ranking of variables can be obtained by utilizing the mechanism of random forest. Each tree in the random forest is constructed on 2/3 of the training data which are drawn randomly with replacement (bootstrap). The split in each node of the trees is selected from subset of variables (features). After building trees of forest, each tree is tested on the 1/3 of the samples which have not been selected for bootstrap. These samples are called the Out-Of-Bag (OOB) instances and error of predictive performance of them is shown with Err(OOB). The OOB is used for ranking variables by permuting each variable (j) one-by-one in OOB dataset of all the trees and calculating error of predictive performance of the permuted version of OOB data (Err _j)). Subtraction of these errors is calculated at the next step. Ultimately, the average error of subtraction results and associated variances are measured. Figure 4 explains algorithm of calculating ranking of feature using RF clearly. The “FSelector” R package has been used for implementation of RF feature ranking method. More detailed explanation on RF can be found in [27, 44]

2.7 Classification performance evaluation metrics

In this study, sensitivity (S _n), specificity (S _p), a global accuracy (Q ⁹), Matthew’s correlation coefficients (Mcc), area under ROC curve (AUC), and F-measure have been used as the performance measure. These measures are defined as follows:

$$ {S}_n=TP/\left(TP+FN\right) $$

(3)

$$ {S}_p=TN/\left(TN+FP\right) $$

(4)

$$ {Q}^9=\frac{\left(1+{q}^9\right)}{2}/2\kern2em {q}^9=\Big\{\begin{array}{c}\frac{\left(TN-FP\right)}{\left(TN+FP\right)}\kern1em \mathrm{if}\kern0.75em \left(TP+FN\right)=0\\ {}\frac{\left(TP-FN\right)}{\left(TP+FN\right)}\kern0.75em \mathrm{if}\kern0.5em \left(TN+FP\right)=0\\ {}1-\sqrt{2\left[{\left(\frac{FN}{TP+FN}\right)}^2+{\left(\frac{FP}{TN+FP}\right)}^2\right]}\kern1em \mathrm{if}\kern0.29em \left(TP+FN\right)\ne 0\\ {}\kern9em \\ {}\kern18.5em \mathrm{and}\kern0.30em \left(TN+FP\right)\ne 0\\ {}\kern9em \end{array}\operatorname{} $$

(5)

$$ Mcc=\frac{\left(TP*TN\right)-\left(FP*FN\right)}{\sqrt{\left(TP*FN\right)*\left(TN*FP\right)*\left(TP*FP\right)*\left(TN*FN\right)}} $$

(6)

$$ F- measure=2*\frac{TP/\left(TP+FP\right)*{S}_n}{TP/\left(TP+FP\right)+{S}_n} $$

(7)

where TP, FP, TN and FN show the number of true positives, false positives, true negatives and false negatives, respectively. Larger values of the S _n, S _p, Q ⁹, Mcc, and F − measure indicate better classification performance.

The Receiver Operator Characteristic (ROC) curve are obtained by plotting sensitivity against 1-specificity and is used for visualizing the performance of the binary classifier. The area under ROC curve (AUC) is utilized for summarizing the performance in a single number. On the other hand, plotting True Positive Rate versus the False Positive Rate gives precision recall curve (PRC) and the area under PRC curve (auPRC) has again summarized the performance in a single number. The increment in the value of AUC and auPRC lead to a more accurate model performance.

2.8 Cross-validation design

The10-fold cross-validation has been used to evaluate the performance of our methods on the HS3D dataset [13, 14]. For this, we have divided the data sets into 10 equal size parts (folds). After the dataset has been separated into parts, a model is made using 9 of the folds as a training set and the remaining fold as a test set. This process is replicated 10 times with a different test set each time. Furthermore, we have repeated each experiment 5 times to increase the reliability of the evaluation. Each time, different folds are generated randomly and average of 5 independent repeats has been reported.

Due to existence of the large difference between number of true and false sites in unbalanced (1:10) datasets of HS3D, the performance of the classifiers tends to be biased towards the majority class [45]. To overcome this problem, under-sampling technique [46, 47] has been used. For this purpose, we only modified the training set (9 folds out of 10) by considering that each fold contains the same proportion of number of true sites versus number of false sites in unbalance dataset. The training is performed on all the true sites by randomly selecting the same number of false sites on training set without modifying the test set.

For NN269 dataset, in order to tune parameters of SVM, we divided training dataset into 10 equally sized data fold. Each fold contains the same proportion of true versus false sequences. For each parameter combination, we used 9 out of 10 folds and evaluated the methods on the remaining fold. We selected the model with the highest average of auPRC on 10 evaluation sets. Then this best model was trained on the complete training dataset. The ultimate evaluation was performed on the corresponding independent test sets. According to [48], when the binary classifier on imbalanced dataset is evaluated, the auPRC is more informative than AUC. So, we focused on auPRC measure for model selection of SVM.

2.9 Statistical comparison among classifiers

It is important to determine whether the differences between results of classifiers are statistically significant or not when they are compared. Therefore we utilized t-test to assess significance of differences in classification performance. The null hypothesis of the test is that there is no difference between performance of the SVM and the RF. A significance level α = 0.01 has been used in this study.

2.10 Proposed methods to assess performance of RF

RF as feature ranking

The proposed procedure consists of two steps (see Fig. 5) for investigating RF feature ranking approach in Human splice site detection. At the first step, we have applied RF feature ranking method to train dataset. Consequently, a value is assigned to each feature indicating importance of each feature in classification accuracy. Then, we sort them according to their values decreasingly. At the second step, we evaluate the ranking by performing a stepwise feature subset evaluation, which is used to provide the learning curve. For this purpose, we select the top-k ranked features from the ordered variables. Then, we evaluate performance of the classifier on chosen subset feature and constructed forward feature addition curve (FFA).

RF as classifier

Splice site is subdivided into two separate classification problems: acceptor splice site classification and donor splice site classification. We try to identify whether a candidate splice site is true splice site (positive) or not (negative) for both classification problems. So, two different models are constructed for them to make prediction. These models consist of two phases: feature extraction using encoding scheme and classification. The proposed methods MM1-RF, MM2-RF, and MCM-RF utilize Markovian encoding approaches MM1, MM2, and MCM to provide features and use RF for classification. The steps of models are outlined in Fig. 6.

3 Results

3.1 Efficiency of RF as feature ranking approach

Performance of selected attributes on balanced and unbalanced datasets have been shown in Fig. 7. From the figure, it is possible to state that the accuracy of simple MM1-SVM has been improved by using feature ranking approaches.

By considering balanced datasets (see Fig. 7a and b), it can be seen that both feature ranking methods have approximately the same accuracy on their optimal points. Additionally the optimal points of both are equal in balanced acceptor and donor sites. The optimal point of balanced acceptor dataset and balanced donor dataset have been achieved by choosing 60% and 30% of top features using both of the feature ranking methods, respectively. Considering results for unbalanced datasets shown in the second row of the Fig. 7, result of the RF ranking in acceptor sites (see Fig. 7c) is higher than the F-Score and optimal point has been obtained using fewer numbers of attributes. In unbalanced donor splice sites (See Fig. 7d) F-Score shows better performance than the RF ranking method. So, on 4 datasets, the RF ranking method shows two equal, one win and one failure on its performance. As a result, on average it can be concluded that the RF feature ranking method is a good candidate for performing feature selection as preprocessing part on splice sites prediction methods.

3.2 Efficiency of RF as classifier

The performance results of classification have been shown in Table 1. Since different training data are obtained due to employing different encoding methods, we considered each row of the table as an independent dataset. Therefore, our experiment utilized 18 different datasets (9 for the acceptor sites, 9 for the donor sites). The performance was estimated using various measures. However, we preferred F − measure to make statistical comparison (reported P-value) between SVM and RF. We should take into account that we could not carry out statistical evaluation on NN269 dataset due to default separation between training set and test set. However, we consider their results as significant when the difference in F-measure became more than 1.50% between SVM and RF.

Table 1 Comparison of classification performance of SVMs and RFs using Markovian encoding methods

Full size table

According to the results, the RF outperforms the SVM significantly in 8 datasets and nominally in 4 datasets. However, the SVM outperforms the RF significantly in 4 datasets and nominally in 2 datasets. So, considering 18 datasets, overall RF performs better than SVM in 12 datasets. In terms of computational efficiency, as can be seen from CPU time column in the Table 1, the RF performed much faster than the SVM due to parameter tuning process involved in the SVM.

In addition, the classification results of proposed methods MM1-RF, MM2-RF and MCM-RF compared with these of MM1-SVM [8], Reduced MM1-SVM [9], SVM-B [6], LVMM2 [13], MM2F-SVM [15] and MCM-SVM [16] methods using Q ⁹ criteria for HS3D dataset and auPRC for NN269 dataset in Fig. 8. The result of the LVMM2 was taken from [13].

From Fig. 8, considering both balanced datasets, the proposed method MM1-RF outperformed MM1-SVM, Reduced MM1-SVM, SVM-B and MM2F-SVM for both acceptor (Fig. 8a) and donor splice site (Fig. 8b), but could not show better performance than MCM-SVM. Two other proposed methods, MM2-RF and MCM-RF performed better than MM1-RF for both acceptor and donor sites. In balanced acceptor splice site (Fig. 8a), MM2-RF and MCM-RF showed the same performance and both of them could outperform other methods. In balanced donor site (Fig. 8b), MCM-RF performed better than MM2-RF and MM1-RF and could outperform all of the other methods except MCM-SVM. Considering unbalanced acceptor dataset (Fig. 8c), we can see that MM1-RF outperformed the MM1-SVM, Reduced MM1-SVM and SVM-B and produce comparable result with LVMM and MM2F-SVM. The MCM-RF method performed better than MM1-RF and could outperform LVMM and MM2F-SVM. The MM2-RF method performed better than MCM-RF and outperformed all methods significantly and stood out as the best method on unbalanced acceptor splice sites. In the unbalance donor site (Fig. 8d), the MM1-RF outperformed MM1-SVM, Reduced MM1-SVM, SVM-B and MM2F-SVM. The MM2-RF performed better than MM1-RF and could produce comparable results with LVMM. The MCM-RF performed slightly better than the MM2-RF and could outperform all the methods except the MCM-SVM same as the MM2-RF. In comparison to LVMM2, the proposed methods MM2-RF and MCM-RF performed slightly better than LVMM2. However, determining the associated threshold parameters of the LVMM [13] are difficult [14]. The proposed method has less complexity in comparison to LVMM2. The overall performance comparison of the proposed methods can be summarized in this way. Considering the balanced acceptor dataset, MM2-RF and MCM-RF showed the best performance. The MCM-SVM method illustrated better accuracy than the proposed methods on balanced donor splice sites. Considering unbalanced datasets, the MM2-RF outperformed all the methods on acceptor site and again MCM-SVM showed higher accuracy in unbalanced donor sites. We can state that our proposed methods are definitely more suitable for acceptor sites than donor sites. Additionally, considering performance of RF along with SVM using the same encoding methods, the proposed methods in most of the cases performed better.

In order to estimate the consistency of the proposed methods, we performed an additional evaluation on the NN269 dataset. For acceptor sites (Fig. 8e), auPRC of the MM1-RF is better than MM1-SVM and Reduced MM1-SVM. Besides, the MM2-RF performed better than MM2F-SVM and SVM-B. The MCM-RF outperformed all of the methods but MCM-SVM performed better than the proposed methods. For the donor sites (Fig. 8f), the auPRC of MM1-RF method is lower than other available models. The MM2-RF and MCM-RF showed the same accuracy in term of auPRC. Both of them outperformed all methods except SVM-B and MCM-SVM methods. Overall, the proposed methods produced good results for NN269 dataset.

4 Conclusion and discussion

In this study, we study RF as a new classifier and feature selection method in Human splice site prediction domain. Since a large number of features are used to describe structures or processes in biology, the elimination of irrelevant and redundant information provide useful biological knowledge for human experts. F-score feature ranking method is a simple and efficient method that is used in splice site prediction domain frequently. We have investigated efficiency of RF feature ranking method by comparing it with F-score to show capability of RF as a feature selection in Human splice sites identification. The results show that RF feature ranking is useful method in human splice sites prediction.

SVM has been most commonly used in prediction of splice sites due to its high performance. But existing of the parameters that have to be set before using it, such as penalty parameter, the kernel type and kernel parameters make it time-consuming process, causing to question whether SVM is a suitable method to genome-wide splice sites prediction [13]. In this study we employ RF as another extremely successful classifier. One of main advantages of RF-based methods in comparison to SVM-based methods is that it does not need tuning step in contrary to SVM and it is really fast with high performance.

By combining RF with three up-to-date encoding methods (MM1, MM2, and MCM), we show that the proposed methods perform approximately the same and often better than the SVM-based methods. In addition, the proposed methods are simple, fast, easy to use and can be applied to large scale Human Genome data for identifying splice sites. As a future study, these methods can also be utilized in identification of other regulatory regions such as translation initiation sites and promoters.

References

Sonnenburg S, Schweikert G, Philips P, Behr J, Ratsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics. 2007;88(Suppl 10):S7.
Article Google Scholar
Bin W, Jing Z. A novel artificial neural network and an improved particle swarm optimization used in splice site prediction. J Appl Comput Mathematics. 2014;3(4) doi:10.4172/2168-9679.1000166.
Nassa T, Singh S, Goel N. Splice site detection in DNA sequences using probabilistic neural network. Intern J Comp Appl(IJCA). 2013;76(4):1–4.
Google Scholar
Salekdeh AY, Wiese KC. Improving splice-junctions classification employing a novel encoding schema and decision-tree. Evol Comput (CEC). 2011:1302–7. doi:10.1109/CEC.2011.5949766.
Bari AG, Reaz MR, Choi HJ, Jeong BS. Survey on nucleotide encoding techniques and SVM kernel Design for Human Splice Site Prediction. Interdisciplinary Bio Central. 2012;4(14):1–6. doi:10.4051/ibc.2012.4.4.0014.
Article Google Scholar
Zhang Y, Chu C-H, Chen Y, Zha H, Ji X. Splice site prediction using support vector machines with a Bayes kernel. Expert Syst Appl. 2006;30(1):73–81.
Article Google Scholar
Burge C, Karlin S. Predictions of complete gene structures in human genomic DNA. J Mol Biol. 1997;9(5):499–509.
Google Scholar
Baten A, Chang B, Halgamuge S, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics. 2006;7(Suppl 5):S15.
Article Google Scholar
Baten A, Halgamuge S, Chang B. Fast splice site detection using information content and feature reduction. BMC Bioinformatics. 2008;9(Suppl 12):S8.
Article Google Scholar
Reese M, Eeckman F, Kupl D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;4(3):311–24.
Article Google Scholar
Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouzé P, Brunak S. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996;24:3439–52.
Article Google Scholar
Loi HS, Rajapakse JC. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Informatics. 2003;14:64–72.
Google Scholar
Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J. Splice site prediction of human genome using length-variable Markov model and feature selection. Expert Syst Appl. 2010;37(4):2771–82.
Article Google Scholar
Wei D, Zhang H, Wei Y, Jiang Q. A novel splice site prediction method using support vector machine. J Comput Inf Syst. 2013;9(20):8053–60.
Google Scholar
Maji S, Garg D. Hybrid approach using SVM and MM2 in splice site junction identification. Curr Bioinforma. 2014;9(1):76–85.
Article Google Scholar
Goel N, Singh S, Aseri TC. An improved method for splice site prediction in DNA sequences using support vector machines. Procedia Comp Sci. 2015;57:358–67. doi:10.1016/j.procs.2015.07.350.
Article Google Scholar
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on Gene selection. IEEE/ACM Transac Comp Biol Bioinformatics. 2016;13(5):971–89.
Article Google Scholar
Kumari B, Swarnkar T. Filter versus wrapper feature subset selection in large dimensionality micro array: a review. Intern J Comp Sci Inform Technol (IJCSIT). 2011;2(3):1048–53.
Google Scholar
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma. 2015;2015 doi:10.1155/2015/198363.
Saeys Y, Degroeve S, Aeyels D, Rouze P, Peer Y. Feature selection for splice site prediction: a new method using EDA-based feature ranking. BMC Bioinformatics. 2004;5(64) doi:10.1186/1471-2105-5-64.
Saeys Y, Degroeve S, Aeyels D, Van PD, Rouze P. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice ste prediction. Bioinformatics. 2003;19(SUPPL2):179–88.
Google Scholar
Svetnik V, Liaw A, Tong C, editors. Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship. Proceedings of the 7th Course on Ensemble Methods for Learning Machines. USA: Springer-Verlag; 2004.
Google Scholar
Genuera R, Poggi JM, Malotc CT. Variable selection using random forests. Pattern Recognition Letters, Elsevier. 2010;31(14):2225–36.
Article Google Scholar
Han L, Embrechts MJ, Szymanski B, Sternickel K, Ross A. Random Forests Feature Selection with Kernel Partial Least Squares: Detecting Ischemia from Magneto Cardiograms. Burges, Belgium: European Symposium on Artificial Neural Networks; 2006. p. 221–6.
Google Scholar
Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH. Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB'06); Toronto: IEEE; 2006. p. 1–8. doi:10.1109/CIBCB.2006.330987.
Slavkov I, Zenko B, Dzeroski S. Evaluation method for feature rankings and their aggregations for biomarker discovery. In: JMLR Workshop and Conference Proceedings: Machine Learning in Systems Biology. 2010. vol. 8. p. 122–35.
Kocev D, Slavkov I, Dzeroski S, editors. Feature ranking for multi-label classication using predictive clustering trees. International Workshop on Solving Complex Machine Learning Problems with Ensemble Methods, in Conjunction with ECML/PKDD; 2013.
Wei D, Zhuang W, Jiang Q, Wei Y. A new classification method for human gene splice site prediction. In: He J, Liu X, Krupinski E, Xu G, editors. Health information science lecture notes in computer science. Heidelberg: Springer; 2012. p. 121–30.
Chapter Google Scholar
Lopes HS, Lima CRE, Murata NJ. A configware approach for high-speed parallel analysis of genomic data. J Circuits Syst Comp. 2007;16:527–40.
Article Google Scholar
Sun H, Peng Q, Zhang Q, Mou D. Splice site prediction based on characteristic of sequential motifs and C4.5 algorithm. In: 50th International Conference on Fuzzy Systems and Knowledge Discovery. Jinan Shandong: China IEEE; 2008. p. 417–22. doi:10.1109/FSKD.2008.331.
Google Scholar
Yin M, Wang J. Effective hidden Markov models for detecting splicing junction sites in DNA sequences. Inf Sci. 2001;139:139–63.
Article MathSciNet MATH Google Scholar
Rajapakse J, Ho L. Markov encoding for detecting signals in genomic sequences. IEEE-ACM Transact Comp Biol Bioinform. 2005;2(2):131–42.
Article Google Scholar
Marashi S, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H. Importance of RNA secondary structure information for yeast donor and acceptor splice site prediction by neural networks. Comput Biol Chem. 2006;30(1):50–7.
Article MATH Google Scholar
Johansen O, Ryen T, Eftesol T, Kjosmoen T, Ruoff P. Splice site Predicton using artificial neural networks. In: Masulli F, Tagliaferri R, Verkhivker GM, editors. Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science. Heidelberg: Springer; 2009. p. 102–33.
Chapter Google Scholar
Cai D, Delcher A, Kao B, Ksif S. Modeling splice sites with Bayes networks. Bioinformatics. 2000;16:152–8.
Article Google Scholar
Chen T, Lu C, Li W. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics. 2005;21:471–82.
Article Google Scholar
Tsai K, Lin S, Shih S, Lai J, Chenn C. Genomic splice Sirte prediction algorithm based on nucleotide sequence pattern for RNA viruses. Comput Biol Chem. 2009;33:171–5.
Article Google Scholar
Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. In: Guyon I, Gunn S, Nikrevesh M, Zadeh L, editors. Feature extraction studies in fuzziness and soft computing. New York: Springer; 2006. p. 315–24.
Google Scholar
Liu H, Motoda H. Feature selection for Knowlegde discovery and data mining. London: Kluwer Academic Publisher; 1998.
Book MATH Google Scholar
Pollastro P, Rampone S. HS3D, a dataset of homo sapies splice site regions, and its extraction procedure from a major public database. Inter J Modern Physics. 2002;C13(13):1105–17.
Article Google Scholar
Breiman L. Random Forest. Mchine Learning. 2001;45(1):5–32. doi:10.1023/A:1010933404324.
Article MATH Google Scholar
Vapnik VN. Statistical learning theory. Adaptive and learning system for signal processing communications and control. New York: 1998.
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:319.
Article Google Scholar
Filimon A. Hedge fund fraud prediction using classication algorithms. Merlin: University of Zurich; 2011.
Google Scholar
Lin WJ, Che JJ. Class-imbalanced classifiers for high-dimensional data. Brief Bioinform. 2012;14(1):13–26. doi:10.1093/bib/bbs006.
Article Google Scholar
Ganganwar V. An overview of classification algorithms for imbalanced datasets. Intern J Emerg Technol Advance Eng(IJETAE). 2012;2(4):42–7.
Google Scholar
Longadge R, Dongre SS, Malik L. Class imbalance problem in data mining: review. Intern J Comp Sci Net (IJCSN). 2013;2(1):83–7.
Google Scholar
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3) doi:10.1371/journal.pone.0118432.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
Elham Pashaei & Nizamettin Aydin
Department of Pathology & Immunology, Baylor College of Medicine, Houston, 77030, TX, USA
Mustafa Ozen

Authors

Elham Pashaei
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Ozen
View author publications
You can also search for this author in PubMed Google Scholar
Nizamettin Aydin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nizamettin Aydin.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

This article is part of the Topical collection on Systems Medicine

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pashaei, E., Ozen, M. & Aydin, N. Splice site identification in human genome using random forest. Health Technol. 7, 141–152 (2017). https://doi.org/10.1007/s12553-016-0157-z

Download citation

Received: 28 June 2016
Accepted: 25 November 2016
Published: 02 December 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s12553-016-0157-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Splice site identification in human genome using random forest

Abstract

Similar content being viewed by others

Prediction of donor splice sites using random forest with a new sequence encoding approach

Random Forest in Splice Site Prediction of Human Genome

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

1 Introduction