1 Introduction

The human eukaryotic deoxyribonucleic acid (DNA) represents the building block of life that holds the encoded genetic directives for living organisms. In the central dogma of molecular biology, the transcription process converts the DNA information into a precursor mRNA. Then, the splicing process removes the non-coding regions (introns) and connects the coding regions (exons) to form the contiguous coding sequence (mRNA) which is in turn translated into protein as shown in Fig. 1. The proteins are built according to the instructions stored in the DNA sequence. Accordingly, the understanding and analysis of DNA and RNA genetic sequences play critical roles in the treatments of any genetic disorders [5, 30].

Fig. 1
figure 1

Central dogma of molecular biology: RNA encoding and protein synthesis

The main objective of this study is to build a recognition system to predict whether a particular sequence of DNA nucleotides includes an exon-intron border, intron-exon border, or neither of them. It is a challenging problem that requires the knowledge of characteristics, dependencies, and the relationship of nucleotides in the splice site surrounding regions. Usually, Genetic databases are imbalanced mixed with noise which weakens the learning process [19]. In the literature, this problem was treated as a classification task. Table 1 lists some recent studies that used different Artificial Intelligence (AI) models to automatically recognize these splice-junctions. The present study combined the high potential Random Forest (RF) model and the state-of-the-art Bayesian Optimization (BO) to build a splice-junction recognition system and to rank the input nucleotides according to their predictive power. RF model is an ensemble of decision trees with several successful applications in healthcare applications [15] and the BO algorithm has emerged as an efficient tool for optimizing computation-intensive functions. BO has proven highly effective in controlling machine learning and deep learning models [17]. Precisely, in this study, the BO algorithm optimizes four RF hyperparameters: number of trees, number of splitting features, splitting criterion, and leaf size. The optimized RF model automatically selects the most predictive nucleotides (features) of the training data. The use of the optimization algorithm gives reassurance to the merit of the resulting model because the more accurate the model, the more we can trust the resulting feature importance.

Table 1 Related work on splice junction recognition

The BO algorithm is evaluated using the Gaussian process (GP), which is the standard surrogate model and three different acquisition functions: probability-of-improvement. Lower-confidence-bound, and expected improvement [9]. Our goal is to achieve the best optimization results in the fewest number of iterations. The proposed method is evaluated on a real-world dataset publicly available from the UCI repository which was obtained from the Genbank 64.1 primate data [31]. The optimized RF model automatically ranks the predictive features according to their importance and this arrangement is evaluated using the BO-optimized versions of three popular AI models: SVM, KNN, and DT trained using the cross-validation approach. Overall, the contributions of this study are as follows:

  • A hybrid machine learning model BO-RF for the recognition of the DNA splice junction sequence.

  • Evaluation of three BO acquisition functions: expected improvement, lower-confidence-bound, and probability-of-improvement.

  • Automatic feature ranking using the sensitivity analysis of the optimized RF model

  • Performance evaluation of the optimized versions of RF, SVM, KNN, and DT models using the RF-based selected features.

The remainder of this paper is organized as follows: Section 2 describes the proposed methodology including the dataset description, the cross-validation approach, and the performance metric. Section 3 briefly presents the RF modeling, the out-of-bag error, and feature importance ranking. Section 4 reviews the BO algorithm including the definition of the Gaussian process model and three acquisition functions. Finally, all experimental results and discussions are presented in Section 5.

2 Methodology

For this study, we suggested a two stages method based on the RF classification model. In the first stage, we fine-tuned the RF hyperparameters using three different BO acquisition functions: expected improvement, lower-confidence-bound, and probability-of-improvement based on the splice-junction dataset. The optimization process controlled four RF hyperparameters: number of trees, number of splitting features, splitting criterion, and leaf size. The objective was to minimize the out-of-bag (OOB) error function. The resulting optimal RF model ranked all predictive features according to their contribution to the classification process. In the second stage, the ranked features were evaluated using BO optimized versions of SVM, KNN, and DT models. The ranked features were included in the modeling process one by one according to their importance. These models were trained to minimize the cross-validation error. Fig. 2 shows the workflow.

Fig. 2
figure 2

Workflow for BO-RF modeling, automatic feature ranking, and model evaluation

2.1 Dataset description

The proposed method is evaluated on a real-world dataset collected from Genbank 64.1(FTP site: genbank.bio.net) and is available in the UCI machine learning repository [31].

The dataset consists of 3190 records with 60 features (DNA nucleotides) categorically classified with no missing values. They represent splice junctions where RNA splicing occurs. The 60 features represent the DNA nucleotides, starting at position −30 and ending at position +30. Each one is filled by one of {A, G, T, and C}. Other characters indicate ambiguity among the standard characters according to Table 2. Fields with ambiguities are processed as they are without being replaced. The dataset contains 767 exon-intron EI records (donor site), 768 intron-exon IE records (acceptor site), and 1655 not splice site N records with percentages of about 25% EI, 25% IE, and 50% N. The objectives of the proposed BO-RF classification model are to determine whether or not a particular sequence has a splice site, to define its type: EI (donor site) IE (acceptor site), or N (not a splice site), and to identify the most predictive features in the training data.

Table 2 Dataset feature description (ambiguity)

2.2 Model training and evaluation

The entire dataset is initially divided into training and test subsets. Then, in the first stage, the RF training process benefits from the OOB capability of the RF modeling methodology [2]. The OOB approach is described in Sec. 3. But, in the second stage, the objective is to compare the performances of the optimized versions of RF, SVM, KNN, and DT models using the ranked features. This time, models are trained to minimize the cross-validation error. K-fold cross-validation guarantees the participation of all training records in model learning and validation [27]. The algorithm partitions the whole training records into k separate subsets conducts k rounds of model training and validation, chooses one partition for validation, trains the model on the remaining k-1 partitions, and then uses the one that left-out to validate the model. For each fold, the algorithm calculates the classification error for the in-fold data by a model, which is trained on the records of the other k-1 partitions. The Graphical procedure of five-fold cross-validation is depicted in Fig. 3. The trained models are evaluated and compared based on the overall accuracy of training and test records. For a particular class y, the true positive (TP) equals the number of actual y records that are correctly classified and the (TN) is the number of non-y records that are properly classified to any non − y class. The overall accuracy is computed as follows:

  • General Accuracy of y class

Fig. 3
figure 3

Graphical depiction of the five-fold cross-validation procedure: the complete dataset is randomly split into five equal partitions. Four partitions are used to train the model and one for validation. The process is repeated five times such that all training records participate in model training and validation

$$ \frac{TP_y+{TN}_y}{N} $$
(1)

where N is the total number of records. Then the overall accuracy in this multi-class classification problem is defined by

  • Overall Accuracy

$$ \frac{\sum \limits_y{TP}_y}{N} $$
(2)

3 Random forest (RF)

In the last years, the RF model has emerged as a practical tool for classification, regression, and visualization in particularly in bioinformatics [1, 13, 35]. It is an ensemble of high-performance decision trees, where predictions are produced by the majority of votes. To keep low bias and low dependence between trees, Breiman in [2] proposed two sources of diversity. First, each tree is trained on different subsets from the training data (bootstrap). Second, at every splitting node, the algorithm uses only a random subset of the available features. Individual trees are built without pruning, that is, they are left to grow to their fullest depths. Assuming \( \mathbf{\mathfrak{D}}=\left\{\left({\mathbf{x}}_{\mathbf{1}},{y}_1\right),\dots, \left({\mathbf{x}}_{\boldsymbol{N}},{y}_N\right)\right\}, \) represents the N training data, where x represents the input feature vector xi = (xi, 1, …, xi, p)T with p features, and yi is the output class. Let the number of decision trees is ntree, number of splitting features is mfeature < p, splitting criteria is fsplitting, and trees are allowed to grow up such that it has LSrecords minimum number of records in terminal leaves (leaf size). The RF learning procedure works as illustrated by Algorithm 1. Fig. 4 shows the structure of the RF algorithm. All trees can be grown in parallel to reduce the bias and variance of the model at the same time. RF provides a reliable feature importance estimate and offers effective approximations of the test error without suffering the cost of repetitive model training associated with cross-validation as described in the subsequent sections.

Fig. 4
figure 4

Structure of random forest

figure a

3.1 Out-of-bag error

Out-of-bag error (OOB) is an estimation technique for measuring the prediction error of the RF model. Typically, the RF algorithm trains each tree using two-thirds of the training records (called in-bag data), and validate the tree with the remaining third (called out-of-bag OOB data) (Fig. 4). Eventually, each record contributes to the training of two-thirds of the trees and validating the other third. These OOB predictions are compared to their known classes to compute the OOB error to estimate the RF generalization. For a particular training record, the OOB error is the mean prediction error using only the trees that did not have this record in their in-bag data. Algorithm 2 describes the OOB prediction procedure.

figure b

For classification with zero-one error function, the generalization error rate is computed as follows:

$$ {E}_{OOB}=\frac{1}{N}\sum \limits_{i=1}^NI\Big({y}_i\ne {\hat{f}}_{OOB}\left({\mathbf{x}}_i\right) $$
(3)

The Oob predictions \( {\hat{f}}_{OOB}\left(\mathbf{x}\right) \) allow to compute the class-wise error rate for each class, and compute the OOB “confusion matrix” by cross-tabulating yi and \( {\hat{f}}_{OOB}\left({\mathbf{x}}_i\right) \).

3.2 Feature importance

The RF algorithm can measure the importance of all features according to their contributions to the prediction of the output class [35]. This measure is calculated by directing all the OOB records down the RF trees and assessing the predicted output. Then, for every feature k, its values are randomly permuted in the OOB records, while preserving all other features fixed and once again, the algorithm generates the predicted outputs (for permuted data),i.e., we have two sets of OOB predictions: one set obtained for real data and one for feature-k-permuted data. Let errOOBt refers to the error of a single tree t for the real OOB data and errOOBtkas error of the permuted OOB data. Then, the importance of the feature xk is defined as follows:

$$ Importance\ \left({x}_k\right)=\frac{1}{n_{tree}}\sum \limits_t^{n_{tree}}\left({errOOB}_{tk}-{errOOB}_t\right) $$
(4)

The large value of this measure means great importance to the feature and vice versa. However, the importance values can be high even for features that are not relevant (predictive) of the class label, as long as the RF model can use them to overfit. It depends on how the RF model handles training data.

3.3 RF hyperparameter estimation

The RF algorithm has four hyperparameters to be prespecified: the number of trees ntree, number of splitting features mfeature, splitting criterion fsplitting, and leaf size LSrecords. Small values of ntree lead to overfitting, while large values increase the model complexity. Typically, ntree is set between 100 and 500. Regarding, mfeature small values increase the RF diversity but, they may upsurge the error rate. Usually, it is set to the square root of the total number of features. There are three common splitting criteria: Gini’s diversity index (GDI), Twoing rule, and maximum deviance reduction (simply Deviance) [18]. The minimum number of records per terminal leaf LSrecords controls the depth of individual trees. Figs. 5 shows the influences of these hyperparameters on the performance of the RF model for the recognition of splice-junction sites. Experiments were done using MATLAB package [21]. The dataset was divided into 70% training and 30% test subsets. The criteria are the misclassification probabilities for the OOB and test records. We set the default values to be 300 for ntree, 8 for mattrib, GDI for fsplitting, and 1 for minimum leaf size. Figure 5a shows the effect of the number of trees. The OOB error is 0.0305 with 263 trees. However, the minimum test error is 0.0230 with 137 trees. In general, the two curves have similar behaviors. Figure 5b shows the effect of the number of splitting features that are randomly selected at each splitting node where the minimum OOB error is achieved with 8 features and the minimum test error requires 12 features. Similarly, Fig. 5c and d show the effects of splitting criterion and leaf size.

Fig. 5
figure 5

The misclassification probabilities for OOB and test records for (a) number of trees, (b) number of splitting features (c) splitting criterion (d) minimum leaf size

The optimal values of these hyperparameters ensure the construction of diverse and correct RF trees. Literature shows that different optimization algorithms were used to optimize all or some of these RF hyperparameters [10, 11, 26].

4 Bayesian optimization (BO)

BO algorithm is applied to select the hyperparameters of different AI models at hand based on the accuracy of the subsequent model classifications (objective functions). The algorithm runs iterative evaluations of a specified objective function exploring the solution space. The main advantage of BO is its success in finding good solutions with only a few iterations [3]. Its strategy maintains a surrogate model to represent the relationship between the hyperparameters and the objective function to guide the movement in the solution space. This surrogate model is progressively improved in a closed-loop method. Initially, the surrogate is prototyped based on some seed points and then this prototype selects the next point to evaluate the objective function. The resulting values ​​improve the prototype itself, and so on until enough information about the objective function is available and the global minimum is generated. BO algorithm employs an acquisition function that uses the surrogate model to determine the next optimization. The Gaussian process is the most popular surrogate model while common acquisition functions include the probability of improvement, lower confidence bound and expected improvement.

4.1 Gaussian process model

Gaussian process (GP) is a probabilistic regression model that can represent a black-box objective function f(x) using mean m(x) and kernel k(x, x) functions. It is assumed that f and its parameters x are assumed to have a common Gaussian distribution [28].

$$ f\left(\mathbf{x}\right)\sim \mathcal{GP}\left(m\left(\mathbf{x}\right),k\left(\mathbf{x},{\mathbf{x}}^{\prime}\right)\right) $$
(5)

For simplicity, the mean function is assumed to be m(x) = 0, i.e., the model is completely defined by its kernel function k. The ARD 5/2 Matérn function is a common kernel which is a twice differentiable function and depends only on the distance between points x and x [24]:

$$ {K}_{M52}\left(\mathbf{x},{\mathbf{x}}^{\prime}\right)={\sigma}_f^2\left(1+\frac{\sqrt{5}r}{\sigma_l}+\frac{5{r}^2}{3{\sigma}_l^2}\right)\mathit{\exp}\left(-\frac{\sqrt{5}r}{\sigma_l}\right), $$
(6)

where \( r=\sqrt{{\left(\mathbf{x}-{\mathbf{x}}^{\prime}\right)}^T\left(\mathbf{x}+{\mathbf{x}}^{\prime}\right)} \) the Euclidean distance between x and x, σf is the function standard deviation, σl is the characteristic length scale. Their values are found by maximizing the marginal log-likelihood of the available data\( {\mathbf{\mathcal{D}}}_{1:t}={\left\{\left({\mathbf{x}}_{\boldsymbol{i}},{y}_i\right)\right\}}_{\boldsymbol{i}=\mathbf{1}}^{\boldsymbol{t}} \) where t is the iteration index. Once the kernel is determined, the distribution at any new location xt + 1can be predicted as follows:

$$ P\left({y}_{t+1}|{\mathcal{D}}_{1:t},{\mathbf{x}}_{t+1}\right)=\mathcal{N}\left({\mu}_t\left({\mathbf{x}}_{t+1}\right),{\sigma}_t^2\left({\mathbf{x}}_{t+1}\right)+{\sigma}_{noise}^2\right) $$
(7)
$$ {\mu}_t\left({\mathbf{x}}_{t+1}\right)={\mathbf{k}}^T{\left[\boldsymbol{K}+{\sigma}_{noise}^2\mathrm{I}\right]}^{-1}{y}_{1:t}^T $$
(8)
$$ {\sigma}_t^2\left({\mathbf{x}}_{t+1}\right)=k\left({\mathbf{x}}_{t+1},{\mathbf{x}}_{t+1}\right)-{\mathbf{k}}^T{\left[\boldsymbol{K}+{\sigma}_{noise}^2I\right]}^{-1}\mathbf{k} $$
(9)

where \( {\displaystyle \begin{array}{l}\boldsymbol{K}=\left[\begin{array}{ccc}k\left({\mathbf{x}}_1,{\mathbf{x}}_1\right)& \mathbf{\cdots}& k\left({\mathbf{x}}_1,{\mathbf{x}}_t\right)\\ {}\mathbf{\vdots}& \mathbf{\ddots}& \mathbf{\vdots}\\ {}k\left({\mathbf{x}}_t,{\mathbf{x}}_1\right)& \dots & k\left({\mathbf{x}}_t,{\mathbf{x}}_t\right)\end{array}\right]\\ {}\kern1.25em \mathbf{k}=\left[k\left({\mathbf{x}}_{t+1},{\mathbf{x}}_1\right)k\left({\mathbf{x}}_{t+1},{\mathbf{x}}_2\right)\cdots k\left({\mathbf{x}}_{t+1},{\mathbf{x}}_t\right)\right].\end{array}} \)

Where \( {\sigma}_{noise}^2 \)represents the noise variance.

4.2 Acquisition function

BO algorithm employs a certain acquisition function α(x) to derive the optimization process in the solution space. It is computed using the Gaussian μt and \( {\sigma}_t^2 \) (as in Eqs. (8), (9)). Specifically, the algorithm works in a closed iterative approach. At each iteration, the point that maximizes α(x) represents the best guess to sample the objective function and then, the sampling results update the GP model then, the algorithm returns to maximize α(x) again but with the updated GP to guess a new candidate and so on. The iteration is continued until sufficient information about the objective function is available and then the global minimum is reached. Algorithm 3 and Fig. 6 illustrate the concept of the BO process.

Fig. 6
figure 6

A graph depicts the BO concept. The optimization refines progressively the GP model, which in turn is used to generate the best guess to sample the objective function. Guess and sample iteration is continued until the global minimum is reached

figure c

Several acquisition functions have been introduced in the literature. They have different approaches to balance between exploring areas with high variance\( {\sigma}_t^2 \) and exploiting those with low mean μt. In this paper, we evaluate the performance of three popular acquisition functions as follows:

  • Probability of improvement (PI)

The PI acquisition function αPI is the simple one that requires less computation [16]. It selects the candidate point x that most likely maximizes αPI over the current best point xbestas follows:

$$ {\displaystyle \begin{array}{c}{\alpha}_{PI}\left(\mathbf{x}\right)\triangleq P\Big(f\left(\mathbf{x}\right)<f\left({\mathbf{x}}_{best}\right)\\ {}{\alpha}_{PI}\left(\mathbf{x}\right)=\Phi \left(\frac{f\left({\mathbf{x}}_{best}\right)-{\mu}_t\left(\mathbf{x}\right)}{\sigma_t\left(\mathbf{x}\right)}\right)\end{array}} $$
(10)

where Φ(·) is the unit normal cumulative distribution function (CDF). The point that maximizes αPI is the best guess to minimize the objective function f.

  • Lower confidence bound (LCB)

LCB acquisition function αLCB represents the statistical lower bound on the minimum lower confidence envelope GLCB(x) of the objective function. It is computed by subtracting the weighted value of the standard deviation σt from the value of the GP predictive mean μt as follows:

$$ {G}_{LCB}\left(\mathbf{x}\right)={\mu}_t\left(\mathbf{x}\right)-\kappa {\sigma}_t\left(\mathbf{x}\right) $$
(11)

Where the parameter κ manages the balance between exploitation and exploration, two is a common value of κ [7]. Then, αLCB maximizes the negative of GLCB as follows:

$$ {\alpha}_{LCB}\left(\mathbf{x}\right)=\kappa {\sigma}_t\left(\mathbf{x}\right)-{\mu}_t\left(\mathbf{x}\right) $$
(12)
  • Expected Improvement (EI)

EI acquisition function is considered the most common one due to the work of Jones et al. [14]. It considers the amount of expected improvement when selecting the next candidate point as follows:

$$ {\alpha}_{EI}(x)=E\left[\max \left(0,f\left({\mathbf{x}}_{best}\right)-{\mu}_t\left(\mathbf{x}\right)\right)\right] $$
(13)
$$ {\alpha}_{EI}\left(\mathbf{x}\right)=\left\{\begin{array}{cc}\left(f\left({\mathbf{x}}_{best}\right)-{\mu}_t\left(\mathbf{x}\right)\right)\cdotp \varPhi (Z)+{\sigma}_t\left(\mathbf{x}\right)\cdotp \varPhi (Z)& \left({\sigma}_t\left(\mathbf{x}\right)>0\right)\\ {}0& \left({\sigma}_t\left(\mathbf{x}\right)=0\right)\end{array}\right. $$
(14)

where \( Z=\frac{f\left({\mathbf{x}}_{best}\right)-{\mu}_t\left(\mathbf{x}\right)}{\sigma_t\left(\mathbf{x}\right)} \) and Φ(·) is the probability density function (PDF) for the normal distribution.

5 Experimental results and discussion

This section presents some experiments that demonstrate the utility of BO for finding the optimal hyperparameters for the RF prediction model. BO controls for RF hyperparameters: number of trees, number of splitting features, splitting criterion, and leaf size. The available data is divided into 70% training and 30 test subsets. Training data is used to construct the optimized RF model that minimized the OOB misclassification probability. Then, the optimized model is evaluated based on the test data. We compare the performances of three different BO acquisition functions: the probability of improvement, lower confidence bound, and expected improvement. Figure 7 shows the results of OOB misclassification probabilities for the three BO acquisition functions against the evaluation number, where the number of seed points is set to 10, and the objective function evaluation limit is set to 60 with no deterministic conditions on the optimizable variables or coupled constraints on the resulting models. Every hyperparameter has a prespecified range, number of trees from 100 to 500, number of splitting features from 2 to 10, the splitting criterion options are (GDI, Towing, and deviance) and finally, the leaf size ranges from 2 to 10. Fig. 7 shows that the probability of improvement acquisition function achieves the minimum OOB error. The resulting optimizable hyperparameters are the number of trees 375, splitting features 7, splitting criterion ‘Twoing’, and leaf size 1. Table 3 shows the values of OOB and the values of optimizable hyperparameters as well as the training and test classification accuracies for the three acquisition functions.

Fig. 7
figure 7

The OOB misclassification probability of the RF model optimized using three different BO acquisition functions. Expected improvement achieves 0.030452 with 27 evaluations, lower confidence bound achieves 0.030004 with 46 evaluations, and the probability of improvement achieves 0.029557 with 33 evaluations

Table 3 Optimization results, the training and test accuracies for the three BO acquisition functions

The probability of improvement function achieves 99.96% and 97.34% training and test accuracies respectively. Lower confidence bound achieves 99.96% and 97.18% and expected improvement achieves 99.37% and 97.49% respectively. In general, the results of the optimization of the three functions, as well as the results of the classification are very close to each other. The optimal RF model generates the relative feature importance shown in Fig. 8. This figure shows the 30 most important features in the training data based on the BO-RF model. The model selects automatically features that decrease the OOB prediction error. In addition to the RF model, the validity of this feature arrangement is examined using BO optimized versions of SVM, KNN, and DT models as follows:

  1. (1)

    SVM model builds a decision hyperplane with the maximal margin width to divide the variable space into two regions (binary classifier) [6]. In the non-linear situations, the model allows a misclassification slake variable ξ around the margin with a regularization constraint C. In our experiments, the radial basis function (RBF) is selected as the kernel in the SVM model [20]. It is a common kernel with only one adjustable parameter (σ). The BO algorithm is applied to find the optimal values of C and σ that minimize the 10-fold cross-validation error.

  2. (2)

    KNN model labels a record based on its similarity to the training ones. For a new record, KNN computes its distance from each of the training records and defines its neighbors. Then, the model places this new record within the class that contains the greatest number of K-nearest neighbors [33]. In this study, BO finds the optimal values for the number K, the type of distance measure, the distance weight, and the exponent that minimizes the 10-fold cross-validation error.

  3. (3)

    DT model creates a classification tree by specifying the optimal partition features with the ability to handle both continuous and categorical variables. The model recursively separates records into branches to construct a tree to improve the classification accuracy [18]. In this study, the BO regulates the values of the maximum number of splitting, the minimum leaf size, and the splitting criterion to minimize the 10-fold cross-validation error.

Fig. 8
figure 8

Relative importance of the splice junction features using the BO-RF model

A set of experiments was conducted to check the validity of the RF-based feature ranking using SVM, KNN, and DT. In the first experiment, the training and test data contained only the two most important features, and then BO optimized versions of the four models were trained and tested. Then, the features are added one by one according to their ranking and each time the training and testing processes are repeated. The accuracies of training and test subsets determine the quality of the classification model that appropriate to the required splice recognition. The resulting training and test accuracies for all these experiments are depicted in Fig. 9a and b respectively.

Fig. 9
figure 9

Training and test accuracies vs. Number of features (ordered by their relative importance generated from the BO-RF model) using BO optimized versions of RF, SVM, KNN, and DT classification models. a Training Accuracies, b Test Accuracies

These curves demonstrate the high performance of all models using a few features arranged by the BO-RF model. All models achieve training and test accuracies greater than 70% using only the top two features. Using the top five features, they achieve training and test accuracies by more than 90%. With the addition of more ranked features, the performance of both RF and SVM improves, but the performance of DT and KNN fluctuates. Decision trees’ performance changes by increasing the number of features, but it is not as bad as the KNN. If the number of features exceeds 7, the KNN begins to suffer from overfitting. On the contrary, Both RF and SVM RF and SVM provide precious results with very few features and their performances continue to improve and stabilize as the number of features increases. Using the top seven features, RF and SVM models achieve training and test accuracies of more than 95%. Using the full set of features, RF achieves 99.96% and 97.34% for training and test accuracies respectively, while SVM achieves 99.95% and 96.44% training and test accuracies.

Both RF and SVM achieve very high performance with few features (ranked by the RF model). In general, their performances are very close and outperforming the other two models (KNN and DT). These results demonstrate that the RF-based model facilitates the selection of the most predictive features ensuring the best possible predictive results using other prediction models. That is, the RF model can be used both as a predictor and as a feature selector with high efficiency. During all previous experiments, the BO methodology provided a very practical tool that enabled improving the performance of different prediction systems.

6 Conclusion

Understanding the genetic sequence and processes in the central dogma of molecular biology is an essential stage toward handling many genetic disorders. In this paper, we presented several experiments towards applying a set of artificial intelligence models to understand the sequences of nuclides and locating splicing sites. Experiments demonstrated the predictive power of an RF model in the recognition of the DNA splicing sites as well as its high ability to identify the most predictive features. That is, RF was used to achieve two objectives, firstly predicting the splicing sites, and secondly, to determine the most important features. To ensure the best results, the BO method has been used to adjust the model hyperparameters to achieve the minimum out-of-bag-error. The more accurate the model, the more we trust the resulting feature ranking. The optimization adjusted four hyperparameters; the number of trees, the number of splitting features, splitting criterion, and leaf size. The optimization worked using the Gaussian process surrogate with three different acquisition functions namely; the probability of improvement, lower confidence bound and expected Improvement. The probability of improvement yielded the best results, and the resulting RF model achieved classification accuracy for training and test data 99.96% and 97.34% respectively. The validity of RF-based feature ranking was tested using BO optimized versions of SVM, KNN, and DT models, where they all achieved training and test accuracy greater than 70% using only the top two features and achieved more than 90% using the top five ones. In general, RF and SVM provided high and steady performance. BO provides an efficient and smart manner to fine-tune prediction models at hand.