Keywords

1 Introduction

Gene expression research is one of the major research areas in the field of bioinformatics. There is exponential growth in the biological data produced by DNA microarray technology [1,2,3]. This approach is high throughput, allowing scientists to measures multitudes of genes at the same time. Through this method, researchers can study and analyze numerous genes at the same time. DNA microarray technologies is providing great insight in genomic data and is changing the field of bioinformatics. Drug discovery, prevention of disease as well as cures, biological interactions, plant and animal metabolisms are underlying issues addressed by gene expression levels [4]. Additionally, there is widespread research in cancer studies to find potential biomarkers based on gene expression levels in order to find potential biomarkers [5,6,7]. The focus of research is on a small subset of genes that are relevant to the phenomenon under study among the different genes also known as the feature subset problem. DNA microarray technology are essentially the measurement of different genes at the stages of translation and transcription. There are two major methods of obtaining DNA microarray data: hybridization of sample to cDNA and high-density oligonucleotide chips [8]. Nevertheless, the data produced by these methods suffer from being highly redundant, large scale and the curse of dimensionality [9]. In order to solve the problems and dispel the curse, feature selection is the approach widely regarded by the bioinformatics community [10,11,12].

In general, feature selection can be categorized into three groups: filter approach, wrapper approach, embedded approach (combination of the previous methods) [13]. Filter methods focuses on the intrinsic characteristics of the genes in terms of their relevance or in their discriminative properties. The genes are ranked according to the filter method and the highest ranked genes are used are the remaining are eliminated. This methodology does not rely on any machine learning algorithm therefore the time complexity is quite low and can be used for large datasets. Moreover, the results are simplified and can be easily be verified in wet labs by biological domain experts. Thus, univariate filter approaches have widely leveraged to analyze and study gene expression levels [14]. Among the different filter approaches, Xing et al. [15] reports that IG (Information Gain) to be the best approach. However, this approach does not perform well for heterogeneous datasets whereas Bayesian Networks show their strength in this regard [16]. Therefore, different filter techniques outperform each other depending on the dataset. In wrapper approach, the genes are searched then judged based on the estimated accuracy of a classifier. The extracted genes are then used to train the classifier. Zhang et al. [17] asserted that wrapper methods outperform filter methods in terms of predictive accuracy of the classifier. Moreover, this approach also integrates the interaction of the gene selection with the classification that is independent in the filter approach. Nevertheless, this approach has cost of being computationally intensive and in some cases cause overfitting of the classifier [18]. Finally, we have embedded approaches wherein the search algorithm is rooted in the classification algorithm. Therefore, it has the advantage of the interaction of search algorithm with the classifier while being far less computationally intensive [19]. One of the more successful approaches is to use SVM (Support Vector Machines) with an embedded feature selection algorithm [20].

SVM-RFE (Recursive feature elimination) [21] was introduced where the authors achieved very high accuracy with their classifier in comparison to other discriminant methods using SVM. In this method, the genes are ranked as features and the lowest ranked features are removed. Yousef et al. [22] introduced SVM-RCE (Recursive cluster elimination); moreover, it was reported to outperform SVM-RFE. SVM-RCE uses KNN to cluster the genes and then uses SVM to rank the clusters with their respective scores while eliminating the lower ranked clusters. Based on its widespread interest, Luo et al. [23] recently improved the computation time by applying an infinite norm of weight coefficient vector to each cluster to score them. They removed the lowest performing genes instead clusters when the number of clusters are small. Additionally, we wanted to empower SVM-RCE and we introduced SVM-RCE-R (Rank) [24] that extended SVM-RCE with a user specific ranking function. Here the user can choose which clusters should be ranked higher based on different metrics (accuracy, sensitivity, f-measure, area under the curve and precision), thereby allowing scientists to explore the biological data in depth to their needs.

Based on improving this method on a greater scale, we are now introducing SVM-RCE-R-OPT which searches for the optimal set of weights resulting in an improvement in our classification results. We use Bayesian optimization to find the parameters for our six different weights. We compare SVM-RCE and SVM-RCE-OPT across 15 datasets to validate the findings that this approach does improve the classification results.

2 Methods and Implementation

We optimize SVM-RCE-R which is based on an early study SVM-RCE which lead us to the present approach of SVM-RCE-R-Opt. The methods and approaches used are described in the upcoming sections for SVM-RCE and SVM-RCE-R. We then describe in detail how we optimized SVM-RCE-R algorithm and the platform we used to implement SVM-RCE-R-Opt.

2.1 SVM-RCE

SVM-RCE is the first algorithm that suggests clustering genes using K-Means into clusters arranged according to correlation metric, in order to perform feature selection procedure by considering each cluster of genes as one unit. Then one needs to score each cluster of genes in terms of the classification of the training set that consist of two-classes. For that purpose, the training data was transformed to be represented based on the genes that belong to a specific cluster with the original class of the training set. Then an internal cross-validation is performed in order to compute the score. The score is the average of the accuracy performance of the cross-validation step. This step is applied for each cluster detected by k-mean. The next step is to rank all the clusters according to its score. The SVM-RCE removes the cluster with the lowest score or it can set to remove percentage of the lowest scored clusters. Thus, the results obtained is without the genes that are associated with the removed clusters as they do not contribute much to the prediction capabilities of the classifier.

2.2 SVM-RCE-R

Based on the interests of SVM-RCE in biological research, we decided to empower this algorithm with a user specific ranking function in SVM-RCE-R. In SVM-RCE, clusters were scored according to their accuracies. However, in data analysis of high dimensionality data with small sample size other metrics are preferred in the understanding of the features that contribute most to classification. This lead to the implementation of a user specific scoring function, in which researchers can choose the scoring function according to their needs. Therefore, we used the commonly used scoring metrics as described in Eq. 1 as our scoring function.

$$ {\text{S}}\left( {\text{w1,w2,w3,w4,w5,w6}} \right){\text{ = w1}} \times {\text{acc + w2}} \times {\text{sen + w3}} \times {\text{spe + w4}} \times {\text{fm + w5}} \times {\text{auc + w6}} \times {\text{prec}} $$
(1)

where acc, sen, spe, fm, auc and prec refer to accuracy, sensitivity, specificity, f-mean, precision and area under curve respectively.

The ranking is then computed as a sorted list of clusters based on the score \({\text{S()}}\). We noted that we could achieve significant improvements in our results in comparison to SVM-RCE. We also did notice that some combinations of clusters with different weights lead to better accuracy score even when the focus was solely on accuracy (Accuracy had the highest weight and the remaining metrics are zero). Therefore, the most important aspect is how one can compute the optimal weights \({\text{w1,w2,w3,w4,w5,w6}}\) such that it improves the overall performance of the algorithm. Therefore, we decided to focus on an optimization approach to find the best combination of weights for our next step in this approach.

2.3 SVM-RCE-R Optimal

Our new proposed approach SVM-RCE-R-OPT is implemented in KNIME [25] due to its user friendliness similarly to SVM-RCE-R. We split the gene expression dataset into train and test sets with a ratio of 30:70 respectively. Moreover, we used stratified sampling to make sure the training data and test data have the same ratio of negative to positive samples. The parameter optimization node in KNIME is used to find the optimal weights for our six different ranks which was used in SVM-RCE-R (acc, sen, spec, etc.). This node uses Bayesian optimization [26] as the search strategy to find the optimal weights for our six different ranks. The algorithm that is used for the maximization for the objective function is illustrated in Eq. 2 based on the original paper.

$$ EI_{y*} = \int_{ - \infty }^{\infty } {\max \left( {y* - y,0} \right)\;PM\left( {y\;|\;x} \right)dy} $$
(2)

The search strategy works in two phases: warm up phase and then the Tree-structured Parzen Estimation (TPE) phase. During the warm up phase, random combinations of the weights are used, and then based on the objective values found, the TPE phase starts. Moreover, users can specify the step size of each weight as well as the number of iterations for each phase. Meanwhile, the search algorithm draws weights with replacement from the search space.

Our objective function is based on our scoring function which was stated in Eq. 1. The means the scoring function across all the cluster levels (number of clusters) is used as our objective function, which was set to be maximized. We specified the step size to be 1000, since prior testing showed that there was an improvement by using heavier weights. After the optimal set of weights have been identified in the search space from the Bayesian optimization, they are then used in a separate node that runs SVM-RCE-R with those weights with the training split of the data. Similarly, we also ran SVM-RCE-R with only the weight of accuracy (acc) set to the maximum and the remaining weights set to zero on the same training set. This provides us a reference to validate whether the weights found are an improvement over the original SVM-RCE algorithm, since the algorithm uses accuracy as the ranking function.

2.4 KNIME Workflow

There the overview of the workflow that we programmed in KNIME [25] as reflected in Fig. 1. The user has to set the list file node to the folder that contains the datasets. The workflow then generates a folder with all the relevant output files.

Fig. 1.
figure 1

KNIME workflow overview

The first step in the workflow is to split the data into two parts. One to calculate the optimal weights of the scoring function and the other for cross-validation of the standard approach (SVM_RCE_ACC meta node) and for the optimal approach (SVM_RCE_OPT_Corr meta node). The data that was used in computing the optimal weights is not used in the next steps to avoid overfitting.

Fig. 2.
figure 2

Bayesian optimization of weights

The SVM-RCE-R is applied with different weights (Parameter Optimization Loop start node) on the training set (Fig. 2) where the results are collected at the loop end node. The maximum iterations used to find the optimum weights was limited to 15 iterations. The different weights are then computed according to the search algorithm mentioned in the previous section. When the optimal set of weights are discovered, then it is used for testing set of the data (Save output node).

3 Data

We have considered the same data used in the study of SVM-RCE-R and we have included two additional new datasets. The datasets represent a range of different diseases, cancers and studies. In terms of diseases: Early-stage Parkinson's disease (GDS2519), Ulcerative colitis (GDS3268), celiac disease (GDS3646), diabetes in children (GDS3875), effects of tobacco smoke in foetal cells (GDS3929), HIV (GDS4228), asthma (GDS5037), acute dengue (GDS5093), pulmonary hypertensions (GDS5499), multi-omic analysis of COVID-19 (GSE157103). In terms of different types of cancer: glioma (GDS1962), prostate cancer (GDS2547), colorectal cancer, (GDS4516_4718), fear conditioning studies in mice (GDS3900). All of the datasets have at least 100 samples except for GDS5093 that has 56 samples. These 14 gene expression datasets are downloaded from Gene Expression Omnibus at NCBI (GEO) [27]. The format of datasets contains sample IDs as the column names and the gene name (gene symbols), according to their respective platform, as the row names with their relevant gene expression values (Fig. 3). Since most of the datasets produced are based on different chips or platforms, the number of genes vary and the exact amount can be found by their GEO accession numbers.

Fig. 3.
figure 3

Input table (dataset) format in KNIME

4 Results

We have applied 100-fold Monte Carlo cross-validation [27] for the original approach and for the optimal approach. In each fold, we compute different performance metrics such as accuracy, specificity and area under curve (AUC). The average of all the 100 folds is computed for each metric. Accuracy and specificity is computed in Eq. 3 where TP is true positives, TN is true negatives, FP is false positives and FN is false negatives. Meanwhile AUC is calculated based on the probability that a classifier will rank a randomly positive instance higher that a negative one.

$$ \begin{array}{*{20}l} {Specificity = TN\;/\;\left( {TN + FP} \right)} \hfill \\ \end{array} $$
$$ \begin{array}{*{20}l} {Accuracy = \left( {TP + TN} \right)\;/\;\left( {TP + TN + FP + FN} \right)} \hfill \\ \end{array} $$
(3)

To validate the optimal weights found from the training, we compute the difference in accuracy between cluster level 90 of the optimal solution and the base accuracy used in the original study SVM-RCE [22]. Using the optimized weights from the training part of the algorithm, we observed that we have an improvement for seven of the datasets, while two of the datasets (GDS3900, GDS4516_4718) showed similar accuracy as shown in Fig. 4. The figure shows that in some cases the improvement over the standard approach might even reach to 10%; in most cases, the improvement is around 5%.

Since we are dealing with high dimensional data, we need to look at other metrics more specifically AUC and specificity. In terms of the specificity, Fig. 5 illustrates that ten out of twelve datasets outperformed or had similar performance as the original SVM-RCE algorithm. This could imply that we are not overfitting in this approach and the robustness of the classifier is still preserved. Additionally, when comparing the AUC in Fig. 6, we note that the optimal approach generally shows an even greater improvement. We can see that seven of the datasets shows either similar or better performance. We can see about a 10% increase in AUC performance for four of the datasets (GDS2519, GDS3646, GDS3875, GDS5093) which is quite a significant improvement. From these results, we can conclude that there is an overall increase in the performance of the datasets. We believe if the number of iterations to search for the optimal set of weights is increased, we may see improvements across all the datasets. However, this could have the drawback of being computational expensive as well as time consuming.

Fig. 4.
figure 4

The difference between the accuracy of the optimal weights to the base accuracy

Fig. 5.
figure 5

The difference between the AUC of the optimal weights to the base AUC

Fig. 6.
figure 6

The difference between the sensitivity of the optimal weights to the base sensitivity

5 Conclusions

In this study, we have proposed an optimization approach for computing the weights for the scoring function that would provide the optimal solution for ranking the clusters. We perform this by using Bayesian search optimization to find the optimal set of weights, then we compare the results with the SVM-RCE. Since SVM-RCE operated using a different ranking function, we wanted to validate whether this approach provides an improvement. When comparing the results across 12 datasets the overall performance is improved in most cases in terms of the accuracy, sensitivity and AUC. However, we would like to note that this algorithm is time consuming since the optimization procedure requires a long time in order to find the suitable weights in the search space. Moreover, approach also allows researchers to understand underlying genes related to biological research since it is based on the SVM-RCE. Therefore, we can find the genes that most contribute to the certain disease or specific research topic (e.g. fear factor). This would help us to find genes that help in identifying diseases in terms of expression levels, under expressed and overexpressed. This could potentially help in medical diagnosis of diseases and understanding of the role genes play in biological processes.