Abstract
Gene expression data classification provides a challenge in classification due to it having high dimensionality and a relatively small sample size. Different feature selection approaches have been used to overcome this issue and SVM-RCE being one of the more successful approach. This study is a continuation of two previous research studies SVM-RCE and SVM-RCE-R. SVM-RCE-R suggests a new approach in the scoring function for the clusters, showing that for some different combination of weights the performance was improved. The aim of this study is to find the optimal weights for the scoring function suggested in the study of SVM-RCE-R using optimization approaches. We have discovered that finding the optimal weights for the scoring function would improve the performance of the SVM-RCE- in most cases. We have shown that in some cases the performance is increased dramatically by 10% in terms of accuracy and AUC. By increasing the performance of the algorithm, it is more likely that we can extract subset genes relating to the class association of a microarray sample.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Gene expression research is one of the major research areas in the field of bioinformatics. There is exponential growth in the biological data produced by DNA microarray technology [1,2,3]. This approach is high throughput, allowing scientists to measures multitudes of genes at the same time. Through this method, researchers can study and analyze numerous genes at the same time. DNA microarray technologies is providing great insight in genomic data and is changing the field of bioinformatics. Drug discovery, prevention of disease as well as cures, biological interactions, plant and animal metabolisms are underlying issues addressed by gene expression levels [4]. Additionally, there is widespread research in cancer studies to find potential biomarkers based on gene expression levels in order to find potential biomarkers [5,6,7]. The focus of research is on a small subset of genes that are relevant to the phenomenon under study among the different genes also known as the feature subset problem. DNA microarray technology are essentially the measurement of different genes at the stages of translation and transcription. There are two major methods of obtaining DNA microarray data: hybridization of sample to cDNA and high-density oligonucleotide chips [8]. Nevertheless, the data produced by these methods suffer from being highly redundant, large scale and the curse of dimensionality [9]. In order to solve the problems and dispel the curse, feature selection is the approach widely regarded by the bioinformatics community [10,11,12].
In general, feature selection can be categorized into three groups: filter approach, wrapper approach, embedded approach (combination of the previous methods) [13]. Filter methods focuses on the intrinsic characteristics of the genes in terms of their relevance or in their discriminative properties. The genes are ranked according to the filter method and the highest ranked genes are used are the remaining are eliminated. This methodology does not rely on any machine learning algorithm therefore the time complexity is quite low and can be used for large datasets. Moreover, the results are simplified and can be easily be verified in wet labs by biological domain experts. Thus, univariate filter approaches have widely leveraged to analyze and study gene expression levels [14]. Among the different filter approaches, Xing et al. [15] reports that IG (Information Gain) to be the best approach. However, this approach does not perform well for heterogeneous datasets whereas Bayesian Networks show their strength in this regard [16]. Therefore, different filter techniques outperform each other depending on the dataset. In wrapper approach, the genes are searched then judged based on the estimated accuracy of a classifier. The extracted genes are then used to train the classifier. Zhang et al. [17] asserted that wrapper methods outperform filter methods in terms of predictive accuracy of the classifier. Moreover, this approach also integrates the interaction of the gene selection with the classification that is independent in the filter approach. Nevertheless, this approach has cost of being computationally intensive and in some cases cause overfitting of the classifier [18]. Finally, we have embedded approaches wherein the search algorithm is rooted in the classification algorithm. Therefore, it has the advantage of the interaction of search algorithm with the classifier while being far less computationally intensive [19]. One of the more successful approaches is to use SVM (Support Vector Machines) with an embedded feature selection algorithm [20].
SVM-RFE (Recursive feature elimination) [21] was introduced where the authors achieved very high accuracy with their classifier in comparison to other discriminant methods using SVM. In this method, the genes are ranked as features and the lowest ranked features are removed. Yousef et al. [22] introduced SVM-RCE (Recursive cluster elimination); moreover, it was reported to outperform SVM-RFE. SVM-RCE uses KNN to cluster the genes and then uses SVM to rank the clusters with their respective scores while eliminating the lower ranked clusters. Based on its widespread interest, Luo et al. [23] recently improved the computation time by applying an infinite norm of weight coefficient vector to each cluster to score them. They removed the lowest performing genes instead clusters when the number of clusters are small. Additionally, we wanted to empower SVM-RCE and we introduced SVM-RCE-R (Rank) [24] that extended SVM-RCE with a user specific ranking function. Here the user can choose which clusters should be ranked higher based on different metrics (accuracy, sensitivity, f-measure, area under the curve and precision), thereby allowing scientists to explore the biological data in depth to their needs.
Based on improving this method on a greater scale, we are now introducing SVM-RCE-R-OPT which searches for the optimal set of weights resulting in an improvement in our classification results. We use Bayesian optimization to find the parameters for our six different weights. We compare SVM-RCE and SVM-RCE-OPT across 15 datasets to validate the findings that this approach does improve the classification results.
2 Methods and Implementation
We optimize SVM-RCE-R which is based on an early study SVM-RCE which lead us to the present approach of SVM-RCE-R-Opt. The methods and approaches used are described in the upcoming sections for SVM-RCE and SVM-RCE-R. We then describe in detail how we optimized SVM-RCE-R algorithm and the platform we used to implement SVM-RCE-R-Opt.
2.1 SVM-RCE
SVM-RCE is the first algorithm that suggests clustering genes using K-Means into clusters arranged according to correlation metric, in order to perform feature selection procedure by considering each cluster of genes as one unit. Then one needs to score each cluster of genes in terms of the classification of the training set that consist of two-classes. For that purpose, the training data was transformed to be represented based on the genes that belong to a specific cluster with the original class of the training set. Then an internal cross-validation is performed in order to compute the score. The score is the average of the accuracy performance of the cross-validation step. This step is applied for each cluster detected by k-mean. The next step is to rank all the clusters according to its score. The SVM-RCE removes the cluster with the lowest score or it can set to remove percentage of the lowest scored clusters. Thus, the results obtained is without the genes that are associated with the removed clusters as they do not contribute much to the prediction capabilities of the classifier.
2.2 SVM-RCE-R
Based on the interests of SVM-RCE in biological research, we decided to empower this algorithm with a user specific ranking function in SVM-RCE-R. In SVM-RCE, clusters were scored according to their accuracies. However, in data analysis of high dimensionality data with small sample size other metrics are preferred in the understanding of the features that contribute most to classification. This lead to the implementation of a user specific scoring function, in which researchers can choose the scoring function according to their needs. Therefore, we used the commonly used scoring metrics as described in Eq. 1 as our scoring function.
where acc, sen, spe, fm, auc and prec refer to accuracy, sensitivity, specificity, f-mean, precision and area under curve respectively.
The ranking is then computed as a sorted list of clusters based on the score \({\text{S()}}\). We noted that we could achieve significant improvements in our results in comparison to SVM-RCE. We also did notice that some combinations of clusters with different weights lead to better accuracy score even when the focus was solely on accuracy (Accuracy had the highest weight and the remaining metrics are zero). Therefore, the most important aspect is how one can compute the optimal weights \({\text{w1,w2,w3,w4,w5,w6}}\) such that it improves the overall performance of the algorithm. Therefore, we decided to focus on an optimization approach to find the best combination of weights for our next step in this approach.
2.3 SVM-RCE-R Optimal
Our new proposed approach SVM-RCE-R-OPT is implemented in KNIME [25] due to its user friendliness similarly to SVM-RCE-R. We split the gene expression dataset into train and test sets with a ratio of 30:70 respectively. Moreover, we used stratified sampling to make sure the training data and test data have the same ratio of negative to positive samples. The parameter optimization node in KNIME is used to find the optimal weights for our six different ranks which was used in SVM-RCE-R (acc, sen, spec, etc.). This node uses Bayesian optimization [26] as the search strategy to find the optimal weights for our six different ranks. The algorithm that is used for the maximization for the objective function is illustrated in Eq. 2 based on the original paper.
The search strategy works in two phases: warm up phase and then the Tree-structured Parzen Estimation (TPE) phase. During the warm up phase, random combinations of the weights are used, and then based on the objective values found, the TPE phase starts. Moreover, users can specify the step size of each weight as well as the number of iterations for each phase. Meanwhile, the search algorithm draws weights with replacement from the search space.
Our objective function is based on our scoring function which was stated in Eq. 1. The means the scoring function across all the cluster levels (number of clusters) is used as our objective function, which was set to be maximized. We specified the step size to be 1000, since prior testing showed that there was an improvement by using heavier weights. After the optimal set of weights have been identified in the search space from the Bayesian optimization, they are then used in a separate node that runs SVM-RCE-R with those weights with the training split of the data. Similarly, we also ran SVM-RCE-R with only the weight of accuracy (acc) set to the maximum and the remaining weights set to zero on the same training set. This provides us a reference to validate whether the weights found are an improvement over the original SVM-RCE algorithm, since the algorithm uses accuracy as the ranking function.
2.4 KNIME Workflow
There the overview of the workflow that we programmed in KNIME [25] as reflected in Fig. 1. The user has to set the list file node to the folder that contains the datasets. The workflow then generates a folder with all the relevant output files.
The first step in the workflow is to split the data into two parts. One to calculate the optimal weights of the scoring function and the other for cross-validation of the standard approach (SVM_RCE_ACC meta node) and for the optimal approach (SVM_RCE_OPT_Corr meta node). The data that was used in computing the optimal weights is not used in the next steps to avoid overfitting.
The SVM-RCE-R is applied with different weights (Parameter Optimization Loop start node) on the training set (Fig. 2) where the results are collected at the loop end node. The maximum iterations used to find the optimum weights was limited to 15 iterations. The different weights are then computed according to the search algorithm mentioned in the previous section. When the optimal set of weights are discovered, then it is used for testing set of the data (Save output node).
3 Data
We have considered the same data used in the study of SVM-RCE-R and we have included two additional new datasets. The datasets represent a range of different diseases, cancers and studies. In terms of diseases: Early-stage Parkinson's disease (GDS2519), Ulcerative colitis (GDS3268), celiac disease (GDS3646), diabetes in children (GDS3875), effects of tobacco smoke in foetal cells (GDS3929), HIV (GDS4228), asthma (GDS5037), acute dengue (GDS5093), pulmonary hypertensions (GDS5499), multi-omic analysis of COVID-19 (GSE157103). In terms of different types of cancer: glioma (GDS1962), prostate cancer (GDS2547), colorectal cancer, (GDS4516_4718), fear conditioning studies in mice (GDS3900). All of the datasets have at least 100 samples except for GDS5093 that has 56 samples. These 14 gene expression datasets are downloaded from Gene Expression Omnibus at NCBI (GEO) [27]. The format of datasets contains sample IDs as the column names and the gene name (gene symbols), according to their respective platform, as the row names with their relevant gene expression values (Fig. 3). Since most of the datasets produced are based on different chips or platforms, the number of genes vary and the exact amount can be found by their GEO accession numbers.
4 Results
We have applied 100-fold Monte Carlo cross-validation [27] for the original approach and for the optimal approach. In each fold, we compute different performance metrics such as accuracy, specificity and area under curve (AUC). The average of all the 100 folds is computed for each metric. Accuracy and specificity is computed in Eq. 3 where TP is true positives, TN is true negatives, FP is false positives and FN is false negatives. Meanwhile AUC is calculated based on the probability that a classifier will rank a randomly positive instance higher that a negative one.
To validate the optimal weights found from the training, we compute the difference in accuracy between cluster level 90 of the optimal solution and the base accuracy used in the original study SVM-RCE [22]. Using the optimized weights from the training part of the algorithm, we observed that we have an improvement for seven of the datasets, while two of the datasets (GDS3900, GDS4516_4718) showed similar accuracy as shown in Fig. 4. The figure shows that in some cases the improvement over the standard approach might even reach to 10%; in most cases, the improvement is around 5%.
Since we are dealing with high dimensional data, we need to look at other metrics more specifically AUC and specificity. In terms of the specificity, Fig. 5 illustrates that ten out of twelve datasets outperformed or had similar performance as the original SVM-RCE algorithm. This could imply that we are not overfitting in this approach and the robustness of the classifier is still preserved. Additionally, when comparing the AUC in Fig. 6, we note that the optimal approach generally shows an even greater improvement. We can see that seven of the datasets shows either similar or better performance. We can see about a 10% increase in AUC performance for four of the datasets (GDS2519, GDS3646, GDS3875, GDS5093) which is quite a significant improvement. From these results, we can conclude that there is an overall increase in the performance of the datasets. We believe if the number of iterations to search for the optimal set of weights is increased, we may see improvements across all the datasets. However, this could have the drawback of being computational expensive as well as time consuming.
5 Conclusions
In this study, we have proposed an optimization approach for computing the weights for the scoring function that would provide the optimal solution for ranking the clusters. We perform this by using Bayesian search optimization to find the optimal set of weights, then we compare the results with the SVM-RCE. Since SVM-RCE operated using a different ranking function, we wanted to validate whether this approach provides an improvement. When comparing the results across 12 datasets the overall performance is improved in most cases in terms of the accuracy, sensitivity and AUC. However, we would like to note that this algorithm is time consuming since the optimization procedure requires a long time in order to find the suitable weights in the search space. Moreover, approach also allows researchers to understand underlying genes related to biological research since it is based on the SVM-RCE. Therefore, we can find the genes that most contribute to the certain disease or specific research topic (e.g. fear factor). This would help us to find genes that help in identifying diseases in terms of expression levels, under expressed and overexpressed. This could potentially help in medical diagnosis of diseases and understanding of the role genes play in biological processes.
References
Chandra, B., Gupta, M.: An efficient statistical feature selection approach for classification of gene expression data. J. Biomed. Inf. 44, 529–535 (2011)
McConnell, P., Johnson, K., Lockhart, D.J.: An introduction to DNA microarrays. In: Methods of Microarray Data Analysis II. Proceedings of the Second Conference on Critical Assessment of Microarray Data Analysis, CAMDA 2001, pp. 9–21. Kluwer Academic Publishers, Dordrecht (2002)
Dopazo, J.: Microarray data processing and analysis. In: Methods of Microarray Data Analysis II, Proceedings of the Second Conference on Critical Assessment of Microarray Data Analysis, CAMDA 2001, pp. 43–63. Kluwer Academic Publishers, Dordrecht (2002)
Riva, A., Carpentier, A.S., Torresani, B., Henaut, A.: Comments on selected fundamental aspects of microarray analysis. Comput Biol Chem 29, 319–336 (2005)
Veer, L., Da, H., Bijver, M., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002)
Zajchowski, D., et al.: Identification of gene expression profiles that predict the aggressive behavior of breast cancer cells. Cancer Res. 61, 5168–5178 (2001)
Veer, L., Jone, D.: The microarray way to tailored cancer treatment. Nat. Med. 8, 13–14 (2002)
Allison, D.B., Cui, X., Page, G.P., Sabripour, M.: Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55–65 (2006)
Ying, L., Han, J.: Cancer classification using gene expression data. Inf. Syst. 28, 243–268 (2003)
Lazar, C., et al.: A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 1106–1119 (2012)
Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437 (2004)
Ang, J.C., Mirzal, A., Haron, H., Hamed, H.N.A.: Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinf. 13, 971–989 (2016)
Zhu, S., Wang, D., Yu, K., Li, T., Gong, Y.: Feature selection for gene expression using model-based entropy. IEEE/ACM Trans. Comput. Biol. Bioinf. 7, 25–36 (2010)
Aris, V., Recce, M.A.: Method to improve detection of disease using selectively expressed genes in microarray data. In: Methods of Microarray Data Analysis, Proceedings of the First Conference on Critical Assessment of Microarray Data Analysis, CAMDA 2000, pp. 69–80. Kluwer Academic Publishers, Dordrecht (2002)
Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: Proceeding of 18th International Conference on Machine Learning (2001)
Giallourakis, C., Henson, C., Reich, M., Xie, X., Mootha, V.K.: Disease gene discovery through integrative genomics. Annu. Rev. Genomics Hum. Genet. 6, 381–406 (2005)
Zhang, H., Ho, T.B., Kawasaki, S.: Wrapper feature extraction for time series classification using singular value decomposition. Int. J. Knowl. Syst. Sci. 3, 53–60 (2006)
Loughrey, J., Cunningham, P.: Overfitting in wrapper-based feature subset selection: the harder you try the worse it gets. In: Bramer, M., Coenen, F., Allen, T. (eds.) Research and Development in Intelligent Systems XXI, pp. 33–43. Springer London, London (2005). https://doi.org/10.1007/1-84628-102-4_3
George, V.S., Raj, C.: Review on feature selection techniques and the impact of svm for cancer classification using gene expression profile. Int. J. Comput. Sci. Eng. Surv. 2, 16–27 (2011)
Li, F., Yang, Y.: Analysis of recursive gene selection approaches from microarray data. Bioinformatics 21, 3741–3747 (2005)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Yousef, M., Jung, S., Showe, L.C., et al.: Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8, 144 (2007)
Luo, L., Huang, D., Ye, L., Zhou, Q., Shao, G., Peng, H.: Improving the computational efficiency of recursive cluster elimination for gene selection. IEEE/ACM Trans. Comput. Biol. Bioinf. 8, 122–129 (2011)
Yousef, M., Bakir-Gungor, B., Jabeer, A., Goy, G., Qureshi, R., Showe, L.C.: Recursive cluster elimination based rank function (SVM-RCE-R) implemented in KNIME. F1000Research 9, 1255 (2020). https://doi.org/10.12688/f1000research.26880.1
Berthold, M.R., et al.: KNIME - the Konstanz information miner. SIGKDD Explorations 11, 26–31 (2009). https://doi.org/10.1145/1656274.1656280
Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2546–2554. Curran Associates Inc., Red Hook, NY (2011)
Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41 (2013). https://doi.org/10.1093/nar/gks1193
Xu, Q.-S., Liang, Y.-Z.: Monte Carlo cross validation. Chemom. Intell. Lab. Syst. 56, 1–11 (2001). https://doi.org/10.1016/S0169-7439(00)00122-2
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yousef, M., Jabeer, A., Bakir-Gungor, B. (2021). SVM-RCE-R-OPT: Optimization of Scoring Function for SVM-RCE-R. In: Kotsis, G., et al. Database and Expert Systems Applications - DEXA 2021 Workshops. DEXA 2021. Communications in Computer and Information Science, vol 1479. Springer, Cham. https://doi.org/10.1007/978-3-030-87101-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-87101-7_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87100-0
Online ISBN: 978-3-030-87101-7
eBook Packages: Computer ScienceComputer Science (R0)