1 Introduction

Identification of differentially expressed genes (DEGs) by analyzing RNA-seq data is very important for the discovery of the mechanisms and pathways underlying the disease. Conventional statistical methods to find DEGs often apply univariate tests for each gene. Therefore, they do not take into account the correlations between genes and concordant or discordant effect between gene groups [1]. Also, these statistical methods generate a large number of false positives and false negatives due to the small biases included in the distribution estimates to predict DEGs from RNA-seq data [2]. In order to prevent these problems, machine learning algorithms can be used to find DEGs causing the disease. Wenric and Shemirani [1] used the permutation importance generated by the Random Forests algorithm to find DEGs in 12 datasets containing the samples of various cancers. Random Forests algorithm outperformed classical methods in most datasets. Wang et al. [3] compared three feature selection algorithms (Information Gain, Correlation Feature Selection, and ReliefF) using five classification algorithms (Logistic Regression, Classification via Regression, Random Forest, Logistic Model Trees, Random Subspace) to detect significant genes. Kakati et al. [2] proposed a deep neural network model called DEGnet in order to identify DEGs. Yu et al. [4] studied attaching the biological significance of regulatory information to differential expression analysis. For this purpose, they used Naive Bayes, Random Forest, and Support Vector Machine with radial basis kernel methods. Al-Obeidat et al. [5] proposed discrete filtering for RNA-seq gene expression data to carry out feature selection. They used Binary Artificial Bee Colony Algorithm and Support Vector Machine to select the fittest and relevant subset of features to classify tumor as malignant and benign samples.

As clear from these examples, there are many machine learning algorithms used for gene selection. These algorithms can propose quite different genes from each other. The proposed number of genes can also vary considerably from algorithm to algorithm. Therefore, in this paper, a web tool called GeneSelectML is developed by using shiny package [6]. GeneSelectML allows the users to discover DEGs using different machine learning algorithms simultaneously. This web tool is available at www.softmed.hacettepe.edu.tr/GeneSelectML and a snapshot of the main page is given in Fig. 1.

We have selected the machine learning algorithms among the ones that are shown to be successful in the literature. During the preliminary analysis, we have searched and tested many other algorithms as well. However, some of them are eliminated either due to unavailability of predict function or unstable results on a case study or reasons such as the discontinuation of the package in CRAN. The second criteria for the inclusion is the accessibility of the algorithm in R or Bioconductor. Since the web tool we developed is a shiny-based application, the algorithms should be available in R. There are other shiny-based web tools designed for various purposes for RNA-seq gene expression data such as pre-processing [7], discovering DEGs [8, 9] and conducting gene ontology analysis [10]. GeneSelectML is distinguished from other shiny-based web tools by the fact that it uses different machine learning algorithms simultaneously for gene selection and it can perform pre-processing, graphical representation and gene ontology analyses all on the same tool. There is also a software called CAMUR developed for RNA-seq datasets and uses machine learning algorithms for selecting significant genes [11]. However, this software needs the MySql database to run and is not appropriate for online use.

A real life example dataset on Alzheimer’s disease is used in this study to illustrate the methods and the web tool. Alzheimer’s disease dataset is obtained from the Gene Expression Omnibus (GEO) Database [12]. It includes miRNAs obtained from 48 Alzheimer’s disease patients and 22 controls. miRNAs have important functions at the post-transcriptional level of gene expression in many pathological conditions including Alzheimer’s disease (AD) [13]. Previous studies showed that AD brains have significant miRNA alterations compared to healthy controls [14, 15]. Alzheimer’s disease, major cause of dementia, is a progressive neurodegenerative disorder which is forecasted to affect 1 in 85 people globally in 2050 [16]. Moreover, there is no curative treatment for AD and the pathological mechanism is not fully known. Therefore, there are needs for new biomarkers and treatment strategies. With the recent advent of new “-omics” based technologies, large amount of data is being generated. In order to analyze and interpret quickly for diagnostic and therapeutic use, there is a need for user-friendly and fast tools. Following the uploading of data to the web tool, pre-processing steps including filtering, normalization, transformation, and univariate analysis are carried out. Different machine learning methods are applied to the pre-processed dataset simultaneously and the DEGs are discovered. Moreover, network plot, heatmap, venn diagram, and box-and-whisker plot can be obtained, and gene ontology analysis can be conducted.

The sections of this paper are organized as follows: Section 2 introduces the proposed methodology, including the pre-processing procedures, machine learning techniques, and development of web tool. Section 3 provides the implementation of GeneSelectML web-based tool on Alzheimer’s disease data, its findings, a case study based on this dataset, and the validation of the tool on a different dataset. Finally, the paper is concluded with a summary of the main findings explored during our study.

2 Methods

2.1 Pre-processing

RNA-seq data must go through some pre-processing steps. These steps can be generalized as filtering, normalization, transformation, and univariate analysis.

2.1.1 Filtering

It is recommended to filter low expressed genes before analysis. Filtering can be done in different ways. The following filtering methods are available in the web tool:

  1. i)

    Genes with all readings lower than a specified threshold can be eliminated.

  2. ii)

    Genes with “near-zero variances” can be excluded.

These filtering methods are available in genefilter [17] and caret [18] packages, respectively.

2.1.2 Normalization

Normalization is applied to RNA-seq data to minimize bias that may arise from technical processes. The number of readings required in RNA-seq data is determined by the minimum amount of RNA species of interest. Sequencing depth can be increased for the purposes such as identifying genes with low expression levels, identifying very small fold changes between different situations and detecting new transcripts. However, different sequencing depth values may lead to underestimation or overestimation of gene expression levels [19]. Another source of variation is gene length. Longer genes may have higher readings, i.e., expression levels, than genes with shorter sequences due to differences in their size [20].

Normalization aims to make the samples comparable by reducing the effect of such bias factors. Many methods have been developed for the normalization of RNA-seq data. The methods are generally based on scaling the data according to a calculated normalization factor.

Median ratio normalization

Consider a gene expression matrix with samples at rows \((i=1,..,n)\) and genes as columns \((g=1,\ldots ,p)\). This matrix contains raw gene read counts \(X_{ig}\). For each gene, a reference sample is created by taking the geometric mean in all samples. Then, the ratio of the sample of interest to the reference sample is calculated for each gene. Finally, by taking the median of the rates, the normalization factor is calculated for the relevant sample. Normalized values are obtained by dividing the read counts of the gene by the normalization factor for each sample.

The normalization factor \((d_{i})\) for each sample can be calculated as follows [21]:

$$\begin{aligned} d_{i}=median_{g}\frac{X_{ig}}{{\left( \prod _{i=1}^{n} X_{ig}\right) }^{1/n}} \end{aligned}$$
(1)

This method can be applied to the data using the DESeq2 package [22] in R Bioconductor.

Trimmed mean of M values normalization (TMM)

Genes with very low or high expression levels are removed from the dataset based on M values. M values (\(M_{ig}\)) are trimmed by 30% as default [23]. Weight values (\(w_{ig}\)) are calculated for the remaining genes. Then, normalization factor is calculated based on these weights. The transformed normalization factor is calculated as follows:

$$\begin{aligned} \log _{2}{(d_{i})}=\frac{\displaystyle \sum \limits _{g=1}^{p'} w_{ig}M_{ig}}{\displaystyle \sum \limits _{g=1}^{p'} w_{ig}} \end{aligned}$$
(2)

where \(p'\) indicates the number of genes after trimming.

TMM with singleton pairing normalization (TMMwsp)

This method is a type of TMM which performs better for the data containing the zeros with high proportion. In TMM method, a sample is chosen as a reference sample. The fold changes and absolute expression levels are obtained relative to the reference sample. The genes which take the value of zero in both corresponding and reference samples are discarded. Unlike the TMM method, TMMwsp method makes a correction by using the total read number of these genes.

Upper quartile normalization

Transcripts with zero value are removed from the dataset and normalized over the 75th percentile values of the remaining values. Therefore, this method is, unfortunately, affected by the genes with high expression levels.

2.1.3 Transformation

Normalizing data may not be sufficient to apply feature selection methods since the expression levels can be distributed in a wide range in RNA-seq data. The logarithmic transformation is also be used in such a situation. With logarithmic transformation, data with a less skewed distribution and fewer excessive values are obtained than untransformed data. The logarithmic transformation may be undefined as the count values for a gene can be zero under some conditions. To avoid this situation, transformation is performed after a prior count of 1 is added.

Let the normalized gene be denoted by \(X_{ig}^{'}\). In this case, the transformed genes can be represented as follows:

$$\begin{aligned} Y_{ig}=log_{2}\left( X_{ig}^{'}+1\right) \end{aligned}$$
(3)

2.1.4 Filtering with univariate analysis

Univariate analysis can be used to reduce the size of the dataset and identify genes that differ significantly between groups. Our web tool has two alternatives to carry out the univariate analysis for each gene by comparing two groups with Student’s t-test using the colttests function or by calculating AUC with the rowpAUCs function in the genefilter package [17]. If the Student’s t-test is selected, the genes are ordered from the smallest p-value to the highest according to the test result. If the AUC method is chosen instead of the Student’s t-test, the genes are ordered from the highest AUC value to the least. In both methods, the specified number of genes at the top of the ranking are selected by the user. Together with the p-values obtained as a result of Student’s t-test, adjusted p-values are also calculated according to the Benjamini-Hochberg (FDR) [24] or Benjamini-Yekutieli [25] correction methods. The default is set to Benjamini-Hochberg method.

2.2 Machine learning algorithms

Six different machine learning methods have been used in our web tool. These methods will be explained in the next four sub-titles.

2.2.1 Biosigner algorithm

Rinaudo et al. [26] proposed a four-step algorithm for selecting the important genes and provided the algorithm in biosigner R package. These steps involve constructing a model by using bootstrap sub-samples, ranking the genes by their importance, eliminating the non-significant ones, and deciding on the final model. The models are based on the Partial Least Squares-Discriminant Analysis, Random Forest, and Support Vector Machines. Our web tool provides the list of genes selected by any of these three models.

2.2.2 GMDH-type neural network algorithm

GMDH-type neural network algorithm is a heuristic self organizing system to learn complex relation between exploratory variables and dependent variable. In its architecture, some neurons performing better compared to the rest of the neurons in each layer, called living cells, continue their ways until the decrease in performance across layers. At last neuron, one neuron is selected to obtain predicted output. The features contributing model performance are selected at the end. The algorithm is available in GMDH2 package [27].

2.2.3 Determan’s optimal gene selection algorithm

Our web tool uses the Support Vector Machines, Random Forest, and Elastic Net Generalized Linear Models within the Determan’s algorithm [28] for gene selection. This algorithm uses bootstrap for measuring feature selection stability and uses cross-validation or leave-one-out procedures to avoid the overfitting problem. The average of cross-validation results is used to calculate performance measures (such as accuracy, sensitivity, etc.) and these measures are then used to obtain the list of best genes. The algorithms are available in omicsMarkeR package [28].

2.2.4 Data mining algorithm for RNA-Seq data

Chiesa et al. [29] developed a data mining algorithm for RNA-seq data and implemented it in DaMiRseq package. It is possible to normalize, select genes, and classify via this algorithm. We use the gene selection procedures, specifically DaMiR.Fsort and DaMiR.FBest functions, of this algorithm in our web tool.

Genes are first ranked based on RReliefF [30] or standardized RReliefF scores in DaMiR.Fsort function. RReliefF is a filtering algorithm that can also take the correlation between genes into account. These ranked genes are then used to pick the best subset via DaMiR.FBest function. The user can either provide the number of selected genes, or algorithm can automatically pick the best subset by using a threshold on the scaled importance scores. Our web tool uses the later approach.

Fig. 1
figure 1

GeneSelectML web tool

2.3 Evaluation of model performances

The performances of models are obtained through a confusion matrix between the predicted and actual class labels. In our case, Table 1 presents 2-by-2 classification table where the predicted and actual class labels are provided in the rows and columns, respectively. Various performance measures can be obtained using a confusion matrix. We assess the model performance with accuracy, kappa, Matthews correlation coefficient (MCC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), balanced accuracy, Youden index, detection rate, detection prevalence, and F1 measure. Calculation of these measures is presented in Table 2. It is recommended to use fivefold or tenfold cross-validation to avoid overfitting in machine learning algorithms [31, 32]. In this study, we use fivefold cross-validation to prevent overfitting and improve model performances. We obtain the performance measures based on test set for each fold. Then, we report the mean of the performance measures obtained in five folds.

Table 1 Confusion matrix
Table 2 Performance measures

2.4 Web tool development

The tool is developed using R software. This tool is designed into seven parts; data upload, pre-processing, methods, selected genes, pathway analysis, visualization, and gene ontology analysis. In Data upload part, researchers can upload raw count gene expressions in .txt format. The raw data must be a n\(\times\)(1+p) dimensional data matrix, where n refers to the total number of samples, p refers to the total number of genes. The first column must be the output variable. The data must include a header indicating gene names. In Pre-processing part, caret [18] and genefilter [17] packages are used for filtering. DESeq2 [22] and edgeR [33] packages are utilized for normalization. The number of genes is reduced using univariate analyses, Student’s t-test or calculating AUC with genefilter package [17]. In Methods part, biosigner [26], GMDH2 [27], omicsMarkeR [28], and DaMiRseq [29] packages are used for the selection of DEGs. In this process, five-fold cross-validation is carried out to validate the models. The process is paralleled with doParallel package [34] to overcome high-volume computational load. The tool can recognize whether the data type is miRNA or mRNA with miRNAmeConverter package [35]. ReactomePA package [36] is used to obtain the pathway analysis of the genes suggested by the models. multiMiR package [37] is used to identify target genes for miRNA datasets prior to pathway analysis. ComplexHeatmap [38], igraph [39], venn [40], and graphics [41] packages are utilized for the visualization of the genes suggested by models in Visualize part. For gene ontology analysis, topGO [42], mirnatab [43], and miRNAtap.db [44] packages are used. Annotation of genes is provided by using org.Hs.eg.db package [45]. All analysis steps of GeneSelectML web tool are presented in Fig. 2.

There are two example datasets available in the web tool, Alzheimer’s disease data (miRNA) and Kidney chromophobe data (mRNA), to help users learn the usage of the tool. Also, there is a toy data available in .txt format just to learn how to upload the data. There exist two panels of the interface: sidebar and main panels. Researchers can specify the arguments of the methods in the sidebar panel. The parameters not decided by user are set to the defaults of the original algorithms. The results of the specified models are provided in the main panel. After the process is completed in pre-processing and methods parts, summary of the process is provided in summary under methods tab. Selected genes are listed based on genes and methods in two sub-tabs of selected genes tab. In this tab, there are two options to continue to pathway analysis, graphical approaches, and gene ontology analysis. Users can choose the genes suggested by at least one method or at least two methods. After this choice, the results of the pathway analysis can be downloaded via the download link from this tab. There exist various graphical approaches in visualize tab including a number of options for editing plots. A gene ontology analysis is conducted in GO tab. All results including tables and plots can be downloaded in different file formats. A detailed manual of the tool is available in the web page of the tool.

Fig. 2
figure 2

Analysis steps of GeneSelectML web tool

Fig. 3
figure 3

Uploading GSE46579 dataset to the tool

3 Results

3.1 Implementation of the tool

In this section, we analyze Alzheimer RNA-seq data to demonstrate the use of this web-based tool. The dataset is uploaded via Data upload tab (Fig. 3). After uploading the data, the dataset is pre-processed in Pre-processing tab with four steps: filtering with conventional ways, normalization, transformation, and filtering with univariate analysis. In Methods tab, we construct six models presented in Section 2.2. Selected genes are provided in Selected genes tab. Users can specify the genes selected by at least one or two method(s) to continue Visualize and GO tabs. There exist four graphical approaches; network plot, heatmap, venn diagram, and box-and-whisker plot in Visualize tab. Finally, we perform gene ontology analysis of DEGs in GO tab.

3.2 Dataset

We analyze Alzheimer RNA-seq dataset [13] for this illustration. This dataset includes a cohort of 70 samples — 48 Alzheimer’s disease patients and 22 controls — and 503 features (i.e., miRNAs). The data can be found at GEO with accession number GSE46579 [46]. We load the dataset to the tool using Data upload tab (Fig. 3) before starting analysis.

3.3 Pre-processing

Dimension reduction is the essential step to improve the performance of methods for diagnosing DEGs. In this part, we use near-zero variances filtering. Then, the data are normalized using median ratio normalization. After normalizing the data, logarithmic transformation is applied. The number of genes is reduced to 200 using univariate analysis (i.e., Student’s t-test).

3.4 Gene selection and classification performance

We construct six machine learning algorithms after pre-processing stage is completed. We report the selected genes in two ways. One is providing the genes based on methods. The other one is reporting the list of genes (Table 4). In Table 4, there exist a list of genes, the frequency and percent of methods suggesting the corresponding gene, the regulation status and the names of methods suggesting the corresponding gene. Twenty-four genes, of which 11 genes are selected by at least two methods, are suggested by at least one method.

The classification performances of the methods are presented in Table 3. The results show that SVM performs better than the other methods with respect to the most of performance measures. It is important to point out that DaMirseq performs best for the classification of Alzheimer patients when the sensitivity is assessed. The algorithm classifies 100\(\%\) of the persons having Alzheimer’s disease. For SVM, sensitivity is obtained as 0.950. The method classifies 95\(\%\) of the persons having Alzheimer’s disease. GLMNET outperforms other algorithms in terms of MCC, is one of the best in terms of detection rate and is competitive with others in most of the measures.

In Selected genes tab, users can specify the genes selected by at least one or two method(s). We select the genes suggested by at least two methods for further analysis.

Table 3 Cross-validation classification performances

3.5 Visualization

This web-based tool offers well-arranged graphical approaches; network plot (Fig. 4a), heatmap (Fig. 4b), venn diagram (Fig. 4c), and box-and-whisker plot (Fig. 4d). The network plot shows whether the correlation exists between selected genes in a way that the correlation is positive or negative. The tool offers the users to color positive and negative correlations. In our case, we color blue for positive correlation and color red for negative correlation if the magnitude of correlation is larger than 0.6. Researchers can draw the heatmap of selected genes with class labels. The tool also provides venn diagram which shows the number of genes selected by methods and their intersections. Users can draw box-and-whisker plot to compare the groups with respect to each of the selected genes.

Fig. 4
figure 4

Graphical approaches in GeneSelectML web tool. (a) indicates the correlation between selected genes with a magnitude greater than 0.60. Blue color states positive correlation while red color states negative correlation. (b) represents standardized values based on rows. Genes are given in the rows, samples are given in the columns. (c) demonstrates the number of genes selected by algorithms and their intersections. (d) shows the distribution of the expression values by groups for the gene of interest. If Student’s t-test is selected as univariate analysis, p-values are added to the bottom of the plot

3.6 Findings on Alzheimer RNA-seq data

In the Alzheimer study, we analyze 503 genes of 48 AD and 22 healthy controls. The number of genes is reduced to 200 after pre-processing. Out of these 200 genes, 11 of them are found to be differentially expressed by two or more algorithms in our GeneSelectML web tool, and an additional of 13 genes are detected as DEGs by one algorithm (Table 4). Out of 11 DEGs, only three of them are upregulated. The results highlight the strength of using a tool which incorporates many methods. For instance, using only the DaMirseq would detect 11 genes as significant instead of 24 DEGs. Similarly, using only OmicsMarkeR-GLMNET would miss 15 genes found by other methods. Our web tool is able to list a combination of genes detected by many algorithms in a reasonable time. The computational time for the process, including data upload, filtering, normalization, transformation, univariate analysis, and applying six different machine learning algorithms, is approximately 300 seconds.

By gene ontology analysis, we analyze the biological process of 11 miRNAs which are proposed via at least two algorithms. We find that decreased miRNAs affect positive regulation of phosphorylation, cell cycle, setting macromolecules, regulation of locomotion, and increased miRNAs affect positive regulation of nucleobase, chromosome organization processes.

The miRNA proposed by four different machine learning algorithms is has-miR-628-3p. It has been shown that hsa-miR-628-3p is related to many cancers and it promotes apoptosis in lung cancer cell cultures [47]. Similar to current study results, a previous study, which analyzed more than 1200 miRNAs in AD temporal cortex, showed that expression levels of hsa-miR-628-3p, has-miR-1234, hsa-miR-144, and hsa-miR-148b were decreased in AD samples [48].

In the reference study, from which the dataset of the current study is obtained, they selected 12 differentially expressed miRNA [46]. Four of these 12 miRNAs, specifically, miR-151a-3p, brain-miR-112, let-7f-5p, and hsa-miR-1285-5p, are found as differentially expressed miRNAs in the long list of our current study. Moreover, Satoh et al. [49] also analyzed the same miRNA dataset using omiRas web tool. They identified 27 differentially expressed miRNAs [49]. Seven of 11 miRNAs proposed by at least two algorithms in our study were also identified as differentially expressed miRNAs in the Satoh’s study. These common genes include has-let-7a-5p, has-let-7g-5p, has-miR-144-5p, has-miR-151a-3p, hsa-let-7f-5p, has-miR-148a-3p, and has-miR-148b-5p. In fact, all of the differentially expressed miRNAs, except one, in our short list, were also found significant by at least one of the references [46, 48,49,50,51]. The only exception is that our tool also detects hsa-miR-148a-3p.

Table 4 Suggested genes by web tool

3.7 A case study based on Alzheimer’s disease dataset

A case study is conducted to reveal the capacity of the web tool for suggesting DEGs. For this purpose, a dataset is simulated based on Alzheimer’s disease dataset from the negative binomial distribution using the ssizeRNA package [52] in R. The number of genes is taken as 503 and the number of observations is 70. Two hundred genes remain after near-zero variance filtering and Student’s t-test results. A response variable including treatments and controls is generated to obtain a binary outcome. The rate of the treatments is taken as 0.686 (48/70). As the distribution parameters, a mean vector and a dispersion vector are specified based on the Alzheimer’s disease dataset. That is, the mean vector is obtained as the arithmetic mean for each gene, taking into account the control group in the Alzheimer’s disease data. The dispersion parameter is taken 0.1 for each gene. Ten of the genes are simulated statistically significant between the groups. Our tool proposes 17 genes in long list, of which 10 of them are placed in short list. All genes in short list are the genes that are simulated to be statistically significant between two groups. That means all of the significant 10 genes are suggested by at least two methods. Thus, the tool suggests 100% of the DEGs with at least two methods and also 7 additional genes with a single method.

3.8 Implementation on KICH dataset

Kidney Chromophobe (KICH) dataset is used to demonstrate the validity of GeneSelectML web tool on a different dataset. This dataset is obtained via TCGAbiolinks R/Bioconductor package [53] and includes mRNAs from 66 tumor samples and 25 matched-normal samples. The number of genes is 19,947. Near-zero variance filtering, median ratio normalization and logarithmic transformation are applied to the data, respectively. Student’s t-test is performed as univariate analysis. The number of genes decreased to 200 after pre-processing. Of the remaining 200 genes, 10 genes are found to be differentially expressed by two or more algorithms in our GeneSelectML web tool and all of them are downregulated. Additionally, 14 genes are selected as DEGs by one method. Zhang et al. [54] analyzed the same dataset in their study and displayed the top 100 DEGs. Eight of 10 genes proposed by at least two algorithms in our study were also identified as DEGs in Zhang’s study. These common genes are RALYL, IRX1, UGT2A3, UGT3A1, UPK1B, DACH2, SLC9A3, and UNCX. One of the remaining two genes, UMOD gene, was found significant in the references [55,56,57]. MYH8 gene is the only exception our tool proposed in the short list.

4 Conclusion

Diagnosing DEGs is the crucial step to explore the reasons of diseases. Rather than univariate analysis, modelling the data considering the relationship among genes improves the prediction performance. However, there exist critical distinctions among studies analyzing the same dataset for the causes arising from a variety of methods.

In this study, the objective is to minimize these risks arising from the methods. GeneSelectML is a web-based platform which brings various gene selection algorithms together for RNA-seq data. All steps can be conducted using separate R packages, but the process might be distractive and time consuming for the inexperienced researchers in R programming language.

GeneSelectML is a user-friendly, comprehensive, and freely available tool for gene selection through machine learning algorithms that can deal with high performance computation. Currently, GeneSelectML tool involves six machine learning algorithms for gene selection. These are Biosigner, GMDH, OmicsMarkeR-GLMNET, OmicsMarkeR-SVM, OmicsMarkeR-RF, and DaMirseq algorithms. The tool also offers the users easy-to-use pre-processing steps; filtering, normalization, transformation, and univariate analysis. Moreover, there exists a user-friendly interface for graphical approaches; network plot, heatmap, venn diagram, and box-and-whisker plot. Also, gene ontology analysis is provided for the selected genes.

In this study, we construct aforementioned machine learning algorithms on GSE46579 dataset to explore the features for Alzheimer’s disease as well as to show the implementation of the tool. Eleven features are found to be differentially expressed by at least two methods. One of these features, hsa-miR-148a-3p, might be considered as a new biomarker for Alzheimer’s disease diagnosis. Of course, this finding needs clinical assessment and verification. Also, KICH dataset is used to demonstrate the validity of GeneSelectML web tool on a different dataset.

GeneSelectML will be periodically updated as the R packages are updated and the novel approaches are developed. This tool is freely available at www.softmed.hacettepe.edu.tr/GeneSelectML.