Keywords

1 Background

DNA microarrays are characterized by high dimensionality due to the high number of features composed of thousands of genes and a limited number of observations. For this reason, it becomes necessary to reduce the dimensionality of dataset in order to decrease the size of the dataset matrix and also, to make the classification task easier and faster.

One form of the dimensionality reduction is feature subset selection, an imperative step in the field of classification.

So, in order to classify a cancer dataset, we need to select the relevant features that best represent the cancer dataset. To do this, we need to use a filter selection method on the original cancer dataset, and then use this selected subset of features to classify cancer dataset. The classification accuracy obtained by the classifier represents the performance of the subset selected.

In this paper, we suggest to use the k means clustering not only as a classification algorithm but also as a selection method. We hybridized between a filter selection method (the signal to noise ratio (SNR), ReliefF, Correlation Coefficient (CC), ReliefF and Mutual Information (MI)) and the clustering K-means. To compare these feature selection methods, an evaluation of the dimensionality reduction had been done using four supervised classifiers (k nearest neighbors (KNN), Support Vector Machine (SVM) and Linear Discriminant analysis (LDA)).

The goal of this hybridization is to improve classification performance and to accelerate the search to identify important feature subsets.

2 Related Works

Features selection methods become the focus of much research in areas of application for which datasets with thousands of features are available. Some of the used methods in the field of feature selection are:

  • Fisher, T-statistics, Signal to noise ratio and ReliefF selection methods [1].

  • The use of two-step neural network classifier [2].

  • The (BW) discriminant score was proposed by [3]. It is based on the dispersion ratio between classes and intra-class dispersion.

  • A hybridization between Genetic Algorithm (GA) and Max-relevance, Min-Redundancy (MRMR) [4].

3 Materials and Methods

We used different feature selection methods and classifiers for cancer classification.

In the first step, we downloaded the dataset of each cancer composed of thousands of features. In the second step we reduced the number of features, using a feature subset selection, to only relevant features. In the final step, we classify the datasets.

3.1 Dataset Description

In this paper, we studied the effect of feature selection methods on three commonly used gene expression datasets: leukemia cancer, Colon cancer and Prostate cancer (Table 1).

Table 1. Datasets and parameters used for experiments
  • Leukemia is composed of 7129 features and 72 samples. It contains two classes: acute lymphocytic leukemia (ALL) and acute myelogenous leukemia (AML). It can be downloaded from the websiteFootnote 1.

  • Colon cancer is composed of 6500 features and 62 samples. It contains two classes: Tumor and Not tumor. It can be downloaded from this websiteFootnote 2.

  • Prostate cancer is composed of 12600 features and 101 samples. It contains two classes: Tumor and Not tumor. It can be downloaded from this websiteFootnote 3.

  • Lung Cancer is composed of 12533 features and 181 samples; it contains two classes: malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA). Data could be downloaded from the websiteFootnote 4.

  • Lymphoma cancer is composed of 7070 genes and 77 samples. It contains two classes: diffuse large B-cell lymphoma (DLBCL) and follicular lymphoma (FL). It is available to the public at the websiteFootnote 5.

3.2 Feature Subset Selection

Feature selection is the process of selecting a subset of relevant features for model construction (Fig. 1).

Fig. 1.
figure 1

Feature subset selection

The main idea to apply a feature selection method is that the dataset contains many Features that are either redundant or irrelevant and can consequently be removed without high loss of information.

A feature selection algorithm can be considered as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. There are three main categories of feature selection algorithms: wrappers, filters and embedded methods [10].

  • Wrapper methods use a predictive model to score feature subsets.

  • Filter methods use a proxy measure instead of the error rate to score a feature subset.

  • Embedded methods are a catchall group of techniques which perform feature selection as part of the model construction process.

We are interested in this paper with filter methods which are based on the estimated weight (scores) corresponding to each feature (gene) used to order then to select the most relevant descriptors.

The methods used in this work are the Signal to Noise Ratio (SNR), Correlation Coefficient (CC), ReliefF, Mutual Information (MI) and clustering (K-means).

The signal to noise ratio

The signal to noise ratio, called also S/R test, recognizes relevant features by calculating the score S/R of each gene (g) [11].

This score was proposed by [5] and expressed as follows:

$$ \text{S/R}_{{\text{(g)}}} = \frac{{M_{1g} - {\text{M}}_{2g} }}{{{\text{S}}_{1g} + {\text{S}}_{2g} }} $$
(1)

Where Mkg and Skg denote the mean and the standard deviation of the feature g for samples of classes 1 and 2.

ReliefF

This algorithm presented as Relief [12] and then developed and adjusted to the multi-class case by Kononenko as the ReliefF [13].

This criterion measures the ability of each feature to group data of the same class and discriminating those having different classes. The algorithm is described as follows:

  • Initialize the score (or the Weight) wd = 0, d = 1,…, D

  • For t = 1 …N

  • Pick randomly an instance xi

  • Find the k nearest neighbors to xi having the same class (hits)

  • Find the k nearest neighbors to xi having different class (misses c)

  • For each feature d, update the weight:

    $$ {\text{Wd}} = {\text{wd}} - \sum\nolimits_{{{\text{j}} = 1}}^{\text{K}} {\frac{{{\text{diff}}\left( {{\text{x}}_{i} ,\,{\text{d}},\,{\text{hits}}_{j} } \right)}}{{{\text{m}}\, *\,{\text{k}}}}} + \sum\nolimits_{{{\text{c}} \ne {\text{class}}\left( {{\text{x}}_{i} } \right)}} {\frac{{{\text{p}}\left( {\text{c}} \right)}}{{1 - {\text{p}}\left( {{\text{class}}\left( {{\text{x}}_{i} } \right)} \right)}}} \sum\nolimits_{{{\text{j}} = 1}}^{\text{k}} {\frac{{{\text{diff}}\left( {{\text{x}}_{i} ,\,{\text{d}},\,{\text{misses}}_{j} } \right)}}{{{\text{m}}\, *\,{\text{k}}}}} $$
    (2)

    The distance used is defined by:

    $$ \text{diff}\left({\text{x}_{\text{i}}, \text{d, x}_{\text{j}} } \right) = \frac{{\left| {x_{id} -{\text{x}}_{jd} } \right| }}{{{ {\max} }\left( {\text{d}} \right) - { {\min} }\left( {\text{d}} \right)}} $$
    (3)

    Max (d) (resp. min (d)) is the maximum (resp. minimum) value that may take the feature designated by the index d on the data set. xid is the value of the dth feature of the data xi.

This method does not eliminate redundancy, but defines a relevant criterion.

Correlation Coefficient

Correlation coefficients measure the strength of association between two features. The Pearson correlation coefficient measures the strength of the linear association between features [14].

Let and \( {\text{S}}_{\text{y}} \) be the standard deviations of two random features X and Y respectively. Then the Pearson’s product moment correlation coefficient between the features is:

$$ \rho_{x,\,y} = \frac{{cov\left( {X,\,Y} \right)}}{{S_{x} S_{y} }} = \frac{{E\left( {\left( {X - E\left( X \right)} \right)\left( {Y - E\left( Y \right)} \right)} \right)}}{{S_{x} S_{y} }} $$
(4)

Where cov(.) means covariance and E(.) denotes the expected value of the feature.

Mutual Information

Let us consider a random feature G that can take n values over several measures, we can empirically estimate the probabilities P(G1), …, P(Gn) for each state G1, ……, Gn of feature Shannon’s entropy [15] of the feature is defined as:

$$ \text{H}_{{\text{(G)}}} = - \sum\nolimits_{{{i = 0}}}^{{\textit{NG}}} {\text{P}_{{\text{(G)}}} } \,{\log}\,(\text{P}_{{\text{G(i)}}})$$
(5)

The mutual information measures the dependence between two features. In the situation of genes selection, we use this measure to recognize genes which are related to the class C. The mutual information between C and one gene G is measured by the following expression:

$$ \text{MI}\left( {\text{G, C}} \right) = \text{H}\left( \text{G} \right) + \text{H}\left( \text{C} \right) - \text{H}\left( {\text{G, C}} \right) $$
(6)
$$ \text{H}\left( {\text{G, C}} \right) = - \sum\nolimits_{{{i = 0}}}^{{\textit{NG}}} { - \sum\nolimits_{{{j = 0}}}^{{\textit{NG}}} {\text{P}_{\text{w}} \left( {\text{i, j}} \right)} } {\log}\left( {\text{P}_{\text{w}} \left( {\text{i, j}} \right)} \right) $$
(7)

Cluster analysis

Cluster analysis or clustering is the task of assembling a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. In clustering, the k-means algorithm can be used to divide the input data set into k groups or clusters and returns the index of the cluster to which it has assigned each feature.

K-means algorithm is described as follows:

Given an initial set of k means m1(1),…,mk(1), the algorithm proceeds by alternating between two steps [16]:

  • Assignment step: Assign each feature to the cluster whose mean yields the least within-cluster sum of squares. Since the sum of squares is the squared Euclidean distance, this is intuitively the “nearest” mean

$$ \text{S}_{\text{i}}^{(\text{t})} = \left\{ {\text{x}_{\text{p:}} \left\| {\text{x}_{\text{p}} - \text{m}_{\text{i}}^{{\left( \text{t} \right)}} } \right\|^{\text{2}} \le \left\| {\text{x}_{\text{p}} - \text{m}_{\text{j}}^{{\left( \text{t} \right)}} } \right\|^{\text{2}}, \ \text{1} \le \text{j} \le \text{k}} \right\} $$
(8)

Where each xp is assigned to exactly one S(t), even if it could be assigned to two or more of them.

  • Update step: Calculate the new means to be the centroids of the features in the new clusters.

$$ \text{m}_{\text{i}}^{{\left( {\text{t + 1}} \right)}} = \frac{1 }{{\left| {S_{i}^{t} } \right| }}\sum\nolimits_{{x_{j} \in S_{i}^{t} }} {x_{j} } $$
(9)

3.3 Classification

To compare all feature selection methods, an evaluation of the dimensionality reduction was done using a supervised classification of the three cancers.

Supervised classification is the process of discriminating data, a set of objects or data more widely, so that the objects in the same class are closer to each other than other classes.

To study the performances of the selected features methods, we used the KNN (K nearest neighbors) classifier.

K Nearest Neighbors

K nearest neighbors’ is a classifier that stores training samples and classifies the test samples based on a similarity measure.

In K Nearest Neighbors, we try to find the most similar K number of samples as nearest neighbors in a given sample, and predict class of the sample according to the information of the selected neighbors.

We can compute the Euclidean distance between two samples by using a distance function DE(X, Y), where X, Y are samples composed of N features, such that \( \text{X} = \left\{ {\text{X}_{\text{1}}, \ldots, \text{X}_{\text{N}} } \right\}, \text{Y} = \left\{ {\text{Y}_{\text{1}}, \ldots, \text{Y}_{\text{N}} } \right\} \).

$$ \text{D}_{\text{E}} \left( {\text{X,Y}} \right) = \sum\nolimits_{{\text{j = 1}}}^{\text{k}} {\surd {\left( {\text{X}_{\text{i}}^{\text{2}} - \text{Y}_{\text{i}}^{\text{2}} } \right)} } $$
(10)

Support Vector Machines (SVM)

Support vector machines are supervised learning models used for supervised classification [17]. Support Vector Machines are based on two key concepts: the notion of maximum margin and the concept of kernel functions.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is an algorithm used in machine learning to search and find a linear combination of features that characterizes or separates two or more classes of objects [18].

To evaluate the performances of the classifiers, we measure the value of the classification accuracy Accuracy [19]:

$$ \text{A}_{{\text{ccuracy}}} = 100\,\text{*}\,\left( {\text{TP + TN}} \right){/}\left( {\text{TN + TP + FN + FP}} \right) $$
(11)

Where TP is true positive for correct prediction to disease class, TN is true negative for correct prediction to normal class, FP is false positive for incorrect prediction to disease class, and FN is false negative for incorrect prediction to normal class.

All the algorithms used in this paper have been run using (MATLAB)

4 Results

In this section, we report the results of an experimental study of the effect of the k-means clustering on five commonly used gene expression datasets.

Each dataset is characterized by a group of features, those features are the genes.

After dividing the initial dataset into training data and test data, we applied a subset selection method on training data to select the most relevant features. This subset helps to classify dataset using a classifier (KNN, SVM and LDA). Test data is used to investigate the performances of selection methods and classifiers.

To increase the selection methods performances, we add a clusterisation to the selection step. We divide training data into clusters, and then we select relevant features in each cluster. The obtained subset presents the most relevant features in the dataset.

Tables 2, 3, 4, 5 and 6 compares the classification accuracy obtained (for leukemia, colon, prostate, lung and lymphoma cancers, respectively) before and after adding the k-means clustering to the selection step.

Table 2. Performance of comparison for proposed classifiers (leukemia cancer)
Table 3. Performance of comparison for proposed classifiers (colon cancer)
Table 4. Performance of comparison for proposed classifiers (prostate cancer)
Table 5. Performance of comparison for proposed classifiers (lung cancer)
Table 6. Performance of comparison for proposed classifiers (lymphoma cancer)

We can clearly remark the advantage of adding the clusterisation step to the feature selection process. It increases the accuracy of the four selection methods investigated in this paper.

From the results obtained in Tables 2, 3, 4, 5 and 6 we remark that the k means clustering step obtains a substantial reduction in feature set size maintaining better accuracy compared with results before clustering step, for the chosen Gene datasets of high dimensionality.

5 Conclusion and Discussion

We have shown in this paper that feature selection methods can be applied successfully to a classification situation, using only a limited number of training samples in a high dimensional space of thousands of features.

We performed several studies on leukemia, colon, prostate, lung and lymphoma cancer datasets. The objective was to classify datasets of each cancer into two classes.

The experimental results show that the proposed method has efficient searching strategies and is capable of producing a good classification accuracy with a small and limited number of features simultaneously.

The best result obtained for leukemia cancer is an accuracy of 100% for only 5 genes. For Colon cancer, we obtain 95% for only 6 genes. For prostate cancer, we obtain 90% for 1 gene. For lung cancer, we obtain 100% for 3 genes. For lymphoma cancer, we obtain 100% for 3 genes.

These results encourage adding a clusterisation before the selection step. It increases the classification accuracies and decreases the number of features selected.