Keywords

1 Introduction

With the advent of microarray technology, simultaneous profiling of thousands of gene expression across multiple samples in a single experiment was made possible. The microarray technology generates huge amount of gene expression data whose competitive analysis is challenging. Moreover, it has been observed that gene expression datasets has large number of uninformative and redundant features which increases complexity of classification algorithms [1, 1113].

To circumvent these problems many feature selection techniques are being proposed. The purpose of feature selection is extracting relevant features from the observed data which improves results of machine learning models. Compared with the dimensionality reduction techniques like Principal Component Analysis (PCA) and Linear Discriminate Analysis (LDA), feature selection algorithms only select the relevant subset of features instead of altering the original features. The selected relevant genes, also known as “biomarkers”, find their application in medicine for discovery of new diseases, development of new pharmaceuticals [2, 12] etc. Existing work on feature subset selection from gene expression data can be categorized as (i) classification and (ii) clustering based approaches. The classification based approaches like ReliefF [3] and Correlation based Feature Selection [4], ranks features based on their intrinsic properties and select subset of top ranked features. The clustering based approaches like k-medoids [5] and affinity propagation [6] clusters the features based on their similarity with each other. The feature set is reduced to the representative of each cluster.

Affinity propagation proposed by Frey and Dueck in [6] for feature subset selection identifies subset of features which can represent the dataset. Such features are called exemplars. Affinity propagation takes similarity between features as input. Instead of specifying the number of clusters, a real number called preference value for all features is also passed as input to affinity propagation. The number of identified exemplars is influenced by the preference value. Larger the preference value more the clusters are formed. Features exchange real- valued messages until clusters with their representative exemplars emerge. Affinity propagation has found its application in the machine learning community, computational biology and computer vision. However, aforementioned approach gives uniform preference to all features and messages are exchanged iteratively between features irrespective of the capability of features to differentiate samples of one class from samples of other classes.

In this paper, we propose a Class Aware Exemplar Discovery (CAED) algorithm which calculates class wise ranking for all features and incorporates this information while assigning preference value to features. The features are clustered by exchanging two types of messages viz. responsibility and availability iteratively. The messages are exchanged in a way that the features ranked higher in class wise ranking are favored over the feature ranked lower which leads to better exemplar discovery.

We evaluated correctness of our approach by conducting experiments on 18 publicly available microarray gene expression datasets. We recorded classification accuracy as our performance metric. Improvement in classification accuracy over affinity propagation of three classifiers namely Support Vector Machine, Naive Bayes and C4.5 Decision Tree is achieved for 16, 17 and 13 datasets respectively.

2 Overview of Our Approach

The workflow of our approach is shown in Fig. 1. Gene expression matrix is transformed into similarity matrix using a distance measure. The diagonal values of similarity matrix also called preferences are updated using class aware ranking of features. The updated matrix is used for class aware clustering. The representative from each cluster called exemplar is extracted and the reduced set of features is evaluated over classifiers.

Fig. 1.
figure 1

Workflow of Class Aware Exemplar Discovery (CAED)

2.1 Gene Data

We selected 18 publicly available microarray datasets (available at http://faculty.iitr.ac.in/~patelfec/index.html) for experimental evaluation of Class Aware Exemplar Discovery algorithm. Microarray datasets are gene expression matrices, where expression value for each gene is measured over different samples. Generally total numbers of samples are very few compared to the features.

2.2 Gene-Gene Similarity

For a gene expression matrix D n×m , with n features and m samples, similarity between every two features is calculated and stored in similarity matrix S n×n . The similarity s(i, k) indicates how well the feature with index k is suited to be the exemplar for featurei. The aim is to maximize the similarity, so we take negative of the distance between each feature. We used negative of Euclidean distance as the similarity measure for experimental evaluation.

2.3 Class Aware Preference

The affinity propagation algorithm takes similarity matrix and a real number called preference for each feature as input. For a similarity matrix S n×n , the value \( s\left( {i,i} \right) \) where \( i = \{ 1,2,3 \ldots ,n\} \) is the preference value. These values are called preferences since the feature \( i \) with larger values \( s(i,i) \) is more likely to be chosen as exemplar. The number of identified exemplars (number of clusters) is influenced by the values of input preference. Larger the value more the clusters are formed. The preference value can be uniform or non-uniform. If all features are equally suitable as exemplars, the preference value is set to a common value. The preference value can be set as any number in the range of \( {}_{j s.t.j \ne i}^{min} s(i,j) \) to \( {}_{j s.t.j \ne i}^{max} s(i,j) :i,j = \{ 1,2,3 \ldots ,n\} \).

We propose assignment of preference value to a feature based on its ability to distinguish samples of one class from other classes. To aid this, we do class wise ranking of features using p-metric [1] by one versus all strategy. Top 0.015 % of features from each class is selected and are assigned preference value zero thereby increasing their probability of being chosen as exemplar.

Figure 2 depicts class wise ranking of features where each feature is assigned multiple ranks, one for each class. The high ranked features of each class are highlighted

Fig. 2.
figure 2

Depiction of class wise ranking of features

The other features are assigned uniform preference value i.e. \( {}_{j s.t.j \ne i}^{median} s(i,j):i.j = \{ 1,2,3 \ldots ,n\} \). Figure 3 shows how accuracy of classifiers namely support vector machine (SVM), Naïve Bayes (NB) and C4.5 decision tree (DT) varied by changing the count of features which are assigned high value.

Fig. 3.
figure 3

Change in accuracy of classifiers by changing the percentage of high preference features on 2 datasets

We observed that by selecting more than 0.015 % of features from each class no further improvement in classification accuracy of the classifiers was observed.

2.4 Class Aware Message Passing

The similarity matrix with class aware preference values is passed to affinity propagation for exemplar discovery. Affinity propagation is a message passing based clustering algorithm. Affinity propagation iteratively transmits real-valued messages among features until a good set of clusters with their representative exemplar emerge. The messages exchanged are of two kinds viz.

Responsibility message denoted as \( r(i,k) \) is sent from feature \( i \) to candidate exemplar point \( k \). It indicates the collected evidence for how appropriate feature \( k \) is to be chosen as exemplar for feature \( i \).

Availability message denoted as \( a(i,k) \) is sent from candidate exemplar point \( k \) to feature \( i. \). It indicates the collected evidence of how appropriate it would be, for feature \( i \) to choose feature \( k \) as its exemplar.

We propose class aware message passing algorithm where strength of message exchanged between two features depends on their ability to discriminate samples among different classes.

Each iteration of affinity propagation has three steps:

  • Step 1. Calculating responsibilities r(i, k) given the availabilities

  • Step 2. Updating all availabilities a(i, k) given the responsibilities

  • Step 3. Combining the responsibilities and availabilities to generate clusters

Calculating Responsibilities r(i,k) Given the Availabilities:

For a dataset D n×m with n features, m samples and p classes we calculate class wise rank for each feature using p–metric. Class wise rank of each feature i is denoted as \( R_{i} = \{ C_{1} , C_{2} ,C_{3} \ldots ..,C_{p} \} \) where \( i = \{ 1,2,3..n\} \) and \( C_{1} \) is rank of feature i for class 1. To calculate r(i,k) we evaluate R i and R k . If rank of a feature i for class \( j \) is less than \( \frac{n}{2} \) i.e. \( C_{j} \le \frac{n}{2} \), it lies in the upper half of the ranking and we denote it as H, else it lies in lower half and denoted as L. Hence, the ranking of feature \( i \) is changed to suppose \( R_{i} = \{ H,H,L, \ldots L\} \). Similarly ranking of feature \( k \) is changed to suppose \( R_{k} = \{ L,H,L \ldots ,L\} \). The strength of responsibility message sent from \( i \) to \( k \) is governed by the occurrence of \( H \) in \( R_{i} \) and \( R_{k} \).

Further calculation of responsibility \( r(i,k) \) can be divided into two sections depending on the count of class labels of the dataset.

Two Class Label Dataset.

For datasets with 2 classes, the class wise p-metric ranking of features is identical for both classes. The values in R i and R k can be of four kinds as listed below

  • \( R_{i} = \left\{ {\text{HH}} \right\} = R_{k} = \left( {\text{HH}} \right) \) – feature \( k \) should be assigned responsibility of serving as exemplar for feature \( i \). Set \( s\left( {i,k} \right) = {}_{vs.t.v \ne u}^{max} s\left( {u,v} \right) \) where \( u ,v \in \left\{ {1,2,3 \ldots n} \right\} \)

  • \( R_{i} = \left( {LL} \right)R_{k} = \left( {LL} \right) \) – No change in \( s\left( {i,k} \right) \)

  • \( R_{i} = \left( {\text{HH}} \right)R_{k} = \left( {\text{LL}} \right) \) - No change in \( s\left( {i,k} \right) \)

  • \( R_{i} = \left( {\text{LL}} \right)R_{k} = \left( {\text{HH}} \right) \) – feature \( k \) should be assigned responsibility of serving as exemplar for feature \( i \). Set \( s\left( {i,k} \right) = {}_{vs.t.v \ne u}^{max} s\left( {u,v} \right) \) where \( u ,v \in \left\{ {1,2,3 \ldots n} \right\} \).

Multi Class Label Dataset.

Suppose \( R_{i} = \left\{ {H,H,H, \ldots L} \right\} \) and \( R_{k} = \left\{ {H,H,L, \ldots L} \right\} \) is class wise ranking for features \( i \) and \( k \) respectively. Count of occurrences of H in both sets is stored as \( H_{i} \) and \( H_{k} \). If \( H_{k} \ge H_{i} \) then, feature \( k \) should be assigned high responsibility of serving as exemplar for feature \( i \). Set \( s\left( {i,k} \right) = {}_{vs.t.v \ne u}^{max} s(u,v) \) where \( u ,v \in \left\{ {1,2,3 \ldots n} \right\} . \)

Then, the value of responsibilities is calculated using equation:

$$ r(i,k) \leftarrow s(i,k) {-}{}_{{k^{\prime}s.t.k^{\prime} \ne k}}^{max} \{ a\left( {i,k^{\prime}} \right) + s(i,k^{\prime})\} $$

Setting \( s(i,k) \) as maximum of all the similarities increases the strength of responsibility message sent from feature \( i \) to candidate exemplar point \( k \).

Initially availabilities are set to zero. For the first iteration \( r(i,k) \) is similarity between feature \( i \) and \( k \) as its exemplar, reduced by the maximum similarity between \( i \) and other features. Later, when features highly similar to \( i \) are assigned to some other exemplar, their availabilities as a candidate exemplar for \( i \) falls below zero. Such negative value will affect the similarity value \( s(i,k^{\prime}) \) in the above equation.

Updating all availabilities a(i, k) given the responsibilities.

Availabilities are Calculated as:

$$ a(i,k) \leftarrow min\{ 0,r(k,k) + \mathop \sum \limits_{{i^{\prime}s.t. i^{\prime} \notin \{ i,k\} }} \hbox{max} \{ 0,r(i^{\prime},k)\} \} $$

The value of availability can be zero or negative. Zero value indicates that \( k \) is available and \( k \) can be assigned as exemplar to \( i \). If the value of \( a(i,k) \) is negative, it indicates that \( k \) belongs to some other exemplar and it is not available to become exemplar for \( i \).

Combining the Responsibilities and Availabilities to Generate Clusters.

For every iteration, the point \( j \) that maximizes \( r\left( {i,j} \right) + a\left( {i,j} \right) \) is regarded as exemplar for point \( i \). The algorithm terminate when changes in the message falls previously set threshold. Final result is list of all the exemplars which is an optimal subset of features. Figure 4 shows dynamics of Class Aware Exemplar Discovery algorithm when applied on 15 two-dimensional features with 4 class labels. Initially, features are sized according to their class aware preference. Each feature is numbered according to the count of classes in which it is high ranked. Class aware messages are exchanged between features. When convergence condition is satisfied, features are clustered with each cluster represented by red colored exemplar.

Fig. 4.
figure 4

(a) Two –dimensional features, sized according to their class aware preference value. (b) and (c) Messages are exchanged between features. The number associated with each feature corresponds to its count of occurrence in upper half of class wise ranking. The darkness of arrow directed from point i to point k corresponds to the strength of message that point i belongs to exemplar point k. (d) Clusters formed with their representative exemplar.

Algorithm 1 presents the steps followed by Class Aware Exemplar Discovery to select optimal set of features. The input to the algorithm is gene expression matrix with \( n \) specifying number of genes, \( m \) specifying number of samples and \( c \) classes. Three matrices namely responsibility, availability and similarity are initialized with zero. First we calculate negative of Euclidean distance between every pair of feature and store it in similarity matrix (Line 1). Next, we calculate class wise rank of all features (Line 2). We use \( N_{c} \) to denote ranked list of genes that is obtained for class \( c \) (Lines 3–11).

Class Aware Preference (Line 4) - We select top 0.015 % of features from \( N_{c} \) and assign them high preference value which is zero. To rest of the features median of similarities is assigned as preference value.

Class Aware Message Passing (Lines 15–43).

Then features are clustered by passing real valued class aware messages. The algorithm terminates when change in messages falls below previously set threshold. Exemplars are returned and can be used to accurately predict the class label of unlabeled samples.

3 Experimental Evaluation

We performed three set of experiments to evaluate the performance of our approach. In the first experiment we compared the performance of three different classifiers using features selected by CAED against the features selected by affinity propagation. In second experiment, we compared performance of CAED with CFS [4] for feature selection with greedy search strategy. WEKA [7] provided us with implementations of CFS. The third experiment compared the performance of classifiers using all the features versus features extracted by CAED. Details of all the three experiments are discussed in subsequent chapters. The experiments were carried out on 3.4 GHz Intel i7 CPU with 8 GB RAM machine running Windows-based operating system.

3.1 Description of Experimental Datasets

We performed experimental evaluation of Class Aware Exemplar Discovery over 18 publicly available microarray gene expression datasets [811]. Table 1 describes the datasets used for experimental study. The datasets used for experimental evaluation are picked from various cancer related research work.

Table 1. Description of datasets

3.2 Comparison of Class Aware Exemplar Discovery with Affinity Propagation

We evaluated the classification accuracy of three state-of- art classifiers namely support vector machine (SVM), Naïve Bayes (NB) and C4.5 decision tree (DT) using the exemplars generated by Affinity Propagation (AP) and the exemplars generated by Class Aware Exemplar Discovery. The classification accuracy is calculated using 10 fold cross validation approach.

To visualize the results we performed “Win-Loss Experiment”. If the classification accuracy of CAED is better than baseline approach the result is declared as win, if the classification accuracy has degraded the result is declared as loss otherwise declared as draw. Figure 5 shows results of “Win-Loss Experiment”, obtained when classification accuracy of three classifiers using exemplars generated by affinity propagation is compared against Class Aware Exemplar Discovery.

Fig. 5.
figure 5

Win – loss depiction of CAED versus affinity propagation carried over classifiers (a) support vector machine (b) Naïve Bayes (c) C4.5 decision tree

We observed that Class Aware exemplar discovery generates less exemplar in comparison to Affinity propagation. Figure 6 shows drop in count of clusters (i.e., number of examples) from affinity propagation to CAED.

Fig. 6.
figure 6

Drop in count of clusters in 18 dataset

We also observed that Class Aware Exemplar Discovery converges in lesser message passing iteration in comparison to affinity propagation. This reduces the execution time tremendously. Figure 7 shows drop in execution time, Y axis is measured in seconds.

Fig. 7.
figure 7

Drop in execution time

3.3 Comparison of Class Aware Exemplar Discovery with Standard Feature Subset Selection Techniques

We compare the effectiveness of CAED with features generated using CFS (Correlation based Feature Selection). The maximum achievable 10 fold cross validation classification accuracy is recorded as performance metric. Figure 8 shows results of “Win-Loss Experiment”, obtained when classification accuracy of three classifiers using feature subset generated by CFS is compared against exemplars generated by Class Aware Exemplar Discovery.

Fig. 8.
figure 8

Win – loss depiction of CAED over CFS carried over classifiers (a) support vector machine (b) Naïve Bayes (c) C4.5 decision tree

3.4 Comparison of Class Aware Exemplar Discovery with All Features

We evaluated the performance of three classifiers support vector machine (SVM), Naïve Bayes (NB) and C4.5 decision tree (DT) using all features of the 18 datasets. We compared these results with the classification accuracy obtained using features produced by Class Aware Exemplar Discovery.

Figure 9 shows results of “Win-Loss Experiment” obtained when classification accuracy of three classifiers using all features is compared against Class Aware Exemplar Discovery.

Fig. 9.
figure 9

Win- Loss depiction of CAED over all features carried over classifiers (a) support vector machine (b) Naïve Bayes (c) C4.5 decision tree

4 Conclusions

Gene expression datasets have large number of features. For effective application of any learning algorithm on gene expression datasets, feature subset selection is required. In this paper, we proposed a class aware clustering based feature subset selection technique. Our approach quantifies the ability of a feature to distinguish samples of one class from other classes. We use this value to influence the message passing procedure of affinity propagation. We observed that our approach leads to more relevant selection of features in less time in comparison to existing approach using the similar strategy for feature selection. We evaluated the effectiveness of our approach on 18 real world cancer datasets.

We evaluated Class Aware Exemplar Discovery against affinity propagation. Experiments have shown have shown that our technique outruns Affinity propagation in terms of classification accuracy. In comparison to affinity propagation, CAED converges in less number of iteration leading to huge drop in execution time.

We also evaluated the feature set generated by CAED against state-of-art feature selection technique. Experimental results have shown CAED gives better classification accuracy for all the classifiers used. Motivated by recent growth in parallel computing [13] and NVIDIA CUDA Research Center Support, we are developing a GPU based parallel algorithm for CAED. We are also working on improving the readability of mathematical symbol in the printed version of the submitted paper.