Abstract
Given a dataset, exemplars are subset of data points that can represent a set of data points without significance loss of information. Affinity propagation is an exemplar discovery technique that, unlike k–centres clustering, gives uniform preference to all data points. The data points iteratively exchange real–valued messages, until clusters with their representative exemplar become apparent.
In this paper, we propose a Class Aware Exemplar Discovery (CAED) algorithm, which assigns preference value to data points based on their ability to differentiate samples of one class from others. To aid this, CAED performs class wise ranking of data points, assigning preference value to each data point based on its class wise rank. While exchanging messages, data points with better representative ability are more favored for being chosen as exemplar over other data points.
The proposed method is evaluated over 18 gene expression datasets to check its efficacy for selection of relevant exemplars from large datasets. Experimental evaluation exhibits improvement in classification accuracy over affinity propagation and other state-of-art feature selection techniques. Class Aware Exemplar Discovery converges in lesser iterations as compared to affinity propagation thereby dropping the execution time significantly.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
With the advent of microarray technology, simultaneous profiling of thousands of gene expression across multiple samples in a single experiment was made possible. The microarray technology generates huge amount of gene expression data whose competitive analysis is challenging. Moreover, it has been observed that gene expression datasets has large number of uninformative and redundant features which increases complexity of classification algorithms [1, 11–13].
To circumvent these problems many feature selection techniques are being proposed. The purpose of feature selection is extracting relevant features from the observed data which improves results of machine learning models. Compared with the dimensionality reduction techniques like Principal Component Analysis (PCA) and Linear Discriminate Analysis (LDA), feature selection algorithms only select the relevant subset of features instead of altering the original features. The selected relevant genes, also known as “biomarkers”, find their application in medicine for discovery of new diseases, development of new pharmaceuticals [2, 12] etc. Existing work on feature subset selection from gene expression data can be categorized as (i) classification and (ii) clustering based approaches. The classification based approaches like ReliefF [3] and Correlation based Feature Selection [4], ranks features based on their intrinsic properties and select subset of top ranked features. The clustering based approaches like k-medoids [5] and affinity propagation [6] clusters the features based on their similarity with each other. The feature set is reduced to the representative of each cluster.
Affinity propagation proposed by Frey and Dueck in [6] for feature subset selection identifies subset of features which can represent the dataset. Such features are called exemplars. Affinity propagation takes similarity between features as input. Instead of specifying the number of clusters, a real number called preference value for all features is also passed as input to affinity propagation. The number of identified exemplars is influenced by the preference value. Larger the preference value more the clusters are formed. Features exchange real- valued messages until clusters with their representative exemplars emerge. Affinity propagation has found its application in the machine learning community, computational biology and computer vision. However, aforementioned approach gives uniform preference to all features and messages are exchanged iteratively between features irrespective of the capability of features to differentiate samples of one class from samples of other classes.
In this paper, we propose a Class Aware Exemplar Discovery (CAED) algorithm which calculates class wise ranking for all features and incorporates this information while assigning preference value to features. The features are clustered by exchanging two types of messages viz. responsibility and availability iteratively. The messages are exchanged in a way that the features ranked higher in class wise ranking are favored over the feature ranked lower which leads to better exemplar discovery.
We evaluated correctness of our approach by conducting experiments on 18 publicly available microarray gene expression datasets. We recorded classification accuracy as our performance metric. Improvement in classification accuracy over affinity propagation of three classifiers namely Support Vector Machine, Naive Bayes and C4.5 Decision Tree is achieved for 16, 17 and 13 datasets respectively.
2 Overview of Our Approach
The workflow of our approach is shown in Fig. 1. Gene expression matrix is transformed into similarity matrix using a distance measure. The diagonal values of similarity matrix also called preferences are updated using class aware ranking of features. The updated matrix is used for class aware clustering. The representative from each cluster called exemplar is extracted and the reduced set of features is evaluated over classifiers.
2.1 Gene Data
We selected 18 publicly available microarray datasets (available at http://faculty.iitr.ac.in/~patelfec/index.html) for experimental evaluation of Class Aware Exemplar Discovery algorithm. Microarray datasets are gene expression matrices, where expression value for each gene is measured over different samples. Generally total numbers of samples are very few compared to the features.
2.2 Gene-Gene Similarity
For a gene expression matrix D n×m , with n features and m samples, similarity between every two features is calculated and stored in similarity matrix S n×n . The similarity s(i, k) indicates how well the feature with index k is suited to be the exemplar for featurei. The aim is to maximize the similarity, so we take negative of the distance between each feature. We used negative of Euclidean distance as the similarity measure for experimental evaluation.
2.3 Class Aware Preference
The affinity propagation algorithm takes similarity matrix and a real number called preference for each feature as input. For a similarity matrix S n×n , the value \( s\left( {i,i} \right) \) where \( i = \{ 1,2,3 \ldots ,n\} \) is the preference value. These values are called preferences since the feature \( i \) with larger values \( s(i,i) \) is more likely to be chosen as exemplar. The number of identified exemplars (number of clusters) is influenced by the values of input preference. Larger the value more the clusters are formed. The preference value can be uniform or non-uniform. If all features are equally suitable as exemplars, the preference value is set to a common value. The preference value can be set as any number in the range of \( {}_{j s.t.j \ne i}^{min} s(i,j) \) to \( {}_{j s.t.j \ne i}^{max} s(i,j) :i,j = \{ 1,2,3 \ldots ,n\} \).
We propose assignment of preference value to a feature based on its ability to distinguish samples of one class from other classes. To aid this, we do class wise ranking of features using p-metric [1] by one versus all strategy. Top 0.015 % of features from each class is selected and are assigned preference value zero thereby increasing their probability of being chosen as exemplar.
Figure 2 depicts class wise ranking of features where each feature is assigned multiple ranks, one for each class. The high ranked features of each class are highlighted
The other features are assigned uniform preference value i.e. \( {}_{j s.t.j \ne i}^{median} s(i,j):i.j = \{ 1,2,3 \ldots ,n\} \). Figure 3 shows how accuracy of classifiers namely support vector machine (SVM), Naïve Bayes (NB) and C4.5 decision tree (DT) varied by changing the count of features which are assigned high value.
We observed that by selecting more than 0.015 % of features from each class no further improvement in classification accuracy of the classifiers was observed.
2.4 Class Aware Message Passing
The similarity matrix with class aware preference values is passed to affinity propagation for exemplar discovery. Affinity propagation is a message passing based clustering algorithm. Affinity propagation iteratively transmits real-valued messages among features until a good set of clusters with their representative exemplar emerge. The messages exchanged are of two kinds viz.
Responsibility message denoted as \( r(i,k) \) is sent from feature \( i \) to candidate exemplar point \( k \). It indicates the collected evidence for how appropriate feature \( k \) is to be chosen as exemplar for feature \( i \).
Availability message denoted as \( a(i,k) \) is sent from candidate exemplar point \( k \) to feature \( i. \). It indicates the collected evidence of how appropriate it would be, for feature \( i \) to choose feature \( k \) as its exemplar.
We propose class aware message passing algorithm where strength of message exchanged between two features depends on their ability to discriminate samples among different classes.
Each iteration of affinity propagation has three steps:
-
Step 1. Calculating responsibilities r(i, k) given the availabilities
-
Step 2. Updating all availabilities a(i, k) given the responsibilities
-
Step 3. Combining the responsibilities and availabilities to generate clusters
Calculating Responsibilities r(i,k) Given the Availabilities:
For a dataset D n×m with n features, m samples and p classes we calculate class wise rank for each feature using p–metric. Class wise rank of each feature i is denoted as \( R_{i} = \{ C_{1} , C_{2} ,C_{3} \ldots ..,C_{p} \} \) where \( i = \{ 1,2,3..n\} \) and \( C_{1} \) is rank of feature i for class 1. To calculate r(i,k) we evaluate R i and R k . If rank of a feature i for class \( j \) is less than \( \frac{n}{2} \) i.e. \( C_{j} \le \frac{n}{2} \), it lies in the upper half of the ranking and we denote it as H, else it lies in lower half and denoted as L. Hence, the ranking of feature \( i \) is changed to suppose \( R_{i} = \{ H,H,L, \ldots L\} \). Similarly ranking of feature \( k \) is changed to suppose \( R_{k} = \{ L,H,L \ldots ,L\} \). The strength of responsibility message sent from \( i \) to \( k \) is governed by the occurrence of \( H \) in \( R_{i} \) and \( R_{k} \).
Further calculation of responsibility \( r(i,k) \) can be divided into two sections depending on the count of class labels of the dataset.
Two Class Label Dataset.
For datasets with 2 classes, the class wise p-metric ranking of features is identical for both classes. The values in R i and R k can be of four kinds as listed below
-
\( R_{i} = \left\{ {\text{HH}} \right\} = R_{k} = \left( {\text{HH}} \right) \) – feature \( k \) should be assigned responsibility of serving as exemplar for feature \( i \). Set \( s\left( {i,k} \right) = {}_{vs.t.v \ne u}^{max} s\left( {u,v} \right) \) where \( u ,v \in \left\{ {1,2,3 \ldots n} \right\} \)
-
\( R_{i} = \left( {LL} \right)R_{k} = \left( {LL} \right) \) – No change in \( s\left( {i,k} \right) \)
-
\( R_{i} = \left( {\text{HH}} \right)R_{k} = \left( {\text{LL}} \right) \) - No change in \( s\left( {i,k} \right) \)
-
\( R_{i} = \left( {\text{LL}} \right)R_{k} = \left( {\text{HH}} \right) \) – feature \( k \) should be assigned responsibility of serving as exemplar for feature \( i \). Set \( s\left( {i,k} \right) = {}_{vs.t.v \ne u}^{max} s\left( {u,v} \right) \) where \( u ,v \in \left\{ {1,2,3 \ldots n} \right\} \).
Multi Class Label Dataset.
Suppose \( R_{i} = \left\{ {H,H,H, \ldots L} \right\} \) and \( R_{k} = \left\{ {H,H,L, \ldots L} \right\} \) is class wise ranking for features \( i \) and \( k \) respectively. Count of occurrences of H in both sets is stored as \( H_{i} \) and \( H_{k} \). If \( H_{k} \ge H_{i} \) then, feature \( k \) should be assigned high responsibility of serving as exemplar for feature \( i \). Set \( s\left( {i,k} \right) = {}_{vs.t.v \ne u}^{max} s(u,v) \) where \( u ,v \in \left\{ {1,2,3 \ldots n} \right\} . \)
Then, the value of responsibilities is calculated using equation:
Setting \( s(i,k) \) as maximum of all the similarities increases the strength of responsibility message sent from feature \( i \) to candidate exemplar point \( k \).
Initially availabilities are set to zero. For the first iteration \( r(i,k) \) is similarity between feature \( i \) and \( k \) as its exemplar, reduced by the maximum similarity between \( i \) and other features. Later, when features highly similar to \( i \) are assigned to some other exemplar, their availabilities as a candidate exemplar for \( i \) falls below zero. Such negative value will affect the similarity value \( s(i,k^{\prime}) \) in the above equation.
Updating all availabilities a(i, k) given the responsibilities.
Availabilities are Calculated as:
The value of availability can be zero or negative. Zero value indicates that \( k \) is available and \( k \) can be assigned as exemplar to \( i \). If the value of \( a(i,k) \) is negative, it indicates that \( k \) belongs to some other exemplar and it is not available to become exemplar for \( i \).
Combining the Responsibilities and Availabilities to Generate Clusters.
For every iteration, the point \( j \) that maximizes \( r\left( {i,j} \right) + a\left( {i,j} \right) \) is regarded as exemplar for point \( i \). The algorithm terminate when changes in the message falls previously set threshold. Final result is list of all the exemplars which is an optimal subset of features. Figure 4 shows dynamics of Class Aware Exemplar Discovery algorithm when applied on 15 two-dimensional features with 4 class labels. Initially, features are sized according to their class aware preference. Each feature is numbered according to the count of classes in which it is high ranked. Class aware messages are exchanged between features. When convergence condition is satisfied, features are clustered with each cluster represented by red colored exemplar.
Algorithm 1 presents the steps followed by Class Aware Exemplar Discovery to select optimal set of features. The input to the algorithm is gene expression matrix with \( n \) specifying number of genes, \( m \) specifying number of samples and \( c \) classes. Three matrices namely responsibility, availability and similarity are initialized with zero. First we calculate negative of Euclidean distance between every pair of feature and store it in similarity matrix (Line 1). Next, we calculate class wise rank of all features (Line 2). We use \( N_{c} \) to denote ranked list of genes that is obtained for class \( c \) (Lines 3–11).
Class Aware Preference (Line 4) - We select top 0.015 % of features from \( N_{c} \) and assign them high preference value which is zero. To rest of the features median of similarities is assigned as preference value.
Class Aware Message Passing (Lines 15–43).
Then features are clustered by passing real valued class aware messages. The algorithm terminates when change in messages falls below previously set threshold. Exemplars are returned and can be used to accurately predict the class label of unlabeled samples.
3 Experimental Evaluation
We performed three set of experiments to evaluate the performance of our approach. In the first experiment we compared the performance of three different classifiers using features selected by CAED against the features selected by affinity propagation. In second experiment, we compared performance of CAED with CFS [4] for feature selection with greedy search strategy. WEKA [7] provided us with implementations of CFS. The third experiment compared the performance of classifiers using all the features versus features extracted by CAED. Details of all the three experiments are discussed in subsequent chapters. The experiments were carried out on 3.4 GHz Intel i7 CPU with 8 GB RAM machine running Windows-based operating system.
3.1 Description of Experimental Datasets
We performed experimental evaluation of Class Aware Exemplar Discovery over 18 publicly available microarray gene expression datasets [8–11]. Table 1 describes the datasets used for experimental study. The datasets used for experimental evaluation are picked from various cancer related research work.
3.2 Comparison of Class Aware Exemplar Discovery with Affinity Propagation
We evaluated the classification accuracy of three state-of- art classifiers namely support vector machine (SVM), Naïve Bayes (NB) and C4.5 decision tree (DT) using the exemplars generated by Affinity Propagation (AP) and the exemplars generated by Class Aware Exemplar Discovery. The classification accuracy is calculated using 10 fold cross validation approach.
To visualize the results we performed “Win-Loss Experiment”. If the classification accuracy of CAED is better than baseline approach the result is declared as win, if the classification accuracy has degraded the result is declared as loss otherwise declared as draw. Figure 5 shows results of “Win-Loss Experiment”, obtained when classification accuracy of three classifiers using exemplars generated by affinity propagation is compared against Class Aware Exemplar Discovery.
We observed that Class Aware exemplar discovery generates less exemplar in comparison to Affinity propagation. Figure 6 shows drop in count of clusters (i.e., number of examples) from affinity propagation to CAED.
We also observed that Class Aware Exemplar Discovery converges in lesser message passing iteration in comparison to affinity propagation. This reduces the execution time tremendously. Figure 7 shows drop in execution time, Y axis is measured in seconds.
3.3 Comparison of Class Aware Exemplar Discovery with Standard Feature Subset Selection Techniques
We compare the effectiveness of CAED with features generated using CFS (Correlation based Feature Selection). The maximum achievable 10 fold cross validation classification accuracy is recorded as performance metric. Figure 8 shows results of “Win-Loss Experiment”, obtained when classification accuracy of three classifiers using feature subset generated by CFS is compared against exemplars generated by Class Aware Exemplar Discovery.
3.4 Comparison of Class Aware Exemplar Discovery with All Features
We evaluated the performance of three classifiers support vector machine (SVM), Naïve Bayes (NB) and C4.5 decision tree (DT) using all features of the 18 datasets. We compared these results with the classification accuracy obtained using features produced by Class Aware Exemplar Discovery.
Figure 9 shows results of “Win-Loss Experiment” obtained when classification accuracy of three classifiers using all features is compared against Class Aware Exemplar Discovery.
4 Conclusions
Gene expression datasets have large number of features. For effective application of any learning algorithm on gene expression datasets, feature subset selection is required. In this paper, we proposed a class aware clustering based feature subset selection technique. Our approach quantifies the ability of a feature to distinguish samples of one class from other classes. We use this value to influence the message passing procedure of affinity propagation. We observed that our approach leads to more relevant selection of features in less time in comparison to existing approach using the similar strategy for feature selection. We evaluated the effectiveness of our approach on 18 real world cancer datasets.
We evaluated Class Aware Exemplar Discovery against affinity propagation. Experiments have shown have shown that our technique outruns Affinity propagation in terms of classification accuracy. In comparison to affinity propagation, CAED converges in less number of iteration leading to huge drop in execution time.
We also evaluated the feature set generated by CAED against state-of-art feature selection technique. Experimental results have shown CAED gives better classification accuracy for all the classifiers used. Motivated by recent growth in parallel computing [13] and NVIDIA CUDA Research Center Support, we are developing a GPU based parallel algorithm for CAED. We are also working on improving the readability of mathematical symbol in the printed version of the submitted paper.
References
Inza, I., Larrañaga, P., Blanco, R., Cerrolaza, A.J.: Filter versus wrapper gene selection approaches in DNA microarray domains. Artif. Intell. Med. 31(2), 91–103 (2004)
De Abreu, F.B., Wells, W.A., Tsongalis, G.J.: The emerging role of the molecular diagnostics laboratory in breast cancer personalized medicine. Am. J. Pathol. 183(4), 1075–1083 (2013)
Kononenko, I., Šimec, E., Robnik-Šikonja, M.: Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell. 7(1), 39–55 (1997)
Hall, M.A.: Correlation-based feature selection for machine learning. Doctoral dissertation, The University of Waikato (1999)
Kashef, R., Kamel, M.S.: Efficient bisecting k-medoids and its application in gene expression analysis. In: Campilho, A., Kamel, M. (eds.) ICIAR 2008. LNCS, vol. 5112, pp. 423–434. Springer, Heidelberg (2008)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
De Souto, M.C., Costa, I.G., de Araujo, D.S., Ludermir, T.B., Schliep, A.: Clustering cancer gene expression data: a comparative study. BMC Bioinf. 9(1), 497 (2008)
Foithong, S., Pinngern, O., Attachoo, B.: Feature subset selection wrapper based on mutual information and rough sets. Expert Syst. Appl. 39(1), 574–584 (2012)
Mramor, M., Leban, G., Demšar, J., Zupan, B.: Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16), 2147–2154 (2007)
Blum, A., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1/2), 245–271 (1997)
Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40, 16–28 (2014)
Soufan, O., Kleftogiannis, D., Kalnis, P., Kalnis, B.: Bajic DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS ONE 10, e0117988 (2015). doi:10.1371/journal.pone.0117988
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Sharma, S., Agrawal, A., Patel, D. (2015). Class Aware Exemplar Discovery from Microarray Gene Expression Data. In: Kumar, N., Bhatnagar, V. (eds) Big Data Analytics. BDA 2015. Lecture Notes in Computer Science(), vol 9498. Springer, Cham. https://doi.org/10.1007/978-3-319-27057-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-27057-9_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27056-2
Online ISBN: 978-3-319-27057-9
eBook Packages: Computer ScienceComputer Science (R0)