Abstract
Analysis of gene expression data obtained from microarray experiments is helpful for various biological purposes such as identifying Differentially Expressed genes, disease classification, predicting survival rate of patients etc. However, data from microarray experiments come with less sample size and thus have limited statistical power for any analysis. To overcome this problem, researchers are now relying on a more powerful technique called meta-analysis, an integrated analysis of existing data from different but related independent studies. Microarray data reveal that genes are normally expressed in related functional pattern, which suggests using clustering as an alternative technique to group genes into relatively homogenous clusters such as Differentially Expressed and Non-Differentially Expressed. In this paper, we explore k-Means Clustering technique to perform meta-analysis of gene expression data for finding Differentially Expressed genes. Comparative analysis of k-Means Clustering technique is performed, and the results are validated by various statistical meta-analysis techniques, which prove clustering as a robust alternative technique for meta-analysis of gene expression data.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
With the increase in genome-wide investigation of gene expression data, analysis of high throughput microarray data has become a routine for biomedical research [1]. Investigation of gene expression levels within different organisms, in different tissue types, and in various clinical conditions may help in understanding the gene functions, detecting Differentially Expressed (DE) genes and help in the diagnosis of disease conditions. DE genes provide important information to understand the molecular mechanism of disease initiation and progression and are very helpful in disease diagnosis and prevention.
Microarray gene expression data is high dimensional in nature but have very less samples (hundreds in number). In order to address this issue, efforts have been put for microarray meta-analysis, where data from multiple microarray studies with relevant biological hypotheses are combined to improve DE gene detection [2, 3]. Despite the availability of various meta-analysis methods, ability of each method to identify DE genes under different hypothesis settings differs. Each method has its own advantages and disadvantages in terms of robustness, detection capabilities, biological significance of detected genes and stability [4].
The fact that genes in functionally related patterns show normal expression levels suggests that clustering techniques can be a helpful tool to group genes into homogenous clusters. Clustering acts as an initial step in the analysis of gene expression data for detecting groups of genes that show similar expression patterns. In gene expression data, clustering is done on the basis of similarity measure between gene expression levels which is measured by correlation coefficient between expression levels of different genes.
To group genes into homogeneous clusters for biological analysis, several clustering techniques were earlier used for low sample size gene expression data. Zaravinos et al. [5] performed Hierarchical and k-Means Clustering on microarray gene expression data to identify common DE genes among clinically relevant subclasses of Bladder Cancer. Alon et al. [6] performed analysis of 40 tumor and 22 normal colon tissue samples of different cell types using two way clustering technique to reveal hidden patterns in gene expression data. Heyer et al. [7] used a similarity measure to decrease the percentage of false positives, and then applied a clustering algorithm for grouping gene expression patterns of microarray gene expression data. Lin et al. [8] performed Hierarchical Clustering of hypomethylated regions, which identified differential methylated enhancers and tumor-specific hypermethylated clusters related to normal or breast cancer cell lines.
Many statistical methods have been previously used to perform meta-analysis on gene expression data for identification of DE genes [3, 8, 9]. These methods combine p-values from each test to a single test statistics. With these methods, genes with the significant statistical value are reported as DE genes. However, these methods ignore the dependency between the genes in gene expression data which may lead to poor performance in detecting DE genes under different hypothesis settings. Clustering has an advantage of considering the correlation that exists between genes, which can be an efficient technique for detecting DE genes.
In this paper, we explore k-Means Clustering technique as a novel approach for meta-analysis of microarray gene expression data to identify DE genes. This study is first of its kind to explore the applications and robustness of k-Means Clustering techniques to perform meta-analysis of microarray gene expression data for DE gene detection.
2 Datasets and Pre-processing
For clustering gene expression data, suitable microarray studies are identified with similar biological hypothesis corresponding to Prostate Cancer and Breast Cancer as described in Table 1. Probe sets (a collection of probes designed to interrogate a given sequence) suffer from duplication, i.e. many probe sets refer to same gene. During annotation phase, all the probe sets are annotated with the same gene, creating duplicates. To remove the duplicate genes, the genes with the significant probability value are kept, while others deleted. The annotation of genes is done with the Gene Published List file (GPL) HG-U133 plus 2. All the preprocessing and preparation of data for clustering is done using the tool developed by Shashirekha et al. [10], currently available at https://hussain.shinyapps.io/App-1/.
3 Methods
We explore the application of k-Means Clustering technique [11] for meta-analysis of microarray gene expression data to identify DE genes. k-Means method is a simple and computationally efficient technique to perform clustering analysis. The main aim of k-Means algorithm is to divide n entities into k groups, such that the total distance between the members of the group and its corresponding centroid (representative of the group) is minimized. In k-Means, the distance is termed as within-cluster sum of squares and is defined as an objective function.
where term \(\left\| {x_{i}^{j} - c^{j} } \right\|^{{^{2} }}\) gives the distance between an entity point \(x_{i}^{j}\) and the cluster’s centroid \(c^{j}\). In this paper, we used four k-Means Clustering algorithms. Each method is discussed below.
3.1 Hartigan-Wong Algorithm
Given a set of n objects with p variables measured on each object \(x(i,j)\) for \(i = 1,2, \ldots n;\,j = 1,2, \ldots p\); k-Means Hartigan-Wong algorithm [11] assigns each object to one of the k groups or clusters in such a way that within cluster sum of squares is minimum.
where \(\overline{{x\left( {k,j} \right)}}\) is the mean variable j of all elements in group k. A matrix \(k \,x\, p\) having k initial cluster centers is required in addition to the data matrix as input to the k-Means algorithm. Each object is then initially assigned to the cluster with the nearest cluster mean. The process of assigning the data objects to similar clusters is performed by iteratively searching for the k-partitions with locally optimal within-cluster sum of squares by shifting data objects from one cluster to another.
3.2 Lloyd Algorithm
k-Means Lloyd algorithm [12] is a simple iterative algorithm which efficiently finds a local minimum to assign n objects into k groups. This algorithm due to its efficient practical applications is sometimes referred to as k-Means algorithm. The algorithm can be thought of as a potential function reducing algorithm.
The sets \(S_{j}\) are the sets of points to which \(\mu_{j}\) is the closest center. The potential function is reduced in each step of the algorithm. Considering the set of cluster centers \(\mu_{j}\) fixed, each data object is assigned to the closest cluster center. Also, assume that \(\mu\) is the center of a set of points S. Then, if \(\mu\) is moved to \(\frac{1}{\left| S \right|}\sum\nolimits_{i \in S} {x_{i} }\) then only the potential of the function is reduced. This is because \(\frac{1}{\left| S \right|}\sum\nolimits_{i \in S} {x_{i} }\) is the best possible value for \(\mu\). The algorithm therefore terminates in a local minimum.
3.3 MacQueen Algorithm
k-Means MacQueen algorithm [13] is one of the simplest algorithm to cluster objects into homogenous groups. The process follows an easy and simple way of classifying a given set of data objects into k clusters. The main idea is to define k centroids, one of each cluster and then assigning each data object to the nearest centroid. Position of the centroids is calculated again after assigning each data object to a cluster. The process of assigning the data objects to its nearest clusters is repeated till centroids do not move any more. Finally, this algorithm aims at reducing the objective function i.e. squared error function. The objective function
where term \(\left\| {x_{i}^{j} - c^{j} } \right\|^{{^{2} }}\) is a chosen distance measure between a data point \(x_{i}^{j}\) and the cluster’s center \(c^{j}\), which show the distance of the n data objects from their respective cluster centers.
3.4 Forgy Algorithm
k-Means Forgy algorithm [14] is a simple alternating least-squares algorithm. This algorithm takes data matrix as an input in addition to the number of clusters to be constructed and k samples called seed points. The seed points can be chosen on random selection or knowledge of desired cluster structure could be used to direct their selection.
Steps of the algorithm:
-
1.
Initialize the cluster centroids randomly within the data domain.
-
2.
Create k clusters by assigning the samples to the identified nearest cluster centroid.
-
3.
Stop, if none of the samples change clusters.
-
4.
Compute the centroids of the resulting clusters and go to step 2.
For a set of data points \(x_{1} ,x_{2} , \ldots ,x_{n} \, { \in } \, R^{d}\) where \(R^{d}\) is the data space of d dimensions, Forgy algorithm tries to find the set of k cluster centers \(C = \left[ {c_{1} ,c_{2} , \ldots c_{k} } \right]{ \in } \, R^{d}\) that reduces the objective function E defined by
where \(p(x)\) is the probability distribution function and d is the distance function between cluster center \(c_{i}\) and mean value \(x_{i,j}\) of x.
All the above methods are able to clearly discriminate DE and NDE genes.
5 Results
All the k-Means Clustering algorithms select 15,206 and 5099, 15,554 and 4751, and 10,193 and 2561 DE and NDE genes for PC1, PC2 and BC datasets respectively. Only the best results obtained by various clustering and statistical techniques are displayed in Tables 3, 4 and 5. Statistical techniques for datasets PC1, PC2 and BC detects 12,948, 13,499 and 10,653 DE genes and 3345, 1904 and 12 NDE genes respectively as shown in Table 2. Detection ability of DE and NDE genes by Statistical and Clustering techniques are very close to each other.
Results described in Tables 2 and 3 for PC1 show that 14,923 genes are detected as DE by k-Means Forgy algorithm and Fisher’s method. The contradiction between these two methods is for only 17 genes. Similarly, Stouffer’s method and Forgy algorithm detects 15,179 common genes as DE with a difference of 1475 genes.
Forgy method and minP method detects 12,700 common genes, showing a difference in 2741 genes. The results of MacQueen Clustering technique and Fisher’s method are much closer to each other, selecting 14,904 common DE genes and differ only in 36 genes. Other Clustering and Statistical techniques such as MacQueen and minP, MacQueen and Stouffer’s method detects 12,684 and 15,159 common genes, showing a difference of 2757 and 1495 genes respectively.
Results of common genes detected by k-Means Clustering and Statistical techniques for PC2 dataset are shown in Table 4 and for BC, results are shown in Table 5. All these results illustrate that number of genes detected as DE and NDE by k-Means Clustering and statistical techniques are very close to each other. Although different k-Means algorithms selects different number of genes as DE and NDE, they all prove to be equally robust for all the three datasets for detecting common genes. From the results, k-Means Clustering technique is proven to be an effective technique for clustering gene expression data and suggests to be used as an alternative for identifying Differentially Expressed genes.
6 Conclusion
In this paper, we explore the application of k-Means Clustering technique for meta-analysis of microarray gene expression data to identify Differentially Expressed genes. The results illustrate that clustering techniques are more robust and accurate than conventional statistical techniques. More genes are detected by clustering techniques as DE as compared to the statistical techniques AWF and Fisher’s method and almost the same genes are detected by other statistical methods. k-Means is found to be a promising alternative to statistical techniques for meta-analysis of microarray gene expression data to identify DE genes. Comparison between four k-Means Clustering techniques prove all techniques to be equally robust to detect common Differentially Expressed and Non Differentially expressed genes with respect to statistical techniques.
References
Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. U. S. A. 103(13):14429–14434
Li J, Tseng GC (2011) An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. Ann. Appl. Stat. 5(2):9941019
Wang X, Kang DD, Shen K, Song C, Lu S, Chang LC, Liao SG, Huo Z, Tang S, Ding Y, Kaminski N, Sibille E, Lin Y, Li J, Tseng GC (2012) An r package suite for microarray meta-analysis in quality control, differentially expressed gene analysis and pathway enrichment detection. Bioinformatics 28(19):2534–2536
Sun H, Xing X, Li J, Zhou F, Chen Y, He Y, Li W, Wei G, Chang X (2013) Identification of gene fusions from human lung cancer mass spectrometry data. BMC Genomics 14(Suppl 8):S5
Zaravinos A, Lambrou GI, Boulalas I, Delakas D, Spandidos DA (2011) Identification of common differentially expressed genes in urinary bladder cancer, PLoS One 6(4)
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96(12):6745–6750
Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes exploring expression data: identification and analysis of coexpressed genes. (213):1106–1115
Lin IH, Chen DT, Chang YF, Lee YL, Su CH, Cheng C, Tsai YC, Ng SC, Chen HT, Lee MC, Chen HW, Suen SH, Chen YC, Liu TT, Chang CH, Hsu MT (2015) Hierarchical clustering of breast cancer methylomes revealed differentially methylated and expressed breast cancer genes. PLoS ONE 10(2):130
Fisher R, Fisher RA (1925) Statistical methods for research workers. Genesis Publishing, Oliver and Boyd, Edinburgh
Shashirekha HL, Wani AH (2016) ShinyMDE: shiny tool for microarray metaanalysis for differentially expressed gene detection. In: 2016 international conference on bioinformatics and systems biology (BSB), Allahabad, 2016, pp. 1–5. doi:10.1109/BSB.2016.7552152
Morissette L, Chartier S (2013) The k-means clustering technique: general considerations and implementation in Mathematica. Tutor. Quant. Methods Psychol 9(1):15–24
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
MacQueen JB (1967) K-means some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on mathematical statistics and probability, vol. 1(233), pp. 281–297
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:786–796
Stouffer SA (1949) A study of attitudes. Sci Am 180(5):11
Lu S, Li J, Song C, Shen K, Tseng GC (2010) Biomarker detection in the integration of multiple multi-class genomic studies. Bioinformatics 26(3):33340
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wani, A.H., Shashirekha, H.L. (2018). Clustering: A Novel Meta-Analysis Approach for Differentially Expressed Gene Detection. In: Guru, D., Vasudev, T., Chethan, H., Kumar, Y. (eds) Proceedings of International Conference on Cognition and Recognition . Lecture Notes in Networks and Systems, vol 14. Springer, Singapore. https://doi.org/10.1007/978-981-10-5146-3_12
Download citation
DOI: https://doi.org/10.1007/978-981-10-5146-3_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5145-6
Online ISBN: 978-981-10-5146-3
eBook Packages: EngineeringEngineering (R0)