Keywords

1 Introduction

With the increase in genome-wide investigation of gene expression data, analysis of high throughput microarray data has become a routine for biomedical research [1]. Investigation of gene expression levels within different organisms, in different tissue types, and in various clinical conditions may help in understanding the gene functions, detecting Differentially Expressed (DE) genes and help in the diagnosis of disease conditions. DE genes provide important information to understand the molecular mechanism of disease initiation and progression and are very helpful in disease diagnosis and prevention.

Microarray gene expression data is high dimensional in nature but have very less samples (hundreds in number). In order to address this issue, efforts have been put for microarray meta-analysis, where data from multiple microarray studies with relevant biological hypotheses are combined to improve DE gene detection [2, 3]. Despite the availability of various meta-analysis methods, ability of each method to identify DE genes under different hypothesis settings differs. Each method has its own advantages and disadvantages in terms of robustness, detection capabilities, biological significance of detected genes and stability [4].

The fact that genes in functionally related patterns show normal expression levels suggests that clustering techniques can be a helpful tool to group genes into homogenous clusters. Clustering acts as an initial step in the analysis of gene expression data for detecting groups of genes that show similar expression patterns. In gene expression data, clustering is done on the basis of similarity measure between gene expression levels which is measured by correlation coefficient between expression levels of different genes.

To group genes into homogeneous clusters for biological analysis, several clustering techniques were earlier used for low sample size gene expression data. Zaravinos et al. [5] performed Hierarchical and k-Means Clustering on microarray gene expression data to identify common DE genes among clinically relevant subclasses of Bladder Cancer. Alon et al. [6] performed analysis of 40 tumor and 22 normal colon tissue samples of different cell types using two way clustering technique to reveal hidden patterns in gene expression data. Heyer et al. [7] used a similarity measure to decrease the percentage of false positives, and then applied a clustering algorithm for grouping gene expression patterns of microarray gene expression data. Lin et al. [8] performed Hierarchical Clustering of hypomethylated regions, which identified differential methylated enhancers and tumor-specific hypermethylated clusters related to normal or breast cancer cell lines.

Many statistical methods have been previously used to perform meta-analysis on gene expression data for identification of DE genes [3, 8, 9]. These methods combine p-values from each test to a single test statistics. With these methods, genes with the significant statistical value are reported as DE genes. However, these methods ignore the dependency between the genes in gene expression data which may lead to poor performance in detecting DE genes under different hypothesis settings. Clustering has an advantage of considering the correlation that exists between genes, which can be an efficient technique for detecting DE genes.

In this paper, we explore k-Means Clustering technique as a novel approach for meta-analysis of microarray gene expression data to identify DE genes. This study is first of its kind to explore the applications and robustness of k-Means Clustering techniques to perform meta-analysis of microarray gene expression data for DE gene detection.

2 Datasets and Pre-processing

For clustering gene expression data, suitable microarray studies are identified with similar biological hypothesis corresponding to Prostate Cancer and Breast Cancer as described in Table 1. Probe sets (a collection of probes designed to interrogate a given sequence) suffer from duplication, i.e. many probe sets refer to same gene. During annotation phase, all the probe sets are annotated with the same gene, creating duplicates. To remove the duplicate genes, the genes with the significant probability value are kept, while others deleted. The annotation of genes is done with the Gene Published List file (GPL) HG-U133 plus 2. All the preprocessing and preparation of data for clustering is done using the tool developed by Shashirekha et al. [10], currently available at https://hussain.shinyapps.io/App-1/.

Table 1 Description of Datasets

3 Methods

We explore the application of k-Means Clustering technique [11] for meta-analysis of microarray gene expression data to identify DE genes. k-Means method is a simple and computationally efficient technique to perform clustering analysis. The main aim of k-Means algorithm is to divide n entities into k groups, such that the total distance between the members of the group and its corresponding centroid (representative of the group) is minimized. In k-Means, the distance is termed as within-cluster sum of squares and is defined as an objective function.

$$J = \sum\limits_{j = 1}^{k} {\sum\limits_{i = 1}^{n} {\left\| {x_{i}^{j} - c^{j} } \right\|^{{^{2} }} } }$$
(1)

where term \(\left\| {x_{i}^{j} - c^{j} } \right\|^{{^{2} }}\) gives the distance between an entity point \(x_{i}^{j}\) and the cluster’s centroid \(c^{j}\). In this paper, we used four k-Means Clustering algorithms. Each method is discussed below.

3.1 Hartigan-Wong Algorithm

Given a set of n objects with p variables measured on each object \(x(i,j)\) for \(i = 1,2, \ldots n;\,j = 1,2, \ldots p\); k-Means Hartigan-Wong algorithm [11] assigns each object to one of the k groups or clusters in such a way that within cluster sum of squares is minimum.

$$Sum(k) = \sum\limits_{i = 0}^{n} {\sum\limits_{j = 0}^{p} {\left( {x(i,j) - \overline{x(k,j)} } \right)^{{^{2} }} } }$$
(2)

where \(\overline{{x\left( {k,j} \right)}}\) is the mean variable j of all elements in group k. A matrix \(k \,x\, p\) having k initial cluster centers is required in addition to the data matrix as input to the k-Means algorithm. Each object is then initially assigned to the cluster with the nearest cluster mean. The process of assigning the data objects to similar clusters is performed by iteratively searching for the k-partitions with locally optimal within-cluster sum of squares by shifting data objects from one cluster to another.

3.2 Lloyd Algorithm

k-Means Lloyd algorithm [12] is a simple iterative algorithm which efficiently finds a local minimum to assign n objects into k groups. This algorithm due to its efficient practical applications is sometimes referred to as k-Means algorithm. The algorithm can be thought of as a potential function reducing algorithm.

$$f(k{\text{-}}means) = \sum\limits_{j \in [k]} {\sum\limits_{{i \in S_{j} }} {\left\| {x_{i} - \mu_{j} } \right\|^{{^{2} }} } }$$
(3)

The sets \(S_{j}\) are the sets of points to which \(\mu_{j}\) is the closest center. The potential function is reduced in each step of the algorithm. Considering the set of cluster centers \(\mu_{j}\) fixed, each data object is assigned to the closest cluster center. Also, assume that \(\mu\) is the center of a set of points S. Then, if \(\mu\) is moved to \(\frac{1}{\left| S \right|}\sum\nolimits_{i \in S} {x_{i} }\) then only the potential of the function is reduced. This is because \(\frac{1}{\left| S \right|}\sum\nolimits_{i \in S} {x_{i} }\) is the best possible value for \(\mu\). The algorithm therefore terminates in a local minimum.

3.3 MacQueen Algorithm

k-Means MacQueen algorithm [13] is one of the simplest algorithm to cluster objects into homogenous groups. The process follows an easy and simple way of classifying a given set of data objects into k clusters. The main idea is to define k centroids, one of each cluster and then assigning each data object to the nearest centroid. Position of the centroids is calculated again after assigning each data object to a cluster. The process of assigning the data objects to its nearest clusters is repeated till centroids do not move any more. Finally, this algorithm aims at reducing the objective function i.e. squared error function. The objective function

$$J = \sum\limits_{j = 1}^{k} {\sum\limits_{i = 1}^{n} {\left\| {x_{i}^{j} - c^{j} } \right\|^{{^{2} }} } }$$
(4)

where term \(\left\| {x_{i}^{j} - c^{j} } \right\|^{{^{2} }}\) is a chosen distance measure between a data point \(x_{i}^{j}\) and the cluster’s center \(c^{j}\), which show the distance of the n data objects from their respective cluster centers.

3.4 Forgy Algorithm

k-Means Forgy algorithm [14] is a simple alternating least-squares algorithm. This algorithm takes data matrix as an input in addition to the number of clusters to be constructed and k samples called seed points. The seed points can be chosen on random selection or knowledge of desired cluster structure could be used to direct their selection.

Steps of the algorithm:

  1. 1.

    Initialize the cluster centroids randomly within the data domain.

  2. 2.

    Create k clusters by assigning the samples to the identified nearest cluster centroid.

  3. 3.

    Stop, if none of the samples change clusters.

  4. 4.

    Compute the centroids of the resulting clusters and go to step 2.

For a set of data points \(x_{1} ,x_{2} , \ldots ,x_{n} \, { \in } \, R^{d}\) where \(R^{d}\) is the data space of d dimensions, Forgy algorithm tries to find the set of k cluster centers \(C = \left[ {c_{1} ,c_{2} , \ldots c_{k} } \right]{ \in } \, R^{d}\) that reduces the objective function E defined by

$$E = \sum\limits_{i = 1}^{k} {\int {p(x)d\left( {c_{i} ,x_{i,j} } \right)d_{x} } }$$
(5)

where \(p(x)\) is the probability distribution function and d is the distance function between cluster center \(c_{i}\) and mean value \(x_{i,j}\) of x.

All the above methods are able to clearly discriminate DE and NDE genes.

4 Validation

To validate the results of clustering techniques, the well-known statistical techniques namely, Fisher’s Method [9], Stouffer’s Method [15], Adaptive Weighted Fisher’s (AWF) Method [2] and Minimum p-value (minP) [16] are used.

5 Results

All the k-Means Clustering algorithms select 15,206 and 5099, 15,554 and 4751, and 10,193 and 2561 DE and NDE genes for PC1, PC2 and BC datasets respectively. Only the best results obtained by various clustering and statistical techniques are displayed in Tables 3, 4 and 5. Statistical techniques for datasets PC1, PC2 and BC detects 12,948, 13,499 and 10,653 DE genes and 3345, 1904 and 12 NDE genes respectively as shown in Table 2. Detection ability of DE and NDE genes by Statistical and Clustering techniques are very close to each other.

Table 2 DE and NDE genes detected by different Statistical techniques

Results described in Tables 2 and 3 for PC1 show that 14,923 genes are detected as DE by k-Means Forgy algorithm and Fisher’s method. The contradiction between these two methods is for only 17 genes. Similarly, Stouffer’s method and Forgy algorithm detects 15,179 common genes as DE with a difference of 1475 genes.

Table 3 Common DE and NDE genes detected by Statistical and Clustering techniques in PC1 Dataset

Forgy method and minP method detects 12,700 common genes, showing a difference in 2741 genes. The results of MacQueen Clustering technique and Fisher’s method are much closer to each other, selecting 14,904 common DE genes and differ only in 36 genes. Other Clustering and Statistical techniques such as MacQueen and minP, MacQueen and Stouffer’s method detects 12,684 and 15,159 common genes, showing a difference of 2757 and 1495 genes respectively.

Results of common genes detected by k-Means Clustering and Statistical techniques for PC2 dataset are shown in Table 4 and for BC, results are shown in Table 5. All these results illustrate that number of genes detected as DE and NDE by k-Means Clustering and statistical techniques are very close to each other. Although different k-Means algorithms selects different number of genes as DE and NDE, they all prove to be equally robust for all the three datasets for detecting common genes. From the results, k-Means Clustering technique is proven to be an effective technique for clustering gene expression data and suggests to be used as an alternative for identifying Differentially Expressed genes.

Table 4 Common DE and NDE genes detected by statistical and clustering techniques in PC2 dataset
Table 5 Common DE and NDE genes detected by statistical and clustering techniques in BC dataset

6 Conclusion

In this paper, we explore the application of k-Means Clustering technique for meta-analysis of microarray gene expression data to identify Differentially Expressed genes. The results illustrate that clustering techniques are more robust and accurate than conventional statistical techniques. More genes are detected by clustering techniques as DE as compared to the statistical techniques AWF and Fisher’s method and almost the same genes are detected by other statistical methods. k-Means is found to be a promising alternative to statistical techniques for meta-analysis of microarray gene expression data to identify DE genes. Comparison between four k-Means Clustering techniques prove all techniques to be equally robust to detect common Differentially Expressed and Non Differentially expressed genes with respect to statistical techniques.