Keywords

1 Introduction

In our approach to infer gene regulatory network, we focus on merging a powerful gene expression analysis technique, clustering with the Entropy Reduction Technique (ERT) from information theory in an attempt to achieve better performance than existing information theoretic approaches such as ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks). The goal is to merge these two techniques in order to handle large datasets without being provided any information regarding the type of experimental conditions in which the expression levels of genes were measured. With the knowledge of the regulatory network and the role that each gene plays in a network, fields such as drug discovery and personalized medicine will be revolutionized.

1.1 Gene Regulatory Network

Genes of a biological system do not act independently of each other. They work in a complex regulatory network with other genes and gene products such as RNA and protein to control the expression level of certain genes in the network, either through activation or inhibition regarding the expression level. Two genes can be considered connected by a regulatory interaction link if the expression level of one influences the expression level of the other.

1.2 Clustering of Gene Expression Data

As a useful data mining technique, clustering has been used in the context of gene expression data to identify grouping that exists in the data and also to find hidden patterns among data points. In our algorithm, we use k-means clustering because of its simplicity and ability to cluster large datasets containing 4000 to 5000 genes along with hundreds of samples in an efficient manner to be later used with the Entropy Reduction Technique (ERT). We also use the Elbow Method (see Fig. 2) to find the optimal value for the number of clusters.

1.3 Entropy Reduction (ER)

We use the entropy reduction approach in our algorithm for the purpose of determining statistically significant regulatory interactions among genes. In Information Theory, proposed by Shannon, Entropy is a fundamental concept. It can be defined as the measurement of uncertainty of a random variable [2].

$$\begin{aligned} H(x)= -\sum _{x \in X} p(x)\log p(x) \end{aligned}$$
(1)

where H is the entropy, x is a discrete random vector with alphabet X, and p(x) is the probability mass function.

Entropy is very closely related to Mutual Information (MI), which is the measurement of the amount of information about one random variable that is contained in another variable. So it reduces the uncertainty of one variable given that the information about another variable is provided [2]. In the biological context, if two genes have a regulatory interaction among them, then the mutual information between those two genes will be high. On the other hand, if two genes act independently in the biological process, they will have a mutual information measure close to zero. The main component of entropy reduction technique is, if a variable A shares a regulatory link with another variable B, then

$$\begin{aligned} H (A|B) < H (A) \end{aligned}$$
(2)

where, H(A|B) is the Conditional Entropy of A given B and H (A) is the Entropy of A [2].

Entropy Reduction Technique (ERT) is a relatively new approach to be applied to the task of inferring biological networks. Previously it has been used to generate regulatory networks for small biological networks [2]. In order to apply ERT in the context of large datasets where we would like to avoid calculating large three dimensional matrices, we use a clustering algorithm to minimize the search space so that ERT then can be applied on smaller groups of genes in an efficient way.

1.4 Contribution

We are proposing a novel approach to infer gene regulatory network that combines clustering of genes with Entropy Reduction Technique to make this effective idea applicable on large datasets. We evaluate the performance of our algorithm using Precision and Recall on the dataset from DREAM5-Network Inference Challenge [5] as well as in-silico dataset generated by GeneNetWeaver [10]. The resultant network is compared with the regulatory network generated by ARACNE. We also compare the results from No-Clustering, Unmerged-Clustering and Selected-Merged-Clustering versions of our algorithm to assess the effectiveness of clustering in the regulatory networks. Even though the No-Clustering version is the most effective one in determining regulatory interactions among genes, Selected-Merged-Clustering version also performs well across all datasets. Different threshold values are used after the ERT step to eliminate less significant interactions and outputs of multiple versions of the algorithm are compared to highlight the range of effective threshold values.

2 Related Work

Several types of information theoretic approaches have been used to reverse engineer gene regulatory networks from expression data. Here, we will discuss ARACNE and the Entropy Reduction Technique (ERT) that we have used in our algorithm:

2.1 ARACNE

ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) is an information theoretic algorithm that uses microarray gene expression data to generate transcriptional regulatory network [3]. It identifies links between genes as true regulatory interactions if the statistical dependency is irreducible between those genes. ARACNE defines potential regulatory interactions between two genes based on their Mutual Information (MI). After generating pair-wise mutual information, it uses a threshold value to eliminate links between gene pairs as not significant if they are below the threshold value. But the problem with this MI-based approach is that it also labels indirect interactions between genes which are highly co-regulated due to their relationship with a third gene, as true regulatory interactions, which results in large amount of false positives. ARACNE solves this problem by using a technique called the Data Processing Inequality (DPI) [3].

Fig. 1.
figure 1

DPI technique.

The idea of DPI is, if Gene g1 and Gene g3 (see Fig. 1) have indirect interactions through a third Gene g2 then the link between g1 and g3 will be removed by DPI.

$$\begin{aligned} I(g1, g3) <= min[I(g1, g2), I(g2, g3)] \end{aligned}$$
(3)

where I is the MI between gene pairs.

2.2 Entropy Reduction Technique

The reason for using concepts from information theory such as Entropy, Mutual Information is to generate biological networks without any background theoretical knowledge.

Entropy is closely related to Mutual Information which is the measurement of how much information one variable contains about another. Mutual Information, I can be described in terms of both Joint Entropy H(XY) and Conditional Entropy, H(Y|X) in the following way [2]

$$\begin{aligned} H(X,Y) = -\sum _{x} \sum _{y} p(x, y)\log p(x, y) \end{aligned}$$
(4)
$$\begin{aligned} H(Y|X) = -\sum _{x} \sum _{y} p(x, y)\log p(y|x) \end{aligned}$$
(5)
$$\begin{aligned} I(X,Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y) \end{aligned}$$
(6)

The basic idea of Entropy Reduction technique is, if a variable A does not depend on variable B, then the Entropy of A given B is equal to the Entropy of A. But if variable A has dependency on variable B, then the Entropy of A given B is less than the Entropy of variable A [2].

$$\begin{aligned} H(B)=H(A), {\textbf {A and B are independent of each other}} \end{aligned}$$
(7)
$$\begin{aligned} H(A|B) < H(A), {\textbf {A and B have dependency relationship}} \end{aligned}$$
(8)

This works well when the regulatory network is very small [2]. But for large networks containing thousands of genes, this is extremely time consuming and unfeasible to be applicable in real applications.

3 Clustering of Gene Expression Data

The main goal of clustering is to group data points which are similar into the same cluster from a set of clusters and dissimilar data points into a different cluster. In the context of genetics, the similarity measure can be the similar expression or co-expression level of genes [7]. If Gene A and Gene B are grouped in the same cluster based on expression level, then it can be deduced that they are part of the same biological process. Moreover, strong co-expression level among genes also suggests co-regulation [7]. Various supervised, semi-supervised and unsupervised algorithms have been used in systems biology [9, 11], but considering the globular shape of regulatory networks, we focus primarily on K-means clustering.

3.1 K-Means Clustering

K-means clustering falls into the subgroup of clustering called Partitioning Clustering, which is a clustering technique where each data point belongs to only one of the non-overlapping groups or clusters. It is a very simple and fast unsupervised clustering technique.

To measure the quality of the clustering, Sum of Squared Error (SSE) is computed for a clustering of data points.

$$\begin{aligned} SSE = \sum _{i=1}^{k} \sum _{X \in C_i} dist(C_i, X)^{2} \end{aligned}$$
(9)

Here dist is the Euclidean distance between two points in Euclidean space and \(C_i\) is the centroid of i-th cluster which is defined by,

$$\begin{aligned} C_i = \frac{1}{m} \sum _{X \in C_i} X \end{aligned}$$
(10)

For our algorithm, we use Lloyd’s version of K-means [8] with twenty runs for each value of K to find the minimum SSE. To find the optimal value of K for a given dataset, we run the K-means algorithm on that dataset for K = 2 to K = 100, and plot a within-cluster SSE against the number of clusters to identify the maximum reduction of SSE at any given point. This is also known as the Elbow Method (see Fig. 2).

Gene expression data can be clustered in two ways – i) by row (genes), to cluster genes in different groups and treating samples as features and ii) by column (samples), to cluster samples and treating genes as features. Both of them have their practical purposes. For our method we use row-based or gene-based clustering to group genes by their similar co-expression levels.

Fig. 2.
figure 2

Elbow method to find the optimal value of K.

We use three different versions of clustering together with the ERT step, to be compared for their effects on the efficiency of the ERT process. The versions are described below:

No Cluster Version: In this version, we avoid clustering the dataset; instead we run the ERT step on the entire dataset to find the true regulatory interactions among genes. Given the complexity of the ERT step, it takes a significant amount of time to complete generating the connection matrix.

Unmerged Version: For this version, we cluster the dataset into a given number of clusters and run ERT on genes of each cluster separately. Then we merge the connection matrices returned from each cluster and generate a n-by-n connection matrix where n is the number of genes. In this version, the genes which have true regulatory interactions with genes in a different cluster are not identified.

Selected Merged Version: With this version, after running the unmerged version of the algorithm, an additional merging among “close” clusters is carried out. We calculate which clusters are close to a given cluster by first finding the Euclidean distance of the nearest cluster and then multiplying the distance by two. And then we identify which clusters’ centroids fall within this doubled distance and consider them to be “close” clusters. Finally we merge all the connection matrices from individual and close clusters after applying the Entropy Reduction step into a n-by-n connection matrix where n is the number of genes.

4 Method

In this section, we will describe different components of our algorithm in detail. The algorithm can be divided in two main parts, the Clustering part and the Entropy Reduction (ERT) part.

4.1 The Clustering Part

Input: Data matrix A of dimension n \(\times \) m, where n is the number of genes and m is the number of samples or experiments, the value of number of clusters K, algorithm for K-means clustering and maximum number of iterations for K-means.

Algorithm:

  1. 1.

    Cluster the dataset into K different clusters using the given algorithm.

  2. 2.

    Generate a list of K elements L, where each element contains all the data points assigned to a cluster. This list is used in the Cluster Merging and ERT steps.

  3. 3.

    Calculate a distance matrix D of dimension \(K \times K\) where distance between each pair of cluster centers is stored.

Output: L, list of K elements and distance matrix D

4.2 Entropy Reduction Part for One Cluster [2]

Input: Cluster ID

Algorithm:

  1. 1.

    Collect data points n of the given cluster ID from the list L, generated in Clustering Part.

  2. 2.

    Generate data matrix TD of the genes from the main data matrix A, where columns contain genes and rows contain samples.

  3. 3.

    Discretize TD.

  4. 4.

    Calculate Mutual Information (MI) matrix M of dimension n \(\times \) n and normalize the mutual information in an n \(\times \) n dimensional matrix NMI using Linfoot definition of normalization to have the mutual information values in the range of 0 to 1 [2].

    $$\begin{aligned} {\textbf {M [i, j] = MutualInformation(TD[, i], TD[, j])}} \end{aligned}$$
  5. 5.

    Calculate the single entropy matrix E of dimension n.

    $$\begin{aligned} \mathbf {E[i] = Entropy (TD [, i])} \end{aligned}$$
  6. 6.

    Calculate an n \(\times \) n dimensional Conditional Entropy matrix, CE between all pairs of variables.

    $$\begin{aligned} {\textbf {CE [i, j] = ConditionalEntropy(TD[, i], TD[, j])}} \end{aligned}$$
  7. 7.

    Calculate an n \(\times \) n dimensional Reduced Entropy matrix, RE between each pair of variables i, j using mutual information matrix M and single entropy matrix E using the equation,

    $$\begin{aligned} {\textbf {RE [i, j] = (M[i, j])/E[i]}} \end{aligned}$$
  8. 8.

    Generate an n \(\times \) n dimensional ERT matrix ERTM using the following condition,

    $$\begin{aligned} {\textbf {If CE [i, j] = E[i], then ERTM [i, j] = 1 Else ERTM [i, j] = 0}} \end{aligned}$$
  9. 9.

    Generate a connection matrix C of n \(\times \) n dimension using the following condition,

    $$\begin{aligned} {\textbf {If ERTM [i, j] == 1, then C [i, j] = RE [i, j] Else C [i, j] = 0}} \end{aligned}$$

After this step, each cell of the connection matrix contains the reduced entropy between two genes.

Output: Connection matrix C.

To apply the ERT algorithm on two clusters for the Selected-Merged version of the algorithm, first identify the closest clusters of a given cluster by the process described in Sect. 3.1 under Selected-Merged-Clustering Version and then run the ERT algorithm described above. The output connection matrix C is of dimension n \(\times \) m, where n is the number of genes in the first cluster and m is the number of genes in the second cluster.

After running ERT on all individual clusters and all pairs of closest clusters, all the returned matrices are combined together in a connection matrix of dimension n \(\times \) n. Different threshold values are applied on the connection matrix to evaluate the performance of the algorithm.

The overall algorithm can be described as follows:

  1. 1.

    Cluster the dataset into given number of clusters.

  2. 2.

    Apply ERT algorithm on each cluster.

  3. 3.

    Merge a cluster with its closest clusters.

  4. 4.

    Apply ERT algorithm on all the merged clusters and combine all the results of connection matrices from ERT to generate an n \(\times \) n final connection matrix where n is the number of genes.

We also experimented with the Data Processing Inequality (DPI) technique [3] from ARACNE algorithm after the final connection matrix of genes is generated in an attempt to verify whether combining the DPI technique with our algorithm improves the accuracy of the generated regulatory network even further. In the result section, we only present the “Before DPI” results, meaning the results obtained without applying the DPI technique.

5 Results

To evaluate our algorithm, we use the DREAM5 - Network Inference Challenge from DREAM Challenges [5]. We use precision and recall as performance measures for our algorithm. Precision is defined by,

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(11)

where TP is the True Positive prediction of a regulatory interaction between a pair of genes and FP is the False Positive prediction.

Recall is defined by,

$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(12)

where FN is a False Negative Prediction of regulatory interaction.

5.1 Results for DREAM5-Network Inference Challenge Dataset [5]

First, we look at the effect of different threshold values (see Fig. 3, top) which are used as cutoff points to identify the true positives or regulatory interactions among genes. If the reduced entropy value between a pair of gene is above a certain threshold value, we deduce from that the genes share a regulatory interaction. At the same time, we also measure the effectiveness of clustering (see Fig. 3, bottom) in identifying interactions. Our goal is to find out whether clustering the datasets prior to applying ERT yields similar results as no-clustering approach (denoted as NC in the Fig. 3, bottom).

Fig. 3.
figure 3

True positive vs clusters for different threshold values (top), true positive vs threshold for different cluster numbers (bottom)

Fig. 4.
figure 4

True positive vs cluster number for ARACNE, no-clustering, selected-merged, unmerged clustering

Fig. 5.
figure 5

False positive vs cluster for ARACNE, no-clustering, selected-merged, unmerged clustering

Next, we compare the results from ARACNE (denoted as ARC in Fig. 4) and three different versions of our algorithm in identifying correct regulatory interactions under different threshold values (see Fig. 4).

Fig. 6.
figure 6

Precision and recall for ARACNE, no-clustering and selected-merged clustering

We then compare the false positive results from ARACNE and different clustering versions (see Fig. 5) for similar threshold values.

Precision and Recall Graphs for Different Threshold Values

Finally, we compare the precision and recall rate derived from ARACNE, no-clustering and selected-merged clustering versions of our algorithm (Fig. 6).

6 Discussion

The performance of our algorithm in inferring true regulatory interactions without using DPI technique shows great improvements for lower threshold values over the ARACNE approach. We also find that the No-Clustering version of our algorithm is ranked highest among ARACNE, Unmerged-Clustering and Selected-Merged-Clustering versions of our algorithm in identifying true interactions before using DPI technique. But lower values for the number of clusters in the Selected-Merged version also produce similar results to the No-Clustering version in finding true interactions. For higher values of cluster numbers, both the true positives and false positives are reduced compared to lower values of cluster numbers. Using threshold values lower than the mean of all non-zero values of the final connection matrix generated by the algorithm which contained the reduced entropy for each pair of genes, result in a high number of true and false positives. For values greater than the mean, the reduction of false positives is much greater than the reduction of true positives. As it is evident from the graph of precision and recall for different cluster numbers, our algorithm produces high recall rate before using DPI technique. Another important observation from the precision and recall graph is that the algorithm produces a high recall rate for smaller threshold values. On the other hand, using higher threshold values produce a high precision rate. So for our algorithm to be useful in real-world applications, reasonable threshold values depending on the goal of the task have to be chosen for it to perform well.

7 Conclusion and Future Work

In this work, we have proposed a novel algorithm that enables entropy reduction technique to be used in real-world applications by reducing the search space of regulatory interactions for large networks. The algorithm showed significant improvements in generating accurate regulatory networks when used with appropriate threshold values. For the clustering part of our algorithm we have only used K-means clustering for its simplicity. But in future we want to use different types of clustering algorithms such as spectral clustering and affinity propagation clustering with the Entropy Reduction Technique to compare the performance with our current approach. For the Selected-Merged-Clustering version of the algorithm we hope to use better measurement techniques to identify close clusters. For the current implementation, we consider each gene pair to have true interactions if the conditional entropy is less than the single entropy. In future, it would be better to gain insights regarding the threshold values from the datasets in order to consider interactions to be true only if they are above those threshold values.