Keywords

1 Introduction

Gene regulatory networks specify how biological systems respond to perturbations through the rewiring of molecular interactions. Co-expression networks provide a framework to better understand molecular mechanisms and gene regulation. Thanks to the increasingly high availability of transcriptomic data, robust gene co-expression networks are becoming more widely available. A differential co-expression network represents a particular type of network, which is used to identify changes in response to external stimuli (e.g., changes in activity of gene expression regulators or signaling [3, 7, 28]). Differential co-expression network analysis is an approach for identifying modules of genes with meaningful variation between different experimental conditions (e.g., control and stress). Such an analysis uses as input gene expression data, representing gene expression between control and stress conditions, and output a set of genes that are likely to be involved in the biological response to the specific stress.

Technically, differential co-expression analysis builds a network from the relationships between genes differential expression profiles, that is, the log fold change (LFC) between control and stress expression across multiple observations for each gene. LFC represents a logarithm of the ratio between the control expression and the expression under stress and involves two key steps. First, setting a correlation measure between the genes. Second, filtering the list of pairs using a threshold value for the correlation score [29]. The Pearson correlation coefficient is the most popular measure used for step 1. It assumes linear correlation, normally distributed values, and is sensitive to outliers [17]. A major limitation of using this approach, for building a differential co-expression network, is that it demands large enough sample sizes for statistically reliable results [6, 20, 22]. However, large samples sizes are often prohibitive due to time and computational constraints. For instance, a differential co-expression network with N genes across M control samples and M stressed samples is built by first building a matrix \(X \in \mathbb {R}^{N\times M}\), where each entry X(nm) corresponds to the LFC value of gene n for sample m. Then, the Pearson’s coefficient is computed for each pair of distinct genes \(n_1\) and \(n_2\) with the input vectors \(X(n_1,\_)\) and \(X(n_2, \_)\), each of length M. That is, assuming as input X, building the differential co-expression network takes \(O\left( \left( {\begin{array}{c}N\\ 2\end{array}}\right) M \right) = O(N^2M)\) time, where O(M) is the usual time for computing the Pearson’s coefficient on M samples.

This paper introduces a method based on the penalized least absolute shrinkage and selection operator (Lasso) [33] for building differential co-expression networks. It overcomes the above-mentioned computational limitation by computing a regression of the differential expression profile of one gene against all others, instead of using a correlation metric to identify significant edges between genes. The resulting coefficients represent the strength of the relationship between the corresponding pair of genes, where zero strength indicates no edge between them. Additionally, Lasso has the advantage of yielding accurate parameter estimates even with small sample sizes [13]. For N genes and M samples (both under control and stress), building the differential co-expression network with Lasso takes \(O(NM^2)\) time. However, for the case where there are many more variables than observations (i.e., \(N \gg M\)), as the usual biological expression datasets, it takes O(NM) time [11]. While the proposed approach is used in this work for building differential co-expression networks, it can also more generally be used for building co-expression networks.

Lasso simultaneously performs variable selection and regularization by forcing the least significant coefficients to be zero, which naturally favors the inference of a sparse network. It performs \(\ell _1\) regularization by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value. Lasso iteratively searches for a degree of penalty \(\lambda \) that minimizes the mean square error of the regression. At the optimal value of \(\lambda \), it performs variable selection, which results in a reduced number of non-zero coefficients. The variables with a zero value coefficient are excluded from impacting the regression, which prevents the model from over-fitting. Lasso properties are particularly useful in the construction of co-expression networks since these type of networks are expected to be sparse [36]. Moreover, it avoids the need to define a threshold for selecting meaningful edges from the pairwise relationships between genes. Finally, since Lasso supports small sample sizes, it is well suited for most typical transcriptomic data sets.

The proposed approach is used in combination with a slightly modified version of the workflow proposed in [26] (but with a different module detection algorithm to identify genes that respond to salt stress, namely, the ANGEL algorithm [27]). RNA-seq data was accessed from the GEO database [8], accession number GSE98455. It represents 57,845 gene expression profiles of shoot tissues measured under control and stress conditions in 92 accessions of the Rice Diversity Panel 1 [12]. A total of 25 genes are identified to respond to salt stress and as differentially expressed genes (DEG). About half of these genes (11) are reported with a statistically significant number of different GO annotations relevant to salt stress response.

2 Methodology

This section presents a description of differential co-expression networks, introduces the Lasso-based approach to build differential co-expression networks, summarize the Angel algorithm to detect overlapping modules in the network, and explains an enrichment technique to evaluate the biological significance of the detected modules.

2.1 Differential Co-expression Network

A network is an undirected graph \(G=(V,E)\) where \({V=\{v_1,v_2,\ldots ,v_{N}\}}\) is a set of N vertices (or nodes) and \(E=\{e_1,e_2,\ldots ,e_Q\}\) is a set of Q edges (or links) between vertices. \(G=(V,E)\) can be represented by an adjacency matrix \(A \in \{0,1\}^{N \times N}\) that is symmetric. A matrix entry in positions \((v_i,v_j)\) and \((v_j,v_i)\) is equal to 1 whenever there is an edge connecting vertices \(v_i\) and \(v_j\), and equal to 0 otherwise. Differential co-expression is the altered co-expression patterns of genes between two particular conditions (e.g., control and stress). In a differential co-expression network, each vertex corresponds to a gene. A link indicates a common alteration in the expression pattern between two genes when changing from one condition to the other. Differential co-expression networks are of biological interest because adjacent nodes in the network represent genes that jointly respond to similar stress conditions.

2.2 Co-expression Network Construction with Lasso

Lasso regressions [33] can be seen an advantageous approach for the Constructing co-expression networks using Lasso [33] offers key advantages compared to other methods that rely on the Pearson correlation coefficient (or any other correlation metric). Several assumptions behind computing Pearson limit effectiveness [17] and statistical significance especially if sample sizes are small [6]. Such condition is common across numerous efforts to build co-expression networks [20, 22]. The majority of typical transcriptome datasets tend to be small in terms of the number of samples. Differential co-expression networks are built using differential expression profiles instead of the expression profiles themselves as in co-expression networks. This section explains build differential co-expression networks using Lasso can lead to robust networks even in the presence of small sample sizes.

Furthermore, note that building a network \(G= (V, E)\), that is, a representation of pairwise relationships E over a set of vertices V, is equivalent to inferring a neighborhood for each vertex (i.e., the set of vertices to which it is connected). Given M observations on N genes (vertices) represented in a data matrix \(X \in \mathbb {R}^{N \times M}\), the set of neighbors of vertex \(v_i \in V\), denoted,

$$\begin{aligned} V({v_i}):=\{v_j:(v_i, v_j) \in E\} \end{aligned}$$

is inferred by regressing \(x_i\) against all other variables

$$x_{\backslash i}:=\left[ x_{1}, \ldots , x_{i-1}, x_{i+1}, \ldots , x_{N}\right] ^{T} \in \mathbb {R}^{N-1}.$$

The result is a matrix \(B \in \mathbb {R}^{N \times N}\) whose diagonal is zero and the remaining \(N-1\) entries of a row i correspond to the coefficients of the regression of \(x_i\) against \(x_{\backslash i}\). Each entry B(ij) represents the strength of the relationship between vertices \(v_i\) and \(v_j\), where zero strength indicates no connection.

For each variable \(x_i\) the regression problem has the form:

$$\begin{aligned} \underset{\beta _{i}}{\textrm{minimize}} \left\| X_i-X_{\backslash i} \beta _{i}\right\| _{2}^{2}+\lambda \left\| \beta _{i}\right\| _{1}, \end{aligned}$$
(1)

where \(X_i\) and \(X_{\backslash i}\) represent the observations on \(x_i\) (i.e., the transpose of the first row of X) and the rest of the variables, respectively. The vector \(\beta _i \in \mathbb {R}^{N-1}\) is a vector of coefficients for \(x_i\). In Eq. 1, the first term can be interpreted as a local log-likelihood of \(\beta _i\) and the \(\ell _1\) penalty is added to enforce sparsity. The regularization parameter \(\lambda \) balancing the two terms. Lasso is repeated for all the variables leading to a set of \(N \times N\) coefficients that are computed from \(\beta _1, \ldots , \beta _N\). Note that there is no guarantee that \(B(i,j) \ne 0\) implies \(B(j,i) \ne 0\). Therefore, the information in \(V(v_i)\) and \(V(v_j)\) is combined to enforce symmetry: an edge \((v_i,v_j)\) is meaningful, if B(ij) and B(ji) are both non-zero.

Note also that including the \(\ell _1\) penalty allows Lasso to identify the variables that are strongly associated with the response variable (i.e., variable selection). Since the value of the regularization parameter \(\lambda \) determines the degree of penalty and the accuracy of the model, cross-validation is used to select a regularization parameter that minimizes the mean-squared error. If the degree of penalty \(\lambda \) is equal to zero, the solution is the same as least-squares (LS) linear regression [5]. For larger values of \(\lambda \), larger number of coefficients are shrunk towards zero. Compared to LS, Lasso offers the following advantage. Unlike LS, Lasso does not yield non-zero estimates, which would results in a fully connected network, and giving rise to the problem of setting a threshold above which and edge is considered significant. Lasso avoids this additional step as it simultaneously performs parameter estimation and variable selection by forcing the least significant coefficients to zero through the \(\ell _1\) penalty.

2.3 Overlapping Clustering with ANGEL

ANGEL [27] is a static node-centric algorithm for detecting potentially overlapping modules in networks. It takes as input a graph G, a merging threshold \(\phi \), and an empty set of communities C. The algorithm’s main loop cycles over each node, extracts the corresponding ego-minus-ego network, and computes the local communities it contains using Label Propagation (LP) [23]. During LP, every node is initialized with a unique label. In following steps, each node adopts the current label of the majority of its neighbors. In case of bow-tie situations, the classic LP formulation randomly selects a single label for the contended node. Here however soft community memberships are allowed, that is, each node can belong to multiple communities for the case of a bow-tie configuration. Once the outer loop on the network nodes is completed, the algorithm compacts the community set to avoid the presence of fully contained communities.

Finally, note that compared to HLC [1], the computational cost of ANGEL is significant less and can be approximated by O(|V|).

2.4 Functional Enrichment

Our analysis of differential co-expression networks relies on the detection of gene modules. Such modules are used to investigate relationships occurring between genes performing similar biological functions [34]. Functional enrichment of each module is a critical step to understand the underlying processes contributing to phenotype or stress responses. This section describes how to evaluates the quality of the modules using Gene Ontology (GO) [2, 9] enrichment.

Given a gene module, an enrichment analysis finds which GO terms are over-represented or under-represented by using annotations for that gene set. Gene module enrichment analysis is performed using the Fisher’s Exact Test [32] in combination with a robust False Discovery Rate (FDR) [19] correction for multiple testing. Fisher’s exact test is a statistical significance test used in the analysis of contingency tables. The FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of statistically significant findings, FDR is used to control the expected proportion of incorrectly rejected null hypotheses (“false discoveries”). Here, a Benjamini-Hochberg correction is used [4]. The result for each module is a list of statistically significant GO terms ranked by their adjusted p-values.

For each GO term in each module, a contingency table is built (see Table 1). The hypothesis statement is the following for each module:

  • \(H_0\): The module is a random sample from network.

  • \(H_1\): The module has more genes with the GO term than expected by chance.

Table 1. Contingency table configuration for each module.

Following the configuration in Table 1, the p-value is calculated:

$$\begin{aligned} p=\frac{\left( \begin{array}{c} a+b \\ a \end{array}\right) \left( \begin{array}{c} c+d \\ c \end{array}\right) }{\left( \begin{array}{c} N \\ a+c \end{array}\right) }=\frac{\left( \begin{array}{c} a+b \\ b \end{array}\right) \left( \begin{array}{c} c+d \\ d \end{array}\right) }{\left( \begin{array}{c} N \\ b+d \end{array}\right) } \end{aligned}$$
(2)

The p-value represents the probability (or chance) of seeing at least a genes out of the total \(a+c\) genes in the module annotated with a particular GO term, given the proportion \((a+b)/N\) of genes in the whole genome that are annotated with that GO term. That is, the GO terms shared by the genes in each module are compared to the background distribution of the annotations. The closer the p-value is to zero, the more significant is the association of the particular GO term with the module of genes (i.e., the less likely that the observed annotation of the GO term to the module occurs by chance). In other words, if all of the genes in a module were associated with, say “DNA repair”, this term would be optimally significant. However, since all genes in the genome (with GO annotations) are indirectly associated with the top level term “biological process”, this would not be significant if all the genes in a module were associated with this high-level term.

If a module has at least one GO term with a significant p-value, the module is said to be enriched. This binary classification of a module between enriched and non-enriched allows us to evaluate, in a general way, the biological significance of the modules. The higher the proportion of enriched modules in a differential co-expression network, the better they capture the biological interactions of genes that jointly respond to a specific stress condition.

3 Case Study

This section presents a case study in the identification of genes that respond to saline stress in rice. The differential co-expression network is built using the approach presented in Section 2.2 and the framework in [26], with the ANGEL module detecting algorithm. The goal of this case study is to evaluate whether the proposed approach, in addition to being less computationally costly, finds a number of differentially expressed genes (DEG) and genes with GO annotations relevant to salt stress that is statistically significant.

3.1 Association Network Construction with Lasso

Consider the input data \(X \in \mathbb {R}^{N \times M}\), which represents the matrix of differential expression profiles (corresponds to \(L_1\) in [26]). The matrix X results from pre-processing RNA-seq data of rice both under control and salt stress conditions (GEO database [8] accession number GSE98455). Therefore, X contains the LFC of \(N=8,929\) genes under control and salt stress conditions in \(M=92\) samples.

The differential co-expression network, inferred using Lasso regressions, is composed of \(|V|=7,474\) vertices and \(|E|=67,061\) edges. All the genes in this network are part of the network previously constructed in [26] and preserves 21,123 out of 39,850,128 of the original connections. IN other words, the resulting network is a subnetwork of the one constructed in [26].

3.2 Identification of Co-expression Modules

In [26], the approach for module detection requires finding a threshold for Pearson correlations to define the adjacency matrix. Here the proposed Lasso-based approach bypasses this step since it is able to directly infer network connections without additional parameters.

The ANGEL algorithm distributes a total of 5,577 genes across 1,462 modules with at least 3 genes each. Using the module enrichment technique described in Section 2, a total of \(65\%\) of all modules are identified as enriched, meaning that they have some over-represented GO annotations. In other words, the modules identified by ANGEL are biologically relevant. Figure 1 compares the threshold Pearson network (thP) in [26] with Lasso-based network (nbL), both in terms of the proportion of enriched modules and gene overlaps. Regarding module enrichment, note that the proposed approach surpasses the approach of [26].

Fig. 1.
figure 1

Modules enrichment proportion and overlapping proportion of genes for thP and nbL networks.

Regarding the overlapping modules of genes for the nbL network, notice that the amount of transcription factors (TF) in the gene set belonging to multiple modules is statistically significant (p-value less than 0.05 for the Fisher’s Exact Test). This supports the biological relevance of the overlapping modules. TFs regulate the expression of multiple genes and hence affect multiple pathways of varying functions [25]. Since TFs control different functions, they are expected to be found in overlapping modules. Another interesting finding is that, according to an enrichment analysis in ShynyGO [14], one of the pathways with the highest over-representation in the set of overlapping genes that corresponds to “response to stress” (GO:0006950).

3.3 Gene Selection

Based on the modules detected with ANGEL in the nbL network (following the workflow proposed in [26]), a total of 25 genes are identified as responsive to salt stress. All 25 genes are also identified as DEG. Genes LOC_Os07g39390, LOC_Os04g35010, and LOC_Os01g33450 are selected by both approaches, in the thP and nbL network.

From the 25 identified genes, after individual gene enrichment with the RGAP [18] and UniProt [10] databases, 11 genes report 17 different GO annotations relevant to salt stress response (which is statistically significant based on Fisher’s exact test, p-value\(<0.05\)). Table 2 lists these genes and the corresponding GO annotations relevant to salt stress response. The remaining 14 selected genes are:

figure a
Table 2. Selected genes with associated GO terms relevant to salt stress response.

The apoplast (GO:0048046) is the first subcellular compartment confronted with stress conditions when plants are subjected to salt stress [31]. Stress is first sensed by the receptors in membranes (GO:0016020), which then generates secondary signal messengers like calcium, reactive oxygen species, kinases (GO:0004672, GO:0016301, GO:0016740), and phosphates followed by the activation of transcription factor genes (GO:0003700) that eventually coordinates the plant’s adaptive biochemical and physiological responses [16] (GO:0006950, GO:0009628, GO:0006952). Protein kinases regulate the phosphorylation and dephosphorylation of other proteins, and play a crucial role in stress signal transduction. In addition, serine/threonine protein kinases (GO:0004674) have also been known to be involved in multi-stress tolerance in plants [16].

Salt-induced toxicity negatively affects CO\(_2\) fixation and thylakoid reactions of photosynthesis, which take place in thylakoids (GO:0009579) and the stroma of the chloroplast, resulting in poor plant growth and reduction in yield [15]. An essential process for growth, development, and homeostasis of organisms is the dynamic balance between ubiquitination and deubiquitination (GO:0071108, GO:1990380, GO:0004843, GO:1990380) [30]. In particular, inhibition of shoot and root development (GO:2000280) is the primary response to salt stress [35]. Other, independent studies confirm that the enhanced catalytic and transferase activities (GO:0016740) in salt-stressed rice plants, as well as the transport (GO:0006810) of salt and all related ions through the plant, reinforce salt stress tolerance [24].

4 Concluding Remarks

This work proposes a novel approach for constructing co-expression networks based on the penalized least absolute shrinkage and selection operator (Lasso) [33]. Edges between genes are stablished based on the Lasso regression coefficients between the differential expression profile of one gene against all others. The approach extends the workflow described in [26] for identifying genes related to salt stress in rice. In particular, it uses the ANGEL algorithm, a static node-centric algorithm for detecting modules with overlaps, which improves effectiveness and time complexity. Note that the proposed approach can be used for building differential and non-differential co-expression networks.

The Lasso-based approach is computationally appealing as each of the N Lasso problems can be solved independently. This makes the proposed approach a good candidate to exploit parallelization [21]. Moreover, this method avoids the additional problem that often arises in constructing co-expression networks based on correlations: the definition of a threshold to identify the strongest connections that define the edges of the network. Using Lasso is especially convenient in inferring differential co-expression networks because it yields accurate parameter estimates even with small sample sizes [13], a common condition studying expression data under control and stress conditions.

The modified workflow was applied to a case study of rice under salt stress. The resulting network, inferred with the Lasso approach contained 7,474 vertices and 67,061 edges. \(65\%\) of the identified modules were enriched. The amount of transcription factors in the set of genes belonging to multiple modules (overlapping genes) was statistically significant. Finally, a total of 25 genes were selected as genes that respond to salt stress in rice. All 25 genes were also identified as DEG. Of these identified genes, 11 reported a statistically significant number of different GO annotations relevant to the salt stress response.

As future work, the proposed Lasso-based approach can be applied to infer co-expression networks of other organisms and other types of stresses. Moreover, the inferred networks can be used in comparisons and downstream analysis of co-expression networks. Developing a parallel implementation of the current workflow is an important research direction to further reduce time complexity. Further exploration of co-expression networks should provide valuable insights into the gene interactions and their joint response to stresses.