Benchmarking a Simple Yet Effective Approach for Inferring Gene Regulatory Networks from Systems Genetics Data

Heise, Sandra; Flassig, Robert J.; Klamt, Steffen

doi:10.1007/978-3-642-45161-4_3

Sandra Heise²,
Robert J. Flassig² &
Steffen Klamt²

1048 Accesses

Abstract

We apply our recently proposed gene regulatory network (GRN) reconstruction framework for genetical genomics data to the StatSeq data. This method uses, in a first step, simple genotype–phenotype and phenotype–phenotype correlation measures to construct an initial GRN. This graph contains a high number of false positive edges that are reduced by (i) identifying eQTLs and by retaining only one candidate edge per eQTL, and (ii) by removing edges reflecting indirect effects by means of TRANSWESD, a transitive reduction approach. We discuss the general performance of our framework on the StatSeq in silico dataset by investigating the sensitivity of the two required threshold parameters and by analyzing the impact of certain network features (size, marker distance, and biological variance) on the reconstruction performance. Using selected examples, we also illustrate prominent sources of reconstruction errors. As expected, best results are obtained with large number of samples and larger marker distances. A less intuitive result is that significant (but not too large) biological variance can increase the reconstruction quality. Furthermore, a somewhat surprising finding was that the best performance (in terms of AUPR) could be found for networks of medium size (1,000 nodes), which we had expected to see for networks of small size (100 nodes).

Access provided by Autonomous University of Puebla. Download chapter PDF

Whole-Transcriptome Causal Network Inference with Genomic and Transcriptomic Data

Detection of Regulator Genes and eQTLs in Gene Networks

The Reconstruction and Analysis of Gene Regulatory Networks

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

3.1 Introduction

Systems Genetics approaches provide a new paradigm of large-scale genome and network analysis (Jansen and Nap 2001; Jansen 2003; Rockman and Kruglyak 2006; Rockman 2008). These methods use naturally occurring multifactorial perturbations (e.g., polymorphisms) to causally link genetic or chromosomal regions to observed phenotypic trait data. Identifying a chromosomal region (the quantitative trait locus (QTL)) that influences a certain phenotypic trait is known as QTL mapping. In genetical genomics, a particular subclass of systems genetics, gene-expression levels are considered as phenotypic traits (called etraits) and identified QTLs are referred to as expression-QTLs (eQTLs). One application of eQTL maps obtained from genetical genomics approaches is the reconstruction of gene regulatory networks (GRNs).

According to Liu et al. (2010), a GRN reconstruction pipeline for genetical genomics data consists of three major steps: (i) eQTL mapping, (ii) candidate regulator selection, and (iii) network refinement. Step (i) is used to identify chromosomal regions (eQTLs) that impact on expression levels (\(=\) traits) of genes. A detailed review on eQTL mapping is, for instance, given by Michaelson et al. (2009). In step (ii), the eQTL map in combination with a genetic map is used to select single candidate (regulator) genes from the eQTLs. Frequently used methods include conditional correlation (Bing and Hoeschele 2005; Keurentjes et al. 2007), local regression (Liu et al. 2008), or analysis of between-strains SNPs (Li et al. 2005). In the third step (iii), network refinement methods are employed to the topology obtained in step (ii), e.g., with the goal to identify and eliminate (false positive) edges arising from indirect effects. Here, Bayesian network approaches (Zhu et al. 2007) and structural equation modeling, SEM, (Liu et al. 2008) have been used.

In this chapter, we apply our recently proposed GRN reconstruction framework for genetical genomics data (Flassig et al. 2013), which incorporates the three major reconstruction steps mentioned above in a modular fashion. The framework follows a simple-yet-effective paradigm: it is based on simple correlation measures, without the need for computational demanding optimization steps. This approach is therefore suited for small- and large-scale networks and performed comparable well in the case of little samples but many genes, as we illustrate in Flassig et al. (2013) using simulated and biological data. The workflow of the framework is shown in Fig. 3.1. The initial GRN is constructed based on genotype–phenotype and phenotype–phenotype correlation analyses. Due to genetic linkage there are often groups of genetically adjacent regulator gene candidates, which target the same gene resulting into eQTLs. To avoid many false-positive interaction predictions, single candidate regulators are therefore identified from the eQTLs. Finally, as a method for network refinement in step (iii), indirect path effects are removed by TRANSWESD, a transitive reduction approach introduced recently (Klamt et al. 2010).

3.2 Methods

Figure 3.1 shows the general workflow of our reconstruction framework together with a simple illustrative example. Starting from a typical set of genetical genomics data that include genotyped markers, phenotyped genes and gene-to-marker association, marker linkage analysis, and genotype assignment for each gene is performed in a preprocessing step. In particular, a linkage map is generated in which two markers are indicated to be genetically linked if their genotype–genotype correlation exceeds a given threshold parameter \(d_\mathrm{{min}}\). Then, in a first step, an unweighted and unsigned perturbation graph G1 is derived in which an edge \(i\rightarrow j\) is included if their corresponding genotype-phenotype correlation exceeds a second threshold \(t^{QT}\). The nodes in the graph directly correspond to genes while the linkage map (of the markers) is kept to allow later eQTL assignment for each gene. The perturbation graph G1 is refined to G2 by quantifying each identified edge with respect to edge sign and weight, which indicate activation/repression and interaction strength, respectively. Due to genetic linkage true regulators may be masked by other genes (e.g., on adjacent positions on the genetic map) resulting into eQTLs. The eQTLs of a given target gene t are identified on the basis of all potential regulator genes of t (contained in G2) together with the marker linkage map. These relationships are captured in graph G3, which is the only graph where the nodes represent eQTLs. Graph G4 is subsequently obtained by selecting one candidate regulator per eQTL based on the maximum of the edge weights. We call G4 the final perturbation graph, whose edges reflect direct and indirect effects between genes induced by genetic variations. To identify and remove indirect edges in G4 that can be explained by the operation of sequences of edges (paths) we apply the transitive reduction method TRANSWESD (TRANSitive reduction in WEighted Signed Digraphs) resulting in the final graph G5 containing the identified gene interactions. Optionally, if one is left to verify the interactions experimentally, it is desirable to have a list of edges sorted with respect to edge confidences. Such a list is also required by the evaluation procedure of the StatSeq Systems Genetics Benchmark to assess the quality of a reconstructed network (Sect. 3.3). We generate such a sorted list based on the edge weights. More details on the framework can be found in Flassig et al. (2013).

Table 3.1 Reconstruction performance obtained for each network configuration achieved with the indicated optimal parameter values.

Full size table

3.3 Application to the StatSeq Systems Genetics Benchmark: Results and Discussion

We applied our reconstruction framework described in Sect. 3.2 to the in silico StatSeq dataset provided to all contributors of this book. In this section, we will discuss the general performance of the algorithm and investigate the impact of certain network features (size, marker distance, and biological variance) on the reconstruction performance of our applied reconstruction framework. Using selected examples, we will also illustrate prominent sources of reconstruction errors (Sect. 3.3.2).

3.3.1 General Performance Analysis with Respect to Network Configurations

Table 3.1 shows the AUPR and AUROC reconstruction performance (obtained by using optimal values for the thresholds \(d_\mathrm{{min}}\) and \(t^{QT})\) for all studied 72 network configurations: 3 different network sizes (100, 1000, 5000) \(\times \) 3 replicates (with same topological parameters) \(\times \) 2 marker distances (close and far) \(\times \) 2 different biological variances (high and low) \(\times \) 2 different population sizes (300 and 900) (see also Chap. 1). The performance measures are given for graph G2, G4, and G5 to be able to assess the overall effects of the two major pruning steps within our approach (G2 \(\rightarrow \) G4: selection of one candidate edge per eQTL; G4 \(\rightarrow \) G5: removal of edges that most likely stem from indirect effects (TRANSWESD); see Fig. 3.1). We will mainly focus on the AUPR measure since this is the most appropriate one for sparse networks.

As a general trend, we observe that the first (eQTL) pruning step leads in all cases to an improvement of the AUPR, particularly pronounced in the case of large population sizes (see also averaged values in Table 3.1). The second (TRANSWESD) pruning step achieves a significant (but compared to the eQTL pruning lower) AUPR improvement when using the larger population size, whereas only a minor or even no effect can be seen for reconstruction based on the small population with 300 individuals. The effects of the two pruning steps are also well reflected by the number of true positive (TP) and false positive (FP) edges in Table 3.1.

As expected, we see that a larger population size always helps to yield a better reconstruction quality (see also Fig. 3.3). Somewhat surprising was the finding that the best (averaged) AUPR value could be found for the G5 graph of medium size networks (1,000 nodes), here we had expected to see this for networks with 100 nodes.

In the following we will discuss the sensitivity of the reconstruction results with respect to the threshold parameters (\(t^{Qt}\) and \(d_\mathrm{min}\)) and the impact of marker distance, biological variance, and population sizes by the example of the first 100-nodes network (networks 100.1.1–100.1.8 in Table 3.1). Similar results can be found for the replicates (100.2.x and 100.3.x) and/or networks of larger size (1000.x.x; 5000.x.x). Figure 3.2 shows for configurations 100.1.1–100.1.8 the resulting AUPR and AUROC performances of the reconstructed G5 networks in the two-dimensional space of meaningful threshold parameters. Clearly, as already outlined above, larger population size (900 samples instead of 300) improves the reconstruction quality (compare odd vs. even numbers of network configurations) although, in line with our results in Flassig et al. (2013), the differences are only moderate. We also see that the optimal threshold regions are similar for all 8 networks. However, one can observe that in the case of low sample size (300) the optimal AUROC/AUPR region is more confined. Thus, the method seems to be fairly robust against a variation of thresholds but an appropriate threshold selection strategy is important for small sample sizes. Generally, the genotype–phenotype threshold \(t^{QT}\) for edge detection in G1 seems more sensitive and important than the linkage analysis threshold \(d_\mathrm{{min}}\) required in preprocessing. Regarding sensitivity of the performance evaluation, AUROC is much less sensitive to the parameters \(t^{QT}\) and \(d_\mathrm{{min}}\) than AUPR.

Larger marker distance seems beneficial for reconstruction because genotype correlations are then minimized. This can be seen, for instance, when comparing configuration 2 (marker distance N(5, 1)) with 6 (marker distance N(1, 0.1)) in Fig. 3.2. Partially, weak performance due to small marker distance can be compensated by biological variability (configuration 2 vs. 8). However, in the case of small samples and larger marker distance, larger biological variability decreases performance. This is most likely due to a poor signal-to-noise ratio and can be understood as follows. Interactions between genes are derived from target expression variations induced by regulator genotype variations. This approach requires sufficient (i) variation of the regulator and (ii) sensitivity of targets with respect to expression variations of the regulator. Variation of the regulator can only be induced by either upstream genes, i.e., the regulator itself is regulated by other genes, and/or by biological variability inducing expression variation in each gene along the sample population. The latter is important for identifying regulator–target interactions of regulators, which have no upstream genes. In this case, the only source of topological informative expression variation is biological variability, which however can only be distinguished from uninformative noise for larger sample sizes.

Figure 3.3 summarizes the AUROC and AUPR performances for all network configurations and sizes averaged over the three network replicates. These results confirm many of the observations made for networks 100.1.x. Again, for our reconstruction algorithm, the worst scenario in terms of AUPR values is the one with small sample size, small marker distance, and small biological variance. We also see that the AUROC is more or less insensitive with respect to sample size and configuration of marker distance/biological variance, but sensitive to the total number of nodes. Specifically, the AUROC is constantly decreased in networks with only 100 nodes compared to 1,000 and 5,000 nodes. This is most likely due to the fact that there are less false negative edges in small compared to large networks (if they have the same connectivity, which is the case for the given dataset) leading to a decreased AUROC. Best network configuration for reconstruction in terms of AUPR values is given by larger samples and large marker distance from which only the first one can be influenced by experimental design. Increased biological variance has noticeable effects on the reconstruction quality for small marker distance. Here, higher biological variance is favorable. The reconstruction quality with respect to network size decreases clearly in one particular case: networks with 5,000 nodes perform poorly in the AUPR values for small sample size (300). Therefore, precision is small in this setting because of too few samples. For 900 samples, precision is raised, resulting into similar AUPR values compared to reconstructions of 100/1,000 node networks. Averaged over all configurations, networks with 1,000 nodes are best reconstructed with respect to AUPR and AUROC values for the eight different configurations.

3.3.2 Prominent Sources of Reconstruction Errors

In the following, we restrict the analysis to (i) a well-identifiable configuration (100.1.4) and (ii) a poorly identifiable configuration (100.1.6). We further restrict our analysis to 900 samples, since the influence of the sample size should be clear from the discussions above. In Fig. 3.4 we show the genotype–phenotype correlation matrix and weight matrix as a density plot. Thereby we have indicated TP (green circles), FP (blue circles), and FN (red circles) in the weight matrix (note that the green and blue circles together describe the reconstructed network G5). In the genotype–phenotype matrix plots we see horizontal gray lines (especially in 100.1.6), which correspond to eQTLs, from which regulators have to be selected, in order to reconstruct the GRN. We see that configuration 100.1.4 tends to have more confined eQTLs due to larger marker distances, i.e., smaller genotype correlation between adjacent markers. This of course improves reconstruction quality as can be seen, e.g., in Table 3.1 (AUPR of 0.36 in 100.1.4 vs. 0.12 in 100.1.6).

From the weight matrix plots we also see that 100.1.6 contains more gray spots than 100.1.4. This results from much more correlations in the data of 100.1.6. Since many of these correlations are due to marker correlations, they do not reflect true interactions, thus hampering network inference. The diagonal gray line indicates self-regulation, which were not considered for reconstruction (and were not taken into account by the performance evaluation script). A vertical line of red or green circles indicates a true regulator with many targets. An example is regulator \(g_{92}\), from which many targets are correctly identified in the case of 100.1.4. In the case of 100.1.6, the algorithm selects \(g_{91}\) as the regulator and therefore induces many FPs (vertical line of blue circles at regulator position 91) and many FNs (vertical line of red circles at regulator position 92). The reason for this is that eQTLs in 100.1.6 are much larger due to smaller marker distances, corresponding to a strong correlation of genes \(g_{91}/g_{92}\) via their genotypes (see genotype–phenotype matrix plot in Fig. 3.4). For configurations 100.1.4/100.1.6, gene \(g_{92}\) has 1 true upstream gene, 21 true targets, and mean expressions \(\mu _{E}=1.57 / \mu _{E}=1.35\) with \({\sigma ^{2}}_{E}=0.43 / \sigma ^{2}_{E}=0.098\). In contrast, gene \(g_{91}\) has 5 true upstream nodes, 0 true targets, and mean expressions \(\mu _{E}=0.35 / \mu _{E}=0.4\) with \(\sigma ^{2}_{E}=0.1 / \sigma ^{2}_{E}=0.03\) for configurations 100.1.4/100.1.6. Therefore, when deriving the weights for 100.1.6, gene \(g_{91}\) has larger weights with little variance than gene \(g_{92}\), thus being wrongly selected during eQTL analysis.

Notably, even when a gene has no upstream gene (regulator), we may still recover target interactions. For example, gene \(g_{4}\) has no regulator but we do recover 8 / 12 interactions out of 26 for configuration 100.1.4/100.1.6, simply due to the fact, that the expression of gene \(g_{4}\) is varying due to higher biological variance resulting into expression variations of the targets (see mean edge weights of G4 targets in the table of Fig. 3.4).

Another example for typical challenges of correctly reconstructing interactions from the provided dataset is gene \(g_{50}\). This gene has mean expressions \(\mu _{E}=0.47 / \mu _{E}=0.48\) with \(\sigma ^{2}_{E}=0.06 / \sigma ^{2}_{E}=0.02\) for configurations 100.1.4/100.1.6, with 1 true upstream gene. As the variation in the expressions of gene \(g_{50 }\) is small, we cannot get any information on its targets superior to variation by noise. Further, even in cases where a regulator is varying strongly it does not necessarily induce variation in the target (see FN histogram and the table in Fig. 3.5). This can happen in cases where a gene has several regulators or if the kinetics of the target activation is in an insensitive range with respect to changes in the regulator (e.g., due to a very low or very large \(K_{m}\) parameter in a Hill function describing the dependency of the target on its regulator). Both effects result into small sensitivity with respect to regulators, thus hampering again the identification of interactions.

In Fig. 3.5 we show three histograms of mean and variance of the regulators’ expressions, classified according to whether the (non-)identified target interactions of the regulator are TPs/FPs/FNs. We use network configuration 100.1.4 with optimal threshold parameters as it belongs to the networks with highest reconstruction quality. As expected, regions in the mean–variance expression plane in Fig. 3.5 where we find TPs also overlap with FP and FN regions. Only for mean and variance levels above 1.2 and 0.4, respectively, FNs and partially FPs are reduced. The drop in FNs is due to the fact that interactions are not missed in the high-level region of the mean–variance plane. Almost independent on the expression mean and variance of a regulator, regulators are sometimes wrongly selected from the eQTLs. This explains why FPs are only slightly reduced in the high-level region.

Interactions of regulators with expression values roughly below 0.5 and variance levels below 0.1 are always mis-classified as either FP or FN. Looking at the mean and variance of the expression levels of the target genes that belong to TP/FP/FN of regulator \(g_{92}\) (see table in Fig. 3.4), we see that sufficient variation at a sufficient expression level of the regulator does not guarantee correct identification of (no) interactions. The expression level of the target and its variance also determine classification results. The more inputs a target has, the more likely it is to get an FN since its sensitivity to variation of a specific input node is decreased (see mean expression variance over the FN target genes). False positives are also generated, when the FP targets vary too strongly. In the example of Fig. 3.5, this is probably due to strong biological variance and experimental noise, inducing variations in the FP targets; all five FP targets have a relatively low mean input number of 2.8.

3.4 Summary and Conclusions

We have analyzed the reconstruction results obtained with our recently developed framework for reconstructing gene regulatory networks based on simple correlation measures. Several different network topologies and data qualities have been used to illustrate limitations and challenges for network inference. We demonstrated that the reconstruction quality is influenced by (i) experimental design in terms of sample size and (ii) biological factors (marker distance, biological variability, and target sensitivity with respect to its regulators). Regarding the experimental design, our framework is relatively tolerant to small sample sizes, when comparing the reconstruction results from 300 and 900 sample data. However, best results are obtained with large number of samples and larger marker distances combined with significant (but not too large) biological variances. Biological factors that are beneficial for reconstruction are: larger biological variance in case of genetically close markers, input sensitivity, i.e., every gene does vary when its regulators vary in expression or genotype, respectively.

Finally, we note that meaningful reconstruction results can only be achieved when marker distances are sufficiently large. Otherwise, one should restrict the reconstruction to G3, i.e., eQTL mapping, to narrow down potential interaction sites. Then, for specific genes, the true interactions may be obtained by further focused experimental analysis based on the initial reconstructed graph G3.

References

Bing N, Hoeschele I (2005) Genetical genomic analysis of a yeast segregant population for transcription network inference. Genetics 170:533–542
Article CAS PubMed Google Scholar
Flassig RJ, Heise S, Sundmacher K, Klamt S (2013) An effective framework for reconstructing gene regulatory networks from genetical genomics data. Bioinformatics 29(2):246–254
Article CAS PubMed Google Scholar
Jansen R, Nap N (2001) Genetical genomics: the added value from segregation. Trends Genet 17:388–391
Article CAS PubMed Google Scholar
Jansen R (2003) Studying complex biological systems using multifactorial perturbation. Nat Rev Genet 4:145–151
Article CAS PubMed Google Scholar
Keurentjes JJB, Fu J, Terpstra IR et al (2007) Regulatory network construction in Arabidopsis by using genome-wide gene expression quantitative trait loci. Proc Natl Acad Sci USA 104:1708–1713
Article CAS PubMed Google Scholar
Klamt S, Flassig RJ, Sundmacher K (2010) TRANSWESD: inferring cellular networks with transitive reduction. Bioinformatics 26:2160–2168
Google Scholar
Li H, Lu L, Manly KF et al (2005) Inferring gene transcriptional modulatory relations: a genetical genomics approach. Hum Mol Genet 14:1119–1125
Article CAS PubMed Google Scholar
Liu B, de la Fuente A, Hoeschele I (2008) Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 178:1763–1776
Article PubMed Google Scholar
Liu B, Hoeschele I, de la Fuente A (2010) Inferring gene regulatory networks from genetical genomics data. In: Das S, Caragea D, Hsu WH, Welch SM (eds) Computational methodologies in gene regulatory networks. IGI Global, Hershey, pp 79–107
Google Scholar
Michaelson JJ, Loguercio S, Beyer A (2009) Detection and interpretation of expression quantitative trait loci (eQTL). Methods 48:265–276
Article CAS PubMed Google Scholar
Rockman MV, Kruglyak L (2006) Genetics of global gene expression. Nat Rev Genet 7:862–872
Article CAS PubMed Google Scholar
Rockman MV (2008) Reverse engineering the genotype-phenotype map with natural genetic variation. Nature 456:738–744
Article CAS PubMed Google Scholar
Zhu J, Wiener MC, Zhang C et al (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput Biol 3:e69
Article PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstrasse 1, D-39106, Magdeburg, Germany
Sandra Heise, Robert J. Flassig & Steffen Klamt

Authors

Sandra Heise
View author publications
You can also search for this author in PubMed Google Scholar
Robert J. Flassig
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Klamt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steffen Klamt .

Editor information

Editors and Affiliations

Leibniz-Institute for Farm Animal Biology, Dummerstorf, Germany
Alberto de la Fuente

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Heise, S., Flassig, R.J., Klamt, S. (2013). Benchmarking a Simple Yet Effective Approach for Inferring Gene Regulatory Networks from Systems Genetics Data. In: de la Fuente, A. (eds) Gene Network Inference. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45161-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-45161-4_3
Published: 04 January 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45160-7
Online ISBN: 978-3-642-45161-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics

Benchmarking a Simple Yet Effective Approach for Inferring Gene Regulatory Networks from Systems Genetics Data

Abstract

Similar content being viewed by others

Whole-Transcriptome Causal Network Inference with Genomic and Transcriptomic Data

Detection of Regulator Genes and eQTLs in Gene Networks

The Reconstruction and Analysis of Gene Regulatory Networks

Keywords

3.1 Introduction

3.2 Methods

3.3 Application to the StatSeq Systems Genetics Benchmark: Results and Discussion

3.3.1 General Performance Analysis with Respect to Network Configurations

3.3.2 Prominent Sources of Reconstruction Errors

3.4 Summary and Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Benchmarking a Simple Yet Effective Approach for Inferring Gene Regulatory Networks from Systems Genetics Data

Abstract

Similar content being viewed by others

Whole-Transcriptome Causal Network Inference with Genomic and Transcriptomic Data

Detection of Regulator Genes and eQTLs in Gene Networks

The Reconstruction and Analysis of Gene Regulatory Networks

Keywords

3.1 Introduction

3.2 Methods

3.3 Application to the StatSeq Systems Genetics Benchmark: Results and Discussion

3.3.1 General Performance Analysis with Respect to Network Configurations

3.3.2 Prominent Sources of Reconstruction Errors

3.4 Summary and Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation