Introduction

Microarray-based high-throughput technology has been extensively applied to profile genome-wide gene expression under diverse conditions (Aoki et al. 2007). Using advanced bioinformatics tools, researchers can find candidate genes for a specific phenotype, infer gene functions and regulations (Usadel et al. 2009; Li et al. 2015; Ficklin and Feltus 2011), and perform comparative co-expression analysis (Movahedi et al. 2012; Ruprecht et al. 2017). As a robust system, living beings could response to biotic and abiotic stresses (Amrine et al. 2015; Nishiyama et al. 2018). Complex life activity relies not only on individual genes but also on a dynamic and complex gene network. To identify the gene network, a large-scale analysis of the transcriptome data can be performed.

The gene co-expression network (GCN) method has been used to explore global, temporal and spatial expression of Arabidopsis (Schmid et al. 2005). For example, Mao constructed Arabidopsis GCN using 1094 arrays from AtGenExpress and functionally annotated 46 modules of the 382 identified modules (Mao et al. 2009). Furthermore, Mutwil used 351 microarray data points and identified 181 gene clusters as well as 27 of the 34 significant clusters; then validated 6 genes predicted to be essential genes (Mutwil et al. 2010). Zheng used 1388 microarrays from ATTED-II to construct GCN and predicted motifs in the promoter regions of co-expressed genes (Zheng et al. 2011). Giorgi used a modification of RMA to normalize 3707 Arabidopsis microarrays for correlation analysis (Giorgi et al. 2010). Feltus maximized gene co-expression relationships through pre-clustering of 7105 Arabidopsis expression samples (Feltus et al. 2013). There are also several targeted or condition-dependent network analyses (Ficklin et al. 2017). Targeted network analyses focus on GCN of a subset of genes, while condition-dependent analyses emphasize GCN under a limited number of biotic and abiotic conditions. For example, Peng constructed GCN for organelles in Arabidopsis based on Gene Ontology Cellular Component information (Penga et al. 2016). Wang identified 2438 cell wall-related genes in Arabidopsis under 351 conditions based on GCN (Wang et al. 2012b). Boruc constructed a dynamic interaction network on core cell cycle genes through a combination of GCN and protein–protein interaction information (Boruc et al. 2010). Amrine analyzed 272 microarrays that involved microbial infections of Arabidopsis with a wide array of fungal and bacterial pathogens with biotrophic, hemibiotrophic, and necrotrophic lifestyles as well as constructed GCN of core biotic stress-responsive genes (Amrine et al. 2015). Prasch applied triple-stress conditions with heat, drought and virus exposure to Arabidopsis, and they revealed significant shifts in signaling networks by GCN (Prasch and Sonnewald 2013). Rasmussen constructed GCN in 10 Arabidopsis ecotypes using cold, heat, light, salt and flagellin treatment as single-stress factors as well as their combinations (Rasmussen et al. 2013). Veen used GCN to compare 8 Arabidopsis accessions under compound stress imposed by submergence, and they revealed a core of conserved, genotype- and organ-specific responses to flooding stress (van Veen et al. 2016). Furthermore, a major task for scientists is to transfer acquired knowledge from the model organism Arabidopsis to crop species. The emerging comparative GCN has become a powerful tool for cross-species analysis. The PlaNet combines sequence and comparative GCN to help in identifying homologs in valuable crop species (Mutwil et al. 2011). Ficklin used GCN to find conserved gene modules between maize and rice (Ficklin and Feltus 2011), while Shaik used GCN to identify common modules for drought and bacterial stress responses between Arabidopsis and rice (Shaik and Ramakrishna 2013). Finally, researchers have developed web tools for gene co-expression exploration. These tools have different features, for example, AraNet focuses on functional annotation by combing multiple data sources (Lee et al. 2015); ATTED-II and CressExpress emphasize gene–gene co-expression query (Aoki et al. 2016); and PLANEX also provides Cohen’s Kappa statistics for cross-species co-expression gene comparison (Yim et al. 2013).

One of the widely used GCN methods is weighted gene co-expression network analysis (WGCNA) (Zhang and Horvath 2005). It groups genes with similar expression patterns across biological samples, which may be members of the same pathway or biological process. The whole transcriptome can be simplified to several modules, which allows us to look into bio-system components easily. The relationships between genes within modules can be delineated. The higher-order module network can also be described. These network properties can further be correlated with other biological traits to find functional genes or modules. However, the inherent gene–gene connections that exist within a transcriptome can only be detected when enough perturbations viz. biological replications are pooled.

In this study, we applied WGCNA to publicly available microarray data covering several conditions for Arabidopsis. Genome-scale modules of co-expressed genes with clear functional annotations were identified. The module association with traits was inferred. Five potential heat shock-responsive genes were found. A higher-order module network analysis indicated the distinct expression pattern of chloroplast genes. Module preservation analysis suggested that there was a similarity between Arabidopsis and rice.

Materials and methods

Microarray data acquisition and processing

Microarray datasets were obtained from the National Centre for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database under the platform numbers GPL198 for Arabidopsis and GPL2025 for rice. The two platforms consist of experimental samples from assays using the Affymetrix Arabidopsis ATH1 Genome Array and Affymetrix Rice Genome Array (http://www.affymetrix.com). The Arabidopsis array contains 22,810 probesets, and the rice array contains 57,381 probesets. Briefly, 931 Arabidopsis datasets with 12,112 samples and 191 rice datasets with 2043 samples were analyzed. Raw gene chip data were analyzed with Expression Console (v1.4.1.46) using the MAS5 method (Pepper et al. 2007). Probe-level gene expression data were retrieved. The duplicated samples were detected by the R (v3.3.1) duplicated function. After removing duplicated and disrupted samples, 11,896 Arabidopsis and 2025 rice samples were found. Control probes were removed before quantile normalization in R using the normalize.quantiles function (Bolstad et al. 2003). The probesets were mapped to Entrez gene IDs according to the array annotation table provided by NCBI GEO. The genes labeled with more than one probeset were filtered by their relative standard deviation (RSD). The probeset with the highest RSD was retained, which guaranteed the useful information. For convenience, we referred to the probeset as the corresponding gene throughout the manuscript. Finally, 21,275 genes from Arabidopsis and 19,449 genes from rice were included for downstream analysis. Detailed information for these datasets is provided in Supplementary Tables S1–4.

Weighted gene co-expression network analysis (WGCNA)

Network analysis was performed using the Bioconductor WGCNA package (v1.63) on a Dell PowerEdge R930 Server with the following parameters: networkType = ‘signed’, softPower = 10 or 14, minModuleSize = 30, deepSplit = 4 (Huber et al. 2015; Langfelder and Horvath 2008). Briefly, signed co-expression networks were constructed for Arabidopsis and rice separately. For each gene in the gene expression matrix, a pairwise Pearson correlation coefficient was computed, and an adjacency matrix was calculated by raising the correlation matrix to a power (Zhang and Horvath 2005). A power of 10 and 14 was chosen for Arabidopsis and rice, respectively, using the scale-free topology criterion. Then, the adjacency matrix was transformed into a network of topological overlap (TO), which measures not only the correlation of two genes but also the extent of their shared correlations across the weighted network (Zhang and Horvath 2005). The TO matrix was then hierarchically clustered to identify highly co-expressed genes. Finally, co-expression gene modules were identified by the Dynamic Tree Cut algorithm (Oldham et al. 2008). Each module was summarized by a module eigengene (ME) through singular value decomposition, so that each module expression profile was represented by its first principal component (Zhang and Horvath 2005). Thus, ME explains the maximum amount of variation of the module expression levels and is considered the most representative gene expression in a module. To construct the network of modules and identify modules, the same process was applied to the results discussed above. The parameters were power = 6, minModuleSize = 2. The clustering was conducted with the hclust function in the WGCNA package.

Module stability was tested as the average correlation between the original connectivity and the connectivity from half samples that were randomly sampled 1000 times. The process was run for every module. The module preservation of rice compared to Arabidopsis was analyzed with the WGCNA modulePreservation function using the following parameters: referenceNetworks = Arabidopsis and networkType = ”signed”, nPermutations = 100. The analysis provides quantitative statistics of module preservation, which provides a rigorous argument that a module is not preserved (Langfelder et al. 2011). Through permutations, the analysis provides a Zsummary value, which summarizes the evidence that a module is preserved and indicative of module robustness and reproducibility. The Zsummary threshold for strongly preserved modules is 10. Zsummary scores between 2 and 10 are for weak to moderately preserved modules, and Zsummary scores < 2 are for modules that are not preserved.

Functional annotation of the modules

Gene ontology (GO) enrichment for network modules was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID 6.8) (Huang et al. 2009) with the background list from the Arabidopsis ATH1-121501 Genome Array genome. DAVID provides not only enrichment results for GO but also information for the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Pfam motif and chromosome enrichment. The overrepresentation of a term is defined as a modified Fisher’s exact P value with an adjustment for multiple tests using the Benjamini method. For simplicity, the top significant term was recorded. Modular genes enriched within chromosome regions were analyzed with the Positional Gene Enrichment analysis tool (De Preter et al. 2008). Statistical significance was set at a P value of 3E−7. The overrepresented chromosomal regions were visualized using the Ensembl Genome browser.

For gene expression variation analysis, the gene expression relative standard deviation for each gene in a module was calculated, and the average values for each module were provided.

Comparison with rice transcriptome data

Overall, 1094 rice microarray data points from the NCBI GEO database under the platform GPL2025 were collected and processed as mentioned in “Microarray data acquisition and processing”. The orthologs between Arabidopsis and rice were downloaded through the EnsemblPlants BioMart tool. Orthologs were subjected to module preservation analysis using the R WGCNA package (R Development Core Team 2013). KEGG pathway-based analysis and visualization were also performed in R according to the package tutorial.

Results

A gene co-expression network of Arabidopsis was successfully constructed

A total of 11,896 Arabidopsis samples were used to construct a scale-free gene co-expression network, which is a property of natural biological networks, by choosing a power of 10 (Fig. 1a, b). As described in the methods, the hierarchically clustered genes were detected iteratively by dynamic tree cut method to find stable gene clusters (Fig. 1c). Similar clusters were merged to form 52 co-expressed gene modules (Table 1). The module stability was tested by examining the correlation between the original connectivity and the values calculated from the 1000 sampled connectivity values for each module (Fig. 1d). All the modules had an average connectivity correlation larger than 0.9, except for M42 (for simplicity, the modules are presented as M plus the module number, such as M42). M42 has the lowest module stability, while M2 has the highest module stability.

Fig. 1
figure 1

A scale-free gene co-expression network for Arabidopsis was successfully constructed. a A signed R2 against power plot shows the threshold (red line) for a scale-free network. b The Arabidopsis network obeys the power law when a power of 10 is chosen. The regression line (diagonal) shows a good model fits index R2 = 0.99. c Dendrograms produced by average linkage of hierarchical clustering of Arabidopsis genes, which is based on a topological overlap matrix (TOM). The modules were assigned colors as indicated in the horizontal bar beneath the dendrogram. d Bar plots present the correlation of intramodule connectivity for each module by half-sampling 1000 times with the original one (mean ± SD)

Table 1 Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment of the 52 gene co-expression modules identified in Arabidopsis

Functional enrichment shows that these modules, except for M50 and M51, are associated with various biological processes (only the top one is shown) (Table 1). Most of the modules are enriched with a specific biological term, which suggests the network analysis is valid. The most significant module is M2 enriched with chloroplast genes, while M50 and M51 had no significant annotation. Some of the modules reflect the co-expression of protein complexes in biological processes, such as the M4 ribosome genes in translation and M6 chloroplast genes in photosynthesis. Modules M17, M24, M28, M34 and M37 are biotic stress responsive, that are involved in response to microorganism; while, M25 and M32 are involved in abiotic stresses, such as light and heat. Modules M15, M16, M23, M35 and M41 are associated with protective tissues, such as cell wall, phloem, Casparian strip and seed coat. Modules M3, M18 and M30 are associated with reproduction, such as pollen tube growth and cell division. These modules are also enriched with specific Pfam terms (Supplementary Table S5). For example, M34 is enriched with a Leucine-Rich Repeat (1.1E−6), and M1 is enriched with a PPR repeat family (8.8E−56).

Connectivity-based analysis

In a network, hubs usually include genes with higher connectivity, so hubs are more important (Batada et al. 2006). First, the most highly connected genes with the highest intramodule connectivity for each module are summarized in Supplementary Table S5. Second, to find highly connected genes in the whole network, all genes were sorted in descending order according to their global connectivity, which represents how densely a gene is connected with others. The top 100 highly connected genes are from either M1 or M3, and they are enriched with kinases (4E−7), which indicates the important role of kinase in the network. To check if these genes are date or party hubs, the proportion of connections within or outside of their own module, which is the intramodule connectivity divided by global connectivity, was calculated (Chang et al. 2013). The average intermodule connectivity proportion for the 100 genes is 8%, which is significantly lower compared to 100 random genes with 1000 times sampling (P < 0.01), which suggests that M1- or M3-associated kinases serve as party hubs. However, party hubs interact with most of their partners simultaneously, whereas date hubs bind different partners at different locations and times. A yeast data analysis revealed that kinases fall largely into the date hub category (Agarwal et al. 2010). Recently, it was proposed that a party hub might undergo a rapid transition to a date hub (or vice versa) as expression levels/post‐translational modifications changed (Dietz et al. 2010). In our network, a total of 9 modules are annotated with kinase-related terms. To examine if kinases in other modules are party hubs, kinases were extracted and intermodule connectivity proportions were calculated. Compared to whole network genes, kinases in M1, M2 and M3 are party hubs (P < 1E−8, P < 0.01 and P < 1E−9), but not kinases in M11 and M17. The conclusion also stands when compared to all the kinases.

When incorporating current lethal gene information (Lloyd et al. 2015), a permutation test shows that the lethal genes have a higher connectivity at the P < 0.05 level but are not significantly higher at the P < 0.01 level (60 tests with P > 0.01 in 1000 permutations). A hypergeometric test suggests there are 4 modules enriched with lethal genes, including M6 (P = 0), M5 (P = 1E−10), M18 (P = 1E−5), and M4 (P = 1E−4). Another trait is that seed pigment-associated genes are also enriched in M2 (P = 1E−9) and M6 (P = 0). Interestingly, those modules are all preserved in the rice transcriptome, as discussed in the following section.

Gene expression variation in modules

We have reduced the transcriptome data complexity by gene co-expression modules. We suppose that these modules can be treated as function modules. The modular gene expression variation may infer whether the function is more basal or conditional. The relative gene variation (relative standard deviation of gene expression) was calculated for each gene and then averaged across modules. The top 3 highly stable modules include M48 (photorespiration), M20 (glycolysis) and M4 (translation). The top 3 highly variable modules include M28 (proteolysis), M49 (cytidine deamination), and M30 (pollen exine formation) (Supplementary Table S6).

The correlation between gene expression and connectivity was calculated to observe the relationship for each module (Supplementary Table S6). Two types of correlation were found. One was positive correlation, such as M2, M4, M5, M6 and M15, and most of those modules are involved in synthesis biological processes. Another type was negative correlation, such as M23, M40, M49 and M50, and most of those modules are involved in biological degradation processes.

Correlating modules with experimental conditions/traits

After co-expressed gene module identification, we checked the expression status of a specific module from experiments. Linking the modular gene expression with experimental conditions may help to discover modules functioning under a specific condition (Supplementary Table S6). If a module has a high ME, then the module could be a signature for that trait or experiment. For example, combined heat and anoxia treatment leads to the highest modular expression in M32, which is enriched with gene responses to heat. A significant overlap between the anoxic and the heat responses was reported. The transcription factor heat shock factor A2 (HsfA2) is induced by both heat and anoxia, and it was strongly induced by anoxia (Banti et al. 2010). On the other side, the lowest M32 expression is presented in arrested development 3 (add3) mutant Arabidopsis. It has been shown that add3 mutation prevents the expansion of leaf blades at high temperature, which suggests that add3 affects genes involved in inherently temperature-sensitive developmental processes (Pickett et al. 1996). The hub genes include AT5G37670, AT1G30070 and HSP23.6-MITO (Fig. 2a). Our results confirm these studies from a module-based perspective. The module genes can serve as signatures or candidates for thermo-tolerance.

Fig. 2
figure 2

Representative module network visualization for M32 and M51. a Network visualization for M32. b Network visualization for M51

Although M51 has no significant functional annotation, it is highly expressed in the seed coat at the bending cotyledon stage, which was inferred from the transcriptome atlas of the Arabidopsis maternal seed subregions (Khan et al. 2015) and indicates the potential role of high expression of M51 in germination. On the other hand, the lowest M51 expression is induced when seedlings were treated with MG132 to block proteasome function and to increase TOC1 protein levels, which is an essential component of the Arabidopsis circadian system (Gendron et al. 2012). It has been demonstrated that MG132 treatment completely inhibited seedling growth from dissected embryos, which suggests that proteasome activity is required for germination (Chiu et al. 2016). Eight out of 36 M51 genes have been reported to be seed coat epidermal-specific genes (Esfandiari et al. 2013). The hubs include AtMES esterase family genes MES6 and MES4 (Fig. 2b), while the MES family has been implicated in hormone homeostasis and germination (Vlot et al. 2008; Rajjou et al. 2006; Yang et al. 2008). Therefore, M51 may play roles in germination. Similarly, two other modules that have functional annotations, but that were not significant after P value adjustment, include M50, which may perform nectar secretion, and M42, which may perform manganese ion transmembrane transport, as indicated by their expression in lateral nectaries and embryo roots.

Genomic positional gene enrichment analysis

To check whether the 52 modules were associated with specific chromosome regions, modular genes were tested with the Positional Gene Enrichment analysis tool. At a stringent P value (7E−7) and using “more than 3 gene hits” criteria, 5 modules, including M9, M24, M30, M36, and M49, were identified as enriched within a specific chromosome region (Supplementary Table S7).

Genes function prediction

M32, which contains 98 genes, was selected to demonstrate the application of a gene co-expression module in gene function prediction. Only 3 genes were annotated as hypothetical protein coding genes, and the other 95 genes have unambiguous functional annotations. To confirm their role in heat shock, these 98 genes were submitted to NCBI PubMed, NCBI Gene and GoogleScholar to check their association with heat shock. We found 25 HSPs, 6 HSFs, 14 chaperones, 48 heat shock response genes (Supplementary Table S8). However, some of the 48 heat shock response genes are from microarray studies and have no functional experiment support. The remaining 5 genes (AT2G41170, AT4G02550, AT1G44414, AT1G55530, and AT3G56250) have not been reported to be associated with heat shock in any previous studies. So, these 5 genes could be potential heat shock-responsive genes based on guilt by association, which merits functional verification.

Higher-order module organization

To observe the organization between these modules, the networks of the 52 identified modules were also analyzed. These 52 modules are organized into 15 interconnecting meta-modules. A global connectivity analysis shows that the top 3 highly connected modules were M13, M8, and M4, which have annotations that include biosynthesis of secondary metabolites, oxidation reduction, and translation, respectively (Fig. 3a, Supporting Table S9). The results may indicate the importance and complexity of secondary metabolites (Kliebenstein 2004).

Fig. 3
figure 3

Higher-order network and its relationships. a A network shows the connections between modules. b Hierarchical clustering shows the 52 modules organize into 15 meta-modules, which are denoted with colors and module numbers at corresponding leaves

To check the relationships between these meta-modules, a clustering diagram was plotted, which showed that these 15 meta-modules can be divided into 6 major branches. The two orphan branches are meta-modules in green–yellow and tan, which correspond to chloroplast and valine, and leucine and isoleucine degradation (Fig. 3b). From these results, it can be inferred that chloroplast genes have a distinct transcriptional pattern compared to the other 14 meta-modules.

Comparison with previous Arabidopsis networks

To confirm our results, multiple publications results were compared. Mao and colleagues constructed a gene expression map from 1094 Arabidopsis microarrays and identified 382 sets of highly correlated genes (Mao et al. 2009). They identified 46 modules with significantly enriched GO terms, and 38 of those terms share common GO terms with our modules. However, their modules are small, and the analysis was limited to annotated genes. For example, they identified just 6 genes that respond to heat. Mutwil used 351 microarray data and identified 181 gene clusters, and 27 of the 34 significant clusters share common functional annotations in our results (Mutwil et al. 2010).

Comparison with rice transcriptome

To test if the identified modules were also present in rice, we further collected 2043 rice microarray data points and projected the transcriptomes to Arabidopsis modules using the R function modulePreservation in the WGCNA package. The analysis showed that only 4 modules (M2, M4, M5 and M6) were highly preserved, 11 modules (M1, M3, M10, M16, M18, M20, M26, M27, M33, M39 and M48) were weak to moderately preserved, and the other 37 modules had lower preservation Zsummary than the former 15 modules (Fig. 4). M4 showed the strongest preservation, while M21 showed the lowest (Supplementary Table S10). The well-preserved modules include modules associated with translation, rRNA processing, photosynthesis and chloroplast organization. An example of not preserved module is M7 (rank = 47), which involves lipid storage. Evidence suggested that the lipid change patterns in rice are different from those in Arabidopsis (Wang et al. 2012a). To provide more details about preservation between Arabidopsis and rice, 8 KEGG pathways, including Photosynthesis, Ribosome, Ubiquitin-mediated proteolysis, Endocytosis, Plant–pathogen interaction, Porphyrin and chlorophyll metabolism, Plant hormone signal transduction, and Phenylpropanoid biosynthesis were analyzed. These pathways were identified in both the Arabidopsis and rice datasets. Module preservation statistics and module membership correlation analysis show that the Photosynthesis, Ribosome, Endocytosis, Porphyrin and chlorophyll metabolism, and Phenylpropanoid biosynthesis pathways were preserved, while preservation Zsummary was low for Ubiquitin-mediated proteolysis, Plant–pathogen interaction and Plant hormone signal transduction (Fig. 5). Networks demonstrated the similarity between the Photosynthesis and Ribosome pathways as well as differences in hormone signaling between Arabidopsis and rice (Fig. 6). The unique complexities of hormone-mediated defence networking have been summarized in recent publications (De Vleesschauwer et al. 2014; Ma et al. 2010). For a better exploration of the KEGG pathways between the two species, a shiny-based web viewer was developed (Chang et al. 2015). The tool is available here: http://bioinformatics.fafu.edu.cn/arabi/.

Fig. 4
figure 4

Rice transcriptome preservation in Arabidopsis. Dashed green and blue lines represent the Zsummary threshold for strong (Z > 10) and weak–moderate (2 < Z < 10) module preservation. Numbers along with coloured dots represent the identified modules. Module size is the number of rice orthologs for Arabidopsis

Fig. 5
figure 5

KEGG pathway-based analysis of the preservation between Arabidopsis and rice. a, b and c show the preservation statistics, and Z > 10 indicates strong preservation, whereas 2 < Z < 10 indicates weak–moderate preservation. The dot labels are extracted from the first four letters for each pathway name. d, e, f, g, h, i, j and k plot the kME correlation for orthologs in the 8 pathways, including the Porphyrin and chlorophyll metabolism, Endocytosis, Photosynthesis, Ubiquitin-mediated proteolysis, Phenylpropanoid biosynthesis, Plant–pathogen interaction, Plant hormone signal transduction, and Ribosome pathways

Fig. 6
figure 6

KEGG pathways-based network visualization for Arabidopsis and rice. KEGG pathway-based network for a Photosynthesis; b Ribosome; c Hormone signal transduction. The size of the black dot denotes gene connectivity in Arabidopsis and its orthologs in rice. The red line connecting two dots represents the connection strength between them

Discussion

With the development of high-throughput technologies, large-scale data-based network analysis has become a robust method to discover the gene function, cellular machinery, modularity, conservation or tissue differences (Boruc et al. 2010; He and Maslov 2016; Ruprecht et al. 2017; He et al. 2016). WGCNA has been widely used in biomedical research; however, its application in plants has lagged due to small sample sizes in individual experiments, which is an important factor in GCN analysis. Although the state-of-the-art RNA-Seq data of Arabidopsis have accumulated over the years, the diversity in library type, sequencing depth, and instrument type makes it hard to conduct integrative analysis. Studies have showed that microarray data seem more suited for gene network analysis (Giorgi et al. 2013), and a larger sample size may help to more robustly detect the gene co-expression modules (Oldham et al. 2008). In this study, we collected a compendium of Arabidopsis transcriptome data and identified 52 co-expressed gene modules. All the sample data were pooled together to construct a pan network that was non-targeted, which is different from many previous works that were targeted at one or more specific biological conditions, as mentioned in the Introduction. Thus, we can identify pan modules that are a compilation of nearly all possible co-expressed modules under various conditions. We believe that gene–gene connections exist underlying any transcriptome even if those connections cannot be detected when all the samples have the same status (e.g., a static transcriptome). The inherent gene–gene connections can be detected when perturbation occurs within the network, such as the biomedical project Connectivity Map in which cell line transcriptomes were perturbed by drug treatments to find gene signatures (Lamb et al. 2006). Therefore, the tissue origin or experimental treatment can both be considered as a perturbation, which is the reason for our cross-studies data integration. Fortunately, the standardized microarray platform data provide a possibility for integration.

Although early studies have constructed non-targeted GCNs for Arabidopsis (Mao et al. 2009; Mutwil et al. 2010; He and Maslov 2016), our analysis improves power and has a different perspective that is more concentrated on biological meanings. First, after identifying the gene modules, module stability was tested by half-sampling. Second, connectivity-based analysis shows the important genes in the global network, and module-based gene variation analysis may infer potential basal and responsive modules. Third, we further analyzed the associations between modules and experimental conditions (phenotype annotation), which may help to identify important modules under specific conditions/traits. These modules could be potential signatures for a trait. Finally, as an important crop, rice gene function research often starts from Arabidopsis orthologs. A comparative network analysis of relevant KEGG pathways shows network-based transcriptional conservation of some basal cellular machinery and divergence of some responsive modules between Arabidopsis and rice. The divergent components may reflect species diversity, but conserved components define preserved gene models across species that may facilitate standardization of experimental models (Mueller et al. 2017). Although the rank of preservation statistics can reflect the pathway preservation between Arabidopsis and rice, the statistics should be interpreted with caution due to the imbalanced tissues/conditions representation in the datasets analyzed.

Overall, our results provide a comprehensive network view for the Arabidopsis transcriptome and may serve as a valuable resource for candidate gene function investigation.

Author contribution statement

WL, LL, and HH conceived and designed this study. ZZ and YL collected data. SL, KG and HT analyzed the data. WL drafted the manuscript and developed the web tool. All authors read and approved the manuscript.