Introduction

Breast cancer is the most common and aggressive tumor causing great injury to women physically and mentally [1]. This disease largely affects women in their 40s to 60s. Women before or after the period of menopause were more prone to be affected. It is the second most cancer now, just after lung cancer, the principal cause of death from cancer among women both in developing and developed countries [2]. However, the mechanisms of critical pathways and their interactions involved in the occurrence and development of breast cancer, remain largely unknown. Up to now, early diagnosis is still the key to improving the curative effect in the clinical treatment of breast cancer [3, 4]. Therefore, in this study, we aimed to explore the molecule mechanism in the development of breast cancer and thus provide evidence for further research.

Weighted Gene Co-expression Network Analysis (WGCNA) is a method frequently used in the co-expression module correlation analysis by microarray samples [5]. Besides, it is a comprehensive collection of R functions, which is commonly used in various aspects of weighted correlation network analysis. It’s widely used in various biological processes, such as cancer, genetics, and brain imaging data analysis [6], which is quite helpful for the identification of candidate biomarkers or therapeutic targets. Not only can it help in the process of comparing differentially expressed genes, but also help in figuring out the interactions among genes in different co-expression modules [7]. It is reported that WGCNA analysis had been performed on publicly available microarray data covering a genome-wide scale of genes. WGCNA was proven to be a promising and reliable tool for clinical diagnosis of breast cancer. In this study, a total of nine co-expression modules were constructed by WGCNA. In this study, the WGCNA analysis identified nine modules of genes with high topological overlap in total.

Kyoto Encyclopedia of Genes and Genomes (KEGG) [8], a bioinformatics resource for better understanding of high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, was widely used in the mechanism research. The result of KEGG analysis in this study showed that the enriched pathways of hsa04120 (ubiquitin-mediated proteolysis) in co-expression module nine were quite meaningful in the occurrence of breast cancer. We hope our study will help in better understanding the discovery of biomarker in the clinical diagnosis of breast cancer.

Materials and methods

Expression value analysis of microarray data of breast cancer samples

Probe values were downloaded from GEO dataset at the https://www.ncbi.nlm.nih.gov/geo/ of NCBI with the key word “breast cancer”. Annotation information of microarray data was used to match probes with corresponding gene information. Probes matching with more than one gene were eliminated and the average expression values were calculated out for genes matching with more than one probe. The number of genes was calculated with different expression threshold value of genes so as to determine the appropriate threshold value. WGCNA algorithm was used to evaluate the expression value of genes. What is more, flashClust tool package in R language [9] was used to conduct the cluster analysis of samples at the appropriate threshold value.

Analysis of co-expression modules of breast cancer

Power values were screened out by WGCNA [5] algorithm in the construction of co-expression modules. Scale independence and average connectivity analysis of modules with different power value were performed by gradient test (power value ranging from 1 to 20). Appropriate power value was determined when the scale independence value was equal to 0.8. WGCNA algorithm was then used to construct the co-expression modules and extract the gene information in each module. The smallest number was set as 50 for the reliability of the result.

Interaction analysis of co-expression modules of breast cancer

WGCNA algorithm was used to analyze the interaction relationship among different co-expression modules. Heatmap tool package in R language was used to describe the strength of the relationship (strong or weak degree).

Functional annotation analysis of co-expression genes of breast cancer

Co-expression modules were ranging from the most to the least by the number of genes. Then, functional enrichment analysis was performed on the genes in these modules. Corresponding gene information was mapped to the DAVID dataset (https://david.ncifcrf.gov/summary.jsp) [10]. Gene ontology (GO) [11] and KEGG [8, 12] enrichment analysis were performed. Therefore, the enriched biological processes and metabolic pathways were obtained. The analysis was conducted with the condition of P < 0.05. If there were more than five records, then the top five were selected for the further analysis.

Results

Expression values analysis of microarray data of breast cancer

A total of 136 typical breast cancer samples were obtained from NCBI with the accession number of GSE12903 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12903) [13]. The sequencing platform was GPL96 ([HG-U133A] Affymetrix Human Genome U133A Array) and the number of cancer samples was from GSM305129 to GSM30526. This dataset was larger and much newer. These 136 tumors were from Breast cancer patients who had received adjuvant tamoxifen therapy only. Frozen tumor specimens source and clinical information for breast cancer patients are listed in Table 1 [13]. The microarray data was transformed to genes expression information using the original data. On one hand, probes matching with more than one gene were eliminated and the average value of expression value of genes matching with more than one probe was calculated out as the final expression value of the gene. Besides, genes with the negative values were eliminated. As a result, a total of 12,389 expression values of genes were obtained. Then, 4000 genes with the highest average expression value were selected for the cluster analysis by flashClust tool package of WGCNA algorithm (Fig. 1). As can be seen in Fig. 1, 136 breast cancer samples were divided into two clusters, GSM305262 and GSM305263, on the whole. Two samples were included in Cluster I, and 134 samples were included in Cluster II, which can be divided into two sub-clusters, including 124 samples (Sub-Cluster I) and 10 samples (Sub-Cluster II), respectively.

Table 1 Tumor characteristics for breast cancer patients in the present study
Fig. 1
figure 1

Cluster analysis of breast cancer samples. The top 4000 genes with the highest average expression values were used for the analysis by WGCNA and flashClust. All samples were divided into two clusters on the whole, cluster I (pale red) and cluster II (pale blue), including two samples and 134 samples, respectively. Two sub-clusters were identified in cluster II. There were 124 samples in sub-cluster I and ten samples in sub-cluster II (color figure online)

Construction of co-expression module of breast cancer

Co-expression modules were constructed by the expression values of 4000 genes in 136 breast cancer samples using the WGCNA algorithm. Power value was one of the most critical parameters in the construction process, which mainly affected the scale independence and average connectivity of co-expression modules. Firstly, we screened the appropriate power value. When power value was equal to 8, the scale independence can be up to 0.8 (Fig. 2a) and was with higher average connectivity meanwhile (Fig. 2b). Therefore, power value equal to 8 was determined for further analysis. 4000 genes with highest expression value in 136 breast cancer samples were used for the construction of co-expression modules (Fig. 2c). As a result, a total of nine co-expression modules were constructed by the screened power value (8) and each module was manifested in different colors. These modules were numbered from the most to the least by the number of genes. There were 996 genes in module 1 (gray), 607 genes in module 2 (turquoise), 563 genes in module 3 (blue), 553 genes in module 4 (brown), 403 genes in module 5 (yellow), 371 genes in module 6, (green), 305 genes in module 7 (red), 120 genes in module 8 (black) and 82 genes in module (pink). The average number of genes in these nine modules was 444. The information of module each gene belongs to was listed in supplement Table 2.

Fig. 2
figure 2

Construction of co-expression modules of breast cancer-related genes. a The effect of different power values on the scale independence of co-expression modules of breast cancer genes. b The effect of different power values on the average connectivity degree of co-expression modules of breast cancer genes. c The construction of co-expression modules by WGCNA software. Each branch in the figure represented one gene and every color below represented one co-expression module. The icon M on the right stands for the module and number in the brackets represented the number of genes in this module (color figure online)

Table 2 GO enrichment analysis of genes in the co-expression module

Interaction relationship among co-expression modules of genes

Interaction relationship among the nine co-expression modules of genes was further analyzed (Fig. 3). As can been from the result, there was not any obvious difference of the interaction relationship, on the whole, indicating the relative independence expression of genes in each module and the much higher scale independence among different modules. What is more, the connectivity degree of eigengenes was analyzed for the better understanding of interaction relationship among the constructed co-expression modules. First, cluster analysis was performed on these critical genes (Fig. 4a) and we found that these nine modules were enriched in two clusters, one included six samples (module 1, 3, 5, 7, 8, 9) while the other included three samples (module 2, 4 and 6). Furthermore, there was an obvious difference in the effect of connectivity degree of different modules. Three pairs of module combination had much higher adjacency degree besides the highest self-comparison and adjacency degree. The three pairs had much stronger effects, and they are module 2 and module 6, module 3 and module 5, module 7 and module 8.

Fig. 3
figure 3

Interaction relationship analysis of co-expression genes. Different colors of horizontal axis and vertical axis represented different modules. The brightness of yellow in the middle represented the connectivity degree of different modules. There was not much difference in interactions among different modules, indicating the higher scale independence degree among these modules. The icon below represented the module and the number in the brackets represented the number of genes in corresponding modules (color figure online)

Fig. 4
figure 4

The connectivity analysis of critical genes in different module. a Cluster analysis of critical genes in modules. Two clusters were found out, which included six samples (module 1, 3, 5, 7, 8 and 9) and three samples (module 2, 4 and 6), respectively. b The connectivity heatmap of critical genes in modules. The change of color from blue (0) to red (0) in the heatmap represented the connectivity degree of critical genes in different modules from weak to strong. The icon on the right represented the module and the number in the brackets represented the number of genes in this module (color figure online)

Functional enrichment analysis of critical modules

GO and KEGG enrichment analysis was performed on the genes in the constructed nine modules. We found that there was much difference in the enriched functions among different modules by the result of biology process analysis. The enriched GO terms in module 1 were mainly about the cell division and adherence and DNA repairing, including GO:0098609 (cell–cell adhesion), GO:0051301 (cell division) and GO:0006260 (DNA replication). The GO terms in module 2 were mainly enriched in the splicing and regulation of mRNA, mainly including GO:0000398 (mRNA splicing, via spliceosome) and GO:0043488 (regulation of mRNA stability). Genes in module 3 were similar to that in module 2, mainly enriched in the splicing process of mRNA, mainly including GO:0000398 (mRNA splicing, via spliceosome) and GO:0008380 (RNA splicing). Genes in module 4 were significantly enriched in rRNA processing and translation inhibition, mainly including GO:0006364 (rRNA processing) and GO:0006413 (translational initiation). Genes in module 5 were mainly enriched in the process of the mitochondrion, which was associated with energy supplying, mainly including GO:0006120 (mitochondrial electron transport, NADH to ubiquinone). Module 6 and module 7 were similar to module 1, mainly enriched in GO:0098609 (cell–cell adhesion). Module 8 was mainly enriched in immune/defend reactions, including GO:0006955 (immune response), GO:0006954 (inflammatory response) and GO:0051607 (defense response to virus). In module 9, genes were mainly enriched in the process of protein ubiquitination and instability, mainly including GO:0031648 (protein destabilization) and GO:0016925 (protein sumoylation). The result of KEGG enrichment analysis of genes in the nine constructed modules was shown in Fig. 5. The result showed that there were significant enriched metabolic pathways in each module and the enriched degree of metabolic pathways was quite different. Metabolic pathways in module 8 had the highest enriched degree while module 1 was the lowest. The result of KEGG analysis was illustrated in Table 3. Genes in module 1 were mainly enriched in hsa01100 (metabolic pathways) and hsa04110 (cell cycle). Genes in module 2 were mainly enriched in pathways as hsa03040 (spliceosome) and hsa00190 (oxidative phosphorylation). Genes in module 3 were mainly enriched in pathways as splicing and antibiotic synthesis, mainly including hsa03040 (spliceosome) and hsa01130 (biosynthesis of antibiotics). Genes in module 4 were mainly enriched in hsa03010 (ribosome) and hsa03040 (spliceosome) pathways. Genes in module 5 were mainly enriched in pathways of hsa00190 (oxidative phosphorylation) while genes in module 6 was mainly enriched in pathways as hsa04141 (protein processing in endoplasmic reticulum) and hsa01130 (biosynthesis of antibiotics). Genes in module 7 were mainly enriched in hsa04512 (ECM-receptor interaction) and hsa04510 (focal adhesion) pathways. Genes in module 8 were mainly enriched in hsa04612 (antigen processing and presentation) and hsa04145 (phagosome) pathways, which are in accordance with the result of GO analysis about biological process of immune response. Genes in module 9 were mainly enriched in the biological process of immune response pathways.

Fig. 5
figure 5

KEGG enrichment heatmap of genes in the co-expression module. Words on the right represented the number of metabolic pathways of KEGG and the words below represented the constructed modules in this study. The M icon represented the module (color figure online)

Table 3 KEGG enrichment analysis of genes in the co-expression modules

Discussion

Breast cancer is the second most common tumors affecting people, especially women around the period of menopause worldwide. It is also one of the most principal causes of death of patients suffering from cancer [14]. Nowadays, there hasn’t been any effective treatment for patients with breast cancer and the most effective measure to this disease was prevention [3]. What is worse, patients at the same stage of disease can have quite different treatment responses and overall outcome, which makes the situation more complicated and thus the research on prognostic or predictive markers of breast cancer became more urgent. In this study, we aimed to explore the critical biomarker for a better understanding of the molecular mechanism, which can then be applied in the diagnosis or treatment of breast cancer. In this study, co-expression patterns in breast cancer and matched normal tissues were examined by WGCNA, a powerful method used to extract co-expressed groups of genes from large expression data sets. As a result, a total of nine co-expression modules were screened out by WGCNA in the training dataset GSE12903 from NCBI dataset. Besides, the critical co-expression modules and genes they included were identified by GO and KEGG functional enrichment analysis. Early studies on breast cancer most relied on gene expression profiles, which had some disadvantages. Although genome-wide gene expression breast cancer datasets were available and offered opportunities for translational advances and personalized medicines, the challenges still existed in data analysis. For example, the result of differential expressed gene analysis cannot be in accordance with another which was obtained at different platforms, thus making the result unreliable.

However, WGCNA approach can well avoid this disadvantage by performing well across all types of data and focusing on a batch of gene modules rather than individual genes. Besides, it does not rely on a prior assumption about genes or covariates. Therefore, WGCNA can avoid biologically wrong assumptions about independence of gene expression levels since it can also transform gene expression profiles into functional co-expressed gene modules. Up to now, WGCNA method has been applied in many types of cancers, such as lung cancer, brain cancer, and breast cancer. In this study, we found the genes in two co-expression modules, module 8 and module 9, played an essential role in immune response and ubiquitin-mediated proteolysis process, and these two modules were recognized as the most important modules in the occurrence of breast cancer. GO analysis showed that genes in module 8 were mainly involved in pathways in response to the immune system, inflammatory, and defense. Similarly, we found that genes in module 9 played important roles in response to protein syntheses, such as ubiquitin-mediated proteolysis, protein destabilization, and protein sumoylation processes. Furthermore, KEGG analysis revealed that module 8 was mainly enriched in hsa01130 (Biosynthesis of antibiotics) and hsa00190 (Oxidative phosphorylation) pathways. Most co-expression modules were in close association with immune reaction and ubiquitin-mediated proteolysis process, and these two pathways were regarded as potential biomarkers in the mechanism study of breast cancer. The enrich pathway of hsa04120 (ubiquitin-mediated proteolysis) was recognized as the most critical prognostic marker in the occurrence of breast cancer. Combined with the result of other two enriched pathways, that is, hsa01130 (biosynthesis of antibiotics) and hsa00190 (oxidative phosphorylation), enriched by more than one co-expression module, which were also in close association with the process of ubiquitin-mediated proteolysis, we have reason to believe these enriched pathways can function as biomarkers in the diagnosis of breast cancer. It is reported that cell proliferation correlate with relapse rate in pre- and postmenopausal women with breast cancer [15], and women around this period experienced changes in hormone levels in vivo. The ubiquitin-mediated proteolysis was in close association with the protein syntheses required for the cell proliferation and hormone synthesis. For example, estrogen and progesterone, two main hormones in menopause period, were largely affected in women with breast cancer [16, 17], combined with their main component of protein, the profound meaning of critical biomarker of ubiquitin-mediated proteolysis pathway was more certain to believe, which required further investigations.

In summary, our study used systems biology-based WGCNA approach to construct co-expression modules, which played a critical role in breast cancer. Ubiquitin-mediated proteolysis pathway, significantly enriched in module 8 and module 9, could function as the prognostic and predictive marker in the clinical management of breast cancer.