Abstract
Background
Breast cancer is the most common and aggressive tumor causing injury to women world wide. Although gene expression analysis had been performed previously, systemic co-expression analysis for this cancer is still lacking to date. We attempted to identify the critical modules of breast cancer.
Methods
Co-expression modules were established with the help of WGCNA and the interactions among them were performed by R language. Biological process and pathways analysis of co-expression genes were figured out by GO and KEGG functional enrichment analysis using DAVID dataset.
Results
In this study, expression data of 4,000 genes from 136 samples with breast cancer was used for the establishment of co-expression modules. And nine modules were identified. There was much higher scale independence among different modules by interactions analysis. Moreover, there was an obvious difference in adjacency degree among different modules. The most enriched pathways as immune response and ubiquitin-mediated proteolysis were identified as the most critical modules of breast cancer by GO and KEGG enrichment analysis.
Conclusion
Our result demonstrated that immune response and ubiquitin-mediated proteolysis could serve as prognostic and predictive markers for the occurrence of breast cancer, providing evidence for further analysis in the prognosis and treatment of breast cancer.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Breast cancer is the most common and aggressive tumor causing great injury to women physically and mentally [1]. This disease largely affects women in their 40s to 60s. Women before or after the period of menopause were more prone to be affected. It is the second most cancer now, just after lung cancer, the principal cause of death from cancer among women both in developing and developed countries [2]. However, the mechanisms of critical pathways and their interactions involved in the occurrence and development of breast cancer, remain largely unknown. Up to now, early diagnosis is still the key to improving the curative effect in the clinical treatment of breast cancer [3, 4]. Therefore, in this study, we aimed to explore the molecule mechanism in the development of breast cancer and thus provide evidence for further research.
Weighted Gene Co-expression Network Analysis (WGCNA) is a method frequently used in the co-expression module correlation analysis by microarray samples [5]. Besides, it is a comprehensive collection of R functions, which is commonly used in various aspects of weighted correlation network analysis. It’s widely used in various biological processes, such as cancer, genetics, and brain imaging data analysis [6], which is quite helpful for the identification of candidate biomarkers or therapeutic targets. Not only can it help in the process of comparing differentially expressed genes, but also help in figuring out the interactions among genes in different co-expression modules [7]. It is reported that WGCNA analysis had been performed on publicly available microarray data covering a genome-wide scale of genes. WGCNA was proven to be a promising and reliable tool for clinical diagnosis of breast cancer. In this study, a total of nine co-expression modules were constructed by WGCNA. In this study, the WGCNA analysis identified nine modules of genes with high topological overlap in total.
Kyoto Encyclopedia of Genes and Genomes (KEGG) [8], a bioinformatics resource for better understanding of high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, was widely used in the mechanism research. The result of KEGG analysis in this study showed that the enriched pathways of hsa04120 (ubiquitin-mediated proteolysis) in co-expression module nine were quite meaningful in the occurrence of breast cancer. We hope our study will help in better understanding the discovery of biomarker in the clinical diagnosis of breast cancer.
Materials and methods
Expression value analysis of microarray data of breast cancer samples
Probe values were downloaded from GEO dataset at the https://www.ncbi.nlm.nih.gov/geo/ of NCBI with the key word “breast cancer”. Annotation information of microarray data was used to match probes with corresponding gene information. Probes matching with more than one gene were eliminated and the average expression values were calculated out for genes matching with more than one probe. The number of genes was calculated with different expression threshold value of genes so as to determine the appropriate threshold value. WGCNA algorithm was used to evaluate the expression value of genes. What is more, flashClust tool package in R language [9] was used to conduct the cluster analysis of samples at the appropriate threshold value.
Analysis of co-expression modules of breast cancer
Power values were screened out by WGCNA [5] algorithm in the construction of co-expression modules. Scale independence and average connectivity analysis of modules with different power value were performed by gradient test (power value ranging from 1 to 20). Appropriate power value was determined when the scale independence value was equal to 0.8. WGCNA algorithm was then used to construct the co-expression modules and extract the gene information in each module. The smallest number was set as 50 for the reliability of the result.
Interaction analysis of co-expression modules of breast cancer
WGCNA algorithm was used to analyze the interaction relationship among different co-expression modules. Heatmap tool package in R language was used to describe the strength of the relationship (strong or weak degree).
Functional annotation analysis of co-expression genes of breast cancer
Co-expression modules were ranging from the most to the least by the number of genes. Then, functional enrichment analysis was performed on the genes in these modules. Corresponding gene information was mapped to the DAVID dataset (https://david.ncifcrf.gov/summary.jsp) [10]. Gene ontology (GO) [11] and KEGG [8, 12] enrichment analysis were performed. Therefore, the enriched biological processes and metabolic pathways were obtained. The analysis was conducted with the condition of P < 0.05. If there were more than five records, then the top five were selected for the further analysis.
Results
Expression values analysis of microarray data of breast cancer
A total of 136 typical breast cancer samples were obtained from NCBI with the accession number of GSE12903 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12903) [13]. The sequencing platform was GPL96 ([HG-U133A] Affymetrix Human Genome U133A Array) and the number of cancer samples was from GSM305129 to GSM30526. This dataset was larger and much newer. These 136 tumors were from Breast cancer patients who had received adjuvant tamoxifen therapy only. Frozen tumor specimens source and clinical information for breast cancer patients are listed in Table 1 [13]. The microarray data was transformed to genes expression information using the original data. On one hand, probes matching with more than one gene were eliminated and the average value of expression value of genes matching with more than one probe was calculated out as the final expression value of the gene. Besides, genes with the negative values were eliminated. As a result, a total of 12,389 expression values of genes were obtained. Then, 4000 genes with the highest average expression value were selected for the cluster analysis by flashClust tool package of WGCNA algorithm (Fig. 1). As can be seen in Fig. 1, 136 breast cancer samples were divided into two clusters, GSM305262 and GSM305263, on the whole. Two samples were included in Cluster I, and 134 samples were included in Cluster II, which can be divided into two sub-clusters, including 124 samples (Sub-Cluster I) and 10 samples (Sub-Cluster II), respectively.
Construction of co-expression module of breast cancer
Co-expression modules were constructed by the expression values of 4000 genes in 136 breast cancer samples using the WGCNA algorithm. Power value was one of the most critical parameters in the construction process, which mainly affected the scale independence and average connectivity of co-expression modules. Firstly, we screened the appropriate power value. When power value was equal to 8, the scale independence can be up to 0.8 (Fig. 2a) and was with higher average connectivity meanwhile (Fig. 2b). Therefore, power value equal to 8 was determined for further analysis. 4000 genes with highest expression value in 136 breast cancer samples were used for the construction of co-expression modules (Fig. 2c). As a result, a total of nine co-expression modules were constructed by the screened power value (8) and each module was manifested in different colors. These modules were numbered from the most to the least by the number of genes. There were 996 genes in module 1 (gray), 607 genes in module 2 (turquoise), 563 genes in module 3 (blue), 553 genes in module 4 (brown), 403 genes in module 5 (yellow), 371 genes in module 6, (green), 305 genes in module 7 (red), 120 genes in module 8 (black) and 82 genes in module (pink). The average number of genes in these nine modules was 444. The information of module each gene belongs to was listed in supplement Table 2.
Interaction relationship among co-expression modules of genes
Interaction relationship among the nine co-expression modules of genes was further analyzed (Fig. 3). As can been from the result, there was not any obvious difference of the interaction relationship, on the whole, indicating the relative independence expression of genes in each module and the much higher scale independence among different modules. What is more, the connectivity degree of eigengenes was analyzed for the better understanding of interaction relationship among the constructed co-expression modules. First, cluster analysis was performed on these critical genes (Fig. 4a) and we found that these nine modules were enriched in two clusters, one included six samples (module 1, 3, 5, 7, 8, 9) while the other included three samples (module 2, 4 and 6). Furthermore, there was an obvious difference in the effect of connectivity degree of different modules. Three pairs of module combination had much higher adjacency degree besides the highest self-comparison and adjacency degree. The three pairs had much stronger effects, and they are module 2 and module 6, module 3 and module 5, module 7 and module 8.
Functional enrichment analysis of critical modules
GO and KEGG enrichment analysis was performed on the genes in the constructed nine modules. We found that there was much difference in the enriched functions among different modules by the result of biology process analysis. The enriched GO terms in module 1 were mainly about the cell division and adherence and DNA repairing, including GO:0098609 (cell–cell adhesion), GO:0051301 (cell division) and GO:0006260 (DNA replication). The GO terms in module 2 were mainly enriched in the splicing and regulation of mRNA, mainly including GO:0000398 (mRNA splicing, via spliceosome) and GO:0043488 (regulation of mRNA stability). Genes in module 3 were similar to that in module 2, mainly enriched in the splicing process of mRNA, mainly including GO:0000398 (mRNA splicing, via spliceosome) and GO:0008380 (RNA splicing). Genes in module 4 were significantly enriched in rRNA processing and translation inhibition, mainly including GO:0006364 (rRNA processing) and GO:0006413 (translational initiation). Genes in module 5 were mainly enriched in the process of the mitochondrion, which was associated with energy supplying, mainly including GO:0006120 (mitochondrial electron transport, NADH to ubiquinone). Module 6 and module 7 were similar to module 1, mainly enriched in GO:0098609 (cell–cell adhesion). Module 8 was mainly enriched in immune/defend reactions, including GO:0006955 (immune response), GO:0006954 (inflammatory response) and GO:0051607 (defense response to virus). In module 9, genes were mainly enriched in the process of protein ubiquitination and instability, mainly including GO:0031648 (protein destabilization) and GO:0016925 (protein sumoylation). The result of KEGG enrichment analysis of genes in the nine constructed modules was shown in Fig. 5. The result showed that there were significant enriched metabolic pathways in each module and the enriched degree of metabolic pathways was quite different. Metabolic pathways in module 8 had the highest enriched degree while module 1 was the lowest. The result of KEGG analysis was illustrated in Table 3. Genes in module 1 were mainly enriched in hsa01100 (metabolic pathways) and hsa04110 (cell cycle). Genes in module 2 were mainly enriched in pathways as hsa03040 (spliceosome) and hsa00190 (oxidative phosphorylation). Genes in module 3 were mainly enriched in pathways as splicing and antibiotic synthesis, mainly including hsa03040 (spliceosome) and hsa01130 (biosynthesis of antibiotics). Genes in module 4 were mainly enriched in hsa03010 (ribosome) and hsa03040 (spliceosome) pathways. Genes in module 5 were mainly enriched in pathways of hsa00190 (oxidative phosphorylation) while genes in module 6 was mainly enriched in pathways as hsa04141 (protein processing in endoplasmic reticulum) and hsa01130 (biosynthesis of antibiotics). Genes in module 7 were mainly enriched in hsa04512 (ECM-receptor interaction) and hsa04510 (focal adhesion) pathways. Genes in module 8 were mainly enriched in hsa04612 (antigen processing and presentation) and hsa04145 (phagosome) pathways, which are in accordance with the result of GO analysis about biological process of immune response. Genes in module 9 were mainly enriched in the biological process of immune response pathways.
Discussion
Breast cancer is the second most common tumors affecting people, especially women around the period of menopause worldwide. It is also one of the most principal causes of death of patients suffering from cancer [14]. Nowadays, there hasn’t been any effective treatment for patients with breast cancer and the most effective measure to this disease was prevention [3]. What is worse, patients at the same stage of disease can have quite different treatment responses and overall outcome, which makes the situation more complicated and thus the research on prognostic or predictive markers of breast cancer became more urgent. In this study, we aimed to explore the critical biomarker for a better understanding of the molecular mechanism, which can then be applied in the diagnosis or treatment of breast cancer. In this study, co-expression patterns in breast cancer and matched normal tissues were examined by WGCNA, a powerful method used to extract co-expressed groups of genes from large expression data sets. As a result, a total of nine co-expression modules were screened out by WGCNA in the training dataset GSE12903 from NCBI dataset. Besides, the critical co-expression modules and genes they included were identified by GO and KEGG functional enrichment analysis. Early studies on breast cancer most relied on gene expression profiles, which had some disadvantages. Although genome-wide gene expression breast cancer datasets were available and offered opportunities for translational advances and personalized medicines, the challenges still existed in data analysis. For example, the result of differential expressed gene analysis cannot be in accordance with another which was obtained at different platforms, thus making the result unreliable.
However, WGCNA approach can well avoid this disadvantage by performing well across all types of data and focusing on a batch of gene modules rather than individual genes. Besides, it does not rely on a prior assumption about genes or covariates. Therefore, WGCNA can avoid biologically wrong assumptions about independence of gene expression levels since it can also transform gene expression profiles into functional co-expressed gene modules. Up to now, WGCNA method has been applied in many types of cancers, such as lung cancer, brain cancer, and breast cancer. In this study, we found the genes in two co-expression modules, module 8 and module 9, played an essential role in immune response and ubiquitin-mediated proteolysis process, and these two modules were recognized as the most important modules in the occurrence of breast cancer. GO analysis showed that genes in module 8 were mainly involved in pathways in response to the immune system, inflammatory, and defense. Similarly, we found that genes in module 9 played important roles in response to protein syntheses, such as ubiquitin-mediated proteolysis, protein destabilization, and protein sumoylation processes. Furthermore, KEGG analysis revealed that module 8 was mainly enriched in hsa01130 (Biosynthesis of antibiotics) and hsa00190 (Oxidative phosphorylation) pathways. Most co-expression modules were in close association with immune reaction and ubiquitin-mediated proteolysis process, and these two pathways were regarded as potential biomarkers in the mechanism study of breast cancer. The enrich pathway of hsa04120 (ubiquitin-mediated proteolysis) was recognized as the most critical prognostic marker in the occurrence of breast cancer. Combined with the result of other two enriched pathways, that is, hsa01130 (biosynthesis of antibiotics) and hsa00190 (oxidative phosphorylation), enriched by more than one co-expression module, which were also in close association with the process of ubiquitin-mediated proteolysis, we have reason to believe these enriched pathways can function as biomarkers in the diagnosis of breast cancer. It is reported that cell proliferation correlate with relapse rate in pre- and postmenopausal women with breast cancer [15], and women around this period experienced changes in hormone levels in vivo. The ubiquitin-mediated proteolysis was in close association with the protein syntheses required for the cell proliferation and hormone synthesis. For example, estrogen and progesterone, two main hormones in menopause period, were largely affected in women with breast cancer [16, 17], combined with their main component of protein, the profound meaning of critical biomarker of ubiquitin-mediated proteolysis pathway was more certain to believe, which required further investigations.
In summary, our study used systems biology-based WGCNA approach to construct co-expression modules, which played a critical role in breast cancer. Ubiquitin-mediated proteolysis pathway, significantly enriched in module 8 and module 9, could function as the prognostic and predictive marker in the clinical management of breast cancer.
References
Berg JW, Robbins G. Factors influencing short and long term survival of breast cancer patients. Surg Gynecol Obstet. 1966;122:1311.
Adair F, Berg J, Joubert L, Robbins GF. Long-term followup of breast cancer patients: the 30-year report. Cancer. 1974;33:1145–50.
Saez RA, McGuire WL, Clark GM. Prognostic factors in breast cancer. Semin Surg Oncol. 1989;5:102–10.
Bloom H, Richardson W. Histological grading and prognosis in breast cancer: a study of 1409 cases of which 359 have been followed for 15 years. Br J Cancer. 1957;11:359.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9:559.
Ivliev AE, AC’t Hoen P, Sergeeva MG. Coexpression network analysis identifies transcriptional modules related to proastrocytic differentiation and sprouty signaling in glioma. Cancer Res. 2010;70:10060–70.
Clarke C, Madden SF, Doolan P, Aherne ST, Joyce H, O’Driscoll L, et al. Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis. 2013;34:2300–8.
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.
Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comput Graph Stat. 1996;5:299–314.
Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4:R60.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32:D277–80.
Zhang Y, Sieuwerts AM, McGreevy M, Casey G, Cufer T, Paradiso A, et al. The 76-gene signature defines high-risk patients that benefit from adjuvant tamoxifen therapy. Breast Cancer Res Treat. 2009;116:303–9.
Fisher B, Bauer M, Wickerham DL, Redmond CK, Fisher ER, Cruz AB, et al. Relation of number of positive axillary nodes to the prognosis of patients with primary breast cancer. An NSABP update. Cancer. 1983;52:1551–7.
Isola J, Visakorpi T, Holli K, Kallioniemi O-P. Association of overexpression of tumor suppressor protein p53 with rapid cell proliferation and poor prognosis in node-negative breast cancer patients. J Natl Cancer Inst. 1992;84:1109–14.
Foekens JA, Portengen H, Van Putten WL, Peters HA, Krijnen HL, Alexieva-Figusch J, et al. Prognostic value of estrogen and progesterone receptors measured by enzyme immunoassays in human breast tumor cytosols. Cancer Res. 1989;49:5823–8.
Berger U, Wilson P, Thethi S, McClelland RA, Greene GL, Coombes RC. Comparison of an immunocytochemical assay for progesterone receptor with a biochemical method of measurement and immunocytochemical examination of the relationship between progesterone and estrogen receptors. Cancer Res. 1989;49:5176–9.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
About this article
Cite this article
Zhao, Q., Song, W., He, D.y. et al. Identification of key gene modules and pathways of human breast cancer by co-expression analysis. Breast Cancer 25, 213–223 (2018). https://doi.org/10.1007/s12282-017-0817-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12282-017-0817-5