Background

Breast cancer is the most common malignant tumor in women, and the incidence rate is increasing by about 0.5% per year [1]. According to Global cancer statistics 2018, there are 2,088,849 new cases of breast cancer worldwide each year, and 626,679 related deaths [2]. With the treatment improved in recent years, the 5-year survival rate of breast cancer in the United States has reached 90%, however, once distant metastasis occurs, the survival rate drops to 26% [1, 3]. Current treatment methods cannot cure metastatic advanced breast cancer, and the pros and cons of treatment need to be measured individually to obtain the best outcome [4]. Immunotherapy has been used in advanced breast cancer patients who are resistant to first-line therapy and has shown clinical benefits, which is a promising new therapy [5].

The tumor immune microenvironment is a key part in the development and progression of tumors [6]. Large clinical trial has shown that the infiltration of CD8+ T cells predicts a better prognosis for breast cancer [7]. However, there is significant heterogeneity in the condition of tumor-infiltrating T cells, and T cell function is a window to assess the immune microenvironment [8]. T cells have immune checkpoints such as cytotoxic T-lymphocyte antigen 4 (CTLA-4), which directly inhibit cellular immune response [9]. Immune checkpoint inhibitor (ICI) is currently used in immunotherapy to induce anti-tumor immunity by blocking CTLA-4 and PD-1/PD-L1 [10]. The patient selection and efficacy evaluation system for immune checkpoint blockade therapy is still incomplete, and tumor CD8+ T cells are one of the important reference elements.

Previous study has used immunohistochemical (IHC) to evaluate the relationship between tumor-infiltrating CD8+ T cells level and breast cancer prognosis [11]. With the advent of the era of high-throughput sequencing, the acquisition of tumor gene expression data has become convenient. Gene transcription and epigenetics were proved to play a crucial role in the differentiation and functional activation of CD8+ T cells [5]. Xie et al. reported an immune-related gene prognosis model of breast cancer, but it had no directly correlation with tumor-infiltrating CD8+ T cells and did not be tested the relationship with immune checkpoints expression [12]. CD8+ T cell-related genes have shown potential for the prognosis of breast cancer. Therefore, we expected to predict the prognosis of breast cancer and evaluate the tumor immune status through the expression of CD8+ T cell-related genes.

This study employed gene transcription data to construct a prognostic signature of breast cancer. We used training set and validation set to investigated the relationship between the signature and clinical characteristics of breast cancer. The risk-score was further used to predicted the efficacy of immune checkpoint blockade response. In addition, gene functions and biological pathways were further explored.

Methods

Data Collection

The training set came from The Cancer Genome Atlas- breast cancer (TCGA-BRCA) database. The RNA transcription data and clinical information of 1097 cases of tumor and 121 cases of normal breast tissues were downloaded through the UCSC Xena website (https://xenabrowser.net) [13]. The gene transcription data is converted using the formula log2 (expression value + 1) and corrected by RSEM. The validation sets included GSE20685, GSE7390, GSE16446, GSE69031, GSE21653, GSE25066, GSE26971, GSE75688, GSE124821 [14,15,16,17,18,19,20,21,22] was downloaded from the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/). IMvigor210 data has been available in IMvigor210CoreBiologies R package and is freely available in website [23]. Detail of above dataset was described in Supplementary Table 1.

Identify CD8+ T Cell-Related Genes in the Training Set

The ESTIMATE algorithm calculated immune-score through transcription profile analysis based on the specific gene expression of tumor [24]. We calculated an immune-score for each individual TCGA-BRCA sample. Samples with an immune-score above the upper quartile or below the lower quartile were divided into high- and low- immune score group, respectively. The immune-related genes were defined as the differential expressed genes between high- and low- immune score group samples. We chose immune-score only above the upper quartile or below the lower quartile because of greater differences in immune infiltration status may exist between the two groups. Thus, the differential expressed genes may be more related to the tumor immune microenvironment. In the WGCNA and later analyses, all the 1097 tumor samples were included. Next, single-sample gene set enrichment analysis (ssGSEA) [25] employed the expression of gene markers to evaluate the relative levels of tumor infiltrating immune cells in training set. Weighted gene co-expression network analysis (WGCNA) [26] is a systems biology method to find highly related gene modules [26]. We used WGCNA to identify CD8+ T cell-related gene modules in breast cancer. The method details of ssGSEA and WGCNA were described in Supplementary Method.

Construction of Prognostic Signature Based on CD8+ T Cell-Related Genes

In order to make the signature reflect the patient’s prognosis, univariate Cox regression analysis was used to screen CD8+ T cell-related genes that affect overall survival (OS). When acquiring superabundant prognostic-related genes, the least absolute shrinkage and selection operator (LASSO) Cox regression analysis would reduce the dimensionality of the model to avoid the risk of overfitting. The prognostic signature was established by the LASSO Cox regression fitting model, which provided key prognostic genes and corresponding regression coefficients, and verified by tenfold crossover. A risk-score model is constructed by formula \(risk score={\sum }_{i=1}^{n}({Coefficient}_{i}\times {Expression \; of \; gene}_{i})\). In individual validation sets, multivariate Cox regression analysis was used to calculate the coefficient of risk-score model based on the same gene panel. The median risk score was used to divide the patients into high- and low- risk groups in both training and validation sets.

Prognostic Ability of the Risk Score Model

Based on the risk score model, the risk score was calculated for each patient in the training set. Kaplan–Meier curve and log-rank test were applied to verify the prognostic value of the risk score model. Time-dependent receiver operating characteristic (ROC) curves assessed the accuracy of risk score for predicting breast cancer OS at 3, 5 and 10 years. Subsequently, univariate and multivariate Cox regression analysis verified whether the risk score was an independent prognostic indicator. At the same time, a nomogram was generated for clinical application. In addition, Kaplan–Meier analysis was performed on all the validation dataset.

The Risk-Score Model Predicts Immunotherapy Response

We first explored the correlation of risk score and expression of immune checkpoints in TCGA dataset. Subsequently, the CIBERSORT score of 22 immune infiltration cells and ssGSEA score of 28 immune infiltration cells were calculated in TCGA dataset. The detail process to access CIBERSORT [27] score was also described in Supplementary Method. Due to traditional whole genome sequencing was limited in the evaluation of gene panel expression in differential cell types, we introduce single-cell sequencing dataset GSE75688 to detect the expression of gene panel in breast cancer tumor microenvironment. We further obtained the corresponding anti-CTLA-4 and anti-PD-1/PD-L1 predictive response values of TCGA breast cancer patients from The Cancer Immunome Atlas (https://tcia.at/) [28]. The Cancer Immunome Atlas provides results of comprehensive immunogenomic analyses for 20 solid cancers from TCGA and other data sources. Each sample is given an immunophenoscore (IPS) in specific ICI therapy by scoring tumor immunogenicity from 0 to 10 that can predict the efficacy of ICI therapy. We then compared the IPS between high- and low- risk groups breast cancer. At the same time, the immunotherapy response prediction ability of the risk-score model was verified in IMvigor210 and GSE124821 dataset. IMvigor210 dataset contains transcriptome data of clinical trials that applied atezolizumab in locally advanced and metastatic urothelial carcinoma [29]. And GSE124821 includes bulk mRNA-seq data of anti-CTLA-4/anti-PD1 -treated and pre-treated from sensitive and resistant mouse mammary tumor samples [22].

Gene Functional Enrichment Analysis

In order to understand the biological functions related to the prognostic signature, differentially expressed genes between the high- and low- risk groups were searched in the whole TCGA-BRCA transcription set. Following, the differentially expressed genes were used to enrich the biological pathways of the Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The enriched Gene Ontology terms and KEGG pathways may reveal the activation difference of biological process in difference risk groups.

Statistical Analysis

Statistical analyses were performed using R software (version 3.6.1) (https://www.r-project.org/) and Stata (version 15.0). R packages involved in this study were concluded in Supplementary Method. When P < 0.05, the analysis results can be considered as statistically significant.

Results

Sample Characteristics and Immune-Score

The baseline information of the TCGA-BRCA cohort is summarized in Table 1, and the clinical information of the GSE20685 cohort is provided in Supplementary Table 2. The flowchart of signature construction and verification was presented in Fig. 1. We first calculated an immune-score for each tumor sample in TCGA-BRCA dataset (Fig. 2A, Supplementary Table 3). Then, 1557 differentially expressed genes between the high- and low- immune-score groups were identified as immune-related genes for subsequent genetic screening (Fig. 2B, Supplementary Table 4). The ssGSEA assessed the abundance of tumor infiltrating immune cells in TCGA-BRCA samples (Fig. 2C).

Table 1 Baseline information of training set TCGA cohort
Fig. 1
figure 1

The flowchart of signature construction and verification

Fig. 2
figure 2

Selection of CD8+ T cells-related genes. A: Immune-score of TGGA-BRCA samples. B: Differentially expressed genes between high- and low- immune-score groups TGGA-BRCA. C: 28 types tumor infiltrating immune cells in high- and low- immune-score groups TGGA-BRCA. D-G WGCNA identified the CD8+ T cells-related genes. D: selection of the optimal soft threshold (β = 2). E: 6 gene clusters in the gene dendrogram. F: the correlation coefficient between 6 gene modules and 28 kinds tumor-infiltrating immune cells. G: the regression coefficient of turquoise module genes expression and tumor-infiltrating CD8 + T cells level

Gene screening and Prognostic Signature Establishment

WGCNA analysis was performed on 1557 immune-related genes and the optimal soft threshold (β = 2) were chosen (Fig. 2D). The genes clustered into 6 different modules (Fig. 2E). The turquoise module clustered the most genes and had the highest correlation with tumor-infiltrating immune cells. Supplementary Table 5 provided the genes in the turquoise module. Correlation coefficient of the turquoise module and activated CD8+ T cells was 0.85 (Fig. 2F). The turquoise module genes with both correlation coefficient and significance greater than 0.6 were defined as CD8+ T cell-related genes in breast cancer (n = 290, Fig. 2G). Subsequently, univariate Cox regression analysis identified 21 prognostic-related genes (Fig. 3A). Further, LASSO Cox regression analysis constructed the prognostic signature that included 8 CD8+ T cell-related genes panel (Fig. 3B, Table 2). The formula was as follows: risk score = (-0.121) × SOCS1 + 0.520 × IL10 + (-0.191) × CAMK4 + (-0.013) × CXCL13 + (-0.122) × KIR2DS4 + (-0.127) × TESPA1 + (-0.123) × CD70 + (-0.022) × ICAM4. Risk score of patients in training and validation set was calculated and the patients were divided into high- and low- risk groups according to the median risk score. The risk-score of TCGA-BRCA and corresponding 8-genes panel expression were exhibited (Fig. 3C). Interestingly, risk-score of basal-like subtype patients was lower than Luminal B subtype (Fig. 3D). Results also showed that risk-score of patients in AJCC stage III/IV was higher than in stage I/II (Fig. 3E).

Fig. 3
figure 3

Construction of CD8+ T cells-related genes expression signature. A: Univariate Cox regression analysis result of 21 survival-related genes in the study. B: Coefficient of the 8 genes in the signature. C: Risk scores distribution and heatmap of the 8 signature component genes expression. D: Risk scores of TGGA-BRCA molecular subtypes. E: Risk scores of TGGA-BRCA in different AJCC stages

Table 2 Signature component genes and the corresponding coefficients

Prognostic Ability of the Risk Score Model

Kaplan–Meier analysis proved that patients with high-risk had worse survival probability in TCGA-BRCA dataset (P < 0.001, Fig. 4A), and the result in GSE20685 was similar (P = 0.002, Fig. 4B). The high-risk patients also had poorer OS, disease free survival (DFS), or metastatic free interval (MFI) in the validation set (Supplementary Fig. 1). IL10 is the heaviest weighed gene in the risk-score model. It is also a risk factor of breast cancer patients (HR = 1.261, 95%CI 1.035–1.430, Fig. 4C), but no significant was seen in the subgroup analysis (Supplementary Fig. 2). Time-dependent ROC curve analysis showed that area under the ROC curve (AUC) of the risk score model predicting 3, 5, and 10-year OS of TCGA breast cancer patients were 0.736, 0.676, and 0.761, respectively (Fig. 4D). Meanwhile, AUC of the risk score model predicting 3, 5, and 10-year OS of GSE20685 cohort were 0.818, 0.781, and 0.817, respectively (Fig. 4E). Besides, AUC of TCGA-BRCA molecular subtypes was displayed (Fig. 4F). The chi-square test showed that the elderly (age > 60) and AJCC stage III/IV patients had a higher proportion in the high-risk group (Table 1). Multivariate Cox regression analysis indicated that risk-score was an independent risk factor for breast cancer (Table 3). Further, a nomogram included indexes age, AJCC stage, HER2 status, PR status, and risk-score was drawn for the clinical prognostic application, and the calibration curve estimated the efficiency of nomogram prediction (Fig. 4G, H). The C-index was 0.80, which means a relative high accuracy of the nomogram.

Fig. 4
figure 4

Survival analyses results. A-B Kaplan–Meier plot analyzed overall survival (OS) probability of high- and low- risk breast cancer in A: TCGA-BRCA; B: GSE20685. C: OS probability of high- and low- IL10 expression groups in TCGA-BRCA. D-E Time-dependent ROC curve analysis predicted 3, 5 and 10 years OS in D: TCGA-BRCA; E: GSE20685. F: AUC of time-dependent ROC curve analysis OS in TCGA-BRCA molecular subtypes. G: The nomogram for the clinical use of the signature. H: A calibration curve examined the nomogram

Table 3 Univariate and multivariate Cox regression analyses of the signature and clinicopathological characteristics of OS

The Risk Score Associated with Immune Infiltration and Efficacy of Immunotherapy

Single-cell RNA sequencing dataset GSE75688 showed that CAMK4, CXCL13, and TESPA1 were higher expressed in T cells, but IL10 was lower expressed in T cells (Fig. 5A-H). Both CIBERSORT and ssGSEA method proved that high-risk tumors had decreased tumor-infiltrating CD8+ T cells (Supplementary Fig. 3). High-risk tumors also had decreased tumor-infiltrating M1 macrophages but increased M2 macrophages (Supplementary Fig. 3A), which suggested the immunosuppression microenvironment in high-risk tumors. Bulk data analysis showed that the risk score was negatively correlated with expression of the immune checkpoints such as PDCD1 (PD-1), CD274 (PD-L1), CTLA4, IDO1 and TIGIT, as well as immune stimulators including CD27, CD40, CD48, ICOS and CXCR4 (Fig. 6A). In the treatment response prediction of TCGA-BRCA patients to ICI, the anti-CTLA-4 and anti-PD-1/PD-L1 IPS were lower in high-risk group, that may mean lack of sensitivity to ICI in high-risk breast cancer patients (Fig. 6B). In addition, the risk-score model was validated in IMvigor210 dataset and the results indicated that urothelial carcinoma patients who achieved complete response (CR) to PD-L1 blockade had lower pre-treatment risk-score than the stable disease (SD) and progressive disease (PD) patients (Fig. 6C). Meanwhile, patients who received CR may expressed higher IL10 before PD-L1 blockade than the partial response (PR) and PD patients (Fig. 6D). In the bulk mRNA-seq dataset GSE124821 of breast cancer mouse models, we also found tumors that sensitive to PD-1 and CTLA4 blockade had higher IL10 expression in pre-treatment (Fig. 6E).

Fig. 5
figure 5

Single-cell RNA sequencing dataset GSE75688 showed the expression of 8 signature component genes in T cell

Fig. 6
figure 6

Immunotherapy response prediction. A: The correlation of risk score and immune molecular expression. B: IPS predicted the relative response of anti-CTLA-4 and anti-PD-1/PD-L1 ICI therapy. C: Risk-score of different response group IMvigor210 clinic trial patients. D: Pre-treatment IL10 expression of different response group IMvigor210 clinic trial patients. E: Pre-treatment IL10 expression in the PD-1/CTLA4 block-sensitive group in mouse mammary tumors. CR, complete response; PR, partial response; SD, stable disease; PD, progressive disease

Biological Pathways Associated with Risk Model

Gene functional enrichment analysis showed that risk-score were linked to some Gene Ontology and KEGG terms that related to tumor immune and signal transduction, such as “T cell activation”, “JAK-STAT cascade”, “chemokine signaling pathway”, “protein kinase B signaling”, “JNK cascade”, “PI3K signaling pathway”, “Wnt signaling pathway”, “Ras signaling pathway”, “B cell receptor signaling pathway” and “PD-L1 expression and PD-1 checkpoint pathway in cancer” (Fig. 7A, B).

Fig. 7
figure 7

The signature related biological pathways. A: gene ontology terms. B: KEGG biological pathways

Pan-Cancer Analysis

The CD8 + T cell-related genes expression signature was further tested in TGGA pan-cancer dataset. The risk-score models were additionally constructed in 8 types of malignant tumors: bladder urothelial carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, head and neck squamous cell carcinoma, lung adenocarcinoma, ovarian serous cystadenocarcinoma, stomach adenocarcinoma, liver hepatocellular carcinoma, colon adenocarcinoma. Kaplan–Meier analysis proved that survival probability was significant statistical difference between high- and low- risk patients among the 8 types tumors (Supplementary Fig. 4). Except for stomach adenocarcinoma, the IPSs were also lower in high-risk group (Supplementary Fig. 5). However, IL10 expression was no longer a risk-factor in pan-cancer meta-analysis (Supplementary Fig. 6).

Discussion

In the present study, we constructed a CD8+ T cell-related genes prognostic signature based on TCGA-BRCA dataset. The construction of risk-score model is a process of hub-genes selection. We first used ESTIMATE to defined the 1557 immune-related genes from the whole gene expression profile with 20,530 genes. Then we performed WGCNA to obtain 290 CD8+ T cell infiltration related genes for prognosis model establishment. Further, univariate Cox regression analysis identified 21 prognostic-related genes. Finally, LASSO Cox regression analysis punished the regression coefficient of 21 candidate genes and fitting the risk model. Totally 8 significant survival-related genes were comprised in the risk model. Among the 8 hub genes, SOCS1 has been found to be a “non-classical” checkpoint blocker that negatively regulates cytokine receptor signaling and plays a key role in CD8+ T cell response [30]. Deng et al. reported that IL-10 activated STAT3 and acted as an immunosuppressive molecule in triple-negative breast cancer [31]. Existing study has shown that CAMK4 can activate transcription factors of immune cells such as T cells to enhance the immune response [32]. In recent years, TESPA1 has been found to be a specific regulator on TCR signaling [33]. In addition, ICAM4 is involved in antigen recognition of monocytes and macrophages in innate immunity [34]. Single-cell RNA sequencing data showed that CAMK4, CXCL13 and TESPA1 were highly expressed in T cells, while the other 5 genes, including IL10, were low expressed in T cells.

High-risk defined by the risk-score model indicated a poor prognosis for breast cancer. Since time-dependent ROC curve analysis considers both the follow-up time and death events [35], it was used to evaluate the prognostic prediction accuracy of the signature. In both training set and validation set, the risk-score was able to distinguish patients with differential survival outcomes, including overall survival, disease free survival and metastatic free survival. The risk-score, age, AJCC stages, HER2 status and PR status were independent risk factors for breast cancer in this study.

Pathological stage is an important means for clinicians to evaluate the survival of cancer patients, but tumor immunity is also a key factor affecting the prognosis [36]. There are already some immune-score models that complement the existing TNM stages [37, 38]. In addition to predicting prognosis, the signature we constructed has the potential to reflect the efficacy of immune checkpoint blockade. The immunotherapy response is close related to the expression of tumor immune checkpoints [39]. So, we first explored the correlation between risk score and immune checkpoints expression. Interestingly, the risk-score negatively correlated with the expression of immune checkpoints and immune stimulators in tumor.

Further analysis suggested the high-risk tumors presented lower level of tumor-infiltrating CD8+ T cells. CD8+ T cells are the main anti-tumor cells, and the application of ICI can enhance the activity of CD8+ T cells [40]. Therefore, our CD8+ T cell infiltration-related signature can predict the prognosis and may reflect immunotherapy response of breast cancer. High-risk tumors had also less tumor-infiltrating M1 macrophages but more M2 macrophages. Previous study demonstrated that both innate and adaptive immune cells in breast cancer showed immunosuppressive gene expression characteristics [21]. The decrease of M1/M2 ratio is the damage to anti-tumor immunity, M1 macrophages have anti-tumor effect, and the infiltration of M2 macrophages is linked to the poor prognosis of tumor [41]. At the same time, M2 macrophages can be induced by IL10 [42], which suggested that our signature is associated with the infiltration of macrophages.

In March 2019, atezolizumab became the first ICI to be approved for breast cancer, delivering improved outcomes and showing promising application potential [43]. However, there was still no large-scale transcriptome data of breast cancer immunotherapy. We used IPS that calculated by TCGA RNA-seq to substitute the efficacy of immune checkpoint blockade. Charoentong et al. have demonstrated that IPS can reflect cancer patients’ relative response to anti-PD-1/PD-L1 and anti-CTLA-4 treatment [28]. We found the low-risk TCGA-BRCA patients achieved higher anti-CTLA-4 and anti-PD-1/PD-L1 IPS. It may guide the selection of immune checkpoint blockade therapy for breast cancer patients. Furthermore, the IMvigor210 dataset was involved in this study and it confirmed that the anti-PD-L1 CR group had lower risk-score. Additionally, the pre-treatment IL10 expression was higher in human patients who achieved CR to atezolizumab, and higher pre-treatment IL10 expression was found in the PD-1/CTLA4 block-sensitive group in mouse mammary tumors.

Our study established a CD8+ T cell-related gene expression prognostic signature of breast cancer. Although some approval results have achieved, our study still has shortcomings. Firstly, this is a retrospective study based on the online database that still needs to be verified by prospective studies. Secondly, the IPS was used to detect the efficacy of immunotherapy for breast cancer, which may be different from the real world. We will continue to improve the research to obtain further conclusions.

Conclusions

This study established a breast cancer prognostic signature based on 8 CD8+ T cell-related genes. A high-risk indicates an independent risk factor for breast cancer prognosis. Also, high-risk tumors present decreased tumor infiltrating CD8+ T cells and increased M2 macrophages. The low-risk patients may benefit more from immune checkpoint blockade immunotherapy than the high-risk patients. In addition, breast cancer with enhanced IL10 expression also has an adverse prognosis, but it is more sensitive to ICI. The results provide a new perspective on the prognosis and immunotherapy of breast cancer.