Introduction

Breast cancer is a heterogeneous disease composed of at least four major subtypes, namely luminal-A and luminal-B breast cancer, basal-like and HER2-like breast cancer [27]. The subgroups differ by expression of estrogen (ER) and progesterone receptors (PR), HER2 expression/amplification status, and the proliferative activity of the tumor [31]. Basal-like breast cancer is often approximated with the triple negative breast cancer subtype [7]. Triple negative breast cancer is characterized by an adverse prognosis particularly in case of limited sensitivity against neoadjuvant chemotherapy [5, 16, 24]. It is well described that molecular breast cancer subtypes are associated with significant differences in prognosis [28, 35]. Similarly, a significant association between age at diagnosis and prognosis is well known [6]. Young age at diagnosis is largely understood to be associated with an adverse disease prognosis, and several studies have aimed to unveil the biology behind this phenomenon [9, 10, 30, 36]. Some studies identified differences in gene expression but did not control for the effect of different breast cancer subtypes in their analysis [2]. However, subsequently including subtype and clinical variables suggested that age alone may not provide additional complexity above breast cancer subtypes [1]. A recent commentary on the bimodal age distribution in epidemiological data suggests that breast cancer in young women differs etiologically as more aggressive (ER-negative basal-like subtype) from the more indolent (ER-positive nonbasal-like subtype) among elder women [3]. Conflicting data still exist if age-related frequencies of molecular subtypes are sufficient to account for differences in prognosis. One large study based on microarray data from 3522 breast cancers reported a prognostic value of age independent of molecular subtypes in multivariate analysis [4]. However, very recently, another large microarray study (n = 3947) concluded that the prognostic effect of age is based only on differences in subtype composition [20]. We had previously performed a study on age based on clinical and pathological data. In that study, we focused on a large cohort of a single breast cancer subtype (triple negative, n = 1732) and could show that patients with triple negative breast cancer aged <40 years suffer from a significantly adverse prognosis [23].

In the present analysis, we demonstrate in a large cohort with microarray data (n = 4467) that triple negative and HER2-positive subtypes are more frequent in young patients. Separate analyses by subtype reveal that a significant prognostic value of young age (<40 years) is mainly observed within triple negative breast cancer, only to a limited degree in luminal subtypes, and not within the HER2 subtype. In multivariate analysis, both molecular subtypes, young age, and lymph node status were significant.

Methods

Patients

We compiled Affymetrix gene expression data (U133A or U133Plus2.0 arrays) of 4467 breast cancer patients from 40 publicly available datasets as previously described [17, 33] (Supplementary Table 1). All analyses were performed according to the “REporting recommendations for tumour MARKer prognostic studies” (REMARK) [26, 34] and the respective guidelines to microarray-based studies for clinical outcomes [12]. A diagram of the complete analytic strategy and the flow of patients through the study, including the number of patients analyzed in each stage of the analysis, is given in Supplementary Fig. S1.

No separate informed consent was obtained from patients, whose data were used during the conduct of this study, since the data were already publicly available.

Data processing

All gene expression data are publicly available, and accession numbers are given in Supplementary Table S1. Affymetrix CEL files were processed with the MAS5.0 algorithm of the affy package [13] of the Bioconductor software project [14] in R 3.0.1 (www.r-project.org). Data from each array were log2-transformed, median-centered, and the expression values of all the probesets from the U133A array were multiplied by a scale factor S so that the magnitude (sum of the squares of the values) equals one. For single marker expression analyses, Affymetrix probesets 205225_at, 208305_at, and 216836_s_at were used for estrogen receptor (ESR1), progesteron receptor (PR), and HER2 (human epidermal growth factor receptor 2), respectively. The bimodal distributions of these markers were applied to derive cutoffs to differentiate high and low expressions, or positive and negative statuses, respectively, as described previously [21]. Ki67 expression was quantified as the mean value of its four probesets on U133A arrays (212020_s_at to 212023_s_at).

Assignment of molecular subtypes

To approximate the intrinsic subtypes of breast cancer, we applied the simple method according to Hugh et al. [18], which is based on the expression of single marker genes (ESR1, PR, HER2, Ki67) to define triple negative, HER2-, luminal A-, and luminal B-subtypes. For a distinction of luminal A and luminal B subgroups, all 2884 ER-positive/HER2-negative samples were selected, and a median split according to Ki67 expression was performed. In addition, 106 ER-positive/HER2-positive cases were also assigned to the luminal B subtype as performed in the above-referenced method. Age information was available for 3089 of the 4467 samples. The individual assignments of molecular subtypes are given for each sample in Supplementary Table S2.

Survival analysis

Follow-up information was available for 2590 of all 4467 samples and for 2185 with age information. In the conduct of the presented analysis, event-free survival (EFS) was calculated as preferentially corresponding to the recurrence-free survival endpoint (RFS), but measured with respect to the distant metastasis-free survival (DMFS) endpoint if recurrence-free survival was not available. All results from survival analyses were verified by examining the effect of the different endpoints in stratified analyses. Follow-up data for those women in whom the envisaged endpoint was not reached were censored as of the last follow-up date or at 120 months. Subjects with missing values were excluded from the analyses. We constructed Kaplan–Meier curves and used the log-rank test to determine the univariate significance of the variables. A Cox proportional-hazards model was used to simultaneously examine the effects of multiple covariates on survival. The effect of each individual variable was assessed with the use of the Wald test and described by the hazard ratio, with a 95 % confidence interval (95 % CI).

Categorial variables were analyzed using Chi square test. Gradual association of age groups with molecular subtypes was studied by Spearman’s rank correlation. All P values are two-sided, and all analyses were performed using SPSS Statistics Version 22 (IBM Corp.).

Results

Distribution of molecular subtypes according to age

Information on age at diagnosis was available for 3089 of the 4467 patients. Table 1 summarizes the distribution of molecular subtypes according to patient age groups. With the increasing age, a significant increase in the proportion of luminal A breast cancers (P < 0.001) and a significant decrease in triple negative (P < 0.001) and HER2-subtype (P = 0.001) breast cancers were observed (Table 1). The proportion of luminal B breast cancers was not significantly affected (P = 0.29). The results suggest that subtype composition is an important confounding factor when studying effects of age on prognosis. Thus, separate analyses by subtype are required (see further below).

Table 1 Distribution of molecular subtypes according to age at diagnosis

Earlier data from ligand-binding assay showed as significant the correlation of increasing age with ER protein [8]. Therefore, the difference in subtype compositions according to age that we observed in Table 1 may be just related to a gradual change of ER expression. The quantitative gene expression data enabled us to study this argument. Supplementary Figure S2A demonstrates that there is a clear bimodal distribution distinguishing ER-negative and ER-positive breast cancer subtypes among all age groups. In ER-positive tumors, the median ESR1 mRNA expression increases with age, but no change is seen in the ER-negative subgroup. As shown in Supplementary Fig. S2B, we observed a linear relationship between age and ESR1 mRNA expression in both luminal A and luminal B subtypes but not in HER2 and triple negative breast cancer. Thus, the gradual increase of ER expression does not confound the principal distinction between ER-positive (luminal A and B) and ER-negative subtypes (triple negative and HER2).

Univariate survival analysis according to age at diagnosis in molecular subtypes

Two thousand one hundred and eighty five patients with complete prognostic information and age data were subjected to Kaplan–Meier analysis regarding the association between age at diagnosis (i.e., <40, 40–50 and >50 years) and prognosis (i.e., event-free survival). Overall, patients showed significant differences in prognosis associated with age at diagnosis (Fig. 1a, P < 0.001). While the 5-year event-free survival-rates of patients between 40 and 50 years and >50 years were 68.5 ± 1.9 and 70.4 ± 1.3 %, respectively, that of patients of young age (<40 years) was only 54.3 ± 3.5 % (univariate HR 1.64; 95 % CI 1.34–2.01; P < 0.001). When restricting the analysis to only luminal breast cancer, still a highly significant association of age <40 years with poor prognosis was retained (Fig. 1a, P = 0.001). However, much of this effect may be driven by the worse prognosis of luminal B compared to luminal A cancers, since the frequency of the Luminal B subtype is nearly doubled in patients <40 years (64.1 vs. 35.9 %, Table 1; P < 0.001, χ2-test). In line with this notion, we detected no significant effect when analyzing luminal A and luminal B subgroups separately (Fig. 1c, d, P = 0.11 and P = 0.21, respectively). Despite a smaller number of cases, we observed a clearly significant association between age at diagnosis among patients with triple negative breast cancer (Fig. 1e, P = 0.024). The 5-year event-free survival-rate was 45.5 ± 5.8 % in patients <40 years, in contrast to those of 64.7 ± 3.8 and 59.8 ± 3.2 % for 40–50 and >50 years, respectively. The univariate hazard ratio of patients’ age <40 years in triple negative breast cancer was 1.52 (95 % CI 1.09–2.12; P = 0.014). In contrast, no significant effect of age was detected in the HER2 subtype (Fig. 1f, P = 0.80).

Fig. 1
figure 1

Kaplan–Meier analysis for event-free survival according to patients’ age at diagnosis patients were stratified into three age groups (<40, 40–50, and >50 years). Survival analysis was performed either in all 2185 patients with follow-up and age information (a), among luminal tumors only (b) or in the subgroups of luminal A tumors (c), luminal B tumors (d), triple negative (e), and HER2 subtype (f)

Multivariate regression of survival analysis according to age at diagnosis in molecular subtypes

We next studied whether the prognostic value of age <40 years that we had observed in the triple negative subgroup remains statistically significant in a multivariate analysis in the total cohort. We applied a multivariate Cox regression model which includes patients’ age (<40 vs. ≥40 years), molecular subtype of the tumor, lymph node status, and histological grading. For 1804 patients all those information and follow-up data were available. As presented in Table 2, all three, namely, age (P = 0.012) molecular subtype (P < 0.001), and lymph node status (P < 0.001) were significantly and independently prognostic in this model. Histological grading also showed a strong trend (P = 0.068) toward significance. We observed similar hazard rates for the ages <40 years (HR 1.39; 95 % CI 1.07–1.79) and a positive lymph node status (HR 1.39; 95 % CI 1.16–1.68). For the majority of the patients in this model, no information on tumor size was available. Nevertheless, for completeness, we also applied a second multivariate Cox regression model including tumor size in the subset of 870 samples with information on this parameter. In that considerably smaller cohort study, only molecular subtype remained significant (P < 0.001). Both age (HR 1.28; 95 % CI 0.93–1.76; P = 0.129) and lymph node status (HR 1.22; 95 % CI 0.92–1.61; P = 0.168) showed a trend, but we detected no significance for histological grading (P = 0.66) and tumor size (P = 0.66).

Table 2 Multivariate Cox regression analysis of survival according to age, molecular subtype, grading, and lymph node status

Young age and response to neoadjuvant chemotherapy in breast cancer

Young age (<40 years) has also been reported as a predictive factor for response to neodajuvant chemotherapy in triple negative breast cancer [19, 25]. For 466 of the 4467 samples in our cohort information on age and pathological complete response (pCR) after neoadjuvant treatment was available. We observed no overall effect on pCR of age <40 years (20.6 %) compared to age ≥40 years (20.9 %, P = 1.0; Supplementary Table S3). Among 146 triple negative of these samples, we detected a trend for a higher frequency of pCR in patients <40 years (45.2 %) than in patients ≥40 years (31.5 %, P = 0.15; Supplementary Table S3).

Discussion

In the present study, we used a large cohort of samples with gene expression data obtained from 4467 patients with breast cancer to study influence of molecular subtypes and age on prognosis. The proportion of different molecular subtypes was strongly associated with age in our analysis. An increased frequency of triple negative and HER2 breast cancers were found in patients <40 years (Table 1). This was not unexpected and in line with several earlier reports [14, 11, 20]. This fact strongly urges us to perform any analyses separately by subtype as we and others have suggested previously, since the molecular subtype is one of the strongest prognostic factors [22, 23, 29, 31, 32]. It may also explain in part the differences in the survival rates in the overall cohort. The univariate 5-year event-free survival rates of patients aged <40, 40–50, and >50 years were 54.3 ± 3.5, 68.5 ± 1.9, and 70.4 ± 1.3 %, respectively (P < 0.001, Fig. 1a). According to a recent large microarray study, increasing age at diagnosis was not associated with patients’ prognosis after stratification for clinical/pathological variables including breast cancer subtype [20]. Nevertheless, when we controlled for subtype composition and clinical variables in multivariate analysis, we still detected a significant prognostic effect of young age (P = 0.012, Table 2). This result is in line with data from Azim and colleagues who also used the same cutoff <40 years in their study [4]. However, there are some differences between that study and our data. Azim and colleagues observed a prognostic value mainly within luminal subtypes, but not in triple negative breast cancer. From our data, there is a clear prognostic effect in the triple negative group, but no effect in the HER2 subtype (Fig. 1e, f). The value in luminal breast cancer is less clear when luminal A and luminal B tumors are analyzed separately. However, still a trend is observed for both groups (Fig. 1c, d). Most of the HER2-positive patients in our cohort did not receive anti-HER2 treatment, because of the time range at which the primary datasets were generated. This may account for the general poor prognosis in this subgroup. Whether age may be prognostic in patients with HER2-positive tumors treated by current standards needs to be explored in further studies.

It is important to note that the age cutoff for difference in prognosis is 40 years, and not menopause or 50 years. This has also been observed before, e.g., in both the microarray study of Azim and coleagues [4] and our previous study on triple negative breast cancers [23]. One potential reason could be that awareness of breast cancer may be greater around menopause than that in very young women both regarding patient and physician. This effect and higher mammographic breast density at young age could lead to delayed diagnosis and an advanced stage. However, the observed effect of young age on prognosis persisted in multivariate analysis taking such variables into account. Moreover, differences in breast density seem not to affect the final outcome [15]. Thus, biological differences may underly the distinct prognosis in very young patients compared to older women. This is supported by findings that these cancers are associated with an enrichment of biological processes related to immature mammary cell populations [4]. And it may imply to explore inhibition of stem cell pathways or RANKL signaling in that population [4]. In the triple negative group, we observed a 5-year event-free survival of 45.5 ± 5.8 % for patients <40 years compared to 64.7 ± 3.8 and 59.8 ± 3.2 %, respectively, for patients 40–50 and >50 years (P = 0.024, Fig. 1e). Interestingly, the magnitude observed in the triple negative microarray cohort is similar to the results that we had previously obtained in the larger clinical cohort of 1732 triple negative breast cancers without microarray data (42 vs. 57 % DFS) [23]. That supports further study of potential biological differences in the gene expression data in the future.

Regarding the clinical implications of the results, it is noted that lymph node status and molecular subtype still seem to be the most important parameters and young age is just one indicator of prognosis. Moreover, the majority of very young patients are already treated with more aggressive treatment and mostly chemotherapy. Thus, it is not clear whether further increase in aggressiveness may improve outcome. However, it might be that further studies on biological differences of this early onset disease may eventually lead to novel, more effective therapies.

Our analysis has limitations. Our dataset has been assembled from previously published datasets. Thus, limitations include incomplete information, inhomogeneous treatment, and distinct follow-up endpoints. Consequently, we defined event-free survival as common endpoint which included relapse-free survival or distant metastasis-free survival depending on availability. Also, our analysis was performed in a retrospective manner. Thus, differences in treatment depending on age at diagnosis cannot be excluded and have the potential to significantly introduce bias into our analyses. However, given current clinical practice, patients with young age at diagnosis are usually treated more intensely. Therefore, the differences regarding disease prognosis may in fact have been more pronounced, if treatment had been homogeneous.

In summary, we demonstrated that association of age at diagnosis with molecular breast cancer subtypes contributes to its important role as prognostic factor among patients with breast cancer. Still, a significant prognostic value of young age <40 years was detected especially within the group of triple negative breast cancers which retained significance in multivariate analysis.