Introduction

Compared to Caucasian (CAU) women, African-American (AA) women have lower incidence, but higher breast cancer-specific mortality rates [1]. Higher prevalence of aggressive Basal-like breast cancers in AA women [2] may explain some disparities, but even when AA women are diagnosed with less-aggressive Luminal A breast cancers, they fare worse than CAU women with the same subtype [3]. There are likely multiple factors contributing to the differences, including differential access to care [4] and lifestyle factors. There is some evidence that there may be biological differences in the tumors of AA versus CAU women, even within subtype. For instance, even after controlling for some socioeconomic status variables (SES) in a study where all women received the same treatment based on tumor characteristics, the Southwest Oncology Group [5] reported survival differences between CAU and AA women. Specifically, AA had a survival disadvantage compared with CAU women for ER+ premenopausal breast tumors [HR = 1.74, 95 % CI = (1.11, 2.71)] and ER+ postmenopausal breast cancer [HR = 1.61, 95 % CI = (1.35, 1.93)]. While it is difficult to study many social variables and social variables cannot be ruled out, biological factors should be considered further.

Only a few studies [69] have characterized molecular differences in breast tumors by race. Martin et al. [8] hypothesized that the tumor microenvironment differed between AA and CAU. They reported that independent of ER status, 19 and eight genes were differentially expressed in the breast tumor stroma and epithelium, respectively, of 18 AA and 17 CAU women. Grunda et al. [7] evaluated expression of 84 genes associated with breast cancer aggressiveness, prognosis and response to therapy, and found that 20 of these genes were differentially expressed in 12 AA and 12 CAU age- and stage-matched breast tumors. Field et al. [6] identified genes that were differentially expressed in 26 AA and 26 CAU age, grade, and ER-matched breast tumors. They found that a few genes, including CRYBB2, PSPHL and SOS1, were differentially expressed in both normal and tumor tissues. Most recently, Stewart et al. [9] analyzed age- and stage-matched breast tumors from the Tumor Cancer Genome Atlas (TCGA) project and reported 674 unique genes or transcripts that were differentially expressed by race. Despite matching on clinical features in the TCGA analysis, AA had a significantly higher risk of mortality compared with CAU women (18.87 vs 10.28 %—time period not given), and these investigators found gene expression differences among Luminal A (46 genes), Basal-like (15 genes) and HER2 (25 genes) among stage 1–3 tumors and increasing numbers of differentially expressed genes with increasing stage (from 26 in stage 1 to 223 in stage 3). The TCGA gene signatures were not evaluated for associations with survival nor tested in independent data.

Each of these previous studies evaluated molecular features that may contribute to mortality disparities between AA and CAU breast cancer cases; however, we propose that a disparity-associated gene should meet the following criteria: (1) the gene should be differentially expressed by race in the tumor, and this association should not be driven solely by clinical features such as intrinsic subtype, ER status, or patient age, (2) the differential expression of a candidate gene should be associated with a difference in breast cancer survival. If the gene is associated with race but has no consequences for survivorship, its utility in explaining mortality disparities is limited. We were also interested to know whether the gene was differentially expressed in normal because these gene expression differences are more likely to predate disease progression. Our goals were to extend previous studies by studying both expression and survival, to evaluate the joint effects of multiple disparity-associated genes on survival, and to evaluate how the disparity–associated genes are expressed in normal tissue.

Methods

Datasets and data preprocessing

We used several datasets that included tumor, tumor-adjacent normal, and reduction mammoplasty gene expression data. Most of these data sources are publically available. Data characteristics, including Gene Expression Omnibus (GEO) accession numbers, are listed in Table 1. We used the UNC337 tumor gene expression dataset to evaluate race-associated tumor gene expression. UNC337 is a racially diverse population (race information listed in Supplemental Table 3), while the NKI295 public dataset is racially homogenous (predominantly Caucasian European). Evaluating survival in an independent and racially homogenous population allows us to make broader inferences about the importance of relative gene expression on survival. We compared gene expression in normal (RM), cancer-adjacent normal, and tumor (UNC337 + NKI295) datasets. Reduction mammoplasty samples were from previous reports [10, 11] and from the Normal Breast Study, a study of patients undergoing surgery at UNC Hospitals [12]. All patients provided informed consent via a protocol approved by the Institutional Review Board of the University of North Carolina at Chapel Hill. All four of these datasets were on the same expression platform (Agilent), allowing us to compare gene expression across tissue type. We used the METABRIC [13] dataset as an independent test dataset to evaluate the tumor-based survival associations.

Table 1 Data characteristics and GEO accession numbers

The isolation of RNA and methods of basic microarray processing are described in detail by Sun et al. [10], Prat et al. [14], and van de Vijver et al. [15] for RM, UNC337, and NKI295 respectively. Array filtering and cleaning of the RM and NKI295 datasets are described in Pirone et al. [11]. From 149 microarrays, there were 130 unique RM samples, of which 100 are self-reported AA or belong to CAU race. There were 92 cancer-adjacent, histologically normal samples of self-described CAU or AA women (NBS). Genes that were present (above detection limit) in fewer than 15 % of samples were excluded. From the UNC337 data, we removed autopsies, samples without corresponding demographic and race information, and averaged all replicates by probe. Genes with more than 30 % missing data across all samples were excluded. We used k-nearest neighbors (KNN) (k = 10) to impute missing data. Missing data were imputed using k nearest neighbors (KNN) (k = 10). Data for all the normal breast tissues (RM_NBS) were combined using distance-weighted discrimination (DWD) [16]. Data processing and analyses were completed using BioConductor and R Version 2.14.

Race-associated gene expression in tumor and normal tissue

Race-associated genes were identified in tumors overall and stratified by subtype. Tumors were classified into intrinsic subtype using the PAM50 [17]. Supervised analysis was performed on all subtypes in UNC337 after selection criteria, as described above, using Linear models for Microarray Data (LIMMA) [18] and a False Discovery Rate (FDR) of 5 %. Subtype-stratified supervised analyses were performed on N = 68 Luminal A tumors and N = 39 Basal-like tumors. Using Cluster 3.0 [19], we clustered the expression data by both gene and sample, and visualized the resulting cluster dendrogram in Java Treeview [20]. For genes with multiple probes, we selected the probe with the highest standard deviation to display in the two-dimensional cluster. We evaluated the statistical association between cluster and race using a Chi-square test. We performed sensitivity analyses wherein models were adjusted for tumor characteristics (grade, stage, node, age) to evaluate whether tumor characteristics confounded the association between race and gene expression.

LIMMA analyses were also used to identify genes associated with race in non-tumor tissue at an FDR = 5 %. Since these samples included both cancer-adjacent normal and normal tissues (N = 192) from two different populations, we statistically adjusted for data source in addition to performing DWD correction as described above.

NKI295 survival

We defined disparity-relevant gene expression as gene expression associated with race in the tumor and with survival in test data. The first criterion was met by identifying race-associated gene expression at an FDR = 5 % in the UNC337 (Luminal A or Basal-like tumors) data. These genes were mapped to the NKI dataset, and were then extracted and median centered. We then performed a survival analysis that compared individuals with above-median expression to those with below-median (referent) expression for each of the race-associated genes. Two race-associated genes (FAM177A1, GSTT2) were not available in the NKI dataset. For each gene, we plotted Kaplan–Meier curves and estimated hazard ratios (HR) using Cox Proportional hazards models. A HR <1 signifies that higher expression confers a survival advantage, whereas a HR >1 suggests that higher expression confers a survival disadvantage.

There were six race-associated genes that showed a high magnitude association (HR <0.8 or HR >1.25) with survival in the NKI295 dataset. These genes were used to create a Multi-gene Race-associated Expression (MRE) score that varied between −6 and 6. A score of −6 should predict the best survival and that of +6, the worst survival. We generated this score for each patient by summing up the deleterious effects of each race- and survival-associated gene. The deleterious effect was −1 when the patient level expression was below the median expression for genes with a HR >1 or above the median expression for genes with a HR <1. Similarly, the deleterious effect was +1 when the patient level gene expression was above the median expression for genes with a HR >1 or above the median expression for genes with HR <1.

The association between the MRE score and survival was evaluated using Cox Proportional Hazards in both the training (NKI295 + UNC337, N = 450) and independent (METABRIC, N = 1584) datasets. We tested the statistical association between mean MRE score and tumor subtype in both datasets, and between CAU and AA tumors overall and in Luminal A tumors using either ANOVA (N > 2 groups) or Student's T test (2 groups). We also assessed association between MRE score and survival among Luminal tumors. Finally, we calculated tumor proliferation scores—a marker of tumor proliferation capability and defined as the sum of expression of the following genes: {CCNB1, UBE2C, BIRC5, KNTC2, CDC20, PTTG1, RRM2, MKI67, TYMS, CEP55, CDCA1}—for all tumors, and evaluated the association between this marker and the MRE score.

Evaluating gene expression changes by race and tissue type

Patterns of gene expression in normal and tumor tissue stratified by race are informative for whether the differential expression is a disease feature or exists prior to carcinogenesis. Thus, we tested each of the tumor race- and survival-associated genes for their expression in reduction mammoplasty (N = 100), cancer-adjacent normal (N = 92) and tumor tissue (N = 460). In datasets where there were multiple probes for a particular gene, we chose the probe that was differentially expressed in the UNC337 dataset if available. Alternatively, any probe mapping to the same gene was used if the specific probe was unavailable. We median centered the dataset of all three tissue types and then plotted the average relative expression in boxplots, stratified by race and tissue type.

Results

Training data identification of race- and survival-associated genes

Compared with CAU tumors (Table 2), AA tumors were more likely to be node positive (60 vs 42 %, P = 0.03), ER negative (53 vs 31 %, P = 0.03), and less differentiated (61 vs 46 %, P = 0.10). At an FDR = 5 %, there were 40 probes, representing 38 distinct genes, that were differentially expressed by race across all tumors. A cluster of these 38 genes (rows) and race (columns) is shown in Fig. 1. The left cluster (Cluster 1, N = 63) was primarily composed of AA samples (N = 39, 62 %), and the right cluster (Cluster 2, N = 102) was predominately CAU samples (N = 84, 82 %), (Chi-square = 31.8, df = 1, P < 1.7e–8).

Table 2 Demographic and tumor characteristics of UNC337
Fig. 1
figure 1

Two-dimensional cluster of race-associated gene expression. All genes (rows) were median centered across the samples (columns). AA women are represented by black boxes immediately above the heatmap and CAU women with white boxes. There are two distinct gene clusters with the orange cluster including primarily AA and the purple cluster including primarily CAU

To identify genes that were differentially expressed by race among less-aggressive tumors (Luminal A), or more aggressive tumors (Basal-like), we performed two supervised analyses at an FDR = 10 %, restricted to Luminal A or Basal-like tumors. There were 23 genes differentially expressed by race at a 10 % FDR among Luminal A tumors, of which 10 genes (Table 3) were significant, given 5 % FDR. There were only two differentially expressed genes, given 5 or 10 % FDR in Basal-like tumors (Table 4). We also adjusted for age, grade, size, and node status in multivariable analyses (Supplemental Tables 1a, b); among Luminal A breast cancers, most genes were still differentially expressed by race after statistical adjustment including CRYBB2, PSPH, MUC1, HSDL1, GSTT2, CLEC2D, FAM177A1. AMFR and PSPH remained differentially expressed by race among Basal-like tumors in multivariable model.

Table 3 Luminal A race-associated genes at FDR = 10 %
Table 4 Basal-like race-associated genes at FDR = 10 %

Among these race-associated genes, six were also associated with survival in the NKI295 dataset. High expression of CRYBB2, PSPH, TYMS, and SQLE was associated with higher mortality, while low expression of MUC1 and ACOX2 predicted worse survival (Table 5). The CRYBB2 survival curves violated the proportional hazards assumption, with the crossover of the two curves occurring at ~8–10 years. This pattern of crossing hazards at ~8–10 years has been previously documented among ER-positive breast cancers [11]. Four other race-associated genes (AMFR, CLEC2D, HSDL1, SLC9A3R2) were not associated with survival in the NKI dataset (Supplemental Fig. 1).

Table 5 Hazard ratios in NKI295 and average gene expression in UNC337 tumors

Gene expression in normal versus tumor

To elucidate patterns of expression for race- and survival-associated genes from normal to tumor tissue, we evaluated the expression of these genes in reduction mammoplasty, cancer-adjacent normal and tumor. On average, expression was higher in tumors of AA compared to CAU for SQLE and TYMS, and lower in ACOX2 and MUC1 tumors comparing AA to CAU. However, the pattern of expression in the normal to tumor expression continuum was similar between CAU and AA (Fig. 2a–d). In contrast, both CRYBB and PSPH were differentially expressed by race in both normal and tumor tissues, with higher gene expression among AA when compared with CAU women (Fig. 2e–f). Higher expression of these genes by race in benign tissue was not substantially attenuated even after statistical adjustment for normal tissue type (normal versus adjacent normal).

Fig. 2
figure 2

Median-centered gene expression across samples, stratified by race and tissue type (normal, adjacent normal, tumor) for race- and survival-associated genes. There are two distinct patterns of expression. ad, expression is most distinct by race among tumors, whereas levels are similar by race in normal tissue; e, f, however, CRYBB2 (e) and PSPH (f) levels are higher even in normal tissue of AA women. The increased expression persists in tumor

For all six MRE-associated genes, AA had higher expression of poor prognosis genes and lower expression of good prognosis genes (Table 5) compared with CAU. Higher relative expressions of ACOX2 and MUC1 in tumors were each associated with a ~35 % reduction in mortality (Table 5), and AA tended to have lower expression of these genes compared with CAU (Fig. 2a, b; Table 5). For CRYBB2, PSPH, TYMS, and SQLE, higher relative gene expression was associated with increased mortality in the NKI295 dataset (Table 5; Fig. 2a, b, e, f). In both CAU and AA women, gene expression of TYMS and SQLE increases dramatically from normal and adjacent normal tissue to tumor tissue; however, the tumor expression was highest among AA women (Table 5; Fig. 2c, d). For PSPH and CRYBB2, gene expression increased from normal to tumor tissue among CAU women, whereas relative expression was higher in both tumor and normal of AA women.

Multi-gene Race-associated Expression (MRE) score

Applying the information learned from training on NKI data to an independent dataset, we computed MRE scores for all individuals in our test data (METABRIC, N = 1584) and training data (NKI295 + UNC337, N = 450), where −6 = best and +6 = worst risk. Each increasing MRE point was associated with a 6 % increase in hazard, HR = 1.06, 95 % CI = (1.04, 1.09), such that the HR comparing a 6-point individual to a −6-point individual was 2.03, 95 % CI = (1.98, 2.08) in METABRIC. This result was attenuated when we adjusted for size, grade, and node status (HR = 1.03, 95 % CI = (1.00,1.06) for each increase in MRE points), but remained statistically significant. We also observed strong associations for MRE score and survival when we restricted to N = 401 Luminal tumors, with a HR = 1.76, 95 % CI = (1.64, 1.89) for the comparison of an individual with a score of +6 to an individual with a score of −6. After adjustment for size, grade, and node status, the association was HR = 1.36, 95 % CI = (1.26, 1.46). Interestingly, high MRE scores were associated with Basal-like tumors in both the training and test datasets (Fig. 3a, b). AA patients had a significantly higher (P < 0.001) MRE score (2.42) than Caucasians (−0.32) in our test dataset (NKI295 + UNC337) over all tumors (Fig. 3c), and specifically in Luminal A tumors (1.67 vs −2.43, P < 0.001) (Fig. 3d). Associations between MRE score and race could not be assessed in METABRIC due to the predominance of CAU patients in that dataset.

Fig. 3
figure 3

Top: Boxplots showing the distribution of race-associated gene expression score by subtype and by race. Race-associated gene expression scores were highest for Basal-like, HER2, and Luminal B tumors in both test (METABRIC, a) and training (UNC337 + NKI295, b) data. Across all tumors (c), race-associated gene expression/MRE scores are higher in African Americans, but these differences are not driven solely by subtype because expression is higher even among Luminal A tumors (d)

Although there was a monotonic increase in the hazard ratio with each increase in the MRE score, the largest increase in risk occurs with just a few gene expression changes; in our test dataset, patients with the middle 50 % MRE scores (compared to those in bottom quartile) had a strong elevation in risk (HR = 1.7, 95 % CI = (1.3, 2.1)), and those with the top quartile MRE scores had a HR = 1.9, 95 % CI = (1.4, 2.5) compared to the bottom quartile (Fig. 4b). When we restricted to Luminal A/B tumors, the associations between MRE and survival were slightly attenuated. Comparing those with the highest quartile MRE score and those with the middle 50 % of MRE score to the referent, lowest quartile group, there was a 50 % [95 % CI = (1.0, 2.3)] and 60 % [95 % CI = (1.2, 2.1)] increase in hazard, respectively (Fig. 4c). We were limited by a relatively small number of Luminal A tumors with a high MRE score (Fig. 4d) to sufficiently examine the association between MRE score and survival among Luminal A tumors. In sensitivity analyses (data not shown), we found that relative measures of association were much stronger in the METABRIC dataset when we restricted to women ≤60 years of age. The UNC337 and NKI295 datasets comprise predominantly younger women so the attenuation of effect in test data may be partially attributed to population differences between the METABRIC and UNC337 + NKI297 datasets.

Fig. 4
figure 4

Survival curves and corresponding Hazard Ratios (HR) by MRE score (bottom 25 %, middle 50 %, top 25 %) in the training data (a) and test data: overall (b), Luminal A/B (c) only, and Luminal A only (d)

Tumor proliferation scores were correlated with MRE points in both the training (ρ = 0.55, P < 0.001) and test (ρ = 0.59, P < 0.001) datasets. Proliferation scores were significantly (P < 0.001) higher in AA women (1.63) than in CAU women (−0.73) over all tumors in the UNC337 dataset, and in Luminal A tumors (−0.99 AA versus −3.90 CAU, P < 0.001). It is important to note that the MRE score was attenuated but remained significant after adjusting for standard clinical variables, but was not significant after adjusting for breast cancer subtype (Table 6).

Table 6 Multivariable hazard ratios for MRE score

Discussion

It has been established that AA women suffer from worse breast cancer outcomes compared to CAU women. While aggressive forms of BC disproportionately affect AA women, this does not fully explain the disparities; even within subtype there are differences in survival by race. From previous studies, we have learned that (1) survival differences exist between AA and CAU despite equal treatment [5] or tumor subtype [3] and (2) genes are differentially expressed between AA and CAU tumors even when matching on clinical features [6, 7, 9]. Genes that are both differentially expressed by race in tumors, and confer a survival disadvantage could explain a portion of the observed racial survival disparity, although to date no study had evaluated whether race-associated genes conferred a survival disadvantage. To elucidate biological factors that predispose AA women to worse mortality outcomes, the current study showed that race-associated genes affect survival across multiple datasets.

Six candidate genes (CRYBB2, PSPH, ACOX2, MUC1, SQLE, TYMS) emerged from our analysis as both race and survival associated. Some of these genes have known biological functions, while others do not. Although AMFR tumor expression was and has been previously shown to be associated with race [8], its expression is not associated with a survival advantage suggesting that its differential expression may not contribute to racial mortality disparities. In contrast, we replicated the association between CRYBB2 [6, 8, 9, 21, 22] and PSPH [21, 23] tumor expression and race, and also found that higher expression is associated with poorer survival. CRYBB2 encodes for the beta-crystallin B2 protein located at 22q11.23. Genetic variation in CRYBB2 is associated with macular degeneration [24], but the protein has no documented or hypothesized role in carcinogenesis. PSPH (phosposerine phosphatase, 7p11.2) is located near a region where gain of function is associated with advanced prostate tumor stage [25]. This gene has also been implicated in metabolism [26]. An emerging hallmark of cancer [27], the Warburg effect, is the ability of cancer cells to thrive in an oxygenated environment through glycolysis. Based on candidate gene studies of genes involved in cellular metabolism, Kim et al. [26] found that PSPH expression was higher in Basal-like tumors than in Luminal A tumors, and that high expression was associated with poor survival, HR = 2.07, 95 % CI = (1.10, 4.18), an effect size similar to what we found here.

There is limited literature on ACOX2 (acyl-CoA oxidase 2, branched chain, 3p14.3), but there may be a genetic variant that is a shared risk factor for preeclampsia and cardiovascular disease [28], and some evidence that a transcript is associated with hepatocellular carcinoma [29]. ACOX2 up-regulation in the tumor conferred the same protective effect on mortality as MUC1 (mucin 1, cell surface associated 1q21). Variants of MUC1 interact with estrogen [30], and higher expression is associated with late-stage epithelial ovarian cancer [31, 32] and prostate cancer [33].

Although little has been documented about the role that SQLE (squalene epoxidase, 8q24.1) plays in breast cancer progression, one study found that SQLE mRNA expression was inversely associated with survival among ER+ stage 1 or 2 patients [34]; this parallels our results that increased expression in tumor tissue is associated with almost a 2-fold increase in mortality in the NKI295 dataset. High TYMS (thymidylate synthetase, 18p11.32) expression was the largest independent predictor of mortality in our analysis. TYMS is associated with tumor proliferation and is one of the 50 genes whose expression is used to classify breast cancer into intrinsic subtype [17]. Genetic variants of TYMS or its expression predict sensitivity to 5-fluorouracil [3537] and are prognostic factors for lymph node infiltration in CRC patients [38], and lower expression of TYMS is a positive prognostic factor for non-small cell lung cancer [39].

Disparities in survival may result from the joint expression of multiple genes, rather than from a single gene. Our MRE score captured the cumulative effects of multiple genes and showed that high MRE points were associated with worse outcomes in both our training and test datasets. This score was also positively correlated with tumor proliferation score, providing independent confirmation of biological relevance. However, this score should not be considered as a substitute for established prognostic markers such as intrinsic subtype. After controlling for intrinsic subtype, the MRE score was not statistically significantly associated with breast cancer survival in either the UNC/NKI dataset or the METABRIC data, and generally appears to be higher in Basal-like and Luminal B tumors. A major limitation to our study is that we were not sufficiently powered to evaluate subtype-specific survival advantages associated with the MRE score. For instance, there were only 7 METABRIC Luminal A individuals with an MRE score higher than 3. The direction of effect is also unknown: these genes may increase probability of progressing to a more aggressive subtype, or more aggressive subtypes may have increased probability of upregulating these genes. These two possibilities cannot be evaluated in human tumor specimens that are sampled only at a single point in time. However, future research identifying the mechanism of action of the genes in the MRE score could help establish their biological relevance.

Although some disparities in Luminal A breast cancer mortality may be attributable to treatment or access to care, the patterns of expression of these genes in the continuum of normal to malignancy suggest that intrinsic biological differences between at-risk AA and CAU women may also be operating, and these patterns guide our interpretations of the data. Furthermore, because CRYBB2 and PSPH expression were elevated in both normal and tumor tissue of AA compared with CAU, racial differences likely exist from the earliest stages of tumor development. Previous studies have suggested that PSPHL (a PSPH homolog) and CRYBB2 were differentially expressed in normal tissue of AA and CAU women who underwent reduction mammoplasty (N = 6) or those without evidence of a malignancy (N = 19) [6]. These investigators posited that SNP rs6700—located close to PSPHL—may explain the differences in expression, since the minor allele frequency of AA is higher compared with CAU. We note that future studies using RNAseq should evaluate the specific transcripts of PSPH and CRYBB2 with respect to race, because recent studies [40, 41] suggest that these genes have significant homology to pseudogenes that could produce signal on a microarray. Additionally, Sturtz et al. [41] concluded that the PSPHL signal observed in several studies may be due to population stratification.

Breast cancer mortality disparities are likely driven by a number of social and biological forces. Uncovering the factors that drive disparities is complicated and necessitates evaluating the problem from many different vantage points. Replication of findings across multiple study populations, and by investigators using different analytical and technical approaches, strengthens the evidence in support of these genes as possible targets. Continued evaluation of genes that differ by race in both tumor and normal, such as CRYBB2 and PSPH, as candidate markers of race-associated disparities should include larger population-based studies. Mechanistic studies are also needed, especially for CRYBB2, which has now been shown to associate with race and survival in multiple studies.