An apparent scientific consensus holds that US racial/ethnic groups intrinsically have disparate distributions of breast cancer estrogen receptor (ER) status, with white women purported to have the highest prevalence—and black women the lowest—of ER-positive (ER+) tumors [14]. Nevertheless, studies on this topic are affected by several limitations. Among US epidemiologic investigations designed to explore associations between race/ethnicity and ER status, virtually all of the 19 studies reporting positive associations (usually crude): (a) relied on medical records for ER status data, (b) had a high percentage of missing data on ER status (upwards of 10–20% or more), with the data most likely to be missing for women of color (largely if not solely comprised of black women), and (c) included little or no socioeconomic data [3, 520]. By contrast, the 9 studies reporting no association between race/ethnicity and ER status typically: (a) relied on laboratory assays performed specifically for the study, (b) had little or no missing data on ER status (0–3%), and (c) controlled for socioeconomic position, and also reported associations between socioeconomic position and ER status [2129]. Thus, significant associations between race/ethnicity and breast cancer ER status (chiefly comparing US black to white women) derive chiefly from studies with a relatively high degree of missing data on ER status and no socioeconomic data.

If the data on ER status were truly missing completely at random, and if ER status were unrelated to socioeconomic position, then estimates of racial/ethnic disparities in breast cancer ER status in these prior studies would be unbiased [3032]. Indicating that concerns about bias may be warranted, however, evidence suggests: (a) ER status is more frequently missing among women of color and/or less affluent women [3, 1315, 26, 33], most likely because of inadequacies of medical care [1, 33, 34], and (b) the major known risk factors for ER status—both those affecting endogenous hormone levels (e.g., hormone therapy, nulliparity, late age at first pregnancy, postmenopausal obesity) and those reflecting quality of medical care (e.g., stage of diagnosis, tumor size)—are strongly associated with socioeconomic position, within and across diverse racial/ethnic groups [1, 26, 3337]. Since ER status is a key tumor biomarker relevant to both breast cancer treatment and survival [14, 3335], it is thus important to gauge how taking into account issues of missing data and confounding affect estimates of racial/ethnic disparities in ER status.

Recognizing deficiencies in extant research on ER status, one major review has recommended use of better and more consistent assays for ER status [1]. Also germane are longstanding debates over how racial/ethnic disparities in health are conceptualized: as embodied biological expressions of social inequality that are socially determined, versus biological consequences of intrinsic “racial” (usually meaning “genetic”) differences [3840]. Cognizant of the implications of these debates for research on racial/ethnic health disparities, another review has called for research on how socioeconomic position “contributes to the stage, age at diagnosis, and biology of breast carcinoma.” [34, p. 1995]. Accordingly, guided by the ecosocial theory of disease distribution[4143] and its concern with both societal determinants of health inequities and biased assumptions affecting health research, we sought to examine how estimates and explanations of racial/ethnic inequities in ER status might be biased by missing ER data and omission of socioeconomic data.

The specific a priori hypothesis, we sought to test was that estimates of racial/ethnic inequities in ER status would be attenuated by: (1) using appropriate methods to address issues of missing data, and (2) controlling for socioeconomic position. Deciding on the appropriate analytic methods for testing our hypotheses, moreover, led us to recognize a previously unremarked characteristic of most research on breast cancer ER status and race/ethnicity: their virtually exclusive reliance on the odds ratio [3, 6, 7, 9, 1214, 1720, 2224, 26, 27], at times explicitly interpreted as a relative risk [12, 19]. Yet, the prevalence of ER+ (the most commonly analyzed outcome; prevalence ≈ 75%) and ER− (prevalence ≈ 25%) both substantially exceed the “rare” disease condition (<10%) required for the odds ratio to provide a valid estimate of the risk ratio [4446]. Our third question accordingly concerned whether interpretation of results would be influenced by choice of parameter estimate, i.e., the odds ratio (OR) versus prevalence ratio (PR).

Materials and methods

Study population

The study base consisted of the population residing, between 1998 and 2002, in the catchment area of two well-established population-based cancer registries: (1) the Northern California Cancer Center’s (NCCC) San Francisco/Oakland SEER cancer registry, encompassing FIVE counties (Alameda, Contra Costa, Marin, San Francisco, and San Mateo) [47], and (2) the Los Angeles Cancer Surveillance Program (LA CSP), encompassing Los Angeles County [48]. We chose these registries for three reasons: (1) demographically, their catchment areas have substantial heterogeneity with respect to socioeconomic position and were sufficiently large with enough racial/ethnic diversity to permit meaningful sub-analyses among white, black, Asian and Pacific Islander, and Hispanic populations; (2) the regions they cover are relatively high incidence areas for breast cancer, with rates on average exceeding or equaling those of all SEER registries combined; and (3) they rank highly for the completeness (estimated at ≥98%), timeliness, and accuracy of registering cancer cases [4749]. All analyses performed for this study were approved by the Harvard School of Public Health Human Subjects Committee and the Institutional Review Boards of both cancer registries. Since we were provided only de-identified records for a secondary data analysis, we were not required to obtain informed consent from the women included in the cancer registries.

Breast cancer cases

From this study base, we included all cases of primary invasive breast cancer among women recorded by the two cancer registries as being diagnosed between 1 January 1998 and 31 December 2002 (n = 42,240). We obtained data from the cancer registries on: age at diagnosis, race/ethnicity, estrogen receptor status, tumor stage, tumor size, histologic type, and residential address at time of diagnosis.

All patient data were obtained from medical charts and it is unknown whether their racial/ethnic data were based on self-report or observer-report [4749]. The racial/ethnic categories employed by US cancer registries correspond to those used in the US census, with these categories defined by the US Office of Management and Budget as “social-political constructs and should not be interpreted as being scientific or anthropological in nature.”[50] Using the cancer registry racial/ethnic categories, we delineated the following mutually exclusive groups: white non-Hispanic (n = 26,491), black non-Hispanic (n = 4,102), Asian and Pacific Islander non-Hispanic (n = 4,970), American Indian non-Hispanic (n = 38), “other race” non-Hispanic (n = 356), and Hispanic (n = 4,961). Research on racial/ethnic misclassification of cancer registry and hospital records in California [5154] and in the US nationally [55] indicates that while the sensitivity and specificity of racial/ethnic classification for the white and black population is reasonably high (in excess of 95%), it is somewhat lower for other racial/ethnic groups. In our racial/ethnic-specific analyses, we do not include data on the American Indian and “other race” non-Hispanic women since small numbers preclude meaningful analyses of these data.

In the cancer registry records [47, 48], ER status was defined as: (a) positive: test done and results were positive; (b) negative: test done and results negative; and (c) unknown: “test not done (includes cases diagnosed at autopsy)”; “test done, results borderline or undetermined whether positive or negative”; “test ordered, results not in the chart”; or “unknown if test done or ordered; no information (includes death-certificate-only cases).” Among cases missing ER status, the most common category was “unknown if test done or ordered” (78%) followed by “test not done” (16%). No data were available on reproductive history or hormone therapy use, precluding analysis of ER data in relation to these variables.

Socioeconomic measures

We geocoded the breast cancer cases included in this study using a commercial geocoding company whose accuracy we previously had tested and found to be high (96%) [49, 56]. We accepted only results geocoded to high precision (based on either exact street address or ZIP + 4 code; the latter is an area typically the size of one city block). We were able to geocode fully 97% of our cases with high precision to their census tract (CT) geocodes, the geographic level chosen because, as shown by results of our prior Public Health Disparities Geocoding Project [5659], the census tract provided maximal geocoding and linkage to area-based socioeconomic data (compared to block group and ZIP Code data) and consistently detected expected socioeconomic gradients in health across a wide range of health outcomes.

We selected and constructed our CT area-based socioeconomic measures (ABSMs) based on theoretical considerations and methods described in detail in the publications of the Public Health Disparities Geocoding Project [5659]. ABSMs generated and available pertain to CT poverty, income, occupation, education, and several deprivation indices. For analyses we previously conducted of socioeconomic gradients in breast cancer incidence [49], we found results overall were robust to choice of ABSM and that the ABSM that most informatively delineated the socioeconomic gradient was a new composite variable we created combining data on poverty and high income (defined as  ≥4 times the US median household income, and calculated from the categorical income distribution by interpolation, assuming a Pareto distribution within the income category) [49]. The composite measure employed five mutually exclusive categories: (1) <5% below poverty and ≥10% high income; (2)<5% below poverty and <10% high income; (3) 5.0–9.9% below poverty; (4) 10.0–19.9% below poverty; and (5) ≥20% below poverty (the federal definition of a poverty area [60]).

Additionally, because of the strong association documented between educational level and ER status [26], we also employed an ABSM pertaining to the proportion of adults age 25 and older who had completed four or more years of college education. The pairwise correlations between the three ABSMs used to create these measures (percent below poverty, percent high income, percent college graduates) were all modest (r < 0.4), indicating they were not collinear. The proportion of the study catchment population living in CTs for which the ABSM data were missing was small (0.0–0.3%) and did not vary by race/ethnicity.

Statistical analyses

Our analytic plan involved four steps. First, we determined the univariate distribution, within our study population, overall and by race/ethnicity, of both the study outcome (ER status) and the specified covariates (age, socioeconomic position, tumor stage, tumor size, histologic type), as well as each variable’s extent of missingness. We then created our analytic data set by excluding the small number of women (n = 496) missing data, singly or jointly, on the composite ABSM (n = 12; 0.03%), the college graduate ABSM (n = 2; 0.005%), the registry “race” variable (n = 349; 0.85%), and the registry “Hispanic” variable (n = 410; 1%). We opted not to impute these variables for two reasons: (1) the small number missing, and (2) maintaining comparability to the prior literature on racial/ethnic disparities in ER status, which included women only of known race/ethnicity.

Second, for each variable, using the relevant referent group, we calculated the crude OR and PR for being: (1) ER+ versus ER−, (2) ER− versus ER+, and (3) ER status unknown versus ER status known, among cases with completely observed data (prior to imputation), in order to assess the extent to which estimates of racial/ethnic disparities would be affected by these analytic choices. As noted previously, most prior research, has focused on estimating the OR for being ER+ [3, 5, 7, 10, 11, 14, 16, 17, 19, 21, 2328], Arguably, however risk of being ER− might be the more appropriate parameter, given that ER− is the more adverse outcome and also the rarer outcome (and hence less likely to result in the OR providing a biased estimate of the risk ratio [4446]).

As our third step, we then employed multiple imputations to address potential limitations arising from analyzing only observations with fully observed data [30, 31]. The variables we imputed were: estrogen receptor status (21% missing), tumor stage (2% missing), and tumor size (7% missing). As our data set included the most important known predictors of both ER missingness (e.g., race/ethnicity and socioeconomic position) and ER status (e.g., sociodemographic and tumor characteristics), it is reasonable to posit that our use of multiple imputation was justified, given the key Missing At Random (MAR) criterion, whereby the probability of missingness depends only on variables that are observed [30, 31]. We conducted the imputation using the Amelia II program [61] to create 20 multiply imputed data sets and combined results using the SAS PROC MIANALYZE procedure.

Fourth, using the data set with imputed values, and informed by the results of the preceding analyses, we built up models to assess the prevalence rate ratio of being ER− versus ER+ in relation to race/ethnicity and to socioeconomic position, independently and together, adjusting for relevant covariates (age, catchment area, tumor size, tumor stage, and histologic type). For these models, we used log binomial regression, an analytic approach specifically developed for conditions in which the “odds ratio is not a good approximation of the risk or prevalence ratio.” [62] The parameter estimates from these models can be expressed as prevalence ratios [6265]. In order to calculate the percent change in excess risk comparing two parameter estimates (e.g., PR1 vs. PR2), we used the formula: ((PR1-1 )− (PR2-1))/(PR1-1). Due to unexpected catchment area differences in the prevalence of ER unknown tumors (higher in Los Angeles than in the San Francisco Bay Area, as shown in Table 1), we tested for interaction effects between race/ethnicity and catchment area for risk of ER status; finding none, we controlled for catchment area in the models. We conducted all analyses in SAS [66].

Table 1 Estrogen receptor status distribution of invasive breast cancer cases by age, tumor characteristics, and area-based socioeconomic position, overall and by race/ethnicity: San Francisco Bay Area* and Los Angeles County, 1998–2002

Results

Table 1 presents selected descriptive data on the study population distribution, by estrogen receptor (ER) status, on the distribution of tumor characteristics (stage, size, histologic type), socioeconomic position, and catchment area, overall and by race/ethnicity. Highlighting the strong association between race/ethnicity and socioeconomic position, a much higher proportion of the black non-Hispanic, Hispanic, and Asian and Pacific Islander non-Hispanic cases, compared to white non-Hispanic cases, i.e., 47.8%, 34.6%, and 16.1%, versus 7.6%, respectively, lived in impoverished census tracts (20+% below poverty).

Figure 1 visually depicts the patterning of ER status by socioeconomic position across racial/ethnic groups. Overall, 21.0% of the women were missing ER status, with missingness highest among the Hispanic and black non-Hispanic women (28.5% and 24.7%, respectively), followed by the Asian and Pacific Islander non-Hispanic women (22.1%), and least among the white non-Hispanic women (18.4%). As would be expected, estimates of the percent ER+ and ER− were higher when based only on cases with known ER status, since these estimates ignore the percent with ER unknown. For the women overall, the contrast was 79.2% ER+ and 20.8% ER− (known ER status) versus 62.6% ER+ and 16.5% (all cases, including the unknown). By race/ethnicity, these contrasts were: (a) white non-Hispanic: 82.8% ER+ and 17.2% ER− (known ER status) versus 67.5% ER+ and 14.0% ER− (all cases); (b) black non-Hispanic: 65.8% ER+ and 34.2% ER− (known ER status) versus 49.6% ER+ and 25.7% ER− (all cases); (c) Asian and Pacific Islander non-Hispanic: 77.2% ER+ and 22.8% ER− (known ER status) versus 60.1% ER+ and 17.8% ER− (all cases); and (d) Hispanic: 72.3% ER+ and 27.7% ER− (known ER status) versus 51.7% ER+ and 19.8% ER− (all cases).

Fig. 1
figure 1

Distribution of estrogen receptor (ER) status (positive, negative, unknown) among women with primary invasive breast tumors by race/ethnicity and two area-based measures of socioeconomic position: (a) census tract poverty/high income composite measure and (B) census tract percent of college graduates, San Francisco Bay Area and Los Angeles County, 1998–2002

Also as expected, the distribution of ER status (both known and unknown), in addition to differing by race/ethnicity, varied by age at diagnosis, tumor characteristics, and socioeconomic position (Table 1). Both ER− tumors and tumors missing data on ER status were most common, and ER+ least common, among the younger women, women diagnosed with regional and distant tumors and with ductal histologic type (ER− only) or “other” histologic type (especially if ER unknown), and women living in the more impoverished and less educated census tracts. For example, 24% of the women with ER status unknown and 19% with ER− tumors, versus 13% of the women with ER+ tumors, lived in census tracts with 20+% poverty.

Table 2 shows results for multivariable analyses regarding racial/ethnic disparities for risk of having ER status unknown, analyzed in relation to the PR. The excess risk of having ER status unknown among the women of color compared to white women (Model 1) was strongly attenuated by adjusting for socioeconomic position (Model 2), with the effect of this adjustment greater than adjustment for tumor characteristics and catchment area (Model 3). For example, comparing the black non-Hispanic to the white non-Hispanic women, the 33% greater crude risk for ER status unknown (Model 1) was reduced to 4% and rendered statistically non-significant in models that controlled only for socioeconomic position (Model 2), and remained statistically non-significant, at 7%, in models adjusting for all included covariates (Model 4). Similar patterns were evident for the Hispanic and the Asian and Pacific Islander non-Hispanic women, albeit the reduction in excess risk by controlling for the socioeconomic and other covariates was not sufficient to render the difference statistically non-significant.

Table 2 Multivariable analysis* of prevalence ratio (PR) for racial/ethnic disparities in missing estrogen receptor (ER) status, overall and adjusting for socio-demographic and tumor characteristics: primary invasive breast cancer cases among women, San Francisco Bay Area**, and Los Angeles County, 1998–2002

Next, Table 3 presents the multivariable analyses for racial/ethnic and socioeconomic disparities, separately and combined, for being ER− versus ER+, as measured using the PR and based on the imputed data. Racial/ethnic and socioeconomic disparities were evident in models adjusting solely for age and catchment area (Models 1 and 2), with risk of being ER−, respectively greatest among the black non-Hispanic compared to the white non-Hispanic women (Model 1: PR = 1.76; 95% CI: 1.66, 1.86), followed by the Hispanic women (Model 1: PR = 1.42; 95% CI: 1.34, 1.50) and the Asian and Pacific Islander non-Hispanic women (Model 1: PR = 1.19; 95% CI: 1.11, 1.26) and lowest among women living in CT with the highest versus lowest proportion of college graduates (Model 2: PR = 0.71; 95% CI: 0.66, 0.76). As shown by Model 3, adding socioeconomic data to Model 1 had nearly as great an impact on reducing the estimates of racial/ethnic disparities in ER status as did separately adjusting, in Model 4, for tumor characteristics. In the fully adjusted Model 5 (including data on socioeconomic position, tumor characteristics, age, and catchment area), both black non-Hispanic and Hispanic women (but not Asian and Pacific Islander non-Hispanic women) remained at elevated albeit lower risk of being ER− (PR = 1.47; 95% CI: 1.38, 1.56 and PR = 1.21; 95% CI: 1.14, 1.29, respectively), as did women who lived in the lowest compared to highest income census tracts (PR = 1.10; 95% CI: 1.00, 1.20); women who lived in the most compared to least educated census tracts were at lowest risk (PR = 0.85; 95% CI: 0.79, 0.91).

Table 3 Multivariable analyses* of racial/ethnic and socioeconomic disparities in the prevalence ratio (PR) for being ER− versus ER+, based on the imputed data, for primary invasive breast cancer among women, San Francisco Bay Area** and Los Angeles County, 1998–2002

Table 4 compares findings for: (a) the OR versus PR as the parameter estimate, using the imputed data, and (b) the PR, using the observed versus imputed data. Adjusting for age, socioeconomic position, tumor characteristics, and catchment area (Model 1), the OR was 43% greater than the PR for the black non-Hispanic/white non-Hispanic comparison (1.82 vs. 1.47), 34% greater for the Hispanic/white non-Hispanic comparisons (1.32 vs. 1.21), and 36% greater for the Asian non-Hispanic/white non-Hispanic comparison (1.11 vs. 1.07). Adjusting for the same covariates in Model 2, the PR for being ER− versus ER+ was reduced, for analyses based on the imputed versus observed data, by 16% for both the black/white and Hispanic/white comparisons (1.56 vs. 1.47, and 1.25 vs. 1.21, respectively), and by 13% for the Asian/white comparisons 1.08 vs. 1.07).

Table 4 Multivariable analysis* of racial/ethnic and socioeconomic disparities in: (a) the odds ratio (OR) versus prevalence ratio (PR) for being ER− versus ER+, based on the imputed data, and (b) the PR for being ER− versus ER+ for the observed versus imputed data, for primary invasive breast cancer among women, San Francisco Bay Area** and Los Angeles County, 1998–2002

Discussion

The central finding of our investigation of racial/ethnic disparities in breast cancer ER status is that estimates of the magnitude of these disparities are sensitive to inclusion of socioeconomic data and treatment of missing data for ER status, as well as choice of parameter estimate, i.e., the prevalence ratio versus the odds ratio. Not only was the racial/ethnic patterning of missing ER data driven chiefly by racial/ethnic socioeconomic disparities, but the observed crude racial/ethnic disparities in ER status were notably reduced by adjusting for socioeconomic position and, to a lesser extent, by using imputed data. In the case of black/white comparisons, in analyses based on the imputed data, the excess risk measured by the OR for being ER− versus ER+ in the fully adjusted model was 43% greater than for the PR, and it was 16% higher for the PR in analyses based on the observed versus imputed data. The net implication is that studies on race/ethnicity and ER status that neglect to include socioeconomic data and fail to account for missing data will yield inflated estimates of racial/ethnic disparities in ER status, a problem magnified by reporting the OR rather than the PR.

Study limitations

Before accepting this study’s results, it is important to consider potential limitations affecting the study design and data analysis. First, since we were able to obtain only data included in cancer registry records, we lacked information on several known risk factors for breast cancer ER status: use of hormone therapy, postmenopausal obesity, and reproductive history, including both nulliparity and late age at first pregnancy [1, 3337]. Given that all of these risk factors, except for postmenopausal obesity, are more prevalent in the US among more affluent, more educated, and white women, compared to women of color and to more economically deprived and less educated women [3, 34, 37, 49], then presumably adjusting for these additional risk factors would have further decreased the magnitude of racial/ethnic disparities in ER status. By the same logic, had we been able to adjust for individual- as well as census tract socioeconomic measures (including across the lifecourse), instead of relying only on the area-based socioeconomic measures, the racial/ethnic disparities in ER status would likely have been further diminished [57, 6769].

The lack of data on health system variables associated with ER status, such as access to and quality of screening and treatment, is also unlikely to have compromised our results, given our inclusion of data on what these health system variables are supposed to affect, e.g., tumor size and stage [1, 3337]. That said, inclusion of data on health insurance, delays in obtaining screening, delays in obtaining medical care, and reasons for ER status being unknown, would have been useful for better understanding health system variables affecting ER status. Moreover, racial/ethnic misclassification (likely low for the white non-Hispanic and black non-Hispanic cases [5155]) is unlikely to have unduly biased the results, since such misclassification is unlikely to have been systematically linked to ER status.

Also meriting caution is our using multiple imputation for the missing data. Justifying our use of this technique, as noted previously, was the inclusion of key known risk factors for ER status, thereby meeting the Missing At Random (MAR) assumption that the probability of missingness depends only on the observed variables [30, 31]. If, however, the data were Not Missing At Random (NMAR, i.e., there are additional unobserved predictors of both missingness and ER status), more complex models for non-ignorable non-response are required [30]. Determining whether these assumptions are met depends on conceptual criteria, and cannot be empirically tested in the observed data [2931].

One additional caveat concerns generalizability, since our study base was restricted to two regions, both within one US state, with cases diagnosed between 1998 and 2002. Our finding thus cannot be generalized to all breast cancer cases in the US for all time periods, especially given secular changes in many of the known risk factors for ER status, including reproductive history, use of hormone therapy, and body mass index[3, 37, 49] and also refinements in assays for ER status [1]. Even so, the results likely do have meaningful implications for the more recent US studies conducted on breast cancer estrogen receptor status and race/ethnicity, e.g., the 21 studies conducted during the past decade [3, 515, 2024].

Interpretation of results

Assuming our results are reasonably valid and reflect the experiences of a reasonably heterogeneous study population, our study raises important questions about the seeming US scientific consensus that intrinsic racial/ethnic disparities exist in breast cancer ER status [14]. Our results instead imply this consensus is misleading, since it based predominantly on studies that: (a) lacked socioeconomic data; (b) ignored the problem of missing data; and (c) reported only the odds ratio [3, 6, 7, 9, 1215, 17, 18, 20, 23], or else a p-value for a chi-square test [5, 8, 10, 11, 16]. As with the contrasting prior negative studies, all of which controlled for socioeconomic position and had little or no missing ER data [2028], we found that taking into account racial/ethnic socioeconomic disparities in ER status and missingness of ER data strongly reduced estimates of racial/ethnic disparities in ER status. Moreover, had we been able to include additional risk factors for ER status known to vary by socioeconomic position within and across racial/ethnic groups, such as hormone therapy, body mass index, and reproductive history [1, 3335], it is likely that we would have further shrunk the observed racial/ethnic disparities in ER status.

Granted, our study data do not permit us to rule out whether there are particular candidate genes that vary in frequency by race/ethnicity and that shape risk of developing an ER+ versus ER− breast tumor, as some have hypothesized [24]. Such a hypothesis, however, would need to account not only for the well-known genetic heterogeneity among the racial/ethnic groups delimited by the official federal US racial/ethnic categories [38, 7072] but also for why, even within these racial/ethnic groups, socioeconomic disparities exist for risk of being ER+.

Our finding of socioeconomic disparities in ER status even in models containing data on tumor characteristics further implies the existence of additional pathways—other than those captured by tumor size, stage, and histologic type—by which societal conditions influence ER status. Given that ER status remains a powerful predictor of breast cancer survival [13, 34, 35], and that research on determinants of ER status remains scant [1, 35, 73, 74], a research program on the social determinants of ER status is warranted. In light of our findings, we emphasize that research on ER status should not be restricted only to cases with known ER status, since doing so would, in the US context, disproportionately include white and more affluent women and exclude women subjected to economic deprivation and women of color. The potential harm to both population health and scientific inference resulting from failing to take into account the full population distribution of exposures and health outcomes and by ignoring socioeconomic confounding has been repeatedly demonstrated, most recently in research on hormone therapy and risk of cardiovascular disease [75, 76], with likely spillover consequences including increased breast cancer incidence attributable to HT use [37, 7580]. Our findings likewise suggest that better understanding of the determinants of missing ER status and its utility as a health services marker of inadequate medical care [33, 34] would likely be beneficial for efforts to improve breast cancer survival.

A final implication of our study is that research on race/ethnicity and breast cancer estrogen receptor status, like any population health research, requires considering the social as well as biological determinants of health—as well as the social determinants of missingness and data quality. At issue is the conduct not of “politically correct” science, but of correct science [38, 77]. Leave out socioeconomic data when studying racial/ethnic health inequities [38, 39, 57, 6769], or ignore the social patterning of missing data, [81, 82] and causal inferences are likely to be biased—resulting, in the case of ER status, inflated estimates of racial/ethnic disparities.