Introduction

According to the National Cancer Institute, colorectal cancer is the fourth leading cause of cancer in the US among both sexes, but the second leading cause of death due to cancer when both men and women are combined [1, 2]. Of those who are diagnosed, about 58% have a 10-year survival rate [3]. The incidence of colorectal cancer has decreased modestly in the past few decades due to new treatments, improvements in recognition and control of risk factors such as smoking, and early screening, diagnosis, and interventions [4]. Despite these overall improvements, colon cancer incidence and mortality remain substantially higher among blacks than whites and higher among men than among women [4, 5].

While prevention is always the best treatment, cancer stage at diagnosis is one of the most important determinants of survival [6]. If diagnosed at earlier stages, colorectal cancer has an improved survival rate than during late stage diagnosis [7, 8]. Furthermore, from a health disparity perspective, when analyzed by the stage of cancer (localized, regional, distant), black race was associated with poorer survival rates than white race at each stage [9]. Moreover, insurance coverage accounts for a significant amount of the disparity in survival rate between younger black versus white patients with colon cancer [10]. In developed countries, later stage diagnoses of colon, as well as other cancers, are associated with socioeconomic disparity, potentially exacerbating the already known difference in overall survival [11].

By continuing to understand the effect of various social health disparities, such as age, gender, race, type of insurance coverage, socioeconomic status, education level, unemployment rates, and poverty rates, on colorectal cancer stage at presentation, we may gain a better understanding of the effect they have on cancer outcomes. Studies have focused on the role of race and insurance status in outcomes of colon cancer stratified by stage at presentation, but few studies have looked at the role of income status, education level, and geographic urban versus rural disparity, all of which are known disparities in cancer care. To this end, we endeavored to understand the role that race, age, insurance status, education level, income level, and geographic disparity may play in the presentation of late stage colon cancer, potentially shedding light on how health disparities lead to increased cancer mortality. We performed a retrospective database analysis to understand the association that colorectal cancer stage (late versus early) at presentation may have with patient-specific and geographic disparities.

Methods

Data Source

Data were sourced from the National Cancer Institute Surveillance, Epidemiology, and End Results (SEER) database for the years 2007–2014, which is supplemented by county-level demographic data from the US Census and American Community Survey (ACS). SEER is a nationally representative survey of cancer registries that collects incidence and survival data from population-based registries that cover approximately 28% of the US population [12]. SEER diagnosis records prior to the year 2011 were matched to data from the 2007–2011 ACS, while diagnosis records after the year 2011 were matched to ACS data for 2011–2015 to ensure temporal accuracy for the county-level attributes used in this analysis.

Study Population and Variables

Patients were selected on the basis of a malignant primary site in the colon or rectum and with a year of diagnosis between 2007 and 2014 and an age of diagnosis greater than or equal to 40 years old. Records with missing gender or age data were excluded from selection as well as those records missing valid staging data or a state/county assignment. Covariates abstracted from SEER included patient-level variables such as age, race, year of diagnosis, primary site, and state/county, insurance status, and derived American Joint Committee on Cancer (AJCC) 6th edition staging at diagnosis. Additionally, county-level data available from the ACS were sourced for each record matched for the state/county data. This included percent urban population, median family income, rural–urban continuum code classification, percent of population that has not completed high school, percent of population below the poverty line, percent of population foreign-born, percent of language-isolated persons, and unemployment rate. The primary outcome analyzed in this study was cancer staging at diagnosis, which we defined as early for cancer with AJCC staging I or II and late for neoplasm staged III or IV at diagnosis [13].

Statistical methods

A contingency table was first generated to identify counts of early and late stage diagnosis for each covariate under study and a χ2 analysis was performed as a first approximation of the associations between these covariates and stage at presentation (Table 1). Continuous county-level covariates were grouped into quintiles for the purposes of the contingency table. This was performed as a univariate analysis to inform the independent associations of each covariate with stage at presentation.

Table 1 Contingency table and χ2 analysis

However, since the χ2 test does not demonstrate the directionality or independence of these associations, we generated a model using a multivariate binary logistic regression on the same data to further elucidate the associations between study covariates and stage of cancer presentation. Study covariates that were determined to be significant in the Chi-squared analysis were included in this model, which classified cases into early or late stage at presentation and accepted as input the significant study covariates, with continuous county-level covariates modeled as continuous variables The risk of a late stage at diagnosis for colorectal cancer was calculated as an adjusted odds ratio (OR) and 95% confidence interval (CI) (Table 2). The regression and χ2 analyses were performed in the R statistical computing software using the speedglm package [14, 15]. Statistical tests were two-tailed with α = 0.05. This methodology is validated by previous literature analyzing disparities in healthcare outcomes with the SEER data set [16].

Table 2 Logistic regression model for late stage at presentation

Results

Presented in Table 1 is a contingency table showing the stage at presentation of the analyzed cases by patient demographics and county-level ACS attributes. Our study consisted of a total of 259,828 patients. Chi-squared analysis demonstrated significant associations (at p < 0.05) between stage of diagnosis with sex, race, age, insurance status, location of primary site, percent of population below poverty line, percent of language-isolated persons, and percent of unemployed.

Chi-squared analysis: demographics

Race, age, and insurance status were found to be significant in this analysis. Black patients presented with 53.1% late-stage cancer while white patients presented with 47.4% late-stage cancer. Increasing age showed a negative trend for late stage presentation with the older populations being more likely to present with early-stage cancers and the younger population presented with a higher percentage of late-stage cancers than early. Insurance status also demonstrated a significant association with both Medicaid patients and uninsured patients presenting with a higher percentage of late-stage cancer (54.1% and 59.6%, respectively) than early (45.9% and 40.4%, respectively), while insured patients presented with a lower percentage of late-stage cancer (42.4%) than early (57.6%). Location of primary site showed that the large intestine, NOS (78.5%) showed statistically significant greater rates of late-stage cancer than the rectum (52.8%). When breaking the colon down into parts, the cecum (50.4%) and the rectosigmoid junction (53.1%) had statistically significant greater rates of late-stage cancer. The ascending colon (56.2%), descending colon (53.5%), hepatic flexure (54.6%), sigmoid colon (52.9%), splenic flexure (50.1%), and transverse colon (56.9%) showed greater rates of early-stage cancer.

Chi-squared analysis: county-level

County-level ACS data were also examined to identify geographic associations with stage at diagnosis. Percent below poverty line, percent of language isolation, and percent unemployed all showed to be significant in this analysis. The effect of poverty was analyzed using the percentage of the population below poverty line in a patient’s county and showed higher rates of late-stage cancer for patients living in the most impoverished counties compared to the least (49.8% and 47.8%, respectively). Percent of language isolation in a geographic area showed higher levels of early-stage cancer for all percentages of language isolation. Unemployment rates showed a positive association with late-stage presentation. Counties with the lowest unemployment had 48.0% late-stage presentation versus counties with the greatest unemployment rates, which had 49.6% late-stage presentation (Table 1).

A logistic regression model was also used to find significant disparities in stage at presentation and demonstrate the factors that contribute independently to stage at presentation. This analysis showed significant results for race, age, insurance, percent of language isolation, and percent of those who moved to US in the past year.

Compared to white patients, black patients had the highest likelihood of a late-stage presentation (OR 1.19, p < 0.01) followed by patients of “other” race (American Indian, Alaska Native, Asian/Pacific Islander) who had the next highest likelihood of a late-stage presentation (OR 1.12, p < 0.01). Additionally, increasing age was found to have an inverse relationship with late-stage colon cancer and instead found to be significantly associated with an earlier stage at presentation. This trend becomes significant in the 50–54 age group (OR 0.59, p < 0.01) and continues to the oldest age group in the data set, 85 + (OR 0.52, p < 0.01). Insurance status was also found to be predictive of stage at presentation, with Medicaid (OR 1.22, p < 0.01) and uninsured (OR 1.36, p < 0.01) patients significantly more likely than insured patients to have a late stage at presentation.

County-level attributes also demonstrated significant associations with stage at presentation, specifically percent of population that is language-isolated (OR 0.99, p = 0.01) and percent of population that moved to the US from another country in the past year (OR 1.11, p = 0.02).

Sex, median family income, percent urban, percent below poverty line, percent below high school education, and unemployment percentage were not found to be significant predictors of a patient’s cancer staging at diagnosis.

Discussion

Our study is a Type II prognostic study, which is used to identify factors associated with subsequent clinical outcomes in a patient with a given disease. Our study uniquely examines the association between important disparities such as income factors, education level, and geographic disparities (as well as the traditional disparities of race, gender, and insurance status) for colorectal cancer over a 7-year period for a large nationally representative cohort. In this study, we assessed the role of patient and geographic disparities on stage at presentation of colorectal cancer.

At the patient level, the black race and being uninsured or on Medicaid all demonstrated significance in presenting with late-stage colon cancer. However, the results showed increasing age to be significant for early-stage colon cancer compared to late-stage colon cancer. This could likely be due to the United States Preventative Services Task Force Grade “A” recommendation to get colon cancer screening starting at the age of 50 (40 if there is a family history), and thus helping older patients detect the cancer earlier [17]. Another explanation and likely contributor is, a different more biologically aggressive tumor occurs in younger patients. This has been hypothesized due to the higher percentage of poorly differentiated and mucin-producing cancers found in the younger population. Patients of this age are also more likely to have colon cancer syndromes, such as familial adenomatous polyposis or Lynch syndrome [18]. In addition, we found in our univariate analysis that gender did have a significant relationship with stage of presentation; however, this association was later found to be explained by other risk factors in our multivariate analysis.

At the geographic level, percent of population that is language isolated and percent of population that immigrated to the US within the past year were significant predictors of late-stage presentation of colorectal cancer. Percent urban population, percent below high school education, percent below poverty line, and percent of unemployment were all not significant predictors of late-stage cancer. Economic factors, such as median family income, were not associated with late-stage presentation of colorectal cancer; however, median family income is not a fully representative variable in the SEER database. The lowest data point in SEER for median family income begins close to the US poverty line of $25,100 for a four-person household, therefore it does not accurately represent the population of patients below the poverty line [18].

Previous studies that used data from single states or single disparities have found similar associations between colorectal cancer stage at presentation and patient or geographic disparities [7, 19,20,21,22,23,24,25,26,27]. A similar study in 2009 by Halpern et al. [27] used the National Cancer Database to look at multiple patient characteristics associated with stage at diagnosis. In the study, they found insurance status, race, gender, and age to be significantly associated with colorectal cancer stage at diagnosis. Our study did not find a significant association for gender in the multivariate analysis, but had consistent results for insurance status, race, and age. This study also found increasing age to be significant for higher presentation of early-stage colon cancer further validating our results. Their study found that women had increased odds of presenting with late-stage cancer than men. The differences in significance of gender can be due to different national databases used or it can indicate that this disparity has subsided over time. While both are large national databases, SEER registries cover about 28% of the entire US population while National Cancer Databases have about 70% of all newly diagnosed cancers accounted for [28, 29]. Another reason might be due to the Halpern et al. study on patients diagnosed with cancer between 1998 and 2004, being done almost 10 years ago. The disparity among gender may have gone away since then and could be why it was not found in our study. Our study went further and assessed geographic disparities as well as trends in the past 10 years of the significant associations. Overall, most of the disparities mentioned in the previous studies have not changed from the past years and continue to remain a disparity.

A previous study by Valeri et al. used data from SEER to apply a counterfactual framework to colorectal cancer survival rates. This framework allows one to see what actually happened versus what would have happened if a variable was eliminated. These frameworks help show the importance of the variable and the strong effect it has on the results or in this case, the disparity. They used this method to estimate the extent to which race disparities among colorectal cancer survival would be reduced if the differences in stage at diagnosis were eliminated between races [30]. The study looked specifically at black versus white patients, and consistent with our results, found that black patients are more likely to be diagnosed with Stage IV cancer than white patients. When removing these disparities in stage at diagnosis, they found that would reduce the overall difference in cancer outcomes between black and white patients by 35%. This is a significant reduction and demonstrates the importance of stage at diagnosis to survival outcomes.

A study by Winawer et al. found that those who got a colonoscopic polypectomy, which removes adenomatous polyps, as a preventative measure showed lower-than-expected incidence of colorectal cancer. Out of 1418 patients that had a colonoscopic polypectomy, only five patients had asymptomatic early-stage colorectal cancer detected later on, and no symptomatic cancers were detected [31]. Many patients are still not up-to-date with screenings despite various studies like this proving colorectal cancer screening to be an effective preventative measure. Another study by Siegel et al. looked at various colorectal cancer statistics. It identified the percentage of United States adults over 50 who got screening tests done, broken down by several disparities. It found that 68.3% of adults over 65 get screened, while only 57.8% of adults ages 50–64 get screened. Asian and Hispanic adults had the least screening percentages with 49.4% and 49.9% getting screened, respectively, compared to 65.4% of white adults get screened. Education level showed 71.3% of adults who graduated college get screened while only 47.4% of patients with a high school degree or less get screened. The largest difference exists in insurance status with 59.6% of insured patients getting screened, while only 25.1% of uninsured patients get screened [32]. This study shows the same disparities exist in lower screening rates as they do for later stage at diagnosis, further indicating the importance of reducing/eliminating disparities to improve colorectal cancer outcomes.

An article by Zonderman et al. [33], stated that one of the most potent nonbiological factors influencing the development of health disparities is poverty. The article attributed lower education levels and insurance status to lower socioeconomic status. This in turn can lead to infrequent doctors’ visits, lower health literacy, and more unhealthy practices like smoking, which can all increase cancer risks dramatically. To help reduce these disparities, community resources and culturally appropriate techniques must be implemented that target the unique populations at greatest risk for developing the disease.

This study has several limitations. First, the SEER data are broadly representative of the United States cancer population but there are minor differences in foreign-born patients and urban inhabitants being overrepresented [33]. However, while the data are broadly representative, and SEER is considered the gold standard for data quality amongst cancer registries in the US and globally, the database is still incomplete and has inaccuracies since its data are collected from selected registries that may not be representative of the entire US. Inaccuracies can be due to miscoding of the data transmitted to SEER or the data made available to the registrar for coding were not accurate. Data about socioeconomic status in particular are lacking [34]. In addition, patient migration is an important limitation of SEER. Patients moving into and out of SEER and non-SEER regions would be lost from the data leading to bias in the conclusions. This is an intrinsic limitation of all cancer databases that use registries. Using SEER, we used their Rural-Urban Continuum Code method to classify rural versus urban patients, which classifies patients based on the county they reside in. It is possible that patients might not seek care in the same areas of their residence causing inaccurate data. Also, SEER, as well as other observational studies, provide detailed data on diagnosis, stage, and treatment at the time of diagnosis, but long-term outcomes and follow-ups are not available. Although assessing the stage at presentation of colorectal cancer is a common method to determine outcomes, it is not completely indicative of outcomes and variations can exist.

SEER also does not collect some important variables that can help understand causes of poor outcomes; for example, comorbidities (such as diabetes, polyp history, genetic disorders), previous treatments, use of screening tests, or lifestyle factors (such as smoking, alcohol, tobacco, obesity, lack of physical activity, diet low in fruits/vegetables) and therefore, those variables cannot be controlled for [36,37,38]. A 2013 study by Johnson et al. performed a meta-analysis using 12 established non-screening colorectal cancer risk factors to quantify each of these risk factors’ impact on colorectal cancer risk. They found inflammatory bowel disease and history of colorectal cancer in a first degree relative to be associated with the highest risk of colorectal cancer. They found increased Body Mass Index, red meat intake, cigarette smoking, low physical activity, low vegetable consumption, and low fruit consumption to be associated with moderate risk of colorectal cancer. Contrary to common associations, they did not find any significant associations between alcohol use, post-menopausal hormone therapy, processed meat, and colorectal cancer. They also did not find any significant difference in the risk of colorectal cancer with 5 years of using Aspirin/Non-steroidal anti-inflammatory drugs compared to no use [37].

In addition, there are several limitations of the statistical methods. SEER database has only a select number of confounding variables that can be controlled for. The results can be biased due to certain confounding variables that are not accounted for in the database. Further, data may be multicollinear, leading to unclear associations between the independent variables. This will lead to an unclear effect on the true association between the independent variable with the dependent variable. In addition, SEER classifies anyone with private insurance or Medicare as insured and it is not possible to extract solely Medicare data. Since majority of patients with colorectal cancer are elderly and have Medicare, this could be an interesting result to see; however, it is not possible with SEER.

Further, while certain associations are noted to be significant due to a p value <0.01, the odds ratio may not deviate significantly from 1.0. This is due to the large sample size in the study. While these associations are statistically significant, they may not be clinically significant. We have presented the odds ratios in Table 2 to help assess which variables might have significant clinical findings as well.

Lastly, although the data assessed were from the most recent SEER data, and inherent time lag exists in the database, and therefore, the data might not reflect the most recent associations with colorectal cancer disparities and stage at presentation.

Conclusion

In conclusion, this study revealed that there are still significant associations with certain disparities like black race, lower socioeconomic class, and uninsured patients. Previous studies over 10 years ago have also demonstrated similar disparities, indicating enough progress has not been made to reduce these disparities and thus reduce colorectal cancer incidence and result in improved survival rates.