Introduction

Prostate cancer (PrCA) is the most common cancer in men, accounting for 22% of all new cancer cases in American men [1]. Identifying novel risk factors associated with PrCA risk beyond established risk factors has been met with little success. Established and actively researched risk factors include race, education, marital status, family history, diet, and age [1,2,3,4,5,6].

In the general United States (US) population, PrCA incidence displays a non-uniform geographical distribution, where in 2015 the national average was 99.1 with new PrCA cases per 100,000 [7]. The variation in PrCA incidence by state can be clearly seen by comparing the three highest states for PrCA incidence [i.e., New Jersey (127.4.7 per 100,000), Mississippi (126.7 per 100,000), and Louisiana (125.7 per 100,000)] to the three states with the lowest PrCA incidence [i.e., Alaska (61.0 per 100,000), Nevada (69.5 per 100,000), and New Mexico (72.6 per 100,000)] [7]. These differences across states may be potentially associated with macro-level factors such as area-level income, educational attainment, rurality, and environmental factors [8,9,10]. Examining these potential area-level differences while accounting for individual-level risk factors for PrCA may provide invaluable insight into better identifying individuals at higher than normal risk just by knowing where they live.

Epidemiologic research is focused on evaluating putative disease risk factors through the generation and testing of hypotheses related to risk factors across the socioecological spectrum including environmental exposures and social determinants of health. Social determinants of health are defined as factors measured at the community level, but impact the health outcomes of individuals [11]. Identifying areas with higher- or lower-than-expected numbers of PrCA cases using spatial techniques [12,13,14,15,16], employing spatial nested models [8,9,10] or autoregressive models [17, 18], and using ecological factors of the geographical unit of investigation [8,9,10] can provide a more thorough understanding of PrCA by identifying the expanded universe of risk factors associated with PrCA [17,18,19,20]. In addition, accounting for spatial autocorrelation may allow researchers to improve on model fit, which in turn will allow them to determine geographical regions that are at increased or decreased risk for PrCA [21].

The goal of this study is to identify the impact of area-level characteristics including social vulnerability on incident PrCA diagnosis while controlling for established individual-level risk factors among veterans who receive care in the VA health care system, the largest universal access, integrated delivery system in the nation. We limited our analysis to veterans with PrCA cancer who were diagnosed and treated at a VA facility, who reside in South Carolina, a state with a PrCA incidence that is higher than the national average [22]. This is a small geographical area as compared to other states, and the veteran population is broadly distributed throughout the state. A key consideration is that while the majority of men who receive care in the VA are of lower socioeconomic status, all have guaranteed access to PrCA screening and treatment in particular and health care in general.

Methods

Study population

This study employed a retrospective nested case–control design. The timeframe was from 01 January 1999 to 31 December 2015. The unit of investigation was the US Census Bureau-defined ZIP code tabulation areas (ZCTA) which consists of aggregated ZIP codes developed by the US Postal Service [23].

Data were obtained from the United States Department of Veteran Affairs and the United States Census Bureau. From the United States Department of Veteran Affairs, the following datasets were used: all MedSAS® datasets, Master Vital Status dataset, Mini Vital Status dataset, and Primary Oncology dataset from the VA Cancer Registry [24, 25]. The VA Master Vital Status and VA Mini Vital Status datasets provided patient-level information (i.e., each unique VA patient is listed only once in these files), while the VA MedSAS® datasets provided visit-level information (i.e., with ≥ 1 record/subject) [24, 25]. From the United States Census Bureau, the following datasets were used: United States 2010 Decennial Census, the 2007–2011 Five-Year American Community Survey, and the 2015 ZCTA shapefile for the US [26,27,28]. All geographical related information was either linked to the shapefile (i.e., ZIP code patient-level information linked to ZCTA’s in the shapefile), directly obtained at the ZCTA level (i.e., ecology-level information), or directly obtained from the 2015 ZCTA shapefile from the US Census Bureau (i.e., individual ZCTA boundaries) [28].

All data elements were screened to exclude females. Subjects were excluded based on the following criteria (see Fig. 1 for the STROBE diagram establishing the analytic cohort): data in the Primary Oncology Dataset from the VA Cancer Registry could not be linked to MedSAS® datasets [24], a non-PrCA diagnosis in the Primary Oncology Dataset from the VA Cancer Registry, PrCA diagnosis in the Primary Oncology Dataset from the VA Cancer Registry prior to 1 January 1999, date of birth differs by more than 365 days between VA MedSAS® datasets [24] and VA Vital Status datasets [25], age during the timeframe did not wholly or partially fall between 40 and 70 years old, missing ZIP code information, racial classification other than White or Black, and not residing in a South Carolina ZIP code.

Fig. 1
figure 1

STROBE diagram for the generation of the analytical cohort comparing prostate cancer patients to non-cancer patients among the South Carolina veteran population who seek healthcare at Veteran Health Administration facilities between 1 January 1999 and 31 December 2015 and meet the study inclusion criteria

Overall, 964,047 unique subjects from a total of 1,159,188 (83.17%) unique cancer diagnoses were included in the Primary Oncology Dataset. Of these subjects, 80.47% could be linked to unique patients in the MedSAS® datasets [24]. There were 235,782 unique PrCA cases in the Primary Oncology Dataset that could be linked to the VA MedSAS® Files [24]. There were 230,401 (97.72%) PrCA cases that were excluded due to not residing in a South Carolina ZIP code. Of the 5,384 PrCA cases residing in a South Carolina ZIP code, 1,648 (30.61%) were ineligible due to diagnosis prior to 1 January 1999, date of birth discrepancy, age at first VA visit not satisfied, missing ZIP code, and/or not being White or Black. There were 9,976,241 unique subjects in the VA MedSAS® datasets that were not linked to any subjects in the VA Cancer Registry [24]. These subjects were classified as not having cancer. There were 9,798,934 (98.22%) non-cancer cases that were excluded due to not residing in a South Carolina ZIP code. Of the 177,306 non-cancer cases residing in a South Carolina ZIP code, 73,289 (41.33%) were ineligible due to date of birth discrepancy, age at first VA visit not satisfied, missing ZIP code, and/or not being White or Black.

The final South Carolina analytical level cohort consisted of 3,736 subjects with PrCA and 104,017 non-cancer subjects (Fig. 1). South Carolina veterans receiving care at VHA facilities resided in 413 of South Carolina’s 424 ZIP codes (97.41%). There were 67 (16.2%) ZIP codes without PrCA cases among the 413 South Carolina ZIP codes with veterans receiving care at VHA facilities in South Carolina.

Primary outcome variables

Primary PrCA patients between 1 January 1999 and 31 December 2015 were identified from the VA Primary Oncology Dataset [24, 25]. Non-cancer controls were identified by removing all cancer patients in the VA Primary Oncology Dataset from the VA MedSAS® Files [24, 25]. Non-cancer controls were then limited to those patients whose first VA visit occurred between 1 January 1999 and 31 December 2015.

Individual-level variables

The date of the first VA visit for each subject was obtained from the MedSAS® datasets and the date of birth (DOB) was obtained from the VA Mini Vital Status dataset [24, 25]. Four age strata variables were created: age < 40, age from 40 to 50, age from > 50 to 60, age from > 60 to 70 years. The information located within the VA Master Vital Status file for race and ethnicity were classified into White, Black, Hispanic, Asian, Hawaiian/Pacific Islander, and Native/Alaskan America, Other, and Unknown [25]. Subjects other than White or Black racial category were excluded because they accounted for 0.68% of the PrCA and non-cancer patients. The marital status field at the time of diagnosis in the Primary Oncology Dataset was used for PrCA cases, while marital status field in the MedSAS® Outpatient dataset was used for controls to determine marital status for the cohort [24]. Marital status was categorized as married, previously married (divorced, widowed, and separated), and never married/unknown. The unique VA facility numerical codes in the MedSAS® datasets were used to identify which facility a subject visited [24]. The number of different VA facilities, the most frequented VA facility, and the number of times each subjected visited their most visited VA facility were determined.

ZIP code-related Information

The ZIP code listed the most number of times for each subject’s total visits in the MedSAS® datasets was used for all ZIP code-related information for each subject [24]. For PrCA cases, the ZIP code at diagnosis was obtained from the Primary Oncology dataset. For PrCA cases, the ZIP code at diagnosis was used for cases without ZIP code information from the MedSAS® datasets (n = 6,784) [24].

There were 131 unique VA facilities (i.e., VA hospitals and VA Community-Based Outpatient Clinics) determined from VA MedSAS® datasets [24]. The distance between the most frequented VA facility and ZCTA-linked ZIP code listed the most times it was represented as a continuous variable and obtained using the MedSAS® datasets [24]. This ZIP code information was linked to the 2015 ZCTA shapefile from the US Census Bureau [28]. Using the shapefile, a contiguity straight line origin-to-destination distance matrix between VA facilities and ZCTA code centroids was used to determine the distance traveled by a veteran to his most frequented VA facility. The distance between the most-frequented VA facility and the ZCTA-linked ZIP code listed the most times it was categorized into: 0–25 miles, > 25–55 miles, and > 55 miles [24, 28].

Area-level variables

The Social Vulnerability Index (SoVI®): The SoVI®, initially developed by the Hazards and Vulnerability Research Institute (HVRI) at the University of South Carolina, is a composite measure of neighborhood-level factors obtained from publicly available population-based datasets of factors associated with the health of individuals within those neighborhoods [29, 30].

The SoVI® uses 10 ZCTA measurements obtained from the 2010 US Census and 17 ZCTA measurements obtained from 2007 to 2011 5-Year American Community Survey [26, 27]. Selection of measurements was based on community-level factors that have shown broad associations across multiple health-related outcomes. The development of the SoVI® scores proceeded in accordance with previously generated SoVI® scores for research purposes within the HVRI [29, 30].

The SoVI® is a relative measure represented on a continuous scale from negative infinity to positive infinity. A greater SoVI® score indicates the less able a geographical unit is prepared for, can respond to, and can recover from a disaster compared to a geographical unit with a lower SoVI® score. A SoVI® score was assigned for each subject’s most frequently listed ZIP code. The SoVI® measurements were stratified into three categories using cutoffs of 1 standard deviations: low (− ∞, − 2.04], medium (− 2.04, 2.79], and high (2.79, ∞).

Other ZIP code-level factors include the following: The 2007–2011 Five-Year American Community Survey was used to determine the percent of people within each ZCTA with at least a college degree with tertile cutoffs of [0%, 12.86%], [12.86%, 22.60%], [22.60%, 100%], percent of people within each ZCTA living in poverty with tertile cutoffs of [0%, 7.96%], [7.96%, 15.86%], [15.86%, 100%], percent of Black Americans within each ZCTA with tertile cutoffs of: [0%, 0%], [0%, 2.73%], [2.73%, 100%], and percent of people within each ZCTA that were at least 65 years of age with tertile cutoffs of: [0%, 11.95%], [11.95%, 16.92%], [16.92%, 100%] [27].

Statistical analyses

Descriptive statistics were calculated for the analytic cohort (i.e., t-test for all continuous independent variables and chi-square tests for all categorical

independent variables). Frequentist multivariate general linear models using SAS® were first developed to inform key steps in the development and evaluation of the final models [31]. These models were developed for all individual-level risk factors and nested models incorporating both individual- and ecological-level risk factors with patient ZCTA-linked ZIP code residence. Nested models were evaluated as both fixed effect and random effect were created for model selection purposes. Bayesian models were developed for the null (empty) models; individual-level models; nested models incorporating both individual- and area-level risk factors, and accounted for patient ZCTA-linked ZIP code residence as a random effect; and the multivariate conditional autoregressive models incorporating both individual, area-level risk factors, accounted for patient ZCTA-linked ZIP code residence as a random effect, and accounted for the spatial autocorrelation between ZCTA-linked ZIP codes (i.e., extent of clustering between neighboring ZCTA-linked ZIP codes in the dataset).

All categorical variables were evaluated using created dummy variables. Any independent individual-level variables with undefined 95% confidence intervals or extremely large point estimates with extremely wide 95% confidence intervals were excluded. Independent individual-level variables not excluded were used in a manual backward stepwise approach to generate the final multivariate individual-level model were the added variable remained if it changed the risk factor estimates of the other variables by at least 10%.

A manual backward stepwise approach was used to determine the final nested-level model. Each ecological-level independent risk factor was evaluated separately with the final multivariate individual-level model. The model with the lowest Akaike information criterion (AIC) was selected as the initial nested-level model [32]. Subsequently, the ecological factor with the next-lowest AIC was added to the initial nested model. If any risk factor estimates changed by more than 10% then the additional ecological factor was retained; this was repeated for all ecological-level factors.

The median odds ratio (MOR) also was calculated for each model. The median odds ratio uses the area-level variance in the model and calculated statistics that can be interpreted as the median difference in odds between the ZCTA-linked ZIP code with the highest compared to the lowest risk for two individuals with the same evaluated risk factors (i.e., the MOR is the risk estimate for the unexplained variation in the model) [33]. Therefore, the model with the best fit will also have the lowest MOR because risk factors in the model account for more of the variance in the outcome.

A Global Moran I’s Test was conducted on the ZCTA-linked ZIP code-level residuals of the final models to determine existence of spatial autocorrelation [21]. If the Global Moran I’s test was statistically significant, indicating the presence of spatial autocorrelation, then models accounting for spatial autocorrelation would be constructed using risk factors included in the final nested-level models. If the Global Moran I’s test was statistically non-significant, indicating no presence of spatial autocorrelation in the data, then the nested-level model would become the final overall model for that comparison.

Based on the frequentist models, the final Bayesian general linear models were developed. All Bayesian models used uninformed priors. The initial statistical inference from each Bayesian model was based on 15,000 iterations (i.e., samples) after 5,000 burn-in period (Supplemental Fig. 1). The convergence of the sample chains was evaluated using the Geweke diagnostics, which compares the first 10% of kept iterations to the last 50% of kept iterations [34]. The mean value was used as the point estimate for each individual and area-level risk factors as well as the MOR estimates. The 2.5% and 97.5% iteration cutoffs were used as the ranges for the 95% credible intervals (CrI) for each risk factor.

Bayesian models created were compared to each other using the Deviation Information Criteria (DIC) [35]. The model with the lowest DIC was selected as the best-fitted model for that comparison in that analytical cohort.

Logistical aspects

All data storage, management, and analyses were conducted with the VA Informatics and Computing Infrastructure (VINCI) servers. All data management was done using Microsoft SQL Server. All univariate and frequentist modeling analyses were conducted using SAS® version 9.4 software [31]. All Bayesian modeling was conducted using WinBUGS® version 1.4.3 and R version 3.3.2 [35, 36]. All evaluation of Bayesian modeling was conducted using R [36]. All statistical tests used an α-level of 0.05. This project had Institutional Review Board Approval from the University of South Carolina (Pro00036431) and the WJB Dorn VA (10,404).

Results

There were 3,736 PrCA cases in 377 ZCTA’s with a median number of PrCA cases of eight and a range of 0 to 153 across those ZCTA’s. These PrCA cases were distributed by year, as shown in Supplementary Table 1. There were 104,017 non-cancer cases in 413 ZIP codes with a median of 130 and a range of 1 to 2,044 across those ZIP codes. PrCA patients accounted for 3.5% of the final analytical cohort. The plurality of patients were between > 60 and 70 years of age (37.7%). Most PrCA patients were White (65.5%), married (62.3%), and traveled more than 55 miles to a VA facility for care (52.0%) (Table 1). The median distance traveled to a VA facility was 60.0 miles.

Table 1 Descriptive Statistics of South Carolina veteran population who seek healthcare at Veteran Health Administration facilities from 1 January 1999 and 31 December 2015 and meet the study inclusion criteria

Univariate statistics

PrCA patients were more likely than non-PrCA patients to be older for their first VA visit (age from > 50 to 60 years: 45.1% vs. 34.0%; age from > 60 to 70 years: 41.0% vs. 37.6%). PrCA patients were also more likely to be Black (54.5% vs. 33.6%), less likely to be have been married at diagnosis (58.4% vs. 62.5%), more likely to be have been previously married (32.7% vs. 22.4%), less likely to never have been married or have an unknown marital status (9.0% vs. 15.1%), and more likely to travel between 25 and 55 miles to a VA facility to receive care than their non-cancer counterparts (Table 2).

Table 2 Univariate χ2 analysis comparing all prostate cancer patients to non-prostate cancer patients for all categorical variables among the South Carolina veteran population who seek healthcare at Veteran Health Administration facilities between 1 January 1999 and 31 December 2015 and meet the study inclusion criteria

Multivariable Bayesian modeling

Seven models of increasing complexity using a Bayesian framework were fit (Table 3). All independent variables were significant for the first model (Model 1, Table 3). The second model was the nested null model which accounted for the subject’s ZCTA-linked ZIP code. This model had a MOR of 1.37 (1.30, 1.44) as well as the highest DIC of all seven models. The third model accounted for the individual-level factors as well as nesting the subjects in their ZCTA-linked ZIP codes (Model 3, Table 3). The fourth model included area-level factors including percent of the population living in poverty and categorized SoVI® scores (Model 4, Table 3). While Model 3 and Model 4 have identical DICs, Model 4 had a smaller MOR indicating a better fitted model.

Table 3 Odds ratios and 95% credible intervals for multivariate models comparing prostate cancer patients to non-cancer patients among the South Carolina veteran population who seek healthcare at Veteran Health Administration facilities from 1 January 1999 and 31 December 2015 and meet the study inclusion criteria

The fifth model was the spatial null model with a spatial random effect and autocorrelation term. Model 5 fit the data better than Model 2, with both a smaller DIC and MOR. The sixth model was the spatial model with only the individual-level factors (Model 6, Table 3). The seventh model was the spatial model with the individual-level factors as well as the ZCTA poverty level and the categorized SoVI® scores for each model (Model 7, Table 3).

Model 7, the most complex model, had the smallest DIC and MOR [1.14 (1.11, 1.19)] of all the models, indicating that it is the model with the best fit. PrCA patients were more likely to be older when they had their first VA visit as compared to non-cancer patients, with the risk increasing with increasing age: 40–50 years old (ORadj 5.63 (3.94, 8.39), > 50–60 years old [ORadj 13.11 (9.29, 19.53)], and > 60–70 years old [ORadj 13.31 (9.44, 19.81)]. PrCA patients were more likely to be Black [ORadj 2.98 (2.77, 3.20)]. PrCA patients were more likely to have been married previously as compared to being currently married [ORadj 1.47 (1.37, 1.58)]. However, those who had never been married or their marital status was unknown were at lowest risk relative to those who were married [ORadj 0.55 (0.49, 0.62)].

Distance between 25 and 55 miles from the patient’s ZCTA-linked ZIP code centroid to the most frequented VA facility was statistically insignificant in Model 7 [ORadj 1.06 (0.92, 1.21)]. This stratum for this variable was also statistically insignificant in Model 4, while it was statistically significant for Models 1, 3, and 6. In model 7, PrCA patients were less likely to be living more than 55 miles from the patient’s ZCTA-linked ZIP code centroid to the most frequented VA facility vs. their non-cancer controls [ORadj 0.86 (0.75, 0.98)]. This stratum for this variable was also statistically significant in Model 4, while it was statistically insignificant for Models 1, 3, and 6.

PrCA patients were more likely to live in ZCTA-linked ZIP codes with a SoVI® class 1–2 standard deviations from the ZCTA-linked ZIP codes with the lowest Social Vulnerability Index [ORadj 1.35 (1.11, 1.65)] and 3 standard deviations from the ZCTA-linked ZIP codes with the lowest SoVI® score [ORadj 1.39 (1.11, 1.75)]. These estimates remained consistent and statistically significant for all models that accounted for them.

Residing in ZCTAs with the highest poverty level (≥ 15.86%) increased the risk of PrCA by 24% in Model 4 [ORadj 1.24 (1.01, 1.48)]. However, this ZCTA poverty level was statistically insignificant after accounting for the spatial autocorrelation between ZCTAs in Model 7 [ORadj 1.13 (0.93, 1.35)].

Discussion

Our study identified an association between a veteran’s ZCTA-linked ZIP code of residence and PrCA risk. This study used a unique approach to account for individual-level risk factors, area-level risk factors, spatial random effects, and autocorrelation within one modeling framework.

The distance traveled to receive care evaluates the geographic accessibility among the veteran population and demonstrates a unique dynamic in attempting to quantify the impact of travel experienced by veterans given a travel stipend [37, 38]. When distance is the only measure describing the location of a veteran (i.e., Model 1), it is statistically significant for those veterans traveling between 25 and ≤ 55 miles (i.e., a VA healthcare facility within an approximate 30-min to 1-h commute). That stratum remains statistically significant (i.e., Models 3 and 6), until accounting for ZCTA poverty level and SoVI® (i.e., Models 4 and 7).

Interestingly, for veterans traveling > 55 miles (i.e., a VA healthcare facility more than an approximate 1-h commute), it became a statistically protective effect when ZCTA poverty level and SoVI® were accounted for (i.e., Models 4 and 7). This protective effect may be an indication of those veterans living in rural areas [39]. The primary VA healthcare facilities in South Carolina are located in urban areas, which are limited in South Carolina. The rural/urban distinction of a ZIP code proceeds towards rural as the distance between VA facilities in South Carolina and a veteran increase. Obertova et al. concluded in their systematic review article on the urban–rural disparity in PrCA that those living in rural communities could be less likely to seek preventive/wellness care visits, which in turn, are less likely to be screened for PrCA and thereby less likely to be diagnosed with PrCA resulting in a fewer PrCA cases being associated with that area [39]. The potential influence of a rural–urban disparity warrants the evaluation of measurements characterizing such a disparity, such as the rural–urban commuting area codes (RUCAs), in future studies based on PrCA incidence [40].

A potential dynamic that will be explored in future research is what type of VA facility a veteran receives his care; is it a VA hospital or a VA Community-Based Outpatient Clinic (CBOC)? CBOC’s serve an important function in providing care to veterans living in less populated regions; however, CBOC’s are limited in what type of care they can provide. This analysis did not differentiate between VA hospitals and CBOCs. However, based on the plausible inference from these results, determining if differences exist between veterans who receive care at a VA hospital and a VA CBOC may be beneficial in improving how health care is delivered to this population.

Our results demonstrated that more than one area-level risk factor was associated with higher rates of incident PrCA in the VA population. The models confirmed that area poverty was associated with higher incidence of PrCA [41, 42]. However, of key interest is our finding that models accounting for the spatial autocorrelation between ZCTA-linked ZIP codes diminished the impact of area poverty, while the SoVI® estimates remained similar. While a poverty level of > 15.9% went from being statistically significant in Model 4 to statistically non-significant in Model 7, risk estimates for the SoVI® remained statistically significant and stable across those models. A credible explanation is that poverty levels for a community cannot be constrained by political/administrative boundaries such as ZIP codes, therefore, when spatial autocorrelation is accounted for as in Model 7, it inherently accounts for similarities such as poverty level. However, the SoVI® is a composite measure that includes multiple community-level measures that cannot be fully accounted for by spatial autocorrelation alone. This opens the possibility of further exploration of how the measures that comprise the SoVI® individually and collectively impact the outcome of interest.

While addressing societal issues of poverty and community social vulnerability is outside the scope of VA, these results indicate that veterans living within an approximate 30-min to 1-h commute to a VA healthcare facility maybe at increased risk for being diagnosed with an incident case of PrCA, especially if they reside in a ZIP code with a high percentage of the population in poverty and an increased SoVI® class. These results may indicate (1) a false sense of health assurance knowing that they live within 1-h of VA facility, but do not choose to utilize the VA facility for preventive care/wellness/routine care; (2) a veteran who requires increased healthcare services chooses to live closer to a VA facility; or (3) an indication of the urban/rural dynamic within the veteran population seeking care at VHA facilities [39].

We also showed that established PrCA risk factors, such as marital status and being Black, were statistically significant factors associated with PrCA [1, 43,44,45]. In addition, their respective point estimates and 95% credible intervals were nearly identical for all models.

Findings on age at first VA visit can provide insight into understanding age of PrCA diagnosis for veterans. Veterans diagnosed with incident PrCA were more likely to begin their VA healthcare later in life than those veterans without incident PrCA or who did not develop incident PrCA during the 16-year study timeframe. Furthermore, the risk of incident PrCA more than doubled from the 40–50 to the > 50–60-year-old strata, at which point there was slight differences in incident PrCA risk between the > 50–60 and > 60–70-year-old strata. This risk of age at first VA visit may also be caused by the recent influx of new veterans who served in Operations Enduring Freedom and Iraqi Freedom resulting in a larger population of younger veterans in the VA system who are still at risk for developing chronic diseases such as PrCA, but may cause a bias when used as part of a larger comparison group to evaluate those chronic diseases at this point in time.

Strengths and limitations

This study used electronic medical records for patient-level variables. As such, issues that are inherent in using such information are present [46]. The data files used have been extensively utilized and queried in other VA-based peer-reviewed research and are of high reliability and validity [47,48,49,50]. Furthermore, biases that may be present between PrCA patients and non-cancer patients will be non-directional because data management and cleaning did not differentiate between those patients. Of course, use of pre-existing datasets precludes examining data outside of its intended scope. For example, we had no information on putative PrCA risk factors including diet, physical activity, cardiorespiratory fitness, and a host of other psychosocial factors.

The use of ZCTA’s as a proxy for ZIP codes as the geographical unit of analysis has been demonstrated to have boundary representation issues especially for non-populated ZCTA’s as compared to ZIP codes [51]. However, limiting the population under investigation to South Carolina, which is a small, stably populated state as compared to the rest of the nation should mitigate the boundary issue. Therefore, the only areas impacted by boundary issues are those ZCTAs bordering the states of Georgia and North Carolina, which is always present in all spatial analyses where there is an adjacent area not included in the analysis.

Another consequence of using VA electronic medical records is that patient addresses could not be obtained and therefore the centroid of the ZCTA that each patient resided was used to determine the straight-line distance traveled by patients to their most frequented VA facility. This issue could not be avoided given the patient confidentiality protections [52]. However, given this limitation, any bias will be non-directional given the relatively small geographical area of each ZIP code, which would result in minor differences between the centroid distance to the VA facility and the actual distance. Furthermore, there would be equal probability that any random individual veteran can live closer or farther to their most frequented VA facility than the distance from the centroid. Therefore, the analytical cohort would have a narrow, normally distributed curve in the difference between their actual straight line distance to their most frequented VA facility and the centroid distance.

South Carolina veterans who were not excluded and yet received healthcare services at a VA facility not in South Carolina were still included in the analyses. Receiving care at a South Carolina VA facility was not an exclusion criterion for that reason. The closet VA healthcare facility may be located across the state line for many border areas of the South Carolina. This occurrence does not impact the analyses in determining if residence-specific characteristics are associated with PrCA.

More importantly, the ZCTA was selected as the geographical unit of analysis because it was the most appropriate choice to assess the spatial impact of PrCA epidemiology for two reasons. The first is that it was the smallest of the two geographical units available (the other being county). Therefore, while census tracts are more commonly used in spatial models in evaluating areas smaller than counties, such geographical units were not available in the VA records. The second reason is a consequence of the first. The more homogenous the population was, the more likely it is to differentiate between ZCTAs in the models and therefore to identify ZCTAs with higher- or lower-than-expected risks for PrCA. As the geographical unit area increases, the population becomes more heterogenous in characteristics that could not be assessed. This increases the likelihood of introducing an unaccounted directional bias.

No South Carolina ZCTA maps were presented identifying crude counts, crude rates, or ZCTA-specific risk estimates because ten of the 413 ZIP codes (2.4%) had, at most, five VA patients. Therefore, any presentation identifying location even at the aggregate level would violate the Health Insurance Portability and Accountability Act of 1996 [53]. The research approach undertaken should not be discounted due to the limitation of what can be presented in the public domain. This research has the most benefit if used as an internal resource in identifying ZIP codes where the risk for a specific disease is above or below what is expected after controlling for known risk factors. In this regard, additional resources may be better directed toward understanding what is driving a higher-than-expected disease risk in a specific area. This, in turn, could lead to targeting interventions by public health researchers who have the potential to benefit a specific community in need to better improve their health outcomes. Furthermore, clinicians can use this research in factoring in a patient’s residence as a potential risk factor for specific diseases.

The use of ZCTA-level measurements from one timeframe is a limitation. However, the measurements from the 5-Year American Community Survey that were used individually and as components of the SoVI® and the SoVI® have been shown to remain stable over several consecutive years [54]. However, this analytical approach demonstrated the possibility of successfully modeling individual-level risk factors, area-level risk factors, and spatial autocorrelation within one modeling framework. This approach opens up the possibility to expand this research to include the incorporation of a temporal component into the modeling framework.

Additionally, “age at PrCA diagnosis” could not be used in the multivariate general linear models because the non-cancer comparison group does not have those values to calculate that measure. For this reason, the creation and use of “age at first VA visit” was calculated as novel approach that allowed age to be assessed between the PrCA population and the non-cancer population for multivariate general linear modeling purposes. While this measures the age at which a veteran began accessing VHA care, any bias between the PrCA group and the non-cancer comparison group would most likely be non-directional; i.e., not biased toward the null. Chronic diseases other than PrCA (e.g., cardiovascular disease, type 2 diabetes mellitus) also are likely to occur in middle to late adulthood. Developing such chronic diseases is an equal probable in both the PrCA group and the non-cancer comparison group [55, 56]. It also is important to note that such non-cancer chronic diseases were not exclusion factors in this study.

Finally, in addition to the known risk factors for PrCA (race and marital status), we identified the unique factors such as area of residence and waiting to begin VA healthcare services until later in life as risk factors for incident PrCA. We demonstrated that location-specific characteristics as a risk factor for PrCA can be evaluated and accounted for at three levels: the individual level, community level, and the spatial autocorrelation among the ZCTA-linked ZIP codes. Expanding on the known risk factors for PrCA will allow clinicians to better assess a man’s likelihood for developing PrCA thereby improving the preventive care he receives.