Introduction

Colorectal cancer (CRC) is the third most common cancer in men and the second in women worldwide with an estimated lifetime risk among western populations of 5–6 %. According to data from the American Cancer Society (ACS), approximately 132,700 new CRC cases are to be diagnosed in the USA in 2015 and approximately 49,700 are expected to die of the disease [13]. Genetic syndromes, such as familial adenomatous polyposis (FAP), hereditary non-polyposis colorectal cancer (HNPCC), or MUTYH-associated polyposis (MAP), are estimated to account for 5 % of CRC cases. Additional 15–20 % of patients have a familial history of the disease that might suggest hereditary contribution. However, the majority of patients (about 75 %) have sporadic disease, with no family history of CRC [4, 5]. In this population, major risk factors that may influence the development of the disease include age, male gender, obesity, diet high in fat and low in fiber, sedentary lifestyle, cigarette smoking, and alcohol consumption as well as medical history of inflammatory bowel disease (IBD), diabetes mellitus, and insulin resistance [68]. The use of aspirin and nonsteroidal anti-inflammatory drugs (NSAIDs) and hormone replacement therapy (HRT) were shown to decrease the risk [9].

Despite the clear recommendations in the medical literature, and availability of screening tests proven to reduce the incidence and mortality from CRC [10, 11], compliance among at-risk populations remains low. According to data of the ACS from 2013, only 59 % of adults 50 years of age and older underwent any CRC screening test within the recommended time intervals [12, 13].

There is an ongoing effort to enhance the risk stratification of individuals, increase compliance for initial screening, and streamline the use of surveillance colonoscopy in order to reduce the number of unnecessary tests. Risk scores are commonly used in medicine to quantify a person’s risk of developing a disease. Knowledge of individual’s risk of CRC could be used to develop risk-tailored strategies to improve the efficiency of screening. While such scores exist for breast and prostate cancers [14, 15], limited data are available for CRC [1622]. Although prior studies found difference in disease incidence among subjects that were categorized as high risk compared to those categorized as low-risk group, they had low discrimination ability with area under the curve (AUC) of 0.6–0.69. Additional limitations of these studies included incomplete assessment of colon cancer risk factors [1621]; restricted age range of subjects [18, 20, 21]; selection bias in studies that evaluated only subjects that were self-referred for screening [1820]; and lack of controlling for previous CRC screening or family history of CRC [17]. The above-mentioned limitations affected not only model performance but also model generalizability and validity.

The aim of the current study was to develop and validate a risk prediction model for sporadic CRC that incorporates clinical and laboratory data using a large population-representative electronic medical records (EMR) database. If validated, the model could be used to generate patient risk data automatically; such information can then be easily available for physicians and linked with patient directed interventions.

Methods

Study Design

We conducted a nested case–control study with incidence density sampling using the health improvement network (THIN), a large EMR database from the UK. Case–control studies with incidence density sampling of controls yield odds ratios (ORs) that are statistically unbiased estimates of incidence-rate ratios (or hazard ratio) from a corresponding cohort study with proportional hazard analysis [23]. The study was approved by the Institutional Review Board at the University of Pennsylvania and by the Scientific Review Committee of THIN.

Data Source

The THIN database contains comprehensive medical records on approximately ten million patients (5.7 % of the UK population) treated by general practitioners in 570 practices, providing data on exposures and potential confounders important for CRC risk assessment. Registration date is defined as the date when patients were first registered with a practice in THIN, and Vision date is the date that a practice began using in-practice Vision software that collects information for the THIN database [24]. Each medical diagnosis is defined using Read diagnostic codes, the standard coding system used by general practices in the UK. All practices contributing data to THIN are instructed to follow a standardized protocol of entering information. Data quality is monitored through routine analysis of the entered data [25, 26]. The database was shown to be representative of the UK population with excellent quality of information [27]. Cancer rates in THIN, including colon cancer, were shown to be comparable to those reported in cancer registry data [28].

Study Cohort

All people receiving medical care from 1995 to 2013 from a THIN practitioner were eligible for inclusion. Subjects with a diagnosis of CRC syndromes, familial history of CRC, or IBD were excluded in order to focus on sporadic CRC. Patients without acceptable medical records were excluded (i.e., patients with incomplete documentation or out of sequence date of birth, registration date, date of death, or date of exit from the database). Follow-up started at the later of either the Vision date or 6 months after the date at which the patient registered with the general practitioner [29], and ended on the earliest of CRC diagnosis date, date of death, transferring out of the database, or the end date of the database.

Case Selection

Cases were defined as all individuals in the cohort with at least one Read code for CRC during the follow-up period that was 50–85 years old at the time of diagnosis. We limited our study population to individuals in this age group since 90 % of people diagnosed with CRC are above the age of 50 [2] and because current guidelines recommend screening in adults beginning at the age of 50 and continuing until age 75, and on an individualized basis in adults between the age of 76 and 85 [6]. Subjects who were diagnosed within the first 6 months after registration were excluded in order to avoid prevalent cases [29]. In addition, in order to predict the risk of early-stage disease and since 20 % of CRC cases in the UK have a distant spread at the time of diagnosis [30] and 75 % of them will die within 2.5 years [2], we excluded subjects with death date within 2.5 years from diagnosis date.

Selection of Controls

The eligible control pool for each case comprised all individuals without a diagnosis of CRC at the time of sampling and without a history of previous colectomy. Up to four eligible controls were randomly selected and matched with the case on practice site and start date of follow-up. Controls were assigned the same index date as their matched cases.

Exposures and Covariates

We examined a comprehensive list of potential and known CRC risk factors based on literature review (supplementary index 1). All covariates were measured prior to index date. The risk factors were divided into five categories: anthropomorphic and lifestyle parameters (such as obesity and smoking history), health care utilization (including previous screening for CRC), medical comorbidities (such as diabetes mellitus), medications (such as aspirin and nonsteroidal anti-inflammatories) as categorical variables of any use before index date, and laboratory results [such as complete blood count (CBC) and inflammatory markers] as continuous variables. For laboratory results, we used both last values within the year before index date and the difference between the last two values before index date in order to evaluate the intra-individual trends.

Statistical Analysis

The entire study cohort was randomly divided in a 2:1 ratio in order to generate a test and validation sets. The association between each variable and CRC risk was evaluated using a univariate conditional logistic regression analysis to estimate ORs and 95 % confidence intervals (CIs). All variables associated with a p value <0.25 in the univariate analysis were further evaluated in the multivariate model [31]. Laboratory test-associated variables with >67 % missing data were excluded. We performed a complete case analysis with the remaining variables. Three models were created: a model based only on variables used in previous risk models (reference model); a model based only on laboratory results (laboratory-based model); and a model based on all parameters (combined model). For the multivariate logistic regression in each one of the models, we used backward elimination for variable selection with p values of <0.001 and >0.05 as inclusion and exclusion criteria, respectively. Additionally, we repeated the multivariate models after testing continuous variables for linearity and correction using fractional polynomials (FP) of second degree with powers −2, −1, −0.5, 0, 0.5, 1, 2, 3, to improve the fit of the models [32, 33].

The models were tested for collinearity [variance inflation factor (VIF) >10] and two-way interactions. The risk for each individual was given as deciles of probability with values from 0 to 1. The calibration of each of the models was evaluated using the McFadden’s R2 goodness-of-fit test with high p value (>0.05) indicating adequately fit of the logistic function and values >0.2 indicating extremely good model fits [34]. The discrimination ability of the models was calculated using the area under the receiver operating curve (ROC). Net reclassification index (NRI) [35] comparing either the combined or the laboratory-based models to the reference model was calculated using the formula:

$$ \begin{aligned} & \left[ {({\text{net}}\,{\text{increase}}\,{\text{of}}\,{\text{classification}}\,{\text{for}}\,{\text{cases}}/{\text{total}}\,{\text{number}}\,{\text{of}}\,{\text{cases}})} \right. \\ & \quad \left. +\, {({\text{net}}\,{\text{decrease}}\,{\text{of}}\,{\text{classification}}\,{\text{for}}\,{\text{controls}}/{\text{total}}\,{\text{number}}\,{\text{of}}\,{\text{controls}})} \right] \times 100\,\% . \\ \end{aligned} $$

The analysis was repeated in the validation set of the data. All calculations were done using STATA 13 (Stata Corp., College Station, TX, USA).

Results

Study Cohort and Variables

The study cohort included 22,351 CRC cases that were diagnosed between 1995 and 2013. We excluded 125 cases with family history of CRC; 3194 cases that were diagnosed before the age of 50 or after the age of 85 years old; and 5110 cases that died within 2.5 years of index date and probably had a metastatic disease at diagnosis. Eventually, we were left with 13,879 cases (9299 in the test set and 4580 in the validation set) and 54,109 matched controls (36,199 in the test set and 17,910 in the validation set). Forty-three cases had no matched controls. Characteristics of cases and controls are presented in Table 1. Results for the univariate analysis are presented in Supplementary index 2.

Table 1 Characteristics of cases and controls

Reference Model

A reference model based only on variables that were used in previous studies (age, sex, height, obesity, ever smoking, alcohol dependence and previous screening colonoscopy) demonstrated an AUC of 0.58 (95 % CI 0.57–0.59) and low goodness of fit with McFadden’s R2 of 0.03 (Fig. 1). There was no change in results when age was corrected for linearity.

Fig. 1
figure 1

Receiver operating curve for the different risk models

Laboratory-Based Model

For the model including only laboratory test results, the variables that remained after conducting the backward elimination are presented in Table 2. The AUC for the model based on these parameters was 0.77 (95 % CI 0.76–0.78) with a goodness of fit >0.05 (McFadden’s R2 of 0.23). We excluded creatinine and BUN from the model due to lack of biological plausibility. We repeated the model for hematocrit [PCV (%)], MCV (in fl), lymphocytes (in billion cells/l) and NLR, with complete data available for 16,240 (35.7 %) individuals [4929 (53.0 %) cases and 11,311 (31.2 %) controls]. This model had an AUC of 0.74 (95 % CI 0.73–0.75) with a McFadden’s R2 of 0.16. The AUC improved modestly after correction for linearity (0.76, 95 % CI 0.76–0.77) and a McFadden’s R2 of 0.21. The final equation for this model is presented in supplementary index 3.

Table 2 Variables selected for multivariate model based only on laboratory results

Combined Model

For the combined model including all five groups of variables (anthropomorphic and lifestyle, health care utilization, medical comorbidities, medications, and laboratory results), the variables that remained after conducting the backward elimination are presented in Table 3. The AUC for the model based on these parameters was 0.80 (95 % CI 0.79–0.82) with a McFadden’s R2 of 0.33. We excluded BUN and spironolactone prescriptions due to lack of biological plausibility and antidepressant due to possible confounding by indication. We excluded height due to missing values. Red blood cells and lymphocytes were excluded due to collinearity. From the resulting model age, eosinophil count, and aspirin/NSAIDs, digoxin and recurrent TMP/SMX prescriptions were excluded due to lack of statistical significance. We added to the model platelets and white blood cell count, two additional blood lineages that were significant in the univariate analysis and were suggested as predictors for cancer in previous studies [3638] and metformin use that was suggested to decrease CRC risk [39]. The final combined model included sex, hemoglobin (in g/dl), MCV, white blood cells (billion cells/l), platelets (billion/l), and NLR as well as previous prescriptions of metformin or other oral hypoglycemic medications. 13,640 (30.0 %) individuals had all laboratory results [4098 cases (44.1 %) and 9542 (26.4 %) controls]. The AUC of the model was 0.79 (95 % CI 0.78–0.80) with a McFadden’s R2 of 0.26. The model reached an AUC of 0.80 (95 % CI 0.79–0.81) and a McFadden’s R2 of 0.27 after correction for linearity (Fig. 1). The final equation for this model is presented in supplementary index 3.

Table 3 Variables selected for multivariate model based on all variables

Figure 2 and Table 4 present the percent of observed CRC cases in the test set by model’s probability deciles (of note, the data describe the percent within the case–control population that had 1:4 ratio between cases and controls).

Fig. 2
figure 2

Observed CRC cases in the test and validation sets by probability percentiles of the combined model

Table 4 Distribution of the combined model deciles in the test and validation sets

We further looked at a model that contains only sex and laboratory values and might be easier to use as an automatic application. This model had an AUC of 0.79 (95 % CI 0.78–0.80) and a McFadden’s R2 of 0.26.

Validation

All models were evaluated in the validation set as well. For the reference model, we had 8210 subjects (36.5 % of the validation population due to lack of height measurements). The AUC was 0.63 (95 % CI 0.61–0.64), and the McFadden’s R2 0.01. For the model based on laboratory test results, we had 5792 subjects (25.8 %) with all laboratories available. The AUC was 0.77 (95 % CI 0.75–0.78) similar to the one from the test set with McFadden’s R2 of 0.14. For the combined model, 4946 (22.0 %) had the entire laboratory results. The AUC was 0.73 (95 % CI 0.71–0.74), and the McFadden’s R2 was 0.07. Figure 2 and Table 4 present the percent of observed CRC cases in the validation set by probability deciles of the combined model.

Net Reclassification Index

We further calculated the NRI for both the test and validation sets using the combined model compared to the model based on variables that were used in previous studies. Detailed reclassification tables are provided (supplementary index 4). The NRI was higher in both sets with values of 60.7 % for the test set (52.8 % for cases and 7.9 % for controls) and 14.7 % for the validation set (−4.9 % for cases and 19.6 % for controls). Since the model based on laboratory values had similar AUC as the overall model with the advantage of additional simplicity as an automatic model, we also calculated the NRI using this model in comparison with the model based on variables that were used in previous studies (supplementary index 5). The NRI was 47.6 % for the test set (47.3 % for cases and 0.3 % for controls) and 41.4 % for the validation set (12.2 % for cases and 29.2 % for controls).

Discussion

Using a UK population-representative dataset [27], we assessed for the first time a risk prediction model for sporadic CRC based on laboratory test results, mainly CBC and inflammatory markers, and compared it to a reference model based on variables that were previously used in CRC risk models, such as anthropomorphic and lifestyle parameters and medical comorbidities. The reference model had similar low predictive value and goodness of fit as past models with an AUC ranging between 0.58 in the test set and 0.63 in the validation set. However, models based on laboratory test results had high predictive values and discrimination with an AUC of up to 0.74 and 0.80, respectively, and high goodness of fit. The likelihood of a CRC diagnosis was 18 times higher in the highest compared to the lowest risk decile of the combined model (Fig. 2). These results were replicated in the validation set of the study.

A recent systematic review [40] that evaluated previous CRC risk models found weak discriminatory power, with AUCs ranging from 0.6 to 0.69 and large heterogeneity between studies. These models were limited by selection bias as most models used only data from subjects that underwent screening or diagnostic colonoscopies and had no information regarding individuals that were non-compliant with current screening recommendations (up to one-third of the general US population). Furthermore, some of these models focused on specific populations such as physicians, males, or city dwellers [17, 20] that differ in health literacy and the use of health care services from the general population. Thus, previous results lacked generalizability and several studies that tried to confirm the results in different populations showed an even lower discriminating ability [41]. Additionally, these models were also prone to recall bias due to the use of self-report data. Furthermore, numerous risk factors such as family history of CRC, previous screening colonoscopies, use of aspirin/NSAIDs and HRT were not assessed.

The current study had several important advantages. We used a large population-representative EMR that included information both on individuals with and without a history of previous CRC screening. The incidence of CRC in THIN was previously shown to be comparable to the incidence in the entire population of the UK as reported in cancer registry data [26, 28]. The study cohort had a long follow-up with a median of 6.2 years and a maximum of 18 years. By excluding individuals with a history of genetic CRC syndromes, family history of CRC, or IBD and those diagnosed before the age of 50, we were able to focus on sporadic CRC cases, a population that can benefit from better risk stratification. The median age of the study population (69 years) was in the upper age range recommended for screening colonoscopy. However, in contrast to previous models that evaluated only individuals that underwent screening colonoscopy, the current study evaluated all incidence CRC cases. As such, the median age in our study represented the actual median age for CRC diagnosis in the entire population.

Laboratory parameters are good candidates for automatic EMR-based risk stratification, since after the age of 50 routine blood tests are recommended for other indications (such as lipid profile), at a minimum frequency of every 5 years. In contrast to other known risk factors, such as weight, smoking, or alcohol consumption, laboratory parameters are less prone to information bias, and in contrast to medications there is no possible bias due to lack of compliance. Moreover, the current study evaluated only commonly used laboratory parameters, with results available for at least one-third of the study cohort, and the final model included only variables that are part of the routine blood count and differential. Although changes in the CBC, mainly anemia, decrease in MCV, increase in red cell distribution width, and thrombocytosis were previously described in the literature as features of CRC [3840], to our knowledge there were no previous studies to date that assessed the value of incorporating these changes in a CRC risk prediction model. Furthermore, most studies evaluated anemia as a dichotomous rather than as a continuous variable. A single study suggested a gradual decline in hemoglobin levels starting 3–4 years before cancer diagnosis [41].

All medical diagnoses, medication prescriptions, and laboratory results were recorded before cancer diagnosis. For laboratory results, we focused only on values from the year prior to diagnosis, in order to evaluate data that were collected in a uniform time window in both cases and controls. In addition, we focused on cases without known metastatic disease at diagnosis, ensuring that the risk factors that were used were actually relevant for early detection.

Despite the large number of variables that were tested, the current study had more than 100 cases per variable, and thus, there was no need to apply penalized regression methods to the analysis and the risk for over-fitting of the model was low. Additionally, we repeated the model with and without correction for linearity for laboratory results, age and height, with no change in results.

The current study had several limitations. The THIN database lacks information regarding some of the known risk factors for CRC such as dietary habits, physical activity, and race. However, previous models that used those factors had low discriminatory power with AUC <0.7 [21]. THIN also lacks information regarding tumor location as well as histopathology and staging. Since 20 % of CRC cases in the UK are metastatic at the time of diagnosis [30] and 75 % of the patients die within 2.5 years of diagnosis [2], by excluding all CRC cases who died within 2.5 years of diagnosis, <5 % of our CRC cases would have had metastatic disease. Although there is difference in risk factors and pathogenesis between right- and left-sided tumors and between colon and rectal cancers, those malignancies are diagnosed and treated similarly, thus favouring one model for different disease subgroups. THIN also lacks information regarding the premalignant condition, the adenomatous polyp. However, since we evaluated local disease as the study outcome, our model is relevant for detection of early-stage disease.

Despite the large sample size of our study, several laboratory parameters that were previously described as CRC risk factors, such as C-reactive protein and Helicobacter pylori infection positivity, were not available for most individuals and were excluded from the multivariate analysis. Performing multiple imputations on variables with large proportion of missing data (approximately 67 %) can be unreliable and introduces bias [42]. Although we were able to demonstrate the importance of laboratory results as predictors of sporadic CRC risk, we did not have sufficient number of individuals with repeated measures during follow-up in order to evaluate intra-individual changes in values. Of note, age despite its known effect on CRC risk was not included in the final model, probably secondary to our research methodology that matched cases and controls on duration of follow-up in order to ascertain equal “opportunity” to develop the disease. Despite this limitation, we were able to show better predictive power compared to previous models.

A possible selection bias could result from the exclusion of 6403 (28.6 %) of CRC cases due to suspected metastatic disease. However, since the aim of the current study was to predict early-stage disease that can lead to clinically meaningful interventions, analyzing individuals with advanced disease could bias such conclusions. An additional selection bias might stem from the fact that only one-third of the study population had values for all the laboratory results that were included in the model and the percent of cases with full laboratory results was higher compared to the percent of controls (40–50 vs. 25–30 %, respectively). Since the current study focused on the year before cancer diagnosis, it is possible that undiagnosed CRC might have influenced some of the laboratory results. However, it is important to note that our objective was not to identify etiological factors for CRC; our objective was to evaluate the power of a combination of biomarkers (causal or non-causal) and clinical factors in predicting early-stage CRC.

Finally, the current work created internal validation set by splitting the data randomly in a 2:1 ratio. We did not perform an external validation of our results. Since the THIN database is representative to the entire UK population, such an internal validation supports the generalizability of the results. No PPV or NPV were calculated for the risk models due to the use of case–control methodology. Category-based NRI was calculated for both test and validation sets and was presented separately for cases and controls as previously described [43, 44].

In summary, we developed and internally validated a CRC risk prediction model that demonstrates superior prediction performance compared to the existing models (with a NRI of more than 40 %). The improved performance resulted from incorporation of routinely available laboratory results that are not susceptible to information bias. Future works will need to perform external validation of the model in diverse populations, for example patients with prior colonoscopy, and evaluate the significance of intra-individual changes in laboratory values on sporadic cancer risk. Such a model can be used for a risk-tailored screening approach that can help determine which patients would benefit most from CRC screening by colonoscopy. Cost effectiveness analysis will be needed in each population to determine risk threshold for different levels of screening. This tool cannot be used to determine such threshold, but can be used to assign individual risk level in order to apply the risk-tailored screening approach once the thresholds are established. The suggested laboratory-based model might represent a shift in the paradigm by which we study CRC risk. Of note, this model does not suggest a need to change current screening guidelines or forgoing screening in individuals at low risk.