Introduction

Clinicians, researchers, and funders are increasingly recognizing the importance of conducting cancer research outside of the highly selected populations of controlled trials and specialty, referral-based clinics [15]. Research in community-based settings can provide valuable information on cancer incidence, survival, and costliness, as well as on the effectiveness of cancer-related prevention, screening, treatment, and health care services as delivered in practice.

Studies that can advance knowledge of cancer risk factors, preventive services, treatment effectiveness, and outcomes in “real-world” health care settings include: retrospective studies of cancer etiology with adequate periods for exposure assessment; comparative effectiveness analyses of therapeutic and health care delivery interventions; pragmatic cancer treatment trials; and prospective and retrospective cohort studies of associations between post-diagnosis exposures and cancer prognosis. Research capabilities that enhance the validity of such studies include:

  1. 1.

    The ability to identify and enumerate the underlying population giving rise to cases (i.e., “the denominator”) for calculating and comparing incidence rates across exposure groups, and, in case control studies, for selecting controls from the same population from which cases arose.

  2. 2.

    The ability to assess exposures many years before a cancer diagnosis to study factors whose association with cancer may have long induction/latent periods.

  3. 3.

    The ability to follow cancer patients for many years after diagnosis to study clinical outcomes, long-term effects of treatment, and post-diagnosis exposures.

Integrated health care delivery systems are particularly well-suited to conduct health and health care outcomes research in community care settings, and their success in doing so depends, in part, on having the capabilities described above. This manuscript illustrates the Cancer Research Network’s (CRN’s) capacity to support “real-world” epidemiologic and health services research in the CRN. Specifically, the purpose of this manuscript is to describe patients’ enrollment in CRN health care systems before and after their cancer diagnoses—capacities crucial for research on cancer risk factors, prevention, outcomes, and costs.

Methods

Setting

The CRN, established in 1999 [6, 7], is a cooperative agreement between the National Cancer Institute (NCI) and integrated non-profit health care delivery systems. It currently consists of eight supported and four affiliate research centers within these integrated health care systems (http://crn.cancer.gov/). The goal of the CRN is to conduct and foster public domain research across the cancer care continuum, from primary prevention to treatment, survivorship, and end-of-life care. The CRN has been a long-time contributor to epidemiologic and health services research on cancer etiology, prevention, screening, costs, utilization, communications, recurrence, effects of treatment, and survivorship (for examples, see references [817]).

In addition to having research scientists embedded in these health care settings, one of the CRN’s most important resources is its standardized, distributed data network. The distributed data network is now managed by the Health Care Systems Research Network (HCSRN; formerly HMO Research Network (HMORN)). The HCSRN is an umbrella organization of public domain research centers embedded in or affiliated with health care systems; it encompasses several collaborative research networks including the CRN. The distributed data network supports the Virtual Data Warehouse (VDW), a standardized de-centralized data model that was developed for research use, in which each HCSRN site maintains its data locally, but a programmer at one site can write data-extraction and analysis programs that can be run at other sites [7, 18]. The VDW includes tables that capture information on demographics of health plan enrollees and/or health system patients, health plan enrollment, health care encounters, diagnoses, procedures, pharmacy fills, vital signs, social history, vital status, and laboratory values. Within the VDW, the tumor table (also known as the Virtual Tumor Registry [VTR]) contains data consistent with the North American Association of Central Cancer Registries standards (www.naaccr.org/) [13]. Tumor data are obtained from manual review of cancer patients’ medical records by trained medical record abstractors.

All eight funded CRN sites contributed data to this analysis. Sites, listed by research center and associated health plan, are: Meyers Primary Care Institute, Reliant Medical Group and Fallon Community Health Plan (FCHP), Massachusetts; Group Health Research Institute, Group Health (GH), Washington; Department of Public Health Sciences, Henry Ford Health System and Health Alliance Plan (HFHS), Michigan; Institute for Health Research, Kaiser Permanente Colorado (KPCO); Center for Health Research-Hawaii, Kaiser Permanente Hawaii (KPHI); Center for Health Research, Kaiser Permanente Northwest (KPNW), Oregon; Division of Research, Kaiser Permanente Northern California (KPNC); and Marshfield Clinic Research Foundation, Marshfield Clinic Health System and Security Health Plan (MCHS), Wisconsin. Three affiliate sites also provided data (HealthPartners Institute, HealthPartners (HP), Minnesota; Harvard Pilgrim Health Care (HPHC), Department of Population Medicine, Harvard Medical School, Massachusetts; Department of Research and Evaluation, Kaiser Permanente Southern California (KPSC)).

Data collection

In July 2016, data were extracted from the tumor, enrollment, demographics, and death tables in the VDW via a SAS program developed by the CRN Informatics Core. The program was designed to query standard VDW data elements and required only minimal site-specific customization (e.g., to specify the path to local site data storage locations). Program output consisted of tabulations (for review by local site investigators) and aggregated SAS data sets, the latter of which were returned to and compiled by the Informatics Core programmer at Group Health. Ten of 11 participating sites received the SAS program and returned results via the CRN installation of PopMedNet™, a software application designed to facilitate data sharing activities in distributed networks; the remaining site did so via secure email. As the lead institution, KPNC’s IRB approved the aforementioned CRN infrastructure activities. CRN funded and affiliate sites either ceded IRB authority to KPNC’s IRB (6 sites) or submitted materials to their own institution for review and approval (1 site) or exemption (2 sites); one site’s IRB determined these data were preparatory to research and IRB approval was not required. Additional IRB approval was received from the Massachusetts State Tumor Registry to use cancer data from the two CRN sites in Massachusetts that maintain VDW tumor tables.

Each CRN site contributed population demographic characteristics (gender, age, race, and ethnicity) and insurance coverage data as of 12/31/2014 or 12/31/2015, depending on when their data were last refreshed. Each CRN site also sent the number of new malignant cancer diagnoses by year of diagnosis, age at diagnosis (<13, 13–18, 19–39, 40–49, 50–59, 60–64, 65–69, 70–79, ≥80 years), and cancer site. Cancer site was categorized based on the Surveillance, Epidemiology, and End Results program’s (SEER’s) ICD-O3/WHO 2008 Site Recode definition [19]. Only cases who were enrolled at a CRN site at the time of diagnosis or whose diagnosis was within 45 days after disenrollment or 60 days before enrollment were included [20]. Persons with multiple primary tumors diagnosed during a given year were counted only once per year but were not de-duplicated across diagnosis years.

Analysis

For cancers diagnosed in the most recent year of available data (2012 at one site, 2013 at one site, and 2014 at six sites, and 2015 at two sites), we computed the percent continuously enrolled for ≥1, ≥5, and ≥10 years before diagnosis. We stratified analyses by CRN site, cancer site, and age at diagnosis.

In analyses of disenrollment and death after cancer diagnosis, we focused on cases diagnosed in 2000 because all CRN sites could provide at least 10 years of follow-up. As of each 12-month interval after diagnosis, we computed the following cumulative percentages: continuously enrolled and presumed alive (no indicator of death); continuously enrolled until death; and disenrolled. These categories were mutually exclusive and added to 100 %. Continuous enrollment was defined as no gaps in enrollment >60 days. We conducted stratified analyses by CRN site, cancer site, and age at diagnosis.

Because KPSC restricted their submitted data to a subset of cancer types (breast [n = 1,968], colon and rectum [n = 1,166], prostate [n = 2,171], lymphoma [n = 528], lung and bronchus [n = 1,165], and urinary bladder [n = 210]), we excluded KPSC from our analyses. Including these data would have skewed our overall proportions toward specific cancers. However, we verified in sensitivity analyses restricted to breast, colon and rectum, prostate, lymphoma, lung and bronchus, and urinary bladder that including KPSC data would not meaningfully have altered cancer-specific results.

No statistical testing was performed as our goal was to characterize available data, not to draw inferences or test hypotheses. Analyses were conducted in SAS® version 9.3 (Cary, North Carolina) and Stata® version 12 (College Station, Texas).

Results

Approximately 8 million people were enrolled in participating CRN health plans (Table 1) as of 12/31/2014 or 12/31/2015 (date varied by CRN site). About half of enrollees were aged 40 and older. Race was known in 66.1 % of the enrollees; approximately 30 % of persons with known race were non-white. Demographic characteristics differed across CRN sites (Online Resource S1), primarily with respect to race, ethnicity, and insurance.

Table 1 Selected characteristics of Cancer Research Network health plan enrollees (n = 8,039,569)

Among more than 30,000 cancers diagnosed in the most recently available year of data, the majority occurred in patients aged 60 and older (Table 2). The most common were cancers of the breast (17.8 %), lung, bronchus, and pleura (10.4 %), prostate (9.8 %), colon and rectum (8.7 %), and skin (excluding basal cell and squamous cell carcinoma) (7.8 %) (Table 3). However, there were large numbers of even relatively uncommon cancers, such as those of the ovary (n = 424) or urinary bladder (n = 740). The relative incidence of different cancers within the CRN generally matched SEER program data (Fig. 1), except that prostate, bladder, and lung and bronchus cancers accounted for a smaller percentage of cases within the CRN and breast and melanoma accounted for higher percentages. Of note, CRN data included both male and female breast cancers, while SEER data were limited to female breast cancers.

Table 2 Age and Cancer Research Network site of cancers diagnosed in most recent year of available data (n = 30,420)
Table 3 Types of cancer diagnosed in most recent year of available data (n = 30,420)
Fig. 1
figure 1

Proportional incidence of common cancer types in the SEER program compared to the Cancer Research Network. Note: Breast cancer cases from SEER are in females only but are for both males and females in CRN data. SEER data are from 2015. CRN data are from the most recent year available (Table 2). CRN Cancer Research Network; SEER Surveillance, Epidemiology, and End Results

Most patients were continuously enrolled in their health plan for many years before cancer diagnosis (Table 4). More than two thirds were enrolled for ≥5 years and more than half for ≥10 years before diagnosis (69.5 and 55.6 %, respectively). Of the specific cancer sites examined, patients with thyroid cancer had the lowest ≥10 year enrollment. Prior enrollment varied by CRN site (Online Resource S2), with the percent enrolled for ≥10 years ranging from 30.1 to 67.1 %. Age at diagnosis was also related to pre-cancer enrollment (Online Resource S3), with patients age 40 years and older having longer enrollment before diagnosis and persons under 40 having a shorter pre-diagnosis enrollment duration.

Table 4 Length of continuous enrollment before cancer diagnosis, overall and by cancer site

Among cancers diagnosed in the year 2000 (n = 25,274), the majority either stayed enrolled (28 %) or died (45 %) within 10 years (Fig. 2). Only 27 % disenrolled within 10 years. Nearly all CRN sites had similar patterns of disenrollment in the 5 years following cancer diagnosis (Online Resource S4). However, occurrences of death and disenrollment differed by cancer site (Online Resource S5). Disenrollment tended to be lower among people with cancers that usually affect older persons or are more rapidly fatal (e.g., lung and bronchus) and higher among less fatal cancers affecting younger persons (e.g., thyroid). Patients under age 40 years at cancer diagnosis were more likely to disenroll than older patients, who were more likely to stay enrolled or, in the case of patients aged 70 years or more, to have died during the observation period (Online Resource S6).

Fig. 2
figure 2

Retention after cancer diagnosis in 2000 (n = 25,274), cumulative by year since diagnosis

Discussion

Combined, CRN sites have a large, diverse population with relatively high numbers of common cancers as well as certain rare ones. We observed that about half of incident cancer cases had at least 10 years of enrollment before diagnosis, and only one in five were lost to follow-up in the first 10 years after diagnosis. Some variation by CRN site was observed, as shown in Online Resource S2 and Online Resource S4. Our overall results are consistent with a report from a decade ago showing high retention rates among cancer survivors in a subset of CRN sites [20]. In the current analysis, we noted differences in loss to follow-up by cancer type, but these were generally consistent with what we would expect based on age at diagnosis and known survival statistics [21]. Stratification of length of enrollment by CRN site, cancer site, and age at diagnosis provides important information for researchers planning studies within the CRN. Estimates in this manuscript can be used to plan future, large-scale studies of cancer incidence and outcomes within the CRN. For example, these statistics can help investigators select CRN sites for their research and conduct power analyses. Of note, when data from all CRN sites are combined, the average is weighted toward larger sites like KPNC. We, therefore, provided CRN site-specific estimates in Online Resources S2 and S4. However, investigators planning studies may need more granular information (e.g., stratified by cancer site and CRN site), which can be obtained through the Preparatory-to-Research process (https://crn.cancer.gov/collaboration/process.html).

All data in this manuscript were obtained using a distributed SAS program written at one site and distributed to the other sites. This process illustrates one of the main efficiencies in conducting research within the CRN. Researchers should be aware, however, of several limitations in using CRN data. First, not all CRN sites have data available over the same years. Second, most CRN sites experience a lag in the availability of tumor data. If newly diagnosed cases are needed for a study, researchers should take this into account when selecting CRN sites to be included in their studies. Other methods for quickly identifying cases (e.g., via pathology reports) may still be possible but employing them may require different approaches at each CRN site. A third important consideration is that some data are collected differently across CRN sites. For example, at some CRN sites (i.e., KPNC, HP), everyone who is not verified as Hispanic is classified as unknown ethnicity, whereas at other sites, persons may be classified as non-Hispanic. Fourth, the catchment area for a CRN site’s VDW tumor table may not always include the entire health plan enrollment. For example, Group Health’s VDW tumor table is populated for Group Health enrollees who reside in the western Washington SEER catchment area (about 80 % of enrollees). For HP, cancer diagnoses are not available for health plan members who are not also patients of their medical group (about half of enrollees). These subtleties highlight the importance of incorporating local expertise and knowledge of how each site’s VDW is populated when conducting multi-site research.

This manuscript illustrates two important strengths of the CRN that make it an excellent setting for conducting population-based research along the cancer care continuum from prevention to survivorship and end of life: long pre-diagnosis enrollment and low loss to follow-up post-diagnosis. Pre-diagnosis information is important for studies of cancer incidence in relation to exposures like medication use (e.g., Boudreau et al. [22]) or screening (e.g., Doubeni et al. [11]). Low rates of disenrollment after cancer diagnosis are important for studies of outcomes requiring extended follow-up periods, such as second malignancies (e.g., Clough-Gorr et al. [23]), late-effects of cancer (e.g., Bowles et al. [15]), and health care utilization in survivors (e.g., Buist et al. [24]).

Because the CRN is based on community practice, it is positioned to facilitate multi-site research on comparative effectiveness, multi-morbidity, real-world adherence to screening and treatment regimens, and other topics that are difficult to address in academic cancer settings or by randomized controlled trials. The VDW does not contain information on all exposures of interest (e.g., diet), but is an excellent source of data on health care utilization, diagnoses, and medication fills (Online Resource S7). Unlike claims datasets, CRN sites have access to clinical measures such as anthropometrics and laboratory results. To benefit most from the CRN as a research platform, researchers need to understand not only the wide range of available data [18] but also the characteristics that make it well-suited to population-based cancer research.