Introduction

Cardiovascular disease (CVD) is the leading cause of morbidity and mortality in the United States (US) [1]. CVD healthcare expenditures totaled $316.1 billion in 2012–2013, accounting for 14% of total health care expenditures in those years [1]. Accordingly, epidemiological research has focused on reducing the burden of CVD through elucidating etiological factors, identifying early detection strategies, and informing prevention and disease management efforts. Much of this work has been facilitated with the use of large prospective cohort studies, such as the Framingham Heart Study; however, substantial financial costs and logistical difficulties associated with developing and maintaining these resources have spurred investigators to seek more efficient ways to conduct this work. In particular, use of data from electronic health records (EHRs) has enabled epidemiologists to conduct cross-sectional and longitudinal investigations of CVD risk without the burdens imposed by assembling traditional cohort studies. Additionally, EHR data can be leveraged as an existing data source to conduct rapid and more efficient investigations into the population burden of CVD and its risk factors [2]. Although easier to obtain, EHR data present unique challenges to epidemiological research. These data are primarily collected for clinical purposes, and repurposing these data for surveillance research creates the potential for bias (defined as a systematic deviation of observed results or inferences from the “truth”) [3].

In this paper, we review recent literature pertaining to the use of EHRs in epidemiological research aimed at surveillance of CVD risk in populations. We then explore and outline the most common types of bias that arise in EHR-based studies of CVD. Finally, we conclude with recommended strategies informed by the literature and our own work to reduce these biases in future research.

Overview of Sources of Bias

Sources of bias that can threaten the validity of findings from CVD risk surveillance studies can broadly be divided into two categories: (1) selection bias, related to a discordancy between the representativeness of a study population and a particular target population to whom inferences are made; and (2) information bias, related to misclassification, mismeasurement, or non-random missingness of data used to characterize the population. Ideally, the dataset would contain variables that directly capture the exact characteristic we aim to surveil, measured perfectly among all members of our target population. In reality, we nearly always end up with imperfectly measured variables among a biased subsample of the target population that are approximations of the true characteristic we hope to measure or are a surrogate for the characteristic we aim to capture (e.g., body mass index as a measure of obesity or C-reactive protein as a proxy for inflammation).

Bias affects findings from research and surveillance efforts in two key ways. First, if the population included in the study differs from the target population with regard to factors associated with CVD, we may incorrectly estimate the burden of a particular risk factor. For example, an over-representation of older adults in an EHR system might result in an over-estimate of the prevalence of hypertension (since blood pressure is positively associated with aging). Second, if the data collected are of low or inconsistent quality, or have a high degree of missingness, we may misclassify individual patients’ cardiovascular health. For example, we may over-estimate the prevalence of hyperglycemia if fasting status is not confirmed before a glucose measurement is obtained in a clinical setting. Similarly, if a fasting glucose test is only ordered for patients with known diabetes risk factors, we may also inaccurately estimate the prevalence of undiagnosed diabetes if we ignore the non-random nature of the missing data (e.g., assume that all those with missing data are normoglycemic or simply exclude patients with missing data from calculations of hyperglycemia prevalence). While height, weight, and blood pressure are routinely measured at most clinical encounters, tests for other CVD risk factors such as hyperlipidemia or hyperglycemia may only be ordered when there is a clinical indication, potentially introducing bias. This limitation is unlike ascertainment of CVD risk factors in a traditional prospective cohort study where all participants uniformly undergo the same measurements, obtained in a standardized way, regardless of clinical indication.

In designing a surveillance study, we need to identify potential sources of bias so that we can (1) prevent or reduce bias in the planning/design phase, (2) measure and quantify the remaining bias after the data collection phase, and/or (3) adjust for this bias in the analysis phase. At a minimum, we should describe the magnitude and direction of bias after the data are collected so that we can project how our observed estimates may approximate or differ from the “truth.” A thoughtful approach to designing surveillance studies can minimize bias by optimizing selection of patient populations, improving data collection approaches, addressing quality control issues, and capturing data to understand, quantify, and potentially adjust for bias.

Selection Bias

Selection bias, a type of systematic error, is introduced into observational studies either by flawed recruitment/data extraction or by factors that affect subjects’ participation in the study [4]. In a traditional prospective observational cohort study, this can occur in the design and implementation phases of research studies via inappropriately defined target populations and assembled sampling frames, lack of participation from eligible subjects, or both [5]. If the study population that is observed differs from the target population on key variables, then the study population is not considered to be “representative” of the target population, and study validity may be compromised if the research questions of interest requires representativeness to produce generalizable findings [6].

A major consequence of selection bias is its infringement on internal validity, where investigators make inferences unique to the sample that may not reflect the actual association in the intended target population [4]. Studies are considered internally valid when inferences are made in the context of minimal systematic error [7]. Internal validity precludes external validity [4]. The extent to which such inferences can be generalized beyond the sample in the study and pre-defined target population is encompassed by external validity [7]. Applications of external validity, particularly in the context of generating risk factor prevalence estimates, are only justified when the sample is representative of the population to which results are to be generalized.

Even though traditional health surveys and community-based studies are well-accepted approaches for conducting surveillance of CVD risk factors in a population, reductions in cost and resources as well as an interest in improving efficiency have motivated a shift towards utilizing EHRs and other existing administrative data sources over primary data collection [7]. However, selection bias from EHR-based surveillance of risk factors can inhibit the ability to accurately estimate CVD risk. A representative sample is fundamental to studies estimating disease burden in the general population [8]. Critics of the use of EHR data for characterizing population health assert that EHR datasets are composed of convenience samples, and that individuals accessing the health system (and thus populating EHR systems) are systematically different from those who abstain in a way that would bias findings obtained from such studies [8].

Errors attributed to selection bias are minimized when each person in the target population has an opportunity to be selected for the study [8]. Obtaining truly random samples for EHR-based observational studies that measure CVD risk factors is difficult, though, because inclusion requires the patient to actively seek medical care and their data to be input to an EHR in order to be included [9]. If information (e.g., demographic characteristics) is lacking on the population from which the patients arose, estimating the bias becomes challenging. Thus, identifying types of selection bias and the stages in which they occur in EHR-based surveillance studies is critical to understanding potential implications on effect estimates and generalizability of results.

Informed presence is the premise that patients do not appear randomly in an EHR data repository; rather, illness or symptoms may influence entrance into the healthcare system as well as the data which are likely to be collected on those patients [10••]. Thus, patients in the healthcare system are systematically different and more likely to be diagnosed with conditions that are also tracked in CVD risk factor surveillance than non-patients (i.e., healthcare system non-utilizers). When using EHR data for surveillance, we unintentionally condition on patients being ill for inclusion into the study. The exception to this is records that capture preventive care interactions, yet these too are subject to selection bias because factors such as education, health insurance coverage, and access to transportation might influence who uses these primary care services [11, 12].

The relationship between sufficiency of using EHRs for CVD risk factor surveillance and selection bias is well described [10, 13••, 14, 15•]. CVD risk factor surveillance using EHRs hinges on records offering complete information [16•]. Missing data in EHRs are considered missing at random (MAR) or not missing at random (NMAR), due to systematic biases from the clinical care process or to a key characteristic of the population included. Missing data is also user-defined and meaningful only in reference to the data structure’s ability to answer the research question of interest [15•]. Criteria for meeting complete data requirements constrict inclusion into the study and exclude eligible subjects whose data may be relevant.

Patient wellness is correlated with breadth and frequency of information recorded in the EHR. Frequency of certain elements in the EHR such as laboratory results and medication orders are negatively associated with patient health [16•]. Thus, those with more complete records are often patients with underlying health conditions that prompt more frequent visits with a healthcare provider [16•]. We define “informed presence” as the notion that inclusion in an EHR is not random but rather indicates that the subject is more likely to be ill. It then follows that persons represented in EHRs are systematically different from those not in EHRs. As other authors have noted, individuals contained within an EHR dataset tend to be non-representative of the larger population to whom results are meant to be generalized [6, 8]. Since patients are not observed randomly or in set intervals (but rather only when they have a medical encounter), there exists potential for bias in the data. One way this can manifest is that patients with more medical encounters have more opportunity to receive clinical diagnoses. By inflicting complete data requirements for inclusion into surveillance studies, the surveillance “system” may contain an over-representation of patients with poorer health. Risk factors for CVD may be seemingly more or less prevalent compared to the general population, and generalizing associations observed under such circumstances to healthier populations violates external validity. In fact, adults seeking medical care tend to have higher rates of diabetes, hypercholesterolemia, hypertension, and obesity, and a lower rate of smoking compared to adults in the general population [17].

The populations in EHR data repositories are less heterogeneous than the target population; regardless, they are still being used for CVD risk factor surveillance. Inclusion of eligible patients into EHR-based surveillance is hampered by factors that influence healthcare utilization. For example, a study by Romo et al. using survey data from the 2013 New York City Community Health Survey (CHS) and the 2013–2014 New York City Health and Nutrition Examination Survey (NYC HANES) found that visiting a healthcare provider is more common among women, the unemployed, non-Hispanic Whites, and residents of neighborhoods with the lowest levels of poverty [11]. Negative perceptions about the healthcare system regarding cost, service availability, and culturally competent care also influence likelihood of visiting a provider [11]. Additionally, health insurance status is associated with entry into the healthcare system. Compared to those with health insurance, those who lack health insurance are more likely to be ill and less likely to receive medical care [18]. According to the US Census Bureau, the highest uninsured rates are for young and middle-aged adults, those living below 100% of the poverty line, Hispanics, and non-citizens [18].

Longitudinal surveillance studies—such as the Atherosclerosis Risk in Communities (ARIC) surveillance study that was designed to monitor trends in coronary heart disease and associated mortality—are often used in parallel with national health surveys to draw inferences about population cardiovascular health [19]. While migration bias is less problematic with diligent patient follow-up and tracking, additional design features of ARIC make migration bias even less likely. Sampling from geographic locations with low migration and/or a single hospital servicing the medical needs of most patients may limit generalizability, as follow-up at these sites may not reflect the typical healthcare setting. Further, healthcare systems use different EHR software and data formats may be inconsistent across platforms. Tracking patients longitudinally also introduces challenges as the same measurements may not be conducted at regular intervals or in the same healthcare system for all patients [20].

A logical next step is to combine EHR data from multiple healthcare systems and repositories to increase the diversity of EHR datasets and address migration bias. However, simply pooling EHR datasets does not necessarily solve the issue, since higher-revenue healthcare systems with capabilities for big data analytics are the most common contributors to these collaborative efforts [15•]. Consequently, patient diversity and representativeness in such data repositories may still be a concern.

Information Bias

Information bias occurs when data that appear in the EHR are inaccurate due to missing data, data entry errors, or measurement errors. The majority of bias arising from coding inaccuracies due to data entry errors is considered a form of non-differential misclassification, meaning that the misclassification is not systematically an over- or under-estimate of the truth in the case of a continuous numeric variable (e.g., systolic blood pressure) and, therefore, is not considered a true bias. However, an important factor driving coding inaccuracies with diagnostic (International Classification of Diseases, or ICD) codes [21] is a preference for recording conditions that are likely to be reimbursed higher on the list for billing incentives; this type of bias is indeed considered “differential” because it results in a systematic over-reporting of procedures or conditions with more favorable reimbursement structures. Behavioral history information such as alcohol consumption or smoking may also be differentially misclassified if self-reported by the patient rather than directly observed; patients are more likely to underreport substance abuse and smoking behaviors. Of note, this bias due to inaccuracies in self-reported data, often termed “social desirability bias,” is similar to that seen in most observational research studies.

Differential misclassification often occurs when data from multiple EHR systems are aggregated for research purposes. Some EHR systems do not have compatible interfaces to simply merge data, leading to systematic missing values. Therefore, the investigator must address data harmonization issues. In the aggregated data setting, another challenge is to identify hospital-specific policies that might impact surveillance of a particular disease. Examples include enhanced screening for deep vein thrombosis, recording adverse effects of certain drugs, and monitoring specific types of complications of interest after procedures [22••]. Additionally, use of certain ICD codes may vary between providers and across time [23••]. Furthermore, medical equipment and laboratory tests may define different ranges for “normal” values, and using strict cutoffs for defining abnormal values in aggregated data may introduce bias in the estimates. Finally, data may be missing if non-standardized terminology is used, technical problems occur with data capture, or similar data fields are not uniform across EHR systems [24•].

Diagnostic suspicion bias, also known as over-diagnosis bias, leads to higher estimated prevalence rates and occurs when symptomatic or high-risk patients are more likely to undergo screening that subsequently leads to higher likelihood of diagnosis and receipt of treatment [22••]. This may also be labeled as “surveillance bias” if there is an increase in disease burden measures due to policies around quality control conditions like adverse effects of drugs [22••].

Missing data in EHRs arise from multiple sources including values for measurements or laboratory tests that fall outside the detectable range, the frequency in which specific ICD codes/diagnoses are entered, errors in coding during data extraction, and missed deaths that occur outside the medical system. When a patient leaves a particular EHR system or is not seen regularly, it is difficult to know if the patient is healthy, receiving treatment from another provider, or is sick but not seeing a provider [23••]. These missing values are NMAR, and appropriate techniques are warranted to handle missing values [14]. As patients within the EHR system represent a dynamic cohort, calculation of at-risk person time is also a challenge [23••].

Data entry errors may occur when patients are treated for multiple conditions by different providers using different EHRs; this may be because a thorough medical history is not verified or comorbid conditions do not appear in discreet fields so may be missed in data extraction or pooling efforts [23••, 25]. Important demographic information regarding factors such as race/ethnicity, geographic location, and socioeconomic status are either self-reported, assigned by the provider without input from the patient, or not queried at every encounter. In addition, data regarding domains of social or behavioral health such as psychological stress, physical activity, and social isolation are also commonly missing or incomplete.

Recommendations for Assessing and Reducing Bias in EHR Studies

EHRs must meet five criteria to be considered valid data sources for CVD risk factor surveillance: (1) coverage of the EHR system(s) must include the entire population or a representative subset of the population, (2) cardiovascular health measures should be obtained in a standardized way, (3) measures should be recorded in the EHR in a standardized way, (4) records need to be linked such that equivalent data are correctly merged, and (5) legal authority for data sharing needs to be in place [26••, 27••]. Efforts to meet these requisites can also address concerns about bias. For example, selection bias would be eliminated if coverage of the EHR system(s) captures the entire target population or a simple random sample of that population. Universal healthcare coverage can increase the likelihood of entry into an EHR data repository, as evidenced by the observed increased testing for diabetes and hypercholesterolemia due to expansion of governmental insurance [13••].

Implementation of Unique Patient Identifiers

Tracking patients over time and across systems is a challenge to conducting EHR research. Improvement of interoperability between record systems can be accomplished through legislation or agreements that require a unique identifier to be assigned to each record [28, 29], allowing for easier tracking of individual patients if they move between EHR systems over time. For example, the National Institutes of Health implemented the Global Rare Diseases Patient Registry Data Repository, in which de-identified records of a data repository are assigned a global unique identifier [27••]. This process enables data from patients to be “integrated; tracked over time; and linked across projects, databases, and biobanks [30].” The National Institutes of Health has also created an approach to provide unique identifiers (GUID—global unique identifier) that can be used to link records across different systems [29]. Mandating collection of specific metrics for population-based studies can also enhance measurement standardization in EHRs [30]. A goal of the Query Health Project, for example, is to validate a standard strategy for clinics to capture quality measures that can then be used for public health research [27••].

Adoption of Standardized Rules for Inclusion/Exclusion Criteria

In estimating the prevalence of CVD risk factors, decisions need to be made systematically to determine which patients within an EHR dataset are included in the denominator (i.e., the population total). For example, if quantifying the prevalence of hyperlipidemia in an EHR dataset, criteria need to be defined regarding whether patients with missing blood cholesterol information in the EHR are considered to be lost to follow-up (and therefore should be excluded from the prevalence calculation) or whether it can be assumed that because the test was not ordered that they are likely healthy (and therefore can be included in the population total and be considered to have values in the optimal range). If the latter, additional rules need to be applied for how patients with missing values will be categorized with regard to CVD risk. Adoption of clear guidelines on how missing data will be handled is important for consistency, transparency, and assurance so that important subpopulations of interest are not excluded from the analytic dataset.

Application of Statistical Procedures

Statistical approaches can be used to describe and/or reduce the impact of bias after data collection has occurred. In the analysis phase, use of external data sources can help to evaluate and quantify bias in the study population. Several methods, described below, can be considered to integrate external data to reduce the impact of bias on study findings.

Describing how the study sample differs from the target population can be achieved by leveraging publicly available data sources. For example, US Census or state-level vital statistics data can be used to quantify differences in demographic and clinical characteristics of patients in the EHR dataset compared to the general population. Comparing distributions of these characteristics between the study sample and the target population can help inform generalizability of results [15•, 23]. Further, information that includes data on birth, death, pregnancies, and cancer can be validated through linkage with centralized databases and registries that when combined also serve to enhance EHR datasets. In some countries, individual-level data can be linked to population health and lifestyle surveys and data collected by other sectors regarding social factors [31].

Post-stratification adjustment standardizes crude estimates according to variables implicating the selection bias. In the context of EHR data, these variables might include demographic factors such as sex, race/ethnicity, insurance status, and poverty level [15•]. Sample weights can be generated to adjust for over- and under-representation of key population subgroups in the EHR dataset in comparison to the target population. Since inclusion in an EHR system is non-random, controlling for the number of health encounters also accounts for systematic differences between those who regularly or irregularly visit their provider [17].

Additional frameworks have also been validated for dealing with selection bias specifically [10••]. Propensity score adjustment/matching can be employed to account for systematic differences in health system “users” versus “non-users.” Propensity scores can then be used in analyses to create inverse probability weights to balance observed differences in the two populations with the goal of mimicking a scenario where individuals would be randomized to be included versus excluded from the EHR dataset [32, 33]. Inverse probability weighting to achieve representativeness is still debated but provides the ability to address factors that are associated with inclusion or exclusion in the dataset [8, 15•, 34].

Finally, efforts to reduce missing data at the onset should be explored when possible. For example, use of open-source natural language processing tools can help to incorporate CVD risk factor data that may not appear in discrete fields in the EHR (e.g., family history or behavioral factors that may appear in clinic notes as free text) [35]. Additionally, imputation methods can be applied in scenarios where data are NMAR [13••]. For example, in the case where a fasting glucose test is only ordered for patients with known diabetes risk factors, missing values may be imputed based on observed glucose data from these patients. In other words, it would be problematic to assume either (1) that all patients with missing glucose data have normal values or (2) that all patients with missing glucose data were lost to follow-up.

Incorporation of Patient-reported Data

Integration of patient-reported outcomes and other contextual information could also reduce the impact of missing data and should be considered for inclusion in EHR systems and surveillance efforts moving forward. In the case of tracking the prevalence of use of tobacco products over time, relying only on providers to accurately record and update this information affects data quality. The addition of standardized questionnaires to the clinic workflow in order to ascertain behavioral factors, then, has the potential to improve EHR completeness and accuracy.

Conclusion

From a public health perspective, understanding the strengths and limitations of using the EHR for surveillance of CVD risk can inform more thoughtful design of epidemiological studies and interpretation of findings that utilize these data sources. Several strategies can also be incorporated in the data collection and analysis phases to reduce the impact of selection or information bias. Acknowledging and addressing its limitations, the EHR offers a powerful platform for monitoring and characterizing cardiovascular health on a large scale in an efficient and meaningful way, with the ultimate goal of advancing efforts to prevent, detect, and treat CVD to improve population health.