1 Introduction

Fueled by evolving federal regulation [1,2,3] and the Covid-19 pandemic [3,4,5], real-world data (RWD), largely in the form of Electronic Health Records (EHRs) [6], are increasingly being used to generate real-world evidence (RWE). While primarily intended for clinical and administrative use, EHRs contain massive amounts of information about patients’ “medical history including, diagnoses, treatment plans, immunization dates, allergies, radiology images, pharmacy records, and laboratory and test results” [7] that are available for secondary use in research [1]. Unlike randomized clinical trials (RCT) which are largely conducted in specialized research environments with highly selected participants, EHRs reflect health and disease in the general population and the realities of clinical practice [8,9,10]. EHRs can also provide RWE more quickly and at a lower cost than RCTs [8].

We acknowledge that collections of EHRs can identify trends, [11,12,13,14,15] assist in creating clinical decision supports, [14, 16] and provide valuable drug-post-marketing information, [10, 17,18,19,20,21] associations, [11,12,13] and opportunities for improvement [14,15,16]. On the other hand, we advise against their use for making decisions about causation, unless unusual circumstances apply, such as having uncovered very large relative risks in settings where bias and confounding are minimal. We are most concerned with how faulty or incomplete data in the EHR might result in errors made in drawing inferences about the relationship between characteristics of the patient or treatment and the occurrence, severity, and course of the patient’s disease or response to a treatment. We specifically focus in this paper on the challenges that limit the usefulness of RWE generated from observational studies that are based on EHR data. We provide a comprehensive overview of the sources and types of biases in EHRs research, offer words of caution to readers, and outline potential remedies.

In this paper we identify two sets of processes that are likely to be sources of biases that arise along the pathway of generating RWE from EHRs. The first process is at the healthcare system level, where the selection of patients and healthcare interventions, and imperfect data collection create biases. Specifically, in this setting, EHR data might poorly reflect the actual experiences of patients with the conditions studied. The second process is at the research level including the design of the study, the extraction, analysis, and interpretation of the data, which can distort inferences derived from RWD. Figure 1 illustrates these two sets of processes with a breakdown of potential sources of bias associated with each one. These sources of bias also represent the flow of RWD from its initial creation to its transformation into RWE, a concept first identified by Verheij et al. 2018 [17].

Fig. 1
figure 1

Pathway of generating RWE from EHR data

2 Biases Arising at the Healthcare System Level

Research based on EHR data is subject to selection and information biases, that are inherent in the healthcare system’s patient population and its EHR format. Selection bias, defined as “systematic differences in characteristics between those who are selected and those who are not” [18], mainly arises from a lack of representativeness because the population captured in the EHR rarely fully represents the population that is the source of the EHR population. Further, selection bias arises from data missingness, either from missing visits or from missing information within visits. Information bias, on the other hand, which is “a flaw in measuring exposure or outcome that results in differential quality of information between compared groups” [18], mainly arises due to misclassification of the clinical picture (EHR discontinuity, missing/incomplete data, and inability of the EHR system to capture the patient’s true health status), measurement errors, and variability in data collection methods.

3 Access to Medical Care

3.1 EHR Representativeness

Unlike the general population, all individuals represented in the EHR system have sought and obtained medical care from a specific set of providers [19]. Healthy individuals and patients with milder diseases, without medical coverage, or who use other practitioners are unlikely to be represented in a specified EHR system [20, 21]. The geographic “catchment” area of the EHR is the first restriction of the population, but social, demographic, and economic factors further determine the patient’s enrollment in any given EHR [5]. Additional constraints include distance to the healthcare provider, number of available healthcare systems, type of healthcare system [22, 23], and health insurance plan [22, 24]. EHR systems will preferentially include women, elderly, whites, the more educated, and others more likely to seek medical care, especially primary care [21].

Lack of representativeness is one of the major challenges of EHR data [5, 19], introducing selection bias when study inferences are generalized to broader target populations [19, 24,25,26]. Unlike population-based studies, which also include non-recipients of medical care, and where participant selection is determined by the study sampling plan, enrollment in EHR is largely driven by the individual patient [19]. Because the mechanisms that drive patients to interact with the healthcare system are incompletely known, analytic mechanisms cannot fully address and control for them [19, 27].

3.2 EHR Discontinuity

Patients whose medical information is captured in EHRs are considered members of “open cohorts,” where they enter and leave the record system during the period of observation [28, 29]. Patients can drop in or out of an EHR system to seek specialty care elsewhere, to return to a primary healthcare provider [19], or drop out of the system entirely because of changes in their disease characteristics or insurance coverage [29].

Patients are referred to as “censored” for the time when they are not being followed. The absence of patient information prior to their entry into the EHR is referred to as left censoring. Absence after their departure from the EHR is referred to as right censoring [5, 29]. Absence of information for a period of time with information available both before and after that period is referred to as interval censoring. The movement of patients across provider systems can result in all three forms of censoring in EHR research.

Censoring can cause structural missingness in EHRs [30], leading to an incomplete picture of the history of disease in a given patient. Components of the diagnosis, progression, management, and treatment of disease, as well as the time of their occurrence, are often missing. These absences can lead to information bias and/or selection bias depending on how the researcher deals with the missingness. Removing patients with censoring is one solution but may increase the problem of selection bias [31], even with use of advanced statistical techniques to address biases [19, 22, 27, 32]. Information bias, on the other hand, arises when these patients are included but outcomes are misclassified due to the EHR’s inability to capture the true history of the disease [19].

4 Provision of Care

Whether a clinical event is recorded in the EHR is dependent on policies, practices, referral and reimbursement systems, and professional guidelines that differ across healthcare systems, creating “healthcare system bias” [17]. These factors affect the initiation, frequency, content, and documentation of clinical encounters recorded in the EHR [17]. For example, professional guidelines and reimbursement systems promote blood pressure readings at most in-person visits in some countries, but only if required by clinical conditions in others, creating selection bias for this measure [17].

Patient encounters can take place in inpatient or outpatient settings and can represent primary, specialty, or emergent care, depending on the services and settings provided by the healthcare system. These settings, as well as the practice workload, shape the nature and intensity of data recorded in the EHR [22]. Provision of medical services might also be influenced by the patient’s type of insurance [4]. Understanding the coding processes in the EHR system is essential to minimize information bias [17], since insurance type often dictates billing codes used to establish diagnoses and treatments in EHRs [4, 5, 23].

Moreover, the 21st Century Cures Act [33] requires all medical-care providers to offer patients access to their EHR information, which may influence the recording of data thought to carry a stigma or might stimulate medical litigation [17, 34].

5 Acquisition and Documentation of Medical and Administrative Data

5.1 Data Collection and Measurement

EHR data are composed of structured fields (smart forms and clinical templates), unstructured data (clinical notes, free text), and peripheral documents (imaging data, pathology reports) [10]. Clinical information is often obtained by examination all of these EHR sections [35]. Due to the variability in data collection methods, EHR data are subject to misclassification bias and/or measurement errors. Recent papers have recommended statistical techniques for minimizing such biases in EHR research [19, 27, 36,37,38].

5.1.1 Structured Data

The International Classification of Diseases (ICD) is the main diagnostic coding scheme used across the US healthcare system [4, 39], and researchers need to consider any changes to the coding system or provider’s terminology over the years, as occurred in 2015 when the US healthcare system transitioned from ICD-9 to ICD-10 [4, 5]. The presence of an ICD code in the EHR indicates the existence of a disease, but its absence does not assure the absence of the disease. ICD codes in the EHR thus have high specificity but low sensitivity [28]. When the diagnosis is not obvious, a patient can also have a series of “rule-out” ICD codes for testing and follow-up until the final correct diagnosis is made, and these rule out codes can be mistaken for actual diagnoses [39].

Disease status may be incompletely estimated from diagnostic and billing codes, while information on other structured data may provide a fuller picture of disease status [35]. Although a structured item, recording medication, is also not straightforward. The same chemical formulation of a medication can be described under the rubric of several brands or generic names [39], and in extended or immediate release variations. While information on ordered drugs is usually recorded in the EHR, information on whether the patient actually obtained and took the prescribed medication at the required dosage and timing is rarely available in the EHR. Many EHRs are not linked to pharmacy dispensing information systems [39].

Laboratory measures and vital signs are also some of the readily available structured data in EHR but its use in research has its own challenges [39]. The main standard coding systems for laboratory tests and results are the Logical Observation Identifiers Names and Codes (LOINC), the Systematized Nomenclature of Medicine (SNOMED), and the Current Procedural Terminology (CPT) [40]. Unfortunately, implementation is still not universal [1] and many clinical laboratories use local coding systems [40]. Similarly, the coding of vital signs is generally made without standardized approaches, resulting in inconsistency. According to the working group of the National Heart, Lung, and Blood Institute, blood pressure measurement is still sub-optimal in clinical practice and deviates from the recommended guidelines [41]. Measures that are self-reported can also differ from those obtained by medical professionals resulting in additional sources of variance [40, 42].

Substantial variability in the quality of data recorded in EHR has also been attributed to the software packages [17, 43]. The fields are at times flawed either because they are too broad or require a level of unnecessary details or demand a response/value for an inapplicable field [4]. EHR software can also change over time to reflect new emerging technologies or new changes in coding or billing practices [4, 44].

5.1.2 Unstructured Data

Unstructured or semi-structured data are any form of data available at EHR that does not conform to a pre-designed organized structure [39]. Clinician notes, for example, can provide important information not present in structured fields, such as changes in symptom severity and side effects of medication [5, 7, 9]. Artificial intelligence approaches, including natural language processing (NLP) and machine learning (ML) may be able to transform unstructured notes into useful data [7].

In addition, patients now have electronic opportunities to record self-reported events (e.g. asthma attacks, seizures), episodes of pain, and periodic assessments of “quality of life” [45]. Recording of these patient-reported outcome measures (PROMs) in the EHR is highly variable. New technologies, such as wearable devices and health-related apps are increasingly being transferred to EHRs [46, 47], often with limited or little attention to the quality, accuracy, and reliability of these data [8, 48].

Finally, some of the unstructured information in the EHR is based on the clinician’s discretion and might be subject to implicit bias [5]. Clinicians’ unintentional judgments and evaluations about patients’ attributes such as race or gender may affect what is documented or omitted [49]. Patients and their families have identified errors [50,51,52] and found offense in EHRs [34].

5.1.3 Peripheral Documents

Peripheral documents are mainly composed of imaging (x-rays, computerized tomography [CT] scans, magnetic resonance imaging [MRI] scans, ultrasounds, etc.) and non-numerical test results (electrocardiograms [ECGs], pathology results, etc.) as well as other portable document files or PDF files (scanned documents from sources outside the EHR). Extracting this information through NLP can provide a better understanding of the disease status of the patient than reliance on structured diagnostic codes [39]. However, these documents are not easily extractable [10].

5.2 Data Missingness

Structured data elements in EHRs may also be incompletely recorded [30, 53]. Structured data elements are likely to be most complete for health status and clinical information [10], but elements such as substance use and family history of illness, for example, are often missing; and absence of the entry cannot be assumed to represent absence of the behavior or condition [35]. Missingness can be at random, but can also arise due to the absence of data not intended to be collected (missing not at random) [7]. For example, some health information does not get recorded in EHR since it is not relevant to some patients. This is known as “informative missingness,” when the absence of data conveys information about why it is missing [20]. Laboratory testing requested for some patients, for instance, suggests an underlying health condition and hence provides information about the health status of the patient. Not only do the actual values of the laboratory measurement carry information, but so do the number of measurements, the indications for the multiple measurements, and the dates of the measurements [23, 54].

Missing data in EHR can often lead to selection bias (by using complete data only) and misclassification bias [37]. Statistical techniques for missing data exist [30, 32, 37, 55, 56], but adjustments, such as imputation, cannot fully compensate for the bias in all circumstances [57,58,59].

Variables used to address confounding, such as sociodemographic status, health risk behaviors (diet, physical activity, substance use), and psychological stress are sometimes infrequently recorded [20, 60] and, when recorded, are usually of lower quality than data collected for research, which presents a challenge in accounting for confounders in EHR research. The FDA recommends linking EHR data to other data sources or attempting to collect additional data to capture information about important unmeasured or improperly measured confounders [7]. There are also several statistical techniques that attempt to address this problem, including inverse probability weighting [61], instrumental variable analysis [62,63,64,65], difference-in-differences [62, 66], interrupted time series [67], perturbation variable adjustment [63, 68], propensity scores calibration [63, 69,70,71,72], and sensitivity analysis [73].

6 Biases Arising at the Research Level

Biases can start from the very step of extracting the research database all the way to interpreting and reporting research findings.

7 Extraction and Acquisition of Research Data

Although Epic, Cerner, and Meditech are the most common EHR software in the US, hundreds of vendors develop EHR software, all of which have no common or uniform format [39, 47]. The structure of the EHRs is mainly vendor-specific [4] and differences in user interface can affect what data can be recorded and extracted from the database [17, 35]. Since the data extraction software is usually proprietary, the limitations and biases inherent in the software are nearly impossible to determine. Confidentiality agreements with EHR vendors make it difficult to evaluate the quality of the data extracted and the accuracy of the extraction tool [17, 35], and different extraction procedures may produce different results [74]. The challenges are further exacerbated by integrating different EHR software that lack interoperability or “the ability of two or more products, technologies, or systems to exchange information and to use the information that has been exchanged without special effort on the part of the user” [75] which also makes it difficult to track patients and identify duplication of patient records across multiple EHR systems. This is especially true because, unlike the situation in some European countries, no single identification number is available to be used across the healthcare systems in the United States [9].

Further, raw EHR data can be transformed to research-grade variables through operational phenotyping algorithms [39, 76]. Two types of algorithms help define the study variables. The first involves the creation of a rule-based algorithm and sequential flow chart that identifies codes and clinical information in structured and unstructured fields. This method can be lengthy, might not use all the information available in EHR, and can be biased by the judgments of the team creating these algorithms [76]. The other type involves using machine learning algorithms [77], which has its own challenges and biases [4, 6, 39].

8 Research Design and Analysis

We list here several specific biases that can arise because of design and/or analysis issues in EHR research [78,79,80]. The list is not exhaustive but covers the most common biases affecting validity in EHR-derived research. Solutions to address these biases are suggested in the references cited for each one.

8.1 Berkson Bias

Berkson bias, or admission rate bias, is a type of selection bias resulting from the fact that patients with more than one condition are more likely to be hospitalized than patients with just one condition, creating a spurious correlation between diseases that are independent in the general population [81,82,83].

8.2 Informed Presence Bias

Analogous to Berkson’s bias [84, 85], informed presence bias is a consequence of the nonrandom presence of patients and their details in the EHR system. People who have health challenges are more likely than others to seek care and have more medical encounters [24, 84]. When using data from these visits [86], a researcher unintentionally conditions on presence in the study sample [24, 84]. A prevalent but under-reported disease in the population will be more commonly documented in the sick because they are monitored more closely [87].

8.3 Prevalent-User Bias

Prevalent-user bias is a form of selection bias that is mainly present in drug effect studies. The bias arises when prevalent-users are compared to non-users especially when the treatment effects or the hazard of developing the outcomes vary with time [25, 88]. Prevalent-users are usually considered to be more tolerant of the treatment; i.e. they have “survived” early use [25, 88]. Therefore, if the risk of treatment-related outcomes is highest at the beginning of the treatment, the prevalent-user sample will consist of less susceptible patients and will consequently favor treatment [25, 80, 88].

8.4 Immortal Time Bias

Immortal time bias arises when a time interval exists between the assigned time of entry to the study and the time of exposure assignment, and tends to be found often among real-world cohort studies assessing treatment/drug effects [89]. This waiting time requires exposed participants to stay “immortal” and outcome-free until treatment assignment. Participants who experience the outcome before they have the chance to receive the treatment will be classified as unexposed [79, 89,90,91]. Consequently, the exposed group has a built-in survival advantage compared to the unexposed group and will appear to be protected, but the protection is artificial [89].

8.5 Lag-Time Bias

Lag-time bias is also a time-dependent bias, but one that relates to follow-up time after exposure assignment [88, 92]. The principal idea of the lag-time bias is that the risk of manifesting the outcome might not start immediately after the onset of exposure. Similarly, the risk might not end immediately after termination of the post-exposure observation interval.

8.6 Verification Bias

Verification bias occurs “when there is a difference in testing strategy between groups of individuals, leading to differing ways of verifying the disease of interest” [93]. It arises either when all patients receive an index test but only a proportion of them continue to receive the reference test for disease verification, or when patients are allocated to one of two reference tests based on the results of their index test [94].

In essence, verification bias occurs when the patients are not randomly selected to receive the reference test. In real-world clinical practice, factors like cost, invasiveness, individual patient susceptibility to risks as well as their preferences, and other healthcare system factors can all play a role in the nonrandom assignment of patients to subsequent diagnostic testing [95, 96].

8.7 Protopathic Bias

Protopathic bias occurs when a treatment/drug is prescribed to treat early signs and symptoms of a disease that has not yet been diagnosed [97,98,99,100,101]. It is the erroneous assumption that the drug caused the outcome when in fact the outcome gave rise to the treatment, a form of “reverse causation” [102, 103].

8.8 Confounding by Indication

Although not a bias per se [88, 104, 105], confounding by indication is the misinterpretation of an association between a drug/treatment and an outcome when the indication for selecting the drug/treatment contributes to the outcome [25, 106, 107]. The “indications” or reasons for treatment, such as the severity of the disease [88, 104,105,106,107], the frailty of the patient [88, 107], the physician’s preference for this drug for this patient [104, 105], are considered confounders since they are associated with both the treatment and the outcome [105, 106].

9 Research Results and Interpretation

Since most EHR data are collected without a-priori research questions, their validity, relevance, and fitness for the specific research question need to be assessed [108]. Some analyses, however, go beyond this exploratory step, leading to modifications of pre-established study elements (i.e. inclusion/exclusion criteria, variable selection, variable definition, and analysis plan) that yield only favorable results [9, 109]. Up-front transparency about the research protocol, therefore, is instrumental in evaluating the quality and validity of the study. Using reporting guidelines like RECORD (Reporting of Studies Conducted using Observational Routinely Collected Health Data), MINORS (Methodological index for non-randomized studies), GRACE (The Good Research for Comparative Effectiveness), HARPER (HARmonized Protocol Template to Enhance Reproducibility), and STaRT-RWE (Structured Template for planning and Reporting on the implementation of Real World Evidence studies) are recommended for better transparency [2, 108,109,110,111,112,113]. Also, registration of RWE studies in publicly available databases prior to the execution of the study is likely to maximize transparency [109].

The Newcastle-Ottawa Quality Assessment Scale [114] provides a composite score of assessments of the representativeness of non-randomized studies, the quality of the exposure and outcome, avoidance of biases, and adjustments in the analyses. As a composite score it has the potential to avoid information overload [115], and to intensify the signal of interest [116]. With these advantages comes the potential for weighing components in less than-desirable ways [117, 118]. Consequently, we advise paying attention to the components, as well as the score.

10 Challenges of Pooling Multi-Institutional EHRs

While pooling EHR databases is recommended for validating of study variables [24, 25], minimizing information bias due to EHR discontinuity [25], and increasing power [1], doing so adds another level of complexity to the above-mentioned challenges encountered within an individual EHR system. Integrating multi-institutional EHRs data requires careful evaluation of the heterogeneity in medical practices, reimbursement systems, organizational policies, and demographic characteristics of the catchment areas that can impact the type and quality of data captured [5, 9, 35]. For example, the patient profiles in academic healthcare systems, suburban practices, and federally-qualified health centers are likely to differ [5]. Such heterogeneity can impact the type and quality of data captured, which can, in turn, complicate the integration process for these EHR databases.

Pooling multiple EHR systems is probably best achieved by a multidisciplinary team of clinicians, scientists, informaticians, ML experts, ethical experts who know about the practices of their healthcare system and how data are captured and recorded at their own institutions [35]. Adjusting for clustering or center effects can reduce bias, but is unlikely to eliminate it [119]. Finally, even when data precision, completeness, interoperability, and harmonization are addressed, security and patient privacy remain a concern [44, 78].

In lieu of the challenges of pooling multiple EHR databases, several initiatives have created a network of RWD that aim to collect and map information to a common data model (CDM) that has a consistent format and content [9]. It aims to standardize data collection across several EHR systems and facilitate system interoperability and data sharing [39]. Some of the main initiatives are the Observational Health Data Sciences and Informatics (OHDSI) [120], the National Patient-Centered Clinical Research Network (PCORnet) [121], and, most recently the National COVID Cohort Collaborative (N3C) consortium that emerged as a response to the COVID19 pandemic [122]. Although these initiatives are promising for establishing standardized RWD, they are still under development [4].

11 The Future

More recently, the Office of the National Coordinator for Health, in the US Department of Health and Human Services, released the Trusted Exchange Framework and Common Agreement (TEFCA) which enables the creation of nationwide data-sharing networks, known as Qualified Health Information Networks (QHIN). These QHINs are expected to connect to one another to support a “network of networks,“ resulting in national health information exchange. The TEFCA also aims to create a common set of practices to promote homogeneity of data collected in EHRs [123, 124]. Among these is federated learning [125, 126], which enables multiple contributors “to build a common, robust machine learning model without sharing data, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data” [127]. With the implementation of the TEFCA, we can probably expect a prominent increase in the use of RWE for research and post-marketing surveillance.

12 Conclusion

Now, more than ever, EHRs are being perceived as unique sources of data for clinical research, providing unprecedentedly large volumes of real-time data from real-world settings. Solely having access to big data, however, does not minimize or eliminate bias. Because having a large sample size increases statistical precision, flawed big data increases the chances of significantly biased inferences; this is a phenomenon known as the “big data paradox” [19, 20]. Large datasets, therefore, do not necessarily lead to quality research and valid RWE.

In summary, EHRs might not provide a complete reflection of the patient and his/her health status. Instead, they are a reflection of the utilization of healthcare services and EHR recording processes [76]. Limited data quality and the plethora of biases in EHR data prompt us to conclude with words of caution. As appealing as EHR data might appear, we recommend investigators carefully design their study with the above-mentioned challenges in mind. Even after what might be considered extreme efforts to maximize data quality and minimize bias, humility is encouraged.