Background

Electronic Health Records (EHR) has been increasingly used as a tool for public health surveillance by local and national jurisdictions [1]. For example, recent studies in New York City (NYC) reported that the prevalence estimates from NYC Macroscope, an EHR-based surveillance system in NYC [2], were comparable to the survey-based estimates for diabetes, hypertension, and smoking [3, 4]. EHR often cover more people (n ≥ 100,000) than traditional population health surveys and, and once the infrastructure is in place, the data collection occurs in near real-time without additional recruitment or interviewing cost.

Despite these advantages, the prevalence estimates from EHR can often be biased mainly due to two causes. The first is selection bias. That is, EHR may not represent the target population. For example, the patient population from NYC Macroscope under-represents young men, over-represents patients living in high poverty neighborhoods. It only includes patients who visit primary care doctors connected to a particular EHR system [2]. The selection bias can be corrected, if modeled correctly, by post-stratification. The other source of error is the misclassifications of health outcomes, which is the main interest of our study. It comprises measurement error (e.g., due to the use of non-standardized instruments across sites), extraction error, or the collection of proxy-measurement (e.g., due to the recording without distinction of both self-report and actual measurements). McVeigh et al. [3] reported such subject-level discrepancies by examining a chart review of participants who both visited NYC Macroscope providers and also participated in the NYC Health and Nutrition Examination Survey (HANES), a population-representative survey with field interviews and biospecimen collection. Assuming NYC HANES measurements as “gold-standard,” the chart review found a 5% subject-level error for obesity, 19% for depression, and 19% for influenza vaccination. Notably, the sensitivity (i.e., the proportion of the medical condition identified in NYC HANES also indicated in the EHR) was as low as 31% for depression and 19% for influenza vaccination. In a later study, McVeigh et al. [5] extracted chart data from more than 20 additional EHR software systems from primary care providers and repeated similar study for 190 participants of the 2013–14 NYC HANES. For the public health surveillance system using EHR records, there is an urgent need for methods to estimate the prevalence of health indicators using large and real-time EHR while correcting the potential bias using external sources.

Many existing methods allow investigators to pool multiple data sources and some may be suitable for the unique context of combining big data with a small gold-standard survey. They can be classified by whether the subjects are linked at the individual level and whether potential biases are accounted for. For data sources that are unavailable at the individual level, aggregate statistics are pooled from the sources. For example, Thompson [6] developed methods to combine aggregate statistics from standardized surveys by an international tobacco control project to find programs that are effective in reducing tobacco use. She studied several approaches including a model with random effects for the country. However, her model assumed that all surveys were equally likely to be biased and the bias across countries canceled each other out. There are a handful of works that account for pooling a gold-standard source with potentially biased sources [7,8,9,10,11]. Earlier, Mosteller [9] studied ways to combine the means from two samples when one is potentially biased. Mosteller’s estimator, chosen as one end of the methods, will be discussed further in the following section. Lohr and Brick [7] explored methods for pooling domain-level estimates from two surveys that measure victimization prevalence: their gold-standard survey, the United States National Crime Victimization Survey, and a larger but potentially biased telephone companion survey. In their study, they compared ten methods that combine a gold-standard survey with another biased data source. The methods included calibration methods, weighted averages of the estimators from the two sources without any bias adjustment (i.e. unadjusted dual-frame estimators), with bias adjustment pooled across the domains, and with domain-specific bias adjustment. The last method performed the best. Another estimator that performed well was the multiplicative bias estimator published earlier [11]. Manzi et al. [8] used a Bayesian hierarchical model to pool domain-level smoking prevalence estimates from seven surveys in the eastern regions of England. Similarly, Raghunathan et al. [10] used Bayesian hierarchical model to combine a potentially biased county-level prevalence of cancer outcomes and risk factors from a larger telephone survey, The Behavioral Risk Factor Surveillance System, with an unbiased (or less biased) face-to-face National Health Interview Survey (NHIS) covering fewer counties and fewer households.

When data are available at the individual level, Kim and Rao [12] developed a method to combine a small survey with outcome measurement and auxiliary information with a larger independent survey with only auxiliary information. Park et al. [13] developed a model to pool one gold-standard source with outcome measurement and auxiliary information with another independent source with a potentially biased outcome and the same auxiliary measure. Schenker et al. [14] used multiple imputations to combine self-reported outcomes from a large survey, NHIS, with a smaller NHANES that has both clinical and self-report outcomes. They imputed clinical measurement of health outcomes for the participants of the larger survey by modeling both the underlying mechanism of misspecification of outcomes and the mechanism of inclusion to each survey. We will study further this method in the following section as another end of the methods. For more than two proxy outcome variables measured with lagged overlaps, Gelman et al. [15] and He et al. [16] used similar multiple imputation approaches.

In this study, we aim to demonstrate that the joint analysis of a large EHR with a much smaller gold-standard health survey can improve the accuracy of the prevalence estimates. Our aim is not to study all available methods but instead to demonstrate two statistical procedures at both ends of spectrum. First, we adopt Mosteller’s method [9] to pool two estimators when one is potentially biased. It only requires knowing the prevalence estimates from two data sources and their standard errors. Second, we adopt the method of Schenker et al. [14], which uses iterative multiple imputations of subject-level health outcomes for both surveys. This procedure requires information to link some subjects between two sources and modeling the mechanisms underlying the misclassifications in EHR as well as modeling inclusion probabilities to both sources. We demonstrate the statistical properties of the two estimators using simulation studies. Finally, we illustrate these methods analyzing 2013–14 NYC HANES and the 2013 NYC Macroscope and a small study that linked some subjects between the two sources.

Methods

We consider two data sources. First is a health survey of a smaller sample S1 with survey weights w1 that is representative of the target population. Measurement Y1 in the survey is the gold-standard and hence \( {\hat{p}}_1 \)\( ={\sum}_{i\in {S}_1}{w}_{1,i}{Y}_{1,i}/\sum {w}_{1,i} \) is an unbiased estimator of the prevalence of interest p1. Another data source is EHR of a larger sample S2 that becomes representative of the population with post-stratified weights w2. Measurement Y2 in the EHR may have subject-level errors and \( {\hat{p}}_2 \)\( ={\sum}_{i\in {S}_2}{w}_{2,i}{Y}_{2,i}/\sum {w}_{2,i} \) may be a biased estimator of p1. We denote logit of the prevalence as ϕ1 =logit(p1) and logit of prevalence estimators from the two sources as y1 =logit(\( {\hat{p}}_1 \)) and y2 =logit(\( {\hat{p}}_2 \)). We assume that the covariance between two estimators is ignorable since the number of the overlapping subjects (S1S2) is typically very small relative to the size of EHR (S2). We can link the subset of the overlapping subjects (Sc) between the two sources. Figure 1 outlines the data structure. We used statistical software R for all statistical analyses [17, 18].

Fig. 1
figure 1

Data elements in the 2013–14 NYC HANES, limited to the in-care population and stratified by whether the participant was in the chart review study, and 2013 NYC Macroscope

Mosteller estimator

At the core of the problem is a simple question: “Can we gain by pooling two estimates when one is possibly biased but from a larger sample?” Earlier, Mosteller (1948) [9] studied whether to pool two sample means when one is potentially biased. He compared the mean squared error (MSE) of various mean estimators: the unbiased mean, test-then-pool estimator (i.e., pooling two means only when the mean difference was not significant), and maximum likelihood estimator (MLE) assuming mean-zero Gaussian bias. The last estimator showed the least MSE. We adopt his approach to account for unequal sample sizes and unequal variances. The estimator is a weighted average of y1 and y2:

$$ {\hat{\phi}}^{\mathrm{M}}=\left({k}_1{y}_1+{k}_2{y}_2\right)/\left({k}_1+{k}_2\right). $$

It can be shown that the MSE of this family of estimators is minimized when \( {k}_1=1/{\upsigma}_1^2 \), \( {k}_2=1/\left({\uptau}^2+{\upsigma}_2^2\right) \), where σ1 and σ2 are the standard errors of y1 and y2, and τ = E(y2) − ϕ1 is the bias of y2. The estimator is also the MLE of ϕ1 under the model yj = ϕ1 + 1(j = 2)θ + ej where θ and ej are mutually independent zero-mean normal variable with variance τ2 and \( {\sigma}_j^2 \), respectively. The variance and bias parameters were estimated by consistent estimators \( {\hat{\upsigma}}_1^2={s}_1^2 \), \( {\hat{\upsigma}}_2^2={s}_2^2 \) and \( {\hat{\uptau}}^2={\left({y}_1-{y}_2\right)}^2. \) For example, \( {s}_j^2 \) can be the sample variance estimated using survey weights.

The same estimator can also be derived from an approximate Bayesian perspective [19] by setting a prior to the asymptotically normal sampling distribution of yj. If we set a non-informative prior (i.e. normal with infinite variance) of ϕ1, and zero-mean normal prior of the bias E(y2) − ϕ1 with variance τ2, then the posterior distribution of ϕ1 can be shown to be normal with mean \( {\hat{\phi}}^{\mathrm{M}} \) and variance \( {\sigma}_1^2\left({\sigma}_2^2+{\tau}^2\right)/\left({\sigma}_1^2+{\sigma}_2^2+{\tau}^2\right) \). τ measures the prior belief in closeness of the prevalence measured by EHR and health survey. The 95% highest density credibility interval of the logit prevalence is given as

$$ {\hat{\phi}}^{\mathrm{M}}\pm 1.96\ {\sigma}_1\sqrt{\left({\sigma}_2^2+{\tau}^2\right)/\left({\sigma}_1^2+{\sigma}_2^2+{\tau}^2\right)} $$

The estimator, while less efficient than the subject-level imputation estimator below, is simpler to implement by practitioners who often do not have resources to link subjects in two sources or model the mechanisms of the misclassifications in EHR.

Subject-level imputation estimator

Misclassification model

We adapted the approach by Schenker et al. [14] and modeled the misclassification between the binary outcomes of ith subject in health survey (Y1, i) and EHR (Y2, i):

$$ \mathrm{logit}\ P\left({Y}_{2,i}=1|{Y}_{1,i}={y}_{1,i}\right)={\beta}_{0l}+{\beta}_{1l}{z}_i+{\beta}_{2l}{y}_{1,i} $$
(1)
$$ \mathrm{logit}\ P\left({Y}_{1,i}=1|{Y}_{2,i}={y}_{2,i}\right)={\gamma}_{0l}+{\gamma}_{1l}{z}_i+{\gamma}_{2l}{y}_{2,i} $$
(2)

where zi is a vector predictor. Since the relationship may depend on the design factors of surveys, the model is stratified by four levels (l = 1, 2, 3, 4) divided by the quartiles of the inclusion probabilities to the health survey as q11, q12, q13 and to the EHR as q21, q22, q23.

Model for inclusion to each source

Since the inclusion probabilities to health survey (π1i) are unknown for most EHR subjects, we model them by a model, logit π1i = a0+a1ui, where ui is a vector of survey design factors. The model is fit over entire EHR subjects weighted by their post-stratified weights (w2). Similarly, we model the inclusion probability to EHR logit π2i = b0+b1vi and fit it over entire health survey subjects weighted by their survey weights.

Bayesian iterative regression imputation

While we are ultimately interested in imputing missing health survey outcomes (1) in Fig. 1, we follow Schenker et al. [14] and perform iterative imputations between two models M1, to impute missing EHR values (2) in the figure, and M2, to impute missing health survey values (1) in the figure. This is repeated B times. Imputing missing EHR values (2) in the figure increases sample size when fitting M2, the model we are ultimately interested. The additional variation caused by using imputed values was accounted for by the multiple imputation standard error formula below. The following is the detailed procedure.

To impute missing Y2,i, we divided the subjects S1S2 into 4 (l = 1, …, 4) groups by the quartiles q21, q22, q23, and within each group fit Bayesian regression model M1 with a weakly informative prior for βl = (β0l,β1l,β2l) of independent Cauchy distributions with 2.5 scale and zero location, first on the subjects Sc whose identities can be linked between two data sources. Then, we drew a posterior sample of βl, and in turn Y2,i conditional on βl for all health survey subjects missing Y2,i. Subsequently, treating this imputed Y2,i as observed, we imputed missing Y1,i by dividing the subjects into 4 groups by q11, q12, q13 and fitting the regression model M2 on all EHR subjects with independent Cauchy prior for γl = (γ0l, γ1l, γ2l) with 2.5 scale and zero location. We drew a posterior sample from γl, then in turn Y1,i for all EHR subjects missing Y1,i. We iterated B times to fit models M1 and M2, treating imputed values from the previous step as observed and imputing the missing outcome variables until convergence. Then we calculated a prevalence estimator \( {\hat{p}}_m \) = \( {\sum}_{i\in {S}_2}{w_2}_i{\hat{Y}}_{m,1,i}/\sum {w_2}_i \) based on the imputed health survey measurements of all EHR subjects. Notice that the outcome values were imputed only when they are missing. In other words, \( {\hat{Y}}_{m,1,i} \) = Y1, i for subjects whose health survey outcome was observed. Finally, we combined inferences from M such multiple imputations. The resulting prevalence estimator is unbiased when the specified models are correct:

$$ {\hat{P}}^{\mathrm{R}}={\sum}_{m=1}^M{\hat{p}}_m/M $$

The standard error of \( {\hat{\phi}}^{\mathrm{R}} \) =logit(\( {\hat{P}}^{\mathrm{R}} \)) was estimated by the standard way [20, 21]:

$$ SE\left({\hat{\phi}}^{\mathrm{R}}\right)=\sqrt{W+\left(1+1/M\right)B} $$

where W= \( {\sum}_m{s}_m^2/M \), B= \( {\sum}_m{\left({\hat{\phi}}_m-{\hat{\phi}}^{\mathrm{R}}\right)}^2/\left(M-1\right) \), and sm is the naïve standard error of the logit prevalence (\( {\hat{\phi}}_m \)) calculated from mth imputation. Since the overlap between two sources can be small, we used Barnard-Rubin degrees of freedom [22, 23] to compute credibility intervals, first in log-odds scale before they were transformed to probability scale.

Results

Simulation studies

We performed simulation studies to assess the performance of the methods under various settings. We generated correlated binary outcomes (Y1, Y2) of a target population (N = 10,000,000) whose conditional distributions follow logistic models: logit P(Y1 = 1|Y2) = η10+ φ Y2 and logit P(Y2 = 1|Y1) = η01+ φ Y1 where η10 = γ0+ γ1x1+ γ2x2, η01 = β0+ β1x1+ β2x2. To do so, we first generated an independent Bernoulli variable x1 with success probability .5 and a standard normal variable x2. Then we generated the correlated binary outcomes (Y1, Y2) which has 4 possible outcomes (0,0) (0,1), (1,0), (1,1) with corresponding joint probabilities p00, p01, p10, p11 where p11: p10:p01: p00 = exp.(φ + η10 + η01): exp.(η10):exp.(η01):1. This set up guarantees that the conditional distributions of outcomes are the two stated logistic models. The log odds ratio φ and the linear coefficients were set so that the true prevalence based on two datasets were p1 = p11+ p10 = 0.3 and p2 = p11+ p01 = 0.3, 0.31, 0.32, 0.33, or 0.35.

Then, we randomly selected subjects for the health survey (n1 = 250, 500, or 1000) and EHR (n2 = 100,000) by inclusion probabilities proportional to logit π1i = a0 + a1u1i + a2u2i + a3x1i and logit π2i = b0 + b1u1i + b2u2i + b3x1i. u1 was an independent Bernoulli variable with success probability .5 and u2 was a standard normal variable. We set (a0, a1, a2, a3) = (b0, b1, b2, b3) = (1,1,1, 0.187). x1, the predictor of misclassification, was also included as a survey design factor so that the missing mechanism is missing-at-random but not missing-completlely-at-random. Then, we selected more EHR subjects among the health survey participants so that the proportion of health survey participants that are also in EHR is 20, 50%, or 100%. Finally, we deleted the values of Y1 and π1 for the subjects not in the health survey and Y2 for the subjects not in EHR. All π2 values were deleted as inclusion probabilities are unknown in typical EHR.

For each simulated survey and EHR, we used u1, u2, and x1 to calculate post-stratified weights w2 for the EHR. Then we calculated four prevalence estimates: estimator based only on the health survey, estimator based only on EHR, Mosteller estimator, and the subject-level imputation estimator. For the subject-level imputation estimator, we included burn-in iterations and combined inferences of M = 30 multiple imputations. The overall process of the generation of the target population, sampling health survey and EHR from the population, and calculating the prevalence estimates was repeated 200 times.

Table 1 shows the average prevalence estimates by the four estimators. The size of the health survey (n1) and the size of subjects linked between two sources (n12) were both 500. Health survey estimator was unbiased in all settings. On the contrary, EHR estimator was biased except when there was no misclassification bias (i.e., p2 = 0.3), in which case post-stratification successfully adjusted for the selection bias. Both Mosteller estimator and the subject-level imputation estimator showed less than 3% bias in all settings.

Table 1 Simulation studies: prevalence estimate by four methods

Table 2 shows the MSE of the estimators. When bias was less than or equal to 5% (i.e., p2 = 0.3 or 0.31), the EHR estimator outperformed the health survey estimator due to a larger sample size. When the bias was more substantial, however, it overwhelmed the benefit from the sample size. Then, the subject-level imputation model and the Mosteller estimator performed better than the estimators based only on either source. Notably, they either outperformed or were similar to the health survey estimator in all settings. Between the two, the Mosteller estimator performed better than the subject-level imputation estimator when bias was small to moderate (p2 = 0.3–0.33), but worse when bias was large (p2 = 0.35).

Table 2 Simulation studies: square root of MSE of four methods

We studied how the size of the health survey and the size of subjects linked between two sources affect the performance (Table 3). We fixed the true prevalence (p1) at 0.3 and the prevalence (p2) measured from EHR (Y2) at 0.32. The EHR estimator performed best when the health survey was small (n1 = 250) but Mosteller’s estimator performed best when the health survey size was moderate (n1 = 500, 1000). The subject-level imputation estimator requires enough size of subjects linked between two sources. Mostellers’ method, on the other hand, performed well in most settings.

Table 3 Simulation studies: square root of MSE by different sample sizes

Analysis of NYC macroscope and NYC HANES

We illustrate the methods with data from NYC. To protect patient privacy, the authors did not directly access the data but submitted R codes to the NYC Department of Health and Mental Hygiene (DOHMH) and received back the results of the joint analysis of two data sources presented below.

Description of data sources

NYC Macroscope is an EHR-based surveillance system developed by the NYC DOHMH in collaboration with the City University of New York School of Public Health to estimate the prevalence of chronic diseases and risk factors for adult population (20 years or older) in care by participating primary care providers in NYC [2, 5]. The data were available only as aggregate data stratified by age group, sex, and neighborhood poverty level. Detailed provider and patient inclusion and exclusion criteria are documented elsewhere [2]. In this study, we used the 2013 data that included 716,076 patients.

The 2013–14 NYC HANES is a population-representative survey of NYC residents aged 20 or older (n = 1527) with the interview, physical examination and biospecimen collection [24]. The data used in this study were limited to in-care participants (i.e., participants who have seen a provider for primary care in the previous year; n = 1135). Recently, a chart review study was conducted among a subsample (n = 190) of in-care participants from NYC HANES (Fig. 1) [5]. In their study, more than 20 EHR from primary care providers were abstracted for each chart review participant, and the data were linked to the NYC HANES data at the individual level. The chart review sample consisted of participants who received primary care from NYC Macroscope or non-NYC Macroscope providers. Because there was little difference in demographic and clinical characteristics between the two groups, we used data from all participants in this study. They performed the chart review on subjects enrolled in NYC HANES 2013–14 (n = 1524) who had doctors visit during the year (n = 1135) and signed a consent form and Health Insurance Portability & Accountability Act (HIPPA) waiver (n = 491) whose EHR were available and valid (n = 190).

Definition of health indicators

We selected six health indicators in the sources to demonstrate the methods: hypertension diagnosis, diabetes diagnosis, smoking, obesity, depression, and influenza vaccination. Newton-Dame and her collegues describes these indicators in detail [2]. Hypertension diagnosis was defined as either systolic blood pressure ≥ 140 mmHg or diastolic blood pressure ≥ 90 mmHg or an existing record of hypertension diagnosis (based on ICD-9 in NYC Macroscope and self-report in NYC HANES). Diabetes indicator was based on the presence of an ICD-9 diagnosis in NYC Macroscope and self-report in NYC HANES. Smoking was based on an indication of ‘current smoking’ in the most recent smoking status in the NYC Macroscope and based on a self-report of current smoking in NYC HANES. The obesity indicator was based on the most recent body mass index (BMI) ≥ 30 in NYC Macroscope and based on the measured height and weight at the interview in NYC HANES. Depression indicator was based on the presence of an ICD-9 depression diagnosis ever recorded, or a Patient Health Questionnaire (PHQ-9) score ≥ 10 in NYC Macroscope and based on a self-reported diagnosis or a PHQ-9 score ≥ 10 at interview in NYC HANES. Influenza vaccination indicator was based on the presence of a relevant ICD-9/CPT/CVX code in NYC Macroscope and based on the self-report of receiving influenza vaccination in the past 12 months in NYC HANES.

Illustration of the methods on NYC data

The NYC Macroscope used post-stratification to address the selection bias of Macroscope data [25, 26] by matching the joint distribution of gender, age group, and neighborhood-level poverty to that of the city’s in-care population. The prevalence estimates among the in-care city population-based on NYC HANES and NYC Macroscope were close for hypertension diagnosis (NYC HANES 34.3% vs. NYC Macroscope 33.7%), moderately different for diabetes diagnosis (13.3% vs 14.8%), smoking (17.3% vs. 15.9%), and obesity (31.7% vs. 29.1%), and significantly different for depression (19.0% vs. 8.6%) and influenza vaccination (48.6% vs. 21.2%). The discrepancies in the depression prevalence and influenza vaccination rate were likely due to the under-diagnosis of depression in primary care settings and influenza vaccination outside of clinics (e.g., pharmacies) that are not recorded by the primary care EHR. The population characteristics in NYC HANES and NYC Macroscope for the adult in-care population are described elsewhere [27].

We estimated prevalence by the four estimators: estimator based only on NYC HANES, estimator based only on Macroscope data, Mosteller estimator, and the subject-level imputation estimator. We assumed that NYC HANES was the gold standard since data were collected using a population-representative sample design with a controlled and standardized data collection. The chart review study with 190 subjects whose identities were linked between NYC HANES and NYC Macroscope enabled us to calculate the subject-level imputation estimates for which we used age group, sex, and neighborhood poverty level as covariates for inclusion models and misclassification models. There was a lack of predictors that could properly model misclassifications in the EHR, such as hospital size, instrument labels, or types of visits.

Mosteller prevalence estimates showed improvement over both NYC HANES and NYC Macroscope estimates (Table 4). In all six health outcomes, they showed smaller standard errors compared to NYC HANES estimates and smaller biases compared to Macroscope estimator. The bias reduction was especially substantial (> 99% reduction) in depression and influenza vaccination estimates because, for these indicators, EHR estimates were given little weight (Table 5). On the other hand, the subject-level imputation estimates did not outperform NYC HANES estimates: their credibility intervals were larger than NYC HANES estimates. This was due to the lack of predictors, as mentioned above, that could model the mechanism of misclassification in EHR. The subject-level imputation method requires us to correctly model the misclassification as well as to approximate the inclusion probabilities to the health survey for the EHR subjects.

Table 4 Prevalence estimate and 95% confidence/credibility intervals of select health outcomes among adults in care in New York City (NYC), obtained from the NYC Macroscope 2013 and NYC HANES 2013–14
Table 5 Relative weights used in Mosteller estimator

Table 4 also demonstrates that the selection bias in Macroscope was less than the bias due to subject-level misclassifications: the range of differences in prevalence estimates between Macroscope and NYC HANES for diabetes, smoking, and obesity were similar with (1.6–3.7%) and without (1.5–2.6%) post-stratification. However, it decreased to 0.4–0.6% for the Mosteller estimator. The range of differences in depression prevalence and influenza vaccination rate were also similar with (10.7–26.9%) and without (10.4–27.4%) post-stratification but it reduces dramatically to 0.1% for the Mosteller estimator. This shows that post-stratification alone was insufficient to correct the bias in the EHR for these outcomes. But Mosteller estimator and subject-level imputation estimator both used NYC HANES as a safeguard against potential bias in EHR.

Discussion

Compared to traditional health surveys, EHR has a much larger sample size and the potential to reduce standard errors of prevalence estimates. It can be very helpful in estimating prevalence in small sub-groups of the populations. In NYC Macroscope and our simulation study, we found that the correction of the subject-level error of EHR is necessary and possible.

In the simulation study, the health survey estimator was unbiased, but the standard error was the largest. On the contrary, the bias in EHR estimator can overwhelm the benefit of its sample size. When that happened, both Mosteller estimator and the subject-level imputation estimator yielded negligible bias and small standard errors: they either outperformed or were comparable to the estimators based solely on either source. The subject-level imputation estimator may outperform Mosteller estimator when EHR bias is large. However, the estimator requires enough size of subjects linked between two sources and correctly modeling the mechanism of misclassification as well as modeling inclusion probabilities to both sources.

The difficulty of such a task was demonstrated in the analysis of the NYC data. Mosteller estimators showed considerably smaller standard error than NYC HANES estimates especially when the NYC Macroscope estimates and NYC HANES estimates were close. The subject-level imputation estimator did not outperform NYC HANES estimator in part due to a lack of predictors for misclassification. The predictors for misclassification can be both patient-level characteristics, such as types of visit, and institution-level predictors, such as hospital size or instrument labels. These variables are typically going to be found in EHR (or administrative data sets that accompany EHR), while some patient characteristics will still be found in a health survey. In practice, the fit of the misclassification model should guide the choice between considered approaches, whether to model the underlying mechanism of misclassification or to use Mosteller’s estimator. This can be done, for example, by cross-validated estimation of area under the curve of the receiver operating characteristic (ROC) curve as one moves the probability cutoff in the logistic regression model M2.

In this article, we considered the health survey as the gold standard. Here we acknowledge that survey measurements are rarely unbiased. However, it is often helpful to treat one survey as gold-standard over another. For example, investigators have treated a smaller in-person survey as gold-standard over a larger telephone survey [10], or clinical surveys as gold-standard over self-reported outcomes [14, 28]. EHR are often administrative data collected for billing purposes with non-standardized instruments and protocols, with complex unknown inclusion mechanisms. NYC HANES was designed for health survey purposes by standardized instruments and protocols and collected by representative probability sampling. We assumed that typical bias treatment for the health survey, such as post-stratification and calibration for non-response bias has been successfully performed.

Conclusions

We demonstrated that the joint use of a small gold-standard health survey with a larger EHR improves accuracy in prevalence estimation. Depending on the available data, one can aim to model the misclassification completely or calculate the weighted average of the prevalence estimates from two sources. The studied approaches can improve the quality of EHR as a public health surveillance tool. In another work, we are extending the methods to model subgroup level prevalence estimators from health surveys and EHR.