1 Introduction

Generalized linear models (e.g., McCullagh & Nelder 1989) are the workhorses in many applications of statistical methods. A tacit assumption in these models is that all covariates are perfectly measured without error. Violation of this assumption will produce inconsistent estimators unless the measurement error problem is addressed. A body of research has hence evolved to allow at least approximate inference in generalized linear models with covariate measurement error (see Carroll, Ruppert, Stefanski, & Crainiceanu, 2006, and Buonaccorsi 2010, for comprehensive overviews; we will discuss some of this literature in more detail later).

In this article, we consider structural covariate measurement error models, where a parametric distribution is specified for the erroneously measured covariates. An obvious approach to estimation is then maximum likelihood which produces consistent estimates if the model is correctly specified (e.g., Schafer 1987; Schafer & Purdy 1986; Higdon & Schafer 2001). Unfortunately, the joint likelihood of the response and the measures cannot in general be expressed in closed form and computationally intensive methods based on numerical integration or simulation must be used. The computational burden involved in a full likelihood analysis is, therefore, often considerable.

Regression calibration has been proposed as a computationally efficient approach to estimating generalized linear models with covariate measurement error (e.g., Armstrong 1985; Rosner, Willett, & Spiegelman, 1989; Rosner, Spiegelman, & Willett, 1990; Carroll & Stefanski 1990). It is based on an approximation of the likelihood function where the basic idea is to plug in “best” predictions for the covariates measured with error and proceed in estimating the generalized linear model as if the predictions were covariates measured without error. Unfortunately, estimates of the regression parameters from regression calibration are, in general, inconsistent. The inconsistency is typically small when the true effects of the covariates measured with error are moderate and/or the measurement error variances are small, but more pronounced when these conditions do not hold.

In this article, we propose a pseudo maximum likelihood approach, called improved regression calibration (IRC), which simultaneously addresses the computational challenge in likelihood analysis and the inconsistency problem in conventional regression calibration. The basic idea is to consider a decomposed form of the likelihood where one component is expressed in closed form and trivial to maximize, and the second component is accurately maximized using crude and fast numerical integration. In contrast to conventional regression calibration, where predicted covariates measured with error are treated as fixed in point estimation, the stochastic nature of the predictions is handled by using predictive densities of the covariates measured with error as mixing distributions.

2 Generalized Linear Models with Covariate Measurement Error

Let y i be the outcome variable for unit i, i=1,…,N, x i an m×1 vector of covariates or “exposures” measured with error by the measures w i , and z i a vector of perfectly measured covariates, including a constant 1.

Following Clayton (1992), we can view a generalized linear model with covariate measurement error as composed of three parts: (i) an outcome model g(y i |x i ,z i ;ϑ O ), (ii) a measurement model g(w i |x i ,z i ;ϑ M ), and (iii) an exposure model g(x i |z i ;ϑ E ), where g(⋅|⋅) are conditional density functions and ϑ O , ϑ M , and ϑ E the corresponding parameter vectors. We define the complete parameter vector as \(\boldsymbol {\vartheta }=(\boldsymbol {\vartheta }_{\mathsf{O}}',\boldsymbol {\vartheta }_{\mathsf{M}}',\boldsymbol {\vartheta }_{\mathsf{E}}')'\). Throughout, we make the standard assumption of “nondifferential measurement error” that y i and w i are independent conditional on (x i ,z i ).

2.1 Outcome Model g(y i |x i ,z i ;ϑ O )

The outcome model is a generalized linear model (e.g., McCullagh & Nelder 1989) with three parts: (i) a linear predictor, which in the present context takes the form , (ii) a link function g(⋅) that links the linear predictor to the conditional expectation of the response, given the covariates, E(y i |x i ,z i )=g −1(η i ), and (iii) a conditional distribution for the response, given the covariates, taken from the exponential family,

$$g(y_i|{\mathbf{x}}_i, {\mathbf{z}}_i; \boldsymbol {\vartheta }_{\mathsf {O}}) = \exp\biggl \{\frac{y_i\theta_i -b(\theta_i)}{\phi} +c(y_i,\phi) \biggr\}. $$

Here, θ i =θ i (x i ,z i ;ϑ O ) is the canonical or natural parameter, ϕ=ϕ(ϑ O ) is the scale or dispersion parameter, and b(⋅) and c(⋅) are functions depending on the member of the exponential family. The most common nonlinear instance of this is the binary logistic model where y i follows a Bernoulli distribution and θ i =η i =log{E(y i )/[1−E(y i )]}. For this model, ϕ=1 and . Due to its popularity, we will consider a logistic outcome model in our simulations and data analysis.

2.2 Measurement Model g(w i |x i ;ϑ M )

The form of the measurement model depends on the nature of the available data. Here we focus on the case of replication data, where at least a subsample of subjects provides several measures for each fallibly measured covariate. The main alternative is validation data where both x i and w i are observed for a subsample, in which case the proposed estimation procedures can be modified in a straightforward manner.

In general, the measurements w i may depend on the covariates z i measured without error as well as on x i , similarly to differential item functioning in item response theory. This would be straightforward to handle in our suggested approach, but here we omit z i for simplicity and consider measurement models of the form g(w i |x i ;ϑ M ).

The vector x i is measured by fallible measures \(\mathbf{w}_{i}=(\mathbf{w}_{1i}',\dots,\mathbf{w}_{mi}')'\), where each \(\mathbf{w}_{li}=(w_{li1},\dots,w_{lin_{li}})'\) is a vector of n li replicate measurements. For the moment, consider balanced data where n li =n l . A general multidimensional measurement model for m sets of congeneric measures (e.g., Jöreskog 1971) can be expressed as

(1)

where , , and it is assumed that . The matrix Λ is partitioned as

(2)

where λ l is a vector of scale parameters for the measures of covariate l. Further constraints are often imposed on the parameters of the measurement model, e.g., to obtain tau-equivalent or parallel models.

2.3 Exposure Model g(x i |z i ;ϑ E )

The dependence between the exposures measured with error x i and the covariates measured without error z i is specified as

(3)

where Γ is a regression parameter matrix, ζ i ∼N(0,Ψ), and Cov(z i ,ζ i )=0. As the scale of x i is not identifiable from (1) and (3), some standard identification restrictions are imposed on the parameters. The parameter vector ϑ M then consists of the unique elements of ν, Λ and Θ, and ϑ E of the unique elements of Γ and Ψ.

A generalized linear model with covariate measurement error is shown graphically in Figure 1 for the simple case of an exposure x i fallibly measured by two measures w i1 and w i2, and a covariate z i measured without error. A common identifiability constraint for this case is to assume ν 1=ν 2=0 and λ 1=λ 2=1, which give the “classical” measurement error model w ij =x i +δ ij .

Figure 1.
figure 1

Graph of generalized linear model with covariate measurement error.

The method that we propose below is not dependent on this specific combination of measurement and outcome models, but applies also more generally. Looking ahead to the rest of the paper, other study designs, and corresponding changes to measurement and outcome models, affect only Stage 1 of our two-stage estimation. For example, a situation where the number of replicate measurements is not the same for all units i is accounted for by the selection matrix C i included in Equation (8), and the case where y i is not observed for some units by omitting these from the log-likelihood component 2(ϑ O ,ϑ ME ) in (5). If a validation sample rather than replication data are available, Stage 1 of the estimation could be done by modeling the conditional moments of x i given w i and z i (Equations (11) and (12)) directly rather than via the exposure and measurement models; in this case, the formulas of the variance estimation in the appendix would also be simplified.

3 Estimation Methods

We now consider different approaches to estimation of generalized linear models with covariate measurement error. We start by briefly describing maximum likelihood (ML) estimation, then proceed by developing our suggested approach of improved regression calibration (IRC) before contrasting this with conventional regression calibration (RC). We then conclude this section by a discussion of previous literature on these approaches to measurement error modeling. Throughout, we consider likelihoods for the response y i and the measures w i conditional on the perfectly measured covariates z i .

3.1 Maximum Likelihood (ML) Estimation

The likelihood contribution for a single unit i is

$$ g(y_i,{\mathbf{w}}_i|{\mathbf{z}}_i;\boldsymbol {\vartheta }) = \int g(y_i|{\mathbf{x}}_i,{\mathbf{z}}_i;\boldsymbol {\vartheta }_{\mathsf{O}}) g( {\mathbf{w}}_i|{\mathbf{x}}_i;\boldsymbol {\vartheta }_{\mathsf{M}}) g({\mathbf{x}}_i| {\mathbf{z}}_i;\boldsymbol {\vartheta }_{\mathsf{E}}) \,\mathrm{d}{\mathbf{x}}_i, $$
(4)

the log-likelihood contribution is i (ϑ)=logg(y i ,w i |z i ;ϑ), and the log-likelihood \(\ell(\boldsymbol{\vartheta}) = \sum_{i=1}^{N} \ell_{i}(\boldsymbol{\vartheta})\). When Θ is diagonal, as is often assumed, \(g({\mathbf{w}}_{i}|{\mathbf{x}}_{i};\boldsymbol {\vartheta }_{\mathsf{M}}) = \prod_{l=1}^{m}\, \prod_{j=1}^{n_{li}} g(w_{lij}|{\mathbf{x}}_{i};\boldsymbol {\vartheta }_{\mathsf{M}})\). The ML estimator \(\widehat{\boldsymbol {\vartheta }}\) is obtained by maximizing (ϑ) with respect to ϑ.

Unfortunately, the joint likelihood of generalized linear models with covariate measurement error cannot generally be expressed in closed form and requires integration, typically accomplished by Gaussian quadrature. In general, the performance of Gaussian quadrature depends on the smoothness of the integrand. According to the fundamental theorem of Gaussian quadrature (e.g., Davis & Rabinowitz 1984; Thisted 1988, Theorem 5.3-1), ordinary Gaussian quadrature is exact if the function in the integrand is a 2R−1 order polynomial (where R is the number of quadrature points). However, a likelihood component including a product of conditional response distributions for continuous responses, such as \(\prod_{l=1}^{m}\, \prod_{j=1}^{n_{li}} g(w_{lij}| {\mathbf{x}}_{i};\boldsymbol {\vartheta }_{\mathsf{M}})\) above, tends to produce a peaked integrand in the marginal likelihood (a tendency exacerbated as the number of measures and their intraclass correlation increases). Such likelihood contributions are poorly approximated by low-degree polynomials, and ordinary Gauss–Hermite quadrature does not work well for this situation (e.g., Albert & Follmann 2000; Lesaffre & Spiessens 2001). This is illustrated in the left panel of Figure 2 where we see that all quadrature points completely miss the integrand.

Figure 2.
figure 2

Illustration of integrand and quadrature points (locations and weights) for 3-point ordinary Gauss–Hermite quadrature. Maximum likelihood in left panel and improved regression calibration in right panel.

Therefore, more computationally demanding adaptive Gaussian quadrature methods that align the quadrature points under the integrand are recommended when continuous responses are involved (e.g., Rabe-Hesketh, Skrondal, & Pickles, 2005). A limitation of the full likelihood approach is, hence, that it becomes computationally intensive.

3.2 Improved Regression Calibration (IRC)

As an alternative to full ML we propose to break the estimation problem into two parts, allocating as many parameters as possible to a likelihood component that is easy to maximize. This is an instance of a general two-stage approach to estimation known as pseudo maximum likelihood (PML) estimation (Gong & Samaniego 1981).

Letting \(\boldsymbol {\vartheta }_{\mathsf{ME}}=(\boldsymbol {\vartheta }_{\mathsf{M}}', \boldsymbol {\vartheta }_{\mathsf{E}}')'\), we first re-express g(w i |x i ;ϑ M )g(x i |z i ;ϑ E ) in (4) as g(x i |w i ,z i ;ϑ ME )g(w i |z i ;ϑ ME ), and the log-likelihood as

$$ \ell(\boldsymbol{\vartheta}) = \sum_{i=1}^N \log g(y_i|\mathbf{w}_i,\mathbf{z}_i; \boldsymbol{\vartheta}_{\mathsf{O}}, \boldsymbol{\vartheta }_{\mathsf{ME}}) + \sum_{i=1}^N \log g({\mathbf{w}}_i| {\mathbf{z}}_i;\boldsymbol {\vartheta }_{\mathsf{ME}}) \equiv\ell_2(\boldsymbol {\vartheta }_{\mathsf {O}}, \boldsymbol{\vartheta}_{\mathsf{ME}}) +\ell_1(\boldsymbol{ \vartheta}_{\mathsf{ME}}) $$
(5)

where

$$ g(y_i|{\mathbf{w}}_i,{\mathbf{z}}_i;\boldsymbol {\vartheta }_{\mathsf{O}}, \boldsymbol {\vartheta }_{\mathsf {ME}}) = \int g(y_i|{\mathbf{x}}_i,{\mathbf{z}}_i; \boldsymbol {\vartheta }_{\mathsf{O}})\, g({\mathbf{x}}_i|{\mathbf{w}}_i, {\mathbf{z}}_i; \boldsymbol {\vartheta }_{\mathsf{ME}})\, \mathrm{d}{\mathbf{x}}_i . $$
(6)

In Stage 1 of IRC, we estimate the combined measurement and exposure model g(w i |z i ;ϑ ME ) by maximizing just 1(ϑ ME ), to obtain estimates \(\widehat{\boldsymbol {\vartheta }}_{\mathsf{ME}}\). These are not full ML estimates because they omit the typically small amount of information about of ϑ ME contained in y i . In Stage 2, these estimates from Stage 1 are then treated as known, and estimates \(\widehat{\boldsymbol {\vartheta }}_{\mathsf{O}}^{\mathsf{IRC}}\) for the parameters of primary interest ϑ O are obtained by maximizing \(\ell_{2}(\boldsymbol {\vartheta }_{\mathsf {O}},\widehat{\boldsymbol{\vartheta}}_{\mathsf{ME}})\). A detailed description of the two stages is provided in the next section.

The basic idea of IRC is that maximizing the approximate decomposed likelihood is considerably less demanding than maximizing the joint likelihood. In Stage 1, the component g(w i |z i ;ϑ ME ) is in closed form and trivial to maximize. In Stage 2, the mixing distribution in the integral (6) is the predictive density \(g({\mathbf{x}}_{i}|{\mathbf{w}}_{i},{\mathbf{z}}_{i};\widehat {\boldsymbol {\vartheta }}_{\mathsf{ME}})\) of the covariates measured with error, given their observed measures and covariates measured without error, which is also trivial to obtain.

The dimensionality of integration (the number of covariates measured with error) in Stage 2 is the same as for full ML. At first glance, there does, hence, not appear to be any computational benefits to be reaped from using IRC. However, the integrand is now the single logistic function g(y i |x i ,z i ;ϑ O ), which due to its smoothness is well approximated by a low order polynomial. For instance, the seminal work on nonlinear factor analysis by McDonald (1967) demonstrated that a cubic function sufficed for approximating the normal ogive (which is very close to the logistic function). We therefore expect that crude and fast ordinary Gauss–Hermite quadrature, using just a few quadrature points, would work well for IRC. This is illustrated in the right panel of Figure 2, where all three quadrature points nicely cover the logistic integrand, in contrast to the case for the likelihood in the left panel.

It is likely that direct maximization of the full likelihood expressed as (5) could also be based on more crude Gauss–Hermite quadrature than what is required for the standard form (4). In this article, however, we focus on the two-stage approach to (5), since it is straightforward to implement in publicly available software.

The savings compared to ML are especially pronounced in three settings and their combinations: (i) large datasets, (ii) when the relative number of parameters allocated to the easily maximized likelihood component is large (a large number of measures and/or realistically complex measurement models), and (iii) when the same predictive distributions can be used in several models, so that the Stage-1 likelihood components need only be maximized once.

3.3 Conventional Regression Calibration (RC)

Conventional regression calibration is also a two-stage method which can be seen as an approximation of pseudo-ML (IRC) estimation. Stage 1 is the same as for IRC, but estimation in Stage 2 is based on the further approximation

$$ g(y_i|\mathbf{w}_i,\mathbf{z}_i; \boldsymbol{\vartheta}_{\mathsf{O}}, \widehat{\boldsymbol {\vartheta}}_{\mathsf{ME}}) \approx g(y_i|\widetilde{\boldsymbol{\xi}}_i, {\mathbf{z}}_i;\boldsymbol {\vartheta }_{ \mathsf{O}}) $$
(7)

where \(g(y_{i}|\widetilde{\boldsymbol{\xi}}_{i},{\mathbf{z}}_{i};\boldsymbol {\vartheta }_{ \mathsf{O}})\) is of the same form as the outcome model g(y i |x i ,z i ;ϑ O ), now with the “predictive mean” \(\widetilde{\boldsymbol{\xi}}_{i}=\mathrm{E}(\mathbf{x}_{i}|\mathbf {w}_{i},\mathbf{z}_{i}; \widehat{\boldsymbol{\vartheta}}_{\mathsf{ME}})\) used in the place of x i . RC thus carries only \(\widetilde{\boldsymbol{\xi}}_{i}\) forward from Stage 1 to Stage 2 of the estimation, whereas IRC takes the whole predictive density \(g(\mathbf{x}_{i}|\mathbf{w}_{i},\mathbf{z}_{i}; \widehat{\boldsymbol{\vartheta}}_{\mathsf{ME}})\) into account in Stage 2. In contrast to IRC, RC is generally inconsistent because it employs the approximation (7) of the likelihood function (6).

3.4 ML, PML, and RC in the Measurement Error Literature

The books by Carroll et al. (2006) and Buonaccorsi (2010) provide excellent summaries of methods of estimation in measurement error modeling. The use of full ML estimation has been advocated in a series of papers by Daniel Schafer and coauthors. Schafer (1993), for binary probit models, and Schafer and Purdy (1986), for normal linear models, consider cases where the likelihood can be evaluated in a closed form. For cases where this is not possible, such as binary logistic regression, Schafer (1987) uses a closed-form approximation to avoid numerical integration, while Higdon and Schafer (2001) employ ordinary Gauss–Hermite quadrature to evaluate the likelihood. Rabe-Hesketh, Skrondal, and Pickles (2003) propose using more accurate adaptive quadrature in this setting. Another possibility is to estimate the models in a Bayesian framework, using simulation-based MCMC methods (e.g., Stephens & Dellaportas 1992; Richardson & Gilks 1993; Kuha 1997; Gustafson 2004).

Key references for regression calibration include Armstrong (1985), Rosner et al. (1989, 1990), Carroll and Stefanski (1990), and Gleser (1990), and the overview in Carroll et al. (2006). Buonaccorsi (2010) points out that regression calibration is also a “pseudo-type” two-stage method, which can be viewed as an approximation of PML estimation.

The possibility of PML estimation for regression models with covariates measured with error was noted early, for example, by Carroll, Spiegelman, Lan, Bailey, and Abbott (1984), who apply it for a binary probit model, and Armstrong (1985). PML estimation has been suggested for some specific models where its implementation is relatively straightforward, such as probit models with a single covariate (Burr 1988), linear mixed models (Buonaccorsi, Demidenko, & Tosteson, 2000) and linear structural equation models with latent variables (Skrondal & Laake 2001). For other models, however, the approach has not been developed, perhaps because of a perception that its implementation requires “specialized programming” (Buonaccorsi 2010, p. 227). The IRC method proposed here provides a general approach to PML for covariate measurement models which largely avoids such programming.

4 The Anatomy of Improved Regression Calibration

We will now have a closer look at each of the stages of IRC.

4.1 Stage 1: Estimation of the MIMIC Model g(w i |z i ;ϑ ME )

We can view (1) as representing the measurement model for a possibly hypothetical complete set of replicate measurements w i , where the numbers of measurements in w li are n l for each unit i. The numbers of actually observed replicates may in fact be n li <n l for some i,l, due to design and/or nonresponse. The most common case of unbalanced data by design is one where replicate measurements are only collected for a subsample, so that n li =1 outside the subsample. Defining n i =∑ l n li and n=∑ l n l , the model for such possibly incomplete measurements is obtained by multiplying the right-hand side of (1) by an n i ×n selection matrix C i . We will henceforth include C i where appropriate in the formulae since this is required for obtaining correct results in the unbalanced case where the n li are not constant.

Together, the measurement and exposure models constitute a multiple-indicator multiple-cause (MIMIC) model (e.g., Robinson 1974; Jöreskog & Goldberger 1975). To obtain g(w i |z i ;ϑ ME ), we substitute the exposure model into the measurement model, producing the reduced form MIMIC model

(8)

for which the conditional first and second order moments are

(9)
(10)

The density for the measures, given the perfectly measured covariates, becomes w i |z i ∼N(μ i ,Σ i ), and the log-likelihood 1(ϑ ME ) for the combined measurement and exposure model can be expressed in closed form.

The estimates \(\widehat{\boldsymbol {\vartheta }}_{\mathsf{ME}}\) that maximize 1(ϑ ME ) can be obtained in a very computationally efficient manner using standard methods for moment structure modeling (e.g., Bentler 1983). The estimates are consistent as N→∞ for fixed n i under mild regularity conditions, not requiring the normality assumptions imposed above (e.g., Shapiro 2007). They remain consistent also when measurements are missing at random (MAR) in the sense of Rubin (1976), although MAR is slightly more restrictive here than for full ML since y i is not a part of the Stage-1 likelihood.

4.2 Stage 2: Estimation of the Model \(g(y_{i}|\mathbf{w}_{i}, \mathbf{z}_{i}; \boldsymbol{\vartheta}_{ \mathsf{O}}, \widehat{\boldsymbol {\vartheta }}_{\mathsf{ME}})\)

Under the models (1) and (3) assumed in Stage 1, the predictive density of the covariates measured with error given their observed measures and the covariates measured without error becomes x i |w i ,z i ∼N(ξ i ,Ω i ), with the conditional mean and variance matrix

(11)
(12)

where we note the role of the selection matrix C i . Substituting estimates \(\widehat{\boldsymbol {\vartheta }}_{\mathrm{ME}}\) for the parameters in (11) and (12), we obtain empirical Bayes (EB) predictions \(\widetilde{\boldsymbol{\xi}}_{i}\) for x i for each unit i, and their predictive variances (e.g., Skrondal & Rabe-Hesketh 2004, Chap. 6, and 2009). The EB predictions are identical to the empirical best linear unbiased predictions (EBLUP), which do not hinge on distributional assumptions (e.g., Robinson 1991).

We finally estimate the parameters of primary interest ϑ O . Note that, conditional on (w i ,z i ) and given the estimates \(\widehat{\boldsymbol {\vartheta }}_{\mathrm{ME}}\), we can write \(\mathbf{x}_{i} = \widetilde{\boldsymbol{\xi}}_{i} + \mathbf{u}_{i}\) where , independent of w i and z i . Substituting this into (6) gives

(13)

where \(g^{*}(y_{i}|\widetilde{\boldsymbol{\xi}}_{i},{\mathbf{z}}_{i},\mathbf {u}_{i};\boldsymbol {\vartheta }_{\mathsf{O}})\) is a generalized linear model of the same kind as g(y i |x i ,z i ;ϑ O ), but with the linear predictor

(14)

which includes the vector of random effects u i . For the case of a single covariate x i measured with error, the linear predictor can be expressed as , where \(u_{i} \sim\mathrm{N}(0,\hat{\omega}_{i})\) and is a scalar.

Model (13) is a special case of a generalized linear latent and mixed model (GLLAMM), see, for instance, Rabe-Hesketh, Skrondal, and Pickles (2004a) and Skrondal and Rabe-Hesketh (2004, 2007). It differs from a conventional generalized linear mixed model (GLMM) in several regards. First, the model is for single-level data instead of multilevel or clustered data. The model is identified because the covariance matrix \(\widehat{\boldsymbol{\Omega}}_{i}\) of u i is treated as known from Stage 1, and β x is constrained to be equal to the coefficients of \(\widetilde{\boldsymbol{\xi}}_{i}\) (a model simply introducing level-1 random effects with a free variance matrix, without any parameter restriction, is not identified). Second, the mixing distribution is the predictive density of the unobserved x i . Third, the random effects are multiplied by unknown parameters. An important practical merit of IRC is that model (13) can be estimated using the gllamm program (e.g., Rabe-Hesketh, Skrondal, & Pickles, 2004b; Rabe-Hesketh & Skrondal 2012).

5 Properties of Improved Regression Calibration

The IRC estimator \(\widehat{\boldsymbol {\vartheta }}_{\mathsf{O}}^{\mathsf{IRC}}\) is the value of ϑ O which maximizes the second-stage log-likelihood \(\ell_{2}(\boldsymbol {\vartheta }_{\mathsf {O}}, \widehat{\boldsymbol {\vartheta }}_{\mathsf{ME}})\), where \(\widehat{\boldsymbol {\vartheta }}_{\mathsf{ME}}\) is a consistent estimator of ϑ ME obtained by maximizing 1(ϑ ME ) in the first stage. This is an instance of a general approach to estimation where the parameters of a model are divided into two sets, one of which contains the parameters of interest and the other involves only nuisance parameters. The nuisance parameters are first estimated by some consistent and computationally convenient estimators, and the parameters of interest are then estimated by maximizing an appropriate objective function with the estimates of the nuisance parameters from the first step treated as known. This is known as pseudo maximum likelihood (PML) estimation when, as here, the second-stage objective function is a likelihood (Gong & Samaniego 1981), and more generally as quasi generalized extremum estimation (Gourieroux & Monfort 1995).

It is well known that such two-stage estimators are consistent and asymptotically normally distributed under very general regularity conditions. The conditions and a proof of the consistency are given by Gourieroux and Monfort (1995, Sects. 24.2.4 and 24.2.2). In the notation of our problem, denote the true parameter value by \(\boldsymbol {\vartheta }^{*}=(\boldsymbol {\vartheta }_{\mathsf {O}}^{*\prime},\boldsymbol {\vartheta }_{\mathsf {ME}}^{*\prime})'\). Then \(\widehat{\boldsymbol {\vartheta }}_{\mathsf{O}}^{\mathsf{IRC}}\) is consistent for \(\boldsymbol {\vartheta }_{\mathsf {O}}^{*}\) if, first, standard regularity conditions hold so that the ML estimator of the whole of ϑ is itself consistent for ϑ and, second, if (i) ϑ O and ϑ ME can vary independently of each other, and (ii) \(\widehat{\boldsymbol {\vartheta }}_{ \mathsf{ME}}\) is consistent for \(\boldsymbol {\vartheta }_{\mathsf {ME}}^{*}\). All of these conditions are satisfied in the case considered here.

Let u(ϑ)=∂ℓ(ϑ)/ ϑ be the score function, partitioned as

$$\mathbf{u}(\boldsymbol{\vartheta}) = \biggl( \frac{\partial\ell(\boldsymbol{\vartheta})}{\partial \boldsymbol{\vartheta}'_{\mathsf{O}}}, \; \frac{\partial\ell(\boldsymbol{\vartheta})}{\partial \boldsymbol{\vartheta}'_{\mathsf{ME}}} \biggr)' =\bigl(\mathbf {u}_{\vartheta_{\mathsf{O}}}( \boldsymbol{\vartheta})',\; \mathbf{u}_{\vartheta_{\mathsf {ME}}}(\boldsymbol{ \vartheta})'\bigr)', $$

and define the mean score as \(\bar{\mathbf{u}}(\boldsymbol{\boldsymbol {\vartheta }}) = (\bar{\mathbf{u}}_{\vartheta_{\mathsf{O}}}(\boldsymbol {\vartheta})',\; \bar{\mathbf{u}}_{\vartheta_{\mathsf{ME}}}(\boldsymbol {\vartheta})')' =N^{-1}\,\mathbf{u}(\boldsymbol {\vartheta })\). Define the Fisher information matrix

$$\boldsymbol{\mathcal{I}}\bigl(\boldsymbol{\boldsymbol {\vartheta }}^{*}\bigr)= \lim_{N\rightarrow\infty} \mathrm{E}_{\boldsymbol {\vartheta }^{*}} \biggl[ -\frac{\partial \bar{\mathbf{u}}(\boldsymbol{\boldsymbol {\vartheta }})}{\partial \boldsymbol {\vartheta }'}\bigg \vert_{\boldsymbol {\vartheta }=\boldsymbol {\vartheta }^{*}} \biggr] = \begin{bmatrix} \boldsymbol{\mathcal{I}}_{\mathsf{O},\mathsf{O}} & \\ \boldsymbol{\mathcal{I}}_{\mathsf{ME},\mathsf{O}} &\quad \boldsymbol{\mathcal{I}}_{\mathsf{ME},\mathsf{ME}} \end{bmatrix} $$

with partitions corresponding to ϑ O and ϑ ME . For the asymptotic normality of \(\widehat{\boldsymbol {\vartheta }}_{\mathsf {O}}^{\mathsf{IRC}}\), it is further supposed that

$$ N^{1/2}\, \left[ \begin{array}{c} \bar{\mathbf{u}}_{\vartheta_{\mathsf{O}}}(\boldsymbol {\vartheta }_{\mathsf {O}}^{*}, \boldsymbol {\vartheta }_{\mathsf {ME}}^{*}) \\ \widehat{\boldsymbol{\vartheta}}_{\mathsf{ME}} - \boldsymbol{\vartheta}_{\mathsf{ME}}^{*} \end{array} \right] \stackrel{{\mathcal{L}}}{\longrightarrow} \ \mathsf{ N} \left( \mathbf {0}, \; \begin{bmatrix} \boldsymbol{\mathcal{I}}_{\mathsf{O},\mathsf{O}} & \\ \mathbf{V}_{\mathsf{ME},\mathsf{O}} &\quad \mathbf{V}_{\mathsf{ME},\mathsf{ME}} \end{bmatrix} \right). $$
(15)

Then

$$ N^{1/2}\, \bigl(\widehat{\boldsymbol {\vartheta }}_{\mathsf{O}}^{\mathsf{IRC}}- \boldsymbol {\vartheta }_{\mathsf {O}}^{*}\bigr) \ \stackrel{{\mathcal{L}}}{\longrightarrow} \ {\mathsf{N}}( \mathbf{0}, \boldsymbol{\varXi}) $$
(16)

where

$$ \boldsymbol{\varXi} = \boldsymbol{\mathcal{I}}_{\mathsf{O},\mathsf{O}}^{-1} + \boldsymbol{\mathcal{I}}_{\mathsf{O},\mathsf{O}}^{-1}\, \boldsymbol{\mathcal{I}}_{\mathsf{ME},\mathsf{O}}'\, \mathbf{V}_{\mathsf{ME},\mathsf{ME}}\, \boldsymbol{\mathcal{I}}_{\mathsf{ME},\mathsf{O}}\, \boldsymbol{\mathcal{I}}_{\mathsf{O},\mathsf{O}}^{-1}. $$
(17)

The relatively simple form of (17) follows from the fact that for PML estimators V ME,O =0 in general, so terms involving V ME,O disappear from the expression (Parke 1986). The asymptotic covariance matrix of the IRC estimator, which also takes into account the uncertainty of the Stage-1 estimates, is then given as \({\mbox{ACOV}}(\widehat{\boldsymbol {\vartheta }}_{\mathsf{O}}^{ \mathsf{IRC}})=N^{-1}\, \boldsymbol{\varXi}\).

In (17), \(N^{-1}\, \boldsymbol{\mathcal{I}}_{\mathsf {O},\mathsf{O}}^{-1}\) is the asymptotic covariance matrix of \(\widehat{\boldsymbol {\vartheta }}_{\mathsf {O}}^{\mathsf{IRC}}\) if ϑ ME were known. An estimate of it is obtained as a by-product of fitting model (13), and an estimate of N −1 V ME,ME similarly from fitting (8). The remaining part of (17) is \(\boldsymbol {{\mathcal{I}}}_{\mathsf{ME},\mathsf{O}}\), which we estimate by

$$ \widehat{\boldsymbol{\mathcal{I}}}_{\mathsf{ME},\mathsf{O}} = N^{-1} \, \sum _{i=1}^{N} \, {\mathbf{u}}_{\vartheta_{\mathsf{ME}},i} \bigl(\widehat{\boldsymbol {\vartheta }}^{\mathsf{IRC}}\bigr) \; {\mathbf {u}}_{\vartheta_{\mathsf{O}},i}\bigl( \widehat{\boldsymbol {\vartheta }}^{\mathsf{IRC}}\bigr)' $$
(18)

where \({\mathbf{u}}_{\vartheta_{\mathsf{O}},i}(\widehat{\boldsymbol {\vartheta }}^{\mathsf{IRC}})\) and \({\mathbf{u}}_{\vartheta_{\mathsf{ME}},i}(\widehat {\boldsymbol {\vartheta }}^{\mathsf{IRC}})\) are the gradients of the log-likelihood i (ϑ) for unit i, evaluated at the parameter estimates \(\widehat{\boldsymbol {\vartheta }}^{\mathsf{IRC}}= (\widehat{\boldsymbol {\vartheta }}_{\mathsf{O}}^{\mathsf{IRC}\prime}, \widehat{\boldsymbol {\vartheta }}_{\mathsf{ME}}')'\). How to obtain the required gradients is demonstrated in the appendix.

In summary, the difference between ML and IRC does not concern consistency, as both estimators are consistent. Rather, the difference is the loss of efficiency, compared to ML, which is incurred by IRC when it discards the data on y i in estimating ϑ ME in the first stage. However, we would expect this inefficiency to be slight because very little information about ϑ ME is contained in the y i in the sample. This is examined further in the next section.

6 Simulations

We use a simulation study to compare the performance of maximum likelihood (ML), improved regression calibration (IRC) and conventional regression calibration (RC) estimators. This is done in two parts, comparing first ML and IRC—which turn out to be virtually identical—and then IRC with RC.

For the exposure model we simulate a covariate measured with error as x i =0.3z i +ζ i , with z i N(0,1), independently distributed of ζ i N(0,ψ), where ψ=1. For the measurement model we consider n i =2 measures w ij of x i for each i, and simulate from a parallel or classical linear measurement model w ij =x i +δ ij , where δ ij ∼N(0,θ). Finally, for the outcome model we simulate from the logistic regression model \(\mbox{logit}\{ \operatorname{Pr}(y_{i}=1 | x_{i},z_{i}) \} = {\beta_{0}} + {\beta_{z}} z_{i} + {\beta_{x}} x_{i}\).

Three values of the coefficient β x of the fallibly measured covariate are considered: a moderate magnitude β x =0.5, a high magnitude β x =1, and a very high magnitude β x =1.5, which correspond respectively to odds ratios of 1.65, 2.72, and 4.48 for one standard deviation change in x. The very high magnitude case is included in the spirit of Buzas and Stefanski (1995, p. 546) to provide a tough test. For the measurement error variance θ, we use values θ=1 and θ=0.33. These give two different values for the reliabilities ρ=ψ/(ψ+θ), a moderate reliability case where ρ=0.5 and a high reliability case where ρ=0.75. The parameters β z and β 0 are fixed at 0.5 and −2, respectively, throughout all simulations. We consider the sample sizes N=200, N=1000, and N=5000. For each setting, 1000 replications of datasets are simulated.

ML estimation was carried out using numerical integration with 8 point adaptive quadrature. For IRC we used 3 point ordinary Gaussian quadrature, motivated by the earlier discussion of crude and fast quadrature approximation in this setting. There were, however, a handful of cases where the latter was not accurate enough, indicated by clearly divergent estimates from ML and IRC. To rectify this, we re-estimated the models using adaptive quadrature whenever the IRC estimate of β x or β z was larger than 3 in absolute value, which was required for only four data sets in one simulation setting. This decision rule is straightforward to apply also in the analysis of real data, since the ML estimates need not be known.

We first compare ML and IRC estimators, and also assess the performance of estimators of the variance (17) of the IRC estimator. These results are reported in Tables 1 and 2. It is clear that the estimates of the regression coefficients from IRC are almost identical to those from ML, regardless of the sample size and the parameter values. This is the case not only on average, but also for nearly every individual data set. As a result, the simulation standard deviations of the estimators are also very similar. There thus appears to be virtually no loss of efficiency from the two-stage method of estimation employed by IRC.

Table 1. Simulation results for maximum likelihood (ML) and improved regression calibration (IRC) estimators of regression parameter β x for covariate measured with error under different measurement reliabilities ρ, true values of β x , and sample sizes N. In each case, the true value of the other regression coefficient β z is 0.5. The results are based on 1000 replications. The table shows the simulation mean and standard deviation (SD) of the point estimates \(\hat{\beta}_{x}\), mean of their estimated standard errors (m(SE)), and coverage percentage of 95 % confidence intervals (C95). For IRC estimates, also shown are coverage of 95 % intervals based on a naive estimated standard error which ignores the first-stage uncertainty (C95-2), and the average percentage that this uncertainty contributes to the full standard errors (%-1).
Table 2. Simulation results for maximum likelihood (ML) and improved regression calibration (IRC) estimators of regression parameter β z for perfectly measured covariate under different measurement reliabilities ρ, true values of the other regression coefficient β x , and sample sizes N. In each case, the true value of β z is 0.5. The results are based on 1000 replications. The columns of the table are the same as in Table 1.

On the other hand, computing times for the two approaches can be very different. On a desktop PC with a 2.4 GHz Intel Core 2 processor and 2 GB RAM, estimation for one dataset of sample sizes 200, 1000, and 5000, respectively, took around 15, 45, and 360 seconds for ML, and around 1, 3, and 15 seconds for IRC. It thus appears that the relative advantage in computing time of IRC over ML increases as the sample sizes increase. The same is true when the number of replicate measurements w ij is increased. In tests with n i =3 replicates (not shown here), the computing times for IRC were essentially unchanged, while the times for ML increased to about 17, 55, and 520 seconds for N=200, 1000, and 5000, respectively.

The estimated standard errors of the IRC estimates, taking into account uncertainty from both stages of the estimation, are obtained by estimating (17) as shown in the appendix. It can be seen that this approach performs well. In the most difficult cases, with small sample size, large effects and low reliability of measurement, the standard errors somewhat underestimate the true sampling variation. This is mainly due to right-skewed sampling distributions of the estimates in these cases, which is also reflected in a small upward bias of both ML and IRC estimates. The tails of the sampling distribution do not affect the coverage of the Wald-based 95 % confidence intervals for the parameters, which is 93.6–97.1 % across all the simulations.

The last two columns of Tables 1 and 2 examine a simplified estimate of the standard errors of the IRC estimates that is obtained by using only the first term on the right-hand side of (17), and omitting the second. In other words, this simply ignores the uncertainty in the estimated parameters of the exposure and measurement models from the first stage. Such an approach would be very convenient in practice because it entails using the estimated standard errors from the second-stage model directly, without any further adjustment. In the cases considered here, this simplification would do us little harm since the coverage of the confidence intervals (shown in the column “C95-2” of the tables) is still quite satisfactory. The reason for this is indicated by the last column of the tables, which shows the average percentage that the second term of (17) contributes to the full estimated standard error. This is mostly around 2 %, rising to 6.4 % in the most challenging configuration considered here.

Tables 3 and 4 compare the simulation results for IRC and RC estimators, omitting the full ML estimators because they are so similar to IRC. The focus here is on the finite-sample means and variabilities of the estimators, to examine their relative performances in different settings. We note also that computing times for IRC and RC were very similar, typically around 10 % higher for IRC.

Table 3. Simulation results for improved regression calibration (IRC) and conventional regression calibration (RC) estimators of regression parameter β x for covariate measured with error under different measurement reliabilities ρ, true values of β x , and sample sizes N. In each case, the true value of the other regression coefficient β z is 0.5. The results are based on 1000 replications. The table shows the simulation mean, % bias, and root mean squared error (RMSE) of the point estimates of β x , and coverage percentage of 95 % confidence intervals (C95).
Table 4. Simulation results for improved regression calibration (IRC) and conventional regression calibration (RC) estimators of regression parameter β z for perfectly measured covariate under different measurement reliabilities ρ, true values of the other regression coefficient β x , and sample sizes N. In each case, the true value of β z is 0.5. The results are based on 1000 replications. The columns of the table are the same as in Table 3.

The results show that best performances occur in different circumstances for the two estimators. IRC (and ML) estimators have an upward bias in small samples, due to the right-skewness of their sampling distributions, but the bias disappears in larger samples because these estimators are consistent. In contrast, RC estimators have a bias due to their approximate nature, which is largest when the reliability of measurement is low or when the regression coefficients are large. Taking into account both the biases and sampling variances, root mean squared errors tend to be smaller for RC when the sample size is small or moderate, and for IRC when the sample size is reasonably large. The bias of RC means that in the most difficult cases the coverage of confidence intervals based on them is substantially below the nominal level, while for IRC the coverage levels are always adequate.

In summary, the simulation study suggests, first, that we can generally replace ML with pseudo-ML (IRC) estimation, with essentially no loss in efficiency of estimation but with a substantial gain in computational speed. Second, when comparing IRC with RC, we find that the preferred estimator can depend on the circumstances of the analysis. RC tends to perform best with smaller samples and relatively mild measurement error problems, whereas IRC does best when the sample sizes are large, measurement error is severe or the effects being estimated are strong. The choice between RC and IRC is not informed by speed of computation, which is essentially the same for both of them.

7 Empirical Illustration: Ability and High Earnings

To illustrate covariate measurement error modeling in practice, we apply the investigated methods to a dataset on 935 non-black men from the 1980 wave of the Young Men’s Cohort of the U.S. National Longitudinal Survey (NLS), previously analyzed by Griliches (1976) and Blackburn and Neumark (1992), among others.

The binary outcome y i we consider here is being a high earner, defined as having a salary above the 90 % percentile of the sample distribution. The covariate of main interest is ability x i , also denoted [Ability], which is measured with error. Three covariates which are assumed measured without error are also included: working experience in years z i1 [Exper] (sample mean 11.6, s.d. 4.4), a dummy variable for living in an urban area z i2 [Urban] (71.8 % of the sample), and a dummy variable for being black z i3 [Black] (12.8 %).

Under the standard assumptions previously stated, the outcome model is

$$\mbox{logit}\bigl\{ \operatorname{Pr}(y_i=1 | x_i,z_{i1},z_{i2},z_{i3}) \bigr\} = \beta_{z_0} + \beta_{z_1} z_{i1} + \beta_{z_2} z_{i2}+ \beta_{z_3} z_{i3}+ { \beta_x} x_i, $$

and the exposure model is

$$x_i = \gamma_{0}+\gamma_{1} z_{i1} + \gamma_{2} z_{i2}+ \gamma_{3} z_{i3} + \zeta_i, \quad\zeta_i \sim \mathrm{N}(0,\psi). $$

The mens’ abilities are measured by two fallible measures. The first measure is an IQ test w i1 [IQ], collected as part of a survey of the respondents’ schools conducted in 1968. Since a wide variety of IQ tests were used in different states, these were recoded into “IQ equivalents” by the Center for Human Resources Research at the Ohio State University which administers the NLS. The second measure is a test of “Knowledge of World of Work” w i2 [Know], which examines respondents’ knowledge of the labor market, covering the duties, educational attainment, and relative earnings of ten occupations. It is intended to reflect both the quantity and quality of schooling, intelligence, and motivation (curiosity about the outside world). The seminal paper by Griliches (1976) provides a lucid discussion of the data, variables and specification issues.

We use versions of the two fallible measures standardized to have sample mean 0 and variance 1. Denoting these standardized variables by w i1 and w i2, we consider the classical measurement model

$$w_{ij} = x_{i} + \delta_{ij}, \quad \delta_{ij} \sim\mathrm{N}(0,\theta), \ j = 1,2. $$

This is obtained from the general model (1) for a scalar x i by assuming λ 1=λ 2=1, and then setting ν 1=ν 2=0 and θ 1=θ 2=θ because the marginal means and variances of w i1 and w i2 are equal. Note that for identifiability the model thus specifies that the two measures have equal loadings, i.e., that on the scale of the standardized measures they are equally discriminating measures of ability. This assumption could be relaxed if more than two fallible measures were available.

Estimates from ML, IRC, and RC are shown in Table 5. The parameter estimates for the outcome model are practically identical for ML and IRC, whereas the estimates from RC are smaller, as expected. In particular, the estimate for the parameter of main interest β x from IRC, \(\hat{\beta}_{x}=2.50\), is essentially identical to the ML estimate, whereas the estimate from RC is \(\hat{\beta}_{x}=2.35\).

Table 5. Ability and high earnings: Estimates for logistic regression with covariate measurement error based on maximum likelihood (ML), improved regression calibration (IRC), and conventional regression calibration (RC). For IRC, SE are estimated standard errors based on asymptotic covariance matrix derived in this article and SE-2 are naive estimated standard errors ignoring uncertainty in Stage-1 estimates.

The estimated standard errors of estimates of β are practically identical for ML and IRC, apart from numerical differences. This indicates that the loss of efficiency in estimating the parameters of the exposure and measurement models from only Stage 1 of IRC is effectively nill; indeed, estimates of these parameters and associated estimated standard errors are identical to the full ML results to at least three decimal places. Uncertainty from Stage 1, i.e., the second term of the variance matrix (17), contributes around 8 % of the estimated standard error of \(\hat{\beta}_{x}\) for IRC. We also note that the sum of the maximized log-likelihood components for IRC of =−2738.41 is very close to the maximum of the log-likelihood =−2738.38.

From the estimated exposure model, the ability measure is significantly associated with urbanity, race and working experience. Its conditional variance given these covariates is \(\hat{\psi}=0.29\). The estimated measurement error variance is \(\hat{\theta}=0.58\), and the conditional reliability of the measures (given the covariates) is thus \(\hat{\psi}/(\hat{\psi}+\hat{\theta})=0.33\).

Regarding the outcome model, there is a strong estimated association between the ability measure and high earnings when controlling for working experience, urbanity, and race. The estimated coefficient of \(\hat{\beta}_{x}=2.50\) translates to an odds ratio of 3.8 for being a high earner corresponding to an increase of one conditional standard deviation in ability. The other covariates are retained in the model, but they could possibly also have been omitted because they do not have statistically significant associations with high earnings at the 5 % level. It is worth noting that if the model was simplified by omitting some control variables, we could still choose to use the predicted values \(\widetilde{\xi}_{i}\) and variances \(\hat{\omega}_{i}\) conditional on all of them, without re-calculating these predictions. This only requires the modification that in the calculation of the standard errors (as shown in the appendix) the corresponding elements of β z are set to 0.

8 Discussion

In this article, we have proposed an improved regression calibration approach to the estimation of generalized linear models with covariate measurement error, a pseudo maximum likelihood method that simultaneously addresses the computational challenge of maximum likelihood and the inconsistency of conventional regression calibration. A decomposed form of the likelihood was exploited, where the component for the measurement and exposure models is in closed form and trivial to maximize, and the component for the outcome model is accurately maximized using crude and fast numerical integration.

Our simulations show that improved regression calibration produces parameter estimates that are practically indistinguishable from those produced by maximum likelihood. Interval estimation based on the asymptotic covariance matrix for improved regression calibration that was derived in this article has excellent performance. Even interval estimation based on the naive estimator of the asymptotic covariance matrix (ignoring the uncertainty incurred in the first step) usually performs well. Compared to conventional regression calibration, improved regression calibration offers little or no advantage when sample sizes are small, but performs best when samples are reasonably large and especially when the measurement error or the effects are not small.

Both the fallibly measured covariates and their measures are continuous in the models considered here. Improved regression calibration can also be used when the observed measures are categorical, in which case categorical factor models would be used as measurement models. Since the predictive distributions are then no longer normal, it is not obvious that improved regression calibration would work well. If both the fallibly measured covariates and their measures are categorical, the problem is one of misclassification where integration is replaced by summation and maximum likelihood estimation becomes computationally straightforward.