1 Introduction

Increasingly, data analysts seek to link records across data sets to facilitate statistical analyses. As a prototypical example, a health researcher seeks to link data from a previously completed study to patients’ electronic medical records to collect long-term outcomes, with the ultimate goal of estimating relationships between the long-term outcomes and baseline covariates. Such linkages are performed readily when unique identifiers, such as social security numbers, are available for all records in all data sets.

Often, however, one or more of the data sets do not have unique identifiers, perhaps because they were never collected or are not made available due to privacy concerns. In such cases, analysts have to link records based on indirect identifiers, such as name, date of birth, and demographic variables [1, 2]. Generally, such indirect identifiers contain distortions and errors. As a result, they can differ across the data sets, which can make it difficult to determine the correct record linkages. This uncertainty should be quantified and propagated to statistical inferences, although typically this is not done.

In the statistics literature, the most popular method for linking records via indirect identifiers is based on the probabilistic record linkage (RL) approach of Newcombe et al. [15], which was later extended and formalized by Fellegi and Sunter [4]. Many extensions to the Fellegi-Sunter (FS) model have been proposed [e.g., 18, 23, 24]. A common drawback of these and other probabilistic RL methods [e.g., 7, 12] is the difficulty in quantifying linkage uncertainty, and propagating that uncertainty to statistical inferences. These limitations have led to developments of RL approaches from Bayesian perspectives [e.g., 3, 5, 6, 9, 10, 11, 14, 16, 17, 19, 20, 21, 25, 26].

In this article, we propose a Bayesian model for performing probabilistic RL and linear regression simultaneously. The proposed model quantifies uncertainty about the linkages and propagates this uncertainty to inferences about the regression parameters. We focus on bipartite RL—that is, the analyst seeks to merge two data sets—assuming that individuals appear at most once in each data set. As we illustrate, the model can leverage relationships among the dependent and independent variables in the regression to potentially improve the quality of the linkages. This also can increase the accuracy of resulting inferences about the regression parameters.

We use a Bayesian hierarchical model that builds on prior work by Sadinle [17], who proposed a Bayesian version of the FS model for merging two data sets. In fact, one of our primary contributions is to turn the model in [17] into a procedure for jointly performing probabilistic RL and fully Bayesian inference for regression parameters. We also propose and evaluate the effectiveness of three algorithms for fitting the Bayesian hierarchical model, focusing on both the quality of the linkages and on the accuracy of the parameter estimates.

2 Review of Bayesian Probabilistic Record Linkage

In this section, we review the Bayesian bipartite RL model of [17]. Consider two data sets \(\mathbf {A}_1\) and \(\mathbf {A}_2\), containing \(n_1\) and \(n_2\) records, respectively. Without loss of generality, assume \(n_1 \ge n_2\). Our goal is to link records in \(\mathbf {A}_1\) to records in \(\mathbf {A}_2\). We further assume that \(\mathbf {A}_1\) and \(\mathbf {A}_2\) do not contain duplicate records; that is, each record in \(\mathbf {A}_1\) corresponds to a single individual, as is the case for \(\mathbf {A}_2\). We assume that some of the same individuals are in \(\mathbf {A}_1\) and \(\mathbf {A}_2\).

To characterize this, we define the random variable \(\mathbf {Z} = (Z_1, \dots , Z_{n_2})\) as the vector of matching labels for the records in \(\mathbf {A}_2\). For \(j=1, \dots , n_2\), let

$$ Z_j = {\left\{ \begin{array}{ll} i, \text { if record } i \in \mathbf {A}_1 \text { and } j \in \mathbf {A}_2 \text { refer to the same entity;} \\ n_1+j, \text { if record } j \in \mathbf {A}_2 \text { does not have a match in } \mathbf {A}_1. \end{array}\right. } $$

Analysts determine whether a pair of records (ij) is a link, i.e., whether or not \(Z_j=i\), by comparing values of variables that are common to \(\mathbf {A}_1\) and \(\mathbf {A}_2\). Suppose we have F common variables, also known as linking variables or fields. For \(f=1, \dots , F\), let \(\gamma _{ij}^f\) represent a score that reflects the similarity of field f for records i and j. For example, when field f is a binary variable, we can set \(\gamma _{ij}^f = 1\) when record i agrees with record j on field f, and \(\gamma _{ij}^f = 0\) otherwise. When field f is a string variable like name, we can calculate a similarity metric like the Jaro-Winkler distance [8, 22] or the Levenshtein edit distance [13]. We can convert these string metrics to \(\gamma _{ij}^f\) by categorizing the scores into a multinomial variable, where the categories represent the strength of agreement. We illustrate this approach in Sect. 4.

For each record (ij) in \(\mathbf {A}_1 \times \mathbf {A}_2\), let \(\pmb {\gamma }_{ij} = (\gamma _{ij}^1, \dots , \gamma _{ij}^F)\). We assume \(\pmb {\gamma }_{ij}\) is a realization of a random vector \({\varGamma }_{ij}\) distributed as

$$\begin{aligned} \varGamma _{ij}|Z_j=i {\mathop {\sim }\limits ^{iid}} \mathcal {M}(\mathbf {m}), \qquad \varGamma _{ij}|Z_j \ne i {\mathop {\sim }\limits ^{iid}} \mathcal {U}(\mathbf {u}), \qquad \text { where} \end{aligned}$$

\(\mathcal {M}(\mathbf {m})\) represents the model for comparison vectors among matches, and \(\mathcal {U}(\mathbf {u})\) represents the model for comparison vectors among non-matches. For each field f, we let \(m_{f\ell } = \mathbb {P}(\varGamma _{ij}^f = \ell |Z_j=i)\) be the probability of a match having level \(\ell \) of agreement in field f, and let \(u_{f\ell } = \mathbb {P}(\varGamma _{ij}^f = \ell |Z_j \ne i)\) be the probability of a non-match having level \(\ell \) of agreement in field f. Let \(\mathbf {m}_f = (m_{f1}, \dots , m_{fL_f})\) and \(\mathbf {u}_f = (u_{f1}, \dots , u_{fL_f})\); let \(\mathbf {m} = (\mathbf {m}_1, \dots , \mathbf {m}_F)\) and \(\mathbf {u} = (\mathbf {u}_1, \dots , \mathbf {u}_F)\).

For computational convenience, it is typical to assume the comparison fields are conditionally independent given the matching status of the record pairs. Let \(\mathbf {\Theta } = (\mathbf {m},\mathbf {u})\). The likelihood of the comparison data can be written as

$$\begin{aligned} \mathcal {L}(\mathbf {Z}|\mathbf {\Theta }, \pmb {\gamma }) = \prod _{i=1}^{n_1} \prod _{j=1}^{n_2} \prod _{f=1}^F \prod _{\ell = 0}^{L_f} \Big [m_{f\ell }^{I(Z_j=i)} u_{f\ell }^{I(Z_j \ne i)} \Big ]^{I(\gamma _{ij}^f = \ell )}, \end{aligned}$$
(1)

where \(I(\cdot )=1\) when its argument is true and \(I(\cdot )=0\) otherwise.

Sometimes, there are missing values in the linking variables. Although we do not consider this possibility in our simulations, we summarize how to handle missing values in the model, assuming ignorable missing data. With conditional independence and ignorability, we can marginalize over the missing comparison variables. The likelihood of the observed comparison data can be written as

$$\begin{aligned} \mathcal {L}(\mathbf {Z}|\mathbf {\Theta }, \pmb {\gamma }^{\mathrm {obs}}) = \prod _{f=1}^F \prod _{\ell = 0}^{L_f}m_{f\ell }^{a_{f \ell }(\mathbf {Z})}u_{f\ell }^{b_{f\ell }(\mathbf {Z})}, \qquad \text {where} \end{aligned}$$
(2)

\(a_{f\ell }(\mathbf {Z}) = \sum _{i,j}I_{\mathrm {obs}}(\gamma _{ij}^f)I(\gamma _{ij}^f = \ell )I(Z_j =i)\) and \(b_{f\ell }(\mathbf {Z}) = \sum _{i,j}I_{\mathrm {obs}}(\gamma _{ij}^f)I(\gamma _{ij}^f = \ell )I(Z_j \ne i)\). For a given \(\mathbf {Z}\), these represent the number of matches and non-matches with observed disagreement level \(\ell \) in field f. Here, \(I_\mathrm {obs}(\cdot ) =1\) when its argument is observed, and \(I_\mathrm {obs}(\cdot ) =0\) when its argument is missing.

To define the prior distributions for \(\mathbf {m}\) and \(\mathbf {u}\), for all fields f, let \(\varvec{\alpha }_f = (\alpha _{f0}, \dots , \alpha _{fL_f})\) and \(\varvec{\beta }_f = (\beta _{f0}, \dots , \beta _{fL_f})\). We assume that \(\mathbf {m}_f \sim \mathrm {Dirichlet}(\varvec{\alpha }_f)\) and \(\mathbf {u}_f \sim \mathrm {Dirichlet}(\varvec{\beta }_f)\), where \(\varvec{\alpha }_f\) and \(\varvec{\beta }_f\) are known parameters. In our simulation studies, we set all entries of \(\varvec{\alpha }_f\) and \(\varvec{\beta }_f\) equal to 1 for every field.

We use the prior distribution for \(\mathbf {Z}\) from [17]. For \(j = 1, \dots , n_2\), let the indicator variable \(I(Z_j \le n_1) \mid \pi {\mathop {\sim }\limits ^{iid}} \mathrm {Bernoulli}(\pi )\), where \(\pi \) is the proportion of matches expected a priori. Let \(\pi \sim \mathrm {Beta}(\alpha _\pi , \beta _\pi )\), where the prior mean \(\alpha _\pi /(\alpha _\pi + \beta _\pi )\) represents the expected percentage of overlap. Let \(n_{12}(\mathbf {Z}) = \sum _j I(Z_j \le n_1)\) be the number of matches according to \(\mathbf {Z}\). The prior specification implies that \(n_{12}(\mathbf {Z}) \sim \) Beta-Binomial\((n_2,\alpha _\pi ,\beta _\pi )\) after marginalizing over \(\pi \). That is,

$$\begin{aligned} \mathbb {P}(n_{12}(\mathbf {Z})) = \left( {\begin{array}{c}n_2\\ n_{12}(\mathbf {Z})\end{array}}\right) \frac{\mathrm {B}(n_{12}(\mathbf {Z})+\alpha _\pi ,n_2-n_{12}(\mathbf {Z})+\beta _\pi )}{\mathrm {B}(\alpha _\pi ,\beta _\pi )}. \end{aligned}$$
(3)

We assume that, conditional on the value of \(n_{12}(\mathbf {Z})\), all the possible bipartite matchings are equally likely a priori. There are \(n_1!/(n_1-n_{12}(\mathbf {Z}))!\) such bipartite matchings. Thus, the prior distribution for \(\mathbf {Z}\) is

$$\begin{aligned} \mathbb {P}(\mathbf {Z}|\alpha _\pi ,\beta _\pi )&= \mathbb {P}(n_{12}(\mathbf {Z})|\alpha _\pi ,\beta _\pi )\mathbb {P}(\mathbf {Z}|n_{12}(\mathbf {Z}),\alpha _\pi ,\beta _\pi ) \end{aligned}$$
(4)
$$\begin{aligned}&= \frac{(n_1-n_{12}(\mathbf {Z}))!}{n_1!}\frac{\mathrm {B}(n_{12}(\mathbf {Z})+\alpha _\pi ,n_2-n_{12}(\mathbf {Z})+\beta _\pi )}{\mathrm {B}(\alpha _\pi ,\beta _\pi )}. \end{aligned}$$
(5)

3 The Bayesian Hierarchical Model for Simultaneous Regression and Record Linkage

In this section, we present the Bayesian hierarchical model for regression and RL, and propose three Markov chain Monte Carlo (MCMC) algorithms for fitting the model in practice. Throughout, we assume the explanatory variables \(\mathbf {X}\) are in \(\mathbf {A}_1\), and the response variable \(\mathbf {Y}\) is in \(\mathbf {A}_2\).

3.1 Model Specification

We assume the standard linear regression, \(\mathbf {Y}|\mathbf {X},\mathbf {V},\mathbf {Z} \sim N(\mathbf {X\beta }, \sigma ^2 \mathbf {I})\). Here, \(\mathbf {V}\) are linking variables used in the RL model but not in the regression model. Analysts can specify prior distributions on \((\mathbf {\beta },\sigma ^2)\) that represent their beliefs. A full specification of the joint distribution of \((\mathbf {Y}, \mathbf {X} | \mathbf {V})\) requires analysts to specify some marginal model for \(\mathbf {X}\), written generically as \(f(\mathbf {X}|\mathbf {V})\). In some contexts, however, it is not necessary to specify \(f(\mathbf {X}|\mathbf {V})\), as we explain in Sect. 3.2. Critically, this model assumes that the distribution of \((\mathbf {Y}, \mathbf {X} | \mathbf {V})\) is the same for matches and non-matches. Finally, for the RL component of the model, we model \(\mathbf {Z}\) using the Bayesian FS approach in Sect. 2.

For the simulation studies, we illustrate computations with the Bayesian hierarchical model using univariate Y and univariate X. Assume \(X \sim N(\mu , \tau ^2)\). As a result, in the simulations, the random variable (XY) follows a bivariate normal distribution with

$$\begin{bmatrix} X \\ Y \end{bmatrix} \sim N\Big (\begin{bmatrix} \mu \\ \beta _0 + \beta _1 \mu \end{bmatrix}, \begin{bmatrix} \tau ^2 &{} \beta _1 \tau ^2 \\ \beta _1 \tau ^2 &{} \sigma ^2+\beta _1^2\tau ^2 \end{bmatrix} \Big ). $$

We assume a normal-Gamma prior on the regression parameters. Letting \(\phi = 1/\sigma ^2\), we have \( \phi \sim G(.5, .5) \) and \( \beta |\phi \sim N(b_0, \varPhi _0 \phi ^{-1}) \) where \(b_0 = [3,1]^T\) and \(\varPhi _0\) is a \(2\times 2\) identity matrix. When needed, we assume Jeffrey’s prior \(p(\mu ,\tau ^2) \propto 1/\tau ^2.\)

3.2 Estimation Strategies

Even in the relatively uncomplicated set-up of the simulation study, it is not possible to compute the posterior distribution of the model parameters in closed form. Therefore, we consider three general strategies for MCMC sampling in order to approximate the posterior distribution.

First, we propose an MCMC sampler that uses only the linked records when estimating the full conditional distribution of the regression parameters in each iteration of the sampler. This method generates potentially different samples of linked records at each iteration. A key advantage of this method is that it does not require imputation of \(\mathbf {X}\); hence, analysts need not specify \(f(\mathbf {X}|\mathbf {V})\). We call this the joint model without imputation, abbreviated as JM. Second, we propose an MCMC sampler that imputes the missing values of \(\mathbf {X}\) for those records in \(\mathbf {A}_2\) without a link, and updates the regression parameters in each iteration using the linked pairs as well the imputed data. We call this the joint model with imputation, abbreviated as JMI. Third, we propose an MCMC sampler that is similar to JMI but uses an extra step when imputing the missing values of \(\mathbf {X}\). Specifically, at each iteration, we (i) sample values of the regression parameters from a conditional distribution based on only the linked records, (ii) use the sampled parameters to impute missing values in \(\mathbf {X}\), and (iii) update regression coefficients based on linked as well as imputed pairs. By adding step (i), we aim to reduce potential effects of a feedback loop in which less accurate estimates of regression parameters result in less accurate estimates of the conditional distribution of \(\mathbf {X}\), and so on through the MCMC iterations. We call this the joint model with imputation and reduced feedback (JMIF).

For JMI and JMIF, inferences are based on every entity in \(\mathbf {A}_2\), whereas for JM inferences are based on the subsets of linked pairs, which can differ across MCMC iterations. Analysts should keep these differences in mind when selecting an algorithm that suits their goals.

3.3 Details of the MCMC Samplers

In this section, we present the mathematical details for implementing the three proposed MCMC samplers. Before doing so, we present an algorithm for a two-step approach, where we perform RL and then use the linked data for regression. The three proposed MCMC samplers for the Bayesian hierarchical model utilize parts of the algorithm for the two-step approach.

3.3.1 Two Step Approach (TS)

Given the parameter values at iteration t of the sampler, we need to sample new values \({\mathbf {m}_f^{[t+1]} = (m_{f0}^{[t+1]}, \dots , m_{fL_f}^{[t+1]})}\) and \({\mathbf {u}_f^{[t+1]} = (u_{f0}^{[t+1]}, \dots , u_{fL_f}^{[t+1]})}\), where \(f=1, \dots , F\). We then sample a new value \(\mathbf {Z}^{[t+1]} = (Z_1^{[t+1]}, \dots , Z_{n_2}^{[t+1]})\). The steps are as follows.

T.1:

For \(f=1, \dots , F\), sample

$$\begin{aligned} \mathbf {m}_f^{[t+1]}|\pmb {\gamma }^{obs},\mathbf {Z}^{[t]} \sim \mathrm {Dirichlet}(a_{f0}(\mathbf {Z}^{[t]})+ \alpha _{f0}, \dots , a_{fL_f}(\mathbf {Z}^{[t]})+ \alpha _{fL_f}), \end{aligned}$$
(6)
$$\begin{aligned} \mathbf {u}_f^{[t+1]}|\pmb {\gamma }^{obs},\mathbf {Z}^{[t]} \sim \mathrm {Dirichlet}(b_{f0}(\mathbf {Z}^{[t]})+ \beta _{f0}, \dots , b_{fL_f}(\mathbf {Z}^{[t]})+ \beta _{fL_f}). \end{aligned}$$
(7)

Collect these new draws into \(\mathbf {\Theta }^{[t+1]}\). Here, each

$$\begin{aligned} a_{fl}(\mathbf {Z})&= \sum _{i,j}I_{obs}(\pmb {\gamma }_{ij}^f)I(\gamma _{ij}^f=l)I(Z_j=i),\end{aligned}$$
(8)
$$\begin{aligned} b_{fl}(\mathbf {Z})&= \sum _{i,j}I_{obs}(\pmb {\gamma }_{ij}^f)I(\gamma _{ij}^f=l)I(Z_j \ne i). \end{aligned}$$
(9)
T.2:

Sample the entries of \(\mathbf {Z}^{[t+1]}\) sequentially. Having sampled the first \(j-1\) entries of \(\mathbf {Z}^{[t+1]}\), we define \(\mathbf {Z}_{-j}^{[t+(j-1)/n_2]} = (Z_1^{[t+1]}, \dots , Z_{j-1}^{[t+1]}, Z_{j+1}^{[t]}, \dots , Z_{n_2}^{[t]})\). We sample a new label \(Z_j^{[t+1]}\), with the probability of selecting label \(q \in \{1, \dots , n_1, n_1+j\}\) given by \(p_{qj}(\mathbf {Z}_{-j}^{[t+(j-1)/n_2]} \mid \mathbf {\Theta }^{[t+1]})\). This can be expressed for generic \(\mathbf {Z}_{-j}\) and \(\mathbf {\Theta }\) as

(10)

where \(w_{qj} = \log [\mathbb {P}(\pmb {\gamma }_{qj}^{obs}|Z_j=q,\mathbf {m})/\mathbb {P}(\pmb {\gamma }_{qj}^{obs}|Z_j \ne q,\mathbf {u})]\) is equivalently

$$\begin{aligned} w_{qj} = \sum _{f=1}^FI_{obs}(\pmb {\gamma }_{qj}^f)\sum _{l=0}^{L_f}\log \Big (\frac{m_{fl}}{u_{fl}}\Big )I(\gamma _{qj}^f = l). \end{aligned}$$
(11)

The normalizing constant for \(p_{qj}(\mathbf {Z}_{-j}|\mathbf {\Theta })\) is

$$\begin{aligned}&\prod _{i=1}^{n_1}\prod _{k=1}^{n_2}\prod _{f=1}^F \prod _{\ell =0}^{L_f} \Big [m_{f\ell }^{I(Z_k=i)} u_{f\ell }^{I(Z_k \ne i)} \Big ]^{I(\gamma _{ij}^f = \ell )} u_{fl}^{I(Z_j \ne i)} \nonumber \\&\times \frac{\big (n_1-(n_{12}(\mathbf {Z}_{-j})+1)\big )!}{n_1!} \frac{(n_{12}(\mathbf {Z}_{-j})+\alpha _\pi )!(n_2-(n_{12}(\mathbf {Z}_{-j})+1)+\beta _\pi -1)!}{(n_2+\alpha _\pi +\beta _\pi -1)} \nonumber \\ \end{aligned}$$
(12)

where \(k \ne j\).

T.3:

We now add the regression parameters to the sampler. For any draw of \(\mathbf {Z}^{[t+1]}\), we sample \(\beta ^{[t+1]}\) and \((\sigma ^2)^{[t+1]}=\phi ^{-1}\) from

$$\begin{aligned}&\beta ^{[t+1]}|\phi ,\mathbf {Y},\mathbf {Z}^{[t+1]} \sim N(b_n,(\phi \varPhi _n)^{-1}) \end{aligned}$$
(13)
$$\begin{aligned}&\phi |\mathbf {Y},\mathbf {Z}^{[t]} \sim G\Big (\frac{n+1}{2},\frac{1}{2}(SSE+1 + \hat{\beta }^T\widetilde{\mathbf {X}}^T\widetilde{\mathbf {X}}\hat{\beta }+b_0^T\phi _0b_0 - b_n^T\varPhi _nb_n)\Big ) \end{aligned}$$
(14)

where \(SSE = \widetilde{\mathbf {Y}}^T[\mathbf {I} - \widetilde{\mathbf {X}}(\widetilde{\mathbf {X}}^T\widetilde{\mathbf {X}})^{-1}\widetilde{\mathbf {X}}^T]\widetilde{\mathbf {Y}}\),\(\varPhi _n = \widetilde{\mathbf {X}}^T\widetilde{\mathbf {X}} + \varPhi _0\), \(b_n = \varPhi _n^{-1}(\widetilde{\mathbf {X}}^T\widetilde{\mathbf {X}}\hat{\beta }+\phi _0 b_0)\), and \(\hat{\beta } = (\widetilde{\mathbf {X}}^T\widetilde{\mathbf {X}})^{-1}\widetilde{\mathbf {X}}^T\widetilde{\mathbf {Y}}\). Here, \(\widetilde{\mathbf {X}}\) and \(\widetilde{\mathbf {Y}}\) are the subsets of \(\mathbf {X}\) and \(\mathbf {Y}\) belonging to only the linked cases at iteration t.

Steps T.1 and T.2 are the same as those used in [17]; we add T.3 to sample the regression parameters. Alternatively, and equivalently, analysts can run T.1 and T.2 until MCMC convergence, then apply T.3 to each of the resulting draws of \(\mathbf {Z}^{[t]}\) to obtain the draws of the regression parameters.

3.3.2 Joint Method Without Imputation (JM)

The sampler for the JM method uses T.1, but it departs from T.2 and T.3. As we shall see, in JM we need the marginal density \(f(\mathbf {Y})\). This can be approximated with a standard univariate density estimator. Alternatively, one can derive it from \(f(\mathbf {Y}| \mathbf {X})\) and extra assumptions about \(f(\mathbf {X})\), although these extra assumptions obviate one of the advantages of JM compared to JMI and JMIF. In the simulations, for convenience we use the fact that (YX) are bivariate normal when computing the marginal density of \(\mathbf {Y}\), as evident in step J.2. Step J.2 can be omitted when using means to compute \(f(\mathbf {Y})\) that do not leverage a joint model for \((\mathbf {Y}, \mathbf {X})\).

J.1:

Sample \(\mathbf {m}_f^{[t+1]}\) and \(\mathbf {u}_f^{[t+1]}\) using T.1.

J.2:

Sample \(\mu ^{[t+1]}\) and \((\tau ^2)^{[t+1]}\) using \(1/\tau ^2 \sim G((n-1)/2,\sum (X_i - \bar{X})^2)\) and \(\mu |\tau ^2 \sim N(\bar{X},\tau ^2)\). We use all of \(\mathbf {X}\) in this step.

J.3:

Given \(\mathbf {Z}^{[t]}\), sample \(\beta ^{[t+1]}\) and \((\sigma ^2)^{[t+1]}\) from (13) and (14).

J.4:

Sample \(\mathbf {Z}^{[t+1]}\) sequentially. Having sampled the first \(j-1\) entries of \(\mathbf {Z}^{[t+1]}\), we define \(\mathbf {Z}_{-j}^{[t+(j-1)/n_2]} = (Z_1^{[t+1]}, \dots , Z_{j-1}^{[t+1]},Z_{j+1}^{[t]}, \dots , Z_{n_2}^{[t]})\). Then we sample a new label \(Z_j^{[t+1]}\) with probability \(p_{qj}(\mathbf {Z}_{-j}^{[t+(j-1)/n_2]} \mid \mathbf {\Theta }^{[t+1]}, \mathbf {X}, \mathbf {Y})\) of selecting label \(q \in \{1, \dots , n_1, n_1+j\}\). For generic \((\mathbf {Z}_{-j}, \mathbf {\Theta }, \mathbf {X}, \mathbf {Y})\), we have \(f(\mathbf {Z}_{-j}|\mathbf {\Theta }, \mathbf {X},\mathbf {Y}) \propto f(\mathbf {Y},\mathbf {X}|\mathbf {\Theta }, \mathbf {Z}_{-j}) f(\mathbf {Z}_{-j}|\mathbf {\Theta })\). For \(q \le n_1\), and using the definition for \(w_{qj}\) in (11), we thus have

$$\begin{aligned} p_{qj}(\mathbf {Z}_{-j}| \mathbf {\Theta }, \mathbf {X},\mathbf {Y})&\propto \exp [w_{qj}]I(Z_{j'} \ne q, \forall j' \ne j) \prod _{i \ne q,i\in \mathbf {A}_{12}} f(X_i, Y_i|\mathbf {Z}_{-j}) \nonumber \\&\times \prod _{i \ne q,i \in \mathbf {A}_{1^-}} f(X_i) \prod _{i \ne q,i \in \mathbf {A}_{2^-}} f(Y_i) f(Y_j,X_q)\end{aligned}$$
(15)
$$\begin{aligned}&\propto \exp [w_{qj}]I(Z_{j'} \ne q, \forall j' \ne j) \frac{f(Y_j,X_q)}{f(Y_j)f(X_q)}\end{aligned}$$
(16)
$$\begin{aligned}&= \exp [w_{qj}]I(Z_{j'} \ne q, \forall j' \ne j) \frac{f(Y_j|X_q)}{f(Y_j)}. \end{aligned}$$
(17)

Here, \(\mathbf {A}_{12}\) is the set of matched records, \(\mathbf {A}_{1^-}\) is the set of records in \(\mathbf {A}_1\) without a match in \(\mathbf {A}_2\), and \(\mathbf {A}_{2^-}\) is the set of records in \(\mathbf {A}_2\) without a match in \(\mathbf {A}_1\). For \(q = n_1+j\), after some algebra to collect constants, we have

$$p_{qj}(\mathbf {Z}_{-j}|\mathbf {\Theta }, \mathbf {X}, \mathbf {Y}) \propto [n_1 - n_{12}(\mathbf {Z}_{-j})]\frac{n_2-n_{12}(\mathbf {Z}_{-j})-1+\beta _{\pi }}{n_{12}(\mathbf {Z}_{-j})+\alpha _\pi }.$$

3.3.3 Joint Method with Imputation (JMI)

The sampler for JMI is similar to the sampler for JM, except we impute \(\mathbf {X}\) for non-matches in \(\mathbf {A}_2\). Thus, we require a model for \(\mathbf {X}\), which we also use to compute \(f(\mathbf {Y})\). In accordance with the simulation set-up, we present the sampler with \(X \sim N(\mu ,\tau ^2)\).

I.1:

Sample \(\mathbf {m}_f^{[t+1]}\) and \(\mathbf {u}_f^{[t+1]}\) using J.1.

I.2:

Sample \(\mu ^{[t+1]}\) and \((\tau ^2)^{[t+1]}\) using J.2.

I.3:

Impute \(\mathbf {X}_{mis}\) for those records in \(\mathbf {A}_2\) without a matched X. For the sampler in the simulation study, the predictive distribution is

$$\begin{aligned} X_{mis} \sim N\Big (\mu + \frac{\beta _1 \tau ^2}{\sigma ^2+\beta _1^2 \tau ^2}(Y - \beta _0 - \beta _1\mu ), \tau ^2 - \frac{\beta _1^2\tau ^4}{\sigma ^2 + \beta _1^2\tau ^2}\Big ). \end{aligned}$$
(18)

In JMI, we use the values of \((\beta ^{[t]}, (\sigma ^2)^{[t]}, (\tau ^2)^{[t+1]})\) in (18). Once we have \(\mathbf {X}_{mis}^{[t+1]}\), in the full conditional distributions for \((\beta ^{[t+1]}, (\sigma ^2)^{[t+1]})\) we use both the matched and imputed data for all records in \(\mathbf {A}_2\), with the priors in Sect. 3.3.1. As a result, we draw \(\beta ^{[t+1]}\) and \((\sigma ^2)^{[t+1]}\) based on (13) and (14), but let \((\widetilde{\mathbf {X}}, \widetilde{\mathbf {Y}})\) include both the linked pairs and imputed pairs in \(\mathbf {A}_2\).

I.4:

Sample \(\mathbf {Z}^{[t+1]}\) sequentially using J.4.

3.3.4 Joint Method with Imputation and Reduced Feedback (JMIF)

The sampler for JMIF is like the sampler for JMI, but we use a different predictive model for \(\mathbf {X}_{mis}\). We again present the sampler with \(X \sim N(\mu ,\tau ^2)\).

F.1:

Sample \(\mathbf {m}_f^{[t+1]}\) and \(\mathbf {u}_f^{[t+1]}\) using J.1.

F.2:

Sample \(\mu ^{[t+1]}\) and \((\tau ^2)^{[t+1]}\) using J.2.

F.3:

Given \(\mathbf {Z}^{[t]}\), take a draw \((\beta ^*, \sigma ^{2*})\) from the full conditional distributions in (13) and (14), using only the linked cases at iteration t. We impute \(\mathbf {X}_{mis}\) for those records in \(\mathbf {A}_2\) without a matched X using \((\beta ^*, \sigma ^{2*})\). For the sampler in the simulation study, we use (18) with \((\beta ^*, \sigma ^{2*})\) and \((\tau ^2)^{[t+1]}\). Once we have \(\mathbf {X}_{mis}^{[t+1]}\), in the full conditional distributions for \((\beta ^{[t+1]}, \sigma ^{[t+1]})\) we use both the matched and imputed data for all records in \(\mathbf {A}_2\), with the priors in Sect. 3.3.1. As a result, we draw \(\beta ^{[t+1]}\) and \((\sigma ^2)^{[t+1]}\) based on (13) and (14), but let \((\widetilde{\mathbf {X}}, \widetilde{\mathbf {Y}})\) include both the linked pairs and imputed pairs in \(\mathbf {A}_2\).

F.4:

Sample \(\mathbf {Z}^{[t+1]}\) sequentially using (J.4).

3.4 MCMC Starting Values

Sadinle [17] starts the MCMC sampler by assuming none of the records in file \(\mathbf {A}_2\) have a link in file \(\mathbf {A}_1\). We do not recommend this starting point for the hierarchical model, as it is beneficial to specify an initial set of links to determine sensible starting values for the linear regression parameters. Instead, we employ a standard FS algorithm—implemented using the RecordLinkage package in R—to determine a set of links to use as a starting point.

4 Simulation Studies

We generate simulated data sets using the RLdata10000 data set from the RecordLinkage package in R. The RLdata10000 contains 10,000 records; 10% of these records are duplicates belonging to 1,000 individuals. The RLdata10000 includes linking variables, which the developers of the data set have distorted to create uncertainty in the RL task. To create \(\mathbf {A}_2\), we first randomly sample \(n_{12}= 750\) individuals from the 1,000 individuals with duplicates. We then sample 250 individuals from the remaining 8,000 individuals in RLdata10000. This ensures that each record in \(\mathbf {A}_2\) belongs to one and only one individual. To create \(\mathbf {A}_1\), we first take the duplicates for the 750 records in \(\mathbf {A}_2\); these are true matches. Next, we sample another 250 records from the individuals in RLdata10000 but not in \(\mathbf {A}_2\). Thus, we have 750 records that are true links, and 250 records in each file that do not have a match in the other file. We repeat this process independently in each simulation run.

In both files, in each simulation run, we generate the response and explanatory variables, as none are available in the RLdata10000. For each sampled record i, we generate \(x_i \sim N(0,1)\) and \(y_i \sim N(\beta _0 + \beta _1 x_i, \sigma ^2)\) in each simulation run. We set \(\beta _0 = 3\) and \(\sigma ^2= 1\). We consider \(\beta _1 \in \{.4, .65, .9\}\), to study how the correlation between X and Y affects performance of the methods.

We use four linking variables: the first name and last name, and two constructed binary variables based on birth year and birth day. For the constructed variables, we create an indicator of whether the individual in the record was born before or after 1974, and another indicator of whether the individual in the record was born before or after the 16th day of the month.

To compare first and last name, we use the Levenshtein edit distance (LD), defined as the minimum number of insertions, deletions, or substitutions required to change one string into the other. We divide this distance by the length of the longest string to standardize it. The final measure is in the range of [0, 1], where 0 represents total agreement and 1 total disagreement. Following [17], we categorize the LD into four levels of agreement. We set \(f=1\) and \(\gamma _{ij}^f= 3\) when the first names for record i and j match perfectly (\(LD=0\)); \(\gamma _{ij}^f= 2\) when these names show mild disagreement (\(0 < LD \le .25\)); \(\gamma _{ij}^f= 1\) when these names show moderate disagreement (\(.25 < LD \le .5\)); and, \(\gamma _{ij}^f= 0\) when these names show extreme disagreement (\(LD \ge .5\)). The same is true for last names with \(f=2\). For the constructed binary variables based on birth day and year, we set \(\gamma _{ij}^f= 1\) when the values for record i and j agree with each other, and \(\gamma _{ij}^f= 0\) otherwise.

We create two scenarios to represent different strengths of the information available for linking. The strong linkage scenario uses all four linking variables, and the weak linkage scenario uses first and last name only.

4.1 Results

Table 1 displays averages of the estimated regression coefficients over 100 simulation runs. Across all scenarios, on average the point estimates of \(\beta _1\) from the Bayesian hierarchical model, regardless of the MCMC algorithm, are at least as close to the true \(\beta _1\) as the point estimates from the two-step approach. The hierarchical model offers the largest improvements in accuracy over the two-step approach when the correlation between X and Y is strongest and the information in the linking variables is weakest. In this scenario, the model takes advantage of information in the relationship between the study variables that the two-step approach cannot. In contrast, when the correlation between X and Y is weakest and the information in the linking variables is strongest, there is little difference in the performances of the hierarchical model and two-step approach. These patterns are illustrated in Fig. 1.

Fig. 1.
figure 1

Posterior density of \(\beta _1\) in one arbitrarily chosen data set for each value of \(\beta _1\) under the strong linking information scenario. Posterior distribution for the two-step approach is in solid green, for JM is in dashed red, for JMI is in dotted blue, and for JMIF is in dotdash black. Vertical lines are estimates of \(\beta _1\) when using all the correct links.

Generally, all the algorithms tend to underestimate \(\beta _1\) in these simulations. It is practically impossible to identify all the true links. Therefore, the regression is estimated with some invalid pairs of \((x_i, y_i)\). This attenuates the estimates of \(\beta _1\). The hierarchical model tends to overestimate \(\beta _0\) slightly. The difference is most noticeable when the correlation between X and Y is strong. Generally, on average the two-step approach offers more accurate estimates of \(\beta _0\), although the differences are practically irrelevant.

Among the hierarchical models, JM outperforms JMI and JMIF, with JMIF slightly better than JMI. This is because inaccuracies in the estimated distribution of \(\mathbf {X}_{mis}\) in JMI and JMIF are propagated to the estimated distributions of \((\beta _0, \beta _1)\). To illustrate this, suppose in a particular MCMC iteration the value of \(\beta _1\) is somewhat attenuated, which in turn leads to inaccuracy in the parameters of the imputation model for X|Y. As a result, the imputed values of \(\mathbf {X}_{mis}\) are not samples from an accurate representation of f(X|Y). Thus, the full conditional distribution of \((\beta _0, \beta _1)\) is estimated from completed-data that do not follow the relationship between X and Y. The inaccurate samples of \((\beta _0, \beta _1)\) then create inaccurate imputations, and the cycle continues. In contrast, in any iteration, JM samples coefficients using only the records deemed to be links (in that iteration), thereby reducing the effects of feedback from imprecise imputations. This also explains why JMIF yields slightly more accurate estimates of \(\beta _1\) than JMI when the correlation between X and Y is strong.

Table 1. Summary of simulation results for regression coefficients. Results based on 100 runs per scenario. The true \(\beta _0 = 3\) in all scenarios. For each reported average, the Monte Carlo standard errors are smaller than .01. “Strong” refers to scenarios where we use all four comparison fields, and “Weak” refers to scenarios where we use only two comparison field’s.
Table 2. Summary of linkage quality across 100 simulation runs. Averages in first four columns have standard errors less than 3. Averages in the last two columns have Monte Carlo standard errors less than .002.

Table 2 displays averages across the 100 simulation runs of six standard metrics for the quality of the record linkages. These include the average numbers of correct links (CL), correct non-links (CNL), false negatives (FN), and false positives (FP), as well as the false negative rate (FNR) and false discovery rate (FDR). These are formalized in Appendix A. The results in Table 2 indicate that the hierarchical model offers improved linkage quality over the two-step approach, regardless of the estimation algorithm. In particular, the hierarchical model tends to have smaller FP and larger CNL than the two-step approach. The difference in CNL is most apparent when the information in the linking variables is weak and when the correlation between X and Y is strong. The hierarchical model tends to have higher CL than the two-step approach, but the difference is practically important only when the linkage information is weak and the correlation is relatively strong (\(\beta _1 = .9, \beta _1 = .65\)). Overall, the hierarchical model has lower FDR compared to the two step approach.

5 Discussion

The simulation results suggest that the Bayesian hierarchical model for simultaneous regression and RL can offer more accurate coefficient estimates than the two-step approach in which one first performs RL then runs regression on linked data. The hierarchical model is most effective when the correlation between the response and explanatory variable is strong. The hierarchical model also can improve linkage quality, in particular by identifying more non-links. This is especially the case when the information in the linking variables is not strong. In all scenarios, the relationship between the response and explanatory variable complements the information from the comparison vectors, which helps us declare record pairs more accurately.

As with any simulation study, we investigate only a limited set of scenarios. Our simulations have 75% of the individuals in the target file as true matches. Future studies could test whether the hierarchical model continues to offer gains with lower overlap rates, as well as different values of other simulation parameters. We used a correctly specified linear regression with only one predictor. Appendix B presents a simulation where the linear regression is mis-specified; the model continues to perform well. We note that the hierarchical model without imputation for missing \(\mathbf {X}\) extends readily to multivariate \(\mathbf {X}\). When outcomes are binary, analysts can use probit regression in the hierarchical model. The model also can be modified for scenarios where one links to a smaller file containing explanatory variables. In this case, we use the marginal distribution of \(\mathbf {X}\) and conditional distribution of \(\mathbf {X}|\mathbf {Y}\) rather than those for \(\mathbf {Y}\) and \(\mathbf {Y}|\mathbf {X}\) in (17).