1 Introduction

Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive or post-linkage task on the linked data. In this paper, we propose a joint model for the record linkage and the downstream task of linear regression. Our proposed model can link records over an arbitrary number of databases (lists or files). We assume there is duplication within each database, known as “duplicate detection.” Our record linkage model can be expressed as a random partition model, which leads to a large family of distributions. Next, we jointly model the record linkage task and the downstream task (linear regression), which allows for the exact propagation of the record linkage uncertainty into the downstream task. Crucially, this generates a feedback propagation mechanism from the proposed Bayesian record linkage model into the downstream task of linear regression. This feedback effect is essential to eliminate potential biases that can jeopardize resulting inference in the downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the “feedback effect” is able to improve performance of record linkage.

1.1 Prior Work

Our work builds off [14, 16,17,18], which all proposed Bayesian record linkage models well suited for categorical data. [18] modeled the fully observed records through the “hit-and-miss” measurement error model [2]. One natural way to handle record linkage uncertainty is via a joint model of the record linkage and downstream task. [10] introduced a record linkage model for continuous data based on a multivariate normal model with measurement error. Turning to just record linkage tasks, [16, 17] were the first to perform simultaneous record linkage and de-duplication on multiple files by using the fully observed records, creating a scalable record linkage algorithm. In similar work, de-duplication in a single database framework was tackled from a Bayesian perspective in [14] by using the information provided by the comparison data.

Related work regarding the record linkage and downstream task has been considered under specific assumptions. [9] assumed that the two databases represent a permutation of the same database of units and proposed an estimator (LL) of the regression coefficients which is unbiased, conditionally on the matching probabilities provided by the record linkage task. [7] extended this approach to handle more complex and realistic linkage scenarios and logistic regression problems. Generalizations of the LL estimator have been also provided by [8] using estimating equations. In addition, [4] proposed to consider the probabilities of being a match—provided by the record linkage algorithm—as an ingredient to be used within a multiple imputation scenario. Finally, [5] proposed a Bayesian method that jointly models the record linkage and the association between the overlapping features in two different databases. The authors consider somewhat simpler situation where the number of records to match in the two databases is relatively small and relies upon a specific blocking criteria. In addition, one potential limitation of the approach is the assumption of specific matching pattern. For each single block of comparisons, all cases in the smaller database will certainly appear in the other databases. We refer to [6] for details.

Section 2 introduces our Bayesian record linkage model, providing extensions to priors on random partitions. Section 3 generalizes our record linkage methodology to the downstream task of linear regression. Section 4 provides experiments for the record linkage task on synthetic data. We then provide three experiments on the joint record linkage and downstream task of linear regression on synthetic data. Section 5 provides a discussion and extensions to future work.

2 Bayesian Record Linkage and Priors on Partitions

In this section, we introduce notation used through the paper, our Bayesian record linkage model, and an alternative and more intuitive construction for the prior on co-referent records, known as the linkage structure \(\lambda \).

2.1 Notation

Assume L databases (lists, data sets, or files) \(F_1, F_2\ldots , F_L\) that consist of either qualitative and/or categorical records, which are noisy due to the data collection process. Each record corresponds to an underlying latent entity (statistical unit) of partially overlapping samples (or populations). In addition, assume all databases have p overlapping features (fields). Assume L sets of records are collected from a given population of size \(N_{\text {pop}}\) where \(1\le N_{\text {pop}}\le \infty \) in the same framework as [15, 17]. As such, assign a label \(j'\) (\(j'=1,\ldots , N_{\text {pop}}\)) to each member of the population. Next, let \(\tilde{v}_{j'}=(\tilde{v}_{j'1},\ldots , \tilde{v}_{j' p}\)) be the vector of the p categorical overlapping features for the population individual \(j'\). Finally, denote the entire set of population records by \(\tilde{v}=(\tilde{v}_1,\ldots , \tilde{v}_{N_{pop}})\).

2.2 Bayesian Record Linkage Model

Assume the set of population records \(\tilde{v}\) is generated independently, for \(j'=1,\ldots ,N_{\text {pop}}\), from a vector of independent categorical variables \(\tilde{V}=(\tilde{V}_1,\ldots , \tilde{V}_\ell ,\ldots , \tilde{V}_p)\) such that \(\tilde{V_l}\in \{v_{\ell \, 1},\ldots , v_{\ell \, M_\ell } \}\) and

$$\begin{aligned} P\left( \tilde{V}_\ell =v_{\ell \, s}\right) =\theta _{\ell \, v_{\ell \, s}} \quad s=1,\ldots , M_\ell , \end{aligned}$$
(1)

where \(M_\ell \) is the number of categorical values for the \(\ell \)th feature. At the sample level, assume that one does not observe the “true” population values due to measurement errors. Thus, the observed records, which is a database of size \(N_i\), \(i=1, \dots , L\), consists of distorted versions of subsets of the vectors \(\tilde{v}_{j'}\). Let \(v_{ij}=(v_{ij1}, \ldots , v_{ijp})\) denote the observed values for the j-th record of the i-th database, where \(i=1,\ldots , L\) and \(j=1,\ldots , N_i\). Denote the observed records (across the L databases) by \(v=(v_{11}, \ldots , v_{1 N_1}, \ldots , v_{L 1}, \ldots v_{L N_L})\). Next, let the set of latent indicator variables \(\lambda _{ij}\in \{ 1,\,\ldots , N_{\text {pop}}\}\) denote the unknown co-reference (matching) pattern between the observed records v and the population records \(\tilde{v}\), where \(\lambda _{ij}=j'\) indicates that the population record \(j'\) generated the observed record \(v_{ij}\).Footnote 1 In general, let \(\lambda =( \lambda _{11}, \dots , \lambda _{1N_1}, \ldots , \lambda _{L1}, \dots , \lambda _{L N_L})\) denote the linkage structure.

Next, we formalize the distortion mechanism when the population records are observed in the L databases using the hit-and-miss model [2]. Let \(V_{ij\ell }\) be the random variable that generates observed record \(v_{ijl}\). Assume that \(V_{ijl}\in \{v_{l\, 1},\ldots , v_{l \, M_\ell } \}\), that is, \(V_{ij\ell }\) has the same support of \(\tilde{V_\ell }\). Let \(\delta _{a,b}=1\) if \(a=b\) and \(\delta _{a,b}=0\) if \(a\ne b\), which implies that

$$\begin{aligned} P(V_{ij\ell }=v_{\ell \, s} \mid \lambda _{ij}, \tilde{v} ,\alpha _\ell ) =(1-\alpha _\ell ) \delta _{\tilde{v}_{\lambda _{ij} \ell } ,v_{\ell \,s }}+\alpha _\ell \theta _{\ell \,v_{\ell \,s}} \quad s=1,\ldots , M_\ell \end{aligned}$$
(2)

for \(i=1,\ldots , L; \, j=1,\ldots , N_i; \, \ell =1,\ldots , p\), where \(\alpha _\ell \in [0,1]\) represents the distortion probability for the \(\ell \)-th overlapping feature. Here, the true population value is observed with probability \(1-\alpha _\ell \), and a different value is drawn from the random variable \(\tilde{V}_\ell \) generating the population values with probability \(\alpha _\ell \). Finally, assuming the conditional independence among all the overlapping features given their respective unobserved population counterparts, one obtains

$$\begin{aligned} p(v\mid \tilde{v}, \lambda , \alpha )=\prod _{i=1}^L \prod _{j=1}^{N_i} \prod _{\ell =1}^p P(v_{ijl} \mid \tilde{v},\lambda ,\alpha ) =\prod _{i=1}^L \prod _{j=1}^{N_i} \prod _{\ell =1}^p[ (1-\alpha _\ell ) \delta _{\tilde{v}_{\lambda _{ij} \ell } ,v_{ij\ell }}+\alpha _\ell \theta _{\ell \, v_{ij\ell }}]. \end{aligned}$$
(3)

We assume that the distortion probabilities are exchangeable, that is

$$\begin{aligned} \alpha _{\ell } {\mathop {\sim }\limits ^{iid}} \text {Beta}(f,g), \, \ell =1,\dots ,p, \end{aligned}$$

and we assume the probabilities \(\theta _{\ell \,1} \ldots \theta _{\ell \, M_\ell }\) are considered known and equal to the corresponding population frequencies. The model summarized by Eqs. (1) and (3) can be viewed as a latent variable model where the unobserved population records \(\tilde{v}\) generate the observed records v and \(\alpha =(\alpha _1\ldots ,\alpha _p)\) can be viewed as the unknown model distortion parameter.

Remark: A convenient property of the hit-miss model is that one can integrate out the unknown population values \(\tilde{v}\) to directly obtain the distribution \(p(v|\alpha ,\lambda )\). The resulting marginal distribution \(p(v|\alpha ,\lambda )\) is the product of within-cluster distributions. To improve mixing, we use a Metropolis within Gibbs algorithm to simulate from the joint posterior \(p(\lambda ,\alpha \vert v)\) (See Appendix A).

2.3 The Prior Distribution for \(\lambda \)

In this section, we propose a more intuitive and subjective construction of a prior distribution on \(\lambda \). Let z denote the random partition of the observed records determined by \(\lambda \) and let \(\mathcal {P}\) denote the set containing all the possible partitions of the N observed records. The distribution on the sample labels \(\lambda \) induces a distribution on \(\mathcal {P}\). Furthermore, matches and duplicates are completely specified given the knowledge of the random partition z, which is invariant with respect to the labelings of the partition blocks. Given this construction, one can directly focus on the partition distribution of the observed records without linking the labels distribution to a sample design and to a population size \(N_{\text {pop}}\), see for example, [14]. One can effectively consider the distribution of \(\lambda \) as a prior distribution for the latent linkage structure and concentrate only on its probabilistic properties. Both the interpretations of the role of \(\lambda \) (either as a consequence of the sampling design or a model represented by partitions) may provide useful insights for a correct choice of its prior distribution. One difficult and related question in the record linkage literature has been the subjective specification on the space of partitions. A simple, alternative prior for the number of distinct entities k(z) can be obtained looking at the following allocation rule for the record labels which is based on a generalization of the Chinese Restaurant Process, namely the Pitman-Yor process (PYP) ([3, 13]). (See Appendix B for details).

3 The Downstream Task of Linear Regression

In this section, we propose record linkage methodology for the downstream task of linear regression. Consider the model \(\tilde{Y}=\sum _{l=1}^p \tilde{X}_{l}\beta _l+\epsilon \) for the population units, where the goal is to estimate the regression coefficients \(\beta =(\beta _1,\ldots , \beta _p)^t\). We observe Y and \(X=(X_1,\ldots , X_p)\), where X represents a noisy measurement of the true covariates \(\tilde{X}=(\tilde{X}_1,\ldots \tilde{X}_p)\) and Y is a random copy of the corresponding population variable \(\tilde{Y}\).

To better illustrate our approach, we consider two scenarios. In the first scenario—the complete regression scenario—each database reports a set of overlapping features, the response variable, and the covariates. Let \(y_{ij}\) and \(x_{ij}=(x_{ij1}\ldots ,x_{ijp})\) denote the observed values for the j-th unit of the i-th database, where \(i=1,\ldots , L\) and \(j=1,\ldots , N_i\). In addition, let (yx) denote the entire set of regression data observed across the L databases. In the complete scenario, there is not a bias problem concerning the estimation of the \(\beta \) coefficients. In the second scenario—the broken regression scenario—we assume that the overlapping features are observed in each database, the response variable is observed in only the first database, and specific subsets of covariates are observed in the other databases. In this situation, let (yx) denote the observation \(y_{1j}\), where \(j=1,\ldots ,N_1\) and \(x_{ij}\), where \(i=2,\ldots , L\) and \(j=1,\ldots , N_i\). Note that \(x_{ij}\) represents only a fixed subset of the values \(x_{ij1}\ldots x_{ijp}\) for \(j=1,\ldots ,N_i\). Here, there is a bias issue regarding estimating the \(\beta \) coefficients.Footnote 2

3.1 Simple Linear Regression

In this section, we consider linear regression and the two scenarios mentioned above with a single covariate X. First, consider the complete regression scenario. Let \(\tilde{X}_{j'}\) be the true value of observation X corresponding to the records of cluster \(C_{j'}\). Now consider a cluster \(C_{j'}=\{(i,j)\}\) with one record. Given the true value of \(\tilde{X}_{j'}=\tilde{x}_{j'}\) and membership to cluster \(C_{j'}\), we assume that the response variable \(Y_{ij}\) follows a standard normal regression model with covariate \(\tilde{x}_{j'}\), where the observed value for the covariate \(X_{ij}\) is normal with mean \(\tilde{x}_{j'}\) and \(Y_{ij}\) and \(X_{ij}\) are independent. That is,

$$\begin{aligned} \left[ \begin{array}{c} Y_{ij}\\ X_{ij} \end{array} \right] \left| \right. \tilde{X}_{j'}=\tilde{x}_{j'}\sim N_2 \left[ \left( \begin{array}{c c} \beta &{} 0\\ 0 &{} 1 \end{array} \right) \left[ \begin{array}{c} \tilde{x}_{j'}\\ \tilde{x}_{j'}\\ \end{array} \right] , \left( \begin{array}{c c} \sigma ^2_{y|\tilde{x}} &{} 0\\ 0 &{} \sigma ^2_{x|\tilde{x}} \end{array} \right) \right] . \end{aligned}$$
(4)

We assume that \(\tilde{X}_{j'} \sim N(0, \sigma ^2_{\tilde{x}})\), which allows one to integrate \(X_{j'}\) via Eq. 4. In fact, setting \(Z_{ij}= \left( Y_{ij},X_{ij} \right) ^\prime \), one can easily show that conditionally on the event \(\{ (i,j)\in C_{j'}\}\), it follows that

$$\begin{aligned} Z_{ij} \sim N_2 \left[ \left( \begin{array}{c} 0\\ 0 \end{array} \right) , \sigma ^2_{\tilde{x}} \left( \begin{array}{c c} \beta ^2 &{} \beta \\ \beta &{} 1 \\ \end{array} \right) + \left( \begin{array}{c c} \sigma ^2_{y|\tilde{x}} &{} 0\\ 0 &{} \sigma ^2_{x|\tilde{x}} \end{array} \right) \right] . \end{aligned}$$
(5)

For ease of notation, let \(I_{n}\) denote the \(n\times n\) identity matrix, \(0_n\) denote the n-vector of zero; \(1_{n}\) denote a vector of all 1’s, and \(J_n = 1_n 1_n^\prime \). Next, set

$$\begin{aligned} B = \left( \begin{array}{c c} \beta ^2 &{} \beta \\ \beta &{} 1 \end{array} \right) \quad \hbox {and } \quad \varSigma = \left( \begin{array}{c c} \sigma ^2_{y \vert \tilde{x}} &{} 0 \\ 0 &{} \sigma ^2_{x \vert \tilde{x}} \end{array} \right) . \end{aligned}$$

Consider a cluster \(C_{j'}=\{ (i_1,j_1), (i_2,j_2) \}\) with two records. The two pairs \(Z_{i_1 j_1}\) and \(Z_{i_2 j_2}\) are random vectors, both depending on the same “true” value \(\tilde{X}_{j'}\). Let \(\otimes \) be the Kronecker product. Conditionally on \(\tilde{X}_{j'}=\tilde{x}_{j'}\) and on the cluster membership, we replicate the model for a cluster with one record by assuming that \(Z_{i_1 j_1}\) and \(Z_{i_2 j_2}\) are two independent bivariate normal random variables with joint distribution

$$\begin{aligned} N_4 \left[ \left( I_2 \otimes \left( \begin{array}{c c} \beta &{} 0\\ 0 &{} 1 \end{array} \right) \right) \left( 1_4 \tilde{x}_{j'} \right) , I_2 \otimes \varSigma \right] . \end{aligned}$$
(6)

Then the marginal distribution of \((Z_{i_1 j_1}, Z_{i_2 j_2} )^\prime \) is

$$\begin{aligned} \begin{pmatrix} Z_{i_1 j_1} \\ Z_{i_2 j_2} \\ \end{pmatrix} \sim N_4 \big ( 0_4 , I_2 \otimes \varSigma + \sigma _{\tilde{x}}^2 J_2 \otimes B \big ). \end{aligned}$$

This argument can be extended to any cluster size. When card\((C_{j^\prime }) = n\), the marginal distribution of \(Z=(Z_{i_1 j_1}, \dots , Z_{i_n j_n })\) is again multivariate normal: \( Z \sim N_{2n} \left( 0_{2n}, I_n \otimes \varSigma + \sigma _{\tilde{x}}^2 J_n \otimes B \right) \).

Next, consider the broken regression scenario. In this case, when some information is missing—either the covariate in the first database or the response variable in some of the other databases—one can easily marginalize over the missing variables by using standard properties of multivariate normal distribution. Let \((y,x)_{C_j'}=((y_{ij}, x_{ij} ) : \lambda _{ij}=j')\) denote the set of regression observations, which conditionally on \(\lambda \), correspond to the \(j^\prime \)-th population unit. For example, for a cluster \(C_{j'}=\{(1,j)\}\) with one record in the first database, we denote this as \((y,x)_{C_j'}=y_{1j}\). Using the marginal density of \(Y_{ij}\) in Eq. 5, we can write the likelihood, conditional on \(\lambda \), as \(p((y,x)_{C_j'}|\lambda , \beta , \sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}})\). Similarily, suppose \(C_{j'}=\{(i,j)\}\) with \(i>1\), then \((y,x)_{C_j'}=x_{ij}\) and the likelihood is given by marginal density of \(X_{ij}\). Next, consider a cluster \(C_{j'}=\{ (1, j_1), (i_2,j_2) \}\) with a record in the first database and the other record in a different database, i.e. \(i_2>1\). It follows that \((y,x)_{C_j'}=(y_{1j_1},x_{i_2 j_2})\) and the corresponding likelihood is found by marginalizing over the missing values \(X_{1 j_1},Y_{2 j_2}\) in Eq. 6, where we obtain the joint density in Eq. 5. Finally, it follows that the likelihood function (as a function of \(\lambda , \beta , \sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}}\)) for both the complete and broken regression scenarios can be generally written as \(p(y,x| \lambda , \beta , \sigma ^2_{x|\tilde{x}}, \sigma ^2_{y|\tilde{x}})=\prod _{j'=1}^{N_{pop}} p( (y,x)_{C_j'}|\beta , \sigma ^2_{x|\tilde{x}}, \sigma ^2_{y|\tilde{x}})\).Footnote 3

In order to handle the record linkage and downstream regression task simultaneously, we assume conditional independence on \(\lambda \) between the overlapping features in the record linkage model and the set of variables in the downstream task of linear regression. Assuming conditional independence, we find

$$\begin{aligned} p(\lambda ,\beta ,\alpha , \sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}} |v,x,y)\propto & {} p(v|\lambda ,\alpha ) p(y,x|\lambda ,\beta ,\sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}}) \nonumber \\\times & {} p(\lambda ) p(\alpha ) p(\beta , \sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}}). \end{aligned}$$
(7)

The first factor is related to the record linkage process, and second factor is related to the downstream task of linear regression, and the other factors represent the prior distributions. We assume independent diffuse priors for \(\beta , \sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}}\). To update the appropriate regression parameters \(\beta , \sigma ^2_{y|\tilde{x}}, \sigma ^2_{x|\tilde{x}}\), we use the Metropolis-Hastings algorithm in Appendix A. Using the factorization of the posterior in Eq. (7), the proposed method can be generalized to any statistical model.

3.2 Multiple Linear Regression

We extend the downstream task to that of multiple regression, first considering the complete regression scenario. Let \(C_{j'}\) denote a cluster of size n, \(Y_{C_{j'}}\) denote a vector with n observations of the response variable in this cluster, and \(X_{C_{j'}}\) denote the \(n \times p\) matrix with the values of the p covariates observed in the cluster units. Let \([YX]_{C_{j'}}\) denote the vector of \(n (p+1)\) elements with the n rows of the matrix \((Y_{C_ {j'}}, X_{C_{j'}})\) vertically stacked and let \(\tilde{X}_{j'}\) denote the vector containing the true values of the p covariates. Equation 4 can be generalized assuming that

$$\begin{aligned}{}[YX]_{C_{j'}} \left| \right. \tilde{X}_{j'}\,\sim N_{n(p+1)} \left[ \left( I_{n\times n} \otimes \left( \begin{array}{c c} \beta ^t &{} 0_{p}^t\\ 0_{p\times p} &{} I_{p\times p} \end{array} \right) \right) \! \left( 1_{2n} \otimes \tilde{X} \right) , I_{n\times n} \otimes \left( \begin{array}{c c} \sigma ^2_{y|\tilde{x}} &{} 0\\ 0 &{} \varSigma _{x|\tilde{x}} \end{array} \right) \right] , \end{aligned}$$

where

$$\begin{aligned} 1_{2n} \otimes \tilde{X}\sim N_{2 n p} \left( 0_{2np}, (1_n 1_n^t )\otimes \left( \begin{array}{c c} \varSigma _{\tilde{x}} &{} \varSigma _{\tilde{x}}\\ \varSigma _{\tilde{x}} &{} \varSigma _{\tilde{x}} \end{array} \right) \right) . \end{aligned}$$

This way the marginal distribution of \([YX]_{C_{j'}}\) is \(n(p+1)\)-variate normal with zero mean and covariance matrix

$$\begin{aligned} \begin{gathered} \left( I_{n\times n} \otimes \left( \begin{array}{c c} \beta ^t &{} 0_{p}^t\\ 0_{p\times p} &{} I_{p\times p} \end{array} \right) \right) \left( (1_n 1_n^t )\otimes \left( \begin{array}{c c} \varSigma _{\tilde{x}} &{} \varSigma _{\tilde{x}}\\ \varSigma _{\tilde{x}} &{} \varSigma _{\tilde{x}} \end{array} \right) \right) \left( I_{n \times n}\otimes \left( \begin{array}{c c} \beta ^t &{} 0_{p}^t\\ 0_{p\times p} &{} I_{p\times p} \end{array} \right) \right) ^t +\\ \left( I_{n\times n} \otimes \left( \begin{array}{c c} \sigma ^2_{y|\tilde{x}} &{} 0\\ 0 &{} \varSigma _{x|\tilde{x}} \end{array} \right) \right) , \end{gathered} \end{aligned}$$

which simplifies into

$$\begin{aligned} (1_n 1_n^t )\otimes \left( \begin{array}{c c} \beta ^t \varSigma _{\tilde{x}} \beta &{} \beta ^t \varSigma _{\tilde{x}} \\ \varSigma _{\tilde{x}} \beta ^t &{} \varSigma _{\tilde{x}} \end{array} \right) + I_{n\times n} \otimes \left( \begin{array}{c c} \sigma ^2_{y|\tilde{x}} &{} 0\\ 0 &{} \varSigma _{x|\tilde{x}} \end{array} \right) . \end{aligned}$$

The likelihood provided by the multiple regression model is the product of the factors \(p([YX]_{C_{j'}}=[y,x]_{C_j'}|\beta , \sigma ^2{y|\tilde{x}}, \varSigma _{x|\tilde{x}})\) for the observed clusters. The same considerations from linear regression regarding modeling the prior and the computational aspects apply to multiple linear regression. Note the major difference is in the marginalization pattern in the broken regression scenario. In fact, for a cluster joining records across more than one database, we may need to integrate out the covariate values missing in the databases that share a cluster.

4 Experiments

To investigate the performance of our proposed methodology we consider the RLdata500 data set from the RecordLinkage package in R. This synthetic data set consists of 500 records, each comprising first and last name and full date of birth. We modify this data set to consider two databases, where each database contains 250 records, respectively, with duplicates in and across the two databases. To consider the case without duplicate detection, we modify the original RLdata500 such that it has no duplicate records within each of the two databases. Without duplicate detection is a special case of our general methodology (see Appendix C). We provide experiments for both record linkage and the downstream task.

4.1 Record Linkage with and Without Duplicate-Detection

We provide two record linkage experiments—one with duplicate detection and one without duplicate detection. In Figs. 1 and 2, we report the prior and the posterior for k(z) and the performance of the record linkage procedure measured in terms of the posteriors of the false negative rates (FNR) and the false discovery rates (FDR). (For a review of FNR and FDR, see [1, 15]).

Fig. 1.
figure 1

Prior and posteriors for k(z) (first row), FNR posteriors (second row), FDR posteriors (third row) for the RLdata500 data set.

Figure 1 (with duplicate detection) illustrates that the resulting posteriors of k(z) appears robust to the choices of \(\theta \) and \(\sigma \) (first row). We observe similar behavior for the posteriors of FNR and FDR (second and third rows). Figure 2 illustrates that as we vary the PYP parameters, the posterior of T is weakly dependent on their values. The two database framework without duplicate detection leads, a posteriori, to similar FNR (second row) and lower FDR (third row) compared to the previous case. (See Appendix D for the PYP parameter settings).

Fig. 2.
figure 2

Prior and posteriors for t (first row), FNR posteriors (second row), FDR posteriors (third row) for the RLdata500 data set assuming a two database record linkage framework without duplicate-detection.

4.2 Regression Experiments

We consider three regression experiments on the RLdata500 data set. In Experiment I, we consider the complete regression scenario in a single database framework with duplicate detection. In Experiment II, we consider the broken regression scenario with record linkage and duplicate detection. In Experiment III, we consider the broken multiple regression scenario in a two database framework without duplicate detection. (See Appendix E for details).

Fig. 3.
figure 3

Experiment I. Upper panels: prior (dotdash lines) and posterior of \(\beta , \sigma _{y|\tilde{x}}, \sigma _{x|\tilde{x}} \) with the joint record linkage and regression model (solid lines) and the true linkage structure (dotted lines). Lower panels: posterior for k(z), FNR and FDR.

Figure 3 gives the results of Experiment I. The posteriors of \((\beta \), \(\sigma _{y|\tilde{x}}\), \(\sigma _{x|\tilde{x}})\) from our joint modeling approach (first row, solid lines) do not show remarkable differences when compared to their true counterpart (first row, dotted lines), which were obtained by fitting the regression model conditional on the true value of \(\lambda \). The similarity between the posteriors is mainly due to the large concentration of \(\lambda \) around the true pattern of duplications. The mode of the posterior of the number of distinct entities is exactly the true value (450), where the FNR and FDR are considerably smaller with respect to case without the y and x columns. Hence, the effect of considering the information provided by the regression model has improved the record linkage process.

Figure 4 gives the results of Experiment II. The posteriors (first row, solid lines) of \((\beta , \sigma _{y|\tilde{x}}, \sigma _{x|\tilde{x}})\) are similar to the corresponding true posteriors (first row, dashed lines). We report the posteriors obtained by fixing \(\lambda \) equal to the point estimate provided by the hit-and-miss model applied to the categorical variables alone (first row, dotted lines). The posteriors of \(\beta \) and \(\sigma ^2_{y|\tilde{x}}\) obtained with the plug-in approach are strongly biased for the presence of false matches which, on the other hand, are not affecting the posterior of \(\sigma _{x|\tilde{x}}\). This distribution depends on the 13 duplicated entities with two copies of x which are correctly accounted for in the plug-in approach. To better illustrate the causes of the distortion in the estimation of the regression parameters, the right panel on the top row shows all the (xy) pairs resulting from the plug-in approach. The solid black circles represent the true matches, and the empty red circles represent the false matches, with independent y and x values. We report the corresponding regression lines, where the three false matches are lowering the \(\beta \) estimate and increasing the \(\sigma _{y|\tilde{x}}\) estimate. Further analysis reveals that the posterior for k(z) (second row) with the integrated hit-miss and regression model is less concentrated with respect to the first experiment but it is more concentrated with respect to the single hit-miss model. We reduce the FDR, leaving the FNR almost unchanged. We coin this the feedback effect of the regression from the downstream task. For example, if we consider a false link, the posterior probability of being a match will typically be down-weighted by the low likelihood arising from the regression part of the model. Hence, in addition to centering the estimates of the regression coefficient \(\beta \), the joint regression-hit miss model improves record linkage performance.

Fig. 4.
figure 4

Experiment II. Left upper panels: prior (dotdash lines) and posterior of \(\beta , \sigma _{y|\tilde{x}}, \sigma _{x|\tilde{x}} \) with the joint record linkage and regression model (solid lines), the true linkage structure (dotted lines) and the plug-in approach (dashed lines). Right upper panel: estimated regression line and (xy) pairs with the joint model (solid line and full circles) and the plug-in approach (dashed line and empty circles) Lower panels: posterior for k(z), FNR and FDR.

Fig. 5.
figure 5

Experiment III. Same caption as Fig. 3.

Figure 5 gives the results of the Experiment III. The joint model gives posteriors similar to the true ones while the plug-in approach gives biased estimates and larger variability (first row, left upper panels). The presence of false matches in the plug-in approach gives a positive bias in estimating the variance \(\sigma _{y|\tilde{x}}\) and affects the posterior of the measurement error parameters (first row, right upper panels). The posteriors of \(\sigma _{x_1|\tilde{x}}\) and \(\sigma _{x_2|\tilde{x}}\) (not reported) both with the joint model and the true \(\lambda \) are essentially equal to the prior, while the plug-in posterior is concentrated on larger values. Under such conditions, even with the true linkage structure, we do not have any useful information for estimating the measurement error variances due to the lack of duplicated x values. Thus, while the joint model correctly does not contrast the information provided by the prior, the presence of false matches creates (yx) pairs that could be also explained by a larger measurement error of the covariates. We observe that the joint modeling of the record linkage and regression data improves the matching process as noted by the higher concentration of k(z) (second row, left lower panel) around the true value of 450 and the lower FNRs and FDRs (second row, right lower panels) with respect to results obtained with the hit-and-miss model only.

5 Discussion

We have made three major contributions in this paper. First, we have proposed a Bayesian record linkage model investigating the role that prior partition models may have on the matching process. Second, we have proposed a generalized framework for record linkage and regression that accounts for the record linkage error exactly. Using our methodology, one is able to generate a feedback mechanism of the information provided by the working statistical model on the record linkage process. This feedback mechanism is essential to eliminate potential biases that can jeopardize the resulting post-linkage inference. Third, we illustrate our record linkage and multiple regression methodology on many experiments involving a synthetic data set, where improvements are gained in terms of standard record linkage evaluation metrics.