Abstract
Combining information from several surveys from the same target population is an important practical problem in survey sampling. The paper is motivated by work that authors undertook, sponsored by the Food and Nutrition Technical Assistance III Project (FANTA), with funding from the U.S. Agency for International Development (USAID) Bureau of Food Security (BFS). In the project, two surveys were conducted independently for some areas and we present a measurement error model approach to integrate mean estimates obtained from the two surveys. The predicted values for the counterfactual outcome are used to create composite estimates for the overlapped areas. An application of the technique to the project is provided.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Survey integration is an emerging research area of statistics, which concerns combining information from two or more independent surveys to get improved estimates for various parameters of interest for the target population. One of the early applications of survey integration is the Consumer Expenditure Survey [20], where two survey vehicles (a Diary survey and a quarterly interview survey) were used to obtain improved estimates for the Diary survey items. Renssen and Nieuwenbroek [16], Merkouris [12, 13], Wu [18] and Ybarra and Lohr [19] considered the problem of combining data from two independent surveys to estimate totals at the population and domain levels.
Combining information from two or more independent surveys is a problem frequently encountered in survey sampling. One of the classical setups used to combine information is two-phase sampling, where the measurement x is observed in both surveys and the study variable y is observed only from one survey, say, in Survey A. There is no measurement for y in survey B. In this case, we can treat the union of Survey A and Survey B samples as a phase one sample and treat the Survey A sample as a phase two sample. Hidiroglou [6] formulated this problem and developed efficient estimation using a two-phase regression estimation method. Fuller [4], Legg and Fuller [11], and Kim and Rao [9] considered this problem as a missing data problem and developed mass imputation to obtain improved estimation for the total as well as domain totals. Our setup is different from the two-phase sampling approach in the sense that we have a different measurement of y from two surveys.
We consider a situation where two surveys have common measurement for x but different measurements for y. For example, x can be demographic information that does not suffer from measurement errors but y can suffer from survey-specific measurement errors. The survey-specific difference can occur due to differences in survey questions or survey modes (e.g. [2]). In Table 1, for example, the Survey A sample contains observations in x and \(y_1\) while the Survey B sample contains observations in x and \(y_2\). In the case of \(y_1\) being the study variable of interest, if we can assume that \(y_2\) is a measurement for \(y_1\) with measurement errors, then at issue is the estimation of the population mean of \(y_1\) combining two surveys.
Our research is motivated by work sponsored by The Food and Nutrition Technical Assistance III Project (FANTA) with funding from U.S. Agency for International Development (USAID), to produce integrated estimates from two independent surveys conducted in Guatemala where the geographic areas covered by the two surveys have substantial overlap.
Section 2 provides background on the projects and data descriptions and Sect. 3 introduces the proposed method for survey integration. In Sect. 4, we illustrate the estimation process and results of the work sponsored by FANTA, and Sect. 5 provides concluding remarks.
2 The Food and Nutrition Technical Assistance III Project
2.1 Background
FANTA is a 5-year cooperative agreement between the USAID and FHI 360. FANTA aims to improve the health and well-being of vulnerable groups through technical support in the areas of maternal and child health and nutrition in development and emergency contexts, HIV and other infectious diseases, food security and livelihood strengthening, agriculture and nutrition linkages and emergency assistance in nutrition crises.
USAID is the lead U.S. government agency that works to end extreme global poverty and enable resilient, democratic societies to realize their potential. The Feed the Future Initiative (FTF) was launched in 2010 by the United States government to address global hunger and food insecurity. The Initiative is coordinated primarily by the USAID and is housed within the Bureau of Food Security (BFS), but includes the Office of Food for Peace (FFP). The main objectives of the FTF initiative are the advancement of global agricultural development, increased food production and food security, and improved nutrition particularly for vulnerable populations such as women and children. The FTF initiative is active in 19 focus developing countries in Africa, Asia and Latin America. One of these focus countries is Guatemala.
Both BFS (through the FTF initiative) and FFP sponsor periodic baseline, interim and end-line household surveys to gauge the extent of progress towards achieving the goals of the FTF initiative. In 2013, FFP engaged a third party contractor, ICF International, to conduct a baseline household survey in five departments of the Western Highlands of Guatemala. In the same year, BFS/FTF (henceforth referred to as FTF) engaged a third party contractor, UNC MEASURE, to conduct an interim household survey in the same five departments in Guatemala. Although the surveys were conducted in the same five departments, the geography of the two surveys did not exactly coincide; however, there was substantial geographic overlap. The union of the geography covered by the two surveys represents the FTF Zone of Influence (ZOI), where some of the most food insecure parts of the population in the country reside. Because, FTF was interested in obtaining ZOI-level estimates for a number of key indicators using data from the two independent surveys, they provided funding to FANTA, who in turn, engaged the authors to undertake the work. Because of the overlapped geography from the two surveys, it was necessary to use data integration methods to produce overall ZOI-level estimates.
Guatemala has 22 departments, which are geographic entities, divided into 334 municipalities. The two surveys were each conducted in the following five departments of the Western Highlands of Guatemala: San Marcos, Totonicapan, Quiche, Quezaltenango, and Huehuetenango. Thus, two surveys were conducted in the areas and the survey data from the two samples are ready to be combined for survey integration. More details of this project can be found from the reference provided by USAID [17].
2.2 Common indicators
ICF International (FFP) and UNC MEASURE (FTF) used their own questionnaire for the surveys, and among the indicators in the questionnaires, there were 11 common indicators in both surveys indicating maternal and child health status. Among the 11 common indicators, 4 were collected at the household-level and the remaining 7 were collected at the individual-level. Five indicators of the 7 individual-level indicators pertained to children and remaining 2 to women. Table 2 presents the common indicators and their descriptions.
Most indicator variables are dichotomous, taking the values of either 0 or 1 in both data sets, but the other two indicator variables, which are ‘PCE’ and ‘WDDS,’ are numeric in both data sets. In this paper, we focus on the ‘PCE’ and the ‘HHS’ indicators for analysis as examples of a numeric variable and a dichotomous variable, respectively.
2.3 Survey design
2.3.1 FFP survey
The survey for the FFP project used a three-stage sampling design. In the first stage, the primary sampling unit is the village, where the village population for five departments is divided into two substrata in each department. Each department has two substrata except for Quetzaltenango which has one stratum. So, we have nine strata and the first stage sample selection probability is based on the number of villages in the sampling frame and the size of the village within each stratum. The sampling frame for the first stage sampling included all the villages identified for program implementation. Table 3 shows the summary of sample clusters in each stratum.
In the second stage sampling, sample households were selected randomly from each sampled village. The target number of households selected for each village was 40. The second stage sample selection probability is based on the number of households selected for each village divided by the total number of households in each village.
The third stage sampling was done at the individual level to select woman and children in households. The third stage sample selection probability is based on the total number of individuals selected for each interview module and the number of eligible individuals in the household. Only one eligible woman was randomly selected using the Kish grid [10], but all children were selected to be interviewed.
The final sampling weights are computed as the inverse of products of the three stage first-order inclusion probabilities.
2.3.2 FTF survey
The survey for the FTF project also used a three-stage sampling design using census sectors as the primary sampling units. In the first stage, the census areas (urban/rural) were formed in each department and census sectors were sampled within the census area. From the sampled census sectors, the sample households were randomly selected in the second stage sampling. For the third stage sampling, data on individual-level women and children were collected. All women and children in a household are included in the sample, but the weights associated with women and children are adjusted for nonresponse. Table 4 shows the summary of sample clusters in each stratum.
3 Survey data integration
We present the proposed method in the context of measurement error models. In a classical measurement error model problem, the interest lies in estimating the regression coefficient for the regression of y on x and the covariate x is subject to measurement errors [5]. In our problem, the measurement error occurs in y for one survey (Survey B) and we are interested in combining two surveys to estimate the population mean of y more efficiently. Thus, we still consider the data structure in Table 1. We treat \(y_{1}\) as the gold standard, \(y_{1}=y\), in the sense that there is no measurement error in \(y_{1}\).
Let \(f_1( y_1 \mid x; \theta _1)\) be the density for the conditional distribution of \(y_1\) on x, characterized by parameter \(\theta _1\). Model for \(f_{1}(y_{1} \mid x; \theta _1)\) can be called a structural equation model [3]. Let \(f_2( y_2 \mid x, y_1 ; \theta _2)\) be the density for the conditional distribution of \(y_2\) on \((x, y_1)\), characterized by parameter \(\theta _2\). For parameter identifiability, we assume that
Such assumption is sometimes called the nondifferential measurement error assumption [1, p. 7] in the measurement error model literature. That is, x is an instrumental variable for \(y_1\). The nondifferential measurement error assumption is used to obtain a reduced model.
Given the sample with the data structure in Table 1, the imputed values for \(y_{1}\) in sample B are used to obtain the composite estimator that combines direct observations in the sample A and synthetic values in the sample B. The imputed values are the best predicted values of the counterfactual outcome variable \(y_{1}\) in sample B, which correct for measurement errors in observed valued of \(y_2\). The imputed values are generated using the prediction model for \(y_{1}\), \(f(y_{1} \mid x,y_{2})\).
For the parameter estimation, the (pseudo) maximum likelihood estimator of \(\theta _1\) and \(\theta _2\) can be obtained by using the full EM algorithm as follows:
-
[E-step]
Compute
$$\begin{aligned} Q_1(\theta _1|\theta _1^{(t)},\theta _2^{(t)})= & {} \sum _{i \in S_a}w_{ia} \text {log}f_1(y_{1i}|x_i;\theta _1) \\&+ \sum _{i \in S_b}w_{ib}\text {E}\big [ \text {log}f_{1}(y_{1i}|x_i;\theta _1)\;|\;x_{i},y_{2i};\theta _{1}^{(t)},\theta _{2}^{(t)} \big ] \end{aligned}$$and
$$\begin{aligned} Q_2(\theta _2|\hat{\theta }_1^{(t)},\theta _2^{(t)})= & {} \sum _{i \in S_a}w_{ia} \text {E}\big [ \text {log}f_2(y_{2i}|y_{1i};\theta _2)\;|\;x_{i},y_{1i};\hat{\theta }_{1}^{(t)},\theta _2^{(t)} \big ] \\&+ \sum _{i \in S_b}w_{ib}\text {E}\big [ \text {log}f_2(y_{2i}|y_{1i};\theta _2)\;|\;x_{i},y_{2i};\hat{\theta }_{1}^{(t)},\theta _2^{(t)} \big ], \end{aligned}$$where \(S_a\) and \(S_b\) are the index sets for the Survey A sample and the Survey B sample, respectively. Also, \(w_{ia}\) and \(w_{ib}\) are the sampling weight for unit \(i \in S_{a}\) and for unit \(i \in S_{b}\), respectively. The conditional expectation in \(Q_{1}\) is taken with respect to
$$\begin{aligned} f(y_{1}|x,y_{2};\theta _1,\theta _2)=\frac{f_{1}(y_1|x;\theta _1)f_2(y_{2}|y_{1};\theta _2)}{\int f_{1}(y_1|x;\theta _1)f_2(y_{2}|y_{1};\theta _2) dy_{1}} \end{aligned}$$evaluated at \(\theta _1=\theta _{1}^{(t)}\) and \(\theta _2=\theta _{2}^{(t)}\) for \(Q_1\) and at \(\theta _1=\hat{\theta }_{1}^{(t)}\) and \(\theta _2=\theta _{2}^{(t)}\). For \(Q_2\), the first conditional expectation is taken with respect to \(f(y_{2i}|x_{i},y_{1i})=f(y_{2i}|y_{1i})\) by the assumption (1), evaluated at \(\theta _2=\theta _{2}^{(t)}\).
-
[M-step]
Update \(\theta _1\) by maximizing \(Q_1(\theta _1|\theta _1^{(t)},\theta _2^{(t)})\) with respect to \(\theta _1\) and update \(\theta _2\) by maximizing \(Q_2(\theta _2|\hat{\theta }_1^{(t)},\theta _2^{(t)})\) with respect to \(\theta _2\).
Based on the estimated parameters \(\hat{\theta }_1\) and \(\hat{\theta }_2\), the best predictor of \(y_{1}\) of the Survey B sample is obtained as the expectation of the predictive distribution, which is the conditional distribution of \(y_{1}\) given x and \(y_{2}\). That is, the best predictor of \(y_{1i}\) is
The parametric fractional imputation of [7] can be used to generate fractionally imputed values for \(y_{1}\) in sample B under the general parametric models [14]. When \(f_1(y_{1}|x;\theta _1)\) and \(f_2(y_{2}|x,y_{1};\theta _2)\) have general parametric models, the prediction model may not have a closed form. In this case, the parametric fractional imputation can be used following two-step method:
-
1.
For each \(i \in S_b\), generate \(y_{1i}^{*(j)}\) from \(f_1( y_{1i} \mid x_i ; \hat{\theta }_1 )\) for \(j=1,\ldots ,m\).
-
2.
Let \(y_{1i}^{*(j)}\) be the j-th imputed value of \(y_{1i}\) obtained from Step 1. The fractional weight assigned to \(y_{1i}^{*(j)}\) is
$$\begin{aligned} w_{i}^{*(j)} = \frac{ f_2 ( y_{2i} \mid x_i , y_{1i}^{*(j)} ; \hat{\theta }_2) }{ \sum _{k=1}^m f_2 ( y_{2i} \mid x_i , y_{1i}^{*(k)} ; \hat{\theta }_2)}. \end{aligned}$$
Once we use the parametric fractional imputation, the conditional expectation in (2) can by computed by a Monte Carlo approximation. That is, the conditional expectation can be written by
Using the counterfactual values (2) of the Sample B and observations of the Survey A sample, we can construct a composite estimator that combines two values. The combined estimator is
Kim et al. [8] have investigated the parametric fractional imputation of Kim [7] in the context of statistical matching where the main interest lies in estimating \(\theta _2\) in \(f_2( y_2 \mid x, y_1; \theta _2)\). In their simulation study, the imputation model is based on the nondifferential measurement error assumption, but they noticed that departure from the assumption does not affect the validity of the imputation estimator for the population mean of \(y_1\), even though it leads to biased estimation of the regression parameters. Note that if the assumption does not hold, then the imputation model (based on the assumption) is incorrectly specified. Under the incorrectly specified model, the imputed estimator is still unbiased for the mean estimation, as long as an intercept term is included in the model [9].
4 Application of methodology to USAID surveys in Guatemala
Based on the two estimates obtained from the two independent surveys on the overlap areas, we can improve the efficiency of the estimation by combining the two estimates.
4.1 Survey data integration
In this section, we use a measurement error model approach to integrate two surveys, the FFP and the FTF, presented in Sect. 3. In the view of the measurement error model approach, we treat one sample as a gold standard and the other sample containing measurement errors.
Throughout this study, the FFP sample was used as a benchmark and we predicted the counterfactual outcomes of the FTF sample, which is the value that would have obtained when the FTF sample was collected by ICF International who conducted the FFP project. This is based on the idea that measurement errors between two surveys are diminished when we consider the predicted values of the counterfactual values instead of the original values from the survey. We chose the FFP sample as a reference point since it has a smaller residual sum of squares compares to the one from the FTF sample.
4.1.1 Case 1: continuous study variable
Since the PCE indicator has continuous values, we treat a structural equation model and a measurement error model both follow normal distributions. Assume that a structural equation model for \(y_{1}\) is
where \(\mathbf {x}_{1i}\) is a department indicator and \(x_{2i}\) is a variable indicating the total number of household members, and \(e_{i} \sim N(0,\sigma _{e}^{2})\). Also, a measurement error model for \(y_{2}\) is
where \(u_{i} \sim N(0,\sigma _{u}^{2})\). By using the Bayes theorem, the predictive distribution can be derived as
where \(\mathbf {x}_{i}=(\mathbf {x}_{1i},x_{2i})\) with \(\varvec{\beta }=(\varvec{\beta }_{1},\beta _{2})\) and
with
and
For the analysis of the PCE indicator, we assumed a linear regression model (3). The model diagnostics for the model assumptions are given in Fig. 1. Two plots show that the normality assumption and the homogeneity of variance assumption are appropriate. Residual plot also shows no particular pattern in residuals so the model assumptions in (3) are regarded as reasonable.
For the parameter estimation, we write \(\theta _{1}=(\varvec{\beta }_1,\beta _2,\sigma _e^2)\) and \(\theta _2=(\alpha _0,\alpha _1,\sigma _u^2)\). The best estimator of \(\theta _1\) and \(\theta _2\) can be obtained by the full EM algorithm as explained in Sect. 3. In this example, the \(Q_1\) and \(Q_2\) are as follows:
-
[E-step]
Compute
$$\begin{aligned} Q_1(\theta _1|\theta _1^{(t)},\theta _2^{(t)})= & {} \sum _{i \in S_a}w_{ia} \left\{ -\frac{1}{2}\text {log}(\sigma _e^2)-\frac{1}{2\sigma _e^2}\left( y_{1i}-\varvec{\beta }\mathbf {x}_{i}\right) ^2 \right\} \\&+ \sum _{i \in S_b}w_{ib}\text {E}\left[ -\frac{1}{2}\text {log}(\sigma _e^2)-\frac{1}{2\sigma _e^2}\left( y_{1i}-\varvec{\beta }\mathbf {x}_{i}\right) ^2 \,|\, \mathbf {x}_{i},y_{2i};\theta _{1}^{(t)},\theta _{2}^{(t)} \right] \end{aligned}$$and
$$\begin{aligned}&Q_2(\theta _2|\hat{\theta }_1^{(t)},\theta _2^{(t)}) \nonumber \\&\quad = \sum _{i \in S_a}w_{ia}\text {E}\left[ -\frac{1}{2}\text {log}(\sigma _u^2)-\frac{1}{2\sigma _u^2}\left( y_{2i}-\alpha _0-\alpha _1y_{1i}\right) ^2 \Bigm | \mathbf {x}_{i},y_{1i}; \hat{\theta }_{1}^{(t)},\theta _{2}^{(t)} \right] \nonumber \\&\qquad + \sum _{i \in S_b}w_{ib}\text {E}\left[ -\frac{1}{2}\text {log}(\sigma _u^2)-\frac{1}{2\sigma _u^2}\left( y_{2i}-\alpha _0-\alpha _1y_{1i}\right) ^2 \Bigm | \mathbf {x}_{i},y_{2i};\hat{\theta }_{1}^{(t)},\theta _{2}^{(t)} \right] , \end{aligned}$$where the conditional distribution for
$$\begin{aligned} f(y_{1}|\mathbf {x},y_{2};\theta _1,\theta _2)=\frac{f_{1}(y_1|\mathbf {x};\theta _1)f_2(y_{2}|y_{1};\theta _2)}{\int f_{1}(y_1|\mathbf {x};\theta _1)f_2(y_{2}|y_{1};\theta _2) dy_{1}} \end{aligned}$$is also normal as in (4), evaluated at \(\theta _1=\hat{\theta }_{1}^{(t)}\) and \(\theta _2=\hat{\theta }_{2}^{(t)}\).
-
[M-step]
Update \(\theta _1\) by maximizing \(Q_1(\theta _1|\theta _1^{(t)},\theta _2^{(t)})\) with respect to \(\theta _1\) and update \(\theta _2\) by maximizing \(Q_2(\theta _2|\hat{\theta }_1^{(t)},\theta _2^{(t)})\) with respect to \(\theta _2\).
Based on the estimated parameters \(\hat{\theta }_1\) and \(\hat{\theta }_2\), the best predictor of \(y_{1}\) of the FTF sample is obtained as a mean of the predictive distribution, which is a conditional expectation of \(y_{1}\) given \(\mathbf {x}\) and \(y_{2}\). That is,
is the best prediction of \(y_{1i}\) in the FTF sample that correct for measurement errors in \(y_{2i}\).
Using the counterfactual values of the FTF sample and observations of the FFP sample, we can construct a composite estimator that combines two values. The combined estimator is
where \(S_{a}\) and \(S_{b}\) denote the FFP sample and the FTF sample, respectively.
4.1.2 Case 2: dichotomous study variable
When a study variable is dichotomous, such as the HHS indicator in the project, the normal distribution assumption does not hold for both the structural equation model and the measurement error model. In this case, we consider a logistic regression model for the structural equation model and the misclassification model is used instead of the measurement error model [1]. The structural equation model for \(y_{1}\) is
where \(\mathbf {x}_{i}=(\mathbf {x}_{1i},x_{2i})\) and
where \(\mathbf {x}_{1i}\) is a department indicator and \(x_{2i}\) is a variable indicating total number of household members. The misclassification model is given
where \(p=P(y_{2i}=1|y_{1i}=1)\) and \(q=P(y_{2i}=1|y_{1i}=0)\) are the misclassification parameters.
Denote the parameters \(\theta _1=(\varvec{\beta }_1,\beta _2)\) and \(\theta _2=(p,q)\). Then, the implementation of the EM algorithm via parametric fractional imputation involves the following steps:
-
[E-step]
$$\begin{aligned}&Q_1(\theta _1|\theta _1^{(t)},\theta _2^{(t)}) = \sum _{i \in S_a}w_{ia}\big [ y_{1i}(\varvec{\beta }_{1}\mathbf {x}_{1i}+\beta _{2}x_{2i})-\log \left\{ 1+\text {exp}(\varvec{\beta }_{1}\mathbf {x}_{1i}+\beta _{2}x_{2i}) \right\} \big ]\\&\quad + \sum _{i \in S_b}w_{ib}\sum _{j=1}^{2}w_{1i}^{*(j)}\big [y_{1i}^{*(j)}(\varvec{\beta }_{1}\mathbf {x}_{1i}+\beta _{2}x_{2i})-\log \left\{ 1+\text {exp}(\varvec{\beta }_{1}\mathbf {x}_{1i}+\beta _{2}x_{2i}) \right\} \big ] \end{aligned}$$
and
$$\begin{aligned}&Q_2(\theta _2|\hat{\theta }_1^{(t)},\theta _2^{(t)})=\sum _{i \in S_a}w_{ia}\sum _{j=1}^{2}w_{2i}^{*(j)} \big [ y_{2i}^{*(j)}\left\{ y_{1i}\log p+(1-y_{1i})\log q \right\} \big ] \\&\quad +\sum _{i \in S_a}w_{ia}\sum _{j=1}^{2}w_{2i}^{*(j)} \big [ (1-y_{2i}^{*(j)})\left\{ y_{1i}\log (1-p)+(1-y_{1i})\log (1-q) \right\} \big ] \\&\quad +\sum _{i \in S_b}w_{ib}\sum _{j=1}^{2}w_{1i}^{*(j)} \big [ y_{1i}^{*(j)}\left\{ y_{2i}\log p+(1-y_{2i})\log (1-p) \right\} \big ] \\&\quad +\sum _{i \in S_b}w_{ib}\sum _{j=1}^{2}w_{1i}^{*(j)} \big [ (1-y_{1i}^{*(j)})\left\{ y_{2i}\log q+(1-y_{2i})\log (1-q) \right\} \big ], \end{aligned}$$where \(y_{ki}^{*(1)}=1\) and \(y_{ki}^{*(2)}=0\) for \(k=1,2\) and
$$\begin{aligned} w_{1i}^{*(j)}= & {} P(y_{1i}^{*(j)}|y_{2i},\mathbf {x}_{i})\\\propto & {} f(y_{1i}^{*(j)}|\mathbf {x}_{i})P(y_{2i}|y_{1i}^{*(j)})\\ w_{2i}^{*(j)}= & {} P(y_{2i}^{*(j)}|y_{1i},\mathbf {x}_{i})\\= & {} P(y_{2i}^{*(j)}|y_{1i}), \end{aligned}$$where \(\sum _{j}w_{1i}^{*(j)}=1\) and \(\sum _{j}w_{2i}^{*(j)}=1\).
-
[M-step]
Update \(\theta _1\) by maximizing \(Q_1(\theta _1|\theta _1^{(t)},\theta _2^{(t)})\) with respect to \(\theta _1\) and update \(\theta _2\) by maximizing \(Q_2(\theta _2|\hat{\theta }_1^{(t)},\theta _2^{(t)})\) with respect to \(\theta _2\).
The best predictor of \(y_{1i}\) of the FTF sample can be written by
and the composite estimator combining two samples can be calculated as (5) using (6).
4.2 Variance estimation of the combined estimator
For variance estimation of the combined estimator, replicate variance estimation method is applied. More precisely, we used the bootstrap method of Rao and Wu [15]. For each bootstrap dataset \(D_{(b)}\), \(b=1,\ldots ,B\), we can calculate estimates for the specific bootstrap sample, say \(\hat{\mu }_{(b)}\). Then, the bootstrap approach computes the estimated variance of estimator \(\bar{y}\) by
where \(\hat{\bar{\mu }}=B^{-1}\sum _{b=1}^{B}\hat{\mu }_{(b)}\) is the mean of B bootstrap estimates. We used \(B=500\) in this study.
4.3 Results
In this section, results of the two examples in Sect. 4.1 are presented: the PCE indicator’s result is shown in Table 5 and the HHS indicator’s result is shown in Table 6. Both tables contain mean estimates of the FFP project (FFP), mean estimates of the FTF project (FTF) and combined mean estimates (Combined) using the original estimate of the FFP project and the new FTF mean estimates. Also, standard errors of each mean estimate are also reported.
Mean estimates of the FFP sample and the new mean estimates of the FTF sample are combined using (5) in order to obtain the composite estimates and the result is listed in the last column of the both tables. From the results in Tables 5 and 6, we find that the combined estimator provides reasonable estimates for the population mean with smaller standard errors.
Estimates of parameters of the measurement error model for PCE variable are \((\hat{\alpha }_{0},\hat{\alpha }_{1})=(0.261, 0.732)\). The \(\hat{\alpha }_{0}=0.261\) can be thought of as the mean of the measurement error model and it can explain why some combined estimates are outside the confidence interval of the estimate from the FTF.
In some cases, the combined estimate is not in between the FFP and the FTF. For example, the combined estimate of PCE in Totonicapan and the combined estimate of HHS in Huehuetenango are smaller than the FFP and the FTF. The new estimate of the FTF, which was adjusted for measurement errors, is even smaller than the FFP and it leads to the combined estimate that is not between the two original values. The new FTF estimate is not tabulated in the result, but the new estimate of PCE in Totonicapan is 0.275 and the new one of HHS in Huehuetenango is 8.70, which are smaller than the FFP for both cases.
5 Discussion
This study suggests a new approach to combine information from two surveys using the measurement error model approach and it can be generalized to combine more than two sources of information. Using a structural equation model and a measurement error model, we present a guidance on data integration with illustration of the work sponsored by FANTA. The results shown in Tables 5 and 6 indicate that the reference estimate and the counterfactual predicted values of the other sample can be used to produce the combined estimates.
The choice of a benchmark among several surveys can be decided in various ways. We considered a smaller mean squared error as a criterion in our study. If we have auxiliary information, such as previous experiences on the surveys, it can be used to determine a gold standard among several surveys.
The proposed approach can be applied to combine more than two survey data. Similarly, we can implement the method as follows: set one survey data as a benchmark, remove measurement errors existing in the remaining survey data and calculate the composite estimator using the estimates from the surveys. Also, multivariate modeling for the structural equation model can provide a more efficient estimation. Such extension will be a topic for future research.
References
Buonaccorsi, J.P.: Measurement Error: Models, Methods, and Applications. Chapman & Hall, London (2010)
Dillman, D.A., Phelps, G., Tortora, R., Swift, K., Kohrell, J., Berck, J., Messer, B.L.: Response rate and measurement differences in mixed-mode surveys using mail, telephone, interactive voice response (ivr) and the internet. Soc. Sci. Res. 38(1), 1–18 (2009)
Fornell, C., Larcker, D.F.: Evaluating structural equation models with unobservable variables and measurement error. J. Mark. Res. 18, 39–50 (1981)
Fuller, W.A.: Estimation for multiple phase samples. In: Chambers, R.L., Skinner, C.J. (eds.) Analysis of Survey Data, pp. 307–322. Wiley, Chichester (2003)
Fuller, W.A.: Measurement Error Models. Wiley, New York (2009)
Hidiroglou, M.: Double sampling. Surv. Methodol. 27(2), 143–154 (2001)
Kim, J.K.: Parametric fractional imputation for missing data analysis. Biometrika 98, 119–132 (2011)
Kim, J.K., Berg, E., Park, T.: Statistical matching using fractional imputation. Surv. Methodol. 42, 19–40 (2016)
Kim, J.K., Rao, J.N.: Combining data from two independent surveys: a model-assisted approach. Biometrika 99(1), 85–100 (2012)
Kish, L.: A procedure for objective respondent selection within the household. J. Am. Stat. Assoc. 44(247), 380–387 (1949)
Legg, J.C., Fuller, W.A.: Two-phase sampling. Handb. Stat. 29, 55–70 (2009)
Merkouris, T.: Combining independent regression estimators from multiple surveys. J. Am. Stat. Assoc. 99(468), 1131–1139 (2004)
Merkouris, T.: Combining information from multiple surveys by using regression for efficient small domain estimation. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72(1), 27–48 (2010)
Park, S., Kim, J.K., Park, S.: An imputation approach for handling mixed-mode surveys. Ann. Appl. Stat. 10(2), 1063–1085 (2016)
Rao, J.N., Wu, C.: Resampling inference with complex survey data. J. Am. Stat. Assoc. 83(401), 231–241 (1988)
Renssen, R.H., Nieuwenbroek, N.J.: Aligning estimates for common variables in two or more sample surveys. J. Am. Stat. Assoc. 92(437), 368–374 (1997)
USAID: Baseline study of Food For Peace Title II development food assistance program in Guatemala (2013). https://www.usaid.gov/data/dataset/beafc8ed-c5cf-41a0-84a4-19303c309516
Wu, C.: Combining information from multiple surveys through the empirical likelihood method. Can. J. Stat. 32(1), 15–26 (2004)
Ybarra, L.M., Lohr, S.L.: Small area estimation when auxiliary information is measured with error. Biometrika 95(4), 919–931 (2008)
Zieschang, K.D.: Sample weighting methods and estimation of totals in the consumer expenditure survey. J. Am. Stat. Assoc. 85(412), 986–1001 (1990)
Acknowledgements
The research was partially supposed by a grant from US National Science Foundation (Grant no MMS-1324922).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Park, S., Kim, J.K. & Stukel, D. A measurement error model approach to survey data integration: combining information from two surveys. METRON 75, 345–357 (2017). https://doi.org/10.1007/s40300-017-0124-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40300-017-0124-0