Keywords

1 Introduction

Drawing inference on super population parameters by combining data from different surveys is of considerable recent interest (Citro 2014; Kim and Rao 2012; Gelman et al. 1998) to the survey practitioners. For an up to date and comprehensive review of the methods, we refer to Lohr and Raghunathan (2016). The central idea behind any such method is to use information from different sources effectively for enhancing the efficiency of the estimators. In this paper, we propose a method for combining data based on quadratic inference function (QIF) (Lindsay and Qu 2003) in the context of linear regression analysis. To the best of our knowledge, use of QIF has not been considered before in the survey sampling literature.

For the methodological development in this paper, we consider model-design-based randomization approach to inference discussed in Roberts and Binder (2009), Graubard and Korn (2002), and Godambe and Thompson (1986). Specifically, we consider two finite populations \(\mathcal {P}_1 = \{(y_i,x_{1i},x_{2i}):i\in U_1\}\) and \(\mathcal {P}_2 = \{(y_i,x_{1i}):i\in U_2\}\) of sizes \(N_1\) and \(N_2\), respectively, where \(U_1\) and \(U_2\) are index sets of the population units in \(\mathcal {P}_1\) and \(\mathcal {P}_2\), respectively. Notice that \(\mathcal {P}_1\) and \(\mathcal {P}_2\) can be considered as random samples from a superpopulation. We assume:

  1. (i)

    The study variables in each finite population are independent realizations of the random variables \((y,x_1,x_2)\), where \(x_1\) and \(x_2\) are exogenous, and y is a continuous endogenous variable. Also, given \(x_1\) and \(x_2\), y is generated by a linear regression model \(y=\beta _0 + \beta _1 x_1 + \beta _2 x_2 +\epsilon \), where \(\epsilon \) is the error term independent of \(x_1\) and \(x_2\), and has mean 0 and variance \(\sigma ^2\). However, in \(\mathcal {P}_2\) observations on \(x_2\) are missing.

  2. (ii)

    A probability sample is selected from each resulting finite population using either the same or different sampling designs.

The above theoretical set-up may represent an important practical situation that often arises in survey sampling. Suppose in a survey with a relatively small sample size, the data are collected on a comprehensive set of exogenous variables; whereas in a different survey from the same population with a considerably larger sample size, the data are collected on a smaller subset of the same set of exogenous variables. The problem is to combine these independent samples effectively to get a better estimator.

Clearly, the problem stated above may be considered as a missing data problem where for some units in the bigger sample the data on one or more exogenous variables are missing. Multiple imputation is an often used method (Rendall et al. 2013; Gelman et al. 1998; Rubin 1986) in such situation, but how does it tide over the omitted variable bias is not quite clear. On the other contrary, the QIF based methodology that we propose here, recognizes and takes into account the omitted variable bias explicitly. Although the proposed methodology is applicable for combining data from any number of surveys in the set-up described above, we restrict our discussion to two surveys simply for ease of exposition.

The paper is organized as follows. In Sect. 2, we briefly discuss the estimation methodology based on QIF in a general setting, keeping in view the context of our application. In Sect. 3, we propose design-weighted QIF estimators of the regression coefficients using data from multiple surveys. Our methodology explicitly takes into account the omitted variable bias. In Sect. 4, we report the results of a limited simulation study. As expected, the simulation results show that the design-weighted QIF estimators based on the combined sample are substantially more efficient than the standard least squares estimators based on the sample with more covariates. Concluding remarks are given in Sect. 5.

2 Quadratic Inference Function

In this section we briefly introduce QIF based estimation methodology in a general setting. Suppose \(\mathbf b (x,\varvec{\theta })=(b_{1}(x,\varvec{\theta }), b_{2}(x,\varvec{\theta }),...,b_{q}(x,\varvec{\theta }))^{T}\) is a q-dimensional vector of distinct score functions, where \(\varvec{\theta }= (\theta _1,\theta _2,...,\theta _p)^{T}\) is a p-dimensional vector of parameters. The score functions are also called estimating functions and moment conditions in statistics and economics literature, respectively. Application of QIF based estimation methodology makes sense only if q is greater than p.

Suppose \(\mathcal {F}_{\theta }\) is the semi- parametric model defined by the parameter \(\varvec{\theta }\) and the score equations

$$\begin{aligned} E_{F}{} \mathbf b (X,\varvec{\theta })=0, \end{aligned}$$
(1)

such that if a distribution \(F \in \mathcal {F}_{\theta }\), then (1) is satisfied and vice versa. On the other hand, if the true \(F\notin \mathcal {F}_{\theta }\), and \( E_{F}{} \mathbf b (X,\varvec{\theta })=\delta (\varvec{\theta }) \ne 0 \), where \(\delta (\varvec{\theta })\) is said to represent the vector of discrepancy between the model and the true distribution F.

The quadratic distance function (QDF) between the true distribution F and the semi-parametric model \(\mathcal {F}_{\theta }\) as determined through the basic scores is then defined as

$$\begin{aligned} d(F, \mathcal {F}_{\varvec{\theta }})=\delta (\varvec{\theta })^T \Sigma _{\varvec{\theta }}^{-1}\delta (\varvec{\theta }), \end{aligned}$$
(2)

where \( \Sigma _{\varvec{\theta }}=Var(\mathbf b (X,\varvec{\theta }))\). For an arbitrary F, the value of \(\varvec{\theta }\) for which the basic scores are closest to mean 0 is then given by

$$\begin{aligned} \varvec{\theta }(F)= \arg min_{\varvec{\theta }}\, d(F, \mathcal {F}_{\varvec{\theta }}). \end{aligned}$$
(3)

For making data based inference on \(\varvec{\theta }\), the QDF in (3) needs to be replaced by its empirical analogue, called quadratic inference function. Suppose \(X_1,X_2,...,X_n \) are independently and identically distributed random variables following the distribution F, then a natural estimator of \( E_{F}{} \mathbf b (X,\varvec{\theta })=\delta (\varvec{\theta })\) is \(\bar{\mathbf{b }}(\varvec{\theta })=n^{-1}\sum _{i=1}^{n}b(X_i, \varvec{\theta })\). Suppose further, \(\hat{\Sigma }\) is a suitably chosen estimator of \(Var(\bar{\mathbf{b }}(\varvec{\theta }))\), the QIF is then given by

$$\begin{aligned} Q (\varvec{\theta }) =\bar{\mathbf{b }}(\varvec{\theta })^T\hat{\Sigma }^{-1} \bar{\mathbf{b }}(\varvec{\theta }). \end{aligned}$$
(4)

The choice of \(\hat{\Sigma }^{-1}\) is an important issue. We refer to Lindsay and Qu (2003) for a detailed discussion on it. The QIF estimator of is given by

$$\begin{aligned} \hat{\varvec{\theta }}=\arg min_{\varvec{\theta }}\,Q(\varvec{\theta }). \end{aligned}$$
(5)

If \(F \in \mathcal {F}_{\theta }\) , \(\hat{\varvec{\theta }}\) is consistent for the true value of \(\varvec{\theta }\), otherwise it is consistent for the nonparametric functional \(\varvec{\theta }(F)\) (cf.(3)). For a discussion on the optimum properties of \(\hat{\varvec{\theta }}\), we refer to Lindsay and Qu (2003).

3 Design-Weighted QIF Estimator

Let us now consider the estimation of the regression parameter \(\varvec{\beta }=(\beta _0,\beta _1,\beta _2)^{T}\) of the superpopulation model introduced in Sect. 1. First, we introduce some important notations. Suppose \(\mathbf S _1= \{(y_i,x_{i1},x_{i2}):i\in I_1 \subset U_1\} \) and \(\mathbf S _2= \{(y_i,x_{i1},x_{i2}):i\in I_2 \subset U_2\} \) represent the probability samples of sizes \(n_1(<N_1)\) and \(n_2(<N_2)\) drawn from the populations \(\mathcal {P}_1\) and \(\mathcal {P}_2\) using sampling designs \(p_1(.)\) and \(p_2(.)\), respectively, where \(I_1\) and \(I_2\) are index sets of selected sample units.

As stated at the outset, we adopt the model-design based randomization approach (Roberts and Binder 2009) to the estimation of the superpopulation parameters. Like Chen and Sitter (1999), we propose a two-step design weighted QIF estimator of \(\varvec{\beta }\) that could be used for complex surveys. First, we define QIF of \(\varvec{\beta }\), say, \(Q_U(\varvec{\beta })\), assuming \(\mathcal {P}_1\) and \(\mathcal {P}_2\) to be known. At the second step, we estimate \(Q_U(\varvec{\beta })\) by replacing the population based entities with its design-based estimators based on the samples. We denote it by \(\widetilde{Q}_U(\varvec{\beta })\). Finally, the estimator of \(\varvec{\beta }\) is obtained by minimizing \(\widetilde{Q}_U(\varvec{\beta })\) with respect to \(\varvec{\beta }\). We now describe the two steps in detail.

Assuming \(\mathcal {P}_1\) to be known, and represents a random sample from the superpopulation, the basic score vector for \(\varvec{\beta }\) is given by:

$$\begin{aligned} \mathbf b _1(y,\mathbf x ,\varvec{\beta })= (Y-\beta _0 -\beta _1 x_1 -\beta _2 x_2)\mathbf x , \end{aligned}$$
(6)

where \(\mathbf x =(1, x_1, x_2)^T\). Also, the assumed regression model of y given \(x_1\) and \(x_2\) entails \(E_{\varvec{\beta }}{} \mathbf b _1(Y,\mathbf X ,\varvec{\beta })= 0\). However, for \(\mathcal {P}_2\), the basic score function for \(\varvec{\beta }^{(1)}=(\beta _0,\beta _1)^{T}\)is given by:

$$\begin{aligned} \mathbf b _2^{*}(y,\mathbf x ^{(1)},\varvec{\beta }^{(1)})= (Y-\beta _0 -\beta _1 x_1)\mathbf x ^{(1)}, \end{aligned}$$
(7)

where \(\varvec{\beta }^{(1)}=(\beta _0,\beta _1)^{T}\) and \(\mathbf x ^{(1)}=(1, x_1)^T.\) But omitted variable bias leads to \(E_{\varvec{\beta }}{} \mathbf b _2^{*}(Y,\mathbf X ^{(1)}\varvec{\beta }^{(1)})= \varvec{\delta (\beta _2)}\), where \(\varvec{\delta (\beta _2)}= (0,\beta _2\sigma _{12})^T\), and \(\sigma _{12}=Cov(x_1,x_2).\)

Assuming \(\sigma _{12}\) to be known for the time being, we define a modified score function for \(\varvec{\beta }\) that explicitly takes into account the omitted variable bias as follows:

$$\begin{aligned} \mathbf b _2(y,\mathbf x ^{(1)},\varvec{\beta })= (y-\beta _0 -\beta _1 x_1)\mathbf x ^{(1)} -\varvec{\delta }(\beta _2). \end{aligned}$$
(8)

Thus, by definition, we have \(E_{\varvec{\beta }}{} \mathbf b _2(Y,\mathbf X ^{(1)}, \varvec{\beta }) =0\). The population version of QIF are thus based on the basic score functions given by (6) and (8).

Let us define \(\bar{\mathbf{b }}_1(\varvec{\beta })=N_1^{-1} \sum _{i\in U_1}{} \mathbf b _1(y_i,\mathbf x _i,\varvec{\beta })\), \(\bar{\mathbf{b }}_2(\varvec{\beta })=N_2^{-1} \sum _{i\in U_2}{} \mathbf b _2(y_i,{} \mathbf x ^{(1)}_i,\varvec{\beta })\), and \(\bar{\mathbf{b }}(\varvec{\beta })= (\bar{\mathbf{b }}_1(\varvec{\beta }), \bar{\mathbf{b }}_2(\varvec{\beta }))^T\). Let \(\hat{\Sigma }_{1\varvec{\beta }}\), \(\hat{\Sigma }_{2\varvec{\beta }}\), and \(\hat{\Sigma }_{\varvec{\beta }}\) be suitable finite population based estimators of \(Var(\mathbf b _1(Y,\mathbf X ,\varvec{\beta }))=\Sigma _{1\beta }\), \(Var(\mathbf b _2(Y,\mathbf X ^{(1)},\varvec{\beta }))=\Sigma _{2\beta }\) and \(Var(\mathbf b (Y,\mathbf X ,\varvec{\beta }))= \Sigma _{\beta }\), respectively, where \(\mathbf b (y,\mathbf x , \varvec{\beta })=(\mathbf b _1(y,\mathbf x , \varvec{\beta }), \mathbf b _2(y,\mathbf x ^{(1)}, \varvec{\beta }))^{T}\).

Then the first-step QIF of \(\varvec{\beta }\) is given by

$$\begin{aligned} Q_U(\varvec{\beta }) = W_1 \bar{\mathbf{b }}_1(\varvec{\beta })^T \hat{\Sigma }_{1\varvec{\beta }}^{-1}\bar{\mathbf{b }}_1(\varvec{\beta }) + W_2 \bar{\mathbf{b }}_2(\varvec{\beta })^T \hat{\Sigma }_{2\varvec{\beta }}^{-1}\bar{\mathbf{b }}_2(\varvec{\beta }) , \end{aligned}$$
(9)

where, \(W_k=N_kN^{-1}, k=1,2,\) and \(N = N_1+N_2.\)

Let us now define the second step QIF, \(\widetilde{Q}_U(\varvec{\beta })\), an estimator of \(Q_U(\varvec{\beta })\), based on the samples \(\mathbf S _1\) and \(\mathbf S _2\). Suppose \(\pi _{ik}= P_k(i\in I_k|i\in U_k) (>0)\) denotes the inclusion probability of the \(i-th\) unit of the \(k-\)th population in the sample \(\mathbf S _k\), where \(P_k(.)\) is the probability measure corresponding to the sampling design \(p_k(.)\) for \(i=1,2,...,N_k, k=1,2\). The design weights are then given by \(d_{ik}=\frac{\pi _{ik}^{-1}}{\sum _{i\in S_k} \pi _{ik}^{-1}}\), for \(i\in I_k, k=1,2.\) Defining, \(\widetilde{\mathbf{b }}_{i1}(\varvec{\beta })= \mathbf b _1(y_i,\mathbf x _i,\varvec{\beta })\) for \(i\in I_1\), \(\widetilde{\mathbf{b }}_{i2}(\varvec{\beta })= \mathbf b _1(y_i,\mathbf x _i^{(1)},\varvec{\beta })\) for \(i\in I_2\), \(\widetilde{\mathbf{b }}_1(\varvec{\beta }) =\sum _{i\in I_1}d_{i1} \widetilde{\mathbf{b }}_{i1}(\varvec{\beta })\), \(\widetilde{\mathbf{b }}_2(\varvec{\beta }) =\sum _{i\in I_2}d_{i2} \widetilde{\mathbf{b }}_{i2}(\varvec{\beta })\), and \(\widetilde{\Sigma }_{k\varvec{\beta }} = \sum _{i\in I_k}d_{ik} (\widetilde{\mathbf{b }}_{ik}(\varvec{\beta })-\widetilde{\mathbf{b }}_{k}(\varvec{\beta })) (\widetilde{\mathbf{b }}_{ik}(\varvec{\beta })-\widetilde{\mathbf{b }}_{k}(\varvec{\beta }))^{T}\) for \(k=1,2,\) we obtain

$$\begin{aligned} \widetilde{Q}_U(\varvec{\beta }) = W_1\widetilde{\mathbf{b }}_1(\varvec{\beta })^{T} \widetilde{\Sigma }_{1\varvec{\beta }}^{-1}\widetilde{\mathbf{b }}_1(\varvec{\beta }) + W_2\widetilde{\mathbf{b }}_2(\varvec{\beta })^{T} \widetilde{\Sigma }_{2\varvec{\beta }}^{-1}\widetilde{\mathbf{b }}_2(\varvec{\beta }). \end{aligned}$$
(10)

The design-weighted QIF estimator of \(\varvec{\beta }\) is then given by

$$\begin{aligned} \hat{\varvec{\beta }}=\arg min_{\varvec{\beta }}\,\widetilde{Q}(\varvec{\beta }). \end{aligned}$$
(11)

Notice that throughout the development we assume \(\sigma _{12}\) to be known. It may be a reasonable assumption if the information on \(x_1\) and \(x_2\) are available at the population level while the values of \((y, x_1, x_2)\) are known for the sample only. In this case, the design-weighted QIF estimators lead to a huge improvement over the standard least squares estimators. In case, it is not known, we plug in its estimate from the sample in \(\widetilde{Q}_U(\varvec{\beta }).\) The latter also shows some improvement as is evident from the numerical studies reported in the next section.

4 Numerical Studies

We present the results of a limited simulation study comparing the performances of design-weighted quadratic inference function estimator (QIFE) with that of design-weighted least square estimator (LSE).

Suppose the covariate vector \((x_1,x_2)^T\) has a bivariate normal distribution with mean vector \((0,0)^T\) and covariance matrix \(\varvec{\Sigma }(2\times 2)\). Given \((x_1,x_2)\), y has a normal distribution with mean \(1+0.5x_1+0.25x_2\) and variance 0.25. We consider two superpopulation models M1 and M2 corresponding to two choices of \(\varvec{\Sigma }\), say, \(\varvec{\Sigma }_{1}\) and \(\varvec{\Sigma }_2\), respectively, where

$$\begin{aligned} \Sigma _1 = \left( \begin{array}{cc} 0.5 &{} 0.5\\ 0.5 &{} 1.0 \end{array}\right) \end{aligned}$$

and

$$\begin{aligned} \Sigma _1 = \left( \begin{array}{cc} 0.5 &{} 0.14\\ 0.14 &{} 1.0 \end{array}\right) . \end{aligned}$$

Notice that for model M1 the correlation coefficient between \(x_1\) and \(x_2\) is 0.7 while for M2, it is 0.2.

Following are the steps of the simulation study:

Step 1: We generate finite populations \(U_1\) and \(U_2\) of sizes \(N_1\) and \(N_2\) using the above superpopulation model. First, we randomly generate a value of \(\mathbf x =(x_1,x_2)^T\), and then generate a value of y given \(\mathbf x \) using the conditional distribution of y given \(\mathbf x \). The finite populations \(U_1\) and \(U_2\) then comprise \(N_1\) and \(N_2\) such observations on \((y,x_1,x_2)\) generated independently. Next, by simple random sampling without replacement (SRSWOR), we select L samples of sizes \(n_1(=f_1N_1)\) and \(n_2(=f_2N_2)\) from \(U_1\) and \(U_2\), respectively, where \(f_1\) and \(f_2\) are the sampling fractions. The selected samples from \(U_1\) and \(U_2\) are denoted by \(\mathbf S _{1}^{(l)}\) and \(\mathbf S _{2}^{(l)}\), \(l=1,2,...,L\) respectively.

Step 2: Based on \(\mathbf S _1\) we compute usual design-weighted LSE of \(\varvec{\beta }\). Also based on \(\mathbf S _{1}\) and \(\mathbf S _{2}\), we compute design-weighted QIFE from (11).

Step 3: We repeat the Step 1 R times. At the r-th (\(r=1,2,...,R\)) replication, let the populations generated be \(U_1^{(r)}\) and \(U_2^{(r)}\). For each r, the selected samples from \(U_1^{(r)}\) and \(U_2^{(r)}\) are denoted by \(\mathbf S _1^{(rl)}\) and \(\mathbf S _2^{(rl)}\), \(l=1,2,...,L\), respectively. For each r and l, following Step 2, we compute the LSE and QIFE of \(\beta _j\), \(j=0,1,2\), say, \(\hat{\beta }_{j(LS)}^{(rl)} \) and \(\hat{\beta }_{j(QIF)}^{(rl)}\), respectively.

Step 4: For each estimator of \(\beta _j\), say, \(\hat{\beta }_{j}^{(rl)}\) (a generic notation) we compute the relative bias (RB) (\([(RL)^{-1}\sum _{r,l}\hat{\beta }_j^{(rl)}-\beta _j]/|\beta _j|) \) and relative root mean squared error (RRMSE) (\( \sqrt{(RL)^{-1}\sum _{r,l}(\hat{\beta }_j^{(rl)}-\beta _j)^2 }/|\beta _j|\)).

For our simulation study, we consider \((N_1,N_2)\): (1000, 2000), (1000, 5000), \(R=L=100\) and \(f_1=f_2=0.10\). In Table 1, we report the RRMSE values for the LSE’s and QIFE’s of \(\beta _j, j=0,1,2\). The RB values are not shown. However, it has been observed that for \(n_1= 100,n_2=500\), i.e., when the second sample size is relatively large compared to the first, the relative biases of both the estimators are comparable. For \(n_1= 100,n_2=200\) the relative bias of QIFE is slightly higher than LSE. This is expected as LSE is unbiased while QIFE is not. What is interesting to observe, that with increase in the relative magnitude of \(N_2\) compared to \(N_1\), the performances of QIFE’s of \(\beta _j, j=0,1\) improve over the LSE’s substantially. Also the improvement is more if the correlation between \(x_1\) and \(x_2\) increases. The performances of QIFE and LSE of \(\beta _2\) are more or less same.

Table 1 RRMSE of the least squares (LS) and quadratic inference function (QIF) estimators of the superpopulation parameters for models M1 and M2

5 Concluding Remarks

In this article we propose quadratic inference function estimator of the superpopulation parameters using information from multiple samples from the same superpopulation that incorporates the design weights. For illustrative purpose, in this paper, we have considered linear regression superpopulation model. Our design-adjusted QIF estimator is appealing in the sense that it can be applied for complex survey designs. The simulation study shows encouraging results in situations where size of the sample containing observations on subset of covariates is very high. In future we plan to investigate the asymptotic properties of the proposed QIF estimator under complex survey designs.