Missing data and auxiliary information in surveys

Rueda, M.; González, S.

doi:10.1007/BF02753912

Missing data and auxiliary information in surveys

Published: 01 December 2004

Volume 19, pages 551–567, (2004)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Computational Statistics Aims and scope Submit manuscript

Missing data and auxiliary information in surveys

Download PDF

M. Rueda¹ &
S. González²

182 Accesses
22 Citations
Explore all metrics

Summary

This paper proposes estimation methods with auxiliary information when some observations are missing from the sample. These ratio, difference and regression methods are proposed for any sampling design and are compared with other complete case estimators.

Some General Classes of Efficient Estimators in Case of Missing Data

Article 30 March 2022

Efficient Imputation Methods to Handle Missing Data in Sample Surveys

Article 02 June 2022

Variance estimation procedures in the presence of singly imputed survey data: a critical review

Article 18 August 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The infeasibility of having all the observations in a sample is not an uncommon aspect of data collection in many instances of sample surveys. Missing data occur in survey research because an element in the target population is not included in the survey sampling frame (noncoverage), because a sample element has not participated in the survey (total nonresponse) or because a responding sample element fails to provide an acceptable response to one or more of the survey items (item nonresponse). This latter type of nonresponse is a common occurrence and may arise for different reasons (a respondent refuses to answer an item, does not know the answer to the item, gives an answer that is inconsistent with answers to other items, the interviewer fails to ask the question or record the answer, etc.) Unfortunately, the problem of missing data arises frequently in practice.

One obvious consequence of nonresponse is that the actual sample size is less than the planned one. This can produce biases in estimations if nonrespondents differ from respondents on the characteristic of interest, and also lead to greater sampling variance.

There exist different methods to handle missing data during the stages of data collection and processing. The aim of these methods is to obtain a precise and complete data set. Nevertheless, it is still possible to find errors and losses of some entries even after the data has been collected and filtered.

When some observations in the sample are missing (item nonresponse), a first option would be to carry out a complete case analysis. Methods based on completely recorded units create a rectangular data set by discarding all observations with any missing variable. Thus, when parameters are estimated, only the observations for which all the variables of interest have a valid value are used. Little and Rubin (1987) pointed out the statistical shortcoming of all the methods that ignore incomplete observations. While these methods can provide satisfactory results when the percentage of incomplete cases is low, in general terms they lead to biased estimations, since they assume that the loss of data takes place in a completely random way. King et al. (1998) illustrate how methods of complete cases are prone to serious errors. To sum up, this practice can be said to introduce a bias into the estimate and an increase in sampling variance due to a reduction in sample size, see, e.g., Brick and Kalton (1969), Schafer (1997).

Alternatively, an imputation method may be used to find substitutes for missing observations, see, e.g., Little and Rubin (1987), Särndal (1992) and Rubin (1987) for an interesting account. Certain commonly used imputation methods take the imputed values as true observations, and the statistical analysis may be carried out using the standard procedures developed for data without any missing observations. Such a practice, it is well recognized, may tend to invalidate the inferences and may often have serious consequences. Some statistics specialists are reluctant to apply this method because it manipulates the original information, although there are also reasons to justify its use. Other procedures such as the multiple imputation and the model-assisted approaches account for the fact that imputed values are not true observations, as they reflect the additional variance due to imputation error.

As a third option, we could try to improve the precision of the estimators by including all the cases available for their calculation.

Indirect estimation methods are easily comprehensible techniques for the estimation of total population in survey sampling when an auxiliary characteristic correlated with the study characteristic is available; see, e.g., Sukhatme, Sukhatme, Sukhatme and Asok (1984). These techniques provide generally biased but more efficient estimators in comparison with the traditional unbiased estimator. These methods of estimation assume that the sample data contain no missing observations. This specification may not be tenable in many practical applications, see, e.g., Rubin (1977). Some authors have defined indirect estimators when the sample is drawn according to the procedure of simple random sampling without replacement when some observations are missing, see, e.g., Tracy and Osahan (1994) and Toutenburg and Srivastava (1998, 1999, 2000). However, there appears to be no investigation reported in the literature when another sample design is used, and this is the main concern of the present paper. In this article, therefore, we consider the indirect estimation of total population on the basis of a random sample drawn according to any sample design. Using the methods of ratio, difference and regression estimation, we propose estimators for the population total of study characteristics besides the conventional estimators which amputate incomplete observations.

This article is structured as follows: in section 2 we present estimators for the total population which are better, in the sense of precision, than traditional estimators. Section 3 considers estimator properties through a simulation study in the case of simple random sampling without replacement.

Lastly, in the Appendix, the problem is developed for the case of simple random sampling without replacement and for the case of stratified sampling.

2 Proposed estimators

Consider a population of N units from which a random sample, s, of fixed size, n, is drawn according to a sample design d = (S_d, P_d), with first order inclusion probabilities π_i. For this sample the values of two variables, (y_i, x_i), i = 1 ,…, n, are observed for the estimation of the total population, Y.

It is assumed that a set of (n - p - q) complete observations on selected units in the sample are available. In addition to these, observations on the x characteristic on p units in the sample are available but the corresponding observations on the y characteristic are missing. Similarly, we have a set of q observations on the y characteristic in the sample but the associated values on the x characteristic are missing. Further, p and q are assumed to be integer numbers verifying 0 < p, q < n/2.

This population has the following structure:

Table 1

Full size table

For the sake of simplicity, we separate the unit of the sample s into three disjoint sets:

$$s_1\;=\;\left\{i\;\in\;{s}/x_i,\;y_{i}\;are\;available\right\}\\s_2\;=\;\left\{i\;\in\;{s}/x_i\;are\;available,\;but\;y_{i}\;is\;not\right\}\\s_3\;=\;\left\{i\;\in\;{s}/y_{i}\;are\;available,\;but\;x_{i}\;is\;not\right\}$$

If we write:

$$\hat{y}_{HT}^1=\sum_{i\in{s_1}}\frac{y_i}{\pi_i},\;\;\hat{y}_{HT}^3=\sum_{i\in{s_3}}\frac{y_i}{\pi_i},\;\;\hat{x}_{HT}^1=\sum_{i\in{s_1}}\frac{x_i}{\pi_i},\;\;\text{and}\;\hat{x}_{HT}^2=\sum_{i\in{s_2}}\frac{x_i}{\pi_i}$$

The following indirect estimators for the population total based on complete cases can be formulated:

$${\hat y_{r1}} = \frac{{\hat y_{HT}^1}}{{\hat x_{HT}^1}} = \frac{{\sum\limits_{i \in {s_1}} {\tfrac{{{y_1}}}{{{\pi _i}}}} }}{{\sum\limits_{i \in {s_1}} {\tfrac{{{x_1}}}{{{\pi _i}}}} }}*X$$

(1)

$$\hat{y}_{d1}={\hat{y}_{HT}^1}+(X-{\hat{x}_{HT}^1})$$

(2)

$$\hat{y}_{Reg1}={\hat{y}_{HT}^1}+b(X-{\hat{x}_{HT}^1}))$$

(3)

where b can be fixed and known or unknown. In this latter case, if the error is minimized we obtain that:

$$b=\frac{\text{Cov}(x,y)}{\text{Var}(x)}$$

which is what must be estimated.

All these estimators discard the information available on incomplete cases. This practice can introduce biases and errors into the estimation. For this reason, we propose the following classes of estimators, which incorporate all the available observations:

$${\hat y_{r2}^*} = \frac{\alpha_r{\hat y_{HT}^3}+(1-\alpha_r){{\hat y_{HT}^1}}}{{\beta_r\hat x_{HT}^2}+\beta_r\hat x_{HT}^1}*X$$

(4)

$${\hat y_{d2}^*} = {\alpha_d{\hat y_{HT}^1}+(1-\alpha_d){{\hat y_{HT}^3}}}+(X-({{\beta_d\hat x_{HT}^1}+(1-\beta_d)\hat x_{HT}^2}))$$

(5)

$${\hat y_{Reg2}^*} = {\alpha_{reg}{\hat y_{HT}^1}+(1-\alpha_{reg}){{\hat y_{HT}^3}}}+b[X-({{\beta_{reg}\hat x_{HT}^1}+(1-\beta_{reg})\hat x_{HT}^2})]$$

(6)

In the case of the regression estimator, if b is unknown, we can proceed as in the case of no nonresponse. Thus, we present two possible estimators for b:

$${\hat b_{1}} = \frac{\widehat{\text{Cov}}_{i\in{s}_{1}}\;(x,y)}{\widehat{\text{Var}}_{i\in{s}_{1}}\;(x)}$$

(7)

$${\hat b_{2}} = \frac{\widehat{\text{Cov}}_{i\in{s}_{1}}\;(x,y)}{\widehat{\text{Var}}_{i\in{s}_{1}}\bigcup_{{s}_{2}}\;(x)}$$

(8)

where ${\widehat{\text{Cov}}_{i\in{s}_{1}}\;(x,y)}$ ${\widehat{\text{Var}}_{i\in{s}_{1}}}$ and ${\widehat{\text{Var}}_{i\in{s}_{1}}\bigcup{{s}_{2}}}$ represent the variances and covariances based on the corresponding subsamples. Using these estimations of b, we can define the classes of regression estimators ŷ^*_Reg21 and ŷ^*_Reg22 by replacing the value of b with that of its respective estimation.

Note that the estimators with subindex 1 are the traditional ratio, difference and regression estimators, which are based on complete observations and ignore the incomplete pairs of observations. We propose the estimators with subindex 2, which incorporate all the available observations.

The following step is to look for the estimators with the best behaviour among the proposed classes of estimators. This choice is made seeking to minimize the estimator error. The expressions of the mean squared errors of the estimators are easily Obtained, and by minimizing these errors, we obtain the estimator expressions with minimum errors.

Thus we have:

$$\alpha_{r_{opt}}=\frac{-C_{r}+(E_{r}B_{r}-\frac{C_{r}}{A_{r}}B_r^2)/(D_{r}-B_r^2/A_r)}{A_{r}}$$

$$\beta_{r_{opt}}=\frac{-E_{r}+\frac{C_{r}}{A_{r}}B_{r}}{D_{r}-B_r^2/A_r}$$

$$\alpha_{d_{opt}}=\frac{A_{d}-\frac{C_{d}D_{d}-A_{d}B_{d}}{E_{d}C_{d}-B_d^2}}{C_{d}}Bd$$

$$\beta_{d_{opt}}=\frac{C_{d}D_{d}-A_{d}B_{d}}{E_{d}C_{d}-B_d^2}$$

$$\alpha_{reg_{opt}}=\frac{-C_{reg}}{A_{reg}}-\frac{B_{reg}}{A_{reg}}\;\frac{B_{reg}C_{reg}-A_{reg}E_{reg}}{A_{reg}D_{reg}-B_{reg}^2}$$

$$\beta_{reg_{opt}}=\frac{B_{reg}C_{reg}-A_{reg}E_{reg}}{A_{reg}D_{reg}-B_{reg}^2}$$

where:

$$A_r=2R^2\;\text{Var}(\hat{x}_{HT}^2)+2R^2\;\text{Var}(\hat{x}_{HT}^1)-4R^2\;\text{Cov}(\hat{x}_{HT}^2,\hat{x}_{HT}^1)$$

$$B_r=-2R\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)+2R\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^1)+2R\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^2)-2R\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^1)$$

$$C_r=-2R^2\;\text{Var}(\hat{x}_{HT}^1)+2R^2\;\text{Cov}(\hat{x}_{HT}^2,\hat{x}_{HT}^1)-2R\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^2)+2R\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^1)$$

$$D_r=2\;\text{Var}(\hat{y}_{HT}^3)+2\;\text{Var}(\hat{y}_{HT}^1)-4\;\text{Cov}(\hat{y}_{HT}^3,\hat{y}_{HT}^1)$$

$$E_r=-2\;\text{Var}(\hat{y}_{HT}^1)+2\;\text{Cov}(\hat{y}_{HT}^3,\hat{y}_{HT}^1)-2R\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^1)+2R\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^1)$$

$$A_d=\text{Var}(\hat{y}_{HT}^3)-\;\text{Cov}(\hat{y}_{HT}^1,\hat{y}_{HT}^3)+\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^2)-\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)$$

$$B_d=-\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^1)+\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^2)+\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^1)-\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)$$

$$C_d=\text{Var}(\hat{y}_{HT}^1)\;+\;\text{Var}(\hat{y}_{HT}^3)-2\;\text{Cov}(\hat{y}_{HT}^1,\hat{y}_{HT}^3)$$

$$D_d=\text{Var}(\hat{x}_{HT}^2)\;-\;\text{Cov}(\hat{x}_{HT}^1,\hat{x}_{HT}^2)\;+\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^1)-\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)$$

$$E_d=\text{Var}(\hat{x}_{HT}^2)\;+\;\text{Var}(\hat{x}_{HT}^1)\;-\;2\;\text{Cov}(\hat{x}_{HT}^1,\hat{x}_{HT}^2)$$

$$A_{reg}=2\;\text{Var}(\hat{y}_{HT}^1)\;+\;2\;\text{Var}(\hat{y}_{HT}^3)\;-\;4\;\text{Cov}(\hat{y}_{HT}^1,\hat{y}_{HT}^3)$$

$$B_{reg}=2b[-\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^1)\;+\;\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^2)+\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^1)-\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)]$$

$$C_{reg}=-2\;\text{Var}(\hat{y}_{HT}^3)\;+\;2\;\text{Cov}(\hat{y}_{HT}^1,\hat{y}_{HT}^3)+2b[-\text{Cov}(\hat{y}_{HT}^1,\hat{x}_{HT}^2)+\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)]$$

$$D_{reg}=b^2\;[2\;\text{Var}(\hat{x}_{HT}^1)\;+\;2\;\text{Var}(\hat{x}_{HT}^2)-4\;\text{Cov}(\hat{x}_{HT}^1,\hat{x}_{HT}^2)]$$

$$E_{reg}=-2b^2\;\text{Var}(\hat{x}_{HT}^2)\;+\;2b^2\;\text{Cov}(\hat{x}_{HT}^1,\hat{x}_{HT}^2)-2b\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^1)+2b\;\text{Cov}(\hat{y}_{HT}^3,\hat{x}_{HT}^2)$$

The expressions of these variances and covariances for the case of simple random sampling without replacement and for the case of stratified sampling can be seen in the Appendix.

Unfortunately these optimum values depend on theoretical variances and covariances among the Horvitz-Thompson estimators, which are generally unknown, so the optimal estimator cannot be used. However, they can be estimated when the sample is drawn. Furthermore, these values would be estimated by replication methods, see, e.g., Wolter (1985).

In the absence of good a priori knowledge of these characteristics, we replace the optimal α and β-values by sample based estimates in 4, 5 and 6 thus obtaining the following estimators, which can be evaluated from the sample obtained:

$$\hat{y}_{r2}=\frac{\widehat{\alpha}_{r}{\widehat{y}_{HT}^3+(1-{\widehat{\alpha}_{r})\hat{y}_{HT}^1}}}{\widehat{\beta}_{r}\hat{x}_{HT}^2+{\widehat{\beta}_{r}\hat{x}_{HT}^1}}*X$$

(9)

$$\widehat{y}_{d2}=\widehat{\alpha}_{d}\widehat{y}_{HT}^1+(1-\widehat{\alpha}_{d})\widehat{y}_{HT}^3\;+\;(X-(\widehat{\beta}_{d}\widehat{x}_{HT}^1\;+\;(1-\widehat{\beta}_{d})\widehat{x}_{HT}^2))$$

(10)

$$\hat{y}_{Reg2}=\widehat{\alpha}_{reg}\hat{y}_{HT}^1+(1-\widehat{\alpha}_{reg})\hat{y}_{HT}^3+b[X-(\widehat{\beta}_{reg}\hat{x}_{HT}^1\;+\;(1-\widehat{\beta}_{reg})\hat{x}_{HT}^2)]$$

(11)

$$\hat{y}_{Reg21}=\widehat{\alpha}_{reg}\hat{y}_{HT}^1+(1-\widehat{\alpha}_{reg})\hat{y}_{HT}^3+\hat{b}_1\;[X-(\widehat{\beta}_{reg}\hat{x}_{HT}^1\;+\;(1-\widehat{\beta}_{reg})\hat{x}_{HT}^2)]$$

(12)

$$\hat{y}_{Reg22}=\widehat{\alpha}_{reg}\hat{y}_{HT}^1+(1-\widehat{\alpha}_{reg})\hat{y}_{HT}^3+\hat{b}_2\;[X-(\widehat{\beta}_{reg}\hat{x}_{HT}^1\;+\;(1-\widehat{\beta}_{reg})\hat{x}_{HT}^2)]$$

(13)

These estimators do not coincide with the theoretical estimators in expressions 4, 5 and 6 and involve the estimated parameters. Randles (1982) derived the limit distribution for such statistics. Following his notation, we denote the estimator ŷ_d2 as T_n(̂λ) with ̂λ = (̂α_d, ̂β_d) , We replace ̂λ in T_n(·) with a variable ϛ. Now we calculate the limit of the expectation of the statistic T_n(λ) when the current value of the parameter is λ = (α_d, β_d)

$$\mu(\lambda)\;=\;\text{lim}\;E_\lambda\;(T_n(\varsigma))=Y$$

where E_λ denotes the expectation with respect to the design.

Since μ(·) has partial derivates on ϛ = λ equal to zero, it now follows from Randless (1982) that T_n(̂λ) and T_n(λ) have the same limit distribution, i.e., ŷ_d2 has the same limit distribution as ~d2 with adop~ and /~dop~ and it is reasonable to assume that the sampling errors will be close to the theoretical ones for large samples.

Finally, note that the usual estimators are included in the proposed classes of estimators, and so the estimators obtained by minimizing the errors in these classes will be better, in the sense of mean square error, than the traditional ones.

3 Simulation study

This section examines estimator properties by means of a simulation study.

The populations considered can be divided into two groups: natural populations and simulated populations.

The fam1500 population consists of 1500 families in Andalusia (Spain) taken from Fernández and Mayor (1994). The variable of interest, y, denotes family income and the auxiliary x denotes expenditure on food and drink.

The second class includes three simulated populations used by Meeden (1995). For the simulation, a superpopulation model is considered in which it is assumed that for each i, y_i = bx_i + u_ie_i, in which e_i are independent identically distributed random variables with zero expectations.

In the first population, SIM1, the x_i’s form a random sample from a gamma distribution with a shape parameter of twenty and a scale parameter of one.

In the second population, sim2, the auxiliary variable is a random sample from a log-normal population with mean and standard deviation 4.9 and 0.586 respectively.

In sim3 the auxiliary variable is fifty plus a random sample from the standard exponential distribution.

All the simulated populations contain 500 units.

The following algorithm is used for the populations with several sample sizes. Specifically, sample sizes of 25, 50, 75 and 100 were taken for the simulated populations and 50, 100, 150 and 200 for the FAM1500 population, due to the larger size of the latter.

Algorithm

step 1: Take a sample of size n according to the procedure of simple random sampling without replacement.
step 2: Set the missingness rates, p and q.
step 3: Eliminate the sample p elements on the auxiliary characteristic and q elements on the study characteristic, in a random way.
step 4: Define the subsamples s₁, s₂ and s₃.
step 5: Calculate: ŷ_r1, ŷ_r2, ŷ_Reg1, ŷ_Reg2, ŷ_Reg11, ŷ_Reg21, ŷ_Reg12, ŷ_Reg22, ŷ_d1, ŷ_d2
step 6: Use the values obtained in 1000 items for the calculation of the mean squared errors of the estimators.
step 7: Normalize these mean squared errors, dividing them by the mean squared error of the simple estimator and latter on take the log ratios of these mean squared errors.

Results of the application of this algorithm for some values of p and q can be seen in figures 1, 2 and 3.

In each figure are being plotted the log ratios of standard errors of considered estimators. The dashed curves correspond to the proposed estimator and the dotted curves refer to the estimator based on complete observations. The central horizontal lines correspond to the simple estimators.

It is interesting to note that the missingness rates were taken such that integer values were generated for all sampling sizes.

In the FAM1500 population, all the estimates based on the cases available present a smaller error than the respective estimators based on the complete data. The results obtained from the latter, in general, are no better than those based on the simple estimator, which does not make use of auxiliary data, while those proposed in this paper all present a smaller error than when the simple error is used as the basis for comparison.

A similar pattern was observed in the artificial populations SIM2 and SIM3. The estimators based on the available cases always improved considerably on the results provided by those based on the complete data, and were nearly always better than those based on the simple estimator.

A noteworthy feature is that in the SIM1 population the results obtained with the difference estimator, based on the complete cases, were very bad (the error was more than twice that obtained with the baseline estimator). Nevertheless, the error of the proposed estimator ŷ_d2 was only a fifth of that provided by the ŷ_d1 estimator and less than half that of the direct estimator ŷ, for any of the sample sizes considered. In this population, the ratio and regression estimators based both on the complete data and on available cases considerably improved on the precision of the direct estimator, while between the two estimators there was a less evident reduction in the error than among the other populations.

The behaviour pattern of the estimators ŷ_Reg21 and ŷ_Reg22, in relation to each other, is unclear. Depending on the population and on the sample size considered, one has a smaller error than the other. Evidently, the best behaviour is presented by the regression estimator based on the true value of b.

It has also been observed that, as expected, when the total missingness rate $\frac{p+q}{n}$ increased, the gain in the precision of the proposed estimators is greater.

The simulations were repeated, interchanging the values of p and q and the results obtained were very similar.

To sum up, these simulations show how the use of all the available data by the proposed estimators leads to a considerable error reduction in the estimation of totals, with respect to the respective estimators usually applied. This error reduction can be very great in certain cases, such as estimation by differences, which often functions unsatisfactorily. Moreover, it should be noted that there is a direct relation between error reduction and the missingness rate.

References

Brick, J.M., Kalton, G. (1996), Handling missing data in survey research, Statistical methods in medical research, 5, 215–238.
Article Google Scholar
Fernández, F.R., Mayor, J.A. (1994), Muestreo en poblaciones finitas: curso básico Ed. PPU.
King, G., Honaker, J., Joseph, A., Scheve, K. (1998), Listwise deletion is evil: what to do about missing data in Political Science, Unpublished document.
Little, R.J.A., Rubin, D.B. (1987), Statistical analysis with missing data, John Wiley, New York.
MATH Google Scholar
Meeden, G. (1995), Median estimation using auxiliary information, Survey Methodology, 21(1), 71–77.
Google Scholar
Morales, L. (2000), El efecto de la no respuesta parcial en el análisis de datos de una encuesta: una comparación entre la elimination de observaciones y la imputation multiple, Metodología de Encuestas, 2, 2, 217–218.
Google Scholar
Randles, R. H. (1982), On the asymptotic normality of statistics with estimated parameters. The Annals of Statistics 10, 462–474.
Article MathSciNet Google Scholar
Rubin, D.B. (1976), Inference and missing data, Biometrika, 63, 581–592.
Article MathSciNet Google Scholar
Rubin, D.B. (1977), Formalizing subjective notions about the effect of nonrespondents in sample surveys, Journal of the American Statistical Association, 72, 538–543.
Article MathSciNet Google Scholar
Särndal, C.E. (1992), Methods for estimating the precision of survey estimates when imputation has been used, Survey Methodology, 18, 241–252.
Google Scholar
Schafer, J.L. (1997), Analysis of Incomplete Multivariate Data, Chapman and Hall, London.
Book Google Scholar
Singh, S., Joarder, A.M. (1998), Estimation of finite population variance using nonresponse in survey sampling, Metrika 47, 241–249.
Article MathSciNet Google Scholar
Sukhatme, P.V., Sukhatme, B.V., Sukhatme, S., Asok, C. (1984), Sampling Theory of Surveys with Applications. Iowa State University Press. Iowa.
MATH Google Scholar
Toutenburg, H., Srivastava, V.K. (1998), Estimation of ratio of population means in survey sampling when some observations are missing, Metrika 48, 177–187.
Article MathSciNet Google Scholar
Toutenburg, H., Srivastava, V.K. (1999), Amputation versus imputation of missing values through ratio method in sample surveys, Unpublished document.
Toutenburg, H., Srivastava, V.K. (2000), Efficient estimation of population mean using incomplete survey data on study and auxiliary characteristic, Unpublished document.
Tracy, D.S., Osahan, S.S. (1994), Random nonresponse on study variable versus on study as well as auxiliary variables, Statistica, 54, 163–168.
MathSciNet MATH Google Scholar
Wolter, K.M. (1985), Introduction to variance estimation Springer-Verlag.

Download references

Acknowledgments

The authors would like to thank the referees for their many helpful comments and suggestions.

Research partially supported by MCYT (Spain) contract number BFM2001-3190.

Author information

Authors and Affiliations

Department of Statistics & OR, University of Granada, Spain
M. Rueda
Department of Statistics, University of Jaén, Spain
S. González

Authors

M. Rueda
View author publications
You can also search for this author in PubMed Google Scholar
S. González
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rueda, M., González, S. Missing data and auxiliary information in surveys. Computational Statistics 19, 551–567 (2004). https://doi.org/10.1007/BF02753912

Download citation

Published: 01 December 2004
Issue Date: December 2004
DOI: https://doi.org/10.1007/BF02753912

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Missing data and auxiliary information in surveys

Summary

Similar content being viewed by others

Some General Classes of Efficient Estimators in Case of Missing Data

Efficient Imputation Methods to Handle Missing Data in Sample Surveys

Variance estimation procedures in the presence of singly imputed survey data: a critical review

1 Introduction

2 Proposed estimators