Estimation of Population Mean Using Imputation Methods for Missing Data Under Two-Phase Sampling Design

Singh, G. N.; Suman, S.

doi:10.1007/s42519-018-0016-5

Estimation of Population Mean Using Imputation Methods for Missing Data Under Two-Phase Sampling Design

Original Article
Published: 05 November 2018

Volume 13, article number 19, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Estimation of Population Mean Using Imputation Methods for Missing Data Under Two-Phase Sampling Design

Download PDF

G. N. Singh¹ &
S. Suman¹

160 Accesses
3 Citations
Explore all metrics

Abstract

This manuscript emphasizes the estimation procedure of population mean in two-phase sampling when non-response occurs during survey in both phases of sample data. To cope with the problem of missing data, some new imputation methods have been suggested for estimating the population mean which utilize the information on two auxiliary variables. The properties of the resultant estimators are studied which are followed by empirical and simulation studies accomplished on real as well as on artificial data sets which justify the suggested imputation methods. Results are significantly analyzed, and appropriate suggestions are made to the survey practitioners.

Efficient Imputation Methods to Handle Missing Data in Sample Surveys

Article 02 June 2022

A computational strategy for estimation of mean using optimal imputation in presence of missing observation

Article Open access 18 March 2024

Efficient and alternative approaches for imputing missing data to estimate population mean

Article 13 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Missing data are the most frequent occurring feature in sample surveys, and recognizing its stochastic nature is of utmost importance in order to use appropriate methodology for handling the data sets. Failure in recognition of its nature may distort the inferences about population characteristics/parameters; therefore, the assiduous attempt is needed for handling of the data sets with missing values. A fundamental query appears in this regard that what assumptions to be considered while justifying the ignorability of the complete mechanism. Rubin [1] discussed this fundamental query for missing data by establishing ignorability conditions under the classical and Bayesian approach for statistical inference. Further, [2, 3] subsequently generalized the [1] model to include other forms of incompleteness. Initially, [1] addressed three key concepts related to missing pattern of the data sets: missing at random (MAR), observed at random (OAR) and parameter distribution (PD). He mentioned “The data are MAR if the probability of the observed missingness pattern, given the observed and unobserved data, does not depend on the values of the unobserved data. The data are OAR, if for every possible value of the missing data, the probability of the observed missingness pattern, given the observed and unobserved data, does not depend on the values of observed data.” Later, the combination of MAR and OAR is called missing completely at random (MCAR). Heitain and Basu [4] have differentiated MAR and MCAR mechanism with series of examples. Based on these works, the pattern of the missing mechanism of data sets is recognized and inference related to population parameter is made under some strategies according to their obtained pattern. These methods are termed as “imputation methods.” Imputation is the procedure of replacing missing data with fabricated values. Abundant of works have been carried out based on imputation methods, such as [5,6,7,8,9,10,11,12,13,14,15,16,17].

The information related to the auxiliary variable may be used either at the planning stage or at design stage or survey stage or at estimation stage to get the improved precision of the estimates. When the information on auxiliary variable correlated with study variable is readily available, ratio, regression and their transformed and improved methods have been widely used to obtain efficient estimates, anticipating the information on the population mean of the auxiliary variable. In spite of that, the knowledge of the population mean of the auxiliary variable is not always available. In such circumstances, two-phase sampling or double scheme is a widely used sampling scheme to obtain the reliable estimates of unknown population mean of auxiliary variable in survey studies. The presence of missing data during survey sampling under two-phase sampling design enforces the researchers to implement the imputation methods for obtaining trustworthy conclusion regarding population parameters. Several researchers like [18,19,20,21] and others have suggested some imputation methods for compensating existence of the missing data with the assumption that the complete response may not be available on the study variable as well as on the auxiliary variable in second-phase sample. It is worth to be mentioned that very limited attention has been paid to deal with the situations, when the complete response is not available in the first-phase sample as well.

Following the aforementioned arguments and motivated with the work of [9], authors have proposed some effective imputation methods under missing completely at random (MCAR) response mechanism, which result in the point estimators of the population mean of study variable in two-phase sampling design. The properties of the proposed estimators have been discussed. Empirical and simulation studies are accomplished to authenticate the propositions of the suggested imputation methods and resultant estimators. Suitable recommendations have been made to the survey practitioners for real-life applications.

2 Sampling Design and Notations

Let $P =(P_1,P_2 \ldots P_N)$ be a finite population of size N indexed by triplet characters (y, x, z). It is assumed that y is the study variable and (x and z) are the (first and second) auxiliary variables, respectively, such that y is positively correlated with x and z, while in comparison with x, it is remotely correlated with z. When the population mean ${\bar{X}}$ of the first auxiliary variable is not known but information on the second auxiliary variable z is available for all the units of the population, the following two-phase sampling scheme has been designed for making inference about the population parameters.

Let $s^{\prime }$ be the first-phase sample of size $n^{\prime }$ drawn using simple random sampling without replacement (SRSWOR) scheme from the population and surveyed for the auxiliary variable x to estimate its population mean ${\bar{X}}$. The second-phase sample of size $n < n^{\prime }$ is drawn to measure the study characteristic y under the following design:

Design I The second-phase sample s is drawn from the first-phase sample $s^{\prime }$
Design II The second-phase sample s is independently drawn from the entire population.

We have assumed that non-response occurs in the first- and second-phase samples where $r^{\prime }$ and r are the number of responding units in the first- and second-phase samples of sizes $n^{\prime }$ and n, respectively. The corresponding sets of responding units are denoted by ($R_1$ and $R_2$) and the sets of non-responding units by ($R^{c}_1$ and $R^{c}_2$), respectively. We have also assumed that sample units in the second-phase sample s have been drawn from the responding set $R_1$.

3 Proposed Methods of Imputation and Subsequent Estimators

In this section, using the compromised method of imputation in the first-phase sample, we have proposed some new compromised imputation methods under MCAR response mechanism in the second-phase sample for missing data on the study variable y. The proposed imputation methods and resultant estimators are given below:

3.1 Imputation for Missing Data in the First-Phase Sample

To compensate the missing values on auxiliary variable x in the first-phase sample, we considered the ratio method of imputation; hence, after imputation, the sample data in x take the following form:

$$x_{.i} = {\left\{ \begin{array}{ll} \dfrac{\alpha n' x_i }{r'} +(1-\alpha ){\hat{b}}^{\prime }z_i&{} {\text {if}}\, i\in R_{1}\\ (1-\alpha ) {\hat{b}}^{\prime }z_i&{} {\text {if}}\,i\in R_{1}^{c} \end{array}\right. }$$

(1)

where $\hat{b^{\prime }}= \dfrac{\sum _{i=1}^{r^{\prime }} {x_i}}{\sum _{i=1}^{r^{\prime }} {z_i}}$ and $\alpha$ is an unknown constant. Under the imputation method described in Eq. (1), the point estimator of the population mean ${\bar{X}}$ in the first-phase sample is derived as

$${\bar{x}}^{\prime }= \dfrac{1}{n^{\prime }}\left\{ \sum _{i\in R_1}{x_{.i}}+\sum _{i\in R^{c}_1}{x_{.i}} \right\}$$

which produces the point estimator of the population mean ${\bar{X}}$ in the first-phase sample as

$${\bar{x}}^{\prime }=\alpha {\bar{x}}_{r^{\prime }} + (1-\alpha ) {\bar{x}}_{r^{\prime }}\dfrac{{\bar{z}}_{n^{\prime }}}{{\bar{z}}_{r^{\prime }}}$$

(2)

where ${\bar{x}}_{r^{\prime }}= \dfrac{\sum _{i\in R_1}{x_{i}}}{r^{\prime }}$, ${\bar{z}}_{r^{\prime }}= \dfrac{\sum _{i\in R_1}{z_{i}}}{r^{\prime }}$ and ${\bar{z}}_{n^{\prime }}= \dfrac{\sum _{i=1}^{n'}{z_{i}}}{n^{\prime }}$.

3.2 Imputation for Missing Data in the Second-Phase Sample

To derive the reliable substitutes for missing values in the second-phase sample, we suggest two new compromised imputation methods which are presented below:

First Imputation Method Under this method of imputation, sample data take the following forms

$$y_{.i} = {\left\{ \begin{array}{ll} \dfrac{\alpha _1 n y_i c}{r} +(1-\alpha _1){\hat{b}}z_i c &{}{\text {if}}\,i\in R_{2} \\ (1-\alpha _1)c {\hat{b}}z_i&{}{\text {if}}\,i\in R_{2}^{c} \end{array}\right. }$$

(3)

where $c=\dfrac{1}{{\bar{x}}_n} \alpha {\bar{x}}_{r^{\prime }} + (1-\alpha ) {\bar{x}}_{r ^{\prime }}\dfrac{{\bar{z}}_{n^{\prime }}}{{\bar{z}}_{r^{\prime }}}$, ${\hat{b}}= \dfrac{\sum _{i=1}^{r}{y_i}}{\sum _{i=1}^{} {z_i}}$ and $\alpha _1$ is suitably chosen constant such that the mean square error of resultant estimator is minimum.

Under the imputation method described in Eq. (3), the point estimator of the population mean ${\bar{Y}}$ takes the following form

$$\zeta _{1}=\dfrac{\left\{ {\alpha _{1} {\bar{y}}_{r}}+(1-\alpha _{1}) {\bar{y}}_{r} \frac{{\bar{z}}_{n}}{{\bar{z}}_{r}} \right\} }{{\bar{x}}_n} \left\{ {\alpha {\bar{x}}_{r^{\prime }}+(1-\alpha ) {\bar{x}}_{r^{\prime }} \frac{{\bar{z}}_{n^{\prime }}}{{\bar{z}}_{r^{\prime }}}} \right\}.$$

(4)

Second Imputation Method Under this method of imputation, sample data take the following forms

$$y_{.i} = {\left\{ \begin{array}{ll} \dfrac{\alpha _2 n y_i }{r} +(1-\alpha _2){\hat{b}}z_i &{}\text {if}\,i\in R_{2} \\ (1-\alpha _2){\hat{b}}z_i + \dfrac{1}{n-r} {\hat{b}}_{yx}(r)\left\{ \ {\alpha {\bar{x}}_{r^{\prime }}+(1-\alpha ) {\bar{x}}_{r^{\prime }} \frac{{\bar{z}}_{n^{\prime }}}{{\bar{z}}_{r^{\prime }}}} -{\bar{x}}_{n}\right\} &{}\text {if}\,i\in R_{2}^{c} \end{array}\right. }$$

(5)

where ${\hat{b}}_{yx}(r)= \dfrac{s_{yx}}{s^2_x}$ and $\alpha _2$ is suitably chosen constant such that the mean square error of resultant estimator is minimum.

Under the imputation method described in Eq. (5), the point estimator of the population mean ${\bar{Y}}$ takes the following form

$$\zeta _{2}=\left\{ \alpha _2 {\bar{y}}_{r}+(1-\alpha _2) {\bar{y}}_{r} \frac{{\bar{z}}_{n}}{{\bar{z}}_{r}} \right\} + {{\hat{b}}}_{yx}(r)\left\{ \ {\alpha {\bar{x}}_{r^{\prime }}+(1-\alpha ) {\bar{x}}_{r^{\prime }} \frac{{\bar{z}}_{n^{\prime }}}{{\bar{z}}_{r^{\prime }}}} -{\bar{x}}_{n}\right\}.$$

(6)

4 Properties of Estimators $\zeta _{1}$ and $\zeta _{2}$

The properties of the proposed estimators $\zeta _{1}$ and $\zeta _{2}$ have been explored under two different types of two-phase sampling design opted for MCAR response mechanism. Large sample approximations have been used in order to obtain the expressions of biases and mean square errors of the proposed estimators using the following transformations:

$$\begin{aligned}&{\bar{y}}_{r}= {\bar{Y}} \left( 1+e_{0}\right) , {\bar{x}}_{r}= {\bar{X}} \left( 1+e_{1}\right) , {\bar{x}}_{r^{\prime }}= {\bar{X}} \left( 1+e^{\prime }_{1}\right) , {\bar{x}}^{}_{n}= {\bar{X}} (1+e^{}_{2}), {\bar{x}}_{n^{\prime }}= {\bar{X}} (1+e^{\prime }_{2}),\\&{\bar{z}}_{r^{\prime }}= {\bar{Z}} (1+e^{\prime }_{3}), {\bar{z}}_{n}= {\bar{Z}} (1+e_{4}), {\bar{z}}_{n^{\prime }}= {\bar{Z}} (1+e^{\prime }_{4}), s_{yx}(r)=S_{YX}(1+e^{}_{5}), s^2_{x}(r)=S^2_{X}(1+e^{}_{6}),\\&\quad {\text {such that }}\, E(e^{\prime }_{i})= E(e^{}_{i})=0, |e^{\prime }_{i} |\le 1 \,{\text {and}}\,|e^{}_{i} |\le 1 \forall i,i^{\prime }=0,1,2,3,4,5,6. \end{aligned}$$

Under the above transformations, the estimators $\zeta _{1}$ and $\zeta _{2}$ take the following forms:

$$\zeta _{1}={\bar{Y}}\left\{ \alpha _1(1+e_{0})+(1+\alpha _1) (1+e_{0}) \dfrac{1+e_{4}}{1+e_{3}} \right\} \dfrac{(1+e^{\prime }_{1})}{(1+e^{}_{2})} \dfrac{(1+e^{\prime }_{4})}{(1+e^{\prime }_{3})}$$

(7)

and

$$\begin{aligned}&\zeta _{2}={\bar{Y}} (1+e_{0})\left\{ \alpha _2+(1-\alpha _2)\frac{(1+e_{4})}{(1+e_{3})} \right\} \nonumber \\&\qquad \qquad + \beta _{YX}\dfrac{(1+e_{5})}{(1+e_{6})} {\bar{X}} \left[ (1+ e^{\prime }_{1}) \left\{ \alpha +(1-\alpha ) \dfrac{(1+e^{\prime }_4)}{(1+e^{\prime }_3)} \right\} - (1+e_2)\right] \end{aligned}$$

(8)

where $\beta _{YX} = \dfrac{S_{YX}}{S^2_X}.$

4.1 Biases and Mean Square Errors of Estimators $\zeta _{1}$ and $\zeta _{2}$

Let $B(.)_{d}$ and ${\text {MSE}}(.)_{d}$ be the bias and mean square error, respectively, of an estimator under a given two-phase sampling design $d (=I,II)$.

Theorem 4.1

The biases of the estimators $\zeta _{1}$ and $\zeta _{2}$ are given by

$$B(\zeta _{1})_{I} ={\bar{Y}} \left[ \delta _2(C^2_X - \rho _{YX}C_Y C_X)+\lbrace \delta _3 (1-\alpha _1) + \delta _4 (1- \alpha ) \rbrace (C^2_Z-\rho _{YZ}C_Y C_Z) \right]$$

(9)

$$\begin{aligned}&B(\zeta _{1})_{II} ={\bar{Y}} \Bigg [ f_1 (C^2_X - \rho _{YX}C_Y C_X)+ \delta _3 (1-\alpha _1) (C^2_Z-\rho _{YZ}C_Y C_Z) \nonumber \\&\qquad\qquad \qquad + \delta _4 (1- \alpha ) (C^2_Z-\rho _{XZ}C_X C_Z) \Bigg ] \end{aligned}$$

(10)

$$\begin{aligned}&B(\zeta _{2})_{I} ={\bar{Y}} \delta _3 (1-\alpha _2) (C^2_Z-\rho _{YZ}C_Y C_Z)+ \beta _{YX} {\bar{X}} \left[ \delta _4 (1-\alpha ) (C^2_Z-\rho _{XZ}C_X C_Z) \right] \nonumber \\&\qquad \qquad + \beta _{YX} {\bar{X}} \left[ \dfrac{\delta _2}{{\bar{X}}} \left( \dfrac{\mu _{030}}{\mu _{020}}- \dfrac{\mu _{120}}{\mu _{110}} \right) + \dfrac{(1- \alpha ) (\delta _4)}{{\bar{Z}}} \left( \dfrac{\mu _{021}}{\mu _{020}}- \dfrac{\mu _{111}}{\mu _{110}} \right) \right] \end{aligned}$$

(11)

$$\begin{aligned}&B(\zeta _{2})_{II} ={\bar{Y}} \delta _3 (1-\alpha _2) (C^2_Z-\rho _{YZ}C_Y C_Z)+ \beta _{YX} {\bar{X}} \left[ \delta _4 (1-\alpha ) (C^2_Z-\rho _{XZ}C_X C_Z) \right] \nonumber \\&\qquad \qquad + \beta _{YX} {\bar{X}} \dfrac{f_1}{{\bar{X}}} \left( \dfrac{\mu _{120}}{\mu _{110}}- \dfrac{\mu _{030}}{\mu _{020}} \right) \end{aligned}$$

(12)

where

$$\delta _1=\left( \dfrac{1}{r}- \dfrac{1}{N}\right) , \delta _2=\left( \dfrac{1}{n}- \dfrac{1}{r^{\prime }}\right) , \delta _3=\left( \dfrac{1}{r}- \dfrac{1}{n}\right) , \delta _4=\left( \dfrac{1}{r^{\prime }}- \dfrac{1}{n^{\prime }}\right), \delta _5= \left( \dfrac{1}{n^{\prime }}-\dfrac{1}{N}\right),$$

$\delta _6= \left( \dfrac{1}{r^{\prime }}-\dfrac{1}{N}\right)$ and $f_1=\left( \dfrac{1}{n}-\dfrac{1}{N} \right) .$

Proof

The bias of the estimator $\zeta _{1}$ is derived as

$$\begin{aligned} B(\zeta _{1})_d &= E(\zeta _{1}- {\bar{Y}}) \nonumber \\ & = E \left[ {\bar{Y}}\left\{ \alpha _1(1+e_{0})+(1+\alpha _1) (1+e_{0}) \dfrac{1+e_{4}}{1+e_{3}} \right\} \dfrac{(1+e^{\prime }_{1})}{(1+e^{}_{2})} \dfrac{(1+e^{\prime }_{4})}{(1+e^{\prime }_{3})} -{\bar{Y}} \right] \end{aligned}$$

(13)

Now, expanding the right-hand sides of Eq. (13) binomially, taking expectation under the sampling designs I and II, respectively, and retaining the terms up to the first order of approximations, we get the expression of the bias of the proposed estimator $\zeta _{1}$ under sampling designs I and II as obtained in Eqs. (9)–(10).

In similar fashion, we derive the expression of bias of the proposed estimator $\zeta _{2}$ under sampling designs I and II as obtained in Eq. (11)–(12). $\square$

Theorem 4.2

The mean square errors of the estimators $\zeta _{1}$ and $\zeta _{2}$ are given by

$$\begin{aligned} {\text {MSE}}(\zeta _{1})_{I} & = {\bar{Y}}^2 \Big [ \delta _{1}C^2_Y + \delta _{2} (C^2_X- 2 \rho _{YX} C_Y C_X)+ \delta _{3} \lbrace (1-\alpha _{1})^2 C^2_Z \nonumber \\&\qquad - 2(1-\alpha _{1})\rho _{YZ} C_Y C_Z \rbrace \rbrace + \delta _{4} \lbrace (1-\alpha )^2 C^2_Z - 2(1-\alpha )\rho _{YZ} C_Y C_Z \rbrace \Big ] \end{aligned}$$

(14)

$$\begin{aligned} {\text {MSE}}(\zeta _{1})_{II} & = {\bar{Y}}^2 \Big [ \delta _{1}C^2_Y + \delta _{6}C^2_X+ f_1 (C^2_X- 2 \rho _{YX} C_Y C_X)+ \delta _{3} \lbrace (1-\alpha _{1})^2 C^2_Z \nonumber \\&\qquad -2(1-\alpha _{1})\rho _{YZ} C_Y C_Z \rbrace + \delta _{4}\lbrace (1-\alpha )^2 C^2_Z - 2(1-\alpha )\rho _{XZ} C_XC_Z \rbrace \Big ] \end{aligned}$$

(15)

$$\begin{aligned} {\text {MSE}}(\zeta _{2})_{I} & = {\bar{Y}}^2 \Big [ (\delta _{1}-\delta _{2} \rho ^2_{YX})C^2_Y + \delta _{3} \lbrace (1-\alpha _2)^2 C^2_Z - 2(1-\alpha _{2})\rho _{YZ} C_Y C_Z \rbrace \Big ] \nonumber \\&\qquad + \delta _{4} \Big [ (1-\alpha )^2\beta ^{2}_{YX} {\bar{X}}^2 C^2_Z - 2 (1-\alpha ){\bar{Y}} {\bar{X}} \beta _{YX} \rho _{YZ} C_Y C_Z) \Big ] \end{aligned}$$

(16)

and

$$\begin{aligned} {\text {MSE}}(\zeta _{2})_{II}& = {\bar{Y}}^2 \left[ (\delta _{1}-f_1 \rho ^2_{YX})C^2_Y + \delta _{3} \lbrace (1-\alpha _2)^2 C^2_Z - 2(1-\alpha _{2})\rho _{YZ} C_Y C_Z \rbrace \right] \nonumber \\&\qquad +\beta ^{2}_{YX} {\bar{X}}^2 \left[ \delta _4 \lbrace (1-\alpha )^2 C^2_Z - 2 (1-\alpha ) \rho _{XZ} C_X C_Z) \rbrace +\delta _6 C^2_X \right] . \end{aligned}$$

(17)

Proof

The mean square error of the estimator $\zeta _{1}$ is derived as

$$\begin{aligned} {\text {MSE}}(\zeta _{1})_d & = E(\zeta _{1}- {\bar{Y}})^2 \nonumber \\ & = E \left[ {\bar{Y}}\left\{ \alpha _1(1+e_{0})+(1+\alpha _1) (1+e_{0}) \dfrac{1+e_{4}}{1+e_{3}} \right\} \dfrac{(1+e^{\prime }_{1})}{(1+e^{}_{2})} \dfrac{(1+e^{\prime }_{4})}{(1+e^{\prime }_{3})} -{\bar{Y}} \right] ^2 \end{aligned}$$

(18)

Now, expanding the right-hand sides of Eq. (18) binomially, taking expectation under the sampling designs I and II, respectively, and retaining the terms up to the first order of approximations, we get the expressions of the mean square error of the proposed estimator $\zeta _{1}$ under sampling designs I and II as obtained in Eqs. (14)–(15).

In similar fashion, we derive the expression of mean square error of the proposed estimator $\zeta _{2}$ under sampling designs I and II as obtained in Eqs. (16)–(17). $\square$

4.2 Minimum Biases and Mean Square Errors of the Estimators $\zeta _{1}$ and $\zeta _{2}$

Since the mean square errors of estimators $\zeta _{1}$ and $\zeta _{2}$ under two types of sampling designs mentioned in Eqs. (14)–(17) are the functions of unknown scalars $\alpha , \alpha _{1}$ and $\alpha _2$, the optimum choices of $\alpha , \alpha _{1}$ and $\alpha _2$ are obtained by minimizing the mean square errors given in Eqs. (14)–(17) with respect to $\alpha , \alpha _{1}$ and $\alpha _2$ as

$$\alpha _{1(\mathrm{{opt}})_I} = \alpha _{1(\mathrm{{opt}})_{II}} =1- \rho _{YZ} \dfrac{C_Y}{C_Z}$$

(19)

$$\alpha _{2(\mathrm{{opt}})_I} = 1- \rho _{YZ} \dfrac{C_Y}{C_Z}\quad{\texttt {and}}\quad\alpha _{2(\mathrm{{opt}})_{II}} =1- \rho _{YZ} \dfrac{C_Y}{C_Z}$$

(20)

For estimator $\zeta _{1}$, we have

$$\alpha _{(\mathrm{{opt}})_I} =1- \rho _{YZ} \dfrac{C_Y}{C_Z}\quad{\texttt {and}}\quad\alpha _{(\mathrm{{opt}})_{II}} =1- \rho _{XZ} \dfrac{C_X}{C_Z}$$

(21)

For estimator $\zeta _{2}$, we have

$$\alpha _{(\mathrm{{opt}})_I} = 1- \dfrac{\rho _{YZ} C_X}{\rho _{YX}C_Z}\quad{\texttt {and}}\quad\alpha _{(\mathrm{{opt}})_{II}} =1- \rho _{XZ} \dfrac{C_X}{C_Z}$$

(22)

The optimum biases of the proposed estimators $\zeta _{1}$ and $\zeta _{2}$ have been obtained by putting the optimum choices of $\alpha , \alpha _{1}$ and $\alpha _2$ from Eqs. (19)–(22) in Eqs. (9)–(12). The optimum biases of the proposed estimators $\zeta _{1}$ and $\zeta _{2}$ under two types of two-phase sampling designs are given as

$$B^{*}(\zeta _{1})_{I} = {\bar{Y}} \left[ \delta _2(C^2_X - \rho _{YX}C_Y C_X)+ (\delta _3 + \delta _4) (\rho _{YZ}C_Y C_Z-\rho _{YZ}^2C^2_Y) \right]$$

(23)

$$\begin{aligned} B^{*}(\zeta _{1})_{II} & = {\bar{Y}} \Big [ f_1 (C^2_X - \rho _{YX}C_Y C_X)+ \delta _3 (\rho _{YZ}C_Y C_Z-\rho _{YZ}^2C^2_Y) \nonumber \\&\qquad + \delta _4 (\rho _{XZ}C_X C_Z-\rho _{XZ}^2C^2_X) \Big ] \end{aligned}$$

(24)

$$\begin{aligned} B^{*}(\zeta _{2})_{I} & = {\bar{Y}} \delta _3 (\rho _{YZ}C_Y C_Z-\rho _{YZ}^2 C^2_Y) + \beta _{YX} {\bar{X}} \left[ \delta _4 \dfrac{ \rho _{YZ} }{\rho _{YX} } (C_X C_Z-\rho _{XZ}C^2_X ) \right] + \beta _{YX} {\bar{X}} \nonumber \\ &\qquad \left[ \dfrac{\delta _2}{{\bar{X}}} \left( \dfrac{\mu _{030}}{\mu _{020}}- \dfrac{\mu _{120}}{\mu _{110}} \right) + \dfrac{(1- \alpha ) (\delta _4)}{{\bar{Z}}} \left( \dfrac{\mu _{021}}{\mu _{020}}- \dfrac{\mu _{111}}{\mu _{110}} \right) \right] \end{aligned}$$

(25)

$$\begin{aligned} B^{*}(\zeta _{2})_{II} & = {\bar{Y}} \delta _3 (\rho _{YZ}C_Y C_Z-\rho _{YZ}^2 C^2_Y) + \beta _{YX} {\bar{X}} \left[ \delta _4 (\rho _{XZ}C_X C_Z -\rho _{XZ}^2 C^2_Z) \right] \nonumber \\&\qquad + \beta _{YX} f_1 \left( \dfrac{\mu _{120}}{\mu _{110}}- \dfrac{\mu _{030}}{\mu _{020}} \right) . \end{aligned}$$

(26)

The minimum mean square errors of the proposed estimators $\zeta _{1}$ and $\zeta _{2}$ have been obtained by putting the optimum choices of $\alpha , \alpha _{1} and \alpha _2$ from Eqs. (19)–(22) in Eqs. (14)–(17). The optimum mean square errors of the proposed estimators $\zeta _{1}$ and $\zeta _{2}$ under two types of two-phase sampling designs are denoted by $M(\zeta _{1})_{d}$ and $M(\zeta _{1})_{d}$, respectively, and given as

$$M(\zeta _{1})_{I} = {\bar{Y}}^2 \left[ \lbrace \delta _{1}-(\delta _3+ \delta _4) \rho ^{2}_{YZ}\rbrace C^2_Y + \delta _{2} (C^2_X- 2 \rho _{YX} C_Y C_X) \right]$$

(27)

$$M(\zeta _{1})_{II} = {\bar{Y}}^2 \left[ ( \delta _{1}- \delta _3 \rho ^2_{YZ}) C^2_Y + (\delta _{6}- \delta _{4} \rho ^{2}_{XZ} ) C^2_X+ f_1 (C^2_X- 2 \rho _{YX} C_Y C_X) \right]$$

(28)

$$M(\zeta _{2})_{I} ={\bar{Y}}^2 \left[ (\delta _{1}-\delta _{2} \rho ^{2}_{YX})C^2_Y - (\delta _{3} + \delta _4) \rho ^2_{YZ} \right]$$

(29)

and

$$M(\zeta _{2})_{II}={\bar{Y}}^2 \left[ (\delta _{1}-f_1 \rho _{YX}^2) - \delta _{3} \rho ^2_{yz} \right] C^2_Y + \beta ^{2}_{yx} {\bar{X}}^2C^2_X \left( \delta _{6} -\delta _{4}\rho ^2_{xz}\right).$$

(30)

5 Some Well-Known Methods of Imputation

In the single-phase sampling design when the sample of size n is selected from the population under SRSWOR scheme and the non-response occurs in the sample data, some classical methods of imputation are presented in this section under the assumption that information on the auxiliary variable x is available for each and every units of the population.

5.1 Mean Method of Imputation

The mean method of imputation gives the data as:

$$y_{.i}= {\left\{ \begin{array}{ll} y_i&{}{\text {if}}\,i \in R \\ {\bar{y}}_{r}&{}{\text {if}}\,i\in R^{c} \end{array}\right. }$$

(31)

Under the imputation method discussed in Eq. (31), the corresponding point estimator of the population mean ${\bar{Y}}$ is derived as

$${\bar{y}}_m = \dfrac{1}{r} \sum _{i=1}^{r} y_{.i}= {\bar{y}}_{r}.$$

(32)

The variance of the estimator ${\bar{y}}_m$ is obtained as

$$v({\bar{y}}_m)=\delta _{1} {\bar{Y}}^2 C^2_Y.$$

(33)

5.2 Ratio Method of Imputation

The ratio method of imputation gives the data as:

$$y_{.i} = {\left\{ \begin{array}{ll} y_i &{}{\text {if}}\,i \in R \\ {\hat{b}}x_i&{}{\text {if}}\,i\in R^{c} \end{array}\right. }$$

(34)

where ${\hat{b}}=\dfrac{\sum _{i \in R}^{.} {y_i} }{\sum _{i \in R}^{.}x_i}$.

Under the imputation method discussed in Eq. (34), the corresponding point estimator of the population mean ${\bar{Y}}$ is derived as

$${\bar{y}}_\mathrm{{rat}}= \dfrac{1}{n} \sum _{i=1}^{n} y_{.i}={\bar{y}}_{r} \dfrac{{\bar{x}}_n}{{\bar{x}}_r}.$$

(35)

The mean square error of the estimator ${\bar{y}}_\mathrm{{rat}}$ up to the first order of approximations is obtained as

$$M({\bar{y}}_\mathrm{{rat}})={\bar{Y}}^2 \left[ \delta _{1} C^2_Y + \delta _{3} (C^2_X- 2 \rho _{YX} C_Y C_X)\right].$$

(36)

5.3 Regression Method of Imputation

The regression method of imputation gives the data as

$$y_{.i} = {\left\{ \begin{array}{ll} y_i &{}{\text {if}}\,i \in R \\ {\hat{a}}+{\hat{b}}_{yx} x_i &{}{\text {if}}\,i\in R^{c} \end{array}\right.}$$

(37)

where ${\hat{b}}_{yx}=\dfrac{s_{yx}(r)}{s^2_x(r)} {\text{and}}\,{\hat{a}}=\left( {\bar{y}}_{r}-{\hat{b}}_{yx} {\bar{x}}_r \right).$ Under the imputation method discussed in Eq. (37), the corresponding point estimator of the population mean ${\bar{Y}}$ is derived as

$${\bar{y}}_\mathrm{{reg}}= \dfrac{1}{n} \sum _{i=1}^{n} y_{.i}= {\bar{y}}_r + {\hat{b}}_{yx}\left( {\bar{x}}_n - {\bar{x}}_r \right).$$

(38)

The mean square of the estimator ${\bar{y}}_\mathrm{{reg}}$ up to the first order of approximations is obtained as

$$M({\bar{y}}_\mathrm{{reg}})={\bar{Y}}^2 C^2_Y \left[ \delta _{1} - \delta _{3} \rho ^2_{yx} \right].$$

(39)

6 Analytical Comparison

In this section, we compare the suggested estimators with existing classical estimators ${\bar{y}}_{m}$ , ${\bar{y}}_\mathrm{{rat}}$ and ${\bar{y}}_\mathrm{{reg}}$.

Lemma 6.1

(i)
The proposed estimator $\zeta _1$ under first-phase design is more efficient than ${\bar{y}}_{m}$ if
$$M(\zeta _{1})_{I} -v({\bar{y}}_m)<0 \Rightarrow \dfrac{1-2\rho _{YX}}{\rho _{YZ}^2} < \dfrac{\delta _3 + \delta _4 }{\delta _2}.$$
(ii)
The proposed estimator $\zeta _1$ under second--phase design is more efficient than ${\bar{y}}_{m}$ if
$$M(\zeta _{1})_{II} -v({\bar{y}}_m)<0 \Rightarrow 1-2\rho _{YX}< \dfrac{\delta _3\rho _{YZ}^2 + \delta _4 \rho _{XZ}^2 -\delta _6 }{f_1}.$$
(iii)
The proposed estimator $\zeta _2$ under first-phase design is more efficient than ${\bar{y}}_{m}$ if
$$M(\zeta _{2})_{I} -v({\bar{y}}_m) <0 \Rightarrow \delta _2 \rho _{YZ}^2 + (\delta _3 + \delta _4 ) \rho _{YZ}^2 >0$$
which is always true.
(iv)
The proposed estimator $\zeta _2$ under second-phase design is more efficient than ${\bar{y}}_{m}$ if
$$M(\zeta _{2})_{II} -v({\bar{y}}_m) <0 \Rightarrow {\bar{Y}}^2 (f_1 \rho _{YX}^2 + \delta _3 \rho _{YZ}^2 ) >{\bar{X}}^2 \beta _{YX}^2 (\delta _6- \delta _4 \rho _{XZ}^2)$$

Lemma 6.2

(i)
The proposed estimator $\zeta _1$ under first-phase design is more efficient than ${{\bar{y}}}_\mathrm{{rat}}$ if
$$M(\zeta _{1})_{I} -M({{\bar{y}}}_\mathrm{{rat}} )<0 \Rightarrow \dfrac{1-2\rho _{YX}}{\rho _{YZ}^2} < \dfrac{\delta _3 + \delta _4 }{\delta _2-\delta _3}.$$
(ii)
The proposed estimator $\zeta _1$ under second-phase design is more efficient than ${\bar{y}}_\mathrm{{rat}}$ if
$$M(\zeta _{1})_{II} -M({{\bar{y}}}_\mathrm{{rat}} )<0 \Rightarrow 1-2\rho _{YX}< \dfrac{\delta _3\rho _{YZ}^2 + \delta _4 \rho _{XZ}^2 -\delta _6 }{f_1-f_3}.$$
(iii)
The proposed estimator $\zeta _2$ under first-phase design is more efficient than ${\bar{y}}_\mathrm{{rat}}$ if
$$M(\zeta _{2})_{I} -M({{\bar{y}}}_\mathrm{{rat}} ) <0 \Rightarrow \delta _2 \rho _{YX}^2+ (\delta _3+ \delta _4 ) \rho _{YZ}^2 + \delta _3 (1-2\rho _{YX})>0$$
which is always true if $\rho _{YX} > \dfrac{1}{2}$.
(iv)
The proposed estimator $\zeta _2$ under second-phase design is more efficient than ${\bar{y}}_\mathrm{{rat}}$ if
$$M(\zeta _{2})_{II} -M({{\bar{y}}}_\mathrm{{rat}} ) <0 \Rightarrow 1-2\rho _{YX}>\dfrac{\beta _{YX}^2 {\bar{X}}^2 (\delta _6- \delta _4 \rho _{XZ}^2) - ( \delta _3 \rho _{YZ}^2+ f_1 \rho _{YX}^2 ) {\bar{Y}}^2}{ \delta _3 {\overline{Y}}^2 }.$$

Lemma 6.3

(i)
The proposed estimator $\zeta _1$ under first-phase design is more efficient than ${{\bar{y}}}_\mathrm{{reg}}$ if
$$M(\zeta _{1})_{I} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow \delta _3 \rho _{YX}^2 + \delta _2(1-2\rho _{YX}) < (\delta _3 + \delta _4)\rho _{YZ}^2.$$
(ii)
The proposed estimator $\zeta _1$ under second-phase design is more efficient than ${\bar{y}}_\mathrm{{reg}}$ if
$$M(\zeta _{1})_{II} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow \delta _3 \rho _{YX}^2 + f_1(1-2\rho _{YX}) < (\delta _4 \rho _{XZ}^2 + \delta _3 \rho _{YZ}^2) - \delta _6.$$
(iii)
The proposed estimator $\zeta _2$ under first-phase design is more efficient than ${\bar{y}}_\mathrm{{reg}}$ if
$$M(\zeta _{2})_{I} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow (\delta _3 - \delta _2 ) \rho _{YX}^2 <(\delta _3 + \delta _4)\rho _{YZ}^2$$
(iv)
The proposed estimator $\zeta _2$ under second-phase design is more efficient than ${\bar{y}}_\mathrm{{reg}}$ if
$$M(\zeta _{2})_{II} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow {\bar{Y}}^2 \left\{ ( \delta _3-f_1) \rho _{YX}^2 - \delta _3 \rho _{YZ}^2 \right\} + {\bar{X}}^2 \beta _{YX}^2 (\delta _6- \delta _4 \rho _{XZ}^2)<0.$$

Remark 6.1

It may be assumed that $C_Y \approx C_X \approx C_Z$ in the population.

7 Efficiency Comparison

In this section, empirical and simulation studies have been carried out to demonstrate the accomplishment of the proposed methods of imputation and resultant estimators over mean, ratio and regression methods of imputation.

7.1 Empirical Study

To show the practicability of the proposed methods of imputation in the real-life scenario, four natural populations from various survey studies have been chosen for empirical study. The optimum mean square errors of proposed estimators are taken under consideration in empirical study. The percent relative efficiencies of the proposed methods with respect to the classical methods of imputations (mean, ratio and regression) are obtained as

$$\begin{aligned} E_{11}&=\dfrac{v({\bar{y}}_m)}{M(\zeta _1)}\times 100, \quad E_{12}=\dfrac{M({\bar{y}}_\mathrm{{rat}})}{M(\zeta _1)}\times 100, \quad E_{13}=\dfrac{M({\bar{y}}_\mathrm{{reg}})}{M(\zeta _1)}\times 100; \\ E_{21}&=\dfrac{v(\bar{y_m})}{M(\zeta _2)}\times 100, \quad E_{22}=\dfrac{M({\bar{y}}_\mathrm{{rat}})}{M(\zeta _2)}\times 100\quad {\text {and}} \quad E_{23}=\dfrac{M({\bar{y}}_\mathrm{{reg}})}{M(\zeta _2)}\times 100. \end{aligned}$$

The detailed information of populations is given below:

Population I [Source [22]] (Page No. 58)

Y: Head length of second son
X: Head length of first son
Z: Head breadth of first son
$N=25, n^{\prime }=18, r^{\prime }=11, n=9, r=7$.

Population II [Source: [23] ] (Page No. 399)

Y: Area under wheat in 1964
X: Area under wheat in 1963
Z: : Cultivated area in 1961
$N=34, n^{\prime }=22, r^{\prime }=14, n=11, r=8$.

Population III [Source: [24]] (Page No. 182)

Y: Number of ‘placebo’ children
X: Number of paralytic polio cases in the placebo group
Z: Number of paralytic polio cases in the ‘not inoculated’ group
$N=33, n^{\prime }=22,r^{\prime }=18, n=12, r=8$.

Population IV [Source: [25] (Page No. 349)

Y: Volume
X: Diameter
Z: Height
$N=31, n^{\prime }=22,r^{\prime }=16, n=10,r=7$.

The percent relative efficiencies are computed for the above-mentioned populations under both sampling designs I and II and shown in Tables 1, 2 and 3.

Table 1 Percent relative efficiencies of the proposed methods of imputation with respect to mean method of imputation

Full size table

Table 2 Percent relative efficiencies of the proposed methods of imputation with respect to ratio method of imputation

Full size table

Table 3 Percent relative efficiencies of the proposed methods of imputation with respect to regression method of imputation

Full size table

7.2 Simulation Study

A computer simulation is an endeavor to model a real-life or hypothetical scenarios on a computer so that it may be studied to see how the proposed system, strategies or methods works. The inference may be made about the behavior of the proposed system, strategies or methods by changing parameters in the simulation study. It is a tool to virtually investigate the behavior of the method or system under study. Inspired by this argument, we have run simulation study to investigate the behavior of the proposed imputation methods with respect to classical methods of imputation. The simulation studies have been performed on three artificial computer generated data sets to know the percent relative efficiencies and losses of proposed estimators due to the presence of non-response in the population. The description of artificial data sets is given as:

Population V Source: [Artificially Generated Data Set]

A population of size $N=2000$ are generated from the multivariate normal distribution in R software. The study variable y is positively correlated with auxiliary variables with fixed correlations $\rho _{YX}=0.7$, $\rho _{YZ}=0.6$ and $\rho _{XZ}=0.5$. The parameters used for this population are $n^{\prime }=800, r^{\prime }=640, n= 256, r=204$.

Population VI Source: [Artificially Generated Data Set]

The triplet (y, x, z) is generated of size $N=200$ . The study variable y is highly correlated with auxiliary variables with fixed correlations $\rho _{YX}= 0.93$, $\rho _{YZ}=0.87$ and $\rho _{XZ}= 0.95$. We have taken $n^{\prime }=80, r^{\prime }=64, n= 50, r=40$.

Population VII Source: [Artificially Generated Data Set]

The triplet (y, x, z) is generated of size $N=1000$ such that $x\sim gamma(4, 2.5), e \sim N(0,1)$, $z=1.5x^{0.5}+e, y=8x+7z+e$ where $\rho _{YX} > \rho _{Yz}$. We have taken $n^{\prime }= 400, r^{\prime }=320 , n= 128, r=102$.

In this simulation studies, the following steps have been followed:

Step I Draw a random sample $s^{\prime }$ of size $n^{\prime }$ from population size N.
Step II Take out $(n^{\prime }-r^{\prime })$ sample units randomly from the first-phase sample each time. Impute dropped units using imputation method contemplated for the first-phase sample.
Step III Draw a random subsample of size n from $s ^{\prime }$ for design I and independent random sample n from N for design II.
Step IV Take out $(n-r)$ sample units randomly from the second-phase sample each time. Impute dropped units using proposed method of imputation contemplated for the second-phase sample.
Step V Compute relevant statistics.
Step VI Repeat the above steps ${N}\atopwithdelims (){n} = M$ (say) times .

The simulated variance and mean square errors of the existing and proposed estimators are obtained as:

$$\begin{aligned} {\text {var}}^{*}({\bar{Y}}_M)&=\dfrac{1}{M}\sum _{j=1}^{M}(({\bar{y}}_{m})_j-{\bar{Y}})^2, M^{*}({\bar{y}}_\mathrm{{rat}})= \dfrac{1}{M}\sum _{j=1}^{M}({\bar{y}}_\mathrm{{rat}})_j -{\bar{Y}})^2, M^{*}({\bar{y}}_\mathrm{{reg}})\\&=\dfrac{1}{M}\sum _{j=1}^{M}(({\bar{y}}_\mathrm{{reg}})_j -{\bar{Y}})^2, \\ M^{*}(\zeta _{1})_d&= \dfrac{1}{M}\sum _{j=1}^{M}((\zeta _{1})_{dj} -{\bar{Y}})^2\quad {\text {and}}\quad M^{*}(\zeta _{2})_d = \dfrac{1}{M}\sum _{j=1}^{M}((\zeta _{1})_{dj} -{\bar{Y}})^2 \end{aligned}$$

The simulated percent-related efficiencies are given as

$$\begin{aligned} E^{\prime }_{11}&=\dfrac{{\text {var}}^{*}(\bar{y_m})}{M^{*}(\zeta _{1})_d }\times 100, \quad E^{\prime }_{12}=\dfrac{M^{*}({\bar{y}}_\mathrm{{rat}})}{M^{*}(\zeta _{1})_d }\times 100, \quad E^{\prime }_{13}=\dfrac{M^{*}({\bar{y}}_\mathrm{{reg}})}{M^{*}(\zeta _{1})_d }\times 100; \\ E^{\prime }_{21}&=\dfrac{{\text {var}}(\bar{y_m})}{M^{*}(\zeta _{2})_d }\times 100, \quad E^{\prime }_{22}=\dfrac{M^{*}({\bar{y}}_\mathrm{{rat}})}{M^{*}(\zeta _{2})_d }\times 100\quad {\text {and}}\quad E^{\prime }_{23}=\dfrac{M^{*}({\bar{y}}_\mathrm{{reg}})}{M^{*}(\zeta _{2})_d}\times 100. \end{aligned}$$

The percent relative losses in efficiencies due to non-response of the estimators $\zeta _1$ and $\zeta _2$ are obtained with respect to the similar estimators when non-response has not observed in any phase. The estimators $T_1$ and $T_2$ are defined under the similar circumstances as the estimators $\zeta _1$ and $\zeta _2$, respectively, but under complete response. The simulated percent relative losses in efficiencies of the proposed estimators $\zeta _1$ and $\zeta _2$ with respect to $T_1$ and $T_2$, respectively, under their respective design are given as

$$l_1=\dfrac{M^{\prime }(\zeta _{1})_d - {\text {MSE}}(T_{1})_d}{M^{\prime }(\zeta _{1})_d}\times 100\quad {\text {and}}\quad l_2=\dfrac{M^{\prime }(\zeta _{2})_d -{\text {MSE}}(T_{2})_d}{M^{\prime }(\zeta _{1})_d}\times 100$$

where

$${\text {MSE}}(T_{1})_d = \dfrac{1}{M}\sum _{j=1}^{M}((T_{1})_{dj} -{\bar{Y}})^2\quad {\text {and}}\quad {\text {MSE}}(T_{2})_d = \dfrac{1}{M}\sum _{j=1}^{M}((T_{1})_{dj} -{\bar{Y}})^2.$$

In this study, $M=50{,}000$ has been taken for convenience in calculation. The values of $E^{\prime }_{ij} (i=1,2,), (j=1,2,3)$ and $l_k ( k=1,2)$ are calculated based on the above procedures and presented in Tables 5, 6, 7, 8, 9 and 10.

Table 4 Bias of proposed, mean, ratio and regression estimators under imputation method

Full size table

Table 5 Percent relative efficiencies of proposed method with respect to mean, ratio and regression methods of imputation under design I

Full size table

Table 6 Percent relative efficiencies of proposed method with respect to mean, ratio and regression methods of imputation under design II

Full size table

Table 7 Percent relative loss in efficiencies of $T_{1}$ and $T_{2}$ for population V

Full size table

Table 8 Percent relative loss in efficiencies of $T_{1}$ and $T_{2}$ for population VI

Full size table

Table 9 Percent relative loss in efficiencies of $T_{1}$ and $T_{2}$ for population II

Full size table

Table 10 Percent relative loss in efficiencies of $T_{1 }$ and $T_{2}$ for population VII

Full size table

Following the above-mentioned simulation study, we have also calculated the biases of the resultant estimators $\zeta _{1}$, $\zeta _{2}$ and existing estimators ${\bar{y}}_m$, ${\bar{y}}_\mathrm{{rat}}$ and ${\bar{y}}_\mathrm{{reg}}$ for populations I-IV and shown in Table 4.

8 Interpretations of Empirical and Simulation Results

The following interpretation may be read out form Tables 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10:

(i)
From Tables 1, 2 and 3, it is seen that the percent relative efficiencies of proposed estimators $\zeta _1$ and $\zeta _2$ with respect to the estimators ${\bar{y}}_m$, ${\bar{y}}_\mathrm{{rat}}$ and ${\bar{y}}_\mathrm{{reg}}$ are more than 100 in almost cases when percent relative efficiencies have been obtained using the large sample approximations. This reflects the dominance nature of the proposed method of imputations and resultant estimators over the classical method of imputations.
(ii)
From Tables 5 and 6, it is observed that simulated percent relative efficiencies of proposed estimators $\zeta _1$ and $\zeta _2$ with respect to the estimators ${\bar{y}}_m$, ${\bar{y}}_\mathrm{{rat}}$ and ${\bar{y}}_\mathrm{{reg}}$ are more than 100 in most of the cases when simulation studies are performed on artificial data sets.
(iii)
From Tables 7, 8, 9 and 10, it is indicated that the percent relative losses in efficiencies $l_1$ and $l_2$ of the estimators $\zeta _1$ and $\zeta _2$ under two types of two-phase sampling designs are not more than 30% for both artificial and real populations.
(iv)
From Tables 7 and 8, the negative percent relative losses in efficiencies are observed for some cases under two-phase sample design I which indicates the gain in the precision of estimate.
(v)
From Tables 8, 9 and 10, it is also seen that the percent relative losses in efficiencies $l_1$ and $l_2$ are decreasing as the values of r increase for fixed values of $N, n^{\prime }, r^{\prime }$ and n under both types of two-phase sampling designs. This shows that the percent relative losses in efficiencies are decreasing as percentage of non-response in the second-phase sample decreases.

In Tables 7 and 8, the impact of percent relative losses in efficiencies of the proposed estimators is observed very closely taking into consideration of minor change in percentage of non-response in the second-phase sample and results are shown graphically in Figs. 1, 2, 3, 4, 5 and 6 to get more visible pattern under sampling designs I and II separately.

From Figs. 1, 2, 3, 4, 5 and 6, it is easily seen that the percent relative losses in efficiencies of proposed estimators are decreasing as the percentage of non-response decreases under both types of sampling designs.

9 Conclusions and Recommendations

When the proposed methods of imputation under study have implemented in real-life scenario, proposed methods are remunerating in terms of percent relative efficiencies. These strategies are also showing their superiority in terms of percent relative efficiencies over classical imputation methods namely mean, ratio and regression methods of imputation when simulation studies have been performed over artificial data sets. The percent relative losses in efficiency of proposed estimators are less than 30% whenever non-response occurs 20% or less of sample size. These results support that the proposed methods of imputations described in this study are appreciatively favorable in diminishing the pessimistic effect of non-response on inference to a greater extend as compared to the classical methods of imputation. Hence, looking on the persuaded behavior of the suggested imputation methods, survey practitioner may be encouraged for their practical applications, whenever non-response is inescapable in the survey data.

References

Rubin DB (1976) Inference and missing data. Biometrica 63:581–593
Article MathSciNet Google Scholar
Heitain FD, Rubin BD (1991) Ignorablity and coarse data random. Annu Stat 50(3):207–213
Google Scholar
Heitain FD (1994) Ignoriablity in general complete-data models. Biometrika 81:701–708
Article MathSciNet Google Scholar
Heitain FD, Basu S (1996) Distinguishing “missing at random” and “missing completely at random”. Am Stat 50(3):207–213
MathSciNet Google Scholar
Sande IG (1979) A personal view of hot deck approach to automatic edit and imputation. J Imput Proced Surv Methodol 5:238–246
Google Scholar
Kalton G, Kasprzyk D, Santos R (1981) Issues of non-response and imputation in the survey of income and program participation. In: Krewski D, Platek R, Rao JNK (eds) Current topics in survey sampling. Academic Press, New York, pp 455–480
Chapter Google Scholar
Lee H, Rancourt E, Sarndal CE (1994) Experiments with variance estimation from survey data with imputed values. J Off Stat 10(3):231–243
Google Scholar
Lee H, Rancourt E, Sarndal CE (1995) Variance estimation in the presence of imputed data for the generalized estimation system. In: Proceeding of the American Statistical Association (Survey Research Methods Section of the American Statistical Association (ASA)). pp 384–389
Singh S, Horn S (2000) Compromised imputation in survey sampling. Metrika 51:266–276
Article MathSciNet Google Scholar
Singh S, Deo B (2003) Imputation by power transformation. Stat Pap 44:555–579
Article MathSciNet Google Scholar
Ahmed MS, Al-Titi O, Al-Rawi Z, Abu-Dayyeh W (2006) Estimation of population mean using different imputation methods. Stat Transit 7(6):1247–1264
Google Scholar
Kadilar C, Cingi H (2008) Estimators for the population mean in the case of missing data. Commun Stat Theory Methods 37:2226–2236
Article MathSciNet Google Scholar
Singh S (2009) A new method of imputation in survey sampling. Statistics 43(5):499–511
Article MathSciNet Google Scholar
Diana G, Perri PF (2010) Improved estimators of the population mean for missing data. Commun Stat Theory Methods 39:3245–3251
Article Google Scholar
Singh GN, Karna JP (2010) Some imputation methods to minimize the effect of non response in two-occasion rotation patterns. Commun Stat Theory Methods 39(18):3264–3281
Article MathSciNet Google Scholar
Gira Abdeltawab A (2015) Estimation of population mean with a new imputation methods. Appl Math Sci 9(34):1663–1672
Google Scholar
Bhushan S, Pandey PP (2016) Optimality of ratio type estimation methods for population mean in presence of missing data. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2016.1167906
Article Google Scholar
Thakur NS, Yadav K, Pathak S (2011) Estimation of mean in presence of missing data under two-phase sampling scheme. J Reliab Stat Stud 4(2):93–104
MATH Google Scholar
Thakur NS, Yand Pathak S (2012) Some imputation methods in double sampling scheme for estimation of population mean. Int J Mod Eng Res 2(1):200–207
Google Scholar
Thakur NS, Yadav K, Pathak S (2013) On mean estimation with imputation in two-phase sampling. Res J Math Stat Sci 1(13):1–9
Google Scholar
Pandey R, Yadav K (2016) Mean estimation under imputation based on two-phase sampling design using an auxiliary variable. Pak J Stat Oper Res XII(4):639–658
Article MathSciNet Google Scholar
Anderson TW (1958) An introduction to multivariate statistical analysis. Wiley, New York
MATH Google Scholar
Murthy MN (1967) Sampling theory and methods. Statistical Publishing Society, Calcutta
MATH Google Scholar
Cochran WG (1977) Sampling techniques. Wiley, New-York
MATH Google Scholar
Wang SG, Chow SC (1994) Advanced linear models: theory and applications. Marcel Dekker, Inc., New York
Google Scholar

Download references

Acknowledgements

Authors are thankful to the Indian Institute of Technology (Indian School of Mines), Dhanbad, for providing necessary support to carry out the present research work. Authors are also thankful to the reviewers for their valuable suggestions which improved the quality of the paper.

Author information

Authors and Affiliations

Department of Applied Mathematics, Indian Institute of Technology (Indian School of Mines), Dhanbad, 826004, India
G. N. Singh & S. Suman

Authors

G. N. Singh
View author publications
You can also search for this author in PubMed Google Scholar
S. Suman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Suman.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, G.N., Suman, S. Estimation of Population Mean Using Imputation Methods for Missing Data Under Two-Phase Sampling Design. J Stat Theory Pract 13, 19 (2019). https://doi.org/10.1007/s42519-018-0016-5

Download citation

Published: 05 November 2018
DOI: https://doi.org/10.1007/s42519-018-0016-5

Keywords

Mathematics Subject Classification

62D05

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Estimation of Population Mean Using Imputation Methods for Missing Data Under Two-Phase Sampling Design

Abstract

Similar content being viewed by others

Efficient Imputation Methods to Handle Missing Data in Sample Surveys

A computational strategy for estimation of mean using optimal imputation in presence of missing observation

Efficient and alternative approaches for imputing missing data to estimate population mean

1 Introduction

2 Sampling Design and Notations

3 Proposed Methods of Imputation and Subsequent Estimators

3.1 Imputation for Missing Data in the First-Phase Sample

3.2 Imputation for Missing Data in the Second-Phase Sample

4 Properties of Estimators \(\zeta _{1}\) and \(\zeta _{2}\)

4.1 Biases and Mean Square Errors of Estimators \(\zeta _{1}\) and \(\zeta _{2}\)

Theorem 4.1

Proof

Theorem 4.2

Proof

4.2 Minimum Biases and Mean Square Errors of the Estimators \(\zeta _{1}\) and \(\zeta _{2}\)

5 Some Well-Known Methods of Imputation

5.1 Mean Method of Imputation

5.2 Ratio Method of Imputation

5.3 Regression Method of Imputation

6 Analytical Comparison

Lemma 6.1

Lemma 6.2

Lemma 6.3

Remark 6.1

7 Efficiency Comparison

7.1 Empirical Study

7.2 Simulation Study

8 Interpretations of Empirical and Simulation Results

9 Conclusions and Recommendations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation