Abstract
This manuscript emphasizes the estimation procedure of population mean in two-phase sampling when non-response occurs during survey in both phases of sample data. To cope with the problem of missing data, some new imputation methods have been suggested for estimating the population mean which utilize the information on two auxiliary variables. The properties of the resultant estimators are studied which are followed by empirical and simulation studies accomplished on real as well as on artificial data sets which justify the suggested imputation methods. Results are significantly analyzed, and appropriate suggestions are made to the survey practitioners.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Missing data are the most frequent occurring feature in sample surveys, and recognizing its stochastic nature is of utmost importance in order to use appropriate methodology for handling the data sets. Failure in recognition of its nature may distort the inferences about population characteristics/parameters; therefore, the assiduous attempt is needed for handling of the data sets with missing values. A fundamental query appears in this regard that what assumptions to be considered while justifying the ignorability of the complete mechanism. Rubin [1] discussed this fundamental query for missing data by establishing ignorability conditions under the classical and Bayesian approach for statistical inference. Further, [2, 3] subsequently generalized the [1] model to include other forms of incompleteness. Initially, [1] addressed three key concepts related to missing pattern of the data sets: missing at random (MAR), observed at random (OAR) and parameter distribution (PD). He mentioned “The data are MAR if the probability of the observed missingness pattern, given the observed and unobserved data, does not depend on the values of the unobserved data. The data are OAR, if for every possible value of the missing data, the probability of the observed missingness pattern, given the observed and unobserved data, does not depend on the values of observed data.” Later, the combination of MAR and OAR is called missing completely at random (MCAR). Heitain and Basu [4] have differentiated MAR and MCAR mechanism with series of examples. Based on these works, the pattern of the missing mechanism of data sets is recognized and inference related to population parameter is made under some strategies according to their obtained pattern. These methods are termed as “imputation methods.” Imputation is the procedure of replacing missing data with fabricated values. Abundant of works have been carried out based on imputation methods, such as [5,6,7,8,9,10,11,12,13,14,15,16,17].
The information related to the auxiliary variable may be used either at the planning stage or at design stage or survey stage or at estimation stage to get the improved precision of the estimates. When the information on auxiliary variable correlated with study variable is readily available, ratio, regression and their transformed and improved methods have been widely used to obtain efficient estimates, anticipating the information on the population mean of the auxiliary variable. In spite of that, the knowledge of the population mean of the auxiliary variable is not always available. In such circumstances, two-phase sampling or double scheme is a widely used sampling scheme to obtain the reliable estimates of unknown population mean of auxiliary variable in survey studies. The presence of missing data during survey sampling under two-phase sampling design enforces the researchers to implement the imputation methods for obtaining trustworthy conclusion regarding population parameters. Several researchers like [18,19,20,21] and others have suggested some imputation methods for compensating existence of the missing data with the assumption that the complete response may not be available on the study variable as well as on the auxiliary variable in second-phase sample. It is worth to be mentioned that very limited attention has been paid to deal with the situations, when the complete response is not available in the first-phase sample as well.
Following the aforementioned arguments and motivated with the work of [9], authors have proposed some effective imputation methods under missing completely at random (MCAR) response mechanism, which result in the point estimators of the population mean of study variable in two-phase sampling design. The properties of the proposed estimators have been discussed. Empirical and simulation studies are accomplished to authenticate the propositions of the suggested imputation methods and resultant estimators. Suitable recommendations have been made to the survey practitioners for real-life applications.
2 Sampling Design and Notations
Let \(P =(P_1,P_2 \ldots P_N)\) be a finite population of size N indexed by triplet characters (y, x, z). It is assumed that y is the study variable and (x and z) are the (first and second) auxiliary variables, respectively, such that y is positively correlated with x and z, while in comparison with x, it is remotely correlated with z. When the population mean \({\bar{X}}\) of the first auxiliary variable is not known but information on the second auxiliary variable z is available for all the units of the population, the following two-phase sampling scheme has been designed for making inference about the population parameters.
Let \(s^{\prime }\) be the first-phase sample of size \(n^{\prime }\) drawn using simple random sampling without replacement (SRSWOR) scheme from the population and surveyed for the auxiliary variable x to estimate its population mean \({\bar{X}}\). The second-phase sample of size \(n < n^{\prime }\) is drawn to measure the study characteristic y under the following design:
-
Design I The second-phase sample s is drawn from the first-phase sample \(s^{\prime }\)
-
Design II The second-phase sample s is independently drawn from the entire population.
We have assumed that non-response occurs in the first- and second-phase samples where \(r^{\prime }\) and r are the number of responding units in the first- and second-phase samples of sizes \(n^{\prime }\) and n, respectively. The corresponding sets of responding units are denoted by (\(R_1\) and \(R_2\)) and the sets of non-responding units by (\(R^{c}_1\) and \(R^{c}_2\)), respectively. We have also assumed that sample units in the second-phase sample s have been drawn from the responding set \(R_1\).
3 Proposed Methods of Imputation and Subsequent Estimators
In this section, using the compromised method of imputation in the first-phase sample, we have proposed some new compromised imputation methods under MCAR response mechanism in the second-phase sample for missing data on the study variable y. The proposed imputation methods and resultant estimators are given below:
3.1 Imputation for Missing Data in the First-Phase Sample
To compensate the missing values on auxiliary variable x in the first-phase sample, we considered the ratio method of imputation; hence, after imputation, the sample data in x take the following form:
where \(\hat{b^{\prime }}= \dfrac{\sum _{i=1}^{r^{\prime }} {x_i}}{\sum _{i=1}^{r^{\prime }} {z_i}}\) and \(\alpha\) is an unknown constant. Under the imputation method described in Eq. (1), the point estimator of the population mean \({\bar{X}}\) in the first-phase sample is derived as
which produces the point estimator of the population mean \({\bar{X}}\) in the first-phase sample as
where \({\bar{x}}_{r^{\prime }}= \dfrac{\sum _{i\in R_1}{x_{i}}}{r^{\prime }}\), \({\bar{z}}_{r^{\prime }}= \dfrac{\sum _{i\in R_1}{z_{i}}}{r^{\prime }}\) and \({\bar{z}}_{n^{\prime }}= \dfrac{\sum _{i=1}^{n'}{z_{i}}}{n^{\prime }}\).
3.2 Imputation for Missing Data in the Second-Phase Sample
To derive the reliable substitutes for missing values in the second-phase sample, we suggest two new compromised imputation methods which are presented below:
First Imputation Method Under this method of imputation, sample data take the following forms
where \(c=\dfrac{1}{{\bar{x}}_n} \alpha {\bar{x}}_{r^{\prime }} + (1-\alpha ) {\bar{x}}_{r ^{\prime }}\dfrac{{\bar{z}}_{n^{\prime }}}{{\bar{z}}_{r^{\prime }}}\), \({\hat{b}}= \dfrac{\sum _{i=1}^{r}{y_i}}{\sum _{i=1}^{} {z_i}}\) and \(\alpha _1\) is suitably chosen constant such that the mean square error of resultant estimator is minimum.
Under the imputation method described in Eq. (3), the point estimator of the population mean \({\bar{Y}}\) takes the following form
Second Imputation Method Under this method of imputation, sample data take the following forms
where \({\hat{b}}_{yx}(r)= \dfrac{s_{yx}}{s^2_x}\) and \(\alpha _2\) is suitably chosen constant such that the mean square error of resultant estimator is minimum.
Under the imputation method described in Eq. (5), the point estimator of the population mean \({\bar{Y}}\) takes the following form
4 Properties of Estimators \(\zeta _{1}\) and \(\zeta _{2}\)
The properties of the proposed estimators \(\zeta _{1}\) and \(\zeta _{2}\) have been explored under two different types of two-phase sampling design opted for MCAR response mechanism. Large sample approximations have been used in order to obtain the expressions of biases and mean square errors of the proposed estimators using the following transformations:
Under the above transformations, the estimators \(\zeta _{1}\) and \(\zeta _{2}\) take the following forms:
and
where \(\beta _{YX} = \dfrac{S_{YX}}{S^2_X}.\)
4.1 Biases and Mean Square Errors of Estimators \(\zeta _{1}\) and \(\zeta _{2}\)
Let \(B(.)_{d}\) and \({\text {MSE}}(.)_{d}\) be the bias and mean square error, respectively, of an estimator under a given two-phase sampling design \(d (=I,II)\).
Theorem 4.1
The biases of the estimators \(\zeta _{1}\) and \(\zeta _{2}\) are given by
where
\(\delta _6= \left( \dfrac{1}{r^{\prime }}-\dfrac{1}{N}\right)\) and \(f_1=\left( \dfrac{1}{n}-\dfrac{1}{N} \right) .\)
Proof
The bias of the estimator \(\zeta _{1}\) is derived as
Now, expanding the right-hand sides of Eq. (13) binomially, taking expectation under the sampling designs I and II, respectively, and retaining the terms up to the first order of approximations, we get the expression of the bias of the proposed estimator \(\zeta _{1}\) under sampling designs I and II as obtained in Eqs. (9)–(10).
In similar fashion, we derive the expression of bias of the proposed estimator \(\zeta _{2}\) under sampling designs I and II as obtained in Eq. (11)–(12). \(\square\)
Theorem 4.2
The mean square errors of the estimators \(\zeta _{1}\) and \(\zeta _{2}\) are given by
and
Proof
The mean square error of the estimator \(\zeta _{1}\) is derived as
Now, expanding the right-hand sides of Eq. (18) binomially, taking expectation under the sampling designs I and II, respectively, and retaining the terms up to the first order of approximations, we get the expressions of the mean square error of the proposed estimator \(\zeta _{1}\) under sampling designs I and II as obtained in Eqs. (14)–(15).
In similar fashion, we derive the expression of mean square error of the proposed estimator \(\zeta _{2}\) under sampling designs I and II as obtained in Eqs. (16)–(17). \(\square\)
4.2 Minimum Biases and Mean Square Errors of the Estimators \(\zeta _{1}\) and \(\zeta _{2}\)
Since the mean square errors of estimators \(\zeta _{1}\) and \(\zeta _{2}\) under two types of sampling designs mentioned in Eqs. (14)–(17) are the functions of unknown scalars \(\alpha , \alpha _{1}\) and \(\alpha _2\), the optimum choices of \(\alpha , \alpha _{1}\) and \(\alpha _2\) are obtained by minimizing the mean square errors given in Eqs. (14)–(17) with respect to \(\alpha , \alpha _{1}\) and \(\alpha _2\) as
For estimator \(\zeta _{1}\), we have
For estimator \(\zeta _{2}\), we have
The optimum biases of the proposed estimators \(\zeta _{1}\) and \(\zeta _{2}\) have been obtained by putting the optimum choices of \(\alpha , \alpha _{1}\) and \(\alpha _2\) from Eqs. (19)–(22) in Eqs. (9)–(12). The optimum biases of the proposed estimators \(\zeta _{1}\) and \(\zeta _{2}\) under two types of two-phase sampling designs are given as
The minimum mean square errors of the proposed estimators \(\zeta _{1}\) and \(\zeta _{2}\) have been obtained by putting the optimum choices of \(\alpha , \alpha _{1} and \alpha _2\) from Eqs. (19)–(22) in Eqs. (14)–(17). The optimum mean square errors of the proposed estimators \(\zeta _{1}\) and \(\zeta _{2}\) under two types of two-phase sampling designs are denoted by \(M(\zeta _{1})_{d}\) and \(M(\zeta _{1})_{d}\), respectively, and given as
and
5 Some Well-Known Methods of Imputation
In the single-phase sampling design when the sample of size n is selected from the population under SRSWOR scheme and the non-response occurs in the sample data, some classical methods of imputation are presented in this section under the assumption that information on the auxiliary variable x is available for each and every units of the population.
5.1 Mean Method of Imputation
The mean method of imputation gives the data as:
Under the imputation method discussed in Eq. (31), the corresponding point estimator of the population mean \({\bar{Y}}\) is derived as
The variance of the estimator \({\bar{y}}_m\) is obtained as
5.2 Ratio Method of Imputation
The ratio method of imputation gives the data as:
where \({\hat{b}}=\dfrac{\sum _{i \in R}^{.} {y_i} }{\sum _{i \in R}^{.}x_i}\).
Under the imputation method discussed in Eq. (34), the corresponding point estimator of the population mean \({\bar{Y}}\) is derived as
The mean square error of the estimator \({\bar{y}}_\mathrm{{rat}}\) up to the first order of approximations is obtained as
5.3 Regression Method of Imputation
The regression method of imputation gives the data as
where \({\hat{b}}_{yx}=\dfrac{s_{yx}(r)}{s^2_x(r)} {\text{and}}\,{\hat{a}}=\left( {\bar{y}}_{r}-{\hat{b}}_{yx} {\bar{x}}_r \right).\) Under the imputation method discussed in Eq. (37), the corresponding point estimator of the population mean \({\bar{Y}}\) is derived as
The mean square of the estimator \({\bar{y}}_\mathrm{{reg}}\) up to the first order of approximations is obtained as
6 Analytical Comparison
In this section, we compare the suggested estimators with existing classical estimators \({\bar{y}}_{m}\) , \({\bar{y}}_\mathrm{{rat}}\) and \({\bar{y}}_\mathrm{{reg}}\).
Lemma 6.1
-
(i)
The proposed estimator \(\zeta _1\) under first-phase design is more efficient than \({\bar{y}}_{m}\) if
$$M(\zeta _{1})_{I} -v({\bar{y}}_m)<0 \Rightarrow \dfrac{1-2\rho _{YX}}{\rho _{YZ}^2} < \dfrac{\delta _3 + \delta _4 }{\delta _2}.$$ -
(ii)
The proposed estimator \(\zeta _1\) under second--phase design is more efficient than \({\bar{y}}_{m}\) if
$$M(\zeta _{1})_{II} -v({\bar{y}}_m)<0 \Rightarrow 1-2\rho _{YX}< \dfrac{\delta _3\rho _{YZ}^2 + \delta _4 \rho _{XZ}^2 -\delta _6 }{f_1}.$$ -
(iii)
The proposed estimator \(\zeta _2\) under first-phase design is more efficient than \({\bar{y}}_{m}\) if
$$M(\zeta _{2})_{I} -v({\bar{y}}_m) <0 \Rightarrow \delta _2 \rho _{YZ}^2 + (\delta _3 + \delta _4 ) \rho _{YZ}^2 >0$$which is always true.
-
(iv)
The proposed estimator \(\zeta _2\) under second-phase design is more efficient than \({\bar{y}}_{m}\) if
$$M(\zeta _{2})_{II} -v({\bar{y}}_m) <0 \Rightarrow {\bar{Y}}^2 (f_1 \rho _{YX}^2 + \delta _3 \rho _{YZ}^2 ) >{\bar{X}}^2 \beta _{YX}^2 (\delta _6- \delta _4 \rho _{XZ}^2)$$
Lemma 6.2
-
(i)
The proposed estimator \(\zeta _1\) under first-phase design is more efficient than \({{\bar{y}}}_\mathrm{{rat}}\) if
$$M(\zeta _{1})_{I} -M({{\bar{y}}}_\mathrm{{rat}} )<0 \Rightarrow \dfrac{1-2\rho _{YX}}{\rho _{YZ}^2} < \dfrac{\delta _3 + \delta _4 }{\delta _2-\delta _3}.$$ -
(ii)
The proposed estimator \(\zeta _1\) under second-phase design is more efficient than \({\bar{y}}_\mathrm{{rat}}\) if
$$M(\zeta _{1})_{II} -M({{\bar{y}}}_\mathrm{{rat}} )<0 \Rightarrow 1-2\rho _{YX}< \dfrac{\delta _3\rho _{YZ}^2 + \delta _4 \rho _{XZ}^2 -\delta _6 }{f_1-f_3}.$$ -
(iii)
The proposed estimator \(\zeta _2\) under first-phase design is more efficient than \({\bar{y}}_\mathrm{{rat}}\) if
$$M(\zeta _{2})_{I} -M({{\bar{y}}}_\mathrm{{rat}} ) <0 \Rightarrow \delta _2 \rho _{YX}^2+ (\delta _3+ \delta _4 ) \rho _{YZ}^2 + \delta _3 (1-2\rho _{YX})>0$$which is always true if \(\rho _{YX} > \dfrac{1}{2}\).
-
(iv)
The proposed estimator \(\zeta _2\) under second-phase design is more efficient than \({\bar{y}}_\mathrm{{rat}}\) if
$$M(\zeta _{2})_{II} -M({{\bar{y}}}_\mathrm{{rat}} ) <0 \Rightarrow 1-2\rho _{YX}>\dfrac{\beta _{YX}^2 {\bar{X}}^2 (\delta _6- \delta _4 \rho _{XZ}^2) - ( \delta _3 \rho _{YZ}^2+ f_1 \rho _{YX}^2 ) {\bar{Y}}^2}{ \delta _3 {\overline{Y}}^2 }.$$
Lemma 6.3
-
(i)
The proposed estimator \(\zeta _1\) under first-phase design is more efficient than \({{\bar{y}}}_\mathrm{{reg}}\) if
$$M(\zeta _{1})_{I} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow \delta _3 \rho _{YX}^2 + \delta _2(1-2\rho _{YX}) < (\delta _3 + \delta _4)\rho _{YZ}^2.$$ -
(ii)
The proposed estimator \(\zeta _1\) under second-phase design is more efficient than \({\bar{y}}_\mathrm{{reg}}\) if
$$M(\zeta _{1})_{II} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow \delta _3 \rho _{YX}^2 + f_1(1-2\rho _{YX}) < (\delta _4 \rho _{XZ}^2 + \delta _3 \rho _{YZ}^2) - \delta _6.$$ -
(iii)
The proposed estimator \(\zeta _2\) under first-phase design is more efficient than \({\bar{y}}_\mathrm{{reg}}\) if
$$M(\zeta _{2})_{I} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow (\delta _3 - \delta _2 ) \rho _{YX}^2 <(\delta _3 + \delta _4)\rho _{YZ}^2$$ -
(iv)
The proposed estimator \(\zeta _2\) under second-phase design is more efficient than \({\bar{y}}_\mathrm{{reg}}\) if
$$M(\zeta _{2})_{II} -M({{\bar{y}}}_\mathrm{{reg}} )<0 \Rightarrow {\bar{Y}}^2 \left\{ ( \delta _3-f_1) \rho _{YX}^2 - \delta _3 \rho _{YZ}^2 \right\} + {\bar{X}}^2 \beta _{YX}^2 (\delta _6- \delta _4 \rho _{XZ}^2)<0.$$
Remark 6.1
It may be assumed that \(C_Y \approx C_X \approx C_Z\) in the population.
7 Efficiency Comparison
In this section, empirical and simulation studies have been carried out to demonstrate the accomplishment of the proposed methods of imputation and resultant estimators over mean, ratio and regression methods of imputation.
7.1 Empirical Study
To show the practicability of the proposed methods of imputation in the real-life scenario, four natural populations from various survey studies have been chosen for empirical study. The optimum mean square errors of proposed estimators are taken under consideration in empirical study. The percent relative efficiencies of the proposed methods with respect to the classical methods of imputations (mean, ratio and regression) are obtained as
The detailed information of populations is given below:
Population I [Source [22]] (Page No. 58)
-
Y: Head length of second son
-
X: Head length of first son
-
Z: Head breadth of first son
-
\(N=25, n^{\prime }=18, r^{\prime }=11, n=9, r=7\).
Population II [Source: [23] ] (Page No. 399)
-
Y: Area under wheat in 1964
-
X: Area under wheat in 1963
-
Z: : Cultivated area in 1961
-
\(N=34, n^{\prime }=22, r^{\prime }=14, n=11, r=8\).
Population III [Source: [24]] (Page No. 182)
-
Y: Number of ‘placebo’ children
-
X: Number of paralytic polio cases in the placebo group
-
Z: Number of paralytic polio cases in the ‘not inoculated’ group
-
\(N=33, n^{\prime }=22,r^{\prime }=18, n=12, r=8\).
Population IV [Source: [25] (Page No. 349)
-
Y: Volume
-
X: Diameter
-
Z: Height
-
\(N=31, n^{\prime }=22,r^{\prime }=16, n=10,r=7\).
The percent relative efficiencies are computed for the above-mentioned populations under both sampling designs I and II and shown in Tables 1, 2 and 3.
7.2 Simulation Study
A computer simulation is an endeavor to model a real-life or hypothetical scenarios on a computer so that it may be studied to see how the proposed system, strategies or methods works. The inference may be made about the behavior of the proposed system, strategies or methods by changing parameters in the simulation study. It is a tool to virtually investigate the behavior of the method or system under study. Inspired by this argument, we have run simulation study to investigate the behavior of the proposed imputation methods with respect to classical methods of imputation. The simulation studies have been performed on three artificial computer generated data sets to know the percent relative efficiencies and losses of proposed estimators due to the presence of non-response in the population. The description of artificial data sets is given as:
Population V Source: [Artificially Generated Data Set]
A population of size \(N=2000\) are generated from the multivariate normal distribution in R software. The study variable y is positively correlated with auxiliary variables with fixed correlations \(\rho _{YX}=0.7\), \(\rho _{YZ}=0.6\) and \(\rho _{XZ}=0.5\). The parameters used for this population are \(n^{\prime }=800, r^{\prime }=640, n= 256, r=204\).
Population VI Source: [Artificially Generated Data Set]
The triplet (y, x, z) is generated of size \(N=200\) . The study variable y is highly correlated with auxiliary variables with fixed correlations \(\rho _{YX}= 0.93\), \(\rho _{YZ}=0.87\) and \(\rho _{XZ}= 0.95\). We have taken \(n^{\prime }=80, r^{\prime }=64, n= 50, r=40\).
Population VII Source: [Artificially Generated Data Set]
The triplet (y, x, z) is generated of size \(N=1000\) such that \(x\sim gamma(4, 2.5), e \sim N(0,1)\), \(z=1.5x^{0.5}+e, y=8x+7z+e\) where \(\rho _{YX} > \rho _{Yz}\). We have taken \(n^{\prime }= 400, r^{\prime }=320 , n= 128, r=102\).
In this simulation studies, the following steps have been followed:
-
Step I Draw a random sample \(s^{\prime }\) of size \(n^{\prime }\) from population size N.
-
Step II Take out \((n^{\prime }-r^{\prime })\) sample units randomly from the first-phase sample each time. Impute dropped units using imputation method contemplated for the first-phase sample.
-
Step III Draw a random subsample of size n from \(s ^{\prime }\) for design I and independent random sample n from N for design II.
-
Step IV Take out \((n-r)\) sample units randomly from the second-phase sample each time. Impute dropped units using proposed method of imputation contemplated for the second-phase sample.
-
Step V Compute relevant statistics.
-
Step VI Repeat the above steps \({N}\atopwithdelims (){n} = M\) (say) times .
The simulated variance and mean square errors of the existing and proposed estimators are obtained as:
The simulated percent-related efficiencies are given as
The percent relative losses in efficiencies due to non-response of the estimators \(\zeta _1\) and \(\zeta _2\) are obtained with respect to the similar estimators when non-response has not observed in any phase. The estimators \(T_1\) and \(T_2\) are defined under the similar circumstances as the estimators \(\zeta _1\) and \(\zeta _2\), respectively, but under complete response. The simulated percent relative losses in efficiencies of the proposed estimators \(\zeta _1\) and \(\zeta _2\) with respect to \(T_1\) and \(T_2\), respectively, under their respective design are given as
where
In this study, \(M=50{,}000\) has been taken for convenience in calculation. The values of \(E^{\prime }_{ij} (i=1,2,), (j=1,2,3)\) and \(l_k ( k=1,2)\) are calculated based on the above procedures and presented in Tables 5, 6, 7, 8, 9 and 10.
Following the above-mentioned simulation study, we have also calculated the biases of the resultant estimators \(\zeta _{1}\), \(\zeta _{2}\) and existing estimators \({\bar{y}}_m\), \({\bar{y}}_\mathrm{{rat}}\) and \({\bar{y}}_\mathrm{{reg}}\) for populations I-IV and shown in Table 4.
8 Interpretations of Empirical and Simulation Results
The following interpretation may be read out form Tables 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10:
-
(i)
From Tables 1, 2 and 3, it is seen that the percent relative efficiencies of proposed estimators \(\zeta _1\) and \(\zeta _2\) with respect to the estimators \({\bar{y}}_m\), \({\bar{y}}_\mathrm{{rat}}\) and \({\bar{y}}_\mathrm{{reg}}\) are more than 100 in almost cases when percent relative efficiencies have been obtained using the large sample approximations. This reflects the dominance nature of the proposed method of imputations and resultant estimators over the classical method of imputations.
-
(ii)
From Tables 5 and 6, it is observed that simulated percent relative efficiencies of proposed estimators \(\zeta _1\) and \(\zeta _2\) with respect to the estimators \({\bar{y}}_m\), \({\bar{y}}_\mathrm{{rat}}\) and \({\bar{y}}_\mathrm{{reg}}\) are more than 100 in most of the cases when simulation studies are performed on artificial data sets.
-
(iii)
From Tables 7, 8, 9 and 10, it is indicated that the percent relative losses in efficiencies \(l_1\) and \(l_2\) of the estimators \(\zeta _1\) and \(\zeta _2\) under two types of two-phase sampling designs are not more than 30% for both artificial and real populations.
-
(iv)
From Tables 7 and 8, the negative percent relative losses in efficiencies are observed for some cases under two-phase sample design I which indicates the gain in the precision of estimate.
-
(v)
From Tables 8, 9 and 10, it is also seen that the percent relative losses in efficiencies \(l_1\) and \(l_2\) are decreasing as the values of r increase for fixed values of \(N, n^{\prime }, r^{\prime }\) and n under both types of two-phase sampling designs. This shows that the percent relative losses in efficiencies are decreasing as percentage of non-response in the second-phase sample decreases.
In Tables 7 and 8, the impact of percent relative losses in efficiencies of the proposed estimators is observed very closely taking into consideration of minor change in percentage of non-response in the second-phase sample and results are shown graphically in Figs. 1, 2, 3, 4, 5 and 6 to get more visible pattern under sampling designs I and II separately.
From Figs. 1, 2, 3, 4, 5 and 6, it is easily seen that the percent relative losses in efficiencies of proposed estimators are decreasing as the percentage of non-response decreases under both types of sampling designs.
9 Conclusions and Recommendations
When the proposed methods of imputation under study have implemented in real-life scenario, proposed methods are remunerating in terms of percent relative efficiencies. These strategies are also showing their superiority in terms of percent relative efficiencies over classical imputation methods namely mean, ratio and regression methods of imputation when simulation studies have been performed over artificial data sets. The percent relative losses in efficiency of proposed estimators are less than 30% whenever non-response occurs 20% or less of sample size. These results support that the proposed methods of imputations described in this study are appreciatively favorable in diminishing the pessimistic effect of non-response on inference to a greater extend as compared to the classical methods of imputation. Hence, looking on the persuaded behavior of the suggested imputation methods, survey practitioner may be encouraged for their practical applications, whenever non-response is inescapable in the survey data.
References
Rubin DB (1976) Inference and missing data. Biometrica 63:581–593
Heitain FD, Rubin BD (1991) Ignorablity and coarse data random. Annu Stat 50(3):207–213
Heitain FD (1994) Ignoriablity in general complete-data models. Biometrika 81:701–708
Heitain FD, Basu S (1996) Distinguishing “missing at random” and “missing completely at random”. Am Stat 50(3):207–213
Sande IG (1979) A personal view of hot deck approach to automatic edit and imputation. J Imput Proced Surv Methodol 5:238–246
Kalton G, Kasprzyk D, Santos R (1981) Issues of non-response and imputation in the survey of income and program participation. In: Krewski D, Platek R, Rao JNK (eds) Current topics in survey sampling. Academic Press, New York, pp 455–480
Lee H, Rancourt E, Sarndal CE (1994) Experiments with variance estimation from survey data with imputed values. J Off Stat 10(3):231–243
Lee H, Rancourt E, Sarndal CE (1995) Variance estimation in the presence of imputed data for the generalized estimation system. In: Proceeding of the American Statistical Association (Survey Research Methods Section of the American Statistical Association (ASA)). pp 384–389
Singh S, Horn S (2000) Compromised imputation in survey sampling. Metrika 51:266–276
Singh S, Deo B (2003) Imputation by power transformation. Stat Pap 44:555–579
Ahmed MS, Al-Titi O, Al-Rawi Z, Abu-Dayyeh W (2006) Estimation of population mean using different imputation methods. Stat Transit 7(6):1247–1264
Kadilar C, Cingi H (2008) Estimators for the population mean in the case of missing data. Commun Stat Theory Methods 37:2226–2236
Singh S (2009) A new method of imputation in survey sampling. Statistics 43(5):499–511
Diana G, Perri PF (2010) Improved estimators of the population mean for missing data. Commun Stat Theory Methods 39:3245–3251
Singh GN, Karna JP (2010) Some imputation methods to minimize the effect of non response in two-occasion rotation patterns. Commun Stat Theory Methods 39(18):3264–3281
Gira Abdeltawab A (2015) Estimation of population mean with a new imputation methods. Appl Math Sci 9(34):1663–1672
Bhushan S, Pandey PP (2016) Optimality of ratio type estimation methods for population mean in presence of missing data. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2016.1167906
Thakur NS, Yadav K, Pathak S (2011) Estimation of mean in presence of missing data under two-phase sampling scheme. J Reliab Stat Stud 4(2):93–104
Thakur NS, Yand Pathak S (2012) Some imputation methods in double sampling scheme for estimation of population mean. Int J Mod Eng Res 2(1):200–207
Thakur NS, Yadav K, Pathak S (2013) On mean estimation with imputation in two-phase sampling. Res J Math Stat Sci 1(13):1–9
Pandey R, Yadav K (2016) Mean estimation under imputation based on two-phase sampling design using an auxiliary variable. Pak J Stat Oper Res XII(4):639–658
Anderson TW (1958) An introduction to multivariate statistical analysis. Wiley, New York
Murthy MN (1967) Sampling theory and methods. Statistical Publishing Society, Calcutta
Cochran WG (1977) Sampling techniques. Wiley, New-York
Wang SG, Chow SC (1994) Advanced linear models: theory and applications. Marcel Dekker, Inc., New York
Acknowledgements
Authors are thankful to the Indian Institute of Technology (Indian School of Mines), Dhanbad, for providing necessary support to carry out the present research work. Authors are also thankful to the reviewers for their valuable suggestions which improved the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Singh, G.N., Suman, S. Estimation of Population Mean Using Imputation Methods for Missing Data Under Two-Phase Sampling Design. J Stat Theory Pract 13, 19 (2019). https://doi.org/10.1007/s42519-018-0016-5
Published:
DOI: https://doi.org/10.1007/s42519-018-0016-5