Mean Estimation of Sensitive Variables Under Non-response and Measurement Errors Using Optional RRT Models

Zhang, Qi; Khalil, Sadia; Gupta, Sat

doi:10.1007/s42519-020-00135-2

Mean Estimation of Sensitive Variables Under Non-response and Measurement Errors Using Optional RRT Models

Original Article
Published: 03 November 2020

Volume 15, article number 3, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Mean Estimation of Sensitive Variables Under Non-response and Measurement Errors Using Optional RRT Models

Download PDF

298 Accesses
12 Citations
Explore all metrics

Abstract

This study focuses on three issues we face in survey sampling: non-response, measurement errors, and social desirability bias. We propose a generalized mean estimator in the presence of measurement errors and non-response using optional RRT methodology under simple random sampling. We present a comparison of the proposed estimator with some commonly used estimators.

Variance estimation procedures in the presence of singly imputed survey data: a critical review

Article 18 August 2020

Robustness in Survey Sampling Using the Conditional Bias Approach with R Implementation

Ratio Estimation of Finite Population Mean Using Optional Randomized Response Models

Article 01 September 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays, many researchers use email or phone surveys which is an easier, cheaper, and faster way to obtain information. However, it causes a high non-response rate. This reduces the accuracy of parameter estimates. Among all the sampling methods, face-to-face interview is one that reduces non-response rate the most, but the cost is considerably higher than other methods. Hansen and Hurwitz [11] were the first to suggest a procedure of taking a subsample of non-respondents after the first email or phone attempt and then obtaining information from this group by personal interview.

The problem of non-response has been discussed in many papers. Many researchers suggested different types of estimators for population parameters based on Hansen and Hurwitz [11] double sampling plan. Another method to increase the accuracy of population estimates is by using auxiliary information. Studies of mean estimation using information on auxiliary variables include Khare and Srivastava [14], Rao [22], Khare and Sinha [15,16,17], Kumar and Singh [19], Yaqub et al. [28], Bhushan and Pandey [3, 4], and Unal and Kadilar [26].

Hansen and Hurwitz [11] method could obtain more information from face-to-face interview in the second phase, but it may also cause non-response bias if the variable of interest is sensitive in nature. The respondents are unlikely to provide true response in face-to-face interview for such questions. To reduce the social desirability bias (SDB) caused by sensitive questions, one could use randomized response technique (RRT) models when we target the group of non-respondents. Subjects may refuse to respond on the first call but may provide scrambled response on the second call with personal interview. Diana et al. [6] proposed an unbiased population mean estimator under this two-phase sampling. Their estimator reduces non-response but increases the estimator variance due to the use of RRT model in the non-respondent group. Later, Ahmed et al. [1] proposed generalized ratio and regression estimators utilizing known coefficient of variation of the study variable in case of second sample by using RRT approach. This estimator improved efficiency when the auxiliary variable and study variable are highly correlated. Makhdum et al. [21] also proposed a generalized class of estimators for a sensitive study variable in the presence of non-response using RRT model.

Measurement error is another important issue in sample surveys. Most of the time we assume measurement errors to be very small and neglect them. But if measurement errors are not small enough, then we get unreliable estimates. Some existing studies which have discussed measurement errors in estimating population parameters include Kumar et al. [18], Kumar et al. [20], Khalil et al. [12], and Singh et al. [24]. Singh and Sharma [23], Singh and Vishwakarma [25], and Audu et al. [2] considered the problem of estimating the finite population mean in the presence of non-response and measurement errors simultaneously. Also, Khalil et al. [13] studied mean estimation under measurement errors using optional RRT models.

Based on the previous studies, one may consider estimating population mean of a sensitive variable in the simultaneous presence of both measurement error and non-response. This problem has not drawn much attention in the existing literature. RRT models used in the previous studies [1, 6, 21] are non-optional RRT models where all the respondents are required to provide a scrambled response. However, a survey question may be sensitive for one person but not for another. Gupta et al. [7] pointed out that if we give respondents the option to choose whether they want to answer the sensitive question directly or provide a scrambled response, the model would be more efficient while there is no extra loss of privacy [10].

We will briefly discuss the Hansen and Hurwitz [11] (HH) two-phase sampling procedure in Sect. 2.1 and the optional RRT (ORRT) model in Sect. 2.2. Some existing mean estimators are presented in Sect. 3.1, and a generalized mean estimator is introduced in Sect. 3.2. Section 4 provides the results of a simulation study, and Sect. 5 provides some concluding remarks.

2 Modified Hansen and Hurwitz [11] Procedure (HH)

2.1 Hansen and Hurwitz [11]: Two-Phase Sampling

Let $U = \{U_1, U_2,\ldots ,U_N\}$ be a finite population of size N and a random sample without replacement of size n is taken. We assume that only $n_1$ units provide response on the first call and remaining $n_2=n-n_1$ units do not respond. Then, a subsample of size $n_s=\frac{n_2}{f}$ ($f>1$) is taken from the $n_2$ non-responding units. Hansen and Hurwitz [11] used mail survey at the first attempt and then used face-to-face interview at the second call.

Let $\mu _y=\frac{\sum _{i=1}^{N}y_i}{N}$ and $\sigma _y^2=\frac{\sum _{i=1}^{N}(y_i-\mu _y)^2}{N-1}$, respectively, be the population mean and variance of the study variable y. Let $\mu _{y_{(1)}}=\frac{\sum _{i=1}^{N_1}y_i}{N_1}$ and $\sigma _{y_{(1)}}^2=\frac{\sum _{i=1}^{N_1}(y_i-\mu _{y_1})^2}{N_1-1}$, respectively, be the mean and variance of respondent group of size $N_1$, and $\mu _{y_{(2)}}=\frac{\sum _{i=1}^{N_2}y_i}{N_2}$ and $\sigma _{y_{(2)}}^2=\frac{\sum _{i=1}^{N_2}(y_i-\mu _{y_2})^2}{N_2-1}$, respectively, be the mean and variance of non-respondent group of size $N_2$. Then, the population mean is given by

$$\mu _{y}=W_1 \mu _{y_{(1)}}+W_2 \mu _{y_{(2)}}.$$

(1)

where $W_1=\frac{N_1}{N}$ and $W_2=\frac{N_2}{N}$. Not knowing $N_1$ poses a challenge of its own.

Let ${\bar{y}}_1=\frac{\sum _{i=1}^{n_1}y_i}{n_1}$ be the sample mean for the response group, and ${\bar{y}}_2=\frac{\sum _{i=1}^{n_s}y_i}{n_s}$ be the sample mean for non-response group. One may note here that ${\bar{y}}_1$ and ${\bar{y}}_2$ are unbiased estimators for $\mu _{y_{(1)}}$ and $\mu _{y_{(2)}}$, respectively.

Hansen and Hurwitz [11] suggested an unbiased population mean estimator given by

$${\bar{y}}=w_1 \bar{y_1}+w_2 \bar{y_2},$$

(2)

where $w_1=\frac{n_1}{n}$ and $w_2=\frac{n_2}{n}$.

The variance of ${\bar{y}}$ is given by

$${\text {Var}}({\bar{y}})=\left( \frac{N-n}{Nn}\right) \sigma _y^2+\frac{W_2(f-1)}{n}\sigma _{y_{(2)}}^2.$$

(3)

2.2 Optional RRT (ORRT) Models

Let Y be a sensitive study variable, and $y_i\, (i=1,2\ldots n)$ be a simple random sample without replacement from $y_i$ ($i=1,2\ldots N$). Let $\mu _{y}=\frac{1}{N}\sum _{i=1}^{N} y_{i}$, ${\bar{y}}=\frac{1}{n}\sum _{i=1}^{n} y_{i}$, $\sigma _y^2=\frac{1}{N-1}\sum _{i=1}^{N} (y_i-\mu _y)^2$, and $s_y^2=\frac{1}{n-1}\sum _{i=1}^{n} (y_i-{\bar{y}})^2$. Let T and S be the two scrambling variables with respective means $\mu _T$ and $\mu _S$, and known variances $\sigma _T^2$ and $\sigma _S^2$. Let T, S, X and Y be mutually independent. The respondent is asked to report a scrambled response for the study variable (Y) if he/she considers the question sensitive, and a correct response otherwise.

One could use a simple additive RRT model where the scrambled response is given by $Y+S$ (as in Gupta et al. [8]), or one may use a more general RRT model where the scrambled response is given by $TY+S$ (as in Diana and Perri [5]). Note that the simple additive model is a special case of the second model if we let ${\text {Var}}(T)=0$ and $E(T)=1$. Khaili et al. [13] showed that the simple additive model is more efficient but the general model has greater privacy. However, the general RRT model is better when we use a combined measure of efficiency and privacy $\delta = \frac{{\text {Var}}(Z)}{\Delta }$ proposed by Gupta et al. [10], where Z is the scrambled response and $\Delta =E(Z-Y)^2$ is the privacy level for the same model, as given by Yan et al. [27]. One may note that the model with smaller $\delta$ value is preferred because it means either a larger privacy level, or smaller value of ${\text {Var}}({\hat{\mu }})$, or both. It may be observed that

$$\delta _{\mathrm{additive}\,\mathrm{RRT}}=1+\frac{\sigma _y^2}{\sigma _s^2}>1+\frac{\sigma _y^2}{\sigma _s^2+\sigma _T^2(\mu _y^2+\sigma _y^2)}=\delta _{\mathrm{general}\,\mathrm{RRT}}.$$

(4)

Hence, while working with the general RRT model, the scrambling variable T will put a burden on the model efficiency but will improve the privacy level. Overall, the general model is better in terms of the unified measure of efficiency and privacy.

Therefore, we will use the general scrambling model in this study. The optional version of model $Z=TY+S$ is given by

$$Z=\left\{ \begin{array}{ll} Y&{}{{\text {with probability}}\,1-W}\\ TY+S&{}{{\text {with probability}}\,W} , \end{array} \right.$$

(5)

where W is the probability that a respondent finds the question sensitive. The mean and variance, respectively, for Z are given by

$$E(Z)=E(Y)(1-W)+E(TY+S)W= E(Y)$$

(6)

and

$${\text {Var}}(Z)=E(Z^2)-E^2(Z)=\sigma _y^2+\sigma _S^2W+\sigma _T^2(\sigma _y^2+\mu _y^2)W.$$

(7)

Obviously optional RRT model is more efficient than the non-optional RRT model since variance of Z increases as W increases. When $W=1$, the RRT model becomes a non-optional model.

2.3 Modified Version of Hansen and Hurwitz [11]: Two-Phase Sampling

In order to encourage the respondents to answer a sensitive survey question truthfully, we give the respondents the opportunity to scramble the response using ORRT in the second phase of HH procedure when there is a face-to-face interview. In this case, we are modifying the HH procedure assuming that in the first phase, the respondent group gives direct answer, and then in the second phase, ORRT model is used to get response from a subgroup of non-respondents.

From Sect. 2.2, we can write the general RRT model as $Z=(YT+S)J+Y(1-J)$, where J $\sim$ Bernoulli(W). Therefore, $E(J)=W$, ${\text {Var}}(J)=W(1-W)$ and $E(J^2)={\text {Var}}(J)+ E^2(J)=W$.

The expectation under randomization mechanism is given by

$$\begin{aligned} E_R(Z)&=E_R(TYJ+SJ+Y-YJ)\\&=Y E_R(TJ)+E_R(SJ)+Y-YE_R(J)\\&=Y\mu _T W+\mu _SW+Y-YW\\&=(\mu _TW+1-W)Y+\mu _SW. \end{aligned}$$

(8)

Also

$$\begin{aligned} V_R(Z)&=V_R(TYJ+SJ+Y-YJ)\\&=V_R(TYJ)+V_R(SJ)+V_R(YJ)+2{\text {Cov}}(TYJ,SJ)-2{\text {Cov}}(TYJ,YJ)\\&\quad -2{\text {Cov}}(SJ,YJ)\\&=Y^2[(\sigma _T^2+\mu _T^2)W-\mu _T^2W^2] + [(\sigma _s^2+\mu _S^2)W-\mu _S^2W^2]+Y^2[W(1-W)\\&\quad + 2Y\mu _T\mu _SW(1-W)-2Y^2[\mu _TW(1-W)]-2Y[\mu _SW(1-W)]\\&=(Y^2\sigma _T^2+\sigma _s^2)W. \end{aligned}$$

(9)

Let ${\hat{y}}_i$ be a transformation of the randomized response on the ith unit whose expectation under the randomization mechanism is the true response $y_i$. It is given by

$${\hat{y}}_i=\frac{z_i-\mu _S W}{\mu _TW+1-W}$$

(10)

with

$$E_R({\hat{y}}_i) = y_i$$

(11)

(from (8)), and

$$\begin{aligned} V_R({\hat{y}}_i)&=\frac{V_R(z_i)}{(\mu _TW+1-W)^2}\\&=\frac{[y_i^2\sigma _T^2+\sigma _s^2]W}{(\mu _TW+1-W)^2} = \tau _i \end{aligned}$$

(12)

(from (9)).

With ORRT model, a modified version of the HH estimator is given by

$$\hat{{\bar{y}}}=w_1 {\bar{y}}_1+w_2 \hat{{\bar{y}}}_2,$$

(13)

where $\hat{{\bar{y}}}_2 = \sum _{i=1}^{n_s} (\frac{\hat{y_i}}{n_s})$.

Let $E_i$ and $V_i$ be the expectation and variance in the ith phase ($i=1,2$) under the two-phase sampling. It is easy to verify that

$$\begin{aligned} E(\hat{{\bar{y}}})&=E_1E_2[w_1{\bar{y}}_1+w_2\hat{{\bar{y}}}_2]\\&=E_1[w_1{\bar{y}}_1+w_2E_R(\hat{{\bar{y}}}_2)]\\&=E_1[w_1{\bar{y}}_1+w_2{\bar{y}}_2)]\\&=W_1\mu _{y_{(1)}} + W_2\mu _{y_{(2)}}\\&= \mu _y \end{aligned}$$

(14)

since $E_R(\hat{{\bar{y}}}_2)=\frac{1}{n_s}\sum _{i=1}^{n_s}E_R({\hat{y}}_i)={\bar{y}}_2$.

The variance of $\hat{{\bar{y}}}$ can be written as

$$\begin{aligned} {\text {Var}}(\hat{{\bar{y}}})&=E_1[V_2(\hat{{\bar{y}}})]+V_1[E_2(\hat{{\bar{y}}})]\\&=E_1[V_2(w_1{\bar{y}}_1+w_2\hat{{\bar{y}}}_2)]+V_1[E_2(w_1{\bar{y}}_1+w_2\hat{{\bar{y}}}_2)]\\&=E_1[0+V_2(w_2\hat{{\bar{y}}}_2)]+V_1[w_1{\bar{y}}_1+w_2{\bar{y}}_2]\\&=E_1[V_2(w_2\hat{{\bar{y}}}_2)]+V_1({\bar{y}})\\&=E_1[\frac{w_2^2}{n_s}\frac{\sum _{i=1}^{N_2}\frac{(y_i^2\sigma _T^2+\sigma _s^2)W}{(\mu _TW+1-W)^2}}{N_2}]+V({\bar{y}})\\&={\text {Var}}({\bar{y}})+\frac{W_2f}{n}\frac{\sum _{i=1}^{N_2}\tau _i}{N_2} . \end{aligned}$$

(15)

Note $E(y_i^2)=\sigma _y^2+\mu _y^2$, and

$$E\left( \frac{w_2^2}{n_s}\right) =E\left( \frac{n_2^2}{n^2}\frac{f}{n_2}\right) =E\left( \frac{n_2f}{n^2}\right) =\frac{f}{n^2}E(n_2)=\frac{f}{n^2}(nW_2)=\frac{W_2f}{n},$$

(16)

if we assume $\frac{n}{N} \approx \frac{n_2}{N_2}$.

Since ${\bar{y}}$ is the original HH mean estimator, the variance of $\hat{{\bar{y}}}$ is given by

$${\text {Var}}(\hat{{\bar{y}}})=\theta \sigma _y^2 + \lambda \sigma _{y_{(2)}}^2 + \frac{W_2f}{n}\left[ \frac{[(\sigma _{y_{(2)}}^2+\mu _{y_{(2)}}^2)\sigma _T^2+\sigma _S^2]W}{(\mu _TW+1-W)^2}\right] ,$$

(17)

where $\theta =\frac{(N-n)}{Nn}$ and $\lambda =\frac{(f-1)W_2}{n}$.

3 Mean Estimators Under Measurement Errors and Non-response

3.1 Existing Mean Estimators

Using the standard terminology, as used in Sect. 2.1, let $\mu _x=\frac{\sum _{i=1}^{N}x_i}{N}$ and $\sigma _x^2=\frac{\sum _{i=1}^{N}(x_i-\mu _x)^2}{N-1}$, respectively, be the known population mean and variance of the auxiliary variable X. Let $\mu _{x_{(1)}}=\frac{\sum _{i=1}^{N_1}x_i}{N_1}$ and $\sigma _{x_{(1)}}^2=\frac{\sum _{i=1}^{N_1}(x_i-\mu _{x_1})^2}{N_1-1}$, respectively, be the population mean and variance of the respondent group of size $N_1$, $\mu _{x_{(2)}}=\frac{\sum _{i=1}^{N_2}x_i}{N_2}$ and $\sigma _{x_{(2)}}^2=\frac{\sum _{i=1}^{N_2}(x_i-\mu _{x_2})^2}{N_2-1}$, respectively, be the population mean and variance of the non-respondent group of size $N_2$. Let $\rho _{xy}=\frac{\sigma _{xy}}{\sigma _x\sigma _y}$ be the correlation coefficient between X and Y. Similarly let $\rho _{{xy}_{(1)}}=\frac{\sigma _{{xy}_{(1)}}}{\sigma _{x}\sigma _{y}}$ and $\rho _{{xy}_{(2)}}=\frac{\sigma _{{xy}_{(2)}}}{\sigma _{x}\sigma _{y}}$, respectively, be the correlation coefficients between X and Y for the respondents group and the non-respondents group. Let the measurement error (ME) for the auxiliary variable (X) in the population be given by $V_i=x_i-X_i$. Let the respective ME associated with the study variable (Y) in the population and the scrambled variable (Z) in the face-to-face interview phase be given by $U_i=y_i-Y_i$ and $P_i=z_i-Z_i$. These measurement errors are assumed to be random and uncorrelated with mean zero and variances $\sigma _v^2$, $\sigma _u^2$, and $\sigma _p^2$, respectively.

Assume population mean $\mu _x$ of auxiliary variable is known, and non-response happened on both X and Y. ORRT version of some of the existing mean estimators are listed below.

1. An ordinary mean estimator for sensitive variable in a finite population under modified HH is given by

$${\hat{\mu }}_{yw}= \hat{{\bar{y}}}^*= w_1 \bar{y_1}+w_2 \bar{y_2}^*,\quad {\text {where}} \,\bar{y_2}^* = \frac{1}{n_s}\sum _{i=1}^{n_s} z_{i}.$$

(18)

The MSE of ${\hat{\mu }}_{yw}$ in the presence of measurement errors is given by

$${\text {MSE}}({\hat{\mu }}_{yw})=\theta (\sigma _y^2 + \sigma _u^2) + \lambda (\sigma _{y_{(2)}}^2+\sigma _{p}^2) + G,$$

(19)

where $\theta =\frac{N-n}{Nn}$, $\lambda = \frac{N_2(f-1)}{Nn}$, and $G = \frac{W_2f}{n}[\frac{[(\sigma _{y_{(2)}}^2+\mu _{y_{(2)}}^2)\sigma _T^2+\sigma _s^2]W}{(\mu _TW+1-W)^2}]$.

2. A ratio estimator corresponding to the one in Gupta et al. [9] under modified HH is given by

$${\hat{\mu }}_{rw}=\frac{\hat{{\bar{y}}}^*}{{\bar{x}}^*}\mu _{x}={\hat{R}}_W^*\mu _{x},$$

(20)

where $\hat{{\bar{y}}}^*$ is the ordinary mean estimator under modified HH and ${\bar{x}}^* = w_1 \bar{x_1}+w_2 \bar{x_2}$ is the ordinary mean estimator under original HH procedure. The MSE of ${\hat{\mu }}_{rw}$ in the presence of measurement errors is given by

$$\begin{aligned} {\text {MSE}}^*({\hat{\mu }}_{rw})&= \theta (\sigma _y^2+R^2\sigma _x^2-2R\rho _{yx}\sigma _y \sigma _x)+ \lambda (\sigma _{y_(2)}^2+R^2\sigma _{x_(2)}^2-2R\rho _{zx_{(2)}}\sigma _z \sigma _{x_(2)})\\&\quad +\theta (\sigma _u^2+R^2\sigma _v^2) + \lambda (\sigma _p^2+R^2\sigma _v^2)+G, \end{aligned}$$

(21)

where $R = \mu _y/\mu _{x}$ and $\rho _{zx_{(2)}}=\frac{\rho _{yx(2)}}{\sqrt{1+\frac{[\sigma _s^2 +\sigma _T^2(\sigma _{y_{(2)}}^2+\mu _{y_{(2)}}^2)]W}{\sigma _{y_{(2)}}^2}}}$.

The MSE of ${\hat{\mu }}_{yw}$ and ${\hat{\mu }}_{rw}$, without measurement errors, may be obtained by putting $\sigma _v^2 =\sigma _u^2=\sigma _{p}^2=0$ in the above equations.

3.2 Proposed Mean Estimator

With this background, we use the generalized mean estimator considered in Khalil et al. [12, 13] but with non-response. This mean estimator includes a wide variety of mean estimators as special cases. The non-response version of this estimator is given by

$${\hat{\mu }}_{pw}=(\hat{{\bar{y}}}^*+k(\mu _x-{\bar{x}}^*))\left( \frac{{\bar{D}}}{{\bar{d}}}\right) ^v$$

(22)

where ${\bar{d}}=\phi (\alpha {\bar{x}}^*+\beta )+(1-\phi )(\alpha \mu _x+\beta )$ , ${\bar{D}}=\alpha \mu _x+\beta$, k and v are suitable constants, and $\phi$ is assumed to be an unknown constant whose value is to be determined from optimality considerations. Also $\alpha$ and $\beta$ are assumed to be some known parameters of the auxiliary variable X. Various estimators may be obtained by using different values of $\alpha$ and $\beta$. With $v=1$, we get various regression-in-ratio estimators, and with $v=-1$, we get various regression-in-product estimators.

To obtain the MSE of this estimator, we define $\hat{{\bar{y}}}^*=\mu _{y}(1+e_0^*)$ and ${\bar{x}}^*=\mu _{x}(1+e_1^*)$ such that $E(e_0^*) = E(e_1^*) = 0$; $E(e_0^2) = \frac{1}{\mu _{y}^2}[\theta (\sigma _y^2+\sigma _u^2)+\lambda (\sigma _{y_{(2)}}^2+\sigma _{p}^2)+\frac{W_2f}{n}[\frac{[(\sigma _{y_{(2)}}^2+\mu _{y_{(2)}}^2)\sigma _T^2 +\sigma _s^2]W}{(\mu _T W + 1 - W)^2}]$; $E(e_1^{2*})= \frac{1}{\mu _{x}^2}[\theta (\sigma _x^2+\sigma _v^2)+\lambda (\sigma _{x_{(2)}^2}+\sigma _{v}^2)]$; $E(e_0^*e_1^*)=\theta \rho _{xy} \frac{\sigma _y}{\mu _y} \frac{\sigma _x}{\mu _x}+\lambda \rho _{zx(2)} \frac{\sigma _z}{\mu _z} \frac{\sigma _{x(2)}}{\mu _x}$, where $\rho _{zx_{(2)}}=\frac{\rho _{yx(2)}}{\sqrt{1+\frac{[\sigma _s^2 +\sigma _T^2(\sigma _{y_{(2)}}^2+\mu _{y_{(2)}}^2)]W}{\sigma _{y_{(2)}}^2}}}$].

The bias of the proposed estimator, up to the second order of approximation, in the presence of measurement errors, is given by

$$\begin{aligned} Bias^{*}({\hat{\mu }}_{pw})&\approx \theta \left[ \left( kH+\frac{v+1}{v}\mu _y H^2\right) (\sigma _x^2+\sigma _v^2)-H\rho _{yx}\sigma _y\sigma _x\right] \\&\quad +\lambda \left[ \left( kH+\frac{v+1}{v}\mu _y H^2\right) (\sigma _{x(2)}^2+\sigma _{v}^2)-H\rho _{zx(2)}\sigma _z\sigma _{x(2)}\right] , \end{aligned}$$

(23)

where $H=\frac{\alpha \phi v}{\alpha \mu _x+\beta }$. The bias of ${\hat{\mu }}_{pw}$, without measurement error, may be obtained by setting $\sigma _v^2=0$ in above equation.

Using Taylor’s approximation up to the first order, we have

$${\hat{\mu }}_{pw}-\mu _y\approx e_0^* \mu _y - k \mu _x e_1^* - H\mu _x\mu _y e_1^*.$$

(24)

Taking square and expectation in (24), we have

$$\begin{aligned} ({\hat{\mu }}_{pw}-\mu _y)^2&= e_0^{*2} \mu _y^2+k^2\mu _x^2e_{1}^{*2} + (H \mu _x \mu _y e_1^*)^2-2e_0^* e_1^* k \mu _x \mu _y - 2e_0^*e_1^*H\mu _x\mu _y^2 \\&\quad + 2e_1^{*2}H\mu _x^2\mu _y, \end{aligned}$$

(25)

and

$$\begin{aligned} {\text {MSE}}^*({\hat{\mu }}_{pw})&= E({\hat{\mu }}_{pw}-\mu _y)^2\\&=\theta [\sigma _y^2 + (k+\phi v R_{pw})^2 \sigma _x^2 -2(k+\phi vR_{pw})\rho _{yx}\sigma _x\sigma _y]\\&\quad + \lambda [\sigma _{y_(2)}^2 + (k+\phi v R_{pw})^2 \sigma _{x_(2)}^2 -2(k+\phi vR_{pw})\rho _{zx(2)}\sigma _x\sigma _z]\\&\quad + \theta [\sigma _u^2 + (k+\phi v R_{pw})^2 \sigma _v^2]+\lambda [\sigma _p^2 + (k+\phi v R_{pw})^2 \sigma _v^2]+G \end{aligned}$$

(26)

where $R_{pw}=\frac{\alpha \mu _y}{\alpha \mu _x+\beta }$.

Minimization of the above expression (26) with respect to $\phi$ yields its optimum value as:

$$\phi _{opt}\cong \frac{\theta (\rho _{xy}\sigma _x\sigma _y-k(\sigma _x^2+\sigma _v^2))+\lambda (\rho _{zx_{(2)}}\sigma _z\sigma _{x_{(2)}}-k(\sigma _{x_{(2)}}^2+\sigma _{v}^2))}{vR_{pw}[\theta (\sigma _x^2+\sigma _v^2)+\lambda (\sigma _{x_{(2)}}^2+\sigma _{v}^2)]}.$$

(27)

Substitution of $\phi _{opt}$ in MSE(${\hat{\mu }}_{pw}$) yields the minimum value as:

$$\begin{aligned} {\text {MSE}}_{\mathrm{min}}^*({\hat{\mu }}_{pw})&\cong \theta (\sigma _y^2 + P^2 \sigma _x^2 -2P\rho _{yx}\sigma _x\sigma _y) + \lambda (\sigma _{y_{(2)}}^2 + P^2 \sigma _{x_{(2)}}^2 -2P\rho _{zx_{(2)}} \sigma _z\sigma _{x_{(2)}}) \\&\quad +\theta (\sigma _u^2+ P^2\sigma _v^2)+ \lambda (\sigma _{p}^2 + P^2 \sigma _{v}^2) + G, \end{aligned}$$

(28)

where $P=\frac{\theta \rho _{yx}\sigma _x\sigma _y+\lambda \rho _{zx_{(2)}}\sigma _z\sigma _{x_{(2)}}}{\theta (\sigma _x^2+\sigma _v^2)+\lambda (\sigma _{x_{(2)}}^2+\sigma _{v}^2)}$.

The expression for the minimized MSE of the proposed estimator without ME may be obtained by putting $\sigma _u^2=\sigma _v^2 =\sigma _{p}^2=0$ in the above expression, which gives

$${\text {MSE}}_{\mathrm{min}}({\hat{\mu }}_{pw}) \cong \theta (\sigma _y^2 + P^2 \sigma _x^2 -2P\rho _{yx}\sigma _x\sigma _y) + \lambda (\sigma _{y_{(2)}}^2 + P^2 \sigma _{x_{(2)}}^2 -2P\rho _{zx_{(2)}} \sigma _z\sigma _{x_{(2)}}) + G.$$

(29)

Comparing the MSE expressions of ${\hat{\mu }}_{yw}$ in (19), ${\hat{\mu }}_{rw}$ in (21), and ${\hat{\mu }}_{pw}$ in (28) with measurement errors, it can be verified easily that

${\text {MSE}}_{\mathrm{min}}^*({\hat{\mu }}_{pw})<{\text {MSE}}^*({\hat{\mu }}_{yw})$ if
$$-\frac{(\theta \rho _{yx}\sigma _x\sigma _y+\lambda \rho _{zx_{(2)}}\sigma _z\sigma _{x_{(2)}})^2}{\theta (\sigma _x^2+\sigma _v^2)+\lambda (\sigma _{x_{(2)}}^2+\sigma _{v}^2)} < 0,$$
(30)
${\text {MSE}}_{\mathrm{min}}^*({\hat{\mu }}_{pw})<{\text {MSE}}^*({\hat{\mu }}_{rw})$ if
$$\frac{1}{2}- \frac{\mu _y}{2\mu _x}\frac{\theta (\sigma _x^2+\sigma _v^2)+\lambda (\sigma _{x_{(2)}}^2+\sigma _{v}^2)}{\theta \rho _{yx}\sigma _x\sigma _y+\lambda \rho _{zx_{(2)}}\sigma _z\sigma _{x_{(2)}}}< 1$$
(31)
and
${\text {MSE}}^*({\hat{\mu }}_{rw})<{\text {MSE}}^*({\hat{\mu }}_{yw})$ if
$$\frac{\mu _y}{2\mu _x}\frac{\theta (\sigma _x^2+\sigma _v^2) +\lambda (\sigma _{x_{(2)}}^2+\sigma _{v}^2)}{\theta \rho _{yx}\sigma _x\sigma _y +\lambda \rho _{zx_{(2)}}\sigma _z\sigma _{x_{(2)}}} < 1$$
(32)

The conditions (30) and (31) always hold true. From (32), the ratio estimator is generally more efficient than the ordinary mean estimator if the measurement error on auxiliary variable X ($\sigma _v^2$) is small, and X and Y are strongly correlated.

4 Simulations

We will now compare the performance of the generalized mean estimator under simple random sampling with the other two estimators by a simulation study in this section. In the generalized mean estimator, we choose v and k to be 1, and $\phi$ to be its optimum value. In the simulation, $\phi$ is calculated by plugging the corresponding sample values in (27). As for $\alpha$ and $\beta$, we could use various parameters associated with the auxiliary variable such as the coefficient of variation ($C_x$) or kurtosis, but these choices do not impact the results in any meaningful way. As we can see in (28), minimized MSE is independent of $\alpha$ and $\beta$. Also, we ran extensive simulations and noticed that empirical MSEs also are almost the same for all choices of $\alpha$ and $\beta$. Therefore, we will only show the results where $\alpha =1$ and $\beta =0$. The scrambling variable S is taken to be a normal variate with mean equal to zero and variance, $\sigma _s^2=0.5*\sigma _x^2$. T is also taken to be a normal variate but with mean equal to one and different variances. The measurement errors on X have a normal distribution with mean zero in both phases; the measurement errors of Y in the first phase and Z in the second phase have a normal distribution with mean zero. We use different variances (0, 5, 10) for measurement errors.

We consider a finite population of size 5000 generated from a bivariate normal distribution with means and covariance of (Y, X) as given below.

$$\mathbf{Population} \quad \mu =\begin{bmatrix} 10\\ 6 \end{bmatrix}, \quad \Sigma =\begin{bmatrix} 16 &{} 9.051\\ 9.051 &{} 8 \end{bmatrix}, \quad \rho _{yx}=0.8$$

The parameters of the set of 5000 data points we generated using R are very close to the parameter values in (A) but not exactly same. For the simulation study, we use parameter values in (B) and not those in (A).

$$\mu _{x}=6, \quad \sigma _{x}^2=8, \quad \mu _{y}=10,\quad \sigma _{y}^2=16,\quad \rho _{yx}=0.8$$

(A)

$$\mu _{x}=6.0228,\quad \sigma _{x}^2=8.1830, \quad \mu _{y}=9.9864, \quad \sigma _{y}^2=16.1215, \quad \rho _{yx}=0.8024$$

(B)

We consider samples of size $n = 500$ using SRSWOR and assume a response rate of 40% in the first phase. This means in the first phase, only 200 ($n_1$) subjects provide a response to the survey question and 300 ($n_2$) of them do not. In the second phase, we take another sample ($n_s=\frac{n_2}{f}$) from the non-respondent group by using $f=2$, 3, 4, respectively.

Coding for the simulations was done in R, and the results are averaged over 5000 iterations. The empirical MSE of the estimator ${\hat{\mu }}_y$ is computed by

$${\text {MSE}}^*({\hat{\mu }}_w)=\frac{1}{5000}\sum _{i=1}^{5000}({\hat{\mu }}_w-\mu )^2,$$

(33)

where ${\hat{\mu }}_w={\hat{\mu }}_{yw}$, ${\hat{\mu }}_{rw}$, and ${\hat{\mu }}_{pw}$. Here, $\mu$ is the population mean of the sensitive study variable. The percent relative efficiency (PRE) of the estimator (${\hat{\mu }}_w$) with respect to the ordinary mean estimator (${\hat{\mu }}_{yw}$) is defined as

$${\text {PRE}}=\frac{{\text {MSE}}^*({\hat{\mu }}_{yw})}{{\text {MSE}}^*({\hat{\mu }}_{w})}*100.$$

(34)

We will also use the unified measure $\delta$ of the efficiency and the privacy as defined in Gupta et al. [10]. It is given by

$$\delta =\frac{{\text {MSE}}^*({\hat{\mu }}_w)}{\Delta _{DP}}.$$

(35)

In (35), MSE is used in place of Var(.) to account for biased estimators.

The simulation results are provided in the three tables below. In Table 1, we fix the response rate, Var(T), Var(S), and W but study the impact of vary the size of the measurement errors and the sampling fraction (f) in phase 2. In Table 2, we examine the impact of Var(T) and W.

Table 1 Theoretical (bold) and empirical MSEs/PREs of the ORRT estimators with $\sigma _v^2 = \sigma _u^2 = \sigma _p^2 = 1, 5, 10$ when response rate $= 40\%$ , $W = 0.8$, $\sigma _T^2 = 0.5$ and $\sigma _s^2 = 0.5*~ \sigma _x^2$.

Full size table

Table 2 Theoretical (bold) and empirical MSEs/PREs of the ORRT estimators when response rate = 40% , $\sigma _v^2 = \sigma _u^2= \sigma _p^2 = 1$, $f = 2$ and $\sigma _s^2 = 0.5*\sigma _x^2$

Full size table

These simulation results are discussed in Sect. 5.

5 Discussion

From the two tables, the empirical results are in good agreement with the corresponding theoretical results.

As the measurement errors increase, the MSE of each mean estimator increases. Also, the efficiency of each estimator gets worse as the value of f increases. For example in Table 1, the MSE of the generalized mean estimator increased from 0.1397 to 0.1804 as the variance of measurement errors increased from 1 to 10 when $f = 2$, and increased from 0.1601 to 0.3113 as the value of f increased from 2 to 4 when the variance of measurement error is 5. This is reasonable because larger measurement errors have larger negative impact on mean estimation and larger f value means we obtain smaller sample from the second call.

The results also showed that the MSEs of all mean estimators increase as W increases under non-response, both when measurement errors are present and when they are not present. For example, in Table 2, the MSE of the generalized mean estimator increased from 0.0966 to 0.1680 as the sensitivity level increased from 0.5 to 1 when variance of T is equal to 0.5. Therefore, optional RRT model leads to better results than the non-optional model. Note that the model tends to become non-optional as W increases. Furthermore, the simple additive RRT model ($\sigma _T^2=0$) is more efficient in terms of PRE. But the general RRT model is better, if we examine the performance of various estimators with respect to the unified measure ($\delta$) of efficiency and privacy. For instance, in Table 2, when the sensitivity level W is equal to 0.5, the MSE of the generalized mean estimator increases from 0.0284 to 0.1641 as the variance of T increases from 0 to 1, but the $\delta$ value decreases from 0.0069 to 0.0014.

It is clear from the theoretical conditions (30), (31), (32), and the simulation results that the generalized mean estimator is always more efficient than the ordinary RRT mean estimator and the ratio estimator, while the ratio estimator is less efficient than the ordinary mean estimator if the measurement errors on X are large. For example in Table 1, the MSE of the generalized mean estimator (0.1804) is less than the MSE of the ordinary mean estimator (0.1948) when the variance of measurement errors is 10 and the value of f is 2. However, the MSE of the ratio estimator (0.2518) is larger than the mean estimators because the measurement errors are large. This is because the ordinary mean estimator is not impacted by the measurement error in X. This was not so for the generalized estimator because the use of the regression term was able to overcome the measurement error burden due to X.

References

Ahmed S, Shabbir J, Gupta S (2017) Use of scrambled response model in estimating the finite population mean in presence of non-response when coefficient of variation is known. Commun Stat Theory Methods 46(17):8435–8449
Article MathSciNet Google Scholar
Audu A, Singh R, Khare S, Dauran NS (2020) Almost unbiased estimators for population mean in the presence of non-response and measurement error. J Stat Manag Syst. https://doi.org/10.1080/09720510.2020.1759209
Article Google Scholar
Bhushan S, Pandey AP (2019) An efficient estimation of population mean under non-response. Commun Stat Appl Methods 26:11–25
Google Scholar
Bhushan S, Pandey AP (2020) An efficient estimation procedure for the population mean under non-response. Statistica 79(4):363–378
Google Scholar
Diana G, Perri PF (2011) A class of estimators for quantitative sensitive data. Stat Pap 52:633–650
Article MathSciNet Google Scholar
Diana G, Riaz S, Shabbir J (2014) Hansen and Hurwitz estimator with scrambled response on the second call. J Appl Stat 41(3):596–611
Article MathSciNet Google Scholar
Gupta S, Gupta B, Singh S (2002) Estimation of sensitivity level of personal interview survey questions. J Stat Plan Inference 100:239–247
Article MathSciNet Google Scholar
Gupta S, Shabbir J, Sousa R, Corte-Real P (2012) Estimation of the mean of a sensitive variable in the presence of auxiliary information. Commun Stat Theory Methods 41(13–14):2394–2404
Article MathSciNet Google Scholar
Gupta S, Kalucha J, Shabbir J, Dass BK (2014) Estimation of finite population mean using optional RRT models in the presence of nonsensitive auxiliary information. Am J Math Manag Sci 33(2):147–159
Google Scholar
Gupta S, Mehta S, Shabbir J, Khalil S (2018) A unified measure of respondent privacy and model efficiency in quantitative RRT models. J Stat Theory Pract 12(3):506–511
Article MathSciNet Google Scholar
Hansen MH, Hurwitz WN (1946) The problem of non-response in sample surveys. J Am Stat Assoc 41:517–529
Article Google Scholar
Khalil S, Noor-ul-Amin M, Hanif M (2018) Estimation of population mean for a sensitive variable in the presence of measurement error. J Stat Manag Syst 21(1):81–91
Google Scholar
Khalil S, Zhang Q, Gupta S (2018) Mean estimation of sensitive variables under measurement errors using optional RRT models. Commun Stat Simul Comput. https://doi.org/10.1080/03610918.2019.1584298
Article Google Scholar
Khare BB, Srivastava S (1997) Transformed ratio type estimators for the population mean in the presence of non-response. Commun Stat Theory Methods 26(5):1779–1791
Article Google Scholar
Khare BB, Sinha RR (2007) Estimation of ratio of the two population means using multi-auxiliary characters in the presence of non-response. Narosa Publishing House 10(3):3–14
Google Scholar
Khare BB, Sinha RR (2009) On class of estimators for population mean using multi-auxiliary characters in the presence of non-response. Appl Math Comput 10(3):45–56
Google Scholar
Khare BB, Sinha RR (2011) Estimation of population mean using multi-auxiliary characters with subsampling the nonrespondents. Stat Transit 12(1):3–14
Google Scholar
Kumar M, Singh R, Singh AK, Smarandache F (2011) Some ratio type estimators under measurement errors. World Appl Sci J 14(2):272–276
Google Scholar
Kumar S, Singh HP, Bhougal S, Gupta R (2011) A class of ratio-cum-product type estimators under double sampling in the presence of non- response. J Math Stat 40(4):589–599
MathSciNet MATH Google Scholar
Kumar M, Singh R, Sawan N, Chauhan P (2011) Exponential ratio method of estimators in the presence of measurement errors. Int J Agric Stat Sci 7(2):457–461
Google Scholar
Makhdum M, Sanaullah A, Hanif M (2020) A modified regression-cum-ratio estimator of population mean of a sensitive variable in the presence of non-response in simple random sampling. J Stat Manag Syst 23(3):495–510
Google Scholar
Rao PSRS (1986) Ratio estimation with sub-sampling the non-respondents. Survey Method 12(2):217–230
Google Scholar
Singh SR, Sharma P (2015) Method of estimation in the presence of non-response and measurement errors simultaneously. J Mod Appl Stat Methods 14(1):107–121
Article Google Scholar
Singh N, Vishwakarma GK, Kim JM (2019) Computing the effect of measurement error on efficient variant of ratio and product estimator using auxiliary variable. Commun Stat Simul Comput. https://doi.org/10.1080/03610918.2019.1656742
Article Google Scholar
Singh N, Vishwakarma GK (2019) A generalized class of estimator of population mean with the combined effect of measurement errors and non-response in sample survey. Rev Investig Oper 40:275–285
Google Scholar
Unal C, Kadilar C (2020) Exponential type estimator for the population mean in the presence of non-response. J Stat Manag Syst 23(3):603–615
Google Scholar
Yan Z, Wang J, Lai J (2008) An efficiency and protection degree-based comparison among the quantitative randomized response strategies. Commun Stat Theory Methods 38(3):400–408
Article MathSciNet Google Scholar
Yaqub M, Shabbir J, Gupta SN (2017) Estimation of population mean based on dual use of auxiliary information in non response. Commun Stat Theory Methods 46(24):12130–12151
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors express very sincere gratitude to the reviewers for their constructive suggestions which helped improve the presentation of the paper.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA
Qi Zhang & Sat Gupta
Department of Statistics, Lahore College for Women University, Lahore, 54000, Pakistan
Sadia Khalil

Authors

Qi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sadia Khalil
View author publications
You can also search for this author in PubMed Google Scholar
Sat Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Q., Khalil, S. & Gupta, S. Mean Estimation of Sensitive Variables Under Non-response and Measurement Errors Using Optional RRT Models. J Stat Theory Pract 15, 3 (2021). https://doi.org/10.1007/s42519-020-00135-2

Download citation

Accepted: 02 October 2020
Published: 03 November 2020
DOI: https://doi.org/10.1007/s42519-020-00135-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Mean Estimation of Sensitive Variables Under Non-response and Measurement Errors Using Optional RRT Models

Abstract

Similar content being viewed by others

Variance estimation procedures in the presence of singly imputed survey data: a critical review

Robustness in Survey Sampling Using the Conditional Bias Approach with R Implementation

Ratio Estimation of Finite Population Mean Using Optional Randomized Response Models

1 Introduction

2 Modified Hansen and Hurwitz [11] Procedure (HH)

2.1 Hansen and Hurwitz [11]: Two-Phase Sampling

2.2 Optional RRT (ORRT) Models

2.3 Modified Version of Hansen and Hurwitz [11]: Two-Phase Sampling

3 Mean Estimators Under Measurement Errors and Non-response

3.1 Existing Mean Estimators

3.2 Proposed Mean Estimator

4 Simulations

5 Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mean Estimation of Sensitive Variables Under Non-response and Measurement Errors Using Optional RRT Models

Abstract

Similar content being viewed by others

Variance estimation procedures in the presence of singly imputed survey data: a critical review

Robustness in Survey Sampling Using the Conditional Bias Approach with R Implementation

Ratio Estimation of Finite Population Mean Using Optional Randomized Response Models

1 Introduction

2 Modified Hansen and Hurwitz [11] Procedure (HH)

2.1 Hansen and Hurwitz [11]: Two-Phase Sampling

2.2 Optional RRT (ORRT) Models

2.3 Modified Version of Hansen and Hurwitz [11]: Two-Phase Sampling

3 Mean Estimators Under Measurement Errors and Non-response

3.1 Existing Mean Estimators

3.2 Proposed Mean Estimator

4 Simulations

5 Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation