Abstract
In this paper, we suggest a new method of constructing an unbiased regression type estimator in randomized response sampling. We introduce two new randomized response estimators, one we created through the utilization of a sum of special products technique and the other through the utilization of the method used for computing a matrix determinant. This new idea of making an unbiased regression type estimator proves to be more efficient with no loss in respondent protection. Analytical comparisons show the proposed unbiased regression type estimator is always more efficient than the considered competitors. The theoretical justification that the proposed estimator has a smaller variance over its competitors is crystal clear, so no simulation study is required. However to study the gain in magnitude of the relative efficiency, a simulation study has been carried out.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Statisticians have been using randomized response techniques for some time now in order to predict the proportion of those individuals who belong to a group defined by a sensitive characteristic. Inaccurate results have been an issue when surveying respondents over sensitive questions as trust between the interviewer and the respondent has posed as an issue. Thus, the techniques of randomized response sampling were created. Pioneered by Warner (1965), randomized response techniques allowed the interviewer to ask sensitive questions that allowed the respondents to give their replies in a way that did not reveal true status. These methods allowed many researchers, including social scientists, to conduct surveys over sensitive subjects and obtain more accurate and efficient results. Since Warner (1965) first proposed his method, many other statisticians have made strides in randomized response, making improvements along the way which can be had from recent valuable monographs by Fox (2016), Chaudhuri et al. (2016), Chaudhuri and Christofides (2013), and Chaudhuri (2011).
In the next section, we will discuss the Warner (1965), Mangat and Singh (1990), Kuk (1990) and Odumade and Singh (2009) models.
2 Background
In brief, survey sampling statisticians have long dealt with the difficulties of estimating the true population proportion of those individuals belonging to a group defined by a sensitive characteristic. One popular solution, first proposed by Warner (1965), is the implementation of a randomization device which protects the privacy of those individuals being surveyed. The idea instructs the respondent, while keeping to themselves, to make use of randomization device, such as a deck of cards. The respondent will select a card from the deck. Each card in this deck will have either the statement “I belong to group A” or the statement “I do not belong to group A” with proportion P0 and (1 − P0), respectively. After selecting a card, the respondent will read it to themselves and only tell the interviewer ‘yes’ or ‘no’ if the statement on the drawn card matches his/her status. By letting π represent the true population proportion of those individuals belonging to group A, the Warner (1965) model gives the estimator in Eq. 2.1 of true proportion π of those individuals who belong to group A, for a given P0, as:
where \(\hat {\theta }_{w}=n_{w}/n\) is the observed proportion of ‘yes’ replies out of n respondents selected from the population utilizing a simple random with replacement sampling (SRSWR) scheme, and nw is the observed number of ‘yes’ replies received by the interviewer. Then, the above estimator is unbiased and provides the variance in Eq. 2.2 for two trials per respondent, for a given P0, as
Since the Warner (1965) randomized response model only requires the respondent to deal with a single randomization device, such as a deck of cards, we also consider the case where the Warner (1965) model is performed twice, independently. We do this because the randomization devices that have been proposed, and which will be discussed, after the Warner (1965) model make use of two randomization devices, such as two decks of cards. These devices that use a second device gain efficiency and also improve protection for the respondents participating. Thus, while still using Warner’s (1965) model, consider the case in which the interviewer receive 0, 1, or 2 ‘yes’ replies based on using two independent randomizing devices with parameter P, and T. Then by letting π represent the true population proportion of people belonging to group A, then the probability mass function (p.m.f) of the i-th reply Zi is obtained as in Table 1:
From this, the expected value of Zi is given as
The variance of the response Zi, V (Zi), is given by
Now we have
By plugging Eqs. 2.3 and 2.5 into Eq. 2.4, the variance of Zi is given as
From Eq. 2.3, an unbiased estimator of π is given by
The variance of the Warner (1965) estimator \(\hat {\pi }_{w}\) for two trials per respondent with two independent devices with parameters, P and T, is given as
Clearly, if we let P = T = P0, this model reduces back to the original Warner (1965) model with two independent trials per respondent as in Eq. 2.2. To our knowledge the result in Eq. 2.8 is new, however we cited in the background section of the present investigation.
Mangat and Singh (1990) improved on the Warner (1965) model by proposing a two stage randomized response by making use of two decks of cards. In the Mangat and Singh (1990) model, each respondent is asked to use two randomized devices as R1 and R2. The device R1 consists of two outcomes, “Are you a member of group A?”, with relative frequency T0 and “Go to the second randomization device R2” with relative frequency (1 − T0). The second randomization device R2 is the same as the Warner (1965) randomization device. Similarly to the Warner (1965) model, we let π represent the true population proportion of those individuals belonging to group A, and nms be the number of ‘yes’ replies received by the interviewer frpm n respondents selected from the population utilizing a SRSWR. The following estimator of π is derived for the Mangat and Singh (1990) model:
where \(\hat {\theta }_{ms}=n_{ms}/n\) is the proportion of the observed ‘yes’ answers in the sample. The estimator in Eq. 2.9, provided by Mangat and Singh (1990) is unbiased and has the variance:
Following Warner (1986) and Fox and Tracy (1986 p. 30), Kuk (1990) suggests to use theory of recoding of responses to overcome an undesirable feature of randomized response techniques. Kuk (1990) model avoids putting statements like, “I belong to group A”, because A is a sensitive group so the respondent may become sceptical and uncooperative. He suggests to put cards of same size and type but of different colors (red and blue, say). If a respondent belongs to group A (Ac) then he/she is to draw a card from deck-I (deck-2) and report the color of the drawn card, instead of answering the sensitive question. Let 𝜃1 be the proportion of red cards in the deck-I and 𝜃2 be the proportion of red cards in deck-2. By letting π represent the true population proportion of those individuals belonging to group A, Kuk’s (1990) model gives the probability of a “red” color response from a respondent:
Additionally, suppose the n respondents are selected from the population utilizing a SRSWR. Then, letting nkuk be the number of ‘red’ replies received by the interviewer, we have that nkuk follows a Binomial distribution with parameters n and 𝜃kuk. If each individual being interviewed is requested to give k ≥ 1 replies, then Kuk’s (1990) model gives the variance:
The Kuk (1990) model serves as a special case of many suggested randomized response models such as those proposed by Warner (1965), Mangat and Singh (1990). While all of these models posed as improvements upon one another, a recent paper has proposed a method that has been shown to be more efficient than all of these models. The Odumade and Singh (2009) randomized response model is shown to be more efficient to that of Warner (1965), Mangat and Singh (1990), and Kuk (1990) models. Naturally, this means that Odumade and Singh (2009) is the estimator which we wish to modify. The Odumade and Singh (2009) model consists of two decks of cards as as shown in Fig. 1.
In this model, each respondent selected in the sample experiences an ordered pair of a deck of cards. This ordered pair is (Deck-I, Deck-II). The respondent matches his/her status with the two outcomes from the two decks and replies either (yes, yes), (yes, no), (no, yes), or (no, no). There are four types of responses that can be observed from both types of respondents either belonging to group A or Ac. This randomized response model produces the following probabilities:
and
As before we let π represent the true population proportion of those individuals belonging to group A and considering n respondents, with n11, n10, n01, and n00 being the number of (yes, yes), (yes, no), (no, yes) and (no, no) respective replies received by the interviewer. The respondents are selected from the population utilizing a SRSWR. They considered a minimization of a distance function defined as
where \(\hat {\theta }_{ij} = n_{ij}/n\), i = 0,1;j = 0,1 is the observed proportions of (yes, yes), (yes, no), (no, yes) and (no, no) replies.
Then, the estimator of true proportion derived from the Odumade and Singh (2009) model is given as
The estimator \(\hat {\pi }_{os}\) of Odumade and Singh (2009) is unbiased and has the variance:
An unbiased estimator of the \(V(\hat {\pi }_{os})\) is given by
In the next section, we consider alternative methods to squared distance function to analyze the data collected by using two deck method.
3 New Randomized Response Estimators Based Upon a Sum of Special Products Technique and Upon the Method of Solving for a Matrix Determinant
We became curious as to what types of methods we could use to produce a new and more efficient randomized response estimator. In particular, how best to utilize the data at hands. The method developed here, which leads to an unbiased and efficient estimator of population proportion without any additional cost (or lost of protection), will be expect to be a new challenge to survey statisticians to think more along these lines.
Now, instead of trying to minimize the distance function as Odumade and Singh (2009) did, we attempted two new methods. The first idea, is to optimize the Sum of Special Products (SSP) given by
We named it SSP because the first term in the product is obtained by fixing the first response as “Yes” and the second product is obtained by fixing the first response as “No”. No doubt there are many possibilities, but this SSP leads to an amazing estimator and also opens a big-window for future research. Now we have the following theorems:
Theorem 3.1.
The estimator given below, which optimizes the SSP is unique and unbiased.
Proof.
See Online Supplementary Material in Appendix-A. □
Theorem 3.2.
The variance of the estimator \(\hat {\pi }_{SSP}\) is given by:
Proof.
See Online Supplementary Material in Appendix-A.□
Theorem 3.3.
An unbiased estimator of \(V(\hat {\pi }_{SSP})\) is given by:
Proof.
Following Odumade and Singh (2009), it is easy to verify \(E[\hat {v}(\hat {\pi }_{SSP})]=V(\hat {\pi }_{SSP})\), which proves the theorem. □
The second method for creating an estimator from the data generated by Odumade and Singh (2009) device takes inspiration from the computation of a determinant of a 2 × 2 matrix. Consider the following square matrix of differences in Fig. 2 as:
Theorem 3.4.
The estimator below, which optimizes the determinant of the true differences is unique and unbiased.
Proof.
See Online Supplementary Material in Appendix-A.□
Theorem 3.5.
The variance of the estimator \(\hat {\pi }_{DET}\) is given by
Proof.
See Online Supplementary Material in Appendix-A. □
Theorem 3.6.
An unbiased estimator of \(V(\hat {\pi }_{DET})\) is given by
Proof.
It is easy to verify \(E[\hat {v}(\hat {\pi }_{DET})]=V(\hat {\pi }_{DET})\), which proves the theorem. □
In the next section, we propose a new unbiased regression type estimator that makes use of estimators obtained from optimizing the SSP approach and Determinant (DET) approach along with the Odumade and Singh (2009) estimator.
4 A New Unbiased Regression Type Estimator
Unfortunately, when checking the efficiency of the randomized response models derived in Section 2, we came to determine that the new estimators, \(\hat {\pi }_{SSP}\) and \(\hat {\pi }_{DET}\), did not improve upon the Odumade and Singh (2009) estimator, \(\hat {\pi }_{os}\) in terms of relative efficiency. However, we considered the idea of possibly combining the SSP and DET type estimators previously derived and obtain a regression type estimator.
Theorem 4.1.
An unbiased regression type estimator of the true proportion π of individuals belonging to a sensitive group A is given by:
where β is a constant to be derived.
Proof 7.
The estimator is unbiased whatever the choice of β since \(\hat {\pi }_{os}\), \(\hat {\pi }_{SSP}\) and \(\hat {\pi }_{DET}\) are all unbiased. The optimum value of β is free from the value of π. See Online Supplementary Material in Appendix-A. □
Theorem 4.2.
The minimum variance of \(\hat {\pi }_{reg}\) is given by:
where
and
which are free from the value of π.
Proof.
See Online Supplementary Material in Appendix-A. □
Theorem 4.3.
The optimum value of β which minimizes the variance of \(\hat {\pi }_{reg}\) in (4.1) can be written as
which is again free from the value of π.
Proof.
See Online Supplementary Material in Appendix-A. □
Theorem 4.4.
An unbiased estimator of \(V(\hat {\pi }_{reg})\) is given by
where \(\hat {v}(\hat {\pi }_{os})\) is given in (2.20).
Proof.
Trivial, because \(Cov(\hat {\pi }_{os}, \hat {\pi }_{SSP}-\hat {\pi }_{DET} )\) in Eq. 3.3 and \(V(\hat {\pi }_{SSP}-\hat {\pi }_{DET} )\) in (4.4) are free from the value of π.□
5 Efficiency Comparisons
In order to show that our new estimator is better than the randomized response estimators derived for two trials by Warner (1965), Mangat and Singh (1990), Kuk (1990), and Odumade and Singh (2009), we must compute the relative efficiencies. With respect to Warner (1965), the relative efficiency criterion for the two cases considered are given by
Similarly, the relative efficiency criterion with respect to Mangat and Singh (1990), Kuk (1990) and Odumade and Singh (2009) are given by
and
We used the suggested model, and ran a code in SAS (given in Arias (2019)) to compare the efficiency of the proposed model with respect to Warner (1965), Mangat and Singh (1990), Kuk (1990), and Odumade and Singh (2009). For Warner (1965) with two trials per respondent in Eq. 2.2, Mangat and Singh (1990), we set P0 = P and T0 = T. For Kuk (1990), we set 𝜃1 = P, 𝜃2 = T, and k = 2. For Odumade and Singh (2009), Warner (1965) with two trials in Eq. 2.8, and the suggested model, we allowed the values of P to range from 0.55 to 0.80, and the value of T to range from 0.10 to 0.25, both with a step of 0.05.
Tables 2, 3 and 4 display summaries of the results, while the full outcome of results can be obtained by executing the SAS Codes. For each value of π, we found the mean, standard deviation, maximum, and minimum of the found relative efficiencies for various choices of P and T.
In Tables 2 to 4freq stands for the number of times the proposed estimator is more efficient than all the competitors considered out of 24 possible combinations for P and T. As one can clearly see from Tables 2 to 4, the relative efficiency of the suggested model with respect to the models of the competitors is much better. As the value for π ranges from 0.05 ≤ π ≤ 0.50 with a step of 0.05 , the proposed estimator performs much better than the models proposed by Warner (1965) for both cases, Mangat and Singh (1990), Kuk (1990), and Odumade and Singh (2009). Now in order to visualize this concept, we can use the Figs. 3 and 4 to see the relative efficiency. In these figures, we will put, into visuals, the relative efficiencies over the unknown proportion π.
From Figs. 3 and 4, one can clearly see that when compared with each of the randomized response models produced by Warner (1965) for two trials, Mangat and Singh (1990), Kuk (1990), and Odumade and Singh (2009), the suggested model performs better. In each respective case, it is easy to see that each of the values for relative efficiency, with respect to the other models, will remain above 100%. These graphs and summary of results all show that the suggested estimator can perform much better than all of the other estimators provided by Warner (1965), Mangat and Singh (1990), Kuk (1990), and Odumade and Singh (2009).
6 Simulation Study
A simulation study, very similar to real survey data, was conducted using SAS. We wanted to see how the proposed estimator would perform against the Odumade and Singh (2009) estimator in a real survey. While we let 0.05 ≤ π ≤ 0.50 and set P = 0.316 and T = 0.845, we first determined the true probabilities of 𝜃11,𝜃10,𝜃01, and 𝜃00. Then, utilizing SAS, we created a sample size of n = 50 replies by utilizing the call function RandMultinomial(ntrials,ns,prob) where ntrials represents the number of trials in each simulation, ns is the sample size and prob represents the probabilities. We then used NITR = 10,000 which means each trial had 10,000 iterations. Naturally, we computed the simulated variances of \(\hat {\pi }_{os}\) and \(\hat {\pi }_{reg}\) as follows
and
The relative efficiency can be determined from Eqs. 6.1 and 6.2 as follows
where REsim is the simulated relative efficiency of \(\hat {\pi }_{reg}\) with respect to \(\hat {\pi }_{os}\). A total of 10 simulations were run for 0.05 ≤ π ≤ 0.50 with a step of 0.05. Each individual case considered 10,000 different trials. In each trial a sample size of 50 individuals was used in order to produce accurate results. For each study we determined and studied the relative efficiency of the suggested estimator with respect to the Odumade and Singh (2009) estimator. The results are given below in Table 5.
Clearly from Table 5, the results of the simulation studies run on SAS show that the suggested estimator will perform better than that of the Odumade and Singh (2009) estimator. The added β coefficient forces the suggested estimator to always be more efficient, especially when the optimum choice is used for the minimum variance of the estimator. The original version of this work can be had from Arias (2019), and suggests that there is potential for further research.
7 Conclusion
We knew that the model produced by Odumade and Singh (2009) could perform better than that of Warner (1965), Mangat and Singh (1990), and Kuk (1990). However, the natural question was posed, how do we beat the Odumade and Singh (2009) model? The use of new methods of minimizing SSP and determinant type techniques proved to be of no use. Nevertheless, it became obvious to us that a regression type estimator with an optimum value for β would do the trick. As one can clearly see from the proofs, tables, and figures, the idea to utilize a regression type estimator was the way to go. The suggested estimator became more efficient than all the other estimators we were comparing against. Naturally, additional research is required to see if we can further improve an estimator in the field of randomized response sampling.
References
Arias, R. (2019). New methods for efficient results using randomized response sampling. Unpublished M.Sc. thesis submitted to the Deaprtment of Mathematics, Texas A & M University-Kingsville.
Chaudhuri, A. (2011). Randomized response and indirect questioning technique in surveys. CRC Press, Boca Raton.
Chaudhuri, A. and Christofides, T.C. (2013). Indirect questioning in sample surveys. Springer Science & Business Media, Berlin.
Chaudhuri, A., Christofides, T.C. and Rao, C.R. (2016). Data gathering, analysis and protection of privacy through randomized response techniques: Qualitative and quantitative human traits, 34. Elsevier, North-Holland.
Fox, J.A. (2016). Randomized response and related methods, 2nd edn. SAGE, Los Angeles.
Fox, J.A. and Tracy, P.E. (1986). Randomized response: A method for sensitive surveys. SAGE, LOs Angles.
Kuk, A.Y.C. (1990). Asking sensitive questions indirectly. Biometrika77, 2, 439–442.
Mangat, N.S. and Singh, R. (1990). An alternative randomized response procedure. Biometrika 77, 2, 439–442.
Odumade, O. and Singh, S. (2009). Efficient use of two decks of cards in randomized response sampling. Commun. Stat.-Theory Methods 38, 439–446.
Warner, S.L. (1965). Randomized response: a survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 63–69.
Warner, S.L. (1986). The omitted digit randomized response model for telephone applications. In Proceedings Survey Res. Meth. Sect Am. Statist. Assoc.. pp. 441–443.
Acknowledgments
The authors are thankful to the Editor-in-Chief Dr. Dipak K. Dey, an Associate Editor, a referee and Editorial Assistant: Mr. Sarvagnan Subramanian for their comments and help on the original version of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Arias, R., Sedory, S.A. & Singh, S. An Unbiased Regression Type Estimator In Randomized Response Sampling. Sankhya B 84, 243–258 (2022). https://doi.org/10.1007/s13571-021-00256-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13571-021-00256-z