Respondent privacy and estimation efficiency in randomized response surveys for discrete-valued sensitive variables

Bose, Mausumi

doi:10.1007/s00362-014-0624-4

Respondent privacy and estimation efficiency in randomized response surveys for discrete-valued sensitive variables

Regular Article
Published: 19 August 2014

Volume 56, pages 1055–1069, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistical Papers Aims and scope Submit manuscript

Respondent privacy and estimation efficiency in randomized response surveys for discrete-valued sensitive variables

Download PDF

Mausumi Bose¹

181 Accesses
5 Citations
Explore all metrics

Abstract

In some socio-economic surveys, data are collected on sensitive issues such as tax evasion, criminal conviction, drug use, etc. In such surveys, direct questioning of respondents is not of much use and the randomized response technique is used instead. A few researchers have studied the issue of privacy protection for surveys where the objective is to estimate the proportion of persons bearing the sensitive trait. Not much is known about respondent protection when the variable under study is a discrete quantitative variable and the objective is to estimate the population mean. In this article we study this issue. We propose a scheme for this issue and a measure of privacy. We show that given a stipulated level of this privacy measure, we can determine the parameter of the randomization device so as to maximize the efficiency of estimation, while guaranteeing the desired level of privacy protection.

A Review of Rigorous Randomized Response Methods for Protecting Respondent’s Privacy and Data Confidentiality

Use of Free Software to Estimate Sensitive Behaviours from Complex Surveys

An improved quantitative randomized response technique for data collection in sensitive surveys

Article 22 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The randomized response technique is a useful method for collecting data on variables which are considered sensitive, incriminating or stigmatizing for the respondents. Examples of such situations are common in socio-economic surveys, for instance, we may need to collect data on tax evasion, alcohol addiction, illegal drug use, criminal behaviour or past criminal convictions. In such surveys, direct questions are not useful as the respondents will either refuse to answer embarrassing questions or, even if they do, may give false answers. In a randomized response model, the respondents use a randomization device to generate a randomized response and the parameter under study can be estimated from these responses. So, the respondent is not required to disclose his true response and it is expected that this will lead to better participation in the survey on sensitive issues.

Warner (1965) introduced the randomized response technique for estimating the proportion of persons in a dichotomous population bearing a sensitive qualitative character, such as alcoholism or drug addition. Let $A$ denote such a sensitive character and $A^c$ its complement. Suppose in the population, the proportion of individuals bearing the character $A$ is $\pi _A$. Then the proportion bearing $A^c$ is $\pi _{A^c}=1-\pi _{A}$. The objective of the survey is to estimate the unknown value $\pi _A$. In Warner’s randomized response model, a box with two types of cards labeled $A$ and $A^c$ (in proportion $p:1-p$, $p\ne 1/2)$ is used as the randomization device. Each respondent is asked to draw a card at random from the box and then respond simply ‘yes’ or ‘no’ according as whether or not he bears the character on the label of the card he draws, without disclosing this label. Thus the actual character of the respondent is not disclosed. If this randomization procedure is adopted, on the basis of a simple random sample with replacement of size $n$, the proportion $\pi _A$ may be unbiasedly estimated by $\{w(1-p)\}/(2p-1)$, where $w$ is the sample proportion of ‘yes’ responses. Thus $\pi _A$ may be unbiasedly estimated without having recourse to direct questioning.

Since then, many researchers have extensively contributed to this area, some have proposed alternative models for estimating proportions in dichotomous populations, some have extended this to proportions in polychotomous models and others have studied the case of quantitative sensitive variables, e.g., Kuk (1990); Ljungqvist (1993); Mangat (1994); Chua and Tsui (2000); Van den Hout and Van der Heijden (2002); Christofides (2005); Kim (2007); Arnab and Dorffner (2007); Pal (2008); Diana and Perri (2009);(Chaudhuri et al. (2011a, b)); Barabesi et al. (2012) and many others. For details on the results available on this technique we refer to the review paper by Chaudhuri and Mukerjee (1987) and books by Chaudhuri and Mukerjee (1988) and Chaudhuri (2011).

Lanke (1976) and Leysieffer and Warner (1976) initiated the study of efficiency versus privacy protection in randomized response surveys where the population is divided into two complementary sensitive groups, $A$ and $A^c$, and the objective is to estimate the proportions of persons belonging to these two groups. They suggested measures of jeopardy based on the ‘revealing probabilities’, i.e., the posterior probabilities of a respondent belonging to groups $A$ and $A^c$ given his randomized response. Since then, this dichotomous case has been widely studied. Loynes (1976) extended the jeopardy measure of Leysieffer and Warner (1976) to polychotomous populations. Ljungqvist (1993) gave a unified and utilitarian approach to measures of privacy for the dichotomous case. For estimating the proportion $\pi _A$, Nayak and Adeshiyan (2009) proposed a measure of jeopardy for surveys from dichotomous populations and developed an approach for comparing the available randomization procedures. Recently, for estimating $\pi _A$, Giordano and Perri (2012) compared some randomization models from the point of view of efficiency and privacy protection.

All the references given above are for sensitive variables which are categorial or qualitative in nature and the objective is to estimate $\pi _A$. However, in randomized response surveys it is quite common to have situations where the study variable $X$ is quantitative, e.g. in studies on the number of criminal convictions of a person, the number of induced abortions, the amount of time spent in a correction centre, the amount of undisclosed income, etc. Anderson (1977) studied the case of continuous sensitive variables and considered the amount of information provided by the randomized responses. For ensuring more privacy he recommended that the expectation of the conditional variance of $X$ given the randomized response be made as large as possible. Diana and Perri (2011) studied quantitative sensitive data and for estimating the mean, they used auxiliary information at the estimation stage and compared different models from the efficiency and privacy protection aspects. However, notwithstanding the rich literature on the randomized response technique, not much work seems to have been done in studying the respondent privacy aspect for discrete-valued sensitive variables, even though surveys are often undertaken on such variables.

To fill this gap, in this article we focus on studying the issue of privacy protection when the underlying sensitive variable under study is quantitative and discrete. For instance, the variable of interest may be the number of convictions for criminal offences, number of times one has used illegal drugs or the number of induced abortions, etc. We propose the use of a randomization device and give the associated estimation method for such studies. Then, we consider two separate cases, one where all values of $X$ are sensitive and another where not all values of $X$ are sensitive. For each of these cases, we propose a measure for protecting the privacy of the respondents. We finally show how one can choose the randomization device parameter in each case, so as to guarantee a certain pre-specified level of respondent protection and then maximize the efficiency of estimating the parameter of interest under this constraint. Our study also covers qualitative sensitive variables, i.e., cases where the population is dichotomous or polychotomous, and allows us to estimate the proportions of individuals belonging to each category.

In Sect. 2 we give some preliminaries. In Sect. 3 and 4 we consider the issues of estimation and privacy protection, respectively. In Sect. 5 we obtain the randomization device parameter which allows efficient estimation while assuring the required level of respondent protection. We also present some illustrative numerical examples. In Sect. 6 we show how our study covers the case of polychotomous variables. Finally we conclude with some remarks.

2 Preliminaries

Consider a population of individuals and let $X$ denote the sensitive variable of interest. We assume that $X$ takes a finite number of values $x_1, \ldots , x_m$ and without loss of generality, we may suppose these $m$ values to be known. For $1\le i \le m$, let $\pi _i$ be the unknown population proportion of individuals for whom $X$ equals $x_i$, i.e.,

$$\begin{aligned} {\mathrm {Prob}}(X=x_i) = \pi _i, \ \ 1\le i \le m, \ \ \mathrm {where} \ \ \pi _i \ge 0, \ \ \sum _{i=1}^m \pi _i =1. \end{aligned}$$

(1)

The objective of the survey is to estimate the population mean of $X$. For this, we suppose as usual (cf. Warner (1965), Nayak and Adeshiyan (2009) and others), that a sample of $n$ individuals is drawn from the population by simple random sampling with replacement. As for the randomization device, since we are interested in the numerical values of $X$, we propose the use of a device as described below.

Consider a box containing cards of $(m+1)$ types, the $i$th type of card being marked ‘Report $x_i$ as your response’, $1\le i\le m$, while the $(m+1)$th type of card is marked: ‘Report your true value of $X$ as your response.’ The box has a large number of cards, say $M$, there being $Mp$ cards of type $(m+1)$ and $M\frac{1-p}{m}$ cards of each of the types $i$, $1\le i\le m$, $0<p<1.$ A sampled respondent is asked to draw a card at random from the box and then give a truthful response according to the card drawn by him, without disclosing the label on the card to the investigator. Thus the true value of $X$ for the respondent is not known. The $n$ responses so received are the data from this survey.

Let $R$ denote the randomized response variable. Clearly, with this device, the ranges of $R$ and $X$ match. The efficiency in estimation and respondent protection will depend on the choice of the value of $p$, which we call the device parameter. The above device is such that with probability $p$, a respondent will report his true value, while with probability $\frac{1-p}{m}$, he will report any one of the possible values $x_1, \ldots , x_m$ chosen at random, i.e.,

$$\begin{aligned} {\mathrm {Prob}}(R=x_i |X=x_j)&= \frac{1-p}{m}, \ \ 1\le i \ne j \le m, \end{aligned}$$

(2)

$$\begin{aligned} {\mathrm {Prob}}(R=x_j | X=x_j)&= p +\frac{1-p}{m}, \ \ 1\le j \le m. \end{aligned}$$

(3)

3 Estimation of population mean

The population mean and variance of $X$ are given by

$$\begin{aligned} \mu _X =\sum _{i=1}^m x_i \pi _i \ \text{ and } \ \ \sigma _X^2 = \sum _{i=1}^m (x_i -\mu _X)^2\pi _i, \end{aligned}$$

respectively. Our objective is to estimate $\mu _X$ from the $n$ randomized responses collected as described in Sect. 2. Let $w_i$ be the sample proportion of randomized responses which equal $x_i$, $1\le i\le m.$ Hence, from (1)-(3),

$$\begin{aligned} {\mathrm {E}}(w_i) = {\mathrm {Prob}}(R=x_i) = p\pi _i + \frac{1-p}{m} = \lambda _i, \ \ {\mathrm {say}}. \end{aligned}$$

(4)

So, an unbiased estimator of $\pi _i$ will be given by $\hat{\pi }_i = \frac{1}{p}(w_i - \frac{1-p}{m}),$ leading to an unbiased estimator of $\mu _X$ as

$$\begin{aligned} \hat{\mu }_X= \sum _{i=1}^m x_i\hat{\pi }_i = \frac{1}{p}\sum _{i=1}^m x_iw_i - \frac{1-p}{mp}\sum _{i=1}^m x_i. \end{aligned}$$

Let us write $\bar{X}= \frac{1}{m}\sum _{i=1}^m x_i$. Then, on simplification using (4), the variance of $\hat{\mu }_X$ for a given value of $p$ is given by

$$\begin{aligned} {\mathrm {Var}}_p(\hat{\mu }_X)&= \frac{1}{p^2} {\mathrm {Var}} (\sum _{i=1}^m x_iw_i ) = \frac{1}{np^2} \left\{ \sum _{i=1}^m x_i^2 \lambda _i (1-\lambda _i) - \sum _{i=1}^m\sum _{j(\ne i)=1}^m x_i x_j\lambda _i \lambda _j \right\} \nonumber \\&= \frac{1}{np^2} \left\{ p\sum _{i=1}^m x_i^2\pi _i + \frac{1-p}{m}\sum _{i=1}^m x_i^2 - \left( p\mu _X + (1-p)\bar{X}\right) ^2 \right\} \nonumber \\&= \frac{1}{np^2}\left\{ p \sigma _X^2 + (1-p) \frac{1}{m}\sum _{i=1}^m (x_i-\bar{X})^2 + p(1-p) (\mu _X - \bar{X})^2 \right\} . \end{aligned}$$

(5)

Our aim is to estimate $\mu _X$ keeping ${\mathrm {Var}}_p(\hat{\mu }_X)$ as small as possible. It is clear from the expression on the right side of (5) that ${\mathrm {Var}}_p(\hat{\mu }_X)$ is decreasing in $p$, irrespective of the values of $\pi _1, \ldots , \pi _m$. So, this variance may be decreased, or equivalently, the efficiency of estimation may be increased by increasing $p$, whatever may be the proportions of the $x_i$ values in the population.

4 Privacy protection

In this section we consider the degree of privacy protection available to respondents in a randomized response survey where the randomization device is as described in Sect. 2. In the literature, while studying the respondent privacy aspect for dichotomous populations, Leysieffer and Warner (1976) studied the case where both $A$ and $A^c$ are sensitive categories while Lanke (1975) also considered the case where only $A$ is sensitive and there is no jeopardy in a ‘no’ answer to the sensitive question. For polychotomous populations, Loynes (1976) studied two cases, one where all categories are stigmatizing and another where one of the categories is not stigmatizing.

In line with these studies for qualitative stigmatizing variables, we too consider the privacy issue for discrete-valued variables for two situations, one where all the $m$ values of $X$ are stigmatizing and another where not all values of $X$ are stigmatizing. For instance, we may want to study the number of times a person has voted for a certain political party in the last 5 elections. Here possible values of $X$ are $X=0, \ldots , 5$, all of which may be sensitive. On the other hand, if we are studying the number of times one has under-reported his income for income tax, the value $X=0$ is not stigmatizing, but any value of $X$ larger than zero may well be sensitive. Again, if one is studying the number of induced abortions, then values $X=0$ or $X=1$ may not be considered stigmatizing but higher values of $X$ may be deemed to be so. Many such examples may be cited to show that both these situations arise commonly in practice. We will show that we require separate privacy protection measures in these cases.

For a randomly chosen respondent from the population, the ‘true’ probability that the value of $X$ for this respondent equals $x_i$ is given by Prob$(X=x_i)$. On the other hand, when this respondent gives a randomized response, say $x_j$, then the probability that the value of $X$ for this respondent equals $x_i$ is now given by the conditional probability Prob$(X=x_i|R=x_j)$, or the ‘revealing’ probability. A respondent will be assured that his privacy is protected if he can be convinced that given his response, the probability of his having a particular value of $X$ does not change much, i.e., he needs to be assured that the difference between his true probability and his revealing probability is as small as possible, for all possible true values and all responses. With this is mind, in the next subsections we develop measures of privacy protection separately for the different cases. Further, in a given situation, for a certain target level of privacy protection by the relevant measure, we obtain a range of values of $p$ for the randomization device which can achieve this.

4.1 All values of $X$ are stigmatizing

Suppose all the values $x_1, \ldots , x_m$ are stigmatizing. In this case, a respondent would feel comfortable in participating in the survey if the perception of his having a value $X=x_i$ is not much altered after knowing his randomized response, for all $1\le i\le m$. This would require that his true and revealing probabilities be sufficiently close. Starting from this basic premise we define

$$\begin{aligned} \alpha _{ij} = |{\mathrm Prob}(X=x_i|R=x_j) -{\mathrm Prob}(X=x_i)| \end{aligned}$$

(6)

and since each respondent would want $\alpha _{ij}$ to be as small as possible for all $1\le i,j \le m$, as a measure of privacy protection we propose the following measure:

$$\begin{aligned} \alpha = \mathop {max}\limits _{1\le i,j \le m} \alpha _{ij}. \end{aligned}$$

(7)

A randomization device with a privacy protection value $\alpha =\alpha _0$ would guarantee that the discrepancies between the true and revealing probabilities will be at most $\alpha _0$ for all respondents, irrespective of their true values. Thus a device which results in a lower value of $\alpha $ gives a higher level of privacy protection than one with a higher value of $\alpha $.

Suppose the scientist planning a certain survey would like to keep the privacy protection available to respondents above a certain threshold, i.e., would like to achieve $\alpha \le L$, where $L$ is a pre-assigned quantity, $0< L < 1$. Moreover, this bound on $\alpha $ should hold irrespective of the unknown values of $\pi _1, \ldots , \pi _m.$ The following theorem shows how the device parameter can be chosen to achieve this.

Theorem 1

For $\alpha $ as in (7) and a preassigned L, where $0< L < 1$, $\alpha \le L $ will hold, irrespective of the values of $\pi _1, \ldots , \pi _m, $ if and only if $p \le p_0$, where

$$\begin{aligned} p_0 = \frac{1}{1+ \frac{m}{L}(\frac{1-L}{2})^2}. \end{aligned}$$

(8)

Proof

From (1)-(3), using Bayes’ Theorem it follows that for $1\le i,j,\le m$,

$$\begin{aligned} {\mathrm Prob} (X=x_i|R=x_j) = \frac{(p\delta _{ij} + \frac{1-p}{m})\pi _i}{\sum _{u=1}^m(p\delta _{ju} + \frac{1-p}{m})\pi _u}= \frac{(p\delta _{ij} + \frac{1-p}{m})\pi _i}{p\pi _j + \frac{1-p}{m}}, \end{aligned}$$

(9)

where $\delta _{ij}$ is Kronecker Delta. Hence from (6) it follows that $ \alpha _{ij} = \frac{p\pi _i|\pi _j-\delta _{ij}|}{p\pi _j + \frac{1-p}{m}}$ and for any $i\ne j$,

$$\begin{aligned} \alpha _{ij} =\frac{p\pi _i\pi _j}{p\pi _j + \frac{1-p}{m}}\le \frac{p(1-\pi _j)\pi _j}{p\pi _j + \frac{1-p}{m}} = \alpha _{jj}, \end{aligned}$$

(10)

as $\pi _i + \pi _j \le 1$ for all $i,j$. Thus $ \alpha = \mathop {max}\limits _{1\le j\le m} \alpha _{jj} = \mathop {max}\limits _{1\le j\le m} \frac{\pi _j(1-\pi _j)}{\pi _j + \frac{1-p}{mp}}. $ Hence, $\alpha \le L$ if and only if

$$\begin{aligned} \pi _j(1-\pi _j) -L\pi _j \le \frac{L(1-p)}{mp} \ \ {\mathrm for \ \ all} \ \ 1\le j \le m. \end{aligned}$$

(11)

First suppose $p\le p_0$. Then for $1\le j \le m$,

$$\begin{aligned} \pi _j(1-\pi _j) -L\pi _j&= \left( \frac{1-L}{2}\right) ^2 - \left( \frac{1-L}{2} - \pi _j \right) ^2 \nonumber \\&\le \left( \frac{1-L}{2}\right) ^2 \nonumber \\&= \frac{L(1-p_0)}{mp_0}, \ \ {\mathrm using \ \ the \ \ expression \ \ of } \ \ p_0 \ {\mathrm in } \ \ (8) \nonumber \\&\le \frac{L(1-p)}{mp}, \ \ {\mathrm since } \ \ p\le p_0. \end{aligned}$$

Thus the inequalities in (11) hold, or equivalently $\alpha \le L$, irrespective of the values of $\pi _1, \ldots , \pi _m.$

To prove the converse, suppose $\alpha \le L$, or equivalently, the inequalities in (11) hold, irrespective of the values of $\pi _1, \ldots , \pi _m.$ Then, for $\pi _1=\frac{1-L}{2}, \pi _2 = \frac{1+L}{2}, \pi _3 = \ldots = \pi _m=0$, in particular, these inequalities will also hold. So, for this choice of $\pi _j$ values in (11) with $j=1$, we have

$$\begin{aligned}&\displaystyle \left( \frac{1-L}{2}\right) \left( \frac{1+L}{2}\right) - L\left( \frac{1-L}{2}\right) \le \frac{L(1-p)}{mp} \nonumber \\&\displaystyle { \mathrm i.e., } \ \ \left( \frac{1-L}{2} \right) ^2 \le \frac{L(1-p)}{mp}, \nonumber \\&\displaystyle {\mathrm i.e., } \ \ \frac{m}{L}\left( \frac{1-L}{2} \right) ^2 \le \frac{(1-p)}{p}= \frac{1}{p} -1 \nonumber \\&\displaystyle {\mathrm i.e., } \ \ \frac{1}{p_0} \le \frac{1}{p}, \ \ {\mathrm using \ \ the \ \ expression \ \ of } \ \ p_0 \ {\mathrm in } \ \ (8). \nonumber \\ \end{aligned}$$

(12)

Hence $p\le p_0$. Hence theorem. $\square $

Remark 1

It is clear from (8) that in order to maintain the same level of protection, the value of $p_0$ monotonically decreases with the number of possible values of $X$. Again, for a given number of possible values of $X$, $p_0$ monotonically increases with L. We may reiterate that these values of $p$ do not depend on how the values of $X$ are distributed in the population.

4.2 Not all values of $X$ are stigmatizing

In many surveys it may so happen that not all values of $X$ are sensitive or stigmatizing. For instance, in a survey for estimating the average number of criminal convictions of persons in a certain population, the value $X=0$ is not stigmatizing but any value of $X \ge 1$ could well be stigmatizing. Similarly, for a survey for estimating the average of the number (X) of induced abortions, the values $X=0$ or $X=1$ might not be considered as stigmatizing values while other larger values might be considered stigmatizing by the respondents.

To study the respondents’ privacy protection for such surveys, we first present the simpler case where only one of the values of $X$, say $x_1$, is not stigmatizing, while values $x_2, \ldots , x_m$ are considered stigmatizing. We develop the protection measure for this case in detail. Later we remark that the results obtained for this case may be easily extended to the case where $X$ has more than one non-stigmatizing value.

As before, the data collection and estimation proceeds as in Sect. 2 and 3. To study the respondent protection we note that since the value $x_1$ is non-stigmatizing, respondents will feel comfortable with a randomization device for which the ‘revealing’ probability of their having a true value $x_1$ will be large. So, we propose the following measure of privacy:

$$\begin{aligned} \beta = \min _{1\le j\le m} P(X=x_1|R=x_j) = \min _{1\le j\le m} \frac{(p\delta _{1j} + \frac{1-p}{m})\pi _1}{p\pi _j + \frac{1-p}{m}}, \end{aligned}$$

(13)

on simplification using (9). A device with a privacy protection value $\beta $ will guarantee that all respondents are perceived to have $X=x_1$ with probability at least $\beta $. So, a device leading to a larger value of $\beta $ will ensure greater privacy to respondents than one with a smaller $\beta $.

Let $L$, $0<L<1$, denote a preassigned level of respondents’ privacy. Then in order to achieve this level of protection we require that $\beta \ge L$, irrespective of the values of $\pi _1, \ldots , \pi _m$. Thus we should have

$$\begin{aligned} (p\delta _{1j}+\frac{1-p}{m})\pi _1 \ge L (p\pi _j + \frac{1-p}{m}), \ \ \ 1\le j\le m, \end{aligned}$$

or equivalently, the following inequalities should hold:

$$\begin{aligned}{}[p(1-L) + \frac{1-p}{m}]\pi _1&\ge \frac{L(1-p)}{m} \end{aligned}$$

(14)

$$\begin{aligned} \mathrm{and} \ \ \ \frac{1-p}{m}\pi _1 -L p\pi _j&\ge \frac{L(1-p)}{m}, \ \ 2\le j \le m. \end{aligned}$$

(15)

Clearly, no $p$ can satisfy (13) irrespective of $\pi _1, \ldots , \pi _m$ for any given L since (13) fails as $\pi _1 \rightarrow 0.$ So we assume that $\pi _1 > 0$ and we also assume some prior knowledge about a lower bound on $\pi _1$. This assumption is quite realistic because in most populations there will be an appreciable number of persons with a non-stigmatizing variable value and hence, a lower bound to the proportion of such stigma-free persons in the population will be available.

Thus, suppose we have prior knowledge that $\pi _1 \ge c$. We work with $L<c$. This is again realistic because if the only knowledge about $\pi _1$ is that $\pi _1\ge c$, it is impractical to demand that $P(X=x_1|R=x_j) \ge L (\ge c) $ for all $j$. Now, the following theorem gives the value of the device parameter $p$ which will guarantee the desired level of respondent protection L.

Theorem 2

Let $\beta $ be as in (13) and $\pi _1 \ge c$ for some known $c$. Then given a preassigned L, where $0< L < c $, $\beta \ge L $ will hold, irrespective of the values of $\pi _1, \ldots , \pi _m, $ if and only if $p \le p_0$, where

$$\begin{aligned} p_0 = \frac{\frac{c-L}{m}}{\frac{c-L}{m} + L(1-c)}. \end{aligned}$$

(16)

Proof

Since $\pi _1 \ge c$, it is clear that $\pi _j \le 1-c$ for $2\le j\le m$ and we have

$$\begin{aligned} \left[ p(1-L) + \frac{1-p}{m}\right] \pi _1&\ge \left[ p(1-L) + \frac{1-p}{m}\right] c \nonumber \\ \mathrm{and } \ \ \ \frac{1-p}{m}\pi _1 - L p\pi _j&\ge \frac{1-p}{m} c - L p(1-c), \ \ 2\le j \le m. \nonumber \end{aligned}$$

As a result, (14) and (15) will hold, irrespective of the true values of $\pi _1(\ge c), \pi _2, \ldots , \pi _m$ iff

$$\begin{aligned} \left[ p(1-L) + \frac{1-p}{m}\right] c&\ge L \frac{1-p}{m}\end{aligned}$$

(17)

$$\begin{aligned} \mathrm{and } \ \ \ \frac{1-p}{m}c -L p(1-c)&\ge L \frac{1-p}{m} \end{aligned}$$

(18)

hold. Now, (17) reduces to

$$\begin{aligned} \left( p+\frac{1-p}{m}\right) c \ge L \left( cp + \frac{1-p}{m}\right) \end{aligned}$$

which will always hold for every $p$ since $L (cp+\frac{1-p}{m}) \le L(p+\frac{1-p}{m}) < c(p+\frac{1-p}{m})$ as $L<c$ and $p+\frac{1-p}{m} >0.$ So, it is enough to only consider (18). Note that

$$\begin{aligned} (18) \Leftrightarrow \frac{c-c p}{m} - L p(1-c)&\ge \frac{L-L p}{m} \\ \Leftrightarrow p&\le \frac{\frac{c-L}{m}}{\frac{c-L}{m} + L(1-c)} = p_0, \end{aligned}$$

thus proving the theorem. $\square $

Remark The above discussion can be extended to include the more general case where $X$ has $t$ non-stigmatizing values $x_1, \ldots , x_t$, say, while its remaining $m-t$ values are stigmatizing, $1<t<m.$ In that case too, it can be shown that $p_0$ takes the form as in Theorem 2, but now with

$$\begin{aligned} \beta =\min _{1\le j \le m} P(X=x_1 { \text{ or } } x_2 { \text{ or } }\ldots x_t|R=x_j) { \text{ and } } \pi _1+\ldots + \pi _t \ge c { \text{ with } }L<c. \end{aligned}$$

5 Privacy protection together with efficiency in estimation

We now consider the issue of efficiency in estimation together with privacy protection in randomized response surveys. It was seen from (5) that, irrespective of the values of $\pi _1, \ldots , \pi _m$, the efficiency of estimation may be increased by increasing $p$. On the other hand, for a given $L$ and irrespective of the values of $\pi _1, \ldots , \pi _m$, Theorems 1 and 2 show that a protection of level $L$ may be guaranteed iff $p\le p_0$, where $p_0$ is as in (8) or (16), respectively. So, the best choice of $p$ with regard to maximizing the efficiency of estimation of $\mu _X$, subject to the stipulated level of privacy protection $L$, is $p=p_0$. If we use a randomization device with $p$ equal to any value less than $p_0$, then the efficiency of estimation will be less than that with $p=p_0$, even though the level of protection will still be $L$. The following examples illustrate this.

Example 5.1

Let $X$ take four values which are all sensitive. Suppose $L =0.1$. Then by Theorem 1, taking $ m=4$, we get $p_0 = 0.1099.$ So, if we use a randomization device with $p=0.1099$ then the efficiency of estimation can be maximized while guaranteeing that the maximum discrepancy between the true probability and the revealing probability of all respondents will be at most 0.1. However, if we use a device with $p>p_0$, then the desired level of privacy protection will not be realized.

Table 1 gives the $p_0$ values in (8) for some choices of $L$ and $m$ for achieving maximum efficiency of estimation.

Table 1 Values of $p_0$ for various $m$ and $L$

Full size table

For given $m$ and $L$, if we use a device with $p<p_0$, then the efficiency of estimation will drop, even though the level of privacy protection will be guaranteed. To illustrate how the efficiency changes as $p$ decreases from $p_0$, we use the case of $m=3$, assuming for illustration that $X$ takes the values $X=1, 2, 3$ with probabilities $0.50, 0.35$ and $0.25,$ respectively, in the population. In Table 2, for some illustrative values of $p$, we give the values of relative efficiency, defined as: ${\mathrm RelEff}(p) = \{{\mathrm {Var}}_{p_0}(\hat{\mu }_X)\}/\{{\mathrm {Var}}_{p}(\hat{\mu }_X)\}$ where ${\mathrm {Var}}_p(\hat{\mu }_X)$ is as given by (5). The $p_0$ values used are as given in Table 1.

Table 2 Relative Efficiencies for various values of $p$

Full size table

Figure 1 shows how, for various values of $L$, the relative efficiencies drop as $p$ is decreased from the optimal $p_0$ value.

Example 5.2

Let $X$ take one nonsensitive value and two sensitive values. Suppose it can be assumed that at least 15% of the individuals in the population possess the nonsensitive value and suppose it is stipulated that $L= 0.10$. Then by Theorem 2, taking $m=3, c=0.15, L=0.1$, we obtain $p_0 = 0.1639$. So, if we use a device with $p=0.1639$ then estimation efficiency will be maximum while guaranteeing that all respondents will have at least a 10% probability of being revealed as belonging to the non-stigmatizing class. As in Example 5.1, in this case too, the relative efficiencies drop as $p$ is decreased from the optimal $p_0$ value.

6 Estimation of population proportions

As mentioned in Sect. 1, several researchers have estimated the proportions of individuals belonging to the two categories in dichotomous populations, while Loynes (1976) extended this to estimating the different proportions in a polychotomous population. In our case where $X$ takes $m$ numerical values, we may also readily estimate the population proportions $\pi _1, \ldots , \pi _m$ from the responses collected as in Sect. 2 and again use the measures of privacy as given in (7) and (13) to achieve the stipulated level of privacy protection.

As seen in Sect. , an unbiased estimate of $\pi _i$ is

$$\begin{aligned} \hat{\pi }_i = \frac{1}{p}(w_i -\frac{1-p}{m}), \ \ \ \ 1\le i\le m. \end{aligned}$$

Suppose, in the spirit of $A-$optimality commonly used in optimal design theory, we would like to minimize the average variance of these estimates. For this, we can show that the sum of the variances of the estimates of $\pi _i$ is given by

$$\begin{aligned} \sum _{i=1}^m {\mathrm Var}_p (\hat{\pi _i}) = \frac{1}{np^2} \sum _{i=1}^m \lambda _i (1-\lambda _i) = \frac{1}{n}\left\{ \frac{1}{p^2} - \sum _{i=1}^m \pi _i^2 + \frac{1}{m}(\frac{1}{p^2} -1)\right\} , \end{aligned}$$

(19)

on simplification, using (4). Clearly, (19) is decreasing in $p$, irrespective of the true values of $\pi _1, \ldots , \pi _m.$ So as in the case of estimating the mean, here too, subject to the stipulated level $L$ of privacy protection, the best choice for $p$ for minimizing the average variance of the estimates of the proportions may be obtained by applying Theorem 1 or 2, as the case may be. So, if all categories are sensitive, one uses $p=p_0$, with $p_0$ being given by (8) and if not all categories are sensitive, one uses $p_0$ given by (16). The following example illustrates this in the popular case of dichotomous populations.

Example 6.1

Suppose in a dichotomous population both categories are sensitive and we have to estimate the proportion of persons with these traits. Then the equivalent problem in our context is one where $m=2$, i.e., $X$ can take only two values $x_1$ and $x_2$ and we want to estimate the population proportions $\pi _1$ and $\pi _2.$ This is because on the basis of values $x_1$ and $x_2$, the population units can be divided into 2 groups, say $A$ and $A^c$.

When both $A$ and $A^c$ are sensitive, given $L$, one can apply Theorem 1 and compute $p_0$ using (8). For various levels of privacy protection as quantified by some illustrative values of $L$, the corresponding values of $p_0$ are given in Table 3, while the relative efficiency values for other values of $p$ are shown in Table 4. When $A$ is sensitive and $A^c$ is not, we can proceed similarly by applying Theorem 2.

Table 3 Values of $p_0$ for various $L$ in a dichotomous population

Full size table

Table 4 Relative Efficiencies for various values of $p$ in a dichotomous population

Full size table

Figure 2 shows how the relative efficiencies drop as $p$ is decreased from the optimal $p_0$ value.

7 Concluding remarks

In this paper we have proposed a randomized response scheme for use in surveys where the sensitive or stigmatizing variable of interest is a discrete quantitative variable and the target is to estimate the population mean. We focus our attention on the privacy protection afforded to respondents when they participate in a randomized response survey with this scheme and then develop measures of privacy protection. We study two broad situations: one where all values of the variable are sensitive and another where not all values are sensitive. In the latter case we elaborate on the case where only one of the possible values of the variable is non-stigmatizing whereas all remaining values of $X$ are stigmatizing/sensitive and generalize to the case where $t$ of the values of $X$ are non-stigmatizing and the remaining values are not. We give examples to show that all these cases can arise in surveys.

We develop measures of privacy protection in these two situations and show that, given a target level of privacy protection, how the randomization device parameter may be chosen in order to achieve this level of protection. Finally we obtain the optimal value of the device parameter which allows the maximum efficiency of estimation while guaranteeing the desired level of privacy protection.

We show that our results may also be applied to ensure a desired level of privacy protection in the traditional studies with qualitative sensitive attributes in dichotomous (or polychotmous) populations where the target is to efficiently estimate the proportion (or proportions) of persons bearing the sensitive attribute (or attributes).

The issue of privacy protection when the sensitive variable is continuous and quantitative is yet to be developed. There is a need for a randomized response technique for such variables when the objective is to estimate the population mean efficiently while ensuring a given level of privacy protection. This problem is currently under investigation.

References

Anderson H (1977) Efficiency versus protection in a general randomized response model. Scand J Stat 4:11–19
MATH Google Scholar
Arnab R, Dorffner G (2007) Randomized response techniques for complex survey designs. Stat Papers 48:131–141
Article MathSciNet Google Scholar
Barabesi L, Franceschi S, Marcheselli M (2012) A randomized response procedure for multiple-sensitive questions. Stat Papers 53:703–718
Article MATH MathSciNet Google Scholar
Chaudhuri A (2011) Randomized response and indirect questioning techniques in surveys. CRC Press, Boca Raton
MATH Google Scholar
Chaudhuri A, Bose M, Dihidar K (2011a) Estimation of a sensitive proportion by Warner’s randomized response data through inverse sampling. Stat Papers 52:343–354
Article MATH MathSciNet Google Scholar
Chaudhuri A, Bose M, Dihidar K (2011b) Estimating sensitive proportions by Warner’s randomized response technique using multiple randomized responses from distinct persons sampled. Stat Papers 52:111–124
Article MATH MathSciNet Google Scholar
Chaudhuri A, Mukerjee R (1987) Randomized response techniques: a review. Stat Neerlandica 41:27–44
Article MATH MathSciNet Google Scholar
Chaudhuri A, Mukerjee R (1988) Randomized responses: theory and techniques. Marcel Dekker, New York
MATH Google Scholar
Christofides TC (2005) Randomized response in stratified sampling. J Stat Plann Inference 128:303–310
Article MATH MathSciNet Google Scholar
Chua TC, Tsui AK (2000) Procuring honest responses indirectly. J Stat Plann Inf 90:107–116
Article MATH MathSciNet Google Scholar
Diana G, Perri PF (2009) Estimating a sensitive proportion through randomized response procedures based on auxiliary information. Stat Papers 50:661–672
Article MATH MathSciNet Google Scholar
Diana G, Perri PF (2011) A class of estimators for quantitative sensitive data. Stat Papers 52:633–650
Article MATH MathSciNet Google Scholar
Giordano S, Perri PF (2012) Efficiency comparison of unrelated question models based on same privacy protection degree. Stat Papers 53:987–999
Article MATH MathSciNet Google Scholar
Kim J (2007) A stratified unrelated question randomized response model. Stat Papers 48:215–233
Article MATH Google Scholar
Kuk AYC (1990) Asking sensitive questions indirectly. Biometrika 77:436–438
Article MATH MathSciNet Google Scholar
Lanke J (1975) On the choice of the unrelated question in Simmons’ version of randomized response. J Amer Stat Assoc 70:80–83
Article MATH MathSciNet Google Scholar
Lanke J (1976) On the degree of protection in randomized interviews. Int Stat Rev 44:197–203
Article MATH MathSciNet Google Scholar
Leysieffer RW, Warner SL (1976) Respondent jeopardy and optimal designs in randomized response models. J Amer Stat Assoc 71:649–656
Article MATH MathSciNet Google Scholar
Ljungqvist L (1993) A unified approach to measures of privacy protection in randomized response models: a utilitarian perspective. J Amer Stat Assoc 88:97–103
MATH Google Scholar
Loynes RM (1976) Asymptotically optimal randomized response procedures. J Amer Stat Assoc 71:924–928
Article MATH MathSciNet Google Scholar
Mangat NS (1994) An improved randomized response strategy. J Roy Stat Soc 56:93–95
MATH MathSciNet Google Scholar
Nayak TK, Adeshiyan SA (2009) A unified framework for analysis and comparison of randomized response surveys of binary characteristics. J Stat Plann Inf 139:2757–2766
Pal S (2008) Unbiasedly estimating the total of a stigmatizing variable from a complex survey on permitting options for direct or randomized responses. Stat Papers 49:157–164
Article MATH Google Scholar
Van den Hout A, Van der Heijden PGM (2002) Randomized response, statistical disclosure control and misclassification: a review. Internat Stat Rev 70:269–288
Article MATH Google Scholar
Warner SL (1965) Randomized response: a survey technique for eliminating evasive answer bias. J Amer Stat Assoc 60:63–69
Article MATH Google Scholar

Download references

Acknowledgments

The author is grateful to the reviewers for their careful reading of the earlier version and highly constructive comments.

Author information

Authors and Affiliations

Indian Statistical Institute, Kolkata, 700108, India
Mausumi Bose

Authors

Mausumi Bose
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mausumi Bose.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bose, M. Respondent privacy and estimation efficiency in randomized response surveys for discrete-valued sensitive variables. Stat Papers 56, 1055–1069 (2015). https://doi.org/10.1007/s00362-014-0624-4

Download citation

Received: 18 September 2013
Revised: 24 December 2013
Published: 19 August 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s00362-014-0624-4

Keywords

MSC Classification

62D05

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Respondent privacy and estimation efficiency in randomized response surveys for discrete-valued sensitive variables

Abstract

Similar content being viewed by others

A Review of Rigorous Randomized Response Methods for Protecting Respondent’s Privacy and Data Confidentiality

Use of Free Software to Estimate Sensitive Behaviours from Complex Surveys

An improved quantitative randomized response technique for data collection in sensitive surveys

1 Introduction

2 Preliminaries

3 Estimation of population mean

4 Privacy protection