Interval estimation of $$P(X

Mahdizadeh, M.; Zamanzade, Ehsan

doi:10.1007/s00180-018-0795-x

Interval estimation of $P(X<Y)$ in ranked set sampling

Original Paper
Published: 07 February 2018

Volume 33, pages 1325–1348, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Statistics Aims and scope Submit manuscript

Interval estimation of $P(X<Y)$ in ranked set sampling

Download PDF

486 Accesses
24 Citations
Explore all metrics

Abstract

This article deals with constructing a confidence interval for the reliability parameter using ranked set sampling. Some asymptotic and resampling-based intervals are suggested, and compared with their simple random sampling counterparts using Monte Carlo simulations. Finally, the methods are applied on a real data set in the context of agriculture.

Parametric estimation for the simple linear regression model under moving extremes ranked set sampling design

Article 18 June 2021

On Ranked Set Sampling Variation and Its Applications to Public Health Research

A new reliability measure in ranked set sampling

Article 05 July 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Ranked set sampling (RSS) is a data collection method introduced by McIntyre (1952) in an agricultural context involving pasture yields. It serves as an alternative to the usual simple random sampling (SRS) in situations in which exact measurements of sample units are difficult or expensive to obtain but judgment ranking of them according to the variable of interest is relatively easy and cheap. The judgment ranking is usually performed visually (by a field expert, say), or using one or more concomitant variables, but it cannot necessitate actual measurements on the selected units.

The RSS design can be explained as follows:

1.
Draw k random samples, each of size k, from the target population.
2.
Apply the judgement ordering, by any cheap method without the actual measurement of the variable of interest, on the elements of the rth ($r=1, \ldots ,k$) sample and identify the $r\hbox {th}$ smallest unit.
3.
Actually measure the k identified units in step 2.
4.
Repeat steps 1–3, m times (cycles), if needed, to obtain a ranked set sample of size mk.

Let $X_{[r]i}$ be the $r\hbox {th}$ judgement order statistic from the ith cycle. This is a standard notation in the RSS literature (see Chapter 1 in Chen et al. 2004, for example). Then, the resulting ranked set sample is denoted by $\{X_{[r]i}: r=1,\ldots ,k\,;i=1,\ldots ,m \}$. The design parameter k is called set size. To facilitate the judgment ranking, the set size should be kept small in practice, say 2–8. Nonetheless, larger set sizes can be used as long as the ranking process is not hampered.

A ranked set sample comprising m cycles and with set size k exploits information about far more units than a simple random sample of size mk. In the RSS, the judgment ranking information about $mk(k-1)$ unmeasured units contributes to drawing a more representative sample. The SRS, however, has no mechanism for incorporating the judgment ranking information. Thus, the RSS-based procedures are usually more efficient than their SRS competitors. The extent of improvement hinges on the accuracy of the judgment ranking. The RSS has been applied in a variety of fields, including forestry (Halls and Dell 1966), entomology (Howard et al. 1982), environmental monitoring (Kvam 2003), clinical trials and genetic quantitative trait loci mappings (Chen 2007), segmentation of Terahertz images (Ayech and Ziou 2015), and medicine (Zamanzade and Mahdizadeh 2017).

In reliability theory, the probability $\theta =P(X<Y)$ represents the reliability of a stress–strength model, where X and Y represent the stress and strength variables, respectively. This probability also quantifies steady state availability of a repairable system with X and Y denoting repair time and lifetime of the system, respectively. In fact, $\theta $ provides a general measure of the difference between two populations, that has found applications in diverse areas (Kotz et al. 2003). For example, it is a measure of household financial fragility in economics when X and Y are disposable household income and consumption, respectively. In medicine, it is interpreted as a measure of treatment’s effectiveness if X and Y are the response variables from control and treatment groups, respectively. The latter situation is illustrated using a real data set in Sect. 5.

It is well known that a point estimate is generally different from the true parameter value, say $\vartheta $. Moreover, it does not convey any measure of reliability. Interval estimation is another type of estimation which contains more information about the data used to obtain the point estimate. It allows us to have some degree of confidence for securing $\vartheta $. Interval estimators are called confidence intervals (CIs). Let L and U be two statistics such that $P\left( L< \vartheta <U\right) =1-\alpha $, for some $\alpha \in (0,1)$. Then, (L, U) is a CI for $\vartheta $ with coverage probability (confidence level) $1-\alpha $. The width of the CI reflects the amount of variability inherent in the point estimate. A good interval should be relatively narrow on the average, with high probability of enclosing the true parameter.

This article deals with constructing some CIs for $\theta $ in the RSS design. It is worth noting that point estimation of different population attributes have been comprehensively studied in the RSS literature, while hypothesis testing and interval estimation problems have received little attention (see Chen et al. 2004; Wolfe 2012; Chapter 15 in Hollander et al. 2014 for a good review of the RSS and its applications). Yin et al. (2016) proposed a CI for $\theta $ based on kernel density estimation. A simpler approach is to use empirical distribution function, which has not been investigated yet. We set out to fill this gap in this work. It emerges that the resulting intervals have an edge over the existing one.

In Sect. 2, our point estimator is introduced and its theoretical properties are studied. Some estimators for variance of this estimator also are presented. In Sect. 3, six types of intervals are developed. Section 4 contains results of Monte Carlo simulations assessing performances of the suggested intervals in terms of coverage probability and expected length. An agricultural data set is analyzed in Sect. 5. Final conclusions appear in Sect. 6. Proofs are put off to an appendix.

2 Nonparametric estimation

Let $\{X_{[r]i}: r=1,\ldots ,k\,;i=1,\ldots ,m \}$ and $\{Y_{[s]j}: s=1,\ldots ,\ell \,;j=1,\ldots ,n \}$ be independent ranked set samples from two populations with the distribution functions F and G, respectively. Also, the survival function associated with G is denoted by ${\bar{G}}$. The standard estimator of $\theta $ is given by

$$\begin{aligned} {{\hat{\theta }}}_{\text {RSS}}=\frac{1}{mk n \ell } \sum _{i=1}^m \sum _{j=1}^n \sum _{r=1}^k \sum _{s=1}^\ell I(X_{[r]i}<Y_{[s]j}), \end{aligned}$$

(1)

where $I\left( .\right) $ is the indicator function.

The properties of ${\hat{\theta }}_{\text {RSS}}$ in the especial case of $m=n=1$ was investigated by Sengupta and Mukhuti (2008). They showed that the this estimator is unbiased and more efficient than its SRS counterpart, even in the presence of ranking errors. The following result shows asymptotic normality of ${\hat{\theta }}_{\text {RSS}}$.

Proposition 1

Let ${\hat{\theta }}_{\text {RSS}}$ be as in (1), and $N=mk+n\ell $. If $m,n \rightarrow \infty $ and $(mk)/N \rightarrow \lambda \in (0,1)$, then

$$\begin{aligned} \sqrt{N}({\hat{\theta }}_{\text {RSS}}-\theta ) {\mathop {\rightarrow }\limits ^{d}} N\left( 0,\frac{\sigma _1^2}{\lambda }+\frac{\sigma _2^2}{1-\lambda }\right) , \end{aligned}$$

where

$$\begin{aligned} \sigma _1^2=Var\left( {\bar{G}}(X)\right) -\frac{1}{k} \sum _{r=1}^k \left[ E\left( {\bar{G}}(X_{[r]})\right) -\theta \right] ^2, \end{aligned}$$

and

$$\begin{aligned} \sigma _2^2=Var\left( F(Y)\right) -\frac{1}{\ell } \sum _{s=1}^\ell \left[ E\left( F(Y_{[s]})\right) -\theta \right] ^2. \end{aligned}$$

Suppose ${\hat{\theta }}_{\text {SRS}}$ is the counterpart of (1) based on two independent simple random samples of sizes mk and $n\ell $ from F and G, respectively. By virtue of the next result, ${\hat{\theta }}_{\text {RSS}}$ is asymptotically more efficient than ${\hat{\theta }}_{\text {SRS}}$. This statement is valid regardless of the accuracy of the judgment ranking process.

Proposition 2

The asymptotic relative efficiency of ${\hat{\theta }}_{\text {RSS}}$ to ${\hat{\theta }}_{\text {SRS}}$ is

$$\begin{aligned} ARE({\hat{\theta }}_{\text {RSS}},{\hat{\theta }}_{\text {SRS}})=1+\frac{\frac{1-\lambda }{k}\sum _{r=1}^k \left[ E\left( {\bar{G}}(X_{[r]})\right) -\theta \right] ^2 + \frac{\lambda }{\ell }\sum _{s=1}^\ell \left[ E\left( F(Y_{[s]})\right) -\theta \right] ^2 }{\frac{1-\lambda }{k}\sum _{r=1}^k Var\left( {\bar{G}}(X_{[r]})\right) + \frac{\lambda }{\ell }\sum _{s=1}^\ell Var\left( F(Y_{[s]})\right) }. \end{aligned}$$

In light of the above result, interval estimation of $\theta $ based on the RSS is expected to be more efficient than that based on the SRS. An estimate of $Var({\hat{\theta }}_{\text {RSS}})$ is needed to propose an interval based on Proposition 1. To the best of our knowledge, this has not been studied yet. In the sequel, we introduce an estimator for $\sigma _1^2$. Similar arguments yield an estimate of $\sigma _2^2$. These are combined to arrive at the final estimator.

Let $\{X_{[r]i}: r=1,\ldots ,k;\, i=1,\ldots ,m\}$ be a ranked set sample from a population with finite mean $\mu $ and variance $\sigma ^2$. If $\mu _{[r]}$ and $\sigma ^2_{[r]}$ denote the mean and variance of $X_{[r]1}$, respectively, then Stokes (1980) showed that

$$\begin{aligned} \sigma ^2=\frac{1}{k} \sum _{r=1}^k \sigma ^2_{[r]}+\frac{1}{k} \sum _{r=1}^k \left( \mu _{[r]}-\mu \right) ^2. \end{aligned}$$

(2)

Suppose the random variable W is defined as $W={\bar{G}}(X)$. Then using (2), we get

$$\begin{aligned} Var(W)=\frac{1}{k} \sum _{r=1}^k Var(W_{[r]})+\frac{1}{k} \sum _{r=1}^k \left[ E(W_{[r]})-E(W) \right] ^2, \end{aligned}$$

where $W_{[r]}={\bar{G}}(X_{[r]1})$. That is to say that

$$\begin{aligned} \sigma _1^2=\frac{1}{k} \sum _{r=1}^k Var(W_{[r]}). \end{aligned}$$

Now, from Equation 3 in MacEachern et al. (2002), one can construct an estimator for $\sigma _1^2$ as

$$\begin{aligned} \hat{\sigma }_1^2=\frac{1}{2k m(m-1)} \sum _{r=1}^k \sum _{i=1}^m \sum _{i'=1}^m \left( \mathcal {W}_{[r]i}-\mathcal {W}_{[r]i'} \right) ^2, \end{aligned}$$

(3)

where

$$\begin{aligned} \mathcal {W}_{[r]i}=\frac{1}{n \ell } \sum _{s=1}^\ell \sum _{j=1}^n I(X_{[r]i}<Y_{[s]j}). \end{aligned}$$

Similarly, an estimator of $\sigma _2^2$ is obtained as

$$\begin{aligned} \hat{\sigma }_2^2=\frac{1}{2\ell n(n-1)} \sum _{s=1}^\ell \sum _{j=1}^n \sum _{j'=1}^n \left( \mathcal {Z}_{[s]j}-\mathcal {Z}_{[s]j'} \right) ^2, \end{aligned}$$

(4)

where

$$\begin{aligned} \mathcal {Z}_{[s]j}=\frac{1}{m k} \sum _{r=1}^k \sum _{i=1}^m I(X_{[r]i}<Y_{[s]j}). \end{aligned}$$

Combining (3) and (4), we conclude that

$$\begin{aligned} \widehat{Var}({\hat{\theta }}_{\text {RSS}})=\frac{1}{N}\left( \frac{\hat{\sigma }_1^2}{\hat{\lambda }}+\frac{\hat{\sigma }_2^2}{1-\hat{\lambda }}\right) , \end{aligned}$$

(5)

where $\hat{\lambda }=(mk)/N$. The above estimator is expected to work well for moderate to large values of m and n, but not for small choices of them. In the sequel, three alternatives are suggested.

The jackknife methodology has been proposed to serve two purposes, namely, to reduce a possible bias of an estimator, and to yield an approximation for its variance (see Quenouille 1956; Tukey 1958). Let ${\hat{\theta }}(X_1,\ldots ,X_n)$ be a statistic of interest, where $X_ i$’s are iid random variables, and ${\hat{\theta }}$ is invariant under permutation of the arguments. If ${\hat{\theta }}^{(i)}$ denotes the value of ${\hat{\theta }}$ based on $X_1,\ldots ,X_{i-1},X_{i+1},\ldots ,X_n$, then the jackknife estimate of $Var({\hat{\theta }})$ is given by

$$\begin{aligned} \widehat{Var}({\hat{\theta }})=\frac{n-1}{n}\sum _{i=1}^n \left( {\hat{\theta }}^{(i)}-\hat{\theta }^{(0)}\right) ^2, \end{aligned}$$

where ${\hat{\theta }}^{(0)}=\sum _{i=1}^n {\hat{\theta }}^{(i)}/n$.

A ranked set sample consists of independent but not identically distributed random variables. Therefore, one should adapt the above technique to estimate $Var({\hat{\theta }}_{\text {RSS}})$. The first method is to treat data as $m+n$ iid random variables ${\mathbf {X}}_1,\ldots ,{\mathbf {X}}_m,\mathbf {Y}_1,\ldots ,\mathbf {Y}_n$, where ${\mathbf {X}}_i=(X_{[1]i},\ldots ,X_{[k]i})$ ($i=1,\ldots ,m$) and $\mathbf {Y}_j=(Y_{[1]j},\ldots ,Y_{[\ell ]j})$ ($j=1,\ldots ,n$). This is to say that ${\mathbf {X}}_i$ ($\mathbf {Y}_j$) contains the elements of the X (Y) sample drawn in the ith (jth) cycle. Suppose $\tilde{\theta }_{\text {RSS}}^{(t)}$ is value of the reliability estimator when $\mathbf {Z}_t$, $t=1,\ldots ,m+n$, is omitted from the data, where

$$\begin{aligned} \mathbf {Z}_t = \left\{ \begin{array}{ll} {\mathbf {X}}_t &{} t=1,\ldots ,m\\ \mathbf {Y}_{t-m} &{} t=m+1,\ldots ,m+n \end{array} \right. . \end{aligned}$$

Now, the jackknife estimate of the variance is

$$\begin{aligned} {\widetilde{Var}}_1({\hat{\theta }}_{\text {RSS}})=\frac{m+n-1}{m+n} \sum _{t=1}^{m+n} \left( \tilde{\theta }_{\text {RSS}}^{(t)}-\tilde{\theta }^{(0)}\right) ^2, \end{aligned}$$

(6)

where $\tilde{\theta }^{(0)}=\sum _{t=1}^{m+n} \tilde{\theta }_{\text {RSS}}^{(t)}/(m+n)$.

It is possible to obtain another jackknife-type estimate of the variance by excluding the cycles from the two samples simultaneously. Let $\breve{\theta }_{\text {RSS}}^{(u,v)}$ denote value of the estimator when ${\mathbf {X}}_u$ and $\mathbf {Y}_v$ are removed from the data. The second estimator is then

$$\begin{aligned} {\widetilde{Var}}_2({\hat{\theta }}_{\text {RSS}})=\frac{m n-1}{m n} \sum _{u=1}^{m} \sum _{v=1}^{n} \left( \breve{\theta }_{\text {RSS}}^{(u,v)}-\breve{\theta }^{(0)}\right) ^2, \end{aligned}$$

(7)

where $\breve{\theta }^{(0)}=\sum _{u=1}^{m} \sum _{v=1}^{n} \breve{\theta }_{\text {RSS}}^{(u,v)}/(m n)$.

The bootstrap method, introduced by Efron (1979), can also be used to estimate the variance. The method involves drawing samples repeatedly from the empirical distribution function. Suppose $X_1,\ldots ,X_n$ is a random sample from the target population, and ${\hat{\theta }}$ is an estimator of interest. First we draw a sample of size n, with replacement, from the data points (called a bootstrap sample). This sampling procedure is repeated B times, and the estimator is computed from each bootstrap sample. The sample variance of these B values is then the bootstrap estimate of $Var({\hat{\theta }})$.

Modarres et al. (2006) suggested two bootstrap algorithms in the RSS design. The bootstrap ranked set sampling (BRSS) method, which is the most efficient one, is now delineated. Let $F_{m k}$ be the empirical distribution function based on the ranked set sample $\{X_{[r]i}: r=1,\ldots ,k;\, i=1,\ldots ,m\}$, i.e.

$$\begin{aligned} F_{m k}(x)=\frac{1}{m k} \sum _{r=1}^k \sum _{i=1}^m I(X_{[r]i}\le x). \end{aligned}$$

According to the BRSS algorithm, a bootstrap sample is drawn as follows:

1.
Assign to each element of the ranked set sample a probability of $(mk)^{-1}$.
2.
Randomly draw k elements ${\mathcal {X}}_1,\ldots ,{\mathcal {X}}_k {\mathop {\sim }\limits ^{iid}} F_{m k}$, sort them in ascending order ${\mathcal {X}}_{(1)},\ldots ,{\mathcal {X}}_{(k)}$, and retain $X_{[r]1}^*={\mathcal {X}}_{(r)}$.
3.
Perform step 2 for $r=1,\ldots ,k$.
4.
Repeat steps 2 and 3 m times to obtain {$X_{[r]i}^*$}.

Following similar steps, a bootstrap copy of $\{Y_{[s]j}: s=1,\ldots ,\ell ;\, j=1,\ldots ,n\}$ is generated. Suppose B pairs of bootstrap samples are drawn as described above, and let ${\hat{\theta }}_{\text {RSS}}^b$ be the value of the reliability estimator based on data in the bth ($b=1,\ldots ,B$) replication. Then bootstrap variance estimator is given by

$$\begin{aligned} {\widehat{Var}}_{\text {boot}}({\hat{\theta }}_{\text {RSS}})=\frac{1}{B-1}\sum _{b=1}^B \left( {\hat{\theta }}_{\text {RSS}}^b-\bar{\theta }^* \right) ^2, \end{aligned}$$

(8)

where $\bar{\theta }^*=\sum _{b=1}^B {\hat{\theta }}_{\text {RSS}}^b/B$.

3 Proposed CIs

In this section, we construct several CIs for $\theta $ using asymptotic and resampling methods. Based on Proposition 1, one can employ the pivotal quantity

$$\begin{aligned} T=\frac{{\hat{\theta }}_{\text {RSS}}-\theta }{\sqrt{\widehat{Var}({\hat{\theta }}_{\text {RSS}})}} \thickapprox N(0, 1), \end{aligned}$$

where $\widehat{Var}({\hat{\theta }}_{\text {RSS}})$ is defined in (5). The corresponding approximate ($1-\alpha $)-CI is

$$\begin{aligned} \left( {\hat{\theta }}_{\text {RSS}}-z_{\alpha /2}\sqrt{\widehat{Var}({\hat{\theta }}_{\text {RSS}})},{\hat{\theta }}_{\text {RSS}}+z_{\alpha /2}\sqrt{\widehat{Var}({\hat{\theta }}_{\text {RSS}})} \right) , \end{aligned}$$

(9)

where $z_{\alpha /2}$ is the ($1-\alpha /2$) quantile of the standard normal distribution. The pivotal quantity T can be altered if $\widehat{Var}({\hat{\theta }}_{\text {RSS}})$ is replaced by one of the estimates presented in (6), (7) and (8). Accordingly, natural modifications of (9) would be

$$\begin{aligned}&\left( {\hat{\theta }}_{\text {RSS}}-z_{\alpha /2}\sqrt{{\widetilde{Var}}_1({\hat{\theta }}_{\text {RSS}})},{\hat{\theta }}_{\text {RSS}}+z_{\alpha /2}\sqrt{{\widetilde{Var}}_1({\hat{\theta }}_{\text {RSS}})} \right) , \end{aligned}$$

(10)

$$\begin{aligned}&\left( {\hat{\theta }}_{\text {RSS}}-z_{\alpha /2}\sqrt{{\widetilde{Var}}_2({\hat{\theta }}_{\text {RSS}})},{\hat{\theta }}_{\text {RSS}}+z_{\alpha /2}\sqrt{{\widetilde{Var}}_2({\hat{\theta }}_{\text {RSS}})} \right) , \end{aligned}$$

(11)

and

$$\begin{aligned} \left( {\hat{\theta }}_{\text {RSS}}-z_{\alpha /2}\sqrt{{\widehat{Var}}_{\text {boot}}({\hat{\theta }}_{\text {RSS}})},{\hat{\theta }}_{\text {RSS}}+z_{\alpha /2}\sqrt{{\widehat{Var}}_{\text {boot}}({\hat{\theta }}_{\text {RSS}})} \right) . \end{aligned}$$

(12)

We can construct a two-sided equal-tailed ($1-\alpha $)-CI for $\theta $ from the empirical distribution function of a series of bootstrap replications of ${\hat{\theta }}_{\text {RSS}}$. The $\alpha /2$ and the $1-\alpha /2$ quantiles of the bootstrap replications are used as lower and upper confidence bounds. This procedure is called percentile bootstrap, and the corresponding interval is given by

$$\begin{aligned} \left( {\hat{\theta }}_{\text {RSS}}^{\alpha /2}, {\hat{\theta }}_{\text {RSS}}^{1-\alpha /2}\right) , \end{aligned}$$

(13)

where ${\hat{\theta }}_{\text {RSS}}^{\beta }$ is the $\beta $ quantile of ${\hat{\theta }}_{\text {RSS}}^1,\ldots ,{\hat{\theta }}_{\text {RSS}}^B$.

The bootstrap-t method approximates quantiles of the distribution of T from sample quantiles of the quantities

$$\begin{aligned} T_b=\frac{{\hat{\theta }}_{\text {RSS}}^b-{\hat{\theta }}_{\text {RSS}}}{\sqrt{{\widehat{Var}}({\hat{\theta }}_{\text {RSS}}^b)}} \quad (b=1,\ldots ,B), \end{aligned}$$

where ${\hat{\theta }}_{\text {RSS}}^b$ and ${\widehat{Var}}({\hat{\theta }}_{\text {RSS}}^b)$ are computed from the bth bootstrap sample. The bootstrap-t interval is defined as

$$\begin{aligned} \left( {\hat{\theta }}_{\text {RSS}}-t_{1-\alpha /2}\sqrt{\widehat{Var}({\hat{\theta }}_{\text {RSS}})},{\hat{\theta }}_{\text {RSS}}-t_{\alpha /2}\sqrt{\widehat{Var}({\hat{\theta }}_{\text {RSS}})} \right) , \end{aligned}$$

(14)

where $t_{\beta }$ is the $\beta $ quantile of $T_1,\ldots ,T_B$.

The intervals (9), (10), (11), (12), (13) and (14) will be referred to as Normal, Normal-J1, Normal-J2, Normal-B, Boot-p and Boot-t, respectively. It should be mentioned that all the above intervals, except Boot-p, may have endpoints outside the interval (0,1). Therefore, we correct the original interval (L, U) as $\left( \max \{0,L\},\min \{1,U\}\right) $.

4 Simulation results

This section contains results of simulation studies conducted to compare the performances of the different intervals suggested in the previous section. We consider the cases where both X and Y follow either a normal or exponential distribution. If $X-\mu \, (\mu \in \mathbb {R}$) and Y are independent standard normal random variables, then

$$\begin{aligned} \theta =\Phi \left( \frac{-\mu }{\sqrt{2}} \right) , \end{aligned}$$

where $\Phi (.)$ is the distribution function of Y. Similarly, for independent standard exponential random variables $X/\beta \,(\beta >0$) and Y, it can be shown that

$$\begin{aligned} \theta =\frac{1}{1+\beta }. \end{aligned}$$

Under each parent distribution, three values were assigned to the associated parameter so as to produce $\theta =0.25,0.5,0.75$ which are referred to as case A, B and C, respectively. The appropriate parameter values are given in Table 1. If the total sample sizes are denoted by $N_1=mk$ and $N_2=n\ell $, then we select $(N_1,N_2) \in \big \{(10,10),(10,20),(10,30),(20,20)\big \}$. Also, ranked set samples are drawn from the two populations using common set sizes $k=\ell =1,2,5$, where the set size one simply represents the SRS design.

We assume that the ranking the variables of interest X and Y are done based on the concomitant variables ${\mathcal {X}}$ and ${\mathcal {Y}}$ which are related according to equations

$$\begin{aligned} {\mathcal {X}}=\rho _1 \left( \frac{X-\mu _x}{\sigma _x} \right) + \sqrt{1-\rho _1^2} Z_1, \end{aligned}$$

and

$$\begin{aligned} {\mathcal {Y}}=\rho _2 \left( \frac{Y-\mu _y}{\sigma _y} \right) + \sqrt{1-\rho _2^2} Z_2, \end{aligned}$$

where $\rho _i \in [0,1]\,\, (i=1,2)$, and $Z_1\, (Z_2$) is a standard normal random variable independent from $X \,(Y$). Moreover, $Z_1$ and $Z_2$ are independent. The quality of rankings are controlled by the parameter $\rho _i$’s. It is easy to see that $Corr(X,{\mathcal {X}})=\rho _1$ and $Corr(Y,{\mathcal {Y}})=\rho _2$. The chosen values of $\left( \rho _1,\rho _2\right) $ are (1, 1) for perfect rankings of X and Y, (1, 0.8) for perfect ranking of X and fairly accurate ranking of Y, and (0.8, 0.8) for fairly accurate rankings of X and Y.

Table 1 Parameter values corresponding to case A, B and C

Full size table

For each combination of distribution, sample sizes and correlations, 5000 pairs of samples were generated in the RSS design (with the aforesaid set sizes). The six intervals were constructed from each pair of samples for $\alpha =0.05$. In doing so, number of the bootstrap replications is chosen to be 500. Then, coverage rate and expected length of any interval is estimated by fraction of the intervals containing true $\theta $, and mean of the intervals’ lengths, respectively. The results with the perfect ranking are reported in Tables 2, 3, 4 and 5, where the lengths of intervals appear in parentheses.

It can be seen generally that the higher length of interval, the better coverage probability. Normal-J2 and Boot-t CIs have the best coverage rates, and the latter is always shorter. Also, Normal-B and Boot-p are the shortest CIs, and their coverage rates are more or less the same. For a fixed $N_1+N_2$, performances of the CIs generally improve with equal sample sizes setup. Compare similar intervals for sample sizes (10, 30) and (20, 20) under different parent distributions.

Table 2 Estimated coverage rates and lengths of 95% intervals under normal distribution with the perfect ranking when $(N_1,N_2)=(10,10),(10,20)$

Full size table

Table 3 Estimated coverage rates and lengths of 95% intervals under normal distribution with the the perfect ranking when $(N_1,N_2)=(10,30),(20,20)$

Full size table

Table 4 Estimated coverage rates and lengths of 95% intervals under exponential distribution with the perfect ranking when $(N_1,N_2)=(10, 10),(10,20)$

Full size table

Table 5 Estimated coverage rates and lengths of 95% intervals under exponential distribution with the perfect ranking when $(N_1,N_2)=(10,30),(20,20)$

Full size table

Given a pair of total sample sizes, the lengths of all intervals are decreasing in the set size regardless of the case (A, B or C). However, changes in the coverage probabilities are not regular. The above trends are consistent with some results in the RSS literature. For example, Terpstra and Miller (2006) studied exact inference for a population proportion based on the RSS. According to their findings, expected length of the RSS-based CI is uniformly (as a function of the true population proportion) smaller than that of the SRS-based CI. However, there is not a uniform superiority for the coverage probability. See Figure 3 in Terpstra and Miller (2006). In our problem, the situation is more complex because the intervals are based on asymptotic and/or resampling methods.

If the perfect rankings are assumed, the performances of each interval for cases A and C are in close agreement when the parent distribution is normal. This statement is true about the exponential distribution if $N_1=N_2$. These properties can also be observed in the imperfect ranking setup (see Tables 1–8 in the supplementary material), but the additional assumption $\rho _1=\rho _2$ is needed for the exponential distribution. In the presence of ranking errors, lengths of the CIs increase (as compared with the perfect ranking case) but the coverage probabilities do not behave regularly. Overall, Normal-J2 and Boot-t CIs have satisfactory coverage rates (which are close to the nominal level or higher than it) in this situation, although they are longer than the other CIs.

Mahdizadeh and Zamanzade (2016) used kernel density estimation to estimate $\theta $ in the RSS. Let $h_1$ and $h_2$ be bandwidth of the kernel density estimator based on $\{X_{[r]i}: r=1,\ldots ,k\,;i=1,\ldots ,m \}$ and $\{Y_{[s]j}: s=1,\ldots ,\ell \,;j=1,\ldots ,n \}$, respectively (see Chen 1999 for the kernel density estimation in the RSS). If $t=\sqrt{h_1^2+h_2^2}$, then kernel-based estimator is given by

$$\begin{aligned} \tilde{\theta }_{\text {RSS}}=\frac{1}{mk n \ell } \sum _{i=1}^m \sum _{j=1}^n \sum _{r=1}^k \sum _{s=1}^\ell \Phi \left( \frac{Y_{[s]j}-X_{[r]i}}{t} \right) , \end{aligned}$$

where $\Phi (.)$ is the distribution function of standard normal random variable. Yin et al. (2016) established asymptotic normality of the above estimator, and employed this result in developing a CI for $\theta $. The corresponding interval is defined as

$$\begin{aligned} \left( \tilde{\theta }_{\text {RSS}}-z_{\alpha /2}\sqrt{\widehat{Var}(\tilde{\theta }_{\text {RSS}})},\tilde{\theta }_{\text {RSS}}+z_{\alpha /2}\sqrt{\widehat{Var}(\tilde{\theta }_{\text {RSS}})} \right) , \end{aligned}$$

(15)

where $\widehat{Var}(\tilde{\theta }_{\text {RSS}})$ is computed similar to (5) based on

$$\begin{aligned} \mathcal {W}_{[r]i}=\frac{1}{n \ell } \sum _{s=1}^\ell \sum _{j=1}^n \Phi \left( \frac{Y_{[s]j}-X_{[r]i}}{t} \right) , \end{aligned}$$

and

$$\begin{aligned} \mathcal {Z}_{[s]j}=\frac{1}{m k} \sum _{r=1}^k \sum _{i=1}^m \Phi \left( \frac{Y_{[s]j}-X_{[r]i}}{t} \right) . \end{aligned}$$

We conducted a partial simulation study to compare CIs (9) and (15) in terms of the coverage probability and length, based on 10,000 pairs of samples. In determining $h_1$ and $h_2$, the following three methods of the bandwidth selection were utilized: normal reference (NR) rule, unbiased cross-validation (UCV), and plug-in (PI). Although these techniques are developed for the SRS (see Sheather 2004 for more details), they can be applied in the RSS setup by considering data as if collected by the SRS.

Figures 1 and 2 display the results for $(N_1,N_2)=(10, 10)$ with $k=\ell =2,5$, when the perfect rankings are assumed. Here, black/solid curve is corresponding to the interval (9). Also, blue/dashed, red/dotted and orange/longdash curves are associated with the interval (15) using NR, UCV and PI methods, respectively. It is observed that the interval (9) has better coverage rate, while the interval (15) is always shorter. Hence, there is not a single interval preferred from both aspects. Among the kernel-based intervals, overall performance of the CI using PI method is satisfactory.

5 Illustration

We now apply the proposed procedures to an agricultural data set. Murray et al. (2000) conducted an experiment in which apple trees are sprayed with chemical containing fluorescent tracer, Tinopal CBS-X, at 2% concentration level in water. Two nine-tree plots were chosen for spraying. One plot was sprayed at high volume, using coarse nozzles on the sprayer to give a large average droplet size. The other plot was sprayed at low volume, using fine nozzles to give a small average droplet size. Fifty sets of five leaves were identified from the central five trees of each plot, and used to draw 10 copies a ranked set sample of size five, from each plot. The variable of interest is the percentage of area covered by the spray on the surface of the leaves. The formal measurement entails chemical analysis of the solution collected from the surface of the leaves, and thereby is a time-consuming and expensive process. The judgment ranking within each set is based on the visual appearance of the spray deposits on the leaf surfaces when viewed under ultraviolet light. Clearly, the latter method is cheap, and fairly accurate if implemented by an expert observer.

Table 6 Ranked set sample data for the percentage area covered on the surface of the leaves of apple trees

Full size table

Table 7 95% CIs for $\theta $ based on the apple trees data

Full size table

The data are given in Table 6, where measurements obtained from the plot sprayed at high (low) volume constitute the control (treatment) group. The interest centers on knowing whether the sprayer settings affect the percentage area coverage. If $X\, (Y$) denotes the response variable from the control (treatment) group, then ${\hat{\theta }}_{\text {RSS}}$ is a measure of the treatment effect. From the data in Table 6, ${\hat{\theta }}_{\text {RSS}}=0.6184$ is obtained with estimated variances ${\widehat{Var}}({\hat{\theta }}_{\text {RSS}})=0.001344$, ${\widetilde{Var}}_1({\hat{\theta }}_{\text {RSS}})=0.001825$, ${\widetilde{Var}}_2({\hat{\theta }}_{\text {RSS}})=0.019038$, and ${\widehat{Var}}_{\text {boot}}({\hat{\theta }}_{\text {RSS}})=0.001169$. For the bootstrap-based estimate, $B=5000$ is used. It is seen that with the exception of ${\widetilde{Var}}_2({\hat{\theta }}_{\text {RSS}})$, all of the estimates are in good agreement. Table 7 displays 95% CIs for $\theta $ based on different methods. Apart from Normal-J2 interval, we may conclude that the treatment effect is significant at 0.05 level as none of the intervals contain 0.5. It should be mentioned that Normal-J2 is the longest interval in this example, and this is consistent with simulation results in Sect. 4.

As a reviewer pointed out, the accuracy of sampling and statistical inference largely hinges on properties of the population, conditions of the sample, and the method of estimation, so called sampling and statistical trinity (see Wang et al. 2012, for example). Here, the proposed procedures are illustrated using agricultural data. Spatial population may be dominated by spatial autocorrelation, spatial stratified heterogeneity, or both. Also, there may be significant covariates. The properties of the population should be tested before making a choice of the most suitable one among numerous estimators. To justify the choice of a method, a table may be drawn to compare the assumptions of the mainstream models in the topic and the properties of the data under study (e.g. spatial autocorrelation, spatial stratified heterogeneity, and the significance of covariates). Unfortunately, we have not any information about the population from which our sample in Table 6 is drawn. Thus, it is not possible to check the aforesaid properties.

6 Conclusion

The RSS method combines measurement with the judgment ranking information for purpose of statistical inference. It is advantageous in settings where precise measurement on the variable of interest is difficult (e.g., time-consuming, expensive or destructive), but small sets of units can be accurately ranked without actual quantification.

While point estimation of different population attributes have been exhaustively studied in the RSS literature, hypothesis testing and interval estimation problems have received little attention. This article aims to fill this gap in the context of estimating the reliability parameter. Several asymptotic and resampling-based intervals are developed, and compared with their SRS analogs through extensive simulation study. The results confirm the preference of the RSS-based CIs with respect to length, although their coverage rates are not uniformly superior. An agricultural data set is used to illustrate the suggested interval estimation procedures.

The intervals presented in this work utilize a point estimator constructed based on empirical distribution function. We have partly investigated performance of one of the CIs modified using kernel density estimation. The other intervals can be adapted similarly. This will be considered in a separate article.

References

Ayech MW, Ziou D (2015) Segmentation of Terahertz imaging using $k$-means clustering based on ranked set sampling. Expert Syst Appl 42:2959–2974
Article Google Scholar
Chen Z (1999) Density estimation using ranked-set sampling data. Environ Ecol Stat 6:135–146
Article Google Scholar
Chen Z (2007) Ranked set sampling: its essence and some new applications. Environ Ecol Stat 14:355–363
Article MathSciNet Google Scholar
Chen Z, Bai Z, Sinha BK (2004) Ranked set sampling: theory and applications. Springer, New York
Book MATH Google Scholar
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26
Article MathSciNet MATH Google Scholar
Halls LK, Dell TR (1966) Trial of ranked-set sampling for forage yields. For Sci 12:22–26
Google Scholar
Hollander M, Wolfe DA, Chicken E (2014) Nonparametric statistical methods, 3rd edn. Wiley, New York
MATH Google Scholar
Howard RW, Jones SC, Mauldin JK, Beal RH (1982) Abundance, distribution, and colony size estimates for Reticulitermes spp. (Isoptera: Rhinotermitidae) in Southern Mississippi. Environ Entomol 11:1290–1293
Article Google Scholar
Kotz S, Lumelskii Y, Pensky M (2003) The stress-strength model and its generalizations. Theory and applications. World Scientific, Singapore
Book MATH Google Scholar
Kvam P (2003) Ranked set sampling based on binary water quality data with covariates. J Agric Biol Environ Stat 8:271–279
Article Google Scholar
MacEachern SN, Ozturk O, Stark GV, Wolfe DA (2002) A new ranked set sample estimator of variance. J R Stat Soc Ser B 64:177–188
Article MathSciNet MATH Google Scholar
Mahdizadeh M, Zamanzade E (2016) Kernel-based estimation of $P(X>Y)$ in ranked set sampling. SORT 40:243–266
MathSciNet MATH Google Scholar
McIntyre GA (1952) A method of unbiased selective sampling using ranked sets. Aust J Agric Res 3:385–390
Article Google Scholar
Modarres R, Hui TP, Zheng G (2006) Resampling methods for ranked set samples. Comput Stat Data Anal 51:1039–1050
Article MathSciNet MATH Google Scholar
Murray RA, Ridout MS, Cross JV (2000) The use of ranked set sampling in spray deposit assessment. Asp Appl Biol 57:141–146
Google Scholar
Presnell B, Bohn L (1999) U-statistics and imperfect ranking in ranked set sampling. J Nonparametr Stat 10:111–126
Article MathSciNet MATH Google Scholar
Quenouille MH (1956) Notes on bias in estimation. Biometrika 43:353–360
Article MathSciNet MATH Google Scholar
Sengupta S, Mukhuti S (2008) Unbiased estimation of $P(X>Y)$ using ranked set sample data. Statistics 42:223–230
Article MathSciNet MATH Google Scholar
Sheather SJ (2004) Density estimation. Stat Sci 19:588–597
Article MATH Google Scholar
Stokes SL (1980) Estimation of variance using judgment ordered ranked set samples. Biometrics 36:35–42
Article MathSciNet MATH Google Scholar
Terpstra J, Miller ZA (2006) Exact inference for a population proportion based on a ranked set sample. Commun Stat Simul Comput 35:19–26
Article MathSciNet MATH Google Scholar
Tukey JW (1958) Bias and confidence in not quite large samples (abstract). Ann Math Stat 29:614
Article Google Scholar
Wang JF, Stein A, Gao BB, Ge Y (2012) A review of spatial sampling. Spat Stat 2:1–14
Article Google Scholar
Wolfe DA (2012) Ranked set sampling: its relevance and impact on statistical inference. ISRN Probability and Statistics, pp 1–32
Yin J, Hao Y, Samawi H, Rochani H (2016) Rank-based kernel estimation of the area under the ROC curve. Stat Methodol 32:91–106
Article MathSciNet Google Scholar
Zamanzade E, Mahdizadeh M (2017) A more efficient proportion estimator in ranked set sampling. Stat Probab Lett 129:28–33
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research was supported by Iran National Science Foundation (INSF). The authors wish to thank the reviewers for insightful comments and suggestions that improved an earlier version of this paper.

Author information

Authors and Affiliations

Department of Statistics, Hakim Sabzevari University, P.O. Box 397, Sabzevar, Iran
M. Mahdizadeh
Department of Statistics, University of Isfahan, Isfahan, 81746-73441, Iran
Ehsan Zamanzade

Authors

M. Mahdizadeh
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Zamanzade
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Mahdizadeh.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 72 KB)

Appendix

In this section, we first provide some results about the two-sample U-statistics, and then present proofs of Propositions 1 and 2.

Suppose that $h(x_1,\ldots ,x_p;y_1,\ldots ,y_q)$ is a symmetric kernel of degree (p, q) for the parameter $\theta =E\left( h \right) $. For independent simple random samples $X_1,\ldots ,X_{mk}$ from F, and $Y_1,\ldots ,Y_{n\ell }$ from G, the corresponding two-sample U-statistic for $\theta $ is given by

$$\begin{aligned} U_{\text {SRS}}= & {} U(X_1,\ldots ,X_{mk};Y_1,\ldots ,Y_{n\ell }) \\= & {} \frac{1}{{mk \atopwithdelims ()p}{n\ell \atopwithdelims ()q}} \sum _{{\varvec{\alpha }} \in \mathcal {A}} \sum _{{\varvec{\beta }} \in \mathcal {B}} h(X_{\alpha _1},\ldots ,X_{\alpha _p};Y_{\beta _1},\ldots ,Y_{\beta _q}), \end{aligned}$$

where ${\varvec{\alpha }}=(\alpha _1,\ldots ,\alpha _p)$, ${\varvec{\beta }}=(\beta _1,\ldots ,\beta _q)$, and $\mathcal {A}$ ($\mathcal {B}$) is the collection of all subsets of size p (q) chosen from integers $1,\ldots ,mk$ ($1,\ldots ,n\ell $). Let

$$\begin{aligned} h_{10}(x)=E\left( h(x,X_2,\ldots ,X_p;Y_1,\ldots ,Y_q) \right) , \end{aligned}$$

and

$$\begin{aligned} h_{01}(y)=E\left( h(X_1,\ldots ,X_p;y,Y_2,\ldots ,Y_q) \right) . \end{aligned}$$

Also, in connection with the above functions, define $\zeta _{10}=Var\left( h_{10}(X_1) \right) $ and $\zeta _{01}=Var\left( h_{01}(Y_1) \right) $.

Let $\{X_{[r]i}: r=1,\ldots ,k\,;i=1,\ldots ,m \}$ and $\{Y_{[s]j}: s=1,\ldots ,\ell \,;j=1,\ldots ,n \}$ be independent ranked set samples from two populations with the distribution functions F and G, respectively. The two-sample U-statistic for $\theta $ in the RSS is given by

$$\begin{aligned} U_{\text {RSS}}=U({\mathbf {X}}_1,\ldots ,{\mathbf {X}}_m;\mathbf {Y}_1,\ldots ,\mathbf {Y}_n), \end{aligned}$$

where ${\mathbf {X}}_i=(X_{[1]i},\ldots ,X_{[k]i})$ ($i=1,\ldots ,m$), and $\mathbf {Y}_j=(Y_{[1]j},\ldots ,Y_{[\ell ]j})$ ($j=1,\ldots ,n$). In addition, suppose $\gamma _{r 0}=E\left( h_{10}(X_{[r]1}) \right) $, $\gamma _{0 s}=E\left( h_{01}(Y_{[s]1}) \right) $, $\xi _{r 0}=Var\left( h_{10}(X_{[r]1}) \right) $, and $\xi _{0 s}=Var\left( h_{01}(Y_{[s]1}) \right) $.

Assume that $N=mk+n\ell $, and $(mk)/N \rightarrow \lambda \in (0,1)$ as $m,n \rightarrow \infty $. Then, according to Theorem 2 in Presnell and Bohn (1999),

$$\begin{aligned} \sqrt{N}(U_{\text {RSS}}-\theta ) {\mathop {\rightarrow }\limits ^{d}} N\left( 0,\frac{p^2 \phi }{\lambda }+\frac{q^2 \varphi }{1-\lambda }\right) , \end{aligned}$$

where

$$\begin{aligned} \phi =\zeta _{10}-\frac{1}{k} \sum _{r=1}^k \left( \gamma _{r 0} -\theta \right) ^2, \end{aligned}$$

and

$$\begin{aligned} \varphi =\zeta _{01}-\frac{1}{\ell } \sum _{s=1}^\ell \left( \gamma _{0 s} -\theta \right) ^2. \end{aligned}$$

If we set $h(x,y)=I(x<y)$, which is a kernel of degree (1, 1), then $\theta =P(X<Y)$ and Proposition 1 follows.

Also, from Corollary 2 in Presnell and Bohn (1999), the asymptotic relative efficiency of $U_{\text {RSS}}$ to $U_{\text {SRS}}$ is

$$\begin{aligned} ARE(U_{\text {RSS}},U_{\text {SRS}})=1+\frac{\frac{(1-\lambda ) p^2}{k}\sum _{r=1}^k \left( \gamma _{r 0}-\theta \right) ^2 + \frac{\lambda q^2}{\ell }\sum _{s=1}^\ell \left( \gamma _{0 s}-\theta \right) ^2 }{\frac{(1-\lambda ) p^2}{k}\sum _{r=1}^k \xi _{r 0} + \frac{\lambda q^2}{\ell }\sum _{s=1}^\ell \xi _{0 s}}. \end{aligned}$$

Again, by the same choice of the kernel mentioned above, Proposition 2 is concluded.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahdizadeh, M., Zamanzade, E. Interval estimation of $P(X<Y)$ in ranked set sampling. Comput Stat 33, 1325–1348 (2018). https://doi.org/10.1007/s00180-018-0795-x

Download citation

Received: 19 May 2017
Accepted: 29 January 2018
Published: 07 February 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s00180-018-0795-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interval estimation of \(P(X<Y)\) in ranked set sampling

Abstract

Similar content being viewed by others

Parametric estimation for the simple linear regression model under moving extremes ranked set sampling design

On Ranked Set Sampling Variation and Its Applications to Public Health Research

A new reliability measure in ranked set sampling

1 Introduction