Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Ozturk, Omer

doi:10.1007/s13253-015-0231-4

Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Published: 20 October 2015

Volume 21, pages 181–202, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Agricultural, Biological, and Environmental Statistics Aims and scope Submit manuscript

Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Download PDF

Omer Ozturk¹

310 Accesses
12 Citations
Explore all metrics

Abstract

This paper introduces new estimators for population total and mean in a finite population setting, where ranks (or approximate ranks) of population units are available before selecting sample units. The proposed estimators require selecting a simple random sample and identifying the population ranks of sample units. Selection of the sample can be performed with- or without-replacement. The population ranks of the selected units of with-replacement samples are determined among all population units. On the other hand, the ranks of the sample units of without-replacement samples are identified in two different ways: (1) The rank of a sample unit is determined sequentially among the remaining population units after excluding all previously ranked sample units from the population; (2) The ranks are determined among all units in the population. By conditioning on these population ranks, we construct a set of weighted estimators, develop a bootstrap re-sampling procedure to estimate the variances of the estimators, and construct percentile confidence intervals for the population mean and total. We show that the new estimators provide a substantial amount of efficiency gain over their competitors. We apply the proposed estimators to estimate corn production in one of the counties in Ohio.

Two-stage cluster samples with ranked set sampling designs

Article 24 October 2017

Parametric estimation for the simple linear regression model under moving extremes ranked set sampling design

Article 18 June 2021

Interval estimation of \(P(X in ranked set sampling

Article 07 February 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In many survey sampling studies, in addition to the variable of interest, researchers often have additional auxiliary information to improve statistical inference. In many instances, this auxiliary information may not be accurate, cannot be turned into a numerical covariate, or may be even subjective. Even though it contains valuable information, use of this type of information is ignored in practice since it may require strong modeling assumptions. For example, ratio and regression estimators are constructed based on auxiliary variables under strong modeling assumptions. This paper uses rough information to provide ranks (or approximate ranks) for the population units to improve the information content of the sample by avoiding strong modeling assumptions.

The auxiliary variables are very common in survey sampling studies. Hence, finite population settings provide a natural platform to obtain exact or approximate ranks of all population units. Ranking process can be achieved through the use of auxiliary variables, such as, size of sampling units, previous survey outcomes, census tracks, etc. One such setting is given in Husby et al. (2005), namely, The United States Department of Agriculture’s (USDA) National Agricultural Statistics (NASS) county crop estimation program. This program samples farms across the United States from the sampling frames that include obvious auxiliary variables, such as acreage in the farm, size of the farm, etc. These auxiliary variables provide a reasonable mechanism to rank the farms based on their crop productions. The detailed description of the USDA/NASS county estimation program can be found in Iwig (1993).

In an infinite population setting, use of subjective information has generated extensive research interests in judgment post stratified (JPS) and ranked set sampling (RSS) designs. Readers can find the recent research activities and detailed description of JPS sampling designs in MacEachern et al. (2004), Frey and Feeman (2012, 2013), Frey and Ozturk (2011), Stokes et al. (2007), Wang et al. (2006, 2008, 2012), and Ozturk (2014a, b). Ranked set sampling design is originally developed to keep the overall cost of data collection minimal in estimating mean pasture yield in agricultural fields in an infinite population setting by McIntyre (1952, reprinted in 2005). In recent years, there has been a surge in research in RSS sampling designs both in finite and infinite population settings. A tiny slice of literature in finite population setting includes Patil et al. (1995), Jafari Jozani and Johnson (2011, 2012), Frey (2011), Gokpinar and Ozdemir (2010), Ozturk and Jafari Jozani (2013), Al-Saleh and Samawi (2007), Ozdemir and Gokpinar (2008) and Ozturk (2014c). A comprehensive up-to-date literature review both in JPS and RSS can be found in a recent review paper in Wolfe (2012).

Both RSS and JPS sampling designs use ranking information from a few units in a set, not from the entire population, to divide the data into homogeneous groups of judgment strata. This ranking process is subjective and does not require strong modeling assumptions. It only needs a consistent ranking scheme to create ranks for the units in a set without requiring an established standard of measurement. On the other hand, RSS and JPS use ranks locally in a set and ignores the global ranking information in the population. This paper, unlike JPS and RSS sampling designs, concentrates on global ranking information in the entire population and creates informative samples.

The paper considers finite population settings and assumes the ranks of population units are available before sampling. It selects a simple random sample (SRS) of n units from a finite population of size N either with- or without-replacement. For each selected unit in the sample, we measure the characteristic of interest along with its population rank. If the sample is selected with-replacement, the rank of the selected unit is determined from the entire population including all population units. If the sample is taken without-replacement, the population ranks can be determined in two different ways. In the first approach, before selecting the sample, ranks are assigned in the entire population including all units, and then a simple random sample without replacement is selected along with the population ranks of the sample units. In the second approach, ranks are determined sequentially. The rank of a sample unit is determined among the remaining population units by excluding all the previously selected sample units from the population. Even though these two approaches yield without-replacement SRS samples, they create different ranking structures in the sample. Hence, replacement policies and ranking structures lead to three different designs: design-0 , design-1, and design-2. Designs-0 selects the sample with replacement and assigns the ranks in the entire populations; design-1 selects the sample without-replacement and assigns the ranks sequentially; and finally, the design-2 selects the sample without-replacement and assigns the ranks in the entire population.

Ranking structures in these sampling designs provide a lot of information about the exact (or approximate) population location of the sample units. This location information can be used to borrow additional information from other unmeasured population units to improve the information content of the sample. For each measured unit, we consider selecting additional $H-1$ unmeasured units without replacement from the remaining population units to form a set of size H. The relative position of each measured unit in these sets can be computed through its rank in the set by conditioning on its population rank. The rank of the measured unit in a set of size H yields a discrete conditional probability distribution given the population rank of the same measured unit. These conditional probabilities of within-set ranks further provide a mechanism to compute conditional inclusion probabilities for the population units. The final sample in this process consists of three pieces of information: measured values, conditional ranking probabilities, and the conditional inclusion probabilities. Even though the measured observations form an SRS sample in design-0, design-1, and design-2, the conditional probabilities are different. Hence, these designs show different characteristics.

Section 2 provides detailed developments for the construction of design-0, design-1, and design-2. For each design, we construct a probability distribution for the approximate location of the measured units among the unmeasured population units in a set of size H and compute the first-order conditional inclusion probabilities of the population units given the population ranks of a sample. Section 3 uses ranking information and conditional inclusion probabilities to construct estimators for the population mean and total. Section 4 provides empirical evidence to compare the proposed estimators with its competitors. Section 5 develops a bootstrap re-sampling procedure to estimate the variance of the estimators and to construct percentile confidence intervals for the population mean and total. Section 6 applies the proposed estimators to USDA 1992 Ohio corn data. Finally, Sect. 7 provides a concluding remark.

2 Construction of sampling designs

Consider a finite population of N units labeled as $\mathcal{P}=\{u_1,\ldots , u_N\}$. For each population unit $u_i$, we assume that its population rank $s_i$, $1 \le s_i \le N$, is known. If the population ranks are not known, we assume that there exists an auxiliary variables Y highly correlated with the variable of interest X. We then estimate the population rank of X from the population rank of the auxiliary variable Y. In the remaining part of the paper, we use $s_i$ to denote the true population rank of the unit $u_i$ whether it is estimated or not.

Design-0 We select a simple random sample, $U=\{u_{s_1},\ldots ,u_{s_n}\}$, with-replacement from $\mathcal{P}$ and measure all of them for the variable of interest X, $\varvec{X}=(X_1,\ldots , X_n)$. We then identify the population ranks of the measured units, $\varvec{S}=\{s_1,\ldots , s_n\}$, where $s_j$ is the population rank of $X_j$. Our sample then consists of n measurements and population ranks that correspond to these measured units:

$$\begin{aligned} \varvec{X}_{\varvec{S}}=\left\{ X_j,s_j\right\} , j=1,\ldots , n. \end{aligned}$$

(1)

We note that since the sample is selected with replacement, some pair $(X_i,s_i)$ may appear more than once in the sample. If we ignore the ranks in Eq. (1), $\varvec{X}_{\varvec{S}}$ becomes a simple random sample, and the inference can be developed based on standard theory in finite population setting. Since $\varvec{S}$ contains the true (or estimated) population ranks of the sample units, it provides a lot of information about the approximate location (in relative sense) of the sample units in the population. This location indicator allows us to borrow additional information from another $H-1$, $1 \le H \le N-1$, unobserved population units without a measurement. To achieve this goal, for each selected unit $u_{s_j}$, we consider selecting $H-1$ additional units at random without-replacement from the remaining $N-1$ population units and form a set of size H

$$\begin{aligned} U_{j,H}=\{u_{s_j},u_{t_1},\ldots , u_{t_{H-1}}\}, t_h \ne s_j, h=1,\ldots , H-1, \end{aligned}$$

(2)

where $u_{t_h}$ is the $t_h$-th smallest unit among N population units. Let $R_{s_j}$ be the rank of $X_j$, random variable obtained from $u_{s_j}$, in the set $U_{j,H}$. The conditional probability that $R_{s_j}$ is equal to h given that $X_j$ is the $s_j$-th smallest unit in the population can be computed by

$$\begin{aligned} \alpha ^{(0)} (h|s_j)=P(R_{s_j}=h|X_j=x_{s_j})= \frac{ \left( \begin{array}{c} {s_j-1}\\ {h-1} \end{array}\right) \left( \begin{array}{c} {N-s_j}\\ {H-h} \end{array} \right) }{\left( \begin{array}{c} {N-1} \\ {H-1} \end{array}\right) }. \end{aligned}$$

(3)

The above expression shows that the rank of random variable $X_j$ in a set of size H has a conditional probability distribution over integers $(1,\ldots ,H)$ given that $X_j$ is the $s_j$-th smallest unit in the population. This conditional distribution helps us to borrow information from additional $H-1$ unmeasured units in the population in addition to the information each measured unit has in the sample.

We now look at the problem from a different perspective. Instead of treating $\varvec{X}$ as a simple random sample, we treat it as a sample of independent order statistics by conditioning on rank vector $\varvec{R}=\{R_{s_1},\ldots , R_{s_n}\}$ generated by Eq. (3). It is clear that the conditional distribution of $X_j$ given that it has a rank $R_{s_j}=h_j$ in a set of size H is the same as the $h_j$-th order statistics, $X_{(h_j)}\mathop {=}\limits ^{D}X_j|R_{s_j}=h_j$. Let $\varvec{X}_{H|\varvec{\varvec{S}}}=(X_{(h_j)},\ldots ,X_{(h_n)})$ be the n order statistics based on this conditional distribution.

Let $\beta ^{(0)}(i,h|s_j)$ be the probability that the h-th order statistics in set $U_{j,H}$ equals to the i-th smallest unit in the population given that $R_{s_j}=h$. This conditional probability can be computed from

$$\begin{aligned} \beta ^{(0)}(i|h,s_j)=P(X_{(R_{s_j})}=x_{i}|R_{s_j}=h)=\frac{ \left( \begin{array}{c} {i-1}\\ {h-1} \end{array}\right) \left( \begin{array}{c} {N-i}\\ {H-h} \end{array}\right) }{\left( \begin{array}{c} {N}\\ {H} \end{array}\right) }. \end{aligned}$$

One can interpret $\beta ^{(0)}(i|h,s_j)$ as the probability mass function of the h-th order statistics given that the rank of $X_j$ equals to h, $R_{s_j}=h$. By using $\alpha ^{(0)}(i,h|s_j)$ and $\beta ^{(0)}(i,h|s_j)$, we obtain the conditional probability that random variable $X_j$ equals to the i-th smallest value in the population given the population rank $(s_j)$ of $X_j$

$$\begin{aligned} \beta ^{(0)}(i|s_j)=P(X_j=x_{i}|s_j)=\sum _{h=1}^H \beta ^{(0)}(i|h,s_j) \alpha ^{(0)} (h|s_j). \end{aligned}$$

Then the conditional inclusion probability of the i-th population unit in the sample $\varvec{X}_{H|\varvec{S}}$ is given by

$$\begin{aligned} \pi ^{(0)}(i|\varvec{S})= 1- \prod _{j=1}^n(1- \beta ^{(0)}(i|s_j)). \end{aligned}$$

Note that $\pi ^{(0)}(i|S)$, $i=1,\ldots ,N$, are not the inclusion probabilities of the sample $\varvec{X}$. They are the inclusion probabilities of the sample $\varvec{X}_{H|\varvec{S}}$. Since i is arbitrary, we can replace it with $s_j$. In this case, $\pi ^{(0)}(s_j|\varvec{S})$ would be the probability that population unit $u_{s_j}$ in sample $\varvec{X}$ would be included in sample $\varvec{X}_{H|S}$.

Remark 1

If either $H=1$ or $H>1$ and $\alpha ^{(0)}(h|s_j)=1/H$ for $h=1,\ldots , H$, then $\beta ^{(0)}(i|s_j)= 1/N$ for $j=1,\ldots , n$ and $i=1,\ldots ,N$. The inclusion probabilities in these cases reduce to $\pi ^{(0)}(i|\varvec{S})= 1-(\frac{N-1}{N})^n, i=1, \ldots , N$.

For design-0, the data structure of the sample will be denoted by

$$\begin{aligned} \left\{ X_j,\alpha ^{(0)}(h|s_j),\pi ^{(0}(s_j|\varvec{S})\right\} , \quad h=1,\ldots ,H, \quad j=1, \ldots , n. \end{aligned}$$

Design-1 Design-1 selects the sample units without replacement from the population and measures all of them, $\varvec{X}=(X_{1},\ldots , X_{n})$. Unlike design-0, where ranks in $\varvec{S}$ are computed from all units in the population, the population ranks of the measured units are identified sequentially by removing all of the previously ranked units in the sample from the population. Let $s^*_j$ be the rank of the unit $u_j$ after removing all the previously ranked units from the population. Since the ranks of the selected units are assigned sequentially, we introduce additional notation to accommodate the ranking structure. Let $\mathcal{P}_{-j}$ be the finite population of size $N+1-j$ after removing $j-1$ units from the original population $\mathcal{P}$

$$\begin{aligned} \mathcal{P}_{-j}=\{\hbox {All population units excluding the first } j-1 \hbox { units in the sample}\}. \end{aligned}$$

Since the population rank of each selected unit is determined after removing all previously selected units from the population, the sample with this new ranking structure becomes

$$\begin{aligned} U^*=\{ u_{s_1^*}, u_{s_2^*}, \ldots , u_{s_n^*}\}, \end{aligned}$$

where $u_{s_j^*}$ is the j-th selected unit in the sample that has a rank $s_j^*$ in population $\mathcal{P}_{-j}$, and $\varvec{S}^*=\{s_1^*,\ldots , s_n^*\}$ is the set of the ranks obtained from the reduced populations. The expression $s_j^*=s_j-\sum _{k=1}^n I(s_k < s_j)$ provides the connection between $s^*_j$ in the reduced population and $s_j$ in the full population. In this sample, for each selected unit $u_{s_j^*}$, we again construct a set of size H to borrow information from additional $H-1$ unmeasured units from population $\mathcal{P}_{-j}$:

$$\begin{aligned} U^*_{j,H}=\{u_{s_j^*}, u_{t_1^*}, \ldots , u_{t_{H-1}^*}\}, \quad j=1, \ldots ,n, \end{aligned}$$

where $u_{t_h^*}$ is the $t_h^*$-th smallest unit in population $\mathcal{P}_{-j}$. Let $R_{s_j^*}$ be the rank of $u_{s_j^*}$ in the set $U^*_{j,H}$. The conditional distribution of $R_{s_j^*}$ given that $s_j^* \in \mathcal{P}_{-j}$ is given by $\alpha ^{(1)}(h|s^*_j)$:

$$\begin{aligned} \alpha ^{(1)}(h_j|s_j^*)= P(R_{s_j^*}=h_j|s_j^* \in \mathcal{P}_{-j})=\frac{ \left( \begin{array}{c} {s_j^*-1}\\ {h_j-1} \end{array} \right) \left( \begin{array}{c} {N-j+1-s^*_j}\\ {H-h_j} \end{array} \right) }{\left( \begin{array}{c} {N-j} \\ {H-1} \end{array}\right) }. \end{aligned}$$

As in design-0, we again consider a conditional sample $\varvec{X}_{H|\varvec{S}^*}$ given the population ranks $\varvec{S}^*$. By adopting the notation of design-0, the conditional probability that the h-th order statistics in set $U^*_{j,H}$ equals to the i-th smallest unit in the population $\mathcal{P}_{-j}$ given that $R_{s_j}=h$ is given by

$$\begin{aligned} \beta ^{(1)}(i,|h,s_j^*)=P(X_{(h)}=x_i|R_{s_j^*}=h)= \frac{ \left( \begin{array}{c} {i-1}\\ {h-1} \end{array} \right) \left( \begin{array}{c} {N-j+1-i}\\ {H-h} \end{array} \right) }{\left( \begin{array}{c} {N-j+1}\\ {H} \end{array} \right) }, x_i \in \mathcal{P}_{-j}. \end{aligned}$$

In design 1, the conditional probability that random variable $X_j$ equals to the i-th smallest value in the population $\mathcal{P}_{-j}$ given its population rank $(s_j^*)$ follows from the total probability law over the conditional distribution of rank: $R_{s_j}^*$

$$\begin{aligned} \beta ^{(1)}(i|s_j^*)= P( X_j = x_i|s^*_j) = \sum _{h=1}^H\beta ^{(1)}(i|h,s_j^*) \alpha ^{(1)}(h|s_j^*), x_i \in \mathcal{P}_{-j}. \end{aligned}$$

To compute the conditional inclusion probability of the i-th population unit in the sample $\varvec{X}_{H|\varvec{S}^*}$ given the rank vector $\varvec{S}^*$, $\pi ^{(1)}(i|\varvec{S}^*)$, we use the sequential algorithm given in Frey (2013) in a slightly different context. To compute $\pi ^{(1)}(i|\varvec{S}^*)$, we first need to develop some additional notation due to sequential identification of population ranks. For $j=0, \ldots , i-1$, let W(j, d) be the probability that first d units in the sample include j units smaller than the i-th unit and not the i-th unit. It is obvious that, if $d=0$, the values of $\{ W(j,0),0 \le j \le i-1\}$ can be computed from

$$\begin{aligned} W(j,0)= \left\{ \begin{array}{ll} 1, &{} j=0 \\ 0, &{}\hbox {otherwise.} \end{array} \right. \end{aligned}$$

The values of W(j, d) for $d>0$ can be computed from a recursive relationship between adjacent selection steps. Assume that the values of $\{ W(j,d), 0 \le j \le i-1 \}$ are known for a fixed d. There are then two ways to obtain the values of $\{ W(j,d+1), 0 \le j \le i-1 \}$ from stage d: (1) There could be j units smaller than the i-th unit among the first d selected units in the sample and the next selected unit in the sample is larger than the i-th unit in the population. (2) There could be $j-1$ units smaller than the i-th unit among the first d selected units in the sample and the next selected unit in the sample is smaller than the i-th unit in the population. These two statements define a recursive equation as follows:

$$\begin{aligned} W(0,d+1)=T(0,d) \sum _{\lambda =i+1}^{N-d} \beta ^{(1)}(\lambda |s_d^*) \end{aligned}$$

and for $j=1, \ldots , i-1$

$$\begin{aligned} W(j,d+1)=W(j,d) \sum _{\lambda =i+1-j}^{N-d} \beta ^{(1)}(\lambda |s_d^*) +W(j-1,d)\sum _{\lambda =1}^{(i-1)-(j-1)} \beta ^{(1)}(\lambda |s_d^*). \end{aligned}$$

Going through this recursive equation for $d=1,\ldots , n$, we compute the probability $\{W(j,n), 0 \le j \le i-1\}$. The probability that the i-th unit is not included in the sample is then given by $\sum _{j=0}^{i-1} W(j,n)$. The first-order conditional inclusion probability of the i-th unit given $\varvec{S}^*$ and H is then given by

$$\begin{aligned} \pi ^{(1)}(i|S^*)=1- \sum _{j=0}^{i-1} W(j,n), i= 1, \ldots , N \end{aligned}$$

Note that even though it is not explicitly stated in the notation, W(j, d) is a conditional probability for given population rank vector $\varvec{S}^*$. The data structure of design-1 will be denoted by

$$\begin{aligned} \{X_j,\alpha ^{(1)}(h|s_j^*), \pi ^{(1)}(s^*_j|\varvec{S}^*)\}, h=1,\ldots , H, j=1, \ldots , n \end{aligned}$$

Design 2 We select a simple random sample, $\varvec{X}=(X_1,\ldots , X_n)$, of size n without replacement and identify their ranks, $\varvec{S}=(s_1,\ldots , s_n)$, in population $\mathcal{P}$

$$\begin{aligned} U=\{u_{s_1},u_{s_2},\ldots , u_{s_n}\}. \end{aligned}$$

To borrow additional information from the unmeasured population units, we select n disjoint sets, each of size $H-1$. We then randomly match these n sets with selected units in set U to form n sets, each of size H

$$\begin{aligned} U_{j,H}=\{u_{s_j},u_{t_1},\ldots , u_{t_H}\}, j=1,\ldots , n. \end{aligned}$$

The conditional probability distribution of the rank $R_{s_j}$ of $X_j$ in set $U_{j,H}$ given that $X_j$ has the rank $s_j$ in the population $\mathcal{P}$ is given by

$$\begin{aligned} \alpha ^{(2)}(h|s_j)=P(R_{s_j}=h_j|s_j)=\frac{\left( \begin{array}{c} {s_j-1}\\ {h_j-1} \end{array} \right) \left( \begin{array}{c} {N-s_j}\\ {H-h_j} \end{array} \right) }{\left( \begin{array}{c} {N-1} \\ {H-1} \end{array}\right) }, h_j=1,\ldots , H. \end{aligned}$$

In a similar fashion, in the sample $\varvec{X}_{H|\varvec{S}}$, the conditional probability that the h-th order statistics in set $U_{j,H}$ equals to the i-th smallest unit in the population $\mathcal{P}$ given that $R_{s_j}=h$ is given by

$$\begin{aligned} \beta ^{(2)}(i|h,s_j)=P(X_{(h)}=x_i|R_{s_j}=h)=\frac{\left( \begin{array}{c} {i-1} \\ {h-1} \end{array} \right) \left( \begin{array}{c} {N-i}\\ {H-h} \end{array}\right) }{\left( \begin{array}{c} {N}\\ {H} \end{array} \right) }, x_i \in \mathcal{P}. \end{aligned}$$

In Eq. (4), the sum over possible values of h yields the conditional probability that $X_j$ equals to i-th smallest unit in the population given the population rank $s_j$

$$\begin{aligned} \beta ^{(2)}(i|s_j,H)=\sum _{h=1}^H \beta ^{(2)}(i|h,s_j) \alpha ^{(2)}(h|s_j). \end{aligned}$$

(4)

The first-order conditional inclusion probabilities given the population ranks of the observed measurements then follow from

$$\begin{aligned} \pi ^{(2)}(i|\varvec{S})= \sum _{j=1}^n \beta ^{(2)}(i|s_j), i=1,\ldots ,N. \end{aligned}$$

Finally, the data structure of the sample from design-2 is denoted with

$$\begin{aligned} \{X_{j},\alpha ^{(2)}(h|s_j),\pi ^{(2)}(s_j|\varvec{S})\}, h=1,\ldots ,H, j=1,\ldots ,n. \end{aligned}$$

Remark 2

If either $H=1$ or $H>1$ and $\alpha ^{(2)}(h|s_j)=1/H$ for $h=1,\ldots , H$, then $\beta ^{(2)}(i|s_j)= 1/N$ for $j=1,\ldots , n$ and $i=1,\ldots ,N$. The inclusion probabilities in these cases reduce to $\pi ^{(2)}(i|\varvec{S})= n/N, i=1, \ldots , N$.

3 Estimators for population mean and total

In this section, we introduce three estimators for population mean and total for each sampling design. The estimators for population total use the data structures established in the previous section:

$$\begin{aligned} T_1^{(L)}= & {} \left\{ \begin{array}{ll} {\sum }_{j=1}^n w^{(L)}_1(j|\varvec{S})X_j &{} L=0,2 \\ {\sum }_{j=1}^n w^{(L)}_1(j|\varvec{S}^*)X_j &{} L=1 \end{array} \right. \\ T_2^{(L)}= & {} \left\{ \begin{array}{ll} \frac{N}{H}{\sum }_{h=1}^H{\sum }_{j=1}^nw^{(L)}_2(h,j|\varvec{S}) X_j &{} L=0,2,\\ \ \frac{N}{H}{\sum }_{h=1}^H{\sum }_{j=1}^nw^{(L)}_2(h,j|\varvec{S}^*) X_j &{} L=1 \end{array} \right. \end{aligned}$$

and

$$\begin{aligned} T_3^{(L)}=\left\{ \begin{array}{ll} \frac{N}{H}{\sum }_{h=1}^H{\sum }_{j=1}^nw^{(L)}_3(h,j|\varvec{S}) X_j &{} L=0,2 \\ \frac{N}{H}{\sum }_{h=1}^H{\sum }_{j=1}^nw^{(L)}_3(h,j|\varvec{S}^*) X_j &{} L=1 \end{array} \right. \end{aligned}$$

where

$$\begin{aligned}&w_1^{(L)}(j|\varvec{S})=\frac{1}{\pi ^{(L)}(s_j|\varvec{S})}, \quad w_2^{(L)}(h,j|\varvec{S}) = \frac{\alpha ^{(L)}(h|s_j)}{{\sum }_{j=1}^n\alpha ^{(L)}(h|s_j)}, \\&w_3^{(L)}(h,j|\varvec{S}) = \frac{\frac{\alpha ^{(L)}(h|s_j)}{\pi ^{(L)}(s_j|\varvec{S})}}{{\sum }_{j=1}^n\frac{\alpha ^{(L)}(h|s_j)}{\pi ^{(L)}(s_j|\varvec{S})}}. \end{aligned}$$

We note that design-1 estimators always use population ranks $\varvec{S}^*$ in the reduced populations $\mathcal{P}_{-j}$, $j=1,\ldots ,n$, to compute the conditional probabilities $\alpha ^{(1)}(h|s^*_j$ and $\pi ^{(1)}(s^*_j|s^*_j,H)$. Estimator $T^{(L)}_1$ is motivated from Horvitz–Thompson estimator (Horvitz and Thompson 1952), where units having smaller inclusion probability in the sample is given higher weight. On the other hand, it should be clear that $T^{(L)}_1$ is not a Horvitz–Thompson estimator since $\pi ^{(L)}(i|\varvec{S})$ and $\pi ^{(L)}(i|\varvec{S}^*)$ are not inclusion probabilities for sample $\varvec{X}$. They are the inclusion probabilities for sample $\varvec{X}_{H|\varvec{S}}$ and $\varvec{X}_{H|\varvec{S}^*}$, respectively.

The estimator $T^{(L)}_2$ is motivated from JPS estimator in MacEachern et al. (2004), where each measured observation is prorated to H ranking classes. The prorate is proportional to the probability that the measured unit has rank h in a set of size H. This prorating process creates H strata. Hence, improvement over simple random sample (SRS) estimator can be anticipated form the theory of stratified sampling design in survey sampling. Even though the estimators $T^{(0)}_2$ and $T^{(2)}_2$ have the same form, they yield different efficiency results since the sample $\varvec{X}$ is constructed with and without replacement in design-0 and design-1, respectively.

The estimator $T^{(L)}_3$ uses the same idea as in estimator $T^{(L)}_2$, but it gives more weight to observations that are less likely to be included in the sample to reduce the variance of the estimator. One then anticipates that the estimator $T^{(L)}_3$ performs better than the other two estimators.

Estimators for the population mean can be obtained by dividing $T^{(L)}_r$ with N

$$\begin{aligned} \hat{\mu }^{(L)}_r= \frac{1}{N} T^{(L)}_r, r=1, \ldots , 3, \quad L=0,1,2. \end{aligned}$$

4 Empirical evidence

In this section, we investigate the efficiency of the estimators. Even though it is theoretically possible to construct the probability distributions of the estimators by computing the weight functions over all possible values of $\varvec{S}$, this would computationally be intensive even for moderate sample and population sizes. Hence, to reduce the computational burden, we use a simulation study to investigate the properties of the estimators.

Simulation study considered two sets of sample (n) and set (H) sizes, $n=20,50$ and $H=2,5$, respectively. Ranking accuracy is controlled by the correlation coefficient $\rho =1.00,0.75, 0.5$ between X and Y. Datasets are generated from discrete normal and exponential distributions of size $N=300$. Discrete normal and exponential populations are generated by $x_i= Q((i-0.5)/N)$, $i=1,\ldots , N$, where Q is the quantile function of either standard normal or standard exponential distribution depending on the underlying population. Simulation size is taken to be 1000.

Ranking accuracy is simulated by perceived size ranking model in Dell and Clutter (1972). This model, for the population values $\varvec{x}=(x_1,\ldots , x_N)$, selects an N dimensional random vector, $\varvec{\epsilon }=(\epsilon _1, \ldots , \epsilon _N)$, from a normal distribution having mean zero and variance $\tau ^2$. These two vectors are added to create a ranking vector $\varvec{y}=\varvec{x}+\varvec{\epsilon }$. The ranks of the observations in the vector $\varvec{y}$ are used to predict the ranks of the values ($\varvec{x}$) of population units. The accuracy of ranking is controlled by the correlation coefficient: $\rho =\mathrm{corr}(Y,X)=\frac{1}{\sqrt{1+\tau ^2/\sigma ^2}}$, or equivalently by the variance $\tau ^2$, where $\sigma ^2$ is the variance of X.

Table 1 Biases of the the estimators $\hat{\mu }_{r}^{(L)}$, $r=1,2,3$ and SRS mean.

Full size table

Table 2 Biases of the the estimators $\hat{\mu }_{r}^{(L)}$, $r=1,2,3$ and SRS mean.

Full size table

Tables 1 and 2 present the biases of the estimators for discrete normal and exponential distributions. It is clear from these tables that the estimator $\hat{\mu }_1^{(L)}$ has a substantial amount of bias in all sampling designs when the population mean is large, or equivalently when the coefficient of variation, CV$=\sigma /\mu $, is small. For example, the biases of the estimator $\hat{\mu }_1^{(L)}$ are practically zero when $\mu =0$, but they become very large for $\mu =100$. The other estimators, $\hat{\mu }_r^{(L)}$, $r=2,3$; $L=0,1,2$, appear to be essentially unbiased for all $\mu $ and sampling designs. Since the biases of the estimators $\hat{\mu }_1^{(L)}$, $L=0,1,2$, are very large when the coefficient of variation is small, these estimators are not considered any further in this paper.

Table 3 Relative efficiencies of $\hat{\mu }_{r}^{(L)}$, $r=2,3$ and SRS estimators with respect to $\hat{\mu }_{3}^{(2)}$, $R^{(L)}_r=\mathrm{MSE}(\hat{\mu }_{r}^{(L)})/\mathrm{MSE}(\hat{\mu }_{3}^{(2)})$ and $R_{\mathrm{SRS}}=\mathrm{MSE}(\mathrm{SRS})/\mathrm{MSE}(\hat{\mu }_{3}^{(2)})$.

Full size table

Table 4 Relative efficiencies of $\hat{\mu }_{r}^{(L)}$, $r=2,3$ and SRS estimators with respect to $\hat{\mu }_{3}^{(2)}$, $R^{(L)}_r=\mathrm{MSE}(\hat{\mu }_{r}^{(L)})/\mathrm{MSE}(\hat{\mu }_{3}^{(2)})$, and $R_{\mathrm{SRS}}=\mathrm{MSE}(\mathrm{SRS})/\mathrm{MSE}(\hat{\mu }_{3}^{(2)})$.

Full size table

Tables 3 and 4 present the relative efficiencies of $\hat{\mu }_r^{(L)}$, $r=2,3$, and SRS mean with respect to the estimator $\hat{\mu }_3^{(2)}$

$$\begin{aligned} R_r^{(L)}= \frac{\mathrm{MSE}(\hat{\mu }_r^{(L)} )}{\mathrm{MSE}(\hat{\mu }_3^{(2)})}, r=2,3; L=0,1,2, \hbox { and } R_{\mathrm{SRS}}= \frac{\mathrm{MSE}(\mathrm{SRS})}{\mathrm{MSE}(\hat{\mu }_3^{(2)})}. \end{aligned}$$

The values of $R_r^{(L)} > 1$ and $R_{\mathrm{SRS}} >1$ indicate that the estimator $\hat{\mu }_3^{(2)}$ outperforms $\hat{\mu }_r^{(L)}$, and SRS mean, respectively.

There are several important features in Tables 3 and 4 that need to be discussed. It is clear that $R_{\mathrm{SRS}}/R_r^{(L)} >1$, for $r=2,3$ and $L=0,1,2$, which indicates that all of the proposed estimators have higher efficiencies than SRS mean. The efficiency gain is substantial if the ranking information is accurate and set size is large. For example, in Table 3, the efficiencies of $\hat{\mu }_2^{(0)}$, $\hat{\mu }_3^{(0)}$, $\hat{\mu }_2^{(1)}$,$\hat{\mu }_3^{(1)}$, $\hat{\mu }_2^{(2)}$, and $\hat{\mu }_3^{(2)}$ with respect to SRS mean are 5.480 (13.809/2.52), 11.604 (13.809/1.19), 6.529 (13.809/2.115), 14.310 (13.809/0.965), 5.960 (13.809/2.317), and 13.809, respectively, when $n=20$ ,$H=5$, $\rho =1$, and $\mu =0$. Even if $\rho = 0.50$, the new estimators are still better than SRS mean. On the other hand, the efficiency gain is not as high as the ones under perfect ranking.

Tables 3 and 4 also reveal that the relative efficiencies of design-0 estimators are generally lower than the efficiencies of design-1 and design-2 estimators. This is mostly due to the replacement policy of the design-0, where units are selected with replacement. Among these three designs, it appears that design-1 is the most efficient one. For example, $ R^{(2)}_2 /R_2^{(1)} >1$. This can be anticipated from the fact that design-1 determines the ranks of sample units sequentially by removing all the previously ranked units in the sample from the population. This sequential ranking provides stronger data structure in the sample in design-1 than the one in design-2, and hence increases the efficiency. On the other hand, design-1 and design-2 are comparable in their efficiency for the estimators $\hat{\mu }^{(1)}_3$ and $\hat{\mu }^{(2)}_3$. They practically have the same efficiency, $R_3^{(1)} \approx 1$. The estimator $\hat{\mu }^{(2)}_3$ is slightly less efficient when $\rho =1$, but for the other values of $\rho $, $\rho <1$, the estimators are essentially equivalent in their efficiencies.

The efficiencies of the estimators in Tables 3 and 4 appear to be increasing function of set size H. To investigate the impact of set size further, we performed another simulation study using discrete normal population ($\mu =10$ and $\sigma = 10$) of size $N=100$ with $\rho =1$. In this part of the simulation study, sample and simulation sizes are taken to be 20 and 1000, respectively. Figure 1 plots the mean square errors (MSEs) of the estimators $\hat{\mu }_r^{(L)}$, $r=2,3$; $L=0,1,2$ against set size H. It is clear that MSEs are the decreasing functions of set size H for $H \le 15$ and almost flat for $H > 15$. The reason that the MSE plots become flat for large H can be anticipated from the behavior of $\alpha ^{(L)}(h|s_j)$. For large H, this probability will be very small (or zero) for the values of h that are inconsistent with $s_j$. For example, if $s_j$ is small, then large values of h yields $\alpha ^{(L)}(h|s_j) \approx 0$. Hence, the contribution of these ranking classes would be negligible to reduce the MSE of the estimators.

Figure 1 also indicates that design-1 estimators yield smaller MSEs than the design-0 and design-2 estimators. In design-1, population ranks of the selected units are determined sequentially after removing all the previously selected units. This sequential ranking induces stronger data structure (presumably negative correlations among measured observations). Hence, design-1 yields higher efficiency results than the other designs in finite population setting.

5 Bootstrap variance estimate and confidence interval

In this section, we develop statistical inference for population mean, but similar inference can also be developed for population total with a slight change in the notation. The exact sampling distribution of the estimator is not computationally feasible for reasonable sample and population sizes. Therefore, to reduce the computational burden, we use bootstrap distribution to draw statistical inference.

Discussion in the previous section indicates that design-1 performs slightly better than the other two designs. On the other hand, the computation of the conditional inclusion probabilities in design-1 requires extensive computing time when the population and/or sample sizes are large. Since the estimators $\hat{\mu }^{(1)}_3$ and $\hat{\mu }^{(2)}_3$ are equivalent in their efficiencies for all practical purposes, to increase the bootstrap simulation size, we develop the inference based on design-2 estimator $\hat{\mu }^{(2)}_3$.

Let $\theta $ be the parameter of interest. The parameter $\theta $ can be considered as a statistical functional $\theta = F(\mathcal{P})$. The estimate of $\theta $ then can be obtained from plug-in method by replacing $\mathcal{P}$ with empirical population $\hat{\mathcal{P}}$, $\hat{\theta } = F(\hat{\mathcal{P}})$. The empirical bootstrap population $\hat{\mathcal{P}}$ should preserve without replacement structure of design-0, design-1, and design-2. Let $\varvec{x}_S^{(L)}=\{x_j,s_j\};j=1,\ldots , n, $ be the measured values of the simple random sample, $U=\{u_{s_1},\ldots , U_{s_n}\}$, selected from population $\mathcal{P}$ based on design-L, $L=0,1,2$. Let D be the integer part of the ratio N / n. We construct empirical bootstrap population by repeating set $\varvec{x}_S^{(L)}$ D times and selecting $d=N-Dn$ pairs at random from $\varvec{x}_S^{(L)}$ to create an empirical population of size N:

$$\begin{aligned} \hat{\mathcal{P}}^{(L)}=\{\varvec{x}_S^{(L)}, \ldots , \varvec{x}_S^{(L)}, z_1,\ldots , z_d\}, \end{aligned}$$

where $z_t$ , $t=1,\ldots ,d$, are randomly selected pairs from $\varvec{x}_S^{(L)}$. It is clear that the size of the empirical population is the same as the original population. The bootstrap samples are then selected without replacement from population $\hat{\mathcal{P}}^{(L)}$. Let $\varvec{x}^*=\{x^*_j,a_j\}$; $j=1, \ldots , n,$ be a re-sample from population $\hat{\mathcal{P}}^{(L)}$, where $a_i \in S$ for $i=1,\ldots ,n$. For each $b=1,\ldots , B$, let $\varvec{x}^*_b=\{x^*_{j,b},a_{j,b}\}$; $j=1,\ldots ,n$, be a re-sample selected from $\hat{\mathcal{P}}^{(L)}$, we apply our estimator to each one of these bootstrap re-samples to obtain

$$\begin{aligned} \hat{\mu }_{2,b}^{(L)*}= & {} \frac{N}{H}\sum _{h=1}^H\sum _{j=1}^n w_2^{(L)}(h,j|A_b,H)x^*_{j,b}, \hbox { and } \\ \hat{\mu }_{3,b}^{(L)*}= & {} \frac{N}{H}\sum _{h=1}^H\sum _{j=1}^n w_3^{(L)}(h,j|A_b)x^*_{j,b}, \quad b=1,\ldots ,B, \end{aligned}$$

where $A_b=(a_{1,b},\ldots , a_{n,b})$. The bootstrap variance estimates of the estimators in $\hat{\mu }_r^{(L)}$, $r=2,3$, are then given by

$$\begin{aligned} BV(\hat{\mu }_{r}^{(L)})=\frac{1}{B-1} \sum _{b=1}^B \left( \hat{\mu }_{r,b}^{(L)*}-\bar{\hat{\mu }}_r^{(L)*}\right) ^2, r=2,3, \end{aligned}$$

where $\bar{\hat{\mu }}_r^{(L)*}$ is the mean of $\hat{\mu }_{r,b}^{(L)*}$, $b=1,\ldots ,B$.

A $(1-\gamma )100\,\%$ bootstrap percentile confidence interval is constructed by $(Q^{\gamma /2}_r,Q^{1-\gamma /2}_r)$, where $Q^{a}_r$ is the a-th quantile of $\hat{\mu }_{r}^{(L)}$ satisfying $a= P(\hat{\mu }_{r}^{(L)}<Q^{a}_r | \mathcal{P})$. The quantiles $Q^{\gamma /2}_r$ and $Q^{1-\gamma /2}_r$ are obtained from bootstrap distribution of $\hat{\mu }_{r}^{(L)}$.

In order to investigate the properties of bootstrap variance estimate of $\hat{\mu }_{r}^{(2)}$, for $r=2,3$, and the bootstrap percentile confidence interval of population mean, we performed another simulation study. The simulation parameters are taken to be $n=30,50$, $H=2,5$, $\rho =1, 0.75, 0.5$. The shift parameter $\mu $ is selected to be $\mu = 0,100$. Datasets are generated from discrete normal and exponential distributions. Simulation and bootstrap replications are selected to to be 2000 and 1000, respectively.

Table 5 Coverage probabilities (C) of 95 % bootstrap percentile confidence intervals of population mean based on estimators $\hat{\mu }_{r}^{(2)}$, $r=2,3$.

Full size table

Table 5 presents 95 % coverage probabilities (C) of the bootstrap percentile confidence intervals based on estimators $\hat{\mu }_{r}^{(2)}$, $r=2,3$, and SRS mean. Table 5 reveals that coverage probabilities are reasonably close to the nominal coverage probability 0.95 for discrete normal distribution. The coverage probabilities for the discrete exponential distribution appear to be slightly lower than the nominal level for small sample sizes. This may be due to the fact that super-population exponential distribution is a skewed distribution. Hence, it may require larger sample sizes to satisfy the regularity conditions of the bootstrap procedure in Booth et al. (1994).

Table 6 Bootstrap(B) and simulation (S) variance estimates of the estimators $\hat{\mu }_{r}^{(2)}$, $r=2,3$.

Full size table

Table 6 presents bootstrap (B) and simulated (S) variance estimates of the estimators $\hat{\mu }_{r}^{(2)}$, $r=2,3$ and SRS mean. It is obvious that bootstrap variance estimates are almost identical to those estimated from simulation study. The simulation results provide a convincing evidence that variance estimates of the proposed estimators can be computed from bootstrap distribution.

6 Application

In this section, we apply the proposed sampling designs and estimators to 1992 Ohio corn yield data which were used by the Ohio Agricultural Statistics Department in its county estimation program. This dataset includes responses from farms in the USDA’s National Quarterly Agricultural Survey and from farms responding to the Ohio supplemental survey, Husby et al. (2005). (Also, see Ohio Department of Agriculture, 1993, for published estimates based on these data). The success of the proposed sampling designs depends heavily on accurate ranking information of the population units. To get a reasonably correct rank ordering of the population units, we select our population as one of the counties from the Ohio Corn Data having 202 farms. Hence, the population size is $N= 202$. In this population, there are five variables: corn production (bushels, X), farm size (acreage, $Y_1$), group size ($Y_2$), acre planted ($Y_3$), and acre harvested ($Y_4$). Our interest lies in estimation of the mean corn production in the county. The constructions of the proposed sampling designs require rank ordering of 202 farms based on their corn production. We use the variables $Y_1$, $Y_2$, $Y_3$, and $Y_4$ as auxiliary variables to predict rank ordering of X. There is high correlation between X and the other auxiliary variables, $\rho _k=\mathrm{cor}(X,Y_k)$, $r=1,\ldots ,4$. The histogram of X reveals that the population is strongly skewed right. The parameters of this population are given in Ozturk (2014c) and reproduced in Table 7.

Table 7 Population characteristics of an Ohio county in 1992 Ohio corn data, $\rho _k=\mathrm{corr}(X,Y_k)$.

Full size table

The auxiliary variable group size $Y_2$ is an integer-valued random variable which only takes values 1, 2, and 3. There are also ties in other auxiliary variables $Y_1$, $Y_3$, and $Y_4$. In order to break the ties, we generated a random vector $\varvec{\epsilon }$ of size $N=202$ from a normal distribution with mean 0 and standard deviation 0.001 and constructed $\varvec{y}^*_j=\varvec{y}_j+\varvec{u}$, $j=1,\ldots ,4$. Since all entries in $\varvec{y}^*_j $ are unique, the rank ordering of vector $\varvec{x}$ with no ties is estimated from $\varvec{y}^*_j$, $j=1,\ldots ,4$.

By treating these 202 farms as a finite population, we performed another simulation study to investigate the biases and efficiencies of $\hat{\mu }_2^{(2)}$ and $\hat{\mu }_3^{(2)}$. In this part of the simulation, we also included ratio ($\hat{\mu }_\mathrm{Ra}$) and regression ($\hat{\mu }_\mathrm{Reg}$) estimator of the population mean

$$\begin{aligned} \hat{\mu }_\mathrm{Ra}=\frac{1}{N}\frac{\sum _{j=1}^n x_j }{\sum _{j=1}^ns_j}\frac{N(N+1)}{2} \hbox { and } \hat{\mu }_\mathrm{Reg}=\hat{B}_0+\hat{B}_1(N+1)/2, \end{aligned}$$

where $\hat{B}_0$ and $\hat{B}_1$ are the estimated regression coefficients, regressing $x_j$ on $s_j$. Simulation study also considered the bootstrap estimates of standard deviations of the estimators and coverage probabilities of the percentile confidence intervals. Samples in the simulation are selected with the following sample and set size combination, $(n,H)=(20,10),(30,6)$, and (50, 4). Simulation and bootstrap replication sizes are taken to be 3000 and 2000, respectively.

Table 8 Biases (Bi) and relative efficiencies (R) of the estimators $\hat{\mu }_2^{(2)}$, $\hat{\mu }_3^{(2)}$, $\hat{\mu }_\mathrm{Ra}$, $\hat{\mu }_\mathrm{Reg}$, and SRS mean.

Full size table

Table 8 presents the biases (Bi) of $\hat{\mu }_2^{(2)}$,$\hat{\mu }_3^{(2)}$,$\hat{\mu }_\mathrm{Ra}$, $\hat{\mu }_\mathrm{Reg}$, and SRS mean. It also contains the relative efficiencies of the estimator $\hat{\mu }_3^{(2)}$ with respect to $\hat{\mu }_2^{(2)}$, SRS sample mean, $\hat{\mu }_\mathrm{Ra}$, and $\hat{\mu }_\mathrm{Reg}$

$$\begin{aligned} R_2^{(2)}= & {} \frac{\mathrm{MSE}(\hat{\mu }_2^{(2)})}{\mathrm{MSE}(\hat{\mu }_3^{(2)})}, \quad R_\mathrm{SRS}^{(2)}=\frac{\mathrm{MSE}(\mathrm{SRS})}{\mathrm{MSE}(\hat{\mu }_3^{(2)})}, \\ R_\mathrm{Ra}^{(2)}= & {} \frac{\mathrm{MSE}(\hat{\mu }_\mathrm{Ra})}{\mathrm{MSE}(\hat{\mu }_3^{(2)})} \, \hbox { and } \, R_\mathrm{Reg}^{(2)} = \frac{\mathrm{MSE}(\hat{\mu }_\mathrm{Reg})}{\mathrm{MSE}(\hat{\mu }_3^{(2)})}. \end{aligned}$$

Again relative efficiencies greater than one indicate that the estimator $\hat{\mu }_3^{(2)}$ has smaller mean square error.

Biases in Table 8 follow a pattern similar to the ones we have observed in Tables 1 and 2. For large set sizes, the estimators have slightly larger negative biases than SRS estimator. This can be anticipated from the fact that the proposed estimators borrow information from unmeasured units in a set of size H. For skew distributions and large set sizes H, effects of extreme observations in the sample are divided into H different strata, Hence the influence of extreme observations on the estimator is reduced. For this reason, for skewed distributions, the estimators provide a slightly under-estimation for the population mean.

The relative efficiencies also follow the similar pattern as in Tables 3 and 4 . The estimator $\hat{\mu }_3^{(2)}$ is the best estimator, and the efficiencies increase with the quality of ranking information, set, and sample sizes. Efficiency gain with respect to SRS sample mean, ratio, and regression estimators is substantial if the ranking information is reasonably accurate.

The same dataset is analyzed in Ozturk (2014c) using RSS design by combining ranking information from different sources. Sampling designs in this paper are different from the RSS designs used in Ozturk (2014c). Our designs select simple random samples and construct weights based on global ranking information in the entire population, whereas Ozturk (2014c) uses local ranking information in a given set of size H. Since we use global ranking information in the population, our estimator performs better than RSS designs in Ozturk (2014c).

Table 9 Coverage probabilities of bootstrap percentile confidence intervals of population mean based on estimators $\hat{\mu }_r^{(2)})$, $r=2,3$, and SRS mean.

Full size table

Table 10 Bootstrap (B)- and simulation (S)-based estimates of standard deviation of the estimators, $\hat{\mu }_r^{(2)}$, $r=2,3$, and SRS mean.

Full size table

Table 9 presents the coverage probability of the bootstrap percentile confidence interval for population mean based on estimators $\hat{\mu }_2^{(2)}$, ${\mu }_3^{(2)}$ and SRS mean. All these coverage probabilities appear to be close to the nominal value 0.95 for large sample sizes. Since the population has strong skewness to right, for small sample sizes, the coverage probabilities are slightly smaller than the nominal coverage probability 0.95. Table 10 presents the standard error estimates of the estimators. Again the bootstrap estimate of standard deviations are close to the estimate of the standard deviation of the estimators from simulation.

7 Concluding remark

We have developed three sampling designs to estimate the population total and mean in a finite population setting. The proposed sampling designs select a simple random sample and identify their population ranks. Selection of the sample could be either with- or without-replacement. Replacement policy and the way that population ranks of selected units are identified define the three different sampling designs: design-0, design-1, and design-2. In these designs, population ranks of the measured units provide information about the relative positions of the sample units in the population. This positional information is used to borrow additional information from other unmeasured units in the population to reduce the uncertainty in the sample. We introduced three different estimators for the population mean and total for each one of these sampling designs. We show that the estimators perform better than simple random sample estimator as long as there is meaningful ranking information to rank the population units. The efficiencies of the estimators are an increasing function of the correlation coefficient between the response and auxiliary variable.

The population ranks of the selected units can be considered as covariate. These ranks are not strongly attached to the measured values. They can be ignored completely, and the sample can be analyzed as a simple random sample. Similar to a ranked set sample, where strong tie is established between the rank and measurement, it is possible to attach ranking information to the measured units strongly to induce further stratification in the data, but the resulting sample may not be reduced to a simple random sample. In this case, the properties of this sample need to be developed further. One of our current project investigates this type of sampling design.

References

Al-Saleh, M. F. and Samawi, H .M. (2007). A note on inclusion probability in ranked set sampling and some of its variations, Test, 16: 198–209.
Article MathSciNet MATH Google Scholar
Booth, J. G., Butler, W. R. and Hall, P. (1994). Bootstrap methods for finite populations. Journal of American Statisitcal Associaition, 89, 1282-1289.
Article MathSciNet MATH Google Scholar
Dell, T. R., and Clutter, J. L. (1972). Ranked set sampling theory with order statistics backround. Biometrics,28,545-555.
Article MATH Google Scholar
Frey, J. (2011). Recursive computation of inclusion probabilities in ranked set sampling. Journal of Statistical Planning and Inference, 141, 3632–3639.
Article MathSciNet MATH Google Scholar
Frey, J. and Feeman, T. G. (2012). An improved mean estimator for judgment post-stratification, Computational Statistics and Data Analysis, 56, 418-426.
Article MathSciNet MATH Google Scholar
Frey, J. and Feeman, T. G. (2013). Variance estimation using judgment post-stratification. To appear Annals of the Institute of Statistical Mathematics.
Frey, J. and Ozturk, O. (2011). Constrained estimation using judgment post-stratification. Annals of the Institute of Statistical Mathematics, 63, 769-789.
Article MathSciNet MATH Google Scholar
Gokpinar, F., and Ozdemir, Y. A. (2010). Generalization of inclusion probabilities in ranked set sampling. Hacettepe Journal of Mathematics and Statistics, 39, 89–95.
MathSciNet MATH Google Scholar
Horvitz, D. G., and Thompson, D .J. (1952). A generalization of sampling without replacement from a finite universe. Journal of Amewrican Statistical Association, 47, 663–685.
Article MathSciNet MATH Google Scholar
Husby, C. E., Stasny, E. A., and Wolfe, D. A. (2005). An application of ranked set sampling for mean and median estimation using USDA crop production data. Journal of Agricultural, Biological, and Environmental Statisitcs, 10, 354-373.
Article Google Scholar
Iwig, W. C. (1993). “The National Agricultural Statistics Service County Estimates Program” in Indirect Estimators in Federal Programs, Statistical Policy Working Paper 21, Report of the Federal Committee on Statistical Methodology, Subcommittee on Small Area Estimation , Washington DC, 7.1–7.15.
Jafari Jozani, M., and Johnson, B. C. (2011). Design based estimation for ranked set sampling in finite population. Environmental and Ecological Statistics, 18, 663–685.
Article MathSciNet Google Scholar
Jafari Jozani, M., and Johnson, B. C. (2012). Randomized nomination sampling in finite populations. Journal of Statistical Planning and inference, 142, 2103–2115.
Article MathSciNet MATH Google Scholar
MacEachern, S.N., Stasny, E.A., and Wolfe, D.A. (2004). Judgment post-stratification with imprecise ranking. Biometrics, 60, 207-215.
Article MathSciNet MATH Google Scholar
McIntyre, G. A. (1952). A method for unbiased selective sampling using ranked-sets. AustralianJournal of Agricultural Research, 3, 385–390.
Article Google Scholar
McIntyre, G. A. (2005). A method of unbiased selective sampling using ranked-sets. The American Statistician, 59, 230–232.
Ozdemir,Y. A., and Gokpinar, F. (2008). A new formula for inclusion probablities in median ranked set sampling. Communications in Statisitcs– Theory and Methods, 37, 2022-2033.
Article MathSciNet MATH Google Scholar
Ozturk, O. and Jafari Jozani (2013). Inclusion Probabilities in Partially Rank Ordered Set Sampling. In review, Computational Statisitcs and Data anlysis.
Ozturk, O. (2014a). Statistical Inference for Population Quantiles and Variance in Judgment Post-Stratified Samples, Computational Statistics and Data Analysis,77, 188-205.
Article MathSciNet Google Scholar
Ozturk, O. (2014b). Distribution free two-sample methods for judgment-post stratified data. In review Statistica Sinica.
Ozturk, O. (2014c). Estimation of population mean and total in finite population setting using multiple auxiliary variables. Journal of Agircultural, Biological and Environmental Statistics, 19, 161-184.
Article MathSciNet MATH Google Scholar
Patil, G. P., and Sinha, A. K. and Taillie, C. (1995). Finite population correction for ranked set sampling. Annals of the Institute of Statistical Mathematics, 47, 621–636.
Article MathSciNet MATH Google Scholar
Stokes, S. L., Wang, X. and Chen, M. (2007). Judgment post stratification with multiple rankers. Journal of Statistical Theory and Applications, 6, 344-359.
MathSciNet Google Scholar
Wang, X., Lim, J., and Stokes, S. L. (2008). A nonparametric mean estimator for judgment post-stratified data. Biometrics, 64: 355-363.
Article MathSciNet MATH Google Scholar
Wang, X., Wang, K., and Lim, J. (2012). Isotonized CDF estimation from judgment post-stratification data with empty strata. Biometrics, 68, 194-202.
Article MathSciNet MATH Google Scholar
Wang, X., Stokes, L., Lim, J., and Chen, M. (2006). Concomitant of multivariate order statistics with application to judgment poststratification. Journal of American Statistical Association, 101(476): 1693-1704.
Article MathSciNet MATH Google Scholar
Wolfe, D. A. (2012). Ranked Set Sampling: Its Relevance and Impact on Statistical Inference. ISRN Probability and Statistics, doi:10.5402/2012/568385.

Download references

Acknowledgments

The author thanks Cheryl Turner at the USDA NASS Ohio Field Office for arranging to use the Ohio Estimation Data in this study and Christy Meyer at the USDA NASS Head, Census Data Section for providing feedback on an earlier version of this work.

Author information

Authors and Affiliations

Department of Statistics, The Ohio State University, Columbus, OH, 43210, USA
Omer Ozturk

Authors

Omer Ozturk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omer Ozturk.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ozturk, O. Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units. JABES 21, 181–202 (2016). https://doi.org/10.1007/s13253-015-0231-4

Download citation

Received: 03 June 2015
Accepted: 23 September 2015
Published: 20 October 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s13253-015-0231-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Abstract

Similar content being viewed by others

Two-stage cluster samples with ranked set sampling designs

Parametric estimation for the simple linear regression model under moving extremes ranked set sampling design

Interval estimation of \(P(X in ranked set sampling

1 Introduction

2 Construction of sampling designs

Remark 1

Remark 2

3 Estimators for population mean and total

4 Empirical evidence

5 Bootstrap variance estimate and confidence interval

6 Application

7 Concluding remark

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Abstract

Similar content being viewed by others

Two-stage cluster samples with ranked set sampling designs

Parametric estimation for the simple linear regression model under moving extremes ranked set sampling design

Interval estimation of \(P(X in ranked set sampling

1 Introduction

2 Construction of sampling designs

Remark 1

Remark 2

3 Estimators for population mean and total

4 Empirical evidence

5 Bootstrap variance estimate and confidence interval

6 Application

7 Concluding remark

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation