Keywords

1.1 Introduction

The theory of sample surveys aims at selecting a sample of units, so as to represent a larger population. Although such techniques are now of routine use, in particular by national statistical agencies, the legitimacy of this approach has long been debated. The introduction of the representative method dates back to [42], who proposes to produce estimations by using a non-random controlled sample of municipalities and individuals, rather than a census. But this is truly with [46] that the basics of modern survey sampling are established. Neyman proposes a rigorous setup for random surveys, laying the foundations of probabilistic yet controlled surveys, which enable to statistically control the accuracy of estimators.

During decades, many of the available data sets emerged from surveys, the bigger ones being produced by the national statistical agencies. These samples of reasonably large size (a few thousand up to a few tens of thousands of units) aimed at shortening the delay of production of statistical information, with limited cost. For the past 10 years, we have been facing a completely new situation with a burst of the volume of the generated data. It is estimated that 200,000 sms are sent every second worldwide, Facebook users generate four million likes every minute, and in 2020, each person will produce 1.7 MB of data per second. These data cannot be saved and treated comprehensively. The theory of sample surveys is therefore needed to select large but tractable samples, sometimes processed in real time, to faithfully represent these huge masses of data.

More than ever, we need to control the statistical properties of the estimators produced. The purpose of this chapter is therefore to provide an introduction to the estimation framework of survey sampling, by reviewing some of the classical sampling methods. For the sake of simplicity, we focus on direct sampling procedures, which may be seen as the basic bricks needed to build a sampling design in practice. For the processed sampling methods, their statistical properties are discussed. For the reader interested by a more complete presentation of survey sampling tools, see [1, 16, 52, 58] and [28], for example.

This chapter is organized as follows: In Sect. 1.2, we present the survey sampling framework, and describe the classical Horvitz–Thompson estimator and the associated measures of accuracy. In Sect. 1.3, we focus on simple random sampling, for which we describe the statistical properties and give a number of possible sampling algorithms. In Sect. 1.4, we describe a number of classical unequal probability sampling algorithms, and we discuss their statistical properties in Sect. 1.5. We conclude in Sect. 1.6.

1.2 Notation and Estimation

In this section, we define our main notation and introduce the Horvitz–Thompson estimator which is of routine use in survey sampling. We then discuss variance computation and variance estimation for the Horvitz–Thompson estimator.

1.2.1 Notation

We are interested in a finite population U N consisting of N statistical units, which are supposed to be easily identified by a label. Therefore, it is common practice to make no distinction between a unit and its label, and we simply write the population as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} U_N &\displaystyle = &\displaystyle \{ 1,\ldots,N \} \end{array} \end{aligned} $$
(1.1)

We are interested in some quantitative variable of interest y, taking the value y k on unit k.

We suppose that the population of interest U N is embedded into a nested sequence {U N} of finite populations with increasing sizes, and all limiting processes will be taken as N →. This is essentially the asymptotic framework of [41], which is often used to study the asymptotic properties of a sampling design and of the assorted estimators. Also, this is a natural framework if we are interested in a population which is growing over time, for example, if we wish to select a sample in a data stream.

A without-replacement sampling design p N(⋅) is a probability distribution on the subsets in U N, namely

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \forall s \subset U_N \quad p_N(s) \geq 0 &\displaystyle \mathrm{and} &\displaystyle \sum_{s \subset U_N} p_N(s)=1 \end{array} \end{aligned} $$
(1.2)

It enables selecting the random sample S N of units used for estimation, in the sense that Pr(S N = s) = p N(s). Once the sampling design is defined, we need to choose a sampling algorithm, which is an effective procedure for the selection of the sample. For a given sampling design, there is usually a variety of sampling algorithms possible [58], see Sect. 1.3 for an illustration on simple random sampling.

The quantity π k|N ≡ Pr(k ∈ S N) for unit k to be selected is called the first-order inclusion probability. The π k|N’s are involved in the computation of estimators, and their sum

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \sum_{k \in U_N} \pi_{k|N} &\displaystyle \equiv &\displaystyle n \end{array} \end{aligned} $$
(1.3)

gives the average sample size. The probability π kl|N ≡ Pr(k, l ∈ S N) for units k and l to be jointly selected in S N is called the second-order inclusion probability. The π kl|N’s are involved in the computation of the variance of estimators. For a given set of first-order inclusion probabilities π k|N, k ∈ U N, the second-order inclusion probabilities depend on the design used for the selection of the sample.

1.2.2 Horvitz–Thompson Estimator

The Horvitz and Thompson [39] estimator (HT) of the total \(t_{yN}=\sum _{k \in U_N} y_k\) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{t}_{y\pi} &\displaystyle = &\displaystyle \sum_{k \in S_N} \frac{y_k}{\pi_{k|N}} = \sum_{k \in U_N} I_{kN} \frac{y_k}{\pi_{k|N}} \end{array} \end{aligned} $$
(1.4)

with I kN the sample membership indicator of unit k. We note

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} I_N &\displaystyle = &\displaystyle \left(I_{1N},\ldots,I_{kN},\ldots,I_{NN}\right) \end{array} \end{aligned} $$
(1.5)

the vector of sample membership indicators. If all the π k|N’s are positive, which is assumed in the rest of the paper, there is no selection bias. In such case, the HT-estimator is design-unbiased for t yN, i.e., unbiased under the randomization associated with the sampling design. It is remarkable that this property comes completely model-free. It holds for any variable of interest, without requiring any model assumptions.

There is no severe loss of generality in focusing on the total t yN, since many other parameters of interest can be written as smooth functions of totals. Such parameters are therefore easily estimated in a plug-in principle once an estimator of a total is available, see [20]. For example, the population mean is \(\mu _{yN}=N^{-1} \sum _{k \in U_N} y_k\), and is estimated by

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{\mu}_{y\pi} &\displaystyle = &\displaystyle \frac{\hat{t}_{y\pi}}{\hat{N}_{\pi}} \end{array} \end{aligned} $$
(1.6)

where \(\hat {N}_{\pi }=\sum _{k \in S_N} \frac {1}{\pi _{k|N}}\) is the HT-estimator of the population size N. Similarly, the population distribution function for some real number t is \(F_{yN}(t)=N^{-1} \sum _{k \in U_N} 1(y_k \leq t)\), with 1(⋅) the indicator function. The plug-in estimator of F yN(t) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{F}_{y\pi} &\displaystyle = &\displaystyle \frac{1}{\hat{N}_{\pi}} \sum_{k \in S_N} \frac{1(y_k \leq t)}{\pi_{k|N}} \end{array} \end{aligned} $$
(1.7)

1.2.3 Variance Computation

The variance of the HT-estimator is

$$\displaystyle \begin{aligned} V\left(\hat{t}_{y\pi}\right) &= \sum_{k \in U_N} \pi_{k|N}(1-\pi_{k|N}) \left(\frac{y_k}{\pi_{k|N}}\right)^2\notag\\ &\quad + \sum_{k \neq l \in U_N} (\pi_{kl|N}-\pi_{k|N} \pi_{l|N}) \frac{y_k}{\pi_{k|N}} \frac{y_l}{\pi_{l|N}} \end{aligned} $$
(1.8)

In Eq. (1.8), the first term in the right-hand side is the variance we would obtain if the units in the population were selected independently, which is known as Poisson sampling (see Sect. 1.4.1).

A gain in efficiency can be obtained by using fixed-size sampling designs, which are such that only the subsets s in U N of size n have positive selection probabilities p N(s). Many fixed-size sampling designs verify the so-called Sen–Yates–Grundy conditions:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \pi_{kl|N} &\displaystyle \leq &\displaystyle \pi_{k|N} \pi_{l|N} \mathrm{~for~any~} k \neq l \in U_N \end{array} \end{aligned} $$
(1.9)

Under Eq. (1.9), the second-term in the right-hand side of (1.8) is non-positive for a non-negative variable of interest y k, resulting in a variance reduction as compared to Poisson sampling.

For fixed-size sampling designs, the variance of the HT-estimator may be rewritten as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} V\left(\hat{t}_{y\pi}\right) &\displaystyle = &\displaystyle \frac{1}{2} \sum_{k \neq l \in U_N} (\pi_{k|N} \pi_{l|N} - \pi_{kl|N}) \left(\frac{y_k}{\pi_{k|N}}-\frac{y_l}{\pi_{l|N}}\right)^2 \end{array} \end{aligned} $$
(1.10)

which is known as the Sen–Yates–Grundy (SYG) variance formula [53, 62].

Multinomial sampling (a.k.a. sampling with replacement) is an important benchmark for a sampling design p N(⋅). It consists in n independent draws in the population with probabilities p k|N = n −1 π k|N at each draw. Under multinomial sampling, t yN is unbiasedly estimated by the Hansen–Hurwitz estimator

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{t}_{yHH} &\displaystyle = &\displaystyle \sum_{k \in U_N} W_{kN} \frac{y_k}{\pi_{k|N}} \end{array} \end{aligned} $$
(1.11)

with W kN the number of selections for unit k, see [35]. The variance is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} V\left(\hat{t}_{yHH}\right) &\displaystyle = &\displaystyle \sum_{k \in U_N} \pi_{k|N} \left(\frac{y_k}{\pi_{k|N}}-\frac{t_{yN}}{n} \right)^2 \end{array} \end{aligned} $$
(1.12)

The sampling design p N(⋅) is more efficient than multinomial sampling if

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} V\left(\hat{t}_{y\pi}\right) &\displaystyle \leq &\displaystyle V\left(\hat{t}_{yHH}\right) \mathrm{~for~any~variable~of~interest~} y \end{array} \end{aligned} $$
(1.13)

Some sufficient conditions for Eq. (1.13) are listed in [29].

1.2.4 Variance Estimation

For any sampling design, the variance of the HT-estimator may be estimated by the HT-variance estimator

$$\displaystyle \begin{aligned} \hat{V}_{HT}\left(\hat{t}_{y\pi}\right) &= \sum_{k \in S_N} (1-\pi_{k|N}) \left(\frac{y_k}{\pi_{k|N}}\right)^2\notag\\ &\quad + \sum_{k \neq l \in S_N} \frac{\pi_{kl|N}-\pi_{k|N} \pi_{l|N}}{\pi_{kl|N}} \frac{y_k}{\pi_{k|N}} \frac{y_l}{\pi_{l|N}} \end{aligned} $$
(1.14)

For a fixed-size sampling design, we may alternatively use the SYG-variance estimator

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{V}_{SYG}\left(\hat{t}_{y\pi}\right) &\displaystyle = &\displaystyle \frac{1}{2} \sum_{k \neq l \in S_N} \frac{\pi_{k|N} \pi_{l|N} - \pi_{kl|N}}{\pi_{kl|N}} \left(\frac{y_k}{\pi_{k|N}}-\frac{y_l}{\pi_{l|N}}\right)^2 \end{array} \end{aligned} $$
(1.15)

These two variance estimators usually take different values.

Three properties related to the second-order inclusion probabilities are particularly useful for variance estimation. First and obviously, the π kl|N’s need to be calculable. Second, the two variance estimators are design-unbiased if and only if all the π kl|N’s are positive. Finally, for a fixed-size sampling design, the variance estimator \(\hat {V}_{SYG}\left (\hat {t}_{y\pi }\right )\) is non-negative for any variable of interest if and only if the SYG conditions given in (1.9) are respected.

1.3 Simple Random Sampling

When the sampling design p N(⋅) is defined, it is possible that no particular information is known on the population U N, or that such information is meaningless for our target of inference. In such case, we have no other reasonable choice but to give to the units equal inclusion probabilities. In view of Eq. (1.3), this leads to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \pi_{k|N} &\displaystyle = &\displaystyle \frac{n}{N} \end{array} \end{aligned} $$
(1.16)

For example, equal inclusion probabilities are obtained under simple random sampling.

1.3.1 Definition

Simple random sampling without replacement (SI) is the fixed-size sampling design which gives to any subset s ⊂ U N of size n the same selection probability, namely

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} p(s) & = & \left\{\begin{array}{cc} \frac{1}{{N \choose n}} & \mathrm{ if~} Card(s)=n, \\ 0 & \mathrm{otherwise} \end{array} \right. \end{array} \end{aligned} $$
(1.17)

Under SI, the HT-estimator simplifies as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{t}_{y\pi} = N \bar{y}_N &\displaystyle \mathrm{with} &\displaystyle \bar{y}_N = \frac{1}{n} \sum_{k \in S_N} y_k \mathrm{~the~sample~mean} \end{array} \end{aligned} $$
(1.18)

Also, the variance of the HT-estimator given in (1.8) simplifies as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} V\left(\hat{t}_{y\pi}\right) &\displaystyle = &\displaystyle N^2 \left(\frac{1}{n}-\frac{1}{N}\right) S_{yN}^2 \\ \mathrm{with~} S_{yN}^2 &\displaystyle = &\displaystyle \frac{1}{N-1} \sum_{k \in U_N} (y_k-\mu_{yN})^2 \mathrm{~the~population~dispersion} \end{array} \end{aligned} $$
(1.19)

In the particular case of SI, both the HT-variance estimator and the SYG-variance estimator are identical, and equal to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{V}_{HT}\left(\hat{t}_{y\pi}\right) &\displaystyle = &\displaystyle N^2 \left(\frac{1}{n}-\frac{1}{N}\right) s_{yN}^2 \\ \mathrm{with~} s_{yN}^2 &\displaystyle = &\displaystyle \frac{1}{n-1} \sum_{k \in S_N} (y_k-\bar{y}_{N})^2 \mathrm{~the~sample~dispersion} \end{array} \end{aligned} $$
(1.20)

Among the large number of sampling algorithms proposed for SI, see Sect. 4.4 in [58] for a review, we consider three important implementations.

1.3.2 Draw by Draw Sampling Algorithm

The draw by draw procedure given in Algorithm 1 consists in successively drawing n units with equal probabilities, among the units remaining after the last draw. The procedure is computationally very intensive, since n readings of the file are necessary for sample selection. This is therefore not usable for sampling in large populations.

Algorithm 1 Draw by draw procedure for simple random sampling

  • Initialize with t = 1 and U N(1) = U N.

  • For t = 1, …, n:

    • Select some unit (k, say) from U N(t) with equal probabilities.

    • Take U N(t + 1) = U N(t) ∖{k}.

1.3.3 Selection–Rejection Method

The selection–rejection method [26] presented in Algorithm 2 requires one pass of the file only. We note \(\mathcal {F}_{t-1}\) for the σ-field generated by the random variables up to step t − 1: \(\pi _{t|\mathcal {F}_{t-1}}\) is therefore the inclusion probability conditionally on selection steps 1, …, t − 1, which in case of SI is simply the sampling rate among the remaining. Algorithm 2 is a fast implementation of SI, but improvements are possible, see, for example, [21] for a review.

Algorithm 2 Selection–rejection procedure for simple random sampling

We consider an illustration of the selection–rejection procedure on a small population of size N = 6, presented in Table 1.1. We wish to select a sample of size n = 3, and the inclusion probabilities are therefore π k|N = 0.5. Since u 1 > π 1|N, unit 1 is not selected in the sample. We have j(1) = 0, and the conditional inclusion probability of units k > 1 is therefore \(\pi _{k|\mathcal {F}_{1}}=3/5=0.60\). Since \(u_2 \leq \pi _{2|\mathcal {F}_{1}}\), unit 2 is selected in the sample and j(2) = 1. The conditional inclusion probability of units k > 2 is therefore \(\pi _{k|\mathcal {F}_{2}}=2/4=0.50\). The final sample is S N = {2, 4, 5}, see Table 1.1.

Table 1.1 Example of selection by the selection–rejection method

1.3.4 Reservoir Procedure

The selection–rejection procedure requires to know the size N of the population to compute the inclusion probabilities. It is therefore not feasible when sampling in a data stream, when the population size is not fixed but grows over time. The reservoir procedure [45] presented in Algorithm 3 enables to select a SI sample, without knowing in advance the population size N. The principle is to maintain at any time t a reservoir S t, which is in fact a SI sample of n units selected in U t = {1, …, t}. We consider in Sect. 1.4.5 a generalization to unequal probability sampling, called Chao’s procedure [9].

Algorithm 3 Reservoir procedure for simple random sampling

  • Initialize with t = n, and with S n = {1, …, n}.

  • For t = n + 1, …, N, do:

    • Generate a random number u t according to a uniform distribution.

    • If u t ≤ nt, remove one unit (k, say) from S t−1 with equal probabilities. Take S t = S t−1 ∪{t}∖{k}.

    • Otherwise, take S t = S t−1.

We consider an illustration of the reservoir procedure on the population presented in Table 1.1. We initialize with S 3 = {1, 2, 3}. At Step 4, u 4 ≤ 3∕4 = 0.75, and unit 4 is therefore selected in the reservoir, while one unit of S 3 (2, say) is randomly removed. At Step 5, u 5 ≤ 3∕5 = 0.60, and unit 5 is therefore selected in the reservoir, while one unit of S 4 (4, say) is randomly removed. At Step 6, u 6 > 3∕6 = 0.50, and unit 6 is therefore not selected. The final sample is S N = {1, 3, 5}, see Table 1.2.

Table 1.2 Example of selection by the reservoir procedure

1.3.5 Stratification

It is often possible to have access at the sampling stage to a q-vector x k of auxiliary variables, known for any unit in the population. Rather than directly selecting a SI sample, we can use x k to partition the population into groups. Sub-samples are then independently selected into each of these groups by SI. The groups are called strata, and the sampling design is called stratified simple random sampling (STSI).

We note U Nh, h = 1, …, H of size N h the strata, in which SI samples S Nh are independently selected. The HT-estimator may be rewritten as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{t}_{y\pi} = \sum_{h=1}^H N_h \bar{y}_{Nh} \mathrm{~with~} \bar{y}_{Nh} = \frac{1}{n_h} \sum_{k \in S_{Nh}} y_k \mathrm{~the~sample~mean~in~} S_{Nh} \end{array} \end{aligned} $$
(1.22)

The variance of the HT-estimator may be rewritten as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} V\left(\hat{t}_{y\pi}\right) &\displaystyle = &\displaystyle \sum_{h=1}^H N_h^2 \left(\frac{1}{n_h}-\frac{1}{N_h}\right) S_{yNh}^2 \end{array} \end{aligned} $$
(1.23)

with \(S_{yNh}^2\) the dispersion inside the stratum U h. The comparison between Eqs. (1.19) and (1.23) shows that stratification reduces the variance by dropping the dispersion between strata, leaving the dispersion within strata only. STSI is therefore particularly efficient if the strata are homogeneous with respect to the variable of interest. A further gain in efficiency can be obtained by adequately allocating the global sample size into the strata [46].

Stratification is of routine use in surveys, and may result in large gains in efficiency. It is also useful if we wish to make estimations on sub-populations, and if these sub-populations may be used as strata. In this case, stratification makes it possible to control the sample size selected inside these sub-populations. Methods to define appropriate stratification of populations are discussed in [37] and [38].

1.4 Sampling with Unequal Probabilities

Suppose that a positive auxiliary variable x 0k is known for any unit in the population, and that this variable is positively correlated to y k. For a fixed-size sampling design, it follows from Eq. (1.10) that the variance of the HT-estimator is small if π k|N is roughly proportional to y k. Since y k is unknown at the sampling stage, this suggests defining the inclusion probabilities proportionally to x 0k. This leads to probability proportional to size (π-ps) sampling, and the inclusion probabilities are

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \pi_{k|N} &\displaystyle = &\displaystyle n \frac{x_{0k}}{\sum_{l \in U_N} x_{0l}} \end{array} \end{aligned} $$
(1.24)

Equation (1.24) may lead to inclusion probabilities greater than 1 for units with large values of x 0k. In such case, these inclusion probabilities are set to 1, and the other probabilities are recomputed until all of them are lower than 1.

For example, imagine that we wish to select in the population presented in Table 1.1 a sample of size n = 5, with probabilities proportional to x 0k. Applying Eq. (1.24) leads to

$$\displaystyle \begin{aligned} \begin{array}{rcl} \pi_{1|N}=\pi_{6|N}=1.25 &\displaystyle \pi_{2|N}=\pi_{4|N}=0.42 &\displaystyle \pi_{3|N}=\pi_{5|N}=0.83 \end{array} \end{aligned} $$

We set π 1|N = π 6|N = 1, and recompute the other probabilities on the remaining population, i.e.,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \pi_{k|N} &\displaystyle = &\displaystyle (n-2) \frac{x_{0k}}{\sum_{l \in U \setminus \{1,6\}} x_{0l}} \mathrm{~for~} k \in U_N \setminus \{1,6\} \end{array} \end{aligned} $$

The final inclusion probabilities are

$$\displaystyle \begin{aligned} \begin{array}{rcl} \pi_{1|N}=\pi_{3|N}=\pi_{5|N}=\pi_{6|N}=1, &\displaystyle \pi_{2|N}=\pi_{4|N}=0.5 \end{array} \end{aligned} $$

In the rest of this section, we present some important sampling designs with unequal inclusion probabilities. A more comprehensive treatment of unequal probability sampling procedures may be found in [8] and [58].

1.4.1 Poisson Sampling

Poisson sampling, which is described in Algorithm 4, is a very simple method for π-ps sampling. The units are selected through a series of N independent heads or tails experiment. In the particular case when all the units have the same inclusion probability π k|N = nN, the algorithm is called Bernoulli sampling.

Algorithm 4 Poisson sampling

  • At step t = 1, …, N, do:

    • Generate a random number u t according to a uniform distribution, independently of u 1, …, u t−1.

    • The unit t is selected if u t ≤ π t|N.

For illustration, we consider the population in Table 1.1. The inclusion probabilities are defined proportionally to x 0k. All units are given independent random numbers u k generated from a uniform distribution. Comparing them with the π k|N’s leads to the sample S N = {2, 4, 5, 6}, see Table 1.3.

Table 1.3 Example of selection by Poisson sampling

The simplicity of Poisson sampling makes it attractive for coordination sampling, i.e., when we wish to select several samples with controlled overlap. Positive coordination targets a maximum overlap: this is useful in repeated surveys, where we are interested in comparisons over time. Negative coordination targets a minimum overlap: this is useful to minimize the response burden, and currently used in business surveys, for example. For a good overview of sample coordination methods, see, for example, [25, 49], and [31].

1.4.2 Rejective Sampling

Since the units are independently selected under Poisson sampling, the sample size is random. In the example given in Table 1.3, we targeted an average sample size of n = 3, but the size of the sample finally selected is 4. This random sample size results in a greater variability of estimators. To circumvent this problem, [34] introduced rejective sampling (a.k.a. conditional Poisson sampling) which is presented in Algorithm 5. It consists in repeatedly performing Poisson sampling, with a basic set of inclusion probabilities π bk|N, until the target sample size is attained. This is a fixed-size sampling design by construction. In the particular case of equal inclusion probabilities, rejective sampling is equivalent to SI.

Due to the rejection of some samples, the final inclusion probability π k|N differs from the basic inclusion probability π bk|N. More precisely, denoting by S Np the Poisson random sample and by S Nr the rejective random sample, the final inclusion probabilities are

$$\displaystyle \begin{aligned} \begin{array}{rcl} \pi_{k|N} \equiv Pr(k \in S_{Nr}) = Pr(k \in S_{Np} | Card(S_{Np})=n) \end{array} \end{aligned} $$
(1.25)

[24] showed that for any prescribed inclusion probabilities π k|N, associated basic probabilities π bk|N always exist. Some efficient algorithms to compute these probabilities are also available, see [15] and [18]. Rejective sampling has been widely considered in the sampling literature, see, for example, [44] for a review of variance approximations, and [6] for approximations of inclusion probabilities up to any order.

Algorithm 5 Rejective sampling

  • Let π b|N = (π b1|N, …, π bN|N) denote a set of basic inclusion probabilities, with ∑kU π bk|N = n.

  • Select a Poisson sample with inclusion probabilities π b|N.

  • We stop if the sample is of size n. Otherwise, we select another Poisson sample.

For illustration, we consider the population with inclusion probabilities π k|N given in Table 1.3. The associated basic inclusion probabilities π bk|N are given in Table 1.4. Using the first series of random numbers u 1k, we obtain a sample of size 4 which is therefore rejected. Using the second series of random numbers u 2k, we obtain the sample S N = {2, 4, 6} which is of appropriate size.

Table 1.4 Example of selection by rejective sampling

1.4.3 Systematic Sampling

Systematic sampling [43], which is presented in Algorithm 6, is a very popular method for π-ps sampling. It consists in ordering the units in the population, generating a random start to determine the first unit selected, and then making jumps of size 1 to determine the other sampled units. This sampling design is very simple, but involves very few randomness since one random number only is sufficient for the whole sample selection, and its statistical properties are therefore limited.

Algorithm 6 Systematic sampling

For illustration, we consider the population in Table 1.3. We generate a random number u = 0.65 ∈ [V 0, V 1[, and unit 1 is therefore selected. By making jumps of length 1, we obtain the complete sample S N = {1, 4, 6} (Table 1.5).

Table 1.5 Example of selection by systematic sampling

There is a huge body of literature on systematic sampling, see, for example, [40] for a critical review. Systematic sampling enables selecting samples which are well-spread, which is in particular useful in spatial sampling. The popular generalized random tessellation stratified (GRTS) sampling [56] consists in defining an order on the spatial units by tessellation, and then applying systematic sampling; see also [60] and [3] for a review of spatial sampling methods.

1.4.4 Pivotal Sampling

Pivotal sampling [22], which is presented in Algorithm 7, consists in a succession of duels between units. At each step, we consider the two first units remaining in the population. If the sum of their probabilities is lower than 1, one of them gets the sum of their probabilities, while the other is discarded. If the sum of their probabilities is greater than 1, one of the units is selected, while the other one gets the residual probability. Pivotal sampling is also known in computer science as the Srinivasan sampling procedure, see [55].

An illustration of sample selection is given in Table 1.6. At the first step, units 1 and 2 fight, which results in selecting unit 1 and in discarding unit 2. At the second step, units 3 and 4 fight, and unit 3 gets the cumulative probability, while unit 4 is discarded. At the third step, units 3 and 5 fight, and unit 5 is selected, while unit 3 gets the residual probability 0.5 + 0.75 − 1 = 0.25. At the fourth step, units 3 and 6 fight, and unit 6 is selected, while unit 3 is discarded. The final sample is S = {1, 5, 6}.

Table 1.6 Example of selection by pivotal sampling

Algorithm 7 Pivotal sampling

Like systematic sampling, pivotal sampling enables to select samples which are well-spread over space. Since more randomization is introduced in the sample process, the method possesses better statistical properties, see Sect. 1.5. It has therefore been proposed as an alternative for spatial sampling, see, for example, [30] and [13].

Pivotal sampling is a particular case of the cube method [23], which enables to perform balanced sampling. If a q-vector x k of quantitative variables is known for any unit k ∈ U N at the design stage, a sampling design is balanced on x k if the random sample S N is such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \sum_{k \in U_N} I_{kN} \frac{x_k}{\pi_k} &\displaystyle = &\displaystyle \sum_{k \in U_N} x_k \end{array} \end{aligned} $$
(1.26)

with I kN the sample membership indicator for unit k. This means that the HT-estimator of \(t_{xN}=\sum _{k \in U_N} x_k\) exactly matches the known total, see [59] for a review on the cube method. Apart from the cube method, balanced samples may also be selected by rejective sampling. A sample is first selected by means of a basic sampling strategy. If Eq. (1.26) is approximately matched, we keep the sample. Otherwise, another candidate sample is selected by means of the basic sampling strategy [28, 34].

1.4.5 Chao’s Procedure

Rejective sampling requires that the inclusion probabilities π k|N are computable from the start of the algorithm. Poisson sampling, systematic sampling, and pivotal sampling require that the probability π k|N is computable, at least when unit k is considered for selection. In view of Eq. (1.24), we therefore need to know at each step the total \(t_{x_0N}=\sum _{k \in U_N} x_{0k}\) of the auxiliary variable on the entire population.

This is feasible if we have a sampling frame of the units in the population. However, if we wish to sample in a data stream, the units arrive successively and \(t_{x_0N}\) is not known in advance. Chao’s procedure [9], which is presented in Algorithm 8, enables to perform π-ps sampling without sampling frame. The presentation in Algorithm 8, due to [58], is somewhat simpler than the original algorithm.

Algorithm 8 Chao’s procedure

We consider an illustration of Chao’s procedure on the population presented in Table 1.1. We initialize with S 3 = {1, 2, 3}. At Step 4, u 4 ≤ π 4|4 = 0.5, and unit 4 is therefore selected in the reservoir, while one unit of S 3 is randomly removed with probabilities p k|4 (in this particular case, unit 2 is removed with certainty). At Step 5, u 5 ≤ π 5|5 = 0.67, and unit 5 is therefore selected in the reservoir, while one unit of S 4 (4, say) is randomly removed with probabilities p k|5. At Step 6, u 6 ≤ π 6|6 = 0.75, and unit 6 is therefore selected while one unit of S 5 (1, say) is randomly removed with probabilities p k|6. The final sample is S N = {3, 5, 6}, see Table 1.7.

Table 1.7 Example of selection by Chao’s procedure

1.5 Statistical Properties of a Sampling Design

The choice of using some specific sampling design is based on both practical and theoretical matters. To produce consistent estimators with appropriate confidence intervals, it is desirable that the three following statistical properties hold: (a) the HT-estimator is weakly consistent for the true total; (b) the HT-estimator satisfies a central-limit theorem; (c) a consistent variance estimator is available for the HT-estimator.

In order to study the statistical properties of the sampling designs presented in this chapter, we consider the following assumptions:

  • H1: There exists some constant c 1, C 1 > 0 such that for any k ∈ U:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} {} c_1 \frac{n}{N} &\displaystyle \leq \pi_{k} \leq &\displaystyle C_1 \frac{n}{N} \end{array} \end{aligned} $$
    (1.28)
  • H2: There exists some constant C 2 such that:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{1}{N} \sum_{k \in U} y_k^4 &\displaystyle \leq &\displaystyle C_2 \end{array} \end{aligned} $$
    (1.29)
  • H3: There exists some constant c 3 > 0 such that:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} {} c_3 \frac{N^2}{n} &\displaystyle \leq &\displaystyle V(\hat{t}_{y\pi}) \end{array} \end{aligned} $$
    (1.30)

Assumption (H1) states that the inclusion probabilities are not too variable, and do not depart much from those obtained when sampling with equal probabilities. This assumption is under the control of the survey sampler. It is assumed in (H2) that the variable of interest has a finite moment of order 4, and in (H3) that the variance of the HT-estimator is non-vanishing. These assumptions are fairly weak, although we may find situations under which they are not respected. For example, (H2) does not hold for heavily skewed populations where some units in the population have very large y k’s.

1.5.1 Weak Consistency of the HT-Estimator

The HT-estimator is weakly consistent for the total t y if

$$\displaystyle \begin{aligned} \begin{array}{rcl} N^{-1} \left(\hat{t}_{y\pi}-t_y\right) &\displaystyle \rightarrow_{Pr} &\displaystyle 0 \end{array} \end{aligned} $$
(1.31)

where →Pr stands for the convergence in probability. Under assumptions (H1) and (H2), this property holds if the SYG conditions in (1.9) are respected. These conditions hold for SI, and for all the sampling designs presented in Sect. 1.4 except systematic sampling. The result is trivial for Poisson sampling, and has been proved by [15] for rejective sampling, [22] for pivotal sampling, and [9] for Chao’s procedure. More generally, this property holds if the sampling design is negatively associated [7].

The weak consistency of the HT-estimator also holds under assumptions (H1) and (H2) if the sampling design is more efficient than multinomial sampling, see Eq. (1.13). This property holds true for SI. It does not generally hold for Poisson sampling and systematic sampling, but has been proved for the three other designs: see [51] for rejective sampling, [12] for pivotal sampling, and [54] for Chao’s procedure.

1.5.2 Central-Limit Theorem

Several different methods have been used in survey sampling to prove the property (b), which states that the HT-estimator satisfies a central-limit theorem. For example, two different and very elegant proofs for SI are due to [32] and [33].

The simplest case occurs when the sampling design may be seen as a series of independent experiments, like for Poisson sampling where the units are selected independently, or for multinomial sampling where the draws are independent. In this case, the asymptotic normality follows from the Lyapunov central-limit theorem.

The other unequal probability sampling designs are more difficult to handle, due to the dependence in the selection of the units. Reference [32] introduced a coupling method to prove a CLT for simple random sampling, which he later extended in [34] to cover rejective sampling. The basic idea of the coupling method is to link the sampling design under study to another one where the units are selected independently, and such that the two sampling designs are close with respect to some distance function, see, for example, [57]. The coupling method has also been used by [48] for sequential Poisson sampling, and [11] and [14] for multistage sampling, see also [4].

Another technique consists in applying the weaker martingale CLT, see, for example, [36]. This is used by [47] for the Rao–Hartley–Cochran method, [48] for two-stage sampling designs, and [13] for pivotal sampling.

1.5.3 Consistency of a Variance Estimator

The property (c) that a weakly consistent variance estimator is available is somewhat difficult to prove, except for Poisson sampling for which it holds automatically from assumptions (H1) and (H2), and for SI for which it can be proved by tedious, but fairly straightforward computation, see exercise 2.21 in [2].

The second-order inclusion probabilities can be computed for all the sampling designs presented in Sect. 1.4, either by means of an explicit formula for systematic sampling [17] and pivotal sampling [10, 19] or by means of a recursive formula, see [18] for rejective sampling and [9] for Chao’s procedure. However, it can be easily proved that many second-order inclusion probabilities are zero for both systematic sampling and pivotal sampling, and consequently unbiased variance estimators are not available. Numerous variance estimators have been proposed for systematic sampling, see, for example, [50, 61] or [27]. Variance estimators for pivotal sampling are considered in [12, 30] and [13].

Even for sampling designs with positive second-order inclusion probabilities, the consistency of the HT-variance estimator or the SYG-variance estimator requires in particular that the second-order inclusion probabilities are bounded below, which is difficult to prove. For rejective sampling, [34] proposed a variance approximation which does not require the second-order inclusion probabilities. This results in a simplified variance estimator, whose consistency is proved in [14].

1.6 Conclusion

Among the unequal probability sampling designs presented in this chapter, Poisson sampling, rejective sampling, and pivotal sampling possess good statistical properties. Poisson sampling has the disadvantage to lead to random sample size, and therefore to an inflated variance. Rejective sampling is presumably the best sampling method, presenting many favorable properties needed for statistical inference. Pivotal sampling also possesses good statistical properties, but no unbiased variance estimator is available. Anyway, conservative variance estimators are possible [13], and pivotal sampling leads to well-spread samples, and may be more efficient than rejective sampling when the units are ranked with respect to some variable which is related to the variable of interest.

Despite its great interest for sampling in data streams, the currently known statistical properties of Chao’s procedure are limited. Reference [5] studies the validity of Hajek’s variance estimator [34] for Chao’s procedure, but under very restrictive conditions on the first-order inclusion probabilities. Establishing a CLT and the consistency of a variance estimator for Chao’s procedure is both an important and challenging task.