1 Introduction

There are dozens of methods of unequal probability sampling. Most of them are described in the books of  Hanif and Brewer (1980), Gabler (1990), Tillé (2006) and Chaudhuri and Pal (2022). Nevertheless, very few methods allow you to decide to select the units in the order in which they appear. Systematic sampling with unequal probabilities (Madow 1949) allows sequential sampling from a stream. However, once the first units are examined, this method quickly becomes deterministic, making it predictable, which can be problematic. Indeed, sampling should not be predictable if it is used to perform a control (Busnel and Tillé 2020).

Several methods are already available for sampling in a stream. The sequential version of Hanurav-Vijayan sampling design (Vijayan 1968; Aubry 2023) can be considered as a generalization of the Sunter sampling procedure (Sunter 1977, 1986). The ordered pivotal method (as a special case of splitting method of Deville and Tillé 1998) and Deville’s systematic sampling are two different implementations of the same sampling design (Chauvet 2012). A disadvantage of the order pivotal method is that the decision to take or not the first unit may be made after examining a very large number of units in the sampling frame. The sequential version of Hanurav-Vijayan is not generally consistent using Narain-Horvitz-Thompson (Narain 1951; Horvitz and Thompson 1952) estimator. The second-phase of this design may be more simply implemented in terms of a sequential procedure (Chauvet 2022), and then can not respect the order of the observations completely. Disadvantages of Deville’s systematic method include numerous calculations for each step and problems with receiving data in groups in the middle of the process. In addition to these algorithms, reservoir sampling methods are available to collect samples from a stream (Chao 1982; Cohen et al. 2009; Tillé 2019). Using the idea of reservoir in data mining approaches, such as pattern sampling (Boley et al. 2011; Diop et al. 2018), has provided powerful tools for mining online stream populations to discover patterns (Giacometti and Soulet 2021). In such methods, the reservoir is updated each time a unit enters. The final decision on certain units will then be made very late. In stream sampling, it is a convenient property to be able to comment on each unit immediately after observing it.

Chromy sequential method (Chromy 1979) leads to an immediate sampling algorithm with the same design of Deville’s systematic and order pivotal methods (Chauvet 2021). Many zero second-order inclusion probabilities are the main drawback of the three methods. To solve this problem, (Chromy 1979) has proposed to partially randomize the order of the units in the population before applying the sampling algorithm. With such a reorganization of population units, we lose the applicability of stream methods. Here, to overcome all the discussed drawbacks, we first propose a method that allows us to decide on the selection of units according to the order of their appearance in a list. The probabilities are updated only in a window containing a very small number of units after the unit for which the decision has been made. The method is therefore particularly interesting for selecting units from a stream. The great advantage of this new method is that it forces the decision on the units according to their order of appearance in the stream.

Recently, Jauslin et al. (2022) presented a two-phase balanced version of sequential sampling that respects the order of units in the sample selection process. But, depending on the auxiliary variables, this method may not lead directly to a sample. Therefore, the decision regarding some units will be postponed until the completion of the first phase of sampling, and during the implementation of the landing phase of balanced sampling.

We then extend the method to apply integer-sized windows where the size of each window is defined as the expected number of sample units within it. With windows of size two, simultaneously we take care about three important issues in streaming populations, the positivity of second-order inclusion probabilities, applicability on streaming populations and finally spreading the sample along the population indexes. In this version of the method, immediately after observing enough units with inclusion probabilities sum to greater than or equal to two, we can decide for all of them and two of them will be selected.

Also, as a special version of the method, windows of size one is equivalent to Chromy method. It can also easily be shown that Deville’s systematic and Chromy method lead to the same sampling design. The method is therefore quite appropriate for stream data and can be applied to unequal or equal inclusion probabilities.

To discuss these matters, in Sect. 2, we introduce the notations and the principal concept of sampling theory. In Sect. 3, we present the proposed sampling method while, in Sect. 4, we extend the method to stream sampling. Section 5 is devoted to the method for windows of size integers and calculation of the second-order inclusion probabilities. In Sect. 6, we present a stream sampling with windows of size one with extending it to an immediate decision sampling. In Sect. 7, we discuss the size of the windows. The calculation of the second-order inclusion probabilities of the method is presented in Sect. 5. In Sect. 8, we run some small examples and simulations. The manuscript ends with a conclusion on the proposed methods, in Sect. 9.

2 Notation

Consider a finite population of size N labeled by \(U=\{1,2,\dots ,N\}\), a vector of values taken by a variable of interest \(\varvec{y}=(y_1,y_2,\dots ,y_N)^\top \) and a vector of first-order inclusion probabilities \(\varvec{\pi }=(\pi _1,\pi _2,\dots ,\pi _N)^\top \), each associated with the respective labels in U. Consider also the total of the variable of interest,

$$Y = \sum\limits_{{k \in U}} {y_{k} } ,$$

as the main parameter. A sample s is a subset of U and a sampling design p(.) is a probability distribution on all the subsets of U. A random sample is defined by:

$$\begin{aligned} \text{ Pr }(S = s) = p(s),~ p(s)\ge 0, \text{ and } \sum _{s\subset U}p(s) = 1, \text{ for } \text{ all } s\subset U. \end{aligned}$$

To estimate this parameter, two approaches can be used:

  1. (1)

    Consider a sampling design p(.) and find a sampling method (or algorithm) to implement this design. Based on this approach, according to p(.), all the first- and second-order inclusion probabilities can be specified using

    $$\begin{aligned} \begin{aligned} \pi _k&=\text {Pr}(k\in S)=\sum \limits _{s\ni k} p(s) \text{ and } \\\\ \pi _{k\ell }&=\text {Pr}(k,\ell \in S)=\sum \limits _{s\ni \{k,\ell \} } p(s), \text{ for } \text{ all } k,\ell \in U. \end{aligned} \end{aligned}$$
  2. (2)

    Consider a vector of first-order inclusion probabilities \(\varvec{\pi }\) and find a sampling method to select a random sample S such that \(\text {Pr}(k\in S)=\pi _k,\) for all \(k\in U\).

In this article, the second approach is considered. If \(\pi _k>0\), for all \(k\in U\), then Y can be estimated unbiasedly employing the first-order inclusion probabilities using the Narain-Horvitz-Thompson estimator

$$\begin{aligned} {\widehat{Y}}=\sum _{k\in S}\frac{y_k}{\pi _k}. \end{aligned}$$

Its accuracy depends on the second-order inclusion probabilities

$$\begin{aligned} \text {Var}({\widehat{Y}})=\sum _{k\in U}\sum _{\ell \in U} (\pi _{k\ell }-\pi _k\pi _\ell )\frac{y_k}{\pi _k}\frac{y_\ell }{\pi _\ell }. \end{aligned}$$
(1)

Indeed, the values of \(\varvec{\pi }\) indicate the desired chance (calculated using some auxiliary variables) for the units to be selected in the final sample and different sampling methods can lead to the same \(\varvec{\pi }\). Methods with the same \(\varvec{\pi }\), can lead to different matrices of second-order inclusion probabilities \(\varvec{\Pi }\), which directly affects the accuracy of the estimation. If \(\pi _{k\ell }>0,\) for all \(k,\ell \in U,\) then \(\text {Var}({\widehat{Y}})\) can be unbiasedly estimated by

$$\begin{aligned} \widehat{\text {Var}}({\widehat{Y}})=\sum _{k\in S}\sum _{\ell \in S} \frac{\pi _{k\ell }-\pi _k\pi _\ell }{\pi _{k\ell }} \frac{y_k}{\pi _k}\frac{y_\ell }{\pi _\ell }. \end{aligned}$$

The accuracy is however not always the most important criterion for evaluating a strategy (combination of a sampling method and an estimator). Another important criterion can be the applicability of the strategy according to the accessibility of the data over time. Since efficient designs are usually complex, it is important to be able to calculate the second-order inclusion probabilities to have a vision of the accuracy of designs. These topics will be discussed further in Sect. 4.

3 One-step one-decision sampling method

In this section, we present a procedure called One-Step One-Decision (OSOD) sampling method. The general idea of the method is that, at each step, a decision is made about the selection of the unit. After that, inclusion probabilities are updated according to the decision made on this unit. This procedure is repeated on the following units and it takes at most N steps to obtain the final sample. This method does not allow the selection of a sample from a stream but serves as a basis for the construction of the following methods.

Consider a finite population U. Define the sum of the inclusion probabilities

$$\begin{aligned} n=\sum _{k\in U}\pi _k, \end{aligned}$$

which can be non-integer. For simplicity, we first consider the method in which the decision is made for the first unit. Also suppose that

$$\begin{aligned} \sum _{k=1}^N\pi _k\ge 1 \text{ and } \sum _{k=1}^N(1-\pi _k)\ge 1. \end{aligned}$$

The first unit is selected with probability \(\pi _1\). Let \(\pi _1^1\) be the updated inclusion probability of unit 1. It can be 0 or 1 and can be summarized as follows

$$\begin{aligned} \pi ^1_1=\left\{ \begin{array}{ll} 1&{} \text{ with } \text{ probability } \pi _1 \\ 0 &{} \text{ with } \text{ probability } 1-\pi _1. \end{array}\right. \end{aligned}$$
(2)

After deciding on the first unit, the remaining inclusion probabilities are updated. The general idea of the method is to increase (respectively decrease) the inclusion probabilities of the following units depending on whether \(\pi _1^1\) has been changed to 0 (respectively 1). The inclusion probabilities of the remaining units are updated as follows

$$\begin{aligned} \pi ^1_k=\left\{ \begin{array}{ll} \pi ^{1(0)}_k= \min (c_1\pi _k,1) &{} \text{ if } \pi ^1_1=0\\ \displaystyle \pi ^{1(1)}_k=\frac{\pi _k-\pi ^{1(0)}_k(1-\pi _1)}{\pi _1} &{}\text{ if } \pi ^1_1=1, \end{array} \right. \quad \text {for } k = 2,\dots ,N, \end{aligned}$$
(3)

where constant \(c_1\) is defined by

$$\begin{aligned} \sum _{k=2}^N \min (c_1\pi _k,1)=n. \end{aligned}$$
(4)

Equation (3) is calculated in such a way that the overall chances of the units to be selected are exactly equal to the predetermined inclusion probabilities, which results in respecting the sample size expectation. Indeed, in the first part of (3), the first unit is not selected, and then we have distributed \(\pi _1\) on the other units as \(\pi ^{1(0)}_k=\pi _k+\alpha _k \pi _1,\) for some \(0<\alpha _k\le 1\), using (4) which guaranties that \(\alpha _k\) is proportional to \(\pi _k\), and the updated inclusion probabilities do not exceed the value of 1. The second part of (3) has been built in a way to preserve the first order inclusion probabilities.

After the first step, the decision is irrevocably made for unit 1. Simply, this operation can be repeated for other steps \(t=2,\dots ,N.\) At step t, decisions are made on the first \(t-1\) units and \(\varvec{\pi }\) has been updated \(t-1\) times. The updated vector is denoted by

$$\begin{aligned} \varvec{\pi }^{t-1}=(\pi ^{t-1}_1,\pi ^{t-1}_2,\dots ,\pi ^{t-1}_k,\dots ,\pi ^{t-1}_N)^\top , \end{aligned}$$

where its first \(t-1\) units are in \(\{0,1\}\). Again, it is supposed that

$$\begin{aligned} \sum _{k=t}^{N}\pi ^{t-1}_k\ge 1 \text{ and } \sum _{k=t}^{N}(1-\pi ^{t-1}_k)\ge 1, \end{aligned}$$
(5)

and a decision is taken for the \(t^{th}\) unit with probability

$$\begin{aligned} \pi ^{t}_t=\left\{ \begin{array}{ll} 1&{} \text{ with } \text{ probability } \pi ^{t-1}_t\\ 0 &{} \text{ with } \text{ probability } 1-\pi ^{t-1}_t. \end{array} \right. \end{aligned}$$

Then, the inclusion probabilities are updated by

$$\begin{aligned} \pi ^t_k=\left\{ \begin{array}{ll} \pi ^{t(0)}_k=\min (c_t\pi ^{t-1}_k,1)&{} \text{ if } \pi ^{t}_t=0 \\ \displaystyle \pi ^{t(1)}_k=\frac{\pi ^{t-1}_k-\pi ^{t(0)}_k(1-\pi ^{t-1}_t)}{\pi ^{t-1}_t} &{} \text{ if } \pi ^t_t=1, \end{array}\right. \quad \text {for } k=t+1,t+2\dots ,N, \end{aligned}$$

where \(c_t\) is defined by

$$\begin{aligned} \sum _{k=t+1}^N \min (c_t\pi ^{t-1}_k,1)=n-n_t, \end{aligned}$$

and \(n_t\) is the number of selected units up to step t.

Conditions (5) are required to update other probabilities. For example, at the first step, it is necessary to have enough amplitude in other inclusion probabilities to distribute \(\pi _1\) as \(\pi ^{1(0)}_k=\pi _k+\alpha _k\pi _1\) (if unit 1 is not selected) or to remove \(1-\pi _1\) from them as \(\pi ^{1(1)}_k=\pi _k-\alpha _k(1-\pi _1)\) (if unit 1 is selected) for some \(0<\alpha _k\le 1\) on the following inclusion probabilities. Then, henceforth, OSOD stands for the design with a population that satisfies Conditions (5).

The procedure is formally presented in Algorithm 1.

figure a

Result 1 shows that the sampling procedure respects the inclusion probabilities and thus the fixed sample size.

Result 1

With OSOD, for all \(t = 1,2,\dots \) we have

$$\begin{aligned} \text {E}(\pi ^{t}_k)=\pi _k,\; \text{ for } \text{ all } k\in U,\quad \text{ and }\quad \sum _{k=1}^N\pi ^t_k=\sum _{k=1}^N\pi _k. \end{aligned}$$

The proof of Result 1 and all the following results are given in Appendix.

Currently, the sampling procedure considers the entire population to modify the inclusion probabilities. In the next section, an improvement is proposed for sampling from a stream. In fact, only a small part of the following units is enough to update the inclusion probabilities. This improvement, completely modifies the application of the method.

4 Extending the method to stream sampling

A data stream is a sequence of information arriving sequentially in time. When sampling streams, there is usually a very large set of data coming in continuously. Due to the limited space required for data storage, it is desirable to make a decision on the selection of units at the right time.

As pointed out in Sect. 1, in the methods proposed so far, the decision about a unit that is at the beginning of a stream can be made too late. Ideally, the decision to select or not a unit from a stream must be taken as soon as possible, which the OSOD method allows. The idea to extend the OSOD method to stream sampling consists of considering a window of units instead of the entire population to update inclusion probabilities.

Result 2 shows that the number of units required to be considered, depends on the inclusion probabilities of the following units. Essentially, it is enough to wait until we have a set of units such that there exists a constant \(c_t\) that satisfies \(\pi _t \ge (1-1/c_t)\). Furthermore, Result 3 also shows that a sufficient condition to update inclusion probabilities is that the inclusion probabilities within the window sum up to an integer.

Result 2

In OSOD, at step t, a necessary and sufficient condition to have enough space to update the following inclusion probabilities is that

$$\begin{aligned} \pi _t\ge 1-1/c_t. \end{aligned}$$

Result 3

In OSOD, at step t, a sufficient condition to have enough space to update the following inclusion probabilities is that \(n=\sum _{k\in U}\pi _k\) is an integer.

Therefore, after deciding on the first unit, we need a sufficient number of units with \(c_1\) that satisfies \(\pi _1\ge (1-1/c_1)\). Moreover, if there is a window of integer size, then it is possible to apply the method on this window. Still, it could be possible that no window sums up to an integer and no constant \(c_1\) satisfies the condition until the end of the stream. In this case, it is possible to use Result 3 and the notion of phantom unit (Grafström et al. 2012) to artificially transform the sum of the inclusion probabilities to an integer and finish the procedure. Also, an interesting method to solve this problem will be presented in Sect. 5.

Algorithm 2 gives a detailed explanation of the proposed method.

figure b

In summary, OSOD has many advantages on streams:

  • The decision about units can be made based on their order.

  • In order to update the population inclusion probabilities after selecting a unit, the entire population is not needed and only a small part of it (based on Result 2 or Result 3) is sufficient. In fact, in the case of streams, there is usually no finite set called population, and the data is generated online.

  • A decision can be made about one part of the population independently of other parts.

5 Stratified stream sampling (SSS)

In this section, based on the idea of Result 3, we try to partition the population into windows of integer size and then implement different designs, inside each window. Consider a population with \(\varvec{\pi }=(\pi _1,\pi _2,\dots ,\pi _N)^\top ,\) and an arbitrary integer number m. Let \(k_1\in U\) be the index such that

$$\begin{aligned} \sum _{k=1}^{k_1-1}\pi _k< m \text{ and } \sum _{k=1}^{k_1}\pi _k \ge m. \end{aligned}$$
(6)

Then we split \(\pi _{k_1}=\pi _{a_1}+\pi _{b_1}\) such that

$$\begin{aligned} \sum _{k=1}^{k_1-1}\pi _k+\pi _{a_1}=m. \end{aligned}$$

Now, for building the first windows, say \(w_1\), as well as the other windows, we have

$$\begin{aligned} \varvec{\pi }=( \underbrace{ \pi _1,\quad \pi _2,\quad \dots ,\quad \pi _{k_1-2},\quad \pi _{k_1-1},\quad \pi _{a_1} }_{w_1:\;\;\text {The first window}}+\pi _{b_1},\quad \pi _{k_1+1},\quad \dots ,\quad \pi _N)^\top . \end{aligned}$$

With the same approaches we build the other windows of size m:

$$ \begin{aligned} { \varvec\pi } = & (\underbrace {{\pi _{1} ,\pi _{2} , \ldots ,\pi _{{k_{1} - 2}} ,\pi _{{k_{1} - 1}} ,\pi _{{a_{1} }} }}_{{w_{1} ,{\text{ of size m }}}} \\ & \quad + \underbrace {{\pi _{{b_{1} }} ,\pi _{{k_{1} + 1}} , \ldots ,\pi _{{k_{2} - 1}} ,\pi _{{a_{2} }} }}_{{w_{2} ,{\text{ of size m}}}} \\ & \quad + \pi _{{b_{2} }} , \ldots ,\pi _{{a_{{i - 1}} }} + \underbrace {{\pi _{{b_{{i - 1}} }} ,\pi _{{k_{{i - 1}} + 1}} , \ldots ,\pi _{{k_{i} - 1}} ,\pi _{{a_{i} }} }}_{{w_{i} ,\,{\text{of size m}} \ldots }} + \pi _{{b_{i} }} \ldots )^{ { \top }} . \\ \end{aligned} $$

We call unit \(k=k_i,\) the cross-border unit of the window \(w_i\). To continue, we first consider the first window. For the first unit inside \(w_1\), its inclusion probability will be updated as (2), and it is possible to update other units inside the first window as (3) for \(k=2,3,\dots ,a_1\) with defining constant \(c_1\) by \(\sum _{k=2}^{a_1}\min (c_1\pi _k,1)=m.\) The rest of the population units, stay unchanged as

$$\begin{aligned} \pi _k^1=\pi _k;\quad k=b_1,k_1+1,k_1+2,\dots ,N. \end{aligned}$$

After deciding on the first unit, to continue, since \(\sum _{k=2}^{a_1}\pi _k^1\) is an integer, according to Result 3, we can make a decision on all the units inside the respective window step by step, except maybe for the last one which is a part of a cross-border unit and is not a real unit.

From now on, to simplify notation, \(\pi ^+_k\) and \(\pi ^*_k\) denote the final update (which leads to a 0-1 decision) and the initial update (which needs to be imposed on the new window units before entering it) on unit k, respectively. With these notations, after a complete decision about the first window (which will lead to selecting a sample of size m) \(\pi _{a_1}\) will be finally updated to 1 (or 0) with probability \(\pi _{a_1}\) (or \(1-\pi _{a_1}\)). Before starting to update the units inside \(w_2\), we first need to make a decision on unit \(k=k_1\), and update the other units based on \(\pi ^+_{a_1}\). For this purpose, we proceed as follows,

  • if \(\pi ^+_{a_1}=1\), we consider \(k=k_1\) as a sample unit (\(\pi ^*_{k_1}=\pi ^+_{k_1}=1\)) and then the extra part in \(w_2\), i.e. \(\pi _{b_1}\), should be distributed on the other units inside \(w_2\) to initially update them as

    $$\begin{aligned} \pi ^{*(1)}_k= \min (c^*_2\pi _k,1), k=k_1+1,\dots ,a_2, \text { where } \sum _{k=k_1+1}^{a_2}\min (c^*_{2}\pi _k,1)=m. \end{aligned}$$
    (7)
  • if \(\pi ^+_{a_1}=0\), we do not decide on the cross-border unit \(k_1\), and the units inside \(w_2\) including \(b_1\) will be updated as

    $$\begin{aligned} \pi ^{*(0)}_k=\frac{\pi _k-\pi ^{*(1)}_k\pi _{a_1}}{(1-\pi _{a_1})}, k=b_1,k_1+1,\dots ,a_2. \end{aligned}$$
    (8)

The Details of the implementation of the method are given in Algorithm 3.

figure c

Result 4 shows that SSS respects the first-order inclusion probability and fixed sample size.

Result 4

In SSS, for \(k=1,2,\dots \) and \(i=1,2,\dots \),

  1. (i)

    \(E(\pi ^*_k)=\pi _k,\)

  2. (ii)

    \(0\le \pi ^{*(0)}_k,\pi ^{*(1)}_k\le 1,\)

  3. (iii)

    \(\sum _{k=b_i}^{a_{i+1}}\pi ^{*(0)}_k=m\), and \(\sum _{k=b_i}^{a_{i+1}}\pi ^{*(1)}_k=m-1\).

Since the size of the windows is an integer, we can implement any arbitrary design and sample size inside each stratum. Different designs and sample sizes inside different strata make the design completely flexible. Based on the next result, we can calculate all the second-order inclusion probabilities which makes SSS a flexible stream sampling design, capable to estimate the accuracy of the estimations.

Result 5

With m as the size of the windows, considering \(p_i\) as the sampling design implemented in \(i^{th}\) window, and defining

$$\begin{aligned} f(k,d_1,c,d_2)=\pi _{d_1|k}\min (c\pi _{d_2},1)+(1-\pi _{d_1|k}) \frac{\pi _{d_2}-\min (c\pi _{d_2},1)\pi _{d_1}}{1-\pi _{d_1}}, \end{aligned}$$

and

$$\begin{aligned} \pi ^{p_i(a_i)}_{d_1d_2}=\pi _{a_i}\pi ^{p_i(a_i\in S)}_{d_1d_2}+(1-\pi _{a_i})\pi ^{p_i(a_i\notin S)}_{d_1d_2} \end{aligned}$$

where \(\pi ^{p_i}_{d_1d_2}\), \(\pi ^{p_i(a_i\in S)}_{d_1d_2}\) and \(\pi ^{p_i(a_i\notin S)}_{d_1d_2}\) are the second-order inclusion probability of units \(d_1\) and \(d_2\) with implementing design \(p_i\) in window i, when they are updated given \(\pi ^+_{a_i}=1\) similar to (7) and they are updated given \(\pi ^+_{a_i}=0\) according to (8) respectively, we have

  1. (i)

    if k and \(\ell \) are two non-cross-border units belonging to the same window, say i, then

    $$\begin{aligned} \pi _{k\ell }=\left\{ \begin{array}{ll} \pi ^{p_i}_{k\ell }, &{}i=1 \\ \\ \pi ^{p_i(a_i)}_{k\ell }, &{}i>1. \end{array}\right. \end{aligned}$$
  2. (ii)

    if k and \(\ell \) are two non-cross-border units belonging to distinct windows i and j, respectively, where \(i<j\) then,

    $$\begin{aligned} \pi _{k\ell }=\pi _kf(k,a_{j-1},c_j,\ell ), \end{aligned}$$

    where \(\pi _{a_{j-1}|k}\) can be calculated based on a recursive relation as

    $$\begin{aligned} \pi _{a_{j'}|k}=\left\{ \begin{array}{ll} f(k,a_{j'-1},c_{j'},a_{j'}), &{}i< j'< j\\ \\ \frac{\pi ^{p_i(a_{i-1})}_{ka_i}}{\pi _k} &{}j'=i. \end{array}\right. \end{aligned}$$
  3. (iii)

    if \(k=k_i\) and \(\ell \) is a non-cross-border belonging to window j, where \(i<j\) then

    $$\begin{aligned} \pi _{k\ell }=\pi _kf(k,a_{j-1},c_j,\ell ), \end{aligned}$$

    where

    $$\begin{aligned} \pi _{a_{j'}|k}=\left\{ \begin{array}{ll} f(k,a_{j'-1},c_{j'},a_{j'}), &{}i+1<j'<j \\ \\ \pi _{a_{i}|k}\min (c_{i+1}\pi _{a_{i+1}},1) +(1-\pi _{a_{i}|k})\frac{\pi ^{p_i(a_i\notin s)}_{{b_{i}}{a_{i+1}}}}{\pi _{b_{i}}/(1-\pi _{a_{i}})}, &{}j'=i+1,\\ \\ \frac{\pi _{a_i}}{\pi _k}&{} j'=i. \end{array}\right. \end{aligned}$$
  4. (iv)

    if \(\ell =k_{j}\) and k is a non-cross-border unit belonging to window i, where \(i < j\) then

    $$\begin{aligned} \pi _{k\ell }=\pi _k\left[ \pi _{a_{j}|k}+(1-\pi _{a_{j}|k}) \frac{\pi _{b_j}}{1-\pi _{a_{j}}}\right] , \end{aligned}$$

    where

    $$\begin{aligned} \pi _{a_{j'}|k}=\left\{ \begin{array}{lll} f(k,a_{j'-1},c_{j'},a_{j'}), &{}\quad i<j'<j \\ \\ \frac{\pi ^{p_i(a_{i-1})}_{ka_i}}{\pi _k}, &{}\quad j'=i>1\\ \\ \frac{\pi ^{p_i}_{ka_i}}{\pi _k} &{}\quad j'=i=1. \end{array}\right. \end{aligned}$$
  5. (v)

    if \(k=k_i\) and \(\ell =k_{j}\), where \(i<j\), we have

    $$\begin{aligned} \pi _{k\ell }=\pi _k\pi _{\ell |k}, \end{aligned}$$

    then

    $$\begin{aligned} \pi _{k|\ell }= & {} \left[ \pi _{a_{j}|k}+(1-\pi _{a_{j}|k}) \frac{\pi _{b_j}}{1-\pi _{a_{j}}}\right] ,\\ \pi _{a_{j'}|k}= & {} \left\{ \begin{array}{ll} f(k,a_{j'-1},c_{j'},a_{j'}),&{}i+1<j'\le j \\ \\ \frac{\pi _{a_{i}}}{\pi _k}\min (c_{j}\pi _{a_{j}},1) +(1-\frac{\pi _{a_{i}}}{\pi _k})\frac{\pi ^{p_i(a_i \notin s)}_{{b_{i}}{a_{j}}}}{\pi _{b_i}/(1-\pi _{a_i})}, &{}j'=i+1. \end{array}\right. \end{aligned}$$

In addition to proving Result 5 in Appendix, we will run some simulations to evaluate calculation of the second-order inclusion probabilities in Sect. 8.

6 Immediate decision sampling (IDS)

Here we show that the method of windows of size one leads to immediate decision and is equivalent to Chromy sequential method, Deville’s systematic sampling, and the order pivotal method. Also, we will present a very simple algorithm for the method, with only one condition.

If we set \(m=1\) in (6), then we will have windows of size one. Since the sum of inclusion probabilities within the window is one, if a unit is selected, all the other inclusion probabilities will be updated to zero to compensate for the required inclusion probabilities, i.e. \(\pi _{k}^{1(1)}=0\) in (3) while \(\pi _{k}^{1(0)}=\pi _k/(1-\pi _k)\). Result 6 shows that setting \(m=1\), will lead to an immediate decision sampling on the population units.

Result 6

With \(F_0=0 \text { and } F_\ell =\sum _{k=1}^{\ell }\pi _k,\) windows of size one is equivalent to the following process:

After observing unit \(\ell \) in windows \(w_i\), if

  1. (I)

    \(\ell \) is not a cross-border unit,

    1. (1)

      if \(n_\ell <i\),

      $$\begin{aligned} \pi ^{*}_\ell =\frac{\pi _\ell }{1-(F_{\ell -1}-\lfloor F_{\ell -1}\rfloor )}, \end{aligned}$$
    2. (2)

      if \(n_\ell =i\),

      $$\begin{aligned} \pi ^{*}_\ell =0. \end{aligned}$$
  2. (II)

    \(\ell \) is a cross-border unit,

    1. (1)

      if \(n_\ell <i\),

      $$\begin{aligned} \pi ^{*}_\ell =1, \end{aligned}$$
    2. (2)

      if \(n_\ell =i\),

      $$\begin{aligned} \pi _\ell ^{*}=\frac{\pi _\ell -\left\{ 1-(F_{\ell -1}-\lfloor F_{\ell -1}\rfloor )\right\} }{F_{\ell -1}-\lfloor F_{\ell -1}\rfloor }. \end{aligned}$$

Then, based on IDS, there is no need to know about unobserved units coming in the future to make a decision on the observed unit. The four IDS scenarios in Result 6, can be summarized in one condition as presented in Algorithm 4.

figure d

Result 7

IDS, Chromy sequential method, Deville’s systematic sampling, and the order pivotal method are four implementations of the same design.

It is worth noting that although IDS and Chromy sequential method are the same method, but in Algorithm 4, we have provided a very easy algorithm to code the design with only one condition. Then, with a fixed order of population units, IDS can be considered as a simpler version of Deville’s method. In IDS, according to Algorithm 4, there is no need to check whether a unit is a cross-border unit or not, and there is no need to generate random variables from different distributions for each window.

7 Size of the windows, spreading the sample, and the designs inside

There are a few points about window size. Regarding the optimal window size, in the case of streaming data, it is preferable to decide on the current units as soon as possible. In Sect. 6, we showed that if we set the window sizes to one, the method is equivalent to Chromy (1979), which is an immediate decision sampling for each observed unit. As a drawback, windows of size one lead to many zero second-order inclusion probabilities. In fact, the second-order inclusion probabilities are zero among all the units completely inside each window. A solution investigated by Chauvet (2021), is to change the starting unit randomly. This approach contrasts with the data streaming aspect. Then the balance between having (almost) all the second-order inclusion probabilities positive, respecting the streaming aspect and immediate decision leads to set \(m=2\). For this purpose, after receiving enough data to make a window of size \(m=2\), we can completely decide on the data inside the window.

Regarding the feasibility of implementing the method, if we are dealing with a streaming population, we will always have enough data to make the current window of size m. To implement the method inside a finite population, one can decide on m according to the population size and inclusion probabilities. For the last window, we may have to resize the window and if the sum of the inclusion probabilities of the last window is not an integer, we can solve the problem by using a phantom unit (Grafström et al. 2012).

Moreover, if the population has spatial coordinates, we can construct windows based on the coordinates. Then, the smaller the windows size, the spreader the sample is in the population coordinates, which can lead to more efficient samples (see Grafström and Lundström 2013).

8 Examples and simulations

In this section, the validity of calculating second-order inclusion probabilities and efficiency of the proposed design will be evaluated using several examples and simulations.

8.1 Evaluating result 5

To have a general case, consider a population of size \(N=15\) with \(n=7\) and the strata size \(H=3\) with

$$\begin{aligned} \varvec{\pi }=(.34,.82,.47,.66,.47,.36,.40,.53,.20,.22,.81,.67,.53,.11,.41). \end{aligned}$$
(9)

To have strata of size \(H=3\), since the total size is \(n=7\), we will have an unbalanced stratified population. The population is partitioned in \(L=3\) strata, all size \(H_1=H_2=3\) but the last one of size \(H_3=1\) (see Table 1).

Table 1 Stratification of population (9) with \(N=15\), \(n=7\), stratified in \(L=3\) windows of sizes \(H_1=3\), \(H_2=3\) and \(H_3=1\) respectively

Let matrix \(\varvec{\Pi }\) be the second-order inclusion probabilities calculated based on Result 5,

$$\begin{aligned} \tiny \varvec{\Pi }=\left[ \begin{array}{ccccccccccccccc}.34&{}.26&{}.11&{}.18&{}.11&{}.08&{}.14&{}.18&{}.07&{}.07&{}.28&{}.23&{}.18&{}.04&{}.14\\ &{}.82&{}.35&{}.49&{}.35&{}.28&{}.33&{}.43&{}.17&{}.18&{}.66&{}.55&{}.43&{}.09&{}.33\\ &{}&{}.47&{}.25&{}.15&{}.13&{}.19&{}.24&{}.09&{}.10&{}.38&{}.31&{}.25&{}.05&{}.19\\ &{}&{}&{}.66&{}.25&{}.21&{}.26&{}.34&{}.13&{}.14&{}.53&{}.44&{}.35&{}.07&{}.27\\ &{}&{}&{}&{}.47&{}.14&{}.19&{}.25&{}.10&{}.10&{}.38&{}.32&{}.25&{}.05&{}.19\\ &{}&{}&{}&{}&{}.36&{}.13&{}.18&{}.06&{}.06&{}.29&{}.24&{}.19&{}.04&{}.15\\ &{}&{}&{}&{}&{}&{}.40&{}.15&{}.05&{}.05&{}.30&{}.22&{}.21&{}.04&{}.16\\ &{}&{}&{}&{}&{}&{}&{}.53&{}.08&{}.08&{}.39&{}.29&{}.28&{}.06&{}.21\\ &{}&{}&{}&{}&{}&{}&{}&{}.20&{}.01&{}.15&{}.11&{}.10&{}.02&{}.08\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}.22&{}.16&{}.12&{}.11&{}.02&{}.09\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.81&{}.50&{}.43&{}.09&{}.33\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.67&{}.35&{}.07&{}.27\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.53&{}.01&{}.04\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.11&{}.00\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.41 \end{array} \right] \end{aligned}$$

and matrix \(\widehat{\varvec{\Pi }}\) be the second-order inclusion probabilities based on a Monte Carlo simulation with \(M=10,000\) iterations. In each iteration, we implemented SSS in Sect. 5 using R where within the stratum elimination procedure (Tillé 1996) by function UPtille under library sampling had been used, then 10, 000 samples have been obtained and the element \((k,\ell )\) in the matrix \(\widehat{\varvec{\Pi }}\) indicates number of times that the units k and \(\ell \) have been selected together in the same sample, resulted in:

$$\begin{aligned} \tiny \widehat{\varvec{\Pi }}=\left[ \begin{array}{ccccccccccccccc}.35&{}.26&{}.11&{}.19&{}.11&{}.08&{}.14&{}.18&{}.07&{}.08&{}.28&{}.23&{}.18&{}.04&{}.15\\ &{}.81&{}.34&{}.49&{}.35&{}.28&{}.32&{}.42&{}.17&{}.18&{}.67&{}.55&{}.43&{}.09&{}.33\\ &{}&{}.46&{}.25&{}.15&{}.13&{}.18&{}.24&{}.10&{}.09&{}.38&{}.31&{}.24&{}.05&{}.19\\ &{}&{}&{}.66&{}.25&{}.21&{}.26&{}.35&{}.14&{}.14&{}.54&{}.45&{}.35&{}.07&{}.27\\ &{}&{}&{}&{}.47&{}.14&{}.19&{}.24&{}.09&{}.10&{}.38&{}.31&{}.25&{}.05&{}.19\\ &{}&{}&{}&{}&{}.36&{}.13&{}.17&{}.06&{}.06&{}.29&{}.23&{}.19&{}.04&{}.14\\ &{}&{}&{}&{}&{}&{}.40&{}.14&{}.04&{}.05&{}.30&{}.22&{}.21&{}.04&{}.16\\ &{}&{}&{}&{}&{}&{}&{}.52&{}.07&{}.08&{}.39&{}.30&{}.27&{}.06&{}.21\\ &{}&{}&{}&{}&{}&{}&{}&{}.21&{}.01&{}.16&{}.11&{}.11&{}.02&{}.08\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}.21&{}.16&{}.12&{}.11&{}.02&{}.08\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.82&{}.51&{}.43&{}.09&{}.34\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.67&{}.35&{}.07&{}.27\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.53&{}.01&{}.04\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.11&{}.00\\ &{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}&{}.41 \end{array} \right] . \end{aligned}$$

The comparison of \(\varvec{\Pi }\) and \(\widehat{\varvec{\Pi }}\) confirms the validity of Result 5. As we can see in \(\varvec{\Pi }\), all the second-order inclusion probabilities are positive, except for \(\pi _{k\ell }\) with \(k=14, \ell =15\) which was obvious due to the sample size \(H_3=1\) in the third stratum.

Also, for larger population sizes, \(N=100, 200\), with different sample sizes and different strata sizes, we run Monte Carlo simulations with \(M=10,000\) iterations. The first-order inclusion probabilities were set putting random uniform distribution into the inclusionprobabilities function under the sampling library. Again, within each stratum, elimination procedure was used to select a sample. To compare the calculated (\(\varvec{\Pi }\), based on Result 5) and the estimated (\(\widehat{\varvec{\Pi }}\), based on Monte Carlo) matrices of the second-order inclusion probabilities, we defined the following criterion

$$ \rm{\mathcal{Z}}_{\pi } = \frac{{\sum\nolimits_{{k = 1}}^{N} \rm{\mathcal{I}} \left( {\left| {\frac{{\hat{\pi }_{{kk}} - \pi _{k} }}{{\sqrt {\pi _{k} (1 - \pi _{k} )/M} }}} \right| > 1.96} \right)}}{N},\quad {\text{ }}\rm{\mathcal{Z}}_{\diamondsuit } = \frac{{\sum {\sum\nolimits_{{k \le \ell }}^{N} \rm{\mathcal{I}} } \left( {\left| {\frac{{\hat{\pi }_{{k\ell }} - \pi _{{k\ell }} }}{{\sqrt {\pi _{{k\ell }} (1 - \pi _{{k\ell }} )/M} }}} \right| > 1.96} \right)}}{{\frac{{N(N - 1)}}{2}}}, $$

where \(\pi _{k\ell }\) and \({\hat{\pi }}_{k\ell }\) are the (kl) elements of \(\varvec{\Pi }\) and \(\widehat{\varvec{\Pi }}\) respectively. Also, \({\mathcal {I}}(.)\) is an indicator function that takes 1 if “.” is satisfied. Based on Central Limit Theorem, we accept the validity of the results with a confidence level of 0.95 if \({\mathcal {Z}}\le 0.05\). Results are shown in Table 2 which once again confirms the validity of Result 5 independent of NnH,  and L.

Table 2 Results for 10,000 Monte Carlo iterations of implementing the SSS design in random populations of sizes \(N=200, 100\)

8.2 Efficiency of SSS

The R package library datasets (R Core Team 2022) contains the swiss database. Switzerland, in 1888, was entering a period known as the demographic transition; i.e., its fertility was beginning to fall from the high level typical of underdeveloped countries. The data concerns \(N=47\) French-speaking districts at about this period of the demographic transition.

We are convinced that SSS is particularly efficient in the research field of stream sampling. To evaluate the accuracy of the SSS estimator relative to other sampling design like elimination and max entropy, we used swiss data, including 3 variables (see Fig. 1):

  • Fertility: common standardized fertility measure,

  • Agriculture: \(\%\) of males involved in agriculture as occupation,

  • Infant Mortality: live births who live less than 1 year.

Fig. 1
figure 1

The distribution and relationship among three variables of interest, Fertility, Agriculture and Infant Mortality, for the swiss data. In this \(3\times 3\) matrix of plots, the lower off-diagonal draws scatter plots, the diagonal represents density plots and the upper off-diagonal reports the Pearson correlations

“Fertility” variable was used for the main variable y and “Agriculture” and “Infant Mortality” variables were considered for auxiliary variables to create first-order inclusion probabilities. Also, we defined the efficiency of SSS as

$$\begin{aligned} \text{ EF(.) }=\frac{{Var_M}({\hat{Y}}_{\text{. }})}{{Var_M} ({\hat{Y}}_{\text{ SSS }})}, \end{aligned}$$

where “.” indicates “elimination” or “max entropy” design, and “\({Var_M}\)” indicates the Monte Carlo variance of the respective estimators based on 10,000 iterations. Results are shown in Table 3 which shows that SSS is quite comparable with other commonly used sampling designs in terms of accuracy. The smaller the size of the strata and the larger the sample size, the more efficient the SSS. As an interesting point, for all cases with \(m=1\), SSS is more efficient than any other methods.

Table 3 Efficiency of SSS relative to elimination and maximum entropy designs based on Monte Carlo variance of the estimators

9 Conclusion

SSS is a general stream sampling method that allows decisions on units based on their order. The order can be arbitrary, but if the data are the result of a stream, this method can be very useful for deciding which units should be stored as samples over time. Then, after deciding on each unit to update the inclusion probabilities, it is not necessary to use all the population, and usually only a small window of the population can be sufficient. SSS is flexible for stratifying populations and allocating samples into the strata, leading to almost immediate decision on units. In spatial data it is possible to spread the sample over the population coordinates, while the level of spreading can be controlled using window sizes. Window size is a leverage that can be used to make a trade-off between design entropy and sample dispersion over the population indices. The smaller the window size, the greater the sample dispersion the lower the entropy of the design, and vice versa. Size one and two windows are special cases with interesting properties. The former leads to an immediate decision where after observing each unit we can make an immediate decision about its selection without depending on unobserved units. On the other hand, the latter is one of the optimal options, which spreads the sample over the population indices with a reasonable entropy that gives positive chance to all second-order inclusion probabilities.

For future research, it will be interesting to investigate the problem of spreading the sample over the population coordinates using SSS and compare with other successful methods in spatial spreading of sample units, such as local pivotal (Grafström et al. 2012) and weakly associated vectors (Jauslin and Tillé 2020) methods. For this purpose, in two-dimensional coordinates, one unit can be considered as the center, and then using the units close to build the strata. Choosing central units is one of the challenges of this method. Such units can be selected randomly or non-randomly. Since the windows of size one, splits one of the inclusion probabilities to make the smallest possible strata, the level of spreading would probably be good and worth studying.