Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The presence of outliers can strongly bias the sampling design and hence the survey results. In particular, it could induce a wrong computation of the number of statistical units to sample, usually overestimating it.

In what follows we focus on the stratified sampling design, which has been proven to be the most efficient surveying technique under some basic assumptions (see Tillé 2001) and it is currently in use at several NSIs for business surveys.

For instance, suppose that in the stratification variable X some outliers arise. Outliers are observations arbitrarily far from the majority of the data. They are often due to mistakes, like editing, measurement and observational errors. Intuitively, when outliers are present in a given stratum for the stratification variable X they affect both the location and scale measures for X. Therefore, it is clear that a higher dispersion than the “true” one will be observed in that stratum.

Such a situation will bias the outcome of the HL method. For instance, the sample size would be bigger than it should be, given the fact that observations seem to be more distant (in average) than they are in the reality. Moreover, the strata bounds and the sample allocation would be both biased. This is clear when we consider the Neyman allocation, for example, which is based on within-stratum dispersion. Since the principle is to survey more units in the strata in which the auxiliary variable is more dispersed within the stratum, outliers might have the effect of increasing enormously and unduly the sample size in each stratum.

For this reason we build two robust versions of the HL method, the naive robust and the robust HL sampling strategy which we compare through a simulation study.

2 The Problem

We focus on simple stratified samples with one take-all stratum and several take-some strata. This because we deal with

  • skewed distributions (small number of units accounts for a large share of the study variables)

  • availability of administrative information, providing a list of the statistical units of the target population (i.e. tax declaration, social security registers)

  • survey burdens for firms and costs for NSIs

  • data quality (administrative sources and survey collection)

  • compliance requirements established by EUROSTAT

Now, it is known that there exists a discrepancy between the auxiliary variable X used for stratification and the survey variable Y. Therefore, the strategy suggested by Rivest (2002) is to recover such discrepancy by the use of a regression model.

Of course, the auxiliary information X is only a proxy for the target variable Y, which requires to estimate the discrepancy between Y and X, as suggested by Rivest (2002) with the modified HL algorithm.

In the business survey literature, the relationship existing between Y and X is often modeled by a log-linear regression relationship. Let X and Y be continuous random variables and f(x), \(x \in \mathbb{R}\) the density of X. The data x 1, , x N are considered as N independent realizations of the random variable X.

Since stratum h consists of the population units with an X-value in the interval (b h−1, b h ], the stratification process uses the values of E(Y | b h  ≥ X > b h−1) and Var(Y | b h  ≥ X > b h−1), the conditional mean and variance of Y given that the unit falls in stratum h, for \(h = 1,\ldots,L - 1\).

This model considers the regression relationship between Y and X expressed by

$$\displaystyle{\log Y =\alpha +\beta _{\text{log}}\log X+\varepsilon,}$$

where \(\varepsilon\) is assumed to be a 0-mean random variable, normally distributed with variance \(\sigma _{ \text{log}}^{2}\) and independent from X, whereas α and β log are the parameters to be estimated.

However this approach presents some weaknesses

  1. 1.

    \(s_{\mathrm{yh}}^{2}\) is unknown, which makes crucial the use of the auxiliary information X

  2. 2.

    the number L of strata is selected by the user

  3. 3.

    the administrative records are often of low quality (errors)

We can distinguish three main sources of anomalies, listed below

  • erroneous records in the surveyed data (Y ) (vertical outliers)

  • quality issues in the administrative registers (X) (leverage)

  • outliers in both variables (X, Y ) (good/bad leverages)

The presence of such anomalies makes unreliable the conditional mean and variance of Y | X, therefore affecting the sample size and strata bounds determination as well as the sample allocation.

In what follows we propose a possible alternatives to the Rivest (2002) modified HL algorithm. Strata bounds and sizes are derived minimizing the conditional variance in each stratum after a re-weighting of the information according to the degree of outlyingness. We refer to this approach as to the robust modified HL algorithm.

3 The Robust Modified HL Algorithm

Supposing that a log-linear relationship exists between the survey variable Y and the auxiliary one X, then consider the S-estimator of regression as in Rousseeuw and Yohai (1984) as

$$\displaystyle{S(x,y) =\arg \min _{\beta }s(r_{1}(\beta ),\ldots,r_{N}(\beta ))}$$

where r i (β) are the regressions residuals and s is scale measure which solves

$$\displaystyle{ \frac{1} {N}\sum _{i=1}^{N}\rho \left (\frac{r_{i}(\beta )} {s} \right ) = b}$$

for a conveniently chosen ρ function and a constant b. This estimator is robust with respect to both vertical outliers and leverage points. Then, with some straightforward calculations (expanding ρ(⋅ )), the following approximation holds

$$\displaystyle{\mathrm{Var}[Y \vert b_{h} \geq X > b_{h-1}] \thickapprox e^{\sigma ^{2} }\psi _{h}/W_{h} - (\phi _{h}/W_{h})^{2},}$$

where

$$\displaystyle\begin{array}{rcl} W_{h} =\int _{ b_{h-1}}^{b_{h} }\omega (x^{\beta })f(x)\mathit{dx}& &{}\end{array}$$
(1)
$$\displaystyle\begin{array}{rcl} \phi _{h} =\int _{ b_{h-1}}^{b_{h} }x^{\beta }\omega (x^{\beta })f(x)\mathit{dx}& &{}\end{array}$$
(2)
$$\displaystyle\begin{array}{rcl} \psi _{h} =\int _{ b_{h-1}}^{b_{h} }x^{2\beta }\omega (x^{\beta })f(x)\mathit{dx},& &{}\end{array}$$
(3)

β and σ are the parameters of the log-linear model in the previous section, and \(\omega (x) =\rho ^{{\prime}}(x)/x\) is the weighting function.

The problem then reduces to solving for bounds b 1, , b h , , b L which minimize n using the Neyman allocation scheme. In symbols, under the loglinear specification the objective function is

$$\displaystyle\begin{array}{rcl} n_{\hat{t}_{ \text{ystrat}}} = N_{L} + \frac{(\sum _{h=1}^{L-1}(e^{\sigma ^{2} }\psi _{h}W_{h} -\phi _{h}^{2})^{1/2})^{2}} {(c\sum x_{i}^{\beta }/N)^{2} +\sum _{ h=1}^{L-1}\frac{(e^{\sigma ^{2}}\psi _{ h}-\phi _{h}^{2}/W_{h})} {N} }& &{}\end{array}$$
(4)

where robust moments W h , ϕ h and ψ h are those in (3), β and σ are the parameters of the log-linear model estimated by robust regression (S-estimator or LTS).

Then, the Sethi’s iterations are run for a given L and precision c, computing the optimal strata bounds and sample size.

4 Simulation Study

The aim of the simulation study is to compare the performance of the two robust sampling strategies proposed in this paper with respect to Rivest (2002)’s based on classical LS regression.

Simulations are performed using the business sampling frame of the Structural Business Survey in 2002, where we consider as target variable (y) the value added of enterprises in the industry of Constructions which are stratified by the economic-size class. The number strata h = 1, , 6 is set according to the common practice in SBS, with 1 take-all stratum and 5 take-some strata. The auxiliary information x is on the turnover (from the VAT register). Then, population is generated from

$$\displaystyle{\log y_{i} =\beta \log x_{i} +\varepsilon _{i}}$$

with a choice of β = 0. 75.

Then we consider the following designs

  1. 1.

    no outliers: \(\varepsilon _{i} \sim \mathcal{N}(0,1)\)

  2. 2.

    long-tailed errors: \(\varepsilon _{i} \sim \text{Cauchy}_{1}\)

  3. 3.

    long-tailed errors: \(\varepsilon _{i} \sim \text{t}_{3}\)

  4. 4.

    vertical outliers: δ % of \(\varepsilon _{i} \sim \mathcal{N}\left (5\sqrt{\chi _{1;0.99 }^{2}},1.5\right )\)

  5. 5.

    bad leverage points: δ % of \(\varepsilon _{i}\,\sim \,\mathcal{N}(10,10)\) and corresponding \(X\,\sim \,\mathcal{N}(-10,10)\).

The contamination level, i.e. the percentage of outliers in the data, is set to δ = 15 and 30 %. Then the three procedures are used to compute the strata bounds, sizes and allocation

  • generalized HL method (Rivest 2002)

  • robust generalized HL method

at 1 % precision and compared by means of relative MSE of the Horvitz–Thompson estimator for the mean. In Table 1 are displayed the main results.

Table 1 Summary of results comparing Robust modified HL method versus modified HL (Rivest 2002), target precision: 1 %