Response Burden Reduction Through the Use of Administrative Data and Robust Sampling

Bramati, Maria Caterina

doi:10.1007/978-3-319-05552-7_12

Maria Caterina Bramati⁸

Part of the book series: Studies in Theoretical and Applied Statistics ((STASSPSS))

1472 Accesses

Abstract

There are several reasons why robust regression techniques are useful tools in sampling design. First of all, when stratified samples are considered, one needs to deal with three main issues: the sample size, the strata bounds determination and the sample allocation in the strata. Since the target variable y, objective of the survey, is unknown, it is used some auxiliary information x known for the entire population from which the sample is drawn. Such information is helpful as it is strongly correlated with the target y, but of course some discrepancies between them may arise. The use of auxiliary information, combined with the choice of the appropriate statistical model to estimate the relationship with the variable of interest y, is crucial for the determination of the strata bounds, the size of the sample and the sampling rates according to a chosen precision level of the estimates, as it has been shown by Rivest (2002). Nevertheless, this regression-based approach is highly sensitive to the presence of contaminated data. Indeed, the influence of outlying observations in both y and x has an explosive impact on the variances with the effect of strong departures from the optimum sample allocation. Therefore, we expect increasing sample sizes in the strata, wrong allocation of sampling units in the strata and some errors in the strata bounds determination. Since the key tool for stratified sampling is the measure of scale of y conditional to the knowledge of some auxiliary x, a robust approach based on S-estimator of regression is proposed in this paper. The aim is to allow for robust sample size and strata bounds determination, together with the optimal sample allocation. To show the advantages of the proposed method, an empirical illustration is provided for Belgian business surveys in the sector of Construction. It is considered a skewed population framework, which is typical for businesses, with a stratified design with one take-all stratum and L − 1 strata. Simulation results are also provided.

Access provided by Autonomous University of Puebla. Download chapter PDF

Optimal Regression Estimator for Stratified Two-Stage Sampling

Bayesian Ideas in Survey Sampling: The Legacy of Basu

Article Open access 16 October 2023

Bootstrap Methods for Judgment Post Stratification

Article 09 July 2020

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The presence of outliers can strongly bias the sampling design and hence the survey results. In particular, it could induce a wrong computation of the number of statistical units to sample, usually overestimating it.

In what follows we focus on the stratified sampling design, which has been proven to be the most efficient surveying technique under some basic assumptions (see Tillé 2001) and it is currently in use at several NSIs for business surveys.

For instance, suppose that in the stratification variable X some outliers arise. Outliers are observations arbitrarily far from the majority of the data. They are often due to mistakes, like editing, measurement and observational errors. Intuitively, when outliers are present in a given stratum for the stratification variable X they affect both the location and scale measures for X. Therefore, it is clear that a higher dispersion than the “true” one will be observed in that stratum.

Such a situation will bias the outcome of the HL method. For instance, the sample size would be bigger than it should be, given the fact that observations seem to be more distant (in average) than they are in the reality. Moreover, the strata bounds and the sample allocation would be both biased. This is clear when we consider the Neyman allocation, for example, which is based on within-stratum dispersion. Since the principle is to survey more units in the strata in which the auxiliary variable is more dispersed within the stratum, outliers might have the effect of increasing enormously and unduly the sample size in each stratum.

For this reason we build two robust versions of the HL method, the naive robust and the robust HL sampling strategy which we compare through a simulation study.

2 The Problem

We focus on simple stratified samples with one take-all stratum and several take-some strata. This because we deal with

skewed distributions (small number of units accounts for a large share of the study variables)
availability of administrative information, providing a list of the statistical units of the target population (i.e. tax declaration, social security registers)
survey burdens for firms and costs for NSIs
data quality (administrative sources and survey collection)
compliance requirements established by EUROSTAT

Now, it is known that there exists a discrepancy between the auxiliary variable X used for stratification and the survey variable Y. Therefore, the strategy suggested by Rivest (2002) is to recover such discrepancy by the use of a regression model.

Of course, the auxiliary information X is only a proxy for the target variable Y, which requires to estimate the discrepancy between Y and X, as suggested by Rivest (2002) with the modified HL algorithm.

In the business survey literature, the relationship existing between Y and X is often modeled by a log-linear regression relationship. Let X and Y be continuous random variables and f(x), $x \in \mathbb{R}$ the density of X. The data x ₁, …, x _N are considered as N independent realizations of the random variable X.

Since stratum h consists of the population units with an X-value in the interval (b _h−1, b _h], the stratification process uses the values of E(Y | b _h ≥ X > b _h−1) and Var(Y | b _h ≥ X > b _h−1), the conditional mean and variance of Y given that the unit falls in stratum h, for $h = 1,\ldots,L - 1$.

This model considers the regression relationship between Y and X expressed by

$$\displaystyle{\log Y =\alpha +\beta _{\text{log}}\log X+\varepsilon,}$$

where $\varepsilon$ is assumed to be a 0-mean random variable, normally distributed with variance $\sigma _{ \text{log}}^{2}$ and independent from X, whereas α and β _log are the parameters to be estimated.

However this approach presents some weaknesses

1.
$s_{\mathrm{yh}}^{2}$ is unknown, which makes crucial the use of the auxiliary information X
2.
the number L of strata is selected by the user
3.
the administrative records are often of low quality (errors)

We can distinguish three main sources of anomalies, listed below

erroneous records in the surveyed data (Y ) (vertical outliers)
quality issues in the administrative registers (X) (leverage)
outliers in both variables (X, Y ) (good/bad leverages)

The presence of such anomalies makes unreliable the conditional mean and variance of Y | X, therefore affecting the sample size and strata bounds determination as well as the sample allocation.

In what follows we propose a possible alternatives to the Rivest (2002) modified HL algorithm. Strata bounds and sizes are derived minimizing the conditional variance in each stratum after a re-weighting of the information according to the degree of outlyingness. We refer to this approach as to the robust modified HL algorithm.

3 The Robust Modified HL Algorithm

Supposing that a log-linear relationship exists between the survey variable Y and the auxiliary one X, then consider the S-estimator of regression as in Rousseeuw and Yohai (1984) as

$$\displaystyle{S(x,y) =\arg \min _{\beta }s(r_{1}(\beta ),\ldots,r_{N}(\beta ))}$$

where r _i(β) are the regressions residuals and s is scale measure which solves

$$\displaystyle{ \frac{1} {N}\sum _{i=1}^{N}\rho \left (\frac{r_{i}(\beta )} {s} \right ) = b}$$

for a conveniently chosen ρ function and a constant b. This estimator is robust with respect to both vertical outliers and leverage points. Then, with some straightforward calculations (expanding ρ(⋅ )), the following approximation holds

$$\displaystyle{\mathrm{Var}[Y \vert b_{h} \geq X > b_{h-1}] \thickapprox e^{\sigma ^{2} }\psi _{h}/W_{h} - (\phi _{h}/W_{h})^{2},}$$

where

$$\displaystyle\begin{array}{rcl} W_{h} =\int _{ b_{h-1}}^{b_{h} }\omega (x^{\beta })f(x)\mathit{dx}& &{}\end{array}$$

(1)

$$\displaystyle\begin{array}{rcl} \phi _{h} =\int _{ b_{h-1}}^{b_{h} }x^{\beta }\omega (x^{\beta })f(x)\mathit{dx}& &{}\end{array}$$

(2)

$$\displaystyle\begin{array}{rcl} \psi _{h} =\int _{ b_{h-1}}^{b_{h} }x^{2\beta }\omega (x^{\beta })f(x)\mathit{dx},& &{}\end{array}$$

(3)

β and σ are the parameters of the log-linear model in the previous section, and $\omega (x) =\rho ^{{\prime}}(x)/x$ is the weighting function.

The problem then reduces to solving for bounds b ₁, …, b _h, …, b _L which minimize n using the Neyman allocation scheme. In symbols, under the loglinear specification the objective function is

$$\displaystyle\begin{array}{rcl} n_{\hat{t}_{ \text{ystrat}}} = N_{L} + \frac{(\sum _{h=1}^{L-1}(e^{\sigma ^{2} }\psi _{h}W_{h} -\phi _{h}^{2})^{1/2})^{2}} {(c\sum x_{i}^{\beta }/N)^{2} +\sum _{ h=1}^{L-1}\frac{(e^{\sigma ^{2}}\psi _{ h}-\phi _{h}^{2}/W_{h})} {N} }& &{}\end{array}$$

(4)

where robust moments W _h, ϕ _h and ψ _h are those in (3), β and σ are the parameters of the log-linear model estimated by robust regression (S-estimator or LTS).

Then, the Sethi’s iterations are run for a given L and precision c, computing the optimal strata bounds and sample size.

4 Simulation Study

The aim of the simulation study is to compare the performance of the two robust sampling strategies proposed in this paper with respect to Rivest (2002)’s based on classical LS regression.

Simulations are performed using the business sampling frame of the Structural Business Survey in 2002, where we consider as target variable (y) the value added of enterprises in the industry of Constructions which are stratified by the economic-size class. The number strata h = 1, …, 6 is set according to the common practice in SBS, with 1 take-all stratum and 5 take-some strata. The auxiliary information x is on the turnover (from the VAT register). Then, population is generated from

$$\displaystyle{\log y_{i} =\beta \log x_{i} +\varepsilon _{i}}$$

with a choice of β = 0. 75.

Then we consider the following designs

1.
no outliers: $\varepsilon _{i} \sim \mathcal{N}(0,1)$
2.
long-tailed errors: $\varepsilon _{i} \sim \text{Cauchy}_{1}$
3.
long-tailed errors: $\varepsilon _{i} \sim \text{t}_{3}$
4.
vertical outliers: δ % of $\varepsilon _{i} \sim \mathcal{N}\left (5\sqrt{\chi _{1;0.99 }^{2}},1.5\right )$
5.
bad leverage points: δ % of $\varepsilon _{i}\,\sim \,\mathcal{N}(10,10)$ and corresponding $X\,\sim \,\mathcal{N}(-10,10)$.

The contamination level, i.e. the percentage of outliers in the data, is set to δ = 15 and 30 %. Then the three procedures are used to compute the strata bounds, sizes and allocation

generalized HL method (Rivest 2002)
robust generalized HL method

at 1 % precision and compared by means of relative MSE of the Horvitz–Thompson estimator for the mean. In Table 1 are displayed the main results.

Table 1 Summary of results comparing Robust modified HL method versus modified HL (Rivest 2002), target precision: 1 %

Full size table

References

Rivest, L.P.: A generalization of Lavallée and Hidiroglou algorithm for stratification in business surveys. Techniques d’enquêtes 28, 207–214 (2002)
Google Scholar
Rousseeuw, P.J., Yohai, V.J.: Robust regression by means of S-estimators. In: Franke, J., Hardle, W., Martin Robust, D. (eds.) Nonlinear Time Series. Lecture Notes in Statistics. vol. 26, pp. 256–272. Springer, Berlin (1984)
Google Scholar
Tillé, Y.: Théorie des sondages. Dunod, Paris (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Metodi e Modelli per l’Economia, il Territorio e la Finanza, Sapienza, University of Rome, Via del Castro Laurenziano 9, 00161, Rome, Italy
Maria Caterina Bramati

Authors

Maria Caterina Bramati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria Caterina Bramati .

Editor information

Editors and Affiliations

National Institute of Statistics, Istat Head of Office for Census Methods, Rome, Italy
Fabio Crescenzi
University of Bologna Department of Statistical Sciences, Bologna, Italy
Stefania Mignani

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bramati, M.C. (2014). Response Burden Reduction Through the Use of Administrative Data and Robust Sampling. In: Crescenzi, F., Mignani, S. (eds) Statistical Methods and Applications from a Historical Perspective. Studies in Theoretical and Applied Statistics(). Springer, Cham. https://doi.org/10.1007/978-3-319-05552-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-05552-7_12
Published: 05 May 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05551-0
Online ISBN: 978-3-319-05552-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Response Burden Reduction Through the Use of Administrative Data and Robust Sampling

Abstract

Similar content being viewed by others

Optimal Regression Estimator for Stratified Two-Stage Sampling

Bayesian Ideas in Survey Sampling: The Legacy of Basu

Bootstrap Methods for Judgment Post Stratification

Keywords

1 Introduction

2 The Problem

3 The Robust Modified HL Algorithm

4 Simulation Study

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Response Burden Reduction Through the Use of Administrative Data and Robust Sampling

Abstract

Similar content being viewed by others

Optimal Regression Estimator for Stratified Two-Stage Sampling

Bayesian Ideas in Survey Sampling: The Legacy of Basu

Bootstrap Methods for Judgment Post Stratification

Keywords

1 Introduction

2 The Problem

3 The Robust Modified HL Algorithm

4 Simulation Study

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation