Nonresponse Bias Adjustment in Regression Analysis

Fushiki, Tadayoshi; Maeda, Tadahiko

doi:10.1007/s42519-020-0086-z

Nonresponse Bias Adjustment in Regression Analysis

Original Article
Published: 21 February 2020

Volume 14, article number 20, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Nonresponse Bias Adjustment in Regression Analysis

Download PDF

Tadayoshi Fushiki¹ &
Tadahiko Maeda²

280 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Nonresponse is an unavoidable problem in most sample surveys. If the proportion of nonrespondents is very small, nonresponse bias may be negligible. However, nonresponse rates in sample surveys have recently increased in many countries. Thus, methods for dealing with nonresponse bias are becoming an important topic. Regression analysis is often used to analyze survey data. In this paper, we discuss regression analysis with unit nonresponse. The least square estimator of regression coefficients may be asymptotically biased if nonresponse is not ignorable. In this paper, we establish a sufficient condition that a consistent estimator of regression coefficients is obtained. This condition can be determined from a causal diagram. Furthermore, we examine the results of this study by numerical experiments.

Using Auxiliary Information and Nonparametric Methods in Weighting Adjustments

Semiparametric estimation in generalized additive partial linear models with nonignorable nonresponse data

Article 30 December 2023

Sample selection bias with multiple dependent selection rules: an application to survey data analysis with multilevel nonresponse

Article Open access 02 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Background

Nonresponse is an unavoidable problem in most sample surveys. If the proportion of nonrespondents is very small, nonresponse bias may be negligible. However, nonresponse rates in sample surveys have recently increased in many countries. Thus, estimation methods taking nonresponse into account have become more important (for example, [1, 4, 5, 11]).

Regression analysis is often used in the analysis of survey data. Linear models assumed in regression analysis are generally misspecified. The least square estimator of regression coefficients is asymptotically biased in such a situation if nonresponse is not ignorable. If a linear model is correctly specified, the least square estimator of regression coefficients is asymptotically biased when the reason for nonresponse comes from not only explanatory variables but also a response variable.

In studies of nonresponse adjustment, estimation of population totals has been focused. However, estimation of regression coefficients has received very little attention. In the present study, we establish a condition for obtaining a consistent estimator of regression coefficients by bias adjustment. We examined the results of this study by numerical experiments.

The problem of missing data appears in various forms in statistical analysis. There has been a lot of literature on this problem (for example [8]). Regression analysis with missing data has also been studied (for example [6]). The present study differs from these previous studies in that we treat unit nonresponse and use auxiliary information for bias adjustment.

1.2 Example

For simplicity, we assume sampling with replacement and an infinite population. Let Y be a response variable and X be explanatory variables. Let Z be a random variable that is set to 1 if an individual cooperates in the survey and 0 otherwise.

Here, we consider an example. We are interested in the relationship between income and age. Let Y be income and X be age. The scatter plot of (X, Y) for a sample is shown in the left panel of Fig. 1. From the figure, we can see that income increases rapidly in the twenties, but increases slowly in the fifties. Assume that the right panel of Fig. 1 is the scatter plot for respondents. The figure shows that many young individuals did not respond.

In Fig. 2, the solid line is the regression line of income on age estimated from the respondents, whereas the dotted line is that estimated from the entire sample. The slope obtained from the respondents is less than the slope obtained from the entire sample, and the intercept estimated from the respondents is larger. These results are explained by the fact that the estimate from the respondents is dominated by the observations of older respondents because the response rate of old individuals is higher than that of young individuals.

In this study, we establish a condition that guarantees to obtain a bias-corrected estimator of regression coefficients.

2 Nonresponse Bias Adjustment in Regression Analysis

2.1 Condition for Obtaining a Consistent Estimator

Assume that $X_1$ is a part of explanatory variables X and that U is a variable set, and that the population information of $(X_1,U)$ is available. In the following, we assume for simplicity that $(X_1,U)$ is discrete. We denote by $X_2$ the remainder of X. If $(Y,X_2)$ and Z are conditionally independent given $(X_1,U)$, we can obtain a consistent estimator of regression coefficients, as we show below. The conditional independence is written as .

Regression coefficients for the population are given by

$$\begin{aligned} \left( E \left[ \begin{array}{cc} X_1{X_1}^T &{}\quad X_1{X_2}^T\\ X_2{X_1}^T &{}\quad X_2{X_2}^T \end{array} \right] \right) ^{-1} E \left( \begin{array}{c} X_1Y\\ X_2Y \end{array} \right) . \end{aligned}$$

(1)

Therefore, if we obtain consistent estimators for $E(X_i{X_j}^T)$ and $E(X_iY)$, a consistent estimator of regression coefficients can be obtained by applying the continuous mapping theorem (for example, [13]). By using the conditional independence, the following holds:

$$\begin{aligned} E(X_1{X_2}^T) = E_{X_1,U}\{ X_1E({X_2}^T|X_1,U)\} = E_{X_1,U}\{ X_1E({X_2}^T|X_1,U,Z=1)\} . \end{aligned}$$

(2)

Since we can consistently estimate $E({X_2}^T|X_1,U,Z=1)$ based on data from the respondents, a consistent estimator of $E(X_1{X_2}^T)$ can be obtained by using the information on the population distribution of $(X_1,U)$. We can obtain consistent estimators of $E(X_2{X_1}^T)$, $E(X_2{X_2}^T)$ and $E(X_iY)$ in the same way.

2.2 Graphical Expression for the Conditional Independence

In Sect. 2.1, it was shown that a consistent estimator of regression coefficients can be obtained if the population information of $(X_1,U)$ is available and holds. We now consider a graphical expression for the conditional independence.

A causal diagram is a graph expressing causal relationships between variables. We assume that causal relationships are specified and depicted by a directed acyclic graph $G=(V,E)$. Thus, the distribution of the variables of G satisfies a recursive factorization.

If $(Y,X_2)$ and Z are d-separated by $(X_1,U)$, then holds. It is also known that holds if $(Y,X_2)$ and Z are separated by $(X_1,U)$ in $G^{\mathrm{m}}(Y,X_2,Z,X_1,U)$, which is the moral graph of the smallest ancestral set containing $Y\cup X_2\cup Z\cup X_1\cup U$ (for example, [7]).

Example 1

Let Y be income and X be age. We assume that X affects Y and Z. In Fig. 3, the left panel shows the causal diagram and the right panel shows the moral graph. In the causal diagram, Y and Z are d-separated by X, thus holds. Therefore, if the population information of X is available, we can obtain a consistent estimator of the regression coefficients.

Example 2

Let Y be income, $X_1$ be age, and $X_2$ be education level. We assume that gender U affects $X_2$ and Z. In Fig. 4, the left panel shows the causal diagram and the right panel shows the moral graph. In the causal diagram, $(Y,X_2)$ and Z are d-separated by $(X_1,U)$, thus holds. Therefore, if the population information of $(X_1,U)$ is available, we can obtain a consistent estimator of the regression coefficients.

Example 3

Let Y be income, $X_1$ be age, and $X_2$ be education level. We assume that marital status U is affected by Y, and affects Z. In Fig. 5, the left panel shows the causal diagram and the right panel shows the moral graph. In the causal diagram, $(Y,X_2)$ and Z are d-separated by $(X_1,U)$, thus holds. Therefore, if the population information of $(X_1,U)$ is available, we can obtain a consistent estimator of the regression coefficients.

3 Numerical Experiments

In the previous sections, we assumed sampling with replacement and infinite population. In this section, we perform a simulation where sampling is without replacement and population size is finite. Different two sample sizes are used: one value (3500) is typical and the other (10,000) is fairly large in social surveys. The following three examples correspond to the examples of Sect. 2.2.

Example 1

We assume that age X takes values 1, 2, 3, or 4 if the individual is in their twenties, thirties, forties, or fifties, respectively. A population is generated from the distribution

$$\begin{aligned} \Pr (X_1=1)& = {} \Pr (X_1=2) = \Pr (X_1=3) = \Pr (X_1=4) = 0.25,\\ Y& = {} [-50(X_1-4)^2+750+\varepsilon ]_{+},\quad \varepsilon \;\sim \; N(0,100^2),\\ \Pr (Z=1|X=i)& = {} 0.2i \quad (i=1,2,3,4), \end{aligned}$$

where $[ x ]_{+}$ is a function returning x if x is positive and 0 otherwise. The size of the population is 100 million. The possibly misspecified regression model

$$\begin{aligned} Y=\beta _0+\beta _1X+\varepsilon ,\quad \varepsilon \,\sim \, N(0,\sigma ^2) \end{aligned}$$

is used in the estimation.

To investigate the properties of estimators, sampling with size n was repeated 10,000 times, and averages were calculated from these 10,000 replicates. Table 1 shows the means of the estimates of the regression coefficients. The row labeled PopVal contains the values of the regression coefficients for the population. The row labeled NonAdj contains the mean of the ordinary least square estimates of the regression coefficients based on the data from the respondents. The standard deviation is given in parentheses. The row labeled PostStr contains the mean of the post-stratification estimates of the regression coefficients. The post-stratification estimate is obtained as follows. First, $E(X_2|X_1,U,Z=1)$, $E(X_2X_2^T|X_1,U,Z=1)$, $E(Y|X_1,U,Z=1)$ and $E(X_2Y|X_1,U,Z=1)$ are estimated by calculating the simple averages for each stratum. Second, each element of (1) is estimated by the technique as in (2). Third, the regression coefficients are estimated by substituting the estimates in (1). The results reveal a tendency for the nonadjusted slope (intersection) estimate to become smaller (larger) than the value for the population. The estimates by the post-stratification are distributed around the value for the population, and the standard deviation becomes smaller as n increases.

Table 1 The mean of estimates (Example 1)

Full size table

Example 2

As in Example 1, we assume that age $X_1$ takes values 1, 2, 3, or 4. The variable $X_2$ representing an individual’s education level is 1 if the individual is a university graduate and 0 otherwise. Gender U is 1 for a man and 0 for a woman.

A population of size 100 million is generated from the distribution

$$\begin{aligned} \Pr (U=1)& = {} \Pr (U=0) = 0.5,\\ \Pr (X_1=1) & = {} \Pr (X_1=2) = \Pr (X_1=3) = \Pr (X_1=4) = 0.25,\\ \Pr (X_2=1|U=1)& = {} 0.4,\; \Pr (X_2=1|U=0) = 0.2,\\ Y& = {} \left\{ \begin{array}{ll} {[}-50(X_1-4)^2+750+\varepsilon {]}_{+} &{} \text{ if } X_2=1\\ {[}-30(X_1-4)^2+500+\varepsilon {]}_{+} &{} {\text{ if }} X_2=0 \end{array} \right. ,\quad \varepsilon \;\sim \; N(0,100^2),\\ \Pr (Z=1|X_1=i,U=j)& = {} 0.2i-0.2j+0.1 \quad (i=1,2,3,4,\; j=0,1). \end{aligned}$$

The regression model

$$\begin{aligned} Y=\beta _0+\beta _1X_1+\beta _2X_2+\varepsilon ,\;\; \varepsilon \,\sim \, N(0,\sigma ^2) \end{aligned}$$

is used in the estimation.

To investigate the properties of estimators, sampling with size n was repeated 10,000 times as in Example 1. Table 2 shows the means of estimates of the regression coefficients. PopVal, NonAdj, and PostStr have the same meaning as in Example 1. The row labeled Rak contains the mean of raking estimates of the regression coefficients with auxiliary variable $(X_1,U)$. The raking estimates are obtained by the weighted least square method, where the weight for each respondent is determined by the raking procedure (for example, [10]). The row labeled Rak1 (Rak2) contains the mean of the raking estimates of the regression coefficients with auxiliary variable $X_1$ (U). In NonAdj, the estimates of $\beta _0$ and $\beta _2$ have upper biases and the estimate of $\beta _1$ has a lower bias. By post-stratification, biases are almost corrected. In Rak, more biases are observed than in post-stratification. In Rak2, bias adjustment has almost no effect.

Table 2 The mean of estimates (Example 2)

Full size table

Example 3

As in Example 2, we assume that age $X_1$ takes values 1, 2, 3, or 4 and education level $X_2$ takes values 0 or 1. Marital status U is 1 if the individual is married and 0 otherwise.

A population of size 100 million is generated from the distribution

$$\begin{aligned} \Pr (X_1=1)& = {} \Pr (X_1=2) = \Pr (X_1=3) = \Pr (X_1=4) = 0.25,\\ \Pr (X_2=1)& = {} 0.4,\\ Y& = {} \left\{ \begin{array}{ll} {[}-50(X_1-4)^2+750+\varepsilon ]_{+} &{} {\text{ if }} X_2=1\\ {[}-30(X_1-4)^2+500+\varepsilon ]_{+} &{} {\text{ if} } X_2=0 \end{array} \right. ,\quad \varepsilon \;\sim \; N(0,100^2),\\ \Pr (U=1|X_1,Y)& = {} \frac{1}{1+\exp (-0.004Y+1)},\\ \Pr (Z=1|X_1,U)& = {} 0.1i+0.3j+0.1 \quad (i=1,2,3,4,\; j=0,1). \end{aligned}$$

The regression model

$$\begin{aligned} Y=\beta _0+\beta _1X_1+\beta _2X_2+\varepsilon ,\;\; \varepsilon \,\sim \, N(0,\sigma ^2) \end{aligned}$$

is used in the estimation.

To investigate the properties of estimators, sampling with size n was repeated 10,000 times as in Examples 1 and 2. Table 3 shows the means of estimates of the regression coefficients. PopVal, NonAdj, PostStr, Rak, Rak1, and Rak2 have the same meaning as in Example 2. In NonAdj, the estimates of $\beta _0$ and $\beta _2$ have upper biases and the estimate of $\beta _1$ has a lower bias. By post-stratification, biases are almost corrected. In Rak, more biases are left than in post-stratification. In Rak1 the estimates of $\beta _1$ and $\beta _2$ have small biases, whereas in Rak2 only the estimate of $\beta _0$ has a small bias.

Table 3 Mean of estimates (Example 3)

Full size table

4 Application to the SSP-I2010 Survey data

In this section, the “Interview Survey for Stratification and Social Psychology in 2010” (SSP-I2010 Survey) is analyzed. The survey was administered to 3500 Japanese males and females aged 25–59 (at the end of 2009) by the way of face-to-face interviews. The main purpose of this survey was to investigate factors affecting individual stratum identification (economic/social status perception) and public consciousness of economic inequality in Japan. The number of respondents was 1763, yielding a response rate of $50.4\%$. Detailed information on the survey can be found in SSP Project [12].

In the following analysis, only data for males are used. The number of males in the entire sample is 1717, and the number of observations used in the analysis is 701. The objective variable Y is individual stratum identification (1 (upper) to 10 (lower)). Explanatory variables are age $X_{(1)}$, education level $X_{(2)}$, EGP class categories $X_{(3)}$ which measures occupational prestige of a person [2, 3], and annual income $X_{(4)}$. Education level is divided into three categories (primary, secondary, and higher). Income (ten thousand yen) is categorized as follows: 0, 1–199, 200–399, 400–699, 700-999, 1000–1499, and more than or equal to 1500. Used auxiliary variables are city size $U_{(1)}$, occupational status $U_{(2)}$, marital status $U_{(3)}$, house ownership $U_{(4)}$, household composition $U_{(5)}$, and duration of residence $U_{(6)}$. Details of the auxiliary variables are shown in Table 4.

Table 4 Variables used in the analysis of the SSP-I2010 Survey

Full size table

We assume causal relationships shown in Fig. 6. From Fig. 6, holds, where $X_1=X_{(1)},X_2=(X_{(2)},X_{(3)},X_{(4)})$ and $U=(U_{(1)},\ldots ,U_{(6)})$. Population information for $X_{(1)}$ and each $U_{(i)}$ can be obtained from the 2010 Population Census of Japan (Ministry of Internal Affairs and Communications, [9]).

Estimated regression coefficients are shown in Table 5, where Age (20s), Education level (primary), EGP class (without occupation), and Income (0) are reference categories. NonAdj is the ordinary least square estimate of the regression. Rak is the weighted least square estimate where the weights are determined by the raking procedure with auxiliary variable U. In Table 5, we can see a tendency that the absolute values of estimated regression coefficients for education level become smaller in Rak while the absolute values of estimated regression coefficients for EGP class become larger.

Table 5 Estimated regression coefficients

Full size table

5 Summary

In this study, we considered a nonresponse bias adjustment problem in regression analysis. If a linear model assumed in regression analysis is not correct, the least square estimator may be biased. We established a sufficient condition that allows one to obtain a consistent estimator. Whether the condition holds is determined by the causal diagram.

For the analysis, we first create a causal diagram based on prior knowledge. Next, we find auxiliary information U satisfying the conditional independence , where population information of $(X_1,U)$ is available. Finally, the weighted least square estimate is calculated, where the weight for each respondent is obtained by the calibration procedure [11] with auxiliary variables $(X_1,U)$.

In this study, linear regression models were analyzed. In real data analysis, a generalized linear regression model such as a logistic regression model is often used. The sufficient condition established in this study is also valid for the generalized linear regression model. Thus, the condition can be widely applied to analyze survey data.

References

Bethlehem J, Cobben F, Schouten B (2011) Handbook of nonresponse in household surveys. Wiley, New Jersey
Book Google Scholar
Erikson R, Goldthorpe JH, Portocarero L (1979) Intergenerational class mobility in three Western European societies. Br J Sociol 30:415–441
Article Google Scholar
Ganzeboom HBG, Treiman D (1996) Internationally comparative measures of occupational status for 1988 international standard classification of occupations. Soc Sci Res 25:201–239
Article Google Scholar
Groves RM, Couper MP (1998) Nonresponse in household interview surveys. Wiley, New York
Book Google Scholar
Groves RM, Dillman D, Eltinge JL, Little RJA (eds) (2002) Survey nonresponse. Wiley, New York
Google Scholar
Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH (2005) Missing data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346
Article MathSciNet Google Scholar
Lauritzen SL (1996) Graphical models. Oxford University Press, Oxford
MATH Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Book Google Scholar
Ministry of Internal Affairs and Communications (2014) 2010 Population Census of Japan. http://www.e-stat.go.jp/. Accessed 19 Feb 2020
Oh HL, Scheuren FJ (1983) Weighting adjustment for unit nonresponse. In: Madow WG, Olkin I, Rubin DB (eds) Incomplete data in sample surveys, vol 2. Academic Press, New York
Google Scholar
Särndal CE, Ludström S (2005) Estimation in surveys with nonresponse. Wiley, Chichester
Book Google Scholar
SSP Project (2013) Codebooks and basic summary tables of the SSP-I2010 Survey, Osaka: Author (in Japanese)
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Book Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 15K00043.

Author information

Authors and Affiliations

Niigata University, Niigata, Japan
Tadayoshi Fushiki
The Institute of Statistical Mathematics/Joint Support-Center for Data Science Research, Tachikawa, Japan
Tadahiko Maeda

Authors

Tadayoshi Fushiki
View author publications
You can also search for this author in PubMed Google Scholar
Tadahiko Maeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tadayoshi Fushiki.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fushiki, T., Maeda, T. Nonresponse Bias Adjustment in Regression Analysis. J Stat Theory Pract 14, 20 (2020). https://doi.org/10.1007/s42519-020-0086-z

Download citation

Published: 21 February 2020
DOI: https://doi.org/10.1007/s42519-020-0086-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nonresponse Bias Adjustment in Regression Analysis

Abstract

Similar content being viewed by others

Using Auxiliary Information and Nonparametric Methods in Weighting Adjustments

Semiparametric estimation in generalized additive partial linear models with nonignorable nonresponse data

Sample selection bias with multiple dependent selection rules: an application to survey data analysis with multilevel nonresponse