1 Introduction

Quantile regression, first proposed by Koenker and Bassett (1978), has become an important statistical method. By considering different quantiles, quantile regression provides a more complete description of the conditional distribution of responses given covariates. In addition, compared to mean regression, quantile regression demonstrates robustness in the presence of heavy-tailed errors. A detailed review of quantile regression can be found in Koenker et al. (2017).

Traditional quantile regression assumes that the data is fully observed, that is, there are no missingness or measurement errors. However, in many applications, especially in biomedical and social science studies, this assumption may be violated. It is well known that ignoring measurement errors and missing data may produce a large bias for the regression coefficients (Carroll et al. 1995; Little and Rubin 2002). Therefore, when measurement errors and missingness coexist, it becomes essential to handle both problems to obtain reliable results. Depending on the missing mechanism, Little and Rubin (2002) defined three types of missing: missing completely at random, missing at random (MAR), and missing not at random (MNAR). In this paper, we consider the data to be MNAR, which can also be called nonignorable missing data.

In the context of quantile regression, numerous methods have been proposed to handle measurement errors or nonignorable missing responses separately. For quantile regression with covariate measurement errors, He and Liang (2006) studied an orthogonal regression method by assuming that both regression errors and measurement errors obey the same symmetric distribution. This method limits the flexibility of the model. Wei and Carroll (2009) established joint estimating equations and developed an iterative estimation program, which produce a consistent estimator. However, their approach is computationally complex. Wang et al. (2012) developed a smooth corrected quantile estimation procedure that avoids the symmetry assumption and is simple to implement. Nonignorable missing responses in quantile regression have also been studied. For example, based on the instrumental variable, Zhao et al. (2017) considered the empirical likelihood method for the linear models. Ding et al. (2020) introduced a regularized estimation for ultrahigh-dimensional data. For more literature, see (Jiang et al. 2016; Ma et al. 2022; Yu et al. 2022), and so on.

To the best of our knowledge, for quantile regression, there is little literature addressing the concurrent biases arising from nonignorable missing responses and covariate measurement errors simultaneously. We focus on this topic in this paper. Specifically, we propose a two-stage procedure for constructing a weighted bias-corrected quantile loss function, which yields a consistent estimate for linear quantile regression models with both covariate measurement errors and nonignorable nonresponse. In the first stage, we employ the bias-corrected quantile loss function to eliminate the bias introduced by measurement errors. Subsequently, a nonresponse instrument and a generalized gethod of moments (GMM) approach are utilized to estimate the unknown parameters in the propensity. Once the propensity is consistently estimated, in the second stage, we construct a weighted bias-corrected quantile loss function based on the inverse probability weighting (IPW) approach. Furthermore, under some regularity conditions, the asymptotic properties of the proposed estimators are derived.

The remainder of this article is organized as follows. In Sect. 2, the linear quantile regression model with covariate measurement errors and nonignorable missing responses is described. In Sect. 3, we propose a weighted bias-corrected quantile loss function. Asymptotic properties of the proposed estimators are also presented in this section. Simulation studies are given in Sect. 4. Section 5 concludes with a discussion. Proofs of the theorems are deferred to the Appendix A.

2 Statistical modeling

2.1 Linear quantile regression

For a given quantile level \(\tau \in (0,1)\), consider the following linear quantile regression model

$$\begin{aligned} Y_{i}={\textbf{X}}_{i}^\top \varvec{\beta }_{\tau 0}+e_{i}, \ i=1,2,\ldots ,n, \end{aligned}$$
(1)

where \(Y_{i} \in {\mathbb {R}} \) is response, \({\textbf{X}}_{i}=(X_{i1},\ldots ,X_{ip}) \in {\mathbb {R}}^p\) is the corresponding covariate, \(\varvec{\beta }_{\tau 0}\) is a p-dimensional vector of unknown parameters and \(e _{i}\) is an error term satisfied \({\text {Pr}}(e_{i}< 0 \mid {\textbf{X}}_{i})=\tau \). Let \(Q_{Y_i}(\tau \vert {\textbf{X}}_{i})\) be the condition quantile of \(Y_i\) given \({\textbf{X}}_{i}\), then

$$\begin{aligned} Q_{Y_i}(\tau \vert {\textbf{X}}_{i})={\textbf{X}}_{i}^\top \varvec{\beta }_{\tau 0}. \end{aligned}$$

To simplify the notation, in the remainder of the paper, we omit the subscript \(\tau \) from \(\varvec{\beta }_{\tau 0}\).

When \({\textbf{X}}_{i}\) is measured without an error and \(Y_i\) is fully observed, \(\varvec{\beta }_0\) can be estimated consistently by

$$\begin{aligned} \tilde{\varvec{\beta }}=\underset{\varvec{\beta }}{\text {argmin}}\sum _{i=1}^{n} \rho \left( Y_i,{\textbf{X}}_{i},\varvec{\beta }\right) , \end{aligned}$$
(2)

where \(\rho (Y,{\textbf{X}},\varvec{\beta })=\rho _\tau (Y-{\textbf{X}}^\top \varvec{\beta })\), \(\rho _{\tau }(t)=\tau t-t \textrm{I}(t<0)\) is the quantile loss function and and \(\textrm{I}(\cdot )\) is the indicator function.

2.2 Measurement error process

Assume that \({\textbf{X}}_i\) is measured with error and consider the following additive measurement error model

$$\begin{aligned} {\textbf{W}}_{i}={\textbf{X}}_{i}+{\textbf{U}}_{i} , \ i=1,2,\ldots ,n, \end{aligned}$$

where \({\textbf{U}}_{i} \in {\mathbb {R}}^p\) follows a certain distribution with mean \({\textbf{0}}\) and covariance matrix \(\varvec{\Sigma }\), and is independent of \({\textbf{X}}_{i}\) and \(Y_{i}\). In the subsequent sections, our focus is on two types of measurement errors: normal and Laplace, as these error distributions provide reasonable error models in many applications (Wang et al. 2012). Compared to the normal distribution, the Laplace distribution has heavier tails, so random variables that follow the Laplace distribution are more likely to have extreme values.

In practice, often not all covariates are measured with errors. In this paper, we therefore suppose that only the first q (\(q<p\)) components of \({\textbf{X}}\) have measurement errors, then

$$\begin{aligned} \varvec{\Sigma }=\left( \begin{array}{cc} \varvec{\Sigma }^{\prime }_{q \times q} &{} \quad {\textbf {0}} \\ {\textbf {0}}^\top &{} \quad {\textbf {0}} \end{array}\right) , \end{aligned}$$

where \(\varvec{\varvec{\Sigma }}^{\prime }_{q \times q}\) is a \(q \times q\) matrix.

2.3 Nonignorable missing process

Consider the case where \(Y_i\) is sbuject to nonignorable missingness. Let \(\delta _i\) be a binary response indicator that equals 1 if and only if \(Y_i\) is observable. In this case, the propensity \({\text {Pr}}(\delta _{i}=1 \mid {\textbf{W}}_{i},Y_{i})\) is not identifiable. To solve the identifiability problem, similar to the method of Wang et al. (2014), we assume that \({\textbf{W}}_{i}\) can be decomposed into two parts \({\textbf{W}}_{i}=({\textbf{V}}_i,{\textbf{Z}}_i)\), such that

$$\begin{aligned} {\text {Pr}}(\delta _i=1 \mid {\textbf{W}}_{i},Y_i)={\text {Pr}}(\delta _i=1 \mid {\textbf{V}}_{i},Y_i). \end{aligned}$$
(3)

Furthermore, we impose a parametric model on the propensity

$$\begin{aligned} {\text {Pr}}(\delta _i=1 \mid {\textbf{V}}_{i}, Y_i)=\Psi \left( \alpha _1+{\varvec{\alpha }_2^{\top }{\textbf{V}}_{i}} +\alpha _3 Y_i\right) , \end{aligned}$$
(4)

where \(\varvec{\alpha }=(\alpha _1,\varvec{\alpha }_2^\top , \alpha _3)^\top \) is a \(d_{\alpha }\)-dimensional unknown parameter vector and \(\Psi \) is a known monotone function defined on [0, 1]. Popular choices of \(\Psi \) can be the cLog-log model with \(\Psi (t)=1-\exp [-\exp (t)]\), the probit model with \(\Psi \) being the standard normal distribution function and the logistic model with \(\Psi (t)=\exp (t)/[1+\exp (t)]\). Equation (3) shows that, given \(Y_i\) and \({\textbf{V}}_{i}\), \({\textbf{Z}}_i\) can be excluded from the propensity, which will be used to create estimation equations to estimate the unknown parameter vector \(\varvec{\alpha }\) and ensure that \(\Psi \) is identifiable. \({\textbf{Z}}_i\) is referred to as a nonresponse instrument.

3 Inference method

3.1 Weighted corrected-loss estimation

The quantile regression estimator \(\tilde{\varvec{\beta }}\) obtained by (2) satisfies

$$\begin{aligned} n^{-1}\sum \limits _{i=1}^n {\textbf{X}}_i \left\{ \textrm{I} \left( Y_i-{\textbf{X}}_i^\top \tilde{\varvec{\beta }}<0 \right) -\tau \right\} =o_p(1). \end{aligned}$$

Under model (1), there is \({\text {Pr}}(Y<{\textbf{X}}^\top \varvec{\beta }_0\mid {\textbf{X}})=\tau \), then

$$\begin{aligned} {\mathbb {E}}\left[ {\textbf{X}} \left\{ \textrm{I} \left( Y-{\textbf{X}}^\top \varvec{\beta }_0<0 \right) -\tau \right\} \right] =0, \end{aligned}$$
(5)

so \(\varphi (Y,{\textbf{X}},\varvec{\beta })={\textbf{X}}\{\textrm{I}(Y-{\textbf{X}}^\top \varvec{\beta }<0)-\tau \}\) is an unbiased estimating function of \(\varvec{\beta }_0\). When \({\textbf{X}}_i\) is measured with error, replacing \({\textbf{X}}_i\) in (2) with the surrogate variable \({\textbf{W}}_i\) usually results in an inconsistent estimator, because \({\mathbb {E}}\left[ {\textbf{W}}\{\textrm{I}(Y-{\textbf{W}}^\top \varvec{\beta }_0<0)-\tau \}\right] =0\) may not be satisfied. To account for the measurement error, we adopt the approach proposed by Wang et al. (2012).

Assume that \({\textbf{U}}_i\sim {\mathcal {N}}({\textbf{0}},\varvec{\varvec{\Sigma }})\), is a p-dimensional normal random vector, define

$$\begin{aligned} \rho _{\mathcal {N}}(\epsilon _1,h)=\epsilon _1 \left\{ \tau -1/2+G_{\mathcal {N}}(\epsilon _1/h)\right\} , \end{aligned}$$

where \(\epsilon _1\sim {\mathcal {N}}(\mu ,\sigma ^2)\), \(G_{\mathcal {N}}( x) = \pi ^{- 1}\int _0^x\sin ( t) /t\textrm{d}t\), \(\pi \) is the mathematical constant, h is the smoothing parameter. \(\rho _{\mathcal {N}}(\epsilon _1,h)\) offers a smooth approximation to \(\rho _{\tau }(\epsilon _1)\). Let

$$\begin{aligned} \begin{aligned} f\left( \epsilon _1,\sigma ^2,h\right)&={\mathbb {E}}\left[ \rho _{\mathcal {N}}(\epsilon _1+\sqrt{-1}\sigma u,h)\mid \epsilon _1\right] \\&=\epsilon _1(\tau -1/2)+\pi ^{-1} \\&\quad \times \int _0^{1/h}\{t^{-1}\epsilon _1\sin (t\epsilon _1)-\sigma ^2\cos (t\epsilon _1)\}\exp (t^2\sigma ^2/2) \textrm{d}t. \end{aligned} \end{aligned}$$
(6)

where \(u\sim {\mathcal {N}}(0,1)\) is independent of \(\epsilon _1\). Note that \((Y-{\textbf{W}}^\top \varvec{\beta })\mid (Y,{\textbf{X}})\sim \) \({\mathcal {N}}(Y-{\textbf{X}}^\top \varvec{\beta },\varvec{\beta }^\top \varvec{\varvec{\Sigma }}\varvec{\beta }),\) then, motivated by Wang et al. (2012), the bias-corrected quantile loss function of the model (1) only involving normal measurement error is defined as

$$\begin{aligned} \rho _{\mathcal {N}}(Y,{\textbf{W}},\varvec{\beta },h) =f\left( Y-{\textbf{W}}^\top \varvec{\beta },\varvec{\beta }^\top \varvec{\varvec{\Sigma }}\varvec{\beta },h\right) . \end{aligned}$$
(7)

Next we consider the Laplace measurement error, suppose that \({\textbf{U}}_{i}\) is a p-dimensional Laplace random vector, denoted as \({\textbf{U}}_{i}\sim {\mathcal {L}}({\textbf{0}},\varvec{\varvec{\Sigma }})\). Let \(\epsilon _2=Y-{\textbf{W}}^\top \varvec{\beta }\), we have \(\epsilon _2\mid (Y,{\textbf{X}})\sim {\mathcal {L}}(Y-{\textbf{X}}^\top \varvec{\beta },\varvec{\beta }^\top \varvec{\varvec{\Sigma }}\varvec{\beta })\). Subsequently, the corrected quantile loss function of model (1) only involving Laplace measurement error is defined as

$$\begin{aligned} \begin{aligned} \rho _{\mathcal {L}}(Y,{\textbf{W}},\varvec{\beta },h)&=\rho _{\mathcal {L}}(\epsilon _2,h)-\frac{\sigma ^2}{2}\frac{\partial ^2\rho _{\mathcal {L}}(\epsilon _2,h)}{\partial \epsilon _2^2} \\&=\epsilon _2(\tau -1)+\epsilon _2 G_{\mathcal {L}}\left( \dfrac{\epsilon _2}{h}\right) \\&\quad -\dfrac{\sigma ^2}{2} \left\{ \dfrac{2}{h}K\left( \dfrac{\epsilon _2}{h}\right) +\dfrac{\epsilon _2}{h^2}K^{\prime }\left( \dfrac{\epsilon _2}{h}\right) \right\} , \end{aligned} \end{aligned}$$
(8)

where \(\rho _{\mathcal {L}}(\epsilon _2,h)=\epsilon _2\{\tau -1+G_{\mathcal {L}}(\epsilon _2/h)\}\), \(G_{\mathcal {L}}(x)=\int _{t<x} K(t) \textrm{d}t\), \(K(\cdot )\) is a kernel density function, \(\sigma ^2=\varvec{\beta }^\top \varvec{\varvec{\Sigma }}\varvec{\beta }\). By some calculations, as \(h \rightarrow 0\), there are

$$\begin{aligned} {\mathbb {E}}^{*}[\rho _{\mathcal {N}}(Y,{\textbf{W}},\varvec{\beta },h)]&=\rho _{\mathcal {N}}(Y-{\textbf{X}}^\top \varvec{\beta },h) \triangleq \dot{\rho }_{\mathcal {N}}(Y,{\textbf{X}},\varvec{\beta },h) \longrightarrow \rho (Y,{\textbf{X}},\varvec{\beta }),\nonumber \\ {\mathbb {E}}^{*}[\rho _{\mathcal {L}}(Y,{\textbf{W}},\varvec{\beta },h)]&=\rho _{\mathcal {L}}(Y-{\textbf{X}}^\top \varvec{\beta },h) \triangleq \dot{\rho }_{\mathcal {L}}(Y,{\textbf{X}},\varvec{\beta },h) \longrightarrow \rho (Y,{\textbf{X}},\varvec{\beta }), \end{aligned}$$
(9)

where \({\mathbb {E}}^{*}\) is the expectation with respect to \({\textbf{W}}\) given Y and \({\textbf{X}}\). Hence, the minimizers of \(\sum _{i=1}^n\rho _{{\mathcal {N}}}(Y_i,{\textbf{W}}_i,\varvec{\beta },h)\) and \(\sum _{i=1}^n\rho _{{\mathcal {L}}}(Y_i,{\textbf{W}}_i,\varvec{\beta },h)\) are consistent estimators of \(\varvec{\beta }_0\).

But when Y has nonignorable missing values, the consistency mentioned above is broken. To eliminate the missing effect, the IPW method is employed to adjust the bias-corrected quantile loss functions (7) and (8), resulting in the following weighted bias-corrected quantile loss functions

$$\begin{aligned} {\rho ^\star _{\mathcal {N}}}\left( Y, {\textbf{W}}, \varvec{\beta }, h, \delta ,\varvec{\alpha }\right)&=\frac{\delta }{\Delta ({\textbf{V}}, Y, \varvec{\alpha })} \rho _{\mathcal {N}}\left( Y, {\textbf{W}}, \varvec{\beta }, h\right) ,\nonumber \\ {\rho ^\star _{\mathcal {L}}}\left( Y, {\textbf{W}}, \varvec{\beta }, h,\delta , \varvec{\alpha }\right)&=\frac{\delta }{\Delta ({\textbf{V}}, Y, \varvec{\alpha })} \rho _{\mathcal {L}}\left( Y, {\textbf{W}}, \varvec{\beta }, h\right) , \end{aligned}$$
(10)

where \({\Delta }({\textbf{V}}, Y, \varvec{\alpha })=\Psi \left( \alpha _1+\varvec{\alpha }_2^\top {\textbf{V}}+\alpha _3 Y\right) \). Note that there exists one obstacle in (10), specifically, \(\varvec{\alpha }\) is unknown.

To estimate the unknown parameter \(\varvec{\alpha }\), we construct the following estimation equation

$$\begin{aligned} g(Y, {\textbf{W}}, \delta , \varvec{\alpha })={\eta }({\textbf{W}})\left[ \frac{\delta }{\Delta ({\textbf{V}}, Y, \varvec{\alpha })}-1\right] , \end{aligned}$$

where \(\eta ({\textbf{W}})\) is a known vector-valued function with dimension \(d_{\eta } \ge d_{\alpha }\). When \(d_{\eta }=d_{\alpha }\), the estimator \({\varvec{\hat{\alpha }}}\) is obtained by solving \(\sum _{i=1}^n g\left( Y_i,{\textbf{W}}_i, \delta _i,\varvec{\alpha }\right) =0\). When \(d_{\eta }>d_{\alpha }\), we apply the GMM (Hansen 1982) approach as follows

$$\begin{aligned} {\varvec{\hat{\alpha }}}=\underset{\varvec{\alpha }}{\arg \min } {\bar{g}}(\varvec{\alpha })^{\top } {\varvec{\hat{\Omega }}}^{-1} {\bar{g}}(\varvec{\alpha }), \end{aligned}$$

where \({\bar{g}}(\varvec{\alpha })=n^{-1} \sum _{i=1}^n g\left( Y_i,{\textbf{W}}_i, \delta _i, \varvec{\alpha }\right) \), \({\varvec{\hat{\Omega }}}^{-1}\) is the inverse of the matrix \(n^{-1}\) \(\sum _{i=1}^n g\left( Y_i,{\textbf{W}}_i, \delta _i, {\varvec{\hat{\alpha }}}^{(1)}\right) g\left( Y_i,{\textbf{W}}_i, \delta _i, {\varvec{\hat{\alpha }}}^{(1)}\right) ^{\top }\) and \({\varvec{\hat{\alpha }}}^{(1)}=\underset{\varvec{\alpha }}{\arg \min } {\bar{g}}(\varvec{\alpha })^{\top } {\bar{g}}(\varvec{\alpha })\). Once a consistent estimator \({\varvec{\hat{\alpha }}}\) is obtained, we define the weighted bias-corrected quantile estimators as

$$\begin{aligned}{} & {} {\varvec{\hat{\beta }}}_{\mathcal {N}}=\underset{\varvec{\beta } }{{\text {argmin}}} \sum _{i=1}^n \rho ^\star _{\mathcal {N}}\left( Y_i,{\textbf{W}}_{i}, \varvec{\beta }, h,\delta _i,{\varvec{\hat{\alpha }}}\right) , \end{aligned}$$
(11)
$$\begin{aligned}{} & {} {\varvec{\hat{\beta }}}_{\mathcal {L}}=\underset{\varvec{\beta } }{{\text {argmin}}} \sum _{i=1}^n \rho ^\star _{\mathcal {L}}\left( Y_i,{\textbf{W}}_{i}, \varvec{\beta }, h,\delta _i, {\varvec{\hat{\alpha }}}\right) . \end{aligned}$$
(12)

It is not difficult to show that the expectation of \(\rho ^\star _{\mathcal {N}}\left( Y, {\textbf{W}}, \varvec{\beta }, h, \delta ,\hat{\varvec{\alpha }}\right) \) and \(\rho ^\star _{\mathcal {L}}\left( Y, {\textbf{W}}, \varvec{\beta }, h, \delta ,\hat{\varvec{\alpha }}\right) \) with respect to \(\delta \) given Y and \({\textbf{W}}\) is equal to \(\rho _{\mathcal {N}}\left( Y, {\textbf{W}}, \varvec{\beta }, h\right) \) and \(\rho _{\mathcal {L}}\left( Y, {\textbf{W}}, \varvec{\beta }, h\right) \) respectively. Thus, according to Eq. (9), \(\hat{\varvec{\beta }}_{\mathcal {N}}\) and \(\hat{\varvec{\beta }}_{\mathcal {L}}\) are consistent estimators for \(\varvec{\beta }_0\), which can handle both covariate measurement errors and nonignorable missing responses.

Remark 1

The minimization problems (11) and (12) can be solved by the “optim” function in the R software. The the initial value of \(\varvec{\beta }\) is obtained by regressing observed \(Y_i\) on \({\textbf{W}}_i\). The smoothing parameter h can be selected through simulation-extrapolation-type strategy, which is proposed by Wang et al. (2012).

3.2 Large sample properties

Theorem 1

When the measurement error \({\textbf{U}}_{i} \sim {\mathcal {L}}({\textbf {0}}, \varvec{\Sigma })\), suppose that Conditions (C1)–(C4), (C6) and (C8) in Appendix A hold, if \(h \rightarrow 0\) and \((n h)^{-1 / 2} \log (n) \rightarrow 0\),then \(\hat{\varvec{\beta }}_{\mathcal {L}}\) converges to \(\varvec{\beta }_0\) in probability, as \(n \rightarrow \infty \).

Theorem 2

When the measurement error \({\textbf{U}}_{i} \sim {\mathcal {N}}({\textbf {0}}, \varvec{\Sigma })\), suppose that Conditions (C1)-(C5) and (C8) in Appendix A hold, if \(h \rightarrow 0\) and \(h=c(\log n)^{-\xi }\), where \(\xi <1 / 2\) and c is a positive constant, then \(\hat{\varvec{\beta }}_{\mathcal {N}}\) converges to \(\varvec{\beta }_0\) in probability, as \(n \rightarrow \infty \).

Theorem 3

Under the conditions given in the Appendix A. Suppose that \(\varvec{\alpha }_0 \in \Theta _{\alpha }\) is the unique solution to \({\mathbb {E}}[g(Y, {\textbf{W}},\delta , \varvec{\alpha })]=0\), \(\varvec{\Lambda }={\mathbb {E}}\left[ \partial g\left( Y, {\textbf{W}}, \delta , \varvec{\alpha }_0\right) / \partial \varvec{\alpha }\right] \) is of full rank and \(\varvec{\Omega }={\mathbb {E}}\left[ g\left( Y, {\textbf{W}},\delta , \varvec{\alpha }_0\right) g\left( Y, {\textbf{W}}, \delta , \varvec{\alpha }_0\right) ^{\top }\right] \) is positive definite. As \(n \rightarrow \infty \), we have

$$\begin{aligned} \sqrt{n}\left( \hat{\varvec{\beta }}-\varvec{\beta }_0\right) {\mathop {\longrightarrow }\limits ^{d}} {\mathcal {N}}\left( 0, {\textbf{A}}^{-1} {\textbf{D}} {\textbf{A}}^{-1}\right) \end{aligned}$$

where \(\hat{\varvec{\beta }}\) is the consistent estimator of \(\varvec{\beta }_0\), either \(\hat{\varvec{\beta }}_{\mathcal {N}}\) or \(\hat{\varvec{\beta }}_{\mathcal {L}}\) defined in Sect. 3.1. The definition of \({\textbf{A}}\) and \({\textbf{D}}\) is given in Appendix A.

Remark 2

The large sample properties mentioned above are developed under the assumption that \(\varvec{\Sigma }\) is known. When \(\varvec{\Sigma }\) is unknown, it needs to be estimated. The common estimation method is the partial replication method proposed by Carroll et al. (1995). We assume that each \({\textbf{W}}_{i}\) is itself the average of m replicate measurements \({\textbf{W}}_{i,k}, k=1, \ldots , m\), each having variance \(m \varvec{\Sigma }\). Then, a consistent and unbiased estimate of \(\varvec{\Sigma }\) is

$$\begin{aligned} \hat{\varvec{\Sigma }}=\left\{ n\left( m-1\right) \right\} ^{-1} \sum _{i=1}^n \sum _{k=1}^{m}\left( {\textbf{W}}_{i, k}-{\textbf{W}}_{i }\right) \left( {\textbf{W}}_{i,k}-{\textbf{W}}_{i}\right) ^{\top }. \end{aligned}$$

4 Numerical studies

4.1 Instrument and propensity model selection

How to find a suitable nonresponse instrument from a set of covariates is an important question. For example, when \({\textbf{W}}=(W_1,W_2)^\top \) is a two-dimensional random vector, \({\textbf{Z}}\) has the following three choices

$$\begin{aligned} {\textbf{Z}}_0=\{W_1,W_2\},~{\textbf{Z}}_1=\{W_1\},~{\textbf{Z}}_2=\{W_2\}. \end{aligned}$$
(13)

Several studies have attempted to address the issues mentioned above. Let \(p(Y\mid {\textbf{X}})\) be a generic notation for conditional distribution. By assuming a parametric model of \(p(Y\mid {\textbf{X}})\) and a unspecifed propensity, Chen et al. (2021) developed a two-step instrument search procedure. In contrast, Wang et al. (2021) proposed a penalized validation criterion (PVC), under a parametric model on propensity, but an unspecified \(p(Y\mid {\textbf{X}})\). The assumptions about the \(p(Y\mid {\textbf{X}})\) and propensity in this paper are consistent with (Wang et al. 2021), which motivates us to consider the following PVC

$$\begin{aligned} \begin{aligned} \textrm{PVC}_\lambda (k)&={\text {VC}}(k)+\lambda \log \left( d_k\right) , \\ {\hat{k}}&=\underset{1 \le k \le K}{{\text {argmin}}} \textrm{PVC}_\lambda (k), \end{aligned} \end{aligned}$$
(14)

where \(\textrm{VC}(k)=\frac{1}{n}\sum _{i=1}^n|{\hat{F}}_k({\textbf{W}}_i)-{\hat{F}}({\textbf{W}}_i)|\), \({\hat{F}}({\textbf{w}})=n^{-1}\sum _{i=1}^n\textrm{I}({\textbf{W}}_i\le {\textbf{w}})\), \({\hat{F}}_k({\textbf{w}})=\frac{1}{n}\sum _{i=1}^n\frac{\delta _i\textrm{I}({\textbf{W}}_i\le {\textbf{w}})}{\Delta _k({\textbf{V}}_i,Y_i,\hat{\varvec{\alpha }}^k)},1\le k \le K\), \(\Delta _k({\textbf{V}}_i,Y_i,\varvec{\alpha }^k)\) are candidate models, K is the total number of candidate model and \(d_k\) is the dimension of \(\varvec{\alpha }^k\) and \(\lambda \ge 0\) is a regularization parameter whose value can be determined by the cross-validation method.

To be specific, when assuming \(\Psi (\vartheta )=\exp (\vartheta )/ [1+\exp (\vartheta )]\), the candidate model \(\Delta _k({\textbf{V}}, Y, \varvec{\alpha }^k)\) corresponding to (13) are as follows

$$\begin{aligned} \begin{aligned} \Delta _{0} \left( {\textbf{V}}_{0},Y,\varvec{\alpha }^0 \right)&=\frac{\exp \left( \alpha ^0_{1}+\alpha ^0_{2}Y \right) }{ \left\{ 1+\exp \left( \alpha ^0_{1}+\alpha ^0_{2}Y \right) \right\} }, \\ \Delta _1 \left( {\textbf{V}}_{1},Y,\varvec{\alpha }^1 \right)&=\frac{\exp \left( \alpha ^1_{1}+\varvec{\alpha }^1_{2}{\textbf{V}}_{1}+\alpha ^1_{3}Y \right) }{ \left\{ 1+\exp \left( \alpha ^1_{1}+\varvec{\alpha }^1_{2}{\textbf{V}}_{1}+\alpha ^1_{3}Y \right) \right\} }, \\ \Delta _{2} \left( {\textbf{V}}_{2},Y,\varvec{\alpha }^2\right)&=\frac{\exp \left( \alpha ^2_{1}+\varvec{\alpha }^2_{2}{\textbf{V}}_{2}+\alpha ^2_{3}Y \right) }{ \left\{ 1+\exp \left( \alpha ^2_{1}+\varvec{\alpha }^2_{2}{\textbf{V}}_{2}+\alpha ^2_{3}Y \right) \right\} }, \end{aligned} \end{aligned}$$

where \({\textbf{V}}_0=\emptyset , {\textbf{V}}_1=\{W_2\},{\textbf{V}}_2=\{W_1\}\). Please note that criterion (14) enables the simultaneous selection of of both the propensity model and the nonresponse instrument. By replacing \(\exp (\vartheta )/ [1+\exp (\vartheta )]\) with an alternative function, we can derive three additional prospective models. Selection among these six candidates can then be carried out in accordance with criterion (14).

4.2 Monte Carlo studies

In this section, we conduct Monte Carlo simulations to study the finite-sample performance of the proposed estimation. Simulated data are generated from the model:

$$\begin{aligned} Y_{i}=\beta _1 X_{i1} +\beta _2 X_{i2}+{e_i}(\tau ), \quad i=1,2,\ldots n, \end{aligned}$$

where \(X_{i1}\sim \textrm{Uniform}(-3,3)\), \(X_{i2}\sim {\mathcal {N}}(0,2^2)\), \(\beta _{1}=1\), \(\beta _{2}=2\), \({e_i}(\tau )={e_i}-F_{{e_i}}^{-1}(\tau )\) and \(F_{{e_i}}(\cdot )\) is the distribution function of \({e_i}\). We consider three different distributions for \({e_i}\):

  1. (1)

    Normal distribution (E1): \({\mathcal {N}}(0,2^2)\);

  2. (2)

    Heteroscedastic normal distribution (E2): \({\mathcal {N}}(0,(1+{|X_{i2} |})^2)\);

  3. (3)

    t-distribution with 3 degrees of freedom (E3): t(3).

Note that E2 is heteroscedastic error and E3 is heavy-tailed error. Furthermore, the measurement error model is \({\textbf{W}}_{i}={\textbf{X}}_{i}+{\textbf{U}}_{i}\), where \({\textbf{U}}_{i}\) are generated from \( {\mathcal {N}}({\textbf{0}},\varvec{\varvec{\Sigma }})\) with

$$\begin{aligned} \varvec{\varvec{\Sigma }}=\left( \begin{array}{cc} 0.5^2 &{} 0 \\ 0 &{} 0 \end{array}\right) . \end{aligned}$$

We generate \(\delta _{i}\) from the Bernoulli distribution according to the following probability

$$\begin{aligned} \Pr (\delta _i=1\mid {\textbf{V}}_i,Y_i)=\frac{\exp \left( 1.2-0.3 W_{i1}-0.3 Y_i\right) }{1+\exp \left( 1.2-0.3 W_{i1}-0.3 Y_i\right) }. \end{aligned}$$
(15)

The coefficients are chosen such that the missing rate are between \(25\%\) and \(40\%\). Then, we choose \(\eta ({\textbf{W}}) = (1, {\textbf{V}}^\top ,{\textbf{Z}}^\top )^\top \), which is consistent with (Wang et al. 2014) and Wang et al. (2021). More specifically, according to Eq. (15), it can be deduced that \({\textbf{V}}=W_{i1}\) and \({\textbf{Z}}=W_{i2}\), therefore, in this example, \(\eta ({\textbf{W}}) = (1, W_{i1}, W_{i2})^\top \).

Table 1 Number of times PVC criterion selects each instrument in 100 simulations
Table 2 Bias and RMSE of four estimates with \(n=300\) and \(\eta ({\textbf{W}}) = (1, W_{i1},W_{i2})^\top \)
Table 3 Bias and RMSE of four estimates with \(n=500\) and \(\eta ({\textbf{W}}) = (1, W_{i1},W_{i2})^\top \)

First, we conducted simulations with sample sizes of \(n = 300\), 500, and 800 to assess the PVC outlined in Sect. 4.1. Table 1 reports the number of times each candidate model is selected by the PVC in 100 Monte Carlo replications. According to Table 1, it can be seen that the PVC can select the correct propensity \(\Delta _{2}(Y,{\textbf{V}}_{2},\varvec{\alpha }^2)\) with empirical probability higher than those for other candidates. Remarkably, the probability of selecting \(\Delta _{2}(Y,{\textbf{V}}_{2},\varvec{\alpha }^2)\) almost reaches 1 when the sample size expands to 800.

Furthermore, to evaluate the estimation efficiency, we conduct simulation studies of the following four estimators:

  1. (1)

    N: The naive estimator that ignores both the measurement errors and missingness is defined as follows

    $$\begin{aligned} \underset{\varvec{\beta }}{{\text {argmin}}}\sum _{i=1}^n\delta _i\rho _\tau \left( Y_i-{\textbf{W}}_i^\top \varvec{\beta } \right) . \end{aligned}$$
  2. (2)

    D: The estimator that only considers the missingness is obtained by

    $$\begin{aligned} \underset{\varvec{\beta }}{{\text {argmin}}}\sum _{i=1}^n\frac{\delta _i}{\Delta \left( {\textbf{V}}_{i},Y_i, \hat{\varvec{\alpha }}\right) } \rho _\tau \left( Y_i-{\textbf{W}}_i^\top \varvec{\beta } \right) . \end{aligned}$$
  3. (3)

    M: The estimator that only considers the measurement errors is defined as

    $$\begin{aligned} \underset{\varvec{\beta }}{{\text {argmin}}}\sum _{i=1}^n\rho _{{\mathcal {N}}}(Y_i,{\textbf{W}}_i,\varvec{\beta },h). \end{aligned}$$
  4. (4)

    DM: The proposed estimator which consider the measurement errors and missingness simultaneously.

Table 4 Bias and RMSE of the proposed estimator with two different \(\eta ({\textbf{W}})\)

All results are based on 200 simulation replications and the sample sizes \(n = 300\) and 500. The biases (Bias) and the root mean square errors (RMSE) are utilized to assess the performance of the aforementioned estimators. Bias and RMSE are defined as follows

$$\begin{aligned} \text {Bias}({\hat{\beta }}_j)\!=\! \frac{1}{200}\sum _{a\!=\!1}^{200} \left( {\hat{\beta }}_j^{(a)}\!-\!\beta _{0j} \right) ,\text {RMSE}\left( {\hat{\beta }}_j \right) \!=\! \sqrt{\frac{1}{200}\sum _{a\!=\!1}^{200} \left( {\hat{\beta }}_j^{(a)}\!-\!\beta _{0j} \right) ^2}, j=1,2. \end{aligned}$$

Simulation results are presented in Tables 2, 3. Figure 1 presents the boxplots of \({\hat{\beta }}_j-\beta _{0j},(j=1,2)\) at \((\tau ,n)=(0.5,300)\) by all four methods. A few conclusions can be drawn as follows:

  1. (1)

    The proposed estimator has negligible biases in all cases. This also demonstrates the proposed estimator is less sensitive to the distribution of the error term \(e_i\). These results are consistent with our theory. As expected, the naive estimator is biased due to the presence of measurement errors and nonignorable missing.

  2. (2)

    From the Fig. 1, it can be seen that the variance of the proposed estimator is larger compared to the naive estimator. However, as the sample size n increases, the RMSE of the proposed estimator tend to be consistently lower. This indicates that despite the increased variance associated with the proposed method, the benefit from bias correction effectively offsets this variance, leading to an overall improvement in estimation accuracy.

To conclude, the robustness analysis of the proposed estimator to \(\eta ({\textbf{W}})\) is investigated. In specific, we consider \(\eta _1({\textbf{W}}) = (W_{i1}, W_{i2}, W_{i2}^2)^\top \), \(\eta _2({\textbf{W}}) = (1, W_{i1}, W_{i2}, W_{i2}^2)^\top \), and the simulation results with \(n = 500\) are reported in the Table 4. The empirical results show that the proposed estimator is robust to the choice of \(\eta ({\textbf{W}})\).

Fig. 1
figure 1

Boxplots of \({\hat{\beta }}_{1} -\beta _{01}\)(left) and \({\hat{\beta }}_2-\beta _{02}\)(right) for different error distributions at \((\tau ,n)=(0.5300)\)

Simulation studies under the Laplace measurement error are presented in Simulation I in Appendix B.1. The experiment results yield conclusions that align with those in the above example. Simulation II in Appendix B.2 displays the proposed estimators perform well even though the measurement error distribution is misspecification.

4.3 Real data example: Boston housing data

As an illustration, the proposed methodology is now applied to the Boston housing data, which is available in the MASS package in R software. These data contain 506 observations on fourteen variables. Many studies have used this data and found potential relationships between MEDV and PTRATIO, RM, TAX, LSTAT; see (Yu and Lu 2004) and Jiang et al. (2016). In this paper, we also focused on the following five variables:

  1. MEDV:

    Median value of owner-occupied homes in $1000;

  2. PTRATIO:

    Pupil-teacher ratio by town;

  3. RM:

    Average number of rooms per dwelling;

  4. TAX:

    Full-value property-tax rate per $10,000;

  5. LSTAT:

    Percentage of low-income population.

We follow previous studies by log-transforming TAX and LSTAT. For simplicity of notation, the variables MEDV, PTRATIO, RM, \(log(\textrm{TAX})\) and \(log(\textrm{LSTAT})\) are denoted, respectively, \(Y_i\), \(X_{i1}\), \(X_{i2}\), \(X_{i3}\) and \(X_{i4}\). The model

$$\begin{aligned} {Y_i=\beta _1X_{i1}+\beta _2X_{i2}+\beta _3X_{i3}+\beta _4X_{i4}+e_i}, \end{aligned}$$

is used to fit the data at quantile level \(\tau = 0.5\). To better illustrate our proposed method, we assume that \(X_{i1}\) are subject to measurement error. The measurement error model is constructed as \({\textbf{W}}_{i}={\textbf{X}}_{i}+{\textbf{U}}_{i}\), where \({\textbf{X}}_{i}=(X_{i1},X_{i2},X_{i3},X_{i4})^\top \), and \({\textbf{U}}_{i}\) are generated from \( {\mathcal {N}}({\textbf{0}},\varvec{\varvec{\Sigma }})\) with

$$\begin{aligned} \varvec{\varvec{\Sigma }}=\left( \begin{array}{cccc} 0.5^2 &{} \quad 0 &{} \quad 0&{} \quad 0\\ 0 &{} \quad 0 &{} \quad 0&{} \quad 0\\ 0 &{}\quad 0&{}\quad 0&{}\quad 0\\ 0 &{}\quad 0&{}\quad 0&{}\quad 0 \end{array}\right) . \end{aligned}$$

Because our proposed method is robust to the misspecification of the measurement error distribution, in this example, only the case where \({\textbf{U}}_{i}\sim {\mathcal {N}}({\textbf{0}},\varvec{\varvec{\Sigma }})\) is studied. Then to consider scenarios with missing data, binary response indicators \(\delta _i\) are generated prior to estimation, with \(\delta _i \sim {\text {Bernoulli}} (p_i)\). We consider three choices of \(p_{i}\) as follows:

  1. M1

    \(p_i=1/\{1+\exp (-1.5+0.9W_{i1}+0.9Y_i)\}\);

  2. M2

    \(p_i=1/\{1+\exp (-1.4+0.9W_{i1}+0.8sin(Y_i))\}\);

  3. M3

    \(p_i=|sin(-1+0.2W_{i1}^{-1}+0.1Y_i)|\).

The above coefficients in M1, M2 and M3 are set to ensure that the missing ratio is approximately 20%. To apply the proposed method, we use the working model

$$\begin{aligned} \Pr (\delta _i=1\mid {\textbf{V}}_i,Y_i)=1/\{1+\exp \left( \alpha _0+\alpha _1 W_{i1}+\alpha _2 Y_i\right) \}. \end{aligned}$$

Therefore, under M1, the working model is correct, while under M2 and M3, the working model is misspecified. Table 5 summarizes the coefficient estimates with four methods under M1. The standard errors in the parentheses are obtained based on 200 bootstrap samples. The findings presented in Table 5 reveal that only RM positively influences housing price, while PTRATIO, TAX, and LSTAT negatively impact housing price, consistent with the conclusions in Yu and Lu (2004) and Jiang et al. (2016).

Table 5 The estimates (with standard errors in parentheses) for Boston housing data

For comparison, we assess the performance of these estimators based on out-of-sample predictions. Specifically, we estimate the above four regression model based on 300 data and then employ the estimated coefficients to generate a forecast of the other 206 data. We compare the mean squared errors (MSE) and mean absolute deviations (MAD) of the predictions, which are defined as

$$\begin{aligned} \textrm{MSE}=\frac{1}{206}\sum _{i=1}^{206} \left( Y_i-{\hat{Y}}_i \right) ^2,~\textrm{MAD}=\frac{1}{206}\sum _{i=1}^{206} \left|Y_i-{\hat{Y}}_i \right|. \end{aligned}$$

The MSE and MAD for the four estimators under M1 are given in Table 6. The results show that our proposed method outperforms the remaining three methods. Additionlly, Jiang et al. (2018) used the partially linear varying coefficient model to fit the Boston housing data based on weighted composite quantile regression method. It is noteworthy that they used the same evaluation criterion as we did, while the MSE and MAD of their method were 4.4039 and 0.9560, respectively. This again proves that our method is relatively effective in correcting the biases caused by measurement errors and missing data.

In the final analysis, we computed the MSE and MAD of the proposed estimator under M2 and M3, resulting in values of 0.653, 0.549, 0.639, and 0.543, respectively. Compared to the results obtained under M1, it can be seen that using the wrong propensity in the estimation process leads to poorer results. Consequently, in applications where the true propensity is unknown, it is advisable to utilize the PVC outlined in Sect. 4.1 to determine a suitable propensity model, thereby conducting better statistical inference.

Table 6 The MSE and MAD for real data example

5 Conclusion and discussion

In this paper, a robust method has been proposed to simultaneously deal with nonignorable nonresponse and measurement errors in covariates for the linear quantile regression model. We also established the asymptotic properties of the proposed estimates. Simulation study and real data analysis are given to examine the finite sample performance of the proposed approaches. Several extensions can be investigated in the future. To obtain more efficient estimates, the results of this paper can be generalized to composite quantile regression (Kai et al. 2011). In addition, penalized variable selection can be used to identify the significant predictors.