1 Introduction

In statistical data analysis, the data are often collected subject to measurement errors. One typical way to treat the measurement error is the errors-in-variables model which assumes that the real observation Z is a surrogate of the true unobserved variable X, i.e., \(Z = X + u\), where u is the measurement error. Regression models with measurement errors in covariates have received broad attention in the literature over the last century. In the last three decades, it has been the focus of numerous researchers, as is evidenced in the three monographs by Fuller (1987), Cheng and Ness (1999) and Carroll et al. (2006), and the references therein. However, as Berkson (1950) argued that in many situations it is more appropriate to treat the true unobserved variable X as the observed variable Z plus an error, i.e., \(X = Z + \eta \), where now \(\eta \) is the measurement error. As an example, Wang (2003) mentions that in a chemical analysis, in order to study the effect of temperature to dry a sample on the resulting concentration of certain volatile matter, an oven is typically used to keep the samples at certain temperature. However, the actual temperature (X) in the oven can vary from the setup temperature (Z) due to the working mechanism of the oven. As a second example mentioned in Wang (2004), in order to study the yield of a crop in agriculture, an important covariate is the absorption amount of a fertilizer in the crop. Typically only the amount of fertilizer (Z) applied to the crop is observed. However, the actual amount of absorption (X) is not easily observed and may vary randomly around the amount Z applied to the crop. For more examples, see Carroll et al. (2006), Du et al. (2011), Schennach (2013), and the references therein.

Proceeding a bit more precisely, in the Berkson measurement error regression model of interest here, one has the triple XYZ, obeying the relations

$$\begin{aligned} Y = \mu (X) + \varepsilon , \quad X = Z + \eta . \end{aligned}$$
(1.1)

Here Y is a scalar response variable and \(\varepsilon \) is an error variable, independent of X, with \(E\varepsilon =0\), so that \(\mu (x)=E(Y|X=x)\), \(x\in {\mathbb R}^p\). The random vectors \(X, Z, \eta \) are p-dimensional, with X being the true unobservable covariate vector, Z representing an observation on X and \(\eta \) denoting the measurement error having \(E\eta =0\). We also assume that the three r.v.’s \(\varepsilon , Z, \eta \) are mutually independent. See Remark 2 for a further discussion on this assumption.

Let \(\varTheta \subset {\mathbb R}^q\) be a compact set, \(\{m_\theta (x); \theta \in \varTheta , x\in {\mathbb R}^p\}\) be a family of given functions and \({\mathcal C}\) be a compact subset in \({\mathbb R}^p\). The problem of interest is to test

$$\begin{aligned}&H_0: \mu (x) = m_{\theta _0} (x),\quad \text { for some }{\theta _0}\in {\varTheta } \text { and all } x\in {\mathcal C}, \quad \text { versus} \\&H_1: H_0 \text { is not true,} \end{aligned}$$

based on the primary sample \(\{(Z_i, Y_i), i=1,\ldots ,n\}\) and an independent validation sample \(\{({\widetilde{Z}}_k,{\widetilde{X}}_k), k = 1,\ldots ,N\}\), all satisfying (1.1). Then the empirical version of \(\eta \) is naturally obtained by \({\widetilde{\eta }}_k := {\widetilde{X}}_k - {\widetilde{Z}}_k, 1\le k \le N\).

Koul and Song (2009) provide a class of tests for the above testing problem when \(f_\eta \) is known. To describe their tests, assume \(E|\mu (X)|<\infty , E|m_\theta (X)|<\infty \), for all \(\theta \in \varTheta \), and define \( H(z):= E\big [\mu (X)\big |Z=z\big ], H_{\theta } (z) := E[m_{\theta } (X)|Z=z], z\in {\mathbb R}^p. \) Then the original model is transformed to

$$\begin{aligned} Y = H(Z) + \xi ,\quad E(\xi |Z) = 0, \end{aligned}$$

and the hypothesis \(H_0\) implies

$$\begin{aligned}&H'_0: H(z) = H_{\theta _0} (z), \,\, \text { for some } \theta _0\in \varTheta \text { and all } z\in {\mathcal C}, \quad \text {versus}\,\, H'_1: H'_0 \text { is not true.} \end{aligned}$$

Note that known \(f_\eta \) implies that \(H_\theta \) is a known parametric function.

Next, let \(w\equiv w_n = c (\log n/n)^{1/(p+4)}\), \(c>0\), and \(h\equiv h_n\) be two bandwidth sequences depending on n, K be a density kernel and G be a nondecreasing right continuous real-valued function on \({\mathbb R}^p\) and define

$$\begin{aligned} K_{hi}(z)= & {} \frac{1}{h^p} K\left( \frac{z - Z_i}{h}\right) , \quad \hat{f}_w(z) = \frac{1}{n}\sum _{i=1}^nK_{wi}(z), \\ M_n({\theta })= & {} \int _{\mathcal C}\left[ \frac{1}{n \hat{f}_w(z)} \sum _{i=1}^nK_{hi}(z) [Y_i - H_{\theta } (Z_i)] \right] ^2 \mathrm{d}G(z), \quad {\tilde{\theta }}_n = \text {argmin}_{\theta } M_n({\theta }). \end{aligned}$$

The class of tests, one for each K and G, proposed in Koul and Song (2009) (KS) is based on the class of minimized integrated square distances \(M_n({\tilde{\theta }}_n)\). They establish the consistency and asymptotic normality of suitably standardized \({\tilde{\theta }}_n\) and \(M_n({\tilde{\theta }}_n)\).

In the current paper, we extend this minimum distance (m.d.) methodology to the case when \(f_\eta \) is unknown, but when external validation data are available. The lack of knowledge of \(f_\eta \) makes \(H_\theta \) to be an unknown function. The validation data \(\{({\widetilde{Z}}_k,{\widetilde{X}}_k), k = 1,\ldots ,N\}\) is used to estimate \(H_\theta \), which in turn is used to construct a class of tests analogous to \(M_n\). More precisely, let

$$\begin{aligned}&{\widehat{H}}_{\theta } (z) = N^{-1} \sum _{k=1}^Nm_{\theta } (z + {\widetilde{\eta }}_k), \quad {\widetilde{\eta }}_k:= {\widetilde{Z}}_k-{\widetilde{X}}_k, \quad 1\le k\le N, \\&{\widehat{M}}_n({\theta }) = \int _{\mathcal C}\left[ \frac{1}{n \hat{f}_w(z)} \sum _{i=1}^nK_{hi}(z) [Y_i - {\widehat{H}}_{\theta } (Z_i)]\right] ^2 \mathrm{d}G(z), \quad \hat{\theta }_n = \text {argmin}_{\theta } {\widehat{M}}_n({\theta }). \end{aligned}$$

The proposed class of tests investigated in this paper is based on \({\widehat{M}}_n(\hat{\theta }_n).\)

We establish the asymptotic normality of \(n^{1/2}(\hat{\theta }_n-\theta _0)\) under \(H_0\) and that of suitably standardized \({\widehat{M}}_n(\hat{\theta }_n)\) under \(H_0\) and under a sequence of local alternatives. We also discuss the choice of the optimal G that maximizes the asymptotic power against a given sequence of local alternatives.

A surprising finding is that the asymptotic distributions of suitably standardized \({\widehat{M}}_n(\hat{\theta }_n)\) under \(H_0\) and under certain sequences of local alternatives are the same as those of the similarly standardized \(M_n({\tilde{\theta }}_n)\), see Theorems 2 and 5 below. In comparison, Theorem 1 below shows that the asymptotic distributions of \(n^{1/2}(\hat{\theta }_n -\theta _0)\) and \(n^{1/2}({\tilde{\theta }}_n -\theta _0)\), under \(H_0\), are different, in general.

If \(p=q\) and \(m_\theta (x)=\theta ^T x\) is linear, then \(H_\theta (z)=\theta ^T z\) is a known function, so there is no need to estimate it and one can use \(M_n({\tilde{\theta }}_n)\) to fit a linear model to \(\mu (x)\) regardless of the knowledge of \(f_\eta \). Moreover, Proposition 1 below proves that in this case, \(n^{1/2}(\hat{\theta }_n -{\tilde{\theta }}_n)\rightarrow _p 0\). However, as pointed out in Example 1 below, the need to estimate \(f_\eta \) cannot be avoided even when the regression model is a polynomial of order 2 or more, a small departure from the linearity.

The paper is organized as follows. Section 2 describes the needed assumptions for the derivation of the consistency and asymptotic normality of the proposed estimators and tests. Section 3 establishes the consistency and asymptotic normality of the m.d. estimators, while in Sect. 4 we state the main results about the m.d. tests under the null and certain fixed and local alternative hypotheses. The proofs of many results stated in Sects. 3 and 4 appear in the supplement of this paper. Section 5 reports the findings of a simulation study that assesses some finite sample properties of an estimator and a test in the proposed classes of these inference procedures.

In this paper, \({\mathcal N}_q(\nu , \varSigma )\) stands for the q-variate normal distribution with mean vector \(\nu \) and covariance matrix \(\varSigma \), \(x^T\) denotes the transpose of an Euclidean vector x, \(\Vert \cdot \Vert \) denotes the Euclidean norm, all limits are taken as \(N\wedge n\rightarrow \infty \), unless mentioned otherwise, and \(\rightarrow _d\) (\(\rightarrow _p\)) denotes the convergence in distribution (probability).

2 Assumptions under \(H_0\)

In this section, we shall describe the assumptions for the asymptotic normality of suitably standardized \(\hat{\theta }_n\) and \({\widehat{M}}_n(\hat{\theta }_n)\), under \(H_0\). Many of these assumptions are the same as in KS. Define, for \(x,y \in {\mathbb R}^p\) and \(\theta \in \varTheta \),

$$\begin{aligned}&\sigma _{\theta }(x,y) := \text {Cov}\big (m_{\theta } (x + \eta ), m_{\theta } (y + \eta )\big ), \quad \sigma _{\theta }^2(x) := \sigma _\theta (x,x)=\text {Var}(m_{{\theta }}(x+ \eta )). \end{aligned}$$

All the integrals with respect to the measure G are supposed to be over the compact set \({\mathcal C}\), unless specified otherwise. We are now ready to state the needed assumptions.

  1. (A1)

    \(\{(Y_i, Z_i), Z_i \in {\mathbb R}^p, i = 1,\ldots ,n\}\) is an i.i.d. sample with regression function \(H(z) = E(Y|Z=z)\) satisfying \(\int H^2 \mathrm{d}G < \infty \). The integrating measure G has continuous Lebesgue density g on \({\mathcal C}\). The validation data \(\{({\widetilde{Z}}_k, {\widetilde{X}}_k), {\widetilde{Z}}_k \in {\mathbb R}^p, {\widetilde{X}}_k \in {\mathbb R}^p, k=1,\ldots ,N\}\) is an i.i.d. sample from Berkson measurement error model \(X = Z + \eta \). Furthermore, the two samples are independent.

  2. (A2)

    \(0<\sigma _\varepsilon ^2:=\text {Var}(\varepsilon ) < \infty \), \(\tau ^2(z) = E[(m_{\theta _0}(X) - H_{\theta _0}(Z))^2|Z=z]\) is continuous on \({\mathcal C}\).

  3. (A3)

    For some \(\delta >0\), \(E|\varepsilon |^{2+\delta }+E|m_{\theta _0}(X) - H_{\theta _0}(Z)|^{2+\delta }<\infty .\)

  4. (A4)

    \(E|\varepsilon |^{4} + E|m_{\theta _0}(X) - H_{\theta _0}(Z)|^{4}<\infty .\)

  5. (A5)

    \(\int \sigma _\theta ^2(z)\mathrm{d}G(z) < \infty \), for all \(\theta \in \varTheta \).

  6. (F1)

    The density \(f_Z\) is uniformly continuous on \({\mathcal C}\) and \(\inf _{z\in {\mathcal C}}f(z)>0\).

  7. (F2)

    The density \(f_Z\) is twice continuously differentiable on \({\mathcal C}\).

  8. (H1)

    \(m_\theta (x)\) is a.e. continuous in x, for every \(\theta \in \varTheta \).

  9. (H2)

    The parametric function family \(H_\theta (z)\) is identifiable with respect to \(\theta \), i.e, \(H_{\theta _1}(z) = H_{\theta _2}(z)\) a.e. in z implies \(\theta _1 = \theta _2\).

  10. (H3)

    For some positive continuous function r on \({\mathcal C}\), and for some \(0<\beta \le 1\), \(| H_{\theta _1}(z) - H_{\theta _2}(z) | \le \Vert \theta _1 - \theta _2\Vert ^\beta r(z),\) for all \(\theta _1, \theta _2\in \varTheta \) and \(z\in {\mathcal C}\).

  11. (H4)

    For each x, \(m_\theta (x)\) is differentiable with respect to \(\theta \) in a neighborhood of \(\theta _0\) with the derivative vector \(\dot{m}_\theta (x)\) such that for any consistent estimator \(\theta _n\) of \(\theta _0\),

    $$\begin{aligned} \sup _{i} \frac{\big | \frac{1}{N}\sum \nolimits _{k=1}^N [m_{\theta _n}(Z_i + {\widetilde{\eta }}_k) - m_{\theta _0}(Z_i + {\widetilde{\eta }}_k) - (\theta _n-\theta _0)^T \dot{m}_{\theta _0}(Z_i + {\widetilde{\eta }}_k)]\big |}{\Vert \theta _n - \theta _0\Vert } = o_p(1), \end{aligned}$$

    where the supremum is taken over \(1\le i \le n\).

  12. (H5)

    The vector function \(\dot{m}_{\theta _0}(x)\) is continuous in \(x\in {\mathcal C}\) and for every \(\epsilon >0\), there are \(n_\epsilon , N_\epsilon \) such that for every \(0< a < \infty \), and for all \(n> n_\epsilon , N > N_\epsilon \),

    $$\begin{aligned} P\left( \max \limits _{1\le i \le n, 1\le k \le N, (nh^{p})^{1/2}\Vert \theta -\theta _0\Vert \le a} h^{-p/2} \Vert \dot{m}_\theta (Z_i + {\widetilde{\eta }}_k) - \dot{m}_{\theta _0} (Z_i + {\widetilde{\eta }}_k)\Vert \ge \epsilon \right) \le \epsilon . \end{aligned}$$
  13. (H6)

    \(\int \Vert \dot{H}_{\theta _0}\Vert ^2 \mathrm{d}G < \infty \) and \(\varSigma _0 = \int \dot{H}_{\theta _0} \dot{H}_{\theta _0}^T \mathrm{d}G\) is positive definite.

  14. (K)

    The density kernel K is positive, symmetric and square integrable on \([-1,1]^p\).

  15. (W1)

    \(nh^{2p} \rightarrow \infty \) and \(N/n\rightarrow \lambda , \lambda >0\).

  16. (W2)

    \(h\sim n^{-a}\), where \(0<a<\min (1/2p, 4/(p(p+4)))\).

We state some important facts that will be used later. From Mack and Silverman (1982), we obtain that under (F1), (K1), (W1) and (W2),

$$\begin{aligned}&\sup \limits _{z\in {\mathcal C}} |\hat{f}_h(z) - f_Z(z)| = o_p(1),\quad \sup \limits _{z\in {\mathcal C}} |\hat{f}_w(z) - f_Z(z)| = o_p(1), \nonumber \\&\sup \limits _{z\in {\mathcal C}} \left| \frac{f_Z^2(z)}{\hat{f}_w^2(z)} -1\right| = o_p(1). \end{aligned}$$
(2.1)

Let \(\mathrm{d}\varphi = f_Z^{-2}\mathrm{d}G, \mathrm{d}\hat{\varphi }= \hat{f}_w^{-2}\mathrm{d}G.\) We also recall the following facts from Koul and Ni (2004) (KN) and KS. For any continuous function \(\alpha \) on \({\mathcal C}\), \(\int |\alpha (z)| \mathrm{d}\varphi (z)<\infty \), and by (2.1),

$$\begin{aligned} \int \alpha (z)\mathrm{d}\hat{\varphi }(z) = \int \alpha (z) \mathrm{d}\varphi (z) + o_p\left( \int |\alpha (z)| \mathrm{d}\varphi (z)\right) . \end{aligned}$$
(2.2)

Using the equation (3.9) on page 117 of Koul and Ni (2004), for any \(\alpha \) as above, (F1), (K1) and (W1) imply

$$\begin{aligned} \int E\left\{ \frac{1}{n} \sum _{i=1}^nK_h(z-Z_i)\alpha (Z_i)\right\} ^2 \mathrm{d}\varphi (z) = \int \alpha ^2(z)\mathrm{d}G(z)+o(1)=O(1).\nonumber \\ \end{aligned}$$
(2.3)

3 Estimation of \(\theta _0\)

In this section, we establish the consistency of \(\hat{\theta }_n\) and asymptotic normality of \(n^{1/2}(\hat{\theta }_n - \theta _0)\), under \(H_0\). Let

$$\begin{aligned}&W_n({\theta }) = \int _{\mathcal C}\left[ \frac{1}{n \hat{f}_w(z)} \sum _{i=1}^nK_{hi}(z) [{\widehat{H}}_{\theta } (Z_i) - H_{\theta }(Z_i)]\right] ^2 \mathrm{d}G(z). \end{aligned}$$

This quantity is a measure of the essential difference between \({\widehat{M}}_n(\theta )\) and \(M_n(\theta )\) as is seen in the following decomposition:

$$\begin{aligned} {\widehat{M}}_n({\theta })= & {} \int \left[ \frac{1}{n \hat{f}_w(z)} \sum _{i=1}^nK_{hi}(z) [Y_i - H_{\theta }(Z_i) + H_{\theta }(Z_i) - {\widehat{H}}_{\theta } (Z_i)]\right] ^2 \mathrm{d}G(z)\\= & {} M_n({\theta }) + W_n({\theta }) + 2 R_n({\theta }), \end{aligned}$$

where \(R_n(\theta )\) is the cross product term. The following lemma about \(W_n\) is found to be useful in deriving various results in the sequel. Let

$$\begin{aligned}&\gamma (\theta ) = \int \int \sigma ^2_{\theta }(x,y) \mathrm{d}G(x) \mathrm{d}G(y),\quad A_N(\theta ) = \frac{1}{N} \int \sigma _\theta ^2(z) \mathrm{d}G(z). \end{aligned}$$

Lemma 1

Suppose (A1), (A2), (A5), (F1), (H1), (K), and (W1) hold. Then for every \(\theta \in \varTheta \) for which \(\mu (x)=m_\theta (x), x\in {\mathcal C}\),

$$\begin{aligned} N\, (W_n(\theta ) - A_N(\theta )) \rightarrow _d {\mathcal N}_1(0, 2\gamma (\theta )). \end{aligned}$$

3.1 Consistency of \(\hat{\theta }_n\)

In this subsection, we shall establish the consistency of the m.d. estimators \(\hat{\theta }_n\) for \(\theta _0\). Many details below are similar to those in KN and KS. For a \(\nu \in L_2(G),\) let

$$\begin{aligned}&\rho (\nu , H_\theta ) = \int (\nu - H_\theta )^2 \mathrm{d}G, \quad T(\nu ) = \text {argmin}_{\theta }\, \rho (\nu , H_\theta ). \end{aligned}$$
(3.1)

The proof of the following lemma is similar to that of Lemma 3.2 and Corollary 3.1 of KS. Details are left out for the sake of brevity.

Lemma 2

Suppose (A1), (A2), (A5), (F1), (H1), (H3), (K) and (W1) hold. If T(H) is unique, then \(\hat{\theta }_n = T(H) + o_p(1).\) If, in addition \(H_0\) and (H2) hold, then \(\hat{\theta }_n \rightarrow _p \theta _0\).

3.2 Asymptotic normality of \(\hat{\theta }_n\)

Here we present the asymptotic normality result about \(\hat{\theta }_n\) under \(H_0\).

Theorem 1

Suppose (A1)–(A3), (A5), (F1), (F2), (H1)–(H6), (K), (W1), (W2) and \(H_0\) hold. Then \( \sqrt{n} (\hat{\theta }_n - \theta _0) \rightarrow _d {\mathcal N}_q \Big (0,\varSigma _0^{-1}(\varSigma _1 + \lambda ^{-1}\varSigma _2)\varSigma _0^{-1}\Big ), \) where \(\varSigma _0\) is given in (H6) and

$$\begin{aligned}&\varSigma _1 = \int \frac{(\sigma _\varepsilon ^2 + \tau ^2(u))\dot{H}_{\theta _0}(u) \dot{H}_{\theta _0}^T(u) g^2(u)}{f_Z(u)}\mathrm{d}u,\nonumber \\&\varSigma _2 = \int \sigma _{\theta _0}(x,y) \dot{H}_{\theta _0}(x) \dot{H}_{\theta _0}^T(y) \mathrm{d}G(x) \mathrm{d}G(y). \end{aligned}$$
(3.2)

Note that the asymptotic covariance matrix of \(\sqrt{n}(\hat{\theta }_n-\theta _0)\) is determined by \(\varSigma _1\), \(\varSigma _2\) and the limit of N / n. The matrix \(\varSigma _1\) represents the variation due to Berkson measurement error when \(f_\eta \) is known. From Koul and Song (2009), we recall that \(\sqrt{n}({\tilde{\theta }}_n -\theta _0)\rightarrow _d {\mathcal N}_q(0,\varSigma _0^{-1}\varSigma _1\varSigma _0^{-1})\). The matrix \(\varSigma _2\) represents the additional variation due to the estimation of \(H_{\theta }\) by \({\widehat{H}}_{\theta }\) when \(f_\eta \) is unknown. Moreover, the covariance tends to decay as N / n increases. When \(N/n \rightarrow \infty \), in other words, when the validation sample size N is sufficiently large, compared to the primary sample size n, not surprisingly the above asymptotic covariance degenerates to the case of known \(f_\eta \).

Remark 1

Here we shall verify that the quantities \(\varSigma _1\) and \(\varSigma _2\) of (3.2) are well defined under the given assumptions. By (A2) and the compactness of \({\mathcal C}\), \(\tau ^2 (u)\) is bounded on \({\mathcal C}\). Assumption (H6) further implies that \(0<a^T\varSigma _1a<\infty ,\) for all \(a\in {\mathbb R}^q\).

Next, consider \(\varSigma _2\). The Cauchy–Schwarz inequality implies that \(\sigma _{\theta }(x,y) \le \sigma _{\theta }(x) \sigma _{\theta }(y)\) for all \(x, y \in {\mathbb R}, \theta \in \varTheta \), and that for any \(a\in {\mathbb R}^q\),

$$\begin{aligned} |a^T \varSigma _2 a|\le & {} \int \big |\sigma _{\theta _0}(x,y)\big | \, \big | a^T \dot{H}_{\theta _0}(x) \dot{H}_{\theta _0}^T(y) a\big | \mathrm{d}G(x) \mathrm{d}G(y) \\\le & {} \int \sigma _{\theta _0}(x) \sigma _{\theta _0}(y) \Vert a^T \dot{H}_{\theta _0}(x)\Vert \, \Vert a^T\dot{H}_{\theta _0}(y)\Vert \mathrm{d}G(x) \mathrm{d}G(y) \\= & {} \left( \int \sigma _{\theta _0}(x) \Vert a^T \dot{H}_{\theta _0}(x)\Vert \mathrm{d}G(x)\right) ^2\\\le & {} \Vert a\Vert ^2 \int \sigma _{\theta _0}^2(x) \mathrm{d}G(x) \int \Vert H_{\theta _0}(x)\Vert ^2 \mathrm{d}G(x). \end{aligned}$$

Hence assumptions (A5) and (H6) ensure that the entries of \(\varSigma _2\) exist and are finite. Moreover, as seen in the proof of the theorem in the supplement, \(\varSigma _2\) is positive definite.

Remark 2

Here we shall discuss some of the assumptions of Sect. 2, give examples of \(m_\theta \)’s that satisfy the assumptions (H3)–(H5) and provide explicit expressions of \(\varSigma _2\).

In the Berkson model (1.1), \(\varepsilon \) and \(\eta \) are typically assumed to be independent to ensure its identifiability, while Z and \(\eta \) can be correlated. One of the entities we have to estimate consistently for implementing our methodology is \(H_\theta (z)\). Under the assumption of the independence of Z and \(\eta \), it is relatively easy to show that the proposed estimator \({\widehat{H}}_\theta (z)\) is consistent for \( H_\theta (z)\). In the case of correlated Z and \(\eta \), one could use a Nadaraya–Watson type estimator to estimate \(H_\theta (z)\) instead of \({\widehat{H}}_\theta (z)\). However we refrain from doing this in the current paper for the sake of brevity.

As for the validation sample assumed in (A1), in fact, there are two types of validation data in reality. The first type is that an external validation sample is collected after the primary study due to various factors, such as the unawareness of measurement errors in the primary study. In this case, the validation study is usually carried out on a different sample from the primary sample. Hence it is rather clear that the validation data can be treated independent of the primary data as assumed in (A1). A real data example dealing with breast cancer study can be found in Yi et al. (2015). Based on the independence, the theory developed for two sample statistics in Sepanski and Lee (1995) and Geng and Koul (2017) can be applied to derive the asymptotic results. The second type is that the validation study is carried out simultaneously with the primary study; thus, the validation sample is a subset of the primary sample. In this case, the minimum distance idea is applicable; however, the results in this paper would not hold due to the lack of independence between the two samples. Hence different assumptions and theory should be investigated for the second type of validation data. This paper focuses on the first type of validation data based on (A1).

Regarding (H2) of the identifiability of \(H_\theta \), various conditions can be imposed to ensure (H2) hold. For instance, the following two assumptions together imply (H2):

  1. (1)

    \(m_{\theta _1}(x) = m_{\theta _2}(x)\), for a.e. \(x\in \mathbb {R}^p\), implies \(\theta _1 = \theta _2\).

  2. (2)

    The location family \(\{f_\eta (\cdot - z), z \in \mathbb {R}^p\}\) is complete. More details can be found in Koul and Song (2009). Furthermore, it can be easily verified that all the examples of \(m_\theta \)’s given below satisfy (1).

Example 1

(The linear and polynomial cases) Suppose \(q=p\), \(m_\theta (x) = \theta ^T x\), \(\theta , x\in \mathbb {R}^p\) and \(E|X|<\infty \), where for any vector \(x=(x_1,\cdots , x_p)^T\in {\mathbb R}^p\), \(|x|=\sum _{j=1}^p|x_j|\). Then \(H_\theta (z) = \theta ^T z\) is a known function. In this case, there is no need to estimate this function and one can also use \({\tilde{\theta }}_n\) as a m.d. estimator of \(\theta \). This fact in spirit is similar to the fact that the classical least square estimators, when regressing Y on Z, continue to be unbiased and consistent in the Berkson linear model. See Remark 3 for an asymptotic equivalence between \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\).

In the polynomial regression of order p, \(q=p + 1\) and \(m_\theta (x)=\theta ^T \ell (x)\), \(x \in \mathbb {R}\), where \(\theta = (\theta _1, \ldots , \theta _{p+1})^T\) and \(\ell (x) := (1,x,\ldots ,x^p)^T\) such that \(E\Vert \ell (X)\Vert <\infty \). Then

$$\begin{aligned}&L(z):= E(\ell (X)|Z=z) = (1, z, E(z+\eta )^2,\ldots , E(z+\eta )^p)^T,\quad H_\theta (z)=\theta ^T L (z). \end{aligned}$$

This model is a simple deviation from the linear model, yet one already sees the need to estimate \(H_\theta (z)\). Given the validation data, an estimate of \(H_\theta (z)\) in this case is given by

$$\begin{aligned} {\widehat{H}}_\theta (z)= & {} \frac{1}{N} \sum _{k=1}^N m_\theta (z + {\widetilde{\eta }}_k) = \frac{1}{N} \sum _{k=1}^N \big [\theta _1 + \theta _2 (z + {\widetilde{\eta }}_k) + \theta _3 (z + {\widetilde{\eta }}_k)^2 \\&+\cdots + \theta _{p+1} (z + {\widetilde{\eta }}_k)^p \big ]. \end{aligned}$$

Here (H3) is satisfied with \(r = L\). Furthermore, \(\dot{m}_\theta (x) = \ell (x)\) and \(\dot{H}_\theta (z) = L(z)\) for all \(\theta \in \varTheta \). Therefore, similar to the linear case, (H4) and (H5) hold. Moreover,

$$\begin{aligned}&\displaystyle \sigma _{\theta }(x,y) = \theta ^T \big [E\ell (x+ \eta )\ell ^T (y + \eta ) - L (x) L^T(y)\big ] \theta , \\&\displaystyle \varSigma _2 = \int \theta _0^T [E\ell (x+ \eta )\ell ^T (y + \eta ) - L (x) L ^T(y)] \theta _0 \, L(x) L^T(y) \mathrm{d}G(x) \mathrm{d}G(y). \end{aligned}$$

Example 2

(The nonlinear case) In biochemistry, one of the well-known models for enzyme kinetics relates enzyme reaction rate to the concentration of a substrate x by the formula \(\alpha _0 x/(\theta + x), \alpha _0>0, \theta>0, x>0.\) This is the so-called Michaelis–Menten model, see Bates and Watts (1998). The ratio \(\gamma _0 = \alpha _0/\theta \) is defined as the catalytic efficiency that measures how efficiently an enzyme converts a substrate into product. When \(\gamma _0\) is known, the function can be written as

$$\begin{aligned} m_{\theta }(x) := \frac{\gamma _0 \theta x}{\theta + x}, \quad \theta>0, \quad x>0. \end{aligned}$$
(3.3)

We will verify that this nonlinear function satisfies (H3)–(H5).

Regarding (H3), as argued in KS, it suffices to verify that (H3) holds with \(H_\theta \) replaced by \(m_\theta \). Here, direct calculation shows that

$$\begin{aligned} |m_{\theta _1}(x) - m_{\theta _0}(x)| = \frac{\gamma _0 x^2 |\theta _1 - \theta _0|}{(\theta _0 + x)(\theta _1 + x)} \le \gamma _0 |\theta _1 - \theta _0|. \end{aligned}$$

Hence (H3) holds for the \(m_\theta \) of (3.3).

Furthermore, suppose for each \(x\in {\mathbb R}^p\), the \(q\times q\) matrix \(\ddot{m}_\theta (x):= \partial ^2 m_\theta (x)/\partial \theta ^2\) exists for all \(\theta \) in a neighborhood \(U_0\) of \(\theta _0\) and \(\Vert \ddot{m}_\theta (x)\Vert \le C\), for all \(\theta \in U_0\) and \(x\in {\mathbb R}^p\), where the constant C may depend on \(\theta _0\). Then, by the mean value theorem, with probability 1, for all \(1\le i\le n\), \(N\ge 1\),

$$\begin{aligned}&\frac{\big | \frac{1}{N}\sum \nolimits _{k=1}^N [m_{\theta }(Z_i + {\widetilde{\eta }}_k) - m_{\theta _0}(Z_i + {\widetilde{\eta }}_k) - (\theta -\theta _0)^T \dot{m}_{\theta _0}(Z_i + {\widetilde{\eta }}_k)]\big |}{\Vert \theta - \theta _0\Vert } \le C\Vert \theta - \theta _0\Vert . \end{aligned}$$

Now apply this with \(\theta =\theta _n\), where \(\theta _n\) is any consistent estimator of \(\theta _0\), to conclude that (H4) holds.

In particular, for the function \(m_\theta \) of (3.3), \(p=1=q\) and \(\ddot{m}_{\theta }(x) = - 2 \gamma _0 x^2/(\theta + x)^3\) is bounded for \(\theta >0\) and \(x>0\), so (H4) holds in this case.

As for (H5), with \(\sqrt{nh^p}|\theta - \theta _0|\le a\) and \(\theta _{1}^*\) falling between \(\theta \) and \(\theta _0\), we have

$$\begin{aligned}&\sup _{i,k,\theta } h^{-p/2} |\dot{m}_{\theta }(Z_i + {\widetilde{\eta }}_k) - \dot{m}_{\theta _0}(Z_i + {\widetilde{\eta }}_k)| = \sup _{i,k,\theta ^* } h^{-p/2} |\ddot{m}_{\theta ^*}(Z_i + {\widetilde{\eta }}_k) (\theta - \theta _0)| \\&\quad \le \sup _{\theta } C' h^{-p/2} |\theta - \theta _0| = O_p(h^{-p/2}/\sqrt{nh^p}) = O_p\big (1/(\sqrt{n} h^p)\big )= o_p(1), \end{aligned}$$

where \(C'\) is the upper bound for the second derivative \(\ddot{m}_{\theta }(x)\). Therefore, (H5) is satisfied.

Another nonlinear example is the exponential function \(m_\theta (x) = e^{\theta x}\), with \(\theta , x \in \mathbb {R}\). In practice, it is reasonable to assume that both \(\varTheta \) and the domain of X are bounded subsets in \({\mathbb R}\), i.e., \(|\theta |\le C_1\) and \(|x| \le C_2\). To verify (H3), it suffices to show that this condition holds with \(H_\theta (z)\) replaced by \(m_\theta (x)\). With \(\theta ^*\) falling between \(\theta _1\) and \(\theta _2\), we obtain

$$\begin{aligned} |m_{\theta _2}(x) - m_{\theta _1}(x)|= |\dot{m}_{\theta ^*}(x)(\theta _2 - \theta _1)| \le (|x|e^{C_1|x|})|\theta _2 - \theta _1| := r(x)|\theta _2 - \theta _1|. \end{aligned}$$

Therefore, (H3) holds for the exponential regression function. Moreover, the second derivative \(\ddot{m}_{\theta } (x) = x^2 e^{\theta x}\) is bounded by the constant \(C_1^2 e^{C_1 C_2}\). Hence the argument similar to that for (3.3) yields that the exponential function also satisfies (H4) and (H5). The expression of \(\varSigma _2\) is given in Sect. 5.1.

Remark 3

(Connection between \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\) in linear regression) Here we shall show, under some conditions, that \(\hat{\theta }_n-{\tilde{\theta }}_n=o_p(n^{-1/2})\) in the linear regression model

$$\begin{aligned} p=q, \quad \mu (x) = m_{\theta }(x) = \theta ^T x, \quad x\in {\mathcal C}\subset {\mathbb R}^p, \quad \text {for some } \theta _0 \in \varTheta \subset {\mathbb R}^p.\qquad \end{aligned}$$
(3.4)

In this case, \(H_\theta (z)=\theta ^T z\) and by solving the equation \(\partial {\widehat{M}}_n(\theta )/\partial \theta = 0\), we obtain \(B_n\hat{\theta }_n = A_n\), where

$$\begin{aligned} A_n= & {} \int \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z)Y_i\right] \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) (Z_i + \bar{\eta }) \right] \mathrm{d}\hat{\varphi }(z),\\ B_n= & {} \int \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) (Z_i + \bar{\eta })\right] \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) (Z_i + \bar{\eta })^T\right] \mathrm{d}\hat{\varphi }(z), \end{aligned}$$

with \(\bar{\eta }= N^{-1}\sum _{k=1}^N {\widetilde{\eta }}_k\). Similarly, \({\tilde{B}}_n {\tilde{\theta }}_n = {\widetilde{A}}_n\), where

$$\begin{aligned} {\widetilde{A}}_n= & {} \int \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z)Y_i\right] \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) Z_i \right] \mathrm{d}\hat{\varphi }(z), \\ {\widetilde{B}}_n= & {} \int \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) Z_i\right] \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) Z_i^T\right] \mathrm{d}\hat{\varphi }(z). \end{aligned}$$

Roughly speaking, because \(\bar{\eta }\rightarrow _p 0\), \(A_n - {\widetilde{A}}_n = o_p(1)\), \(B_n - {\widetilde{B}}_n = o_p(1)\) and hence \(\hat{\theta }_n - {\tilde{\theta }}_n \rightarrow _p 0\). Furthermore, under some specific conditions, both \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\) can achieve the same asymptotic efficiency. We present two such assumptions here.

  1. (A6)

    \(E\eta ^2<\infty \), \(\tau _1(z) := E\big (|\varepsilon |\big |Z=z\big )\) is a.e. (G) continuous.

  2. (A7)

    \(\nu _G:= \int z \mathrm{d}G(z) = 0\), \(\int zz^T \mathrm{d}G(z)\) is positive definite.

Proposition 1

Suppose (1.1) and (3.4) hold. In addition suppose (A1), (F1), (K), (W1), (A6) and (A7) hold, then \(\sqrt{n} (\hat{\theta }_n - {\tilde{\theta }}_n) \rightarrow _p 0\).

Proof

For the simplicity of the exposition, we give details for the case \(p=1\) only. Then \({\widetilde{B}}_n = \int \big [n^{-1}\sum _{i=1}^n K_{hi}(z)Z_i\big ]^2 \mathrm{d}\hat{\varphi }(z)\). By (2.1), (2.2), (2.3) and direct calculations, \({\widetilde{B}}_n=\kappa _G+o_p(1),\) where \(\kappa _G=\int z^2 \mathrm{d}G(z)\). By (A7), \(\kappa _G>0\). Then \({\tilde{\theta }}_n={\widetilde{B}}_n^{-1} {\widetilde{A}}_n\) is well defined for all sufficiently large n, and the consistency of \({\tilde{\theta }}_n\) yields that \({\widetilde{A}}_n = O_p(1)\). We shall shortly show that

$$\begin{aligned} \mathrm{(a)} \quad \sqrt{n} \big (A_n - {\widetilde{A}}_n\big ) = o_p(1), \quad \mathrm{(b)} \quad \sqrt{n}\big (B_n - {\widetilde{B}}_n\big ) = o_p(n^{-1/2}). \end{aligned}$$
(3.5)

Then for all sufficiently large n, \(\hat{\theta }_n=B_n^{-1}A_n\) and

$$\begin{aligned} \sqrt{n} (\hat{\theta }_n - {\tilde{\theta }}_n)= & {} \frac{\sqrt{n} (A_n {\widetilde{B}}_n - {\widetilde{A}}_n B_n)}{B_n {\widetilde{B}}_n} = \frac{\sqrt{n} [A_n {\widetilde{B}}_n - {\widetilde{A}}_n ({\widetilde{B}}_n + o_p(n^{-1/2}))]}{ {\widetilde{B}}_n ({\widetilde{B}}_n + o_p(n^{-1/2}))} \\= & {} \frac{\sqrt{n} (A_n - {\widetilde{A}}_n) {\widetilde{B}}_n - o_p({\widetilde{A}}_n)}{\kappa _G^2 + o_p(1)} = o_p(1). \end{aligned}$$

To prove (3.5) (a), rewrite

$$\begin{aligned} \sqrt{n} (A_n - {\widetilde{A}}_n) = \sqrt{n} \bar{\eta }\int \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z)Y_i\right] \left[ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) \right] \mathrm{d}\hat{\varphi }(z) := \sqrt{n} \bar{\eta }\, {\mathcal A}_n. \end{aligned}$$

By (A6) and the central limit theorem, \(\sqrt{n} \bar{\eta }= O_p(1)\). It thus suffices to show that \({\mathcal A}_n = o_p(1)\). Let \({\mathcal A}_n^*\) denote the \({\mathcal A}_n\) with \(\hat{\varphi }\) replaced by \(\varphi \). Then (2.2), \(E(|Y|\big |Z = z) \le |\theta _0^T z| + \tau _1(z)\), assumption (A6) and direct calculations yield that

$$\begin{aligned} |{\mathcal A}_n - {\mathcal A}_n^*| = o_p\left( \int \frac{1}{n^2} \sum \limits _{i,j=1}^n K_{hi}(z)K_{hj}(z) |Y_i| \mathrm{d}\varphi (z)\right) = o_p(1). \end{aligned}$$

Next, rewrite

$$\begin{aligned} {\mathcal A}_n^*= & {} \frac{1}{n^2} \sum _{i=1}^n\int K_{hi}^2(z)Y_i \mathrm{d}\varphi (z) + \frac{1}{n^2} \sum _{i=1}^n\sum _{j\ne i=1}^n \int K_{hi}(z) K_{hj}(z) Y_i \mathrm{d}\varphi (z)\\:= & {} {\mathcal A}_{n1} + {\mathcal A}_{n2}. \end{aligned}$$

Calculation of moments shows that \( E ({\mathcal A}_{n1}) = O((nh)^{-1})\), \(E ({\mathcal A}_{n2}) = \theta _0 \nu _G + o(1)\), \(\text {Var}({\mathcal A}_{n1}) = O(n^{-3}h^{-2})\) and \(\text {Var}({\mathcal A}_{n2}) = O(n^{-1}).\) Hence \({\mathcal A}_n^* = \theta _0 \nu _G + o_p(1)\), and (A7) implies (3.5)(a).

Now we prove (3.5)(b). Let

$$\begin{aligned} {\mathcal B}_n:= \int \left[ \frac{1}{n}\sum _{i=1}^nK_{hi}(z) Z_i\right] \left[ \frac{1}{n}\sum _{i=1}^nK_{hi}(z) \right] \mathrm{d}\hat{\varphi }(z). \end{aligned}$$

Then, by (2.3),

$$\begin{aligned} B_n - {\widetilde{B}}_n= & {} 2 \bar{\eta }{\mathcal B}_n + \bar{\eta }^2 \int \left[ \frac{1}{n}\sum _{i=1}^nK_{hi}(z) \right] ^2 \mathrm{d}\hat{\varphi }(z) = 2 \bar{\eta }\, {\mathcal B}_n + O_p(n^{-1}). \end{aligned}$$

Argue as in the analysis of \({\mathcal A}_n\) to obtain that \( {\mathcal B}_n = \nu _G + o_p(1). \) This fact and \(\sqrt{n} \bar{\eta }= O_p(1)\) imply that \(\sqrt{n} (B_n - {\widetilde{B}}_n) = 2 (\sqrt{n} \bar{\eta }) \nu _G + O_p(n^{-1/2})\), which together with (A7) imply (3.5)(b). This also completes the proof of the proposition.

4 Testing

In this section, we establish the asymptotic distributions of the m.d. test statistics \({\widehat{M}}_n(\hat{\theta }_n)\) under the null and certain fixed and local alternative hypotheses. Let

$$\begin{aligned}&\xi _i = Y_i - H_{\theta _0} (Z_i), \quad \hat{\xi }_i = Y_i - {\widehat{H}}_{\hat{\theta }_n}(Z_i), \\&{\widetilde{C}_n} = \frac{1}{n^2} \sum _{i=1}^n\int K_{hi}^2(z) \xi _i^2 \mathrm{d}\varphi (z), \quad {\widetilde{\varGamma }}_{n} = \frac{2h^p}{n^2} \sum _{i \ne j} \left( \int K_{hi}(z) K_{hj}(z) \xi _i \xi _j \mathrm{d}\varphi (z)\right) ^2, \\&{\widehat{C}}_n = \frac{1}{n^2} \sum _{i=1}^n\int K_{hi}^2(z) \hat{\xi }_i^2 \mathrm{d}\hat{\varphi }(z), \quad {\widehat{\varGamma }}_n = \frac{2h^p}{n^2} \sum _{i \ne j} \left( \int K_{hi}(z) K_{hj}(z) \hat{\xi }_i \hat{\xi }_j \mathrm{d}\hat{\varphi }(z)\right) ^2,\\&{\widehat{{\mathcal T}}}_n := nh^{p/2}{\widehat{\varGamma }}^{-1/2}_n\big ({\widehat{M}}_n(\hat{\theta }_n) - {\widehat{C}}_n\big ). \end{aligned}$$

Because \(\xi =Y-H_{\theta _0}(Z) = \varepsilon + m_{\theta _0}(X) - H_{\theta _0}(Z)\) and because \(Z,\eta \) and \(\varepsilon \) are mutually independent, \(E(\xi ^2|Z=z) = \sigma _\varepsilon ^2 + \tau ^2(z)\), where \(\tau ^2\) is as in (A2). Since \({\mathcal C}\) is compact, the continuity of \(\tau ^2\) implies that it is bounded on \({\mathcal C}\) and hence \(\int E\big (\xi ^2|Z=z\big ) \mathrm{d}G(z) < \infty \).

4.1 Asymptotic null distribution of \({\widehat{{\mathcal T}}}_n\)

The following theorem states the asymptotic distribution of the proposed m.d. tests under the null hypothesis \(H_0\).

Theorem 2

Suppose (A1), (A2), (A4), (A5), (F1)–(F2), (K), (H1)–(H6), (W1) and (W2) hold. Then, under \(H_0\), \({\widehat{{\mathcal T}}}_n\rightarrow _d {\mathcal N}_1(0,1)\).

Consequently, the test that rejects \(H_0\) whenever \(\big |{\widehat{{\mathcal T}}}_n\big |>z_{\alpha /2}\) is of the asymptotic size \(\alpha \), where \(z_\alpha \) is the upper 100\(\alpha \)th percentile of \({\mathcal N}_1(0,1)\).

Theorem 2 shows that the ratio parameter N / n does not play a role in the limiting null distribution of \({\widehat{{\mathcal T}}}_n\). This finding is also reflected in a finite sample simulation study of Sect. 5.2 below.

Define

$$\begin{aligned}&U_{n1}(z) = \frac{1}{n} \sum _{i=1}^nK_{hi}(z)[Y_i - H_{\theta _0} (Z_i)], \nonumber \\&U_{n2}(z) = \frac{1}{n} \sum _{i=1}^nK_{hi}(z) [{\widehat{H}}_{\theta _0}(Z_i) - H_{\theta _0}(Z_i)], \nonumber \nonumber \\&V_n(z,\theta ) = \frac{1}{n} \sum _{i=1}^nK_{hi}(z) [{\widehat{H}}_\theta (Z_i) - {\widehat{H}}_{\theta _0}(Z_i)]. \end{aligned}$$
(4.1)

The following decomposition is important to study the asymptotic behavior of the proposed m.d. tests. Rewrite

$$\begin{aligned} {\widehat{M}}_n(\hat{\theta }_n)= & {} \int \left\{ \frac{1}{n} \sum _{i=1}^nK_{hi}(z) \big [Y_i - H_{\theta _0} (Z_i) + H_{\theta _0} (Z_i) - {\widehat{H}}_{\theta _0}(Z_i)\right. \\&\left. + {\widehat{H}}_{\theta _0}(Z_i) - {\widehat{H}}_{\hat{\theta }_n}(Z_i)\big ]\right\} ^2 \mathrm{d}\hat{\varphi }(z) \\= & {} \int [U_{n1}(z) - U_{n2}(z) - V_{n}(z,\hat{\theta }_n)]^2 \mathrm{d}\hat{\varphi }(z) \\= & {} \int [U_{n1}(z) - U_{n2}(z)]^2 \mathrm{d}\hat{\varphi }(z) + \int [V_{n}(z,\hat{\theta }_n)]^2 \mathrm{d}\hat{\varphi }(z) \\&\quad - 2 \int [U_{n1}(z) - U_{n2}(z)] V_{n}(z,\hat{\theta }_n) \mathrm{d}\hat{\varphi }(z) \\:= & {} J_n + {\widehat{D}}_n(\hat{\theta }_n) - 2 K_n(\hat{\theta }_n), \quad \text {say}. \end{aligned}$$

The following three lemmas yield the conclusion of Theorem 2 in a routine fashion.

Lemma 3

Suppose assumptions (A1), (A2), (A4), (A5), (F1)–(F2), (K), (H1)–(H6), (W1), (W2) and \(H_0\) hold. Then

$$\begin{aligned} nh^{p/2}{\widetilde{\varGamma }}^{-1/2}_n (J_n - {\widetilde{C}}_n) \rightarrow _d {\mathcal N}_1 (0,1). \end{aligned}$$

Lemma 4

Under the assumptions of Lemma 3, the following holds.

$$\begin{aligned} \mathrm{(a)}\quad nh^{p/2}{\widehat{D}}_n(\hat{\theta }_n) = o_p(1), \quad \mathrm{(b)} \quad nh^{p/2} K_n(\hat{\theta }_n) = o_p(1). \end{aligned}$$

Lemma 5

Suppose assumptions (A1), (A2), (F1), (K), (H1)–(H6), (W1) with \(\lambda <\infty \), (W2) and \(H_0\) hold. Then

$$\begin{aligned} \mathrm{(a)}\quad nh^{p/2}({\widehat{C}}_n - {\widetilde{C}}_n) = o_p(1), \quad \mathrm{(b)} \quad {\widehat{\varGamma }}_n - {\widetilde{\varGamma }}_n = o_p(1). \end{aligned}$$

4.2 Consistency

Next, we shall briefly discuss the consistency of these tests. Recall (3.1). Let \(\theta _n\) be an consistent estimator of T(H), \( \xi _i = Y_i - H(Z_i), \xi _{ni} = Y_i - {\widehat{H}}_{\theta _n}(Z_i),\) and let

$$\begin{aligned} C_n = \frac{1}{n^2} \sum _{i=1}^n\int K_{hi}^2(z) \xi _{ni}^2 \mathrm{d}\hat{\varphi }(z), \quad \varGamma _n = \frac{2h^p}{n^2}\sum \limits _{i\ne j=1}^n \left( \int K_{hi}(z)K_{hj}(z) \xi _{ni}\xi _{nj} \mathrm{d}\hat{\varphi }(z)\right) ^2. \end{aligned}$$

Let \( {\mathcal T}_n := nh^{p/2} \varGamma _n^{-1/2} ({\widehat{M}}_n(\theta _n) - C_n)\). Then the theorem below presents the asymptotic behavior of the proposed test under certain alternative hypotheses.

Theorem 3

Suppose (A1), (A2), (A4), (A5), (F1), (F2), (H3), (K), (W1) and (W2) hold and the alternative hypothesis \(H_1: \mu (x) = m(x),\)\(x\in {\mathcal C}\) satisfies that \( \inf _\theta \rho (H,H_\theta ) >0\) and T(H) is unique. Then \(|{\mathcal T}_n| \rightarrow _p \infty \) for any consistent estimator \(\theta _n\) of T(H).

By Lemma 2, \(\hat{\theta }_n\) is consistent for T(H); therefore, the above theorem implies that \(|{\widehat{\mathcal {T}}}_n| \rightarrow \infty \) in probability under the same regularity conditions, and the test based on \({\widehat{\mathcal {T}}}_n\) is consistent against the alternative m for which \(\inf _\theta \rho (H,H_\theta ) >0\). The proof of Theorem 3 is similar to that of Theorem 5.1 in KS with slight modifications. The techniques used for analyzing \(W_n(\theta )\) in Lemma 1 and \({\widehat{D}}_n(\theta )\) in the proof of Theorem 1 are enough to produce the conclusions. Details are skipped for the sake of brevity.

4.3 Power at local alternatives

We further investigate the asymptotic power of the proposed test against certain local alternatives. Let a be a known real-valued function with continuous derivative. Define \(A(z) = E(a(X)|Z = z)\) and \(A_2(z) = E([a(X)]^2|Z = z)\), \(z \in {\mathcal C}\). Furthermore, suppose both A and \(A_2\) are continuous on \({\mathcal C}\) and

$$\begin{aligned} \int H_\theta A \mathrm{d}G = 0, \quad \forall \theta \in \varTheta . \end{aligned}$$
(4.2)

We consider a sequence of local alternatives

$$\begin{aligned} {\mathcal H}_{1,n}: \mu (x) = m_{\theta _0}(x) + b_n \, a(x), \quad b_n = 1/\sqrt{nh^{p/2}}. \end{aligned}$$
(4.3)

The asymptotic distribution of \(\hat{\theta }_n\) under \({\mathcal H}_{1,n}\) is given in the following theorem.

Theorem 4

Assume (A1)–(A3), (A5), (F1), (F2), (H1)–(H6), (K), (W1) and (W2) hold. Then under (4.2) and (4.3), \(\sqrt{n}(\hat{\theta }_n - \theta _0) \rightarrow _d {\mathcal N}_q\Big (0, \varSigma _0^{-1}(\varSigma _1 + \lambda ^{-1}\varSigma _2)\varSigma _0^{-1}\Big )\), where \(\varSigma _0\) is given in (H6), \(\varSigma _1\) and \(\varSigma _2\) are defined in (3.2).

The asymptotic distribution of the proposed m.d. tests against the local alternatives \({\mathcal H}_{1,n}\) in (4.3) is presented in the following theorem. Define

$$\begin{aligned} K_2(v) = \int K(v+u)K(u) \mathrm{d}u, \quad \varGamma = 2 \int K_2^2(v)\mathrm{d}v \int [\sigma _\varepsilon ^2 + \tau ^2(z)]^2 g(z) \mathrm{d}\varphi (z). \end{aligned}$$

Theorem 5

Suppose (A1)–(A3), (A5), (F1), (F2), (H1)–(H6), (K), (W1) and (W2) hold. Then under (4.2) and (4.3), \({\widehat{{\mathcal T}}}_n \rightarrow _d {\mathcal N}_1(\varGamma ^{-1/2}\int A^2 \mathrm{d}G, 1)\).

Similar to Theorem 2, under the chosen local alternative sequences, the limit \(\lambda \) of sample ratio N / n does not play a critical role in the asymptotic property of the m.d. test.

Remark 4

(Optimal G) Let \({\mathcal K}(g):= \varGamma ^{-1/2}\int A^2 \mathrm{d}G\). From the above theorem, the asymptotic power of the level \(\alpha \) test based on \({\widehat{{\mathcal T}}}_n\) is

$$\begin{aligned} 1-\varPhi \big (z_{\alpha /2}-{\mathcal K}(g)\big )+ \varPhi \big (-z_{\alpha /2}-{\mathcal K}(g)\big ). \end{aligned}$$

This power is an increasing function of \({\mathcal K}(g)\). Thus it will be maximized by that g which maximizes \({\mathcal K}(g)\). Let \(c= 2 \int K_2^2(v)\mathrm{d}v\) and \(\kappa (z):= \sigma _\varepsilon ^2 + \tau ^2(z)\). Note that \(\kappa (z)\ge \sigma _\varepsilon ^2>0\). Under (F1), \(f_Z(z)>0\) for all \(z\in {\mathcal C}\). Then by the Cauchy–Schwarz inequality, we obtain

$$\begin{aligned} {\mathcal K}(g)= & {} \frac{\int A^2(z) g(z) \mathrm{d}z}{\sqrt{c \int \kappa ^2(z) g^2(z) f_Z^{-2}(z) \mathrm{d}z}} =c^{-1/2}\frac{\int A^2(z) f_Z(z)\kappa ^{-1}(z)\, \kappa (z)g(z)f_Z^{-1}(z) \mathrm{d}z}{\sqrt{\int \kappa ^2(z) g^2(z) f_Z^{-2}(z) \mathrm{d}z}}\\\le & {} c^{-1/2} \left( \int \frac{A^4(z) f_Z^2(z)}{\kappa ^2(z)} \mathrm{d}z\right) ^{1/2}, \end{aligned}$$

with equality holding if, and only if, \(g(z)\propto A^2(z)f_Z^2(z)/\kappa ^2(z)\), for all \(z\in {\mathcal C}\). Since \({\mathcal K}(g)\) is scale invariant, i.e., \({\mathcal K}(bg)={\mathcal K}(g)\), for all \(b>0\), we can take the optimal \(g(z)=A^2(z) f_Z^2(z)/\kappa ^2(z)\).

5 Simulation study

In this section, we present the results of a Monte Carlo study of the proposed estimation and testing procedures for \(p=1, 2\). For \(p=1\), a nonlinear regression model is considered. For \(p=2\), a linear regression is assumed. Three different values of the ratio \(N/n=(4,1,1/4)\) are selected to assess its effect on the performance of these inference procedures. Throughout the simulation, \(K(u) = 0.75(1-u^2)I_{(|u|\le 1)}\) for \(p=1\) and \(K(u) = 0.75^2 (1-u_1^2)(1-u_2^2)I_{(|u_1|\le 1, |u_2|\le 1)}\) for \(p=2\). The set \({\mathcal C}\) and the integrating measure G are specified later. All of the results obtained are based on 1000 replications.

We need to determine the two bandwidths for the implementation of the above inference procedures. As mentioned in Sect. 1, one bandwidth used for estimating \(f_Z\) is \(w = c (\log n/n)^{1/(p+4)}\), \(c>0\). We propose to obtain c by minimizing, w.r.t. c, the unbiased cross-validation criterion UCV(w) of Härdle et al. (1990), where

$$\begin{aligned} UCV(w) = \frac{(R(K))^p}{nw^p} + \frac{1}{n(n-1)w^p}\sum \limits _{i\ne j=1}^n (K*K - K)\left( \frac{Z_i-Z_j}{w}\right) , \end{aligned}$$

with \(R(K) = \int K^2(x)\mathrm{d}x\) and \(K*K(x) = \int K(y)K(x-y)\mathrm{d}y\). We applied a grid search to choose the optimal coefficient c starting from 0.1 with step 0.02, i.e.,

$$\begin{aligned} c_n^* := \text {argmin}_{0.1\le c \le 10} \,UCV\left( c (\log n/n)^{1/(p+4)}\right) , \quad w_{opt} = c_n^* (\log n/n)^{1/(p+4)}. \end{aligned}$$

In order for the bandwidth h to satisfy (W2), we used \(h = \mathrm{d}\, n^{-1/\varDelta }, \varDelta = \max \{2p, p(p+4)/4\} + 1\). We further adopted the leave-one-out cross-validation approach for the nonparametric regression function estimator \(\hat{\mu }\) based on \(\{(Y_i, Z_i), 1\le i \le n\}\) to select the optimal coefficient d, i.e.,

$$\begin{aligned}&d_n^* = \text {argmin}_{0.1\le d \le 10} \sum _{i=1}^n\big \{Y_i - \hat{\mu }_{-i}(Z_i)\big \}^2, \quad \hat{\mu }_{-i}(z) = \frac{\sum _{j=1,j\ne i}^n K_{hj}(z)Y_j}{\sum _{j=1,j\ne i}^n K_{hj}(z)}, \\&h_{\mathrm{opt}} = d_n^* n^{-1/\varDelta }. \end{aligned}$$

In order to interpret the performance of the proposed estimator \(\hat{\theta }_n\), we also present the performance of the KS estimator \({\tilde{\theta }}_n\). Recall that in KS, \(f_\eta \) is assumed to be known.

Both the absolute bias and square root of mean square error (RMSE) of the two estimators are reported. In both linear and nonlinear cases, the absolute bias and RMSE decrease as the sample sizes increase. In the linear case, as shown in Example 1, there is no need to estimate the regression function and the asymptotic variance of \(\hat{\theta }_n\) is the same as that of \({\tilde{\theta }}_n\). This is also reflected in this finite sample study as in the case of \(p=2\) the RMSEs of the components of \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\) in Table 2 are very similar to each other for all the chosen values of N / n. In the nonlinear case, Table 1 shows that the obtained RMSE of \(\hat{\theta }_n\) is larger than \({\tilde{\theta }}_n\) and it decreases as N / n increases from 1 / 4 to 4.

Table 1 Performance of \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\) in the nonlinear case (5.1) with \(p=1\)
Table 2 Performance of \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\) in the linear case with \(p=2=q\)

We compared the proposed test \({\widehat{{\mathcal T}}}_n\) with the test \({\widetilde{{\mathcal T}}}_n\) of KS, where

$$\begin{aligned}&{{\mathcal C}_{n}} = \frac{1}{n^2} \sum _{i=1}^n\int K_{hi}^2(z) {\tilde{\xi }}_i^2 d \hat{\varphi }(z), \quad {\tilde{\xi }}_i = Y_i - {H_{\tilde{\theta }}}_n (Z_i),\\&{\tilde{\varGamma }}_n = \frac{2h^p}{n^2} \sum _{i \ne j} \left( \int K_{hi}(z) K_{hj}(z) {\tilde{\xi }}_i {\tilde{\xi }}_j \mathrm{d}\hat{\varphi }(z)\right) ^2, \quad {\widetilde{{\mathcal T}}}_n := nh^{p/2}{\tilde{\varGamma }}^{-1/2}_n\big ( M_n({\tilde{\theta }}_n) - {\mathcal C}_n\big ). \end{aligned}$$

The \({\widetilde{{\mathcal T}}}_n\) test rejects \(H_0\) at the significance level \(\alpha \) whenever \(|{\widetilde{{\mathcal T}}}_n|\ge z_{\alpha /2}\). With the nominal level 0.05, the empirical levels and powers of these two tests are obtained by computing \(\#\{|{\widehat{\mathcal {T}}}_n| \ge 1.96\}/1000\) and \(\#\{|{\widetilde{{\mathcal T}}}_n|\ge 1.96\} /1000\).

5.1 Finite sample performance of \(\hat{\theta }_n\)

In this subsection, we report the findings of a finite sample performance of the estimator \(\hat{\theta }_n\) in nonlinear and linear cases.

The nonlinear case with \(q=1=p\). Here, data are generated from the model (1.1) with

$$\begin{aligned} \mu (x)=m_{\theta _0}(x) = e^{{\theta _0} x}, \quad \theta _0 = -1, \end{aligned}$$
(5.1)

where \(\varepsilon \sim {\mathcal N}_1(0,0.2^2), \eta \sim {\mathcal N}_1(0,0.2^2), Z \sim U[-1,1].\) Then

$$\begin{aligned} H_{\theta _0}(z) = e^{\theta ^2_0\sigma _\eta ^2/2}\,e^{\theta _0 z}, \quad {\widehat{H}}_{\theta _0} (z) = \frac{1}{N}\sum \limits _{k=1}^N e^{\theta _0(z+{\widetilde{\eta }}_k)} = e^{\theta _0 z} \frac{1}{N}\sum _{k=1}^N e^{\theta _0{\widetilde{\eta }}_k}, \end{aligned}$$

and the second term \(\varSigma _2\) in the asymptotic variance is calculated as

$$\begin{aligned} \sigma _{\theta _0}(x,y) = e^{\sigma _\eta ^2}(e^{\sigma _\eta ^2}-1)e^{\theta _0(x+y)},\quad \varSigma _2 = e^{2\sigma _\eta ^2}(e^{\sigma _\eta ^2}-1) \left[ \int (x + \sigma _\eta ^2 \theta _0) e^{\theta _0 x}\mathrm{d}G(x)\right] ^2. \end{aligned}$$

We used \({\mathcal C}=[-1,1]\) and G equal to the uniform measure on \([-1,1]\).

Table 1 shows little empirical bias in \(\hat{\theta }_n\) and its RMSE decreases as the sample sizes and N / n increase. For \(N/n=4, 1\), the RMSE of \(\hat{\theta }_n\) is similar to that of \({\tilde{\theta }}_n\), while for \(N/n=1/4\), RMSE of \(\hat{\theta }_n\) is much larger than that of \({\tilde{ \theta }}_n\), because of the smaller size of the validation data set.

The linear case with \(q=2=p\). Here we consider the model \(m_\theta (x) = \theta _1 x_1 + \theta _2 x_2\), \(\theta = (\theta _1,\theta _2)^T \in \mathbb {R}^2\), \(x=(x_1,x_2)^T \in \mathbb {R}^2\). The true parameter \(\theta _0 = (0.5,1)\) is used to generate the data. Denote \(Z_i = (Z_{i1}, Z_{i2})^T\) and \(\eta _i = (\eta _{i1},\eta _{i2})^T\) for \(1\le i \le n\). Both \(Z_{i1}\) and \(Z_{i2}\) are generated independently from \(U[-1,1]\), while \(\eta _{i}\) are generated from a bivariate normal distribution \({\mathcal N}_2(0,\varSigma _\eta )\) with \(\varSigma _\eta = (\sigma _{ij})_{i,j=1,2}\), \(\sigma _{11} = 0.1^2, \sigma _{22} = 0.2^2, \sigma _{12} = \sigma _{21} = 0.01\). Then \(X_i = (X_{i1},X_{i2})^T =Z_i+\eta _i.\) The primary data \(\{(Y_i, Z_i), 1\le i\le n\}\) are generated from the above regression function with the error \(\varepsilon \) following \({\mathcal N}_1(0,0.3^2)\). The validation data {\({\widetilde{\eta }}_k, 1\le k \le N\)} are independently simulated from \({\mathcal N}_2(0,\varSigma _\eta )\). The bandwidths h and w are obtained based on the criteria mentioned above. In this case, \({\mathcal C}= [-1,1]^2\) and G is the uniform measure on \([-1,1]^2\). The choices of N / n are the same as in the previous case.

Both bias and RMSE of the estimators \(\hat{\theta }_n = (\hat{\theta }_{n,1}, \hat{\theta }_{n,2})^T\) and \({\tilde{\theta }}_n = ({\tilde{\theta }}_{1}, {\tilde{\theta }}_{2})^T\) are presented in Table 2. It shows small estimation bias and reduced RMSE for increased sample sizes. As noted in Proposition 1, in this linear setup, the asymptotic variances of \(\hat{\theta }_n\) and \({\tilde{\theta }}_n\) are equivalent to each other. This fact is also reflected in this finite sample study by observing that the RMSEs of the components of \(\hat{\theta }_n\) are very similar to those of the components of \({\tilde{\theta }}_n\) for the different chosen values of N / n.

5.2 Test performance

Here we present the performance of a member of the proposed class of m.d. tests based on \({\widehat{{\mathcal T}}}_n\) and a member of the KS tests based on \({\widetilde{{\mathcal T}}}_n\).

Table 3 Empirical levels and powers of \({\widehat{{\mathcal T}}}_n\) and \({\widetilde{{\mathcal T}}}_n\) tests for the nonlinear null model (5.1) with \(p=1=q\)
Table 4 Empirical levels and powers of \({\widehat{{\mathcal T}}}_n\) and \({\widetilde{{\mathcal T}}}_n\) tests under the linear null model with \(p=2=q\)

The case \(q=1=p\). The finite sample performance of these tests is assessed for the nonlinear model (5.1) as the null. Three different alternatives defined below are chosen to obtain the empirical power of a member of the class of the proposed tests.

$$\begin{aligned}&\text {Model 0:}\quad Y = e^{- X} + \varepsilon .\\&\text {Model 1:}\quad Y = e^{- X} - 0.2 X^2 + \varepsilon . \\&\text {Model 2:}\quad Y = e^{- X} + 0.2 \sin (2X) + \varepsilon . \\&\text {Model 3:}\quad Y = e^{- X}I_{(X\le 0.4)} + e^{-0.4} I_{(X>0.4)} + \varepsilon . \end{aligned}$$

The entities G, K, \(f_Z\), \(\eta \) and \(\varepsilon \) are as in the case of \(q=1=p\) in Sect. 5.1.

The empirical levels under model 0 and the empirical powers under models 1, 2, 3 for both the proposed and the KS tests are shown in Table 3 with increasing sample sizes. With the nominal level 0.05, the empirical level of the \({\widehat{{\mathcal T}}}_n\) test is well controlled for the larger sample sizes when \(N/n = 1, 4\). For \(N/n = 1/4\), the empirical level is slightly inflated for small and moderate sample sizes due to the limited validation data, and it decreases toward 0.05 when the sample size increases. The empirical levels of the test \({\widehat{{\mathcal T}}}_n\) are very similar to those of the \({\widetilde{{\mathcal T}}}_n\) test for the chosen larger sample sizes, \(n\ge 200\). This finding is consistent with the theoretical result that the asymptotic null distribution of \({\widehat{{\mathcal T}}}_n\) does not depend on the ratio N / n and is the same as that of \({\widetilde{{\mathcal T}}}_n\).

We also find that the empirical powers of the \({\widehat{{\mathcal T}}}_n\) test for all chosen alternatives are very similar to the empirical powers of the \({\widetilde{{\mathcal T}}}_n\) test for all the three choices of N / n and for moderate-to-large sample sizes. This finding is in some sense consistent with the results in Sect. 4 that the asymptotic distribution of \({\widehat{{\mathcal T}}}_n\) does not depend on the limit of N / n for certain fixed and local alternatives.

The linear case \(q=2=p\). In this case, the setup is the same as Sect. 5.1 for \(p=2\). We investigate the empirical level of the proposed test under models defined below. With \(\theta _0 = (0.5,1)^T\) and \(X = (X_1, X_2)^T\),

$$\begin{aligned}&\text {Model } \emptyset : Y = \theta _0^T X + \varepsilon , \\&\text {Model I}: Y = \theta _0^T X + 0.2 X_1X_2 + \varepsilon , \\&\text {Model II}: Y = \theta _0^T X + 0.5 \sin (2X_1X_2) + \varepsilon ,\\&\text {Model III}: Y = \theta _0^T X I_{(\theta _0^T X\le 0.5)} + 0.5 I_{(\theta _0^T X>0.5)} + \varepsilon . \end{aligned}$$

The numerical findings are summarized in Table 4. In this linear case, it is observed that the empirical levels of the proposed test are close to the nominal level 0.05 for the larger chosen sample sizes and for all three choices of N / n. The empirical power performance pattern is similar to that in the nonlinear case. Again, not surprisingly, the empirical powers of the \({\widehat{{\mathcal T}}}_n\) test are similar to those of the \({\widetilde{{\mathcal T}}}_n\) test for the larger chosen sample sizes and all the three choices of N / n.