Estimating Misspecified Moment Inequality Models

Kaido, Hiroaki; White, Halbert

doi:10.1007/978-1-4614-1653-1_13

Hiroaki Kaido³ &
Halbert White⁴

1396 Accesses

Abstract

This chapter studies partially identified structures defined by a finite number of moment inequalities. When the moment function is misspecified, it becomes difficult to interpret the conventional identified set. Even more seriously, this can be an empty set. We define a pseudo-true identified set whose elements can be interpreted as the least-squares projections of the moment functions that are observationally equivalent to the true moment function. We then construct a set estimator for the pseudo-true identified set and establish its $O_{p}(n^{-1/2})$ rate of convergence.

Access provided by Autonomous University of Puebla. Download chapter PDF

Semiparametric Generalized Estimating Equations in Misspecified Models

The EM algorithm for ML Estimators under nonlinear inequalities restrictions on the parameters

Article 27 December 2019

The consistency for the estimators of semiparametric regression model based on weakly dependent errors

Article 20 June 2015

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This chapter develops a new approach to estimating structures defined by moment inequalities. Moment inequalities often arise as optimality conditions in discrete choice problems or in structures where economic variables are subject to some type of censoring. Typically, parametric models are used to estimate such structures. For example, in their analysis of an entry game in the airline markets, Ciliberto and Tamer (2009) use a linear specification for airlines’ profit functions and assume that unobserved heterogeneity in the profit functions can be captured by independent normal random variables. In asset pricing theory with short sales prohibited, Luttmer (1996) specifies the functional form of the pricing kernel as a power function of consumption growth, based on the assumption that the investor’s utility function is additively separable and isoelastic.

Any conclusions drawn from such methods rely on the validity of the model specification. Although commonly used estimation and inference methods for moment inequality models are robust to potential lack of identification, typically they are not robust to misspecification. Compared to cases where the parameter of interest is point identified, much less is known about the consequences of misspecified moment inequalities. As we will discuss, these can be serious. In general, misspecification makes it hard to interpret the estimated set of parameter values; an even more serious possibility is that the identified set could be an empty set. If the identified set is empty, every nonempty estimator sequence is inconsistent. Furthermore, it is often hard to see if the estimator is converging to some object that can be given any meaningful interpretation. An exception is the estimation method developed by Ponomareva and Tamer (2010), which focuses on estimating a regression function with interval censored outcome variables.

This chapter develops a new estimation method that is robust to potential parametric misspecification in general moment inequality models. Our contributions are three-fold. First, we define a pseudo-true identified set that is nonempty under mild assumptions and that can be interpreted as the projection of the set of function-valued parameters identified by the moment inequalities. Second, we construct a set estimator using a two-stage estimation procedure, and we show that the estimator is consistent for the pseudo-true identified set in Hausdorff metric. Third, we give conditions under which the proposed estimator converges to the pseudo-true identified set at the $n^{-1/2}$-rate.

The first stage is a nonparametric estimator of the true moment function. Given this, why perform a parametric second-stage estimation? After all, the nonparametric first stage estimates the same object of interest, without the possibility of parametric misspecification. There are a variety of reasons a researcher may nevertheless prefer to implement the parametric second stage: first is the undeniably appealing interpretability of the parametric specification; second is the much more precise estimation and inference afforded by using a parametric specification; and third, the second term of the second-stage objective function may offer a potentially useful model specification diagnostic. Future research may permit deriving the asymptotic distribution of this term under the null of correct parametric specification to provide a formal test. The two-stage procedure proposed here delivers these benefits, while avoiding the more serious adverse consequences of potential misspecification.

The chapter is organized as follows. Section 2 describes the data generating process and gives examples that fall within the scope of this chapter. We also introduce our definition of the pseudo-true identified Sect. 3 defines our estimator and presents our main results. We conclude in Sect. 4. We collect all proofs into the appendix.

2 The Data Generating Process and the Model

Our first assumption describes the data generating process (DGP).

Assumption 2.1

Let $(\Omega ,\mathfrak F ,\mathbb P _{0})$ be a complete probability space. Let $k,\ell \in \mathbb N $. Let $X:\Omega \rightarrow \mathbb R ^{k}$ be a Borel measurable map, let $\mathcal X \subseteq \mathbb R ^{k}$ be the support of $X$, and let $P_{0}$ be the probability measure induced by $X$ on $\mathcal X $. Let $\rho _{0}:\mathcal X \rightarrow \mathbb R ^{\ell }$ be an unknown measurable function such that $E[\rho _{0}(X)]$ exists and

$$\begin{aligned} E[\rho _{0}(X)]\le 0, \end{aligned}$$

(1)

where the expectation is taken with respect to $P_{0}$.

In what follows, we call $\rho _{0}$ the true moment function. The moment inequalities (1) often arise as an optimality condition in game-theoretic models (Bajari et al. 2007; Ciliberto and Tamer 2009) or models that involve variables that are subject to some kind of censoring (Manski and Tamer 2002). In empirical studies of such models, it is common to specify a parametric model for $\rho _{0}$.

Assumption 2.2

Let $p\in \mathbb N $ and let $\Theta $ be a subset of $\mathbb R ^{p}$ with nonempty interior. Let $m:\mathcal X \times \Theta \rightarrow \mathbb R ^{\ell }$ be such that $m(\cdot ,\theta )$ is measurable for each $\theta \in \Theta $ and $m(x,\cdot )$ is continuous on $\Theta ,$ $\mathrm{a.e.}-P_{0}$. For each $\theta \in \Theta $, $m(\cdot ,\theta )\in L_{\ell }^{2}:=\{f:\mathcal X \rightarrow \mathbb R ^{\ell }:E[f(X)^{\prime }f(X)]<\infty \}.$

Throughout, we call $m(\cdot ,\cdot )$ the parametric moment function.

Definition 2.1

Let $m_{\theta }(\cdot ):=m(\cdot ,\theta )$. Define $\mathcal M _{\Theta }:=\{m_{\theta }\in L_{\ell }^{2}:\theta \in \Theta \}.$ $\mathcal M _{\Theta }$ is correctly specified ($-P_{0}$) if there exists $\theta _{0}\in \Theta $ such that

$$\begin{aligned} P_{0}[\rho _{0}(X)=m(X,\theta _{0})]=1. \end{aligned}$$

Otherwise, the model is misspecified.

If the model is correctly specified, we may define the set of parameter values that can be identified by the inequalities in (1):

$$\begin{aligned} \Theta _{I}:=\{\theta \in \Theta :E[m(X,\theta )]\le 0\}. \end{aligned}$$

We call $\Theta _{I}$ the conventional identified set. This set collects all parameter values that yield parametric moment functions that are observationally equivalent to $\rho _{0}$.

It becomes difficult to interpret $\Theta _{I}$ when the model is misspecified, as pointed out by Ponomareva and Tamer (2010) for a regression model with an interval-valued outcome variable. Suppose first that the model is misspecified but $\Theta _{I}$ is nonempty. The set is still a collection of parameter values that are observationally equivalent to each other, but since there is no $\theta $ in $\Theta _{I}$ that corresponds to the true moment function, further structure is required to unambiguously interpret $\Theta _{I}$ as a collection of “pseudo-true parameter(s)”. Further, $\Theta _{I}$ may be empty, especially if $\mathcal M _{\Theta }$ is a small class of functions. This makes the interpretation of $\Theta _{I}$ even more difficult. In fact, interpretation is impossible, as there is nothing to interpret.

Often, the economics of a given problem impose further structure on the DGP. To specify this, we let $0<L\le \ell ,$ and for measurable $s:\mathcal X \rightarrow \mathbb R ^{L}$, let $\Vert s\Vert _{L}:=E[s(X)^{\prime }s(X)]^{1/2}$. Let $L_{L}^{2}:=\{s:\mathcal X \rightarrow \mathbb R ^{L},\Vert s\Vert _{L}<\infty \}$, and let $\mathcal S \subseteq L_{L}^{2}$.

Assumption 2.3

There exists $\varphi :{\mathcal X }\times \mathcal S \rightarrow \mathbb R ^{\ell }$ such that for each $x\in \mathcal X $, $ \varphi (x,\cdot )$ is continuous on $\mathcal S $ and for each $s\in \mathcal S $, $\varphi (\cdot ,s)$ is measurable. Further, there exists $ s_{0}\in \mathcal S $ such that

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0}),\quad \forall x\in \mathcal X . \end{aligned}$$

When $\rho _{0}\in L_{\ell }^{2}$ and there is no further structure on $\rho _{0}$ available, we let $L=\ell ,$ $\mathcal S =L_{\ell }^{2},$ and take $\varphi $ to be the evaluation functional $e:\mathcal X \times \mathcal S \rightarrow \mathbb R ^{\ell }$:

$$\begin{aligned} \varphi (x,s)=e(x,s)\equiv s(x), \end{aligned}$$

as then $\varphi (x,\rho _{0})=e(x,\rho _{0})\equiv \rho _{0}(x)$ and $s_{0}=\rho _{0}.$ In this case, it is not necessary to explicitly introduce $\varphi $. Often, however, further structure on the form of $\rho _{0}$ is available. Typically, this is reflected in $s$ depending non-trivially only on a strict subvector of $X,$ say $X_{1}.$ In such cases, we may write $\mathcal S \subseteq L_{\mathcal X _{1}}^{2}$ for clarity. We give several examples below.

When Assumption 2.3 holds, we typically parametrize the unknown function $s_{0}$. For example, it is common to specify $s_{0}$ as a linear function of some of the components of $x$. As we will see in the examples, a common modeling assumption is

Assumption 2.4

There exists $r:\mathcal X \times \Theta \rightarrow \mathbb R ^{L}$ such that with $r_{\theta }:=r(\cdot ,\theta )$,

$$\begin{aligned} m(x,\theta )=\varphi (x,r_{\theta }),\quad \forall (x,\theta )\in \mathcal X \times \Theta . \end{aligned}$$

Thus, misspecification occurs when there is no $\theta _{0}$ in $\Theta $ such that $s_{0}=r_{\theta _{0}}.$

More generally, misspecification can occur because the researcher mistakenly imposes Assumption 2.3, in which case $s_{0}$ fails to exist and there is again no $\theta _{0}$ in $\Theta $ such that $\rho _{0}(x)=\varphi (x,r_{\theta _{0}}).$ As $s_{0}$ is an element of an infinite-dimensional space, we may refer to this as “nonparametric” misspecification. To proceed, we assume that, as is often plausible, the researcher is sufficiently able to specify the structure of interest that nonparametric misspecification is not an issue, either because correct $\varphi $ restrictions are imposed or no $\varphi $ restrictions are imposed. We thus focus on the case of parametric misspecification, where $s_{0}$ exists but there is no $\theta _{0}$ in $\Theta $ such that $s_{0}=r_{\theta _{0}}.$

2.1 Examples

In this section, we present several motivating examples and also give commonly used parametric specifications in these examples. For any vector $x$, we use $x^{(j)}$ to denote the $j$th component of the vector. Similarly, for a vector valued function $f(x)$, we use $f^{(j)}(x)$ to denote the $j$th component of $f(x)$.

Example 2.1

(Interval censored outcome) Let $Z:\Omega \rightarrow \mathbb R ^{d_{Z}}$ be a regressor with support $\mathcal Z $. Let $Y:\Omega \rightarrow \mathbb R $ be an outcome variable that is generated as:

$$\begin{aligned} Y=s_{0}(Z)+\epsilon , \end{aligned}$$

(2)

where $s_{0}\in \mathcal S :=L_{\mathcal Z }^{2},$ say, and $\epsilon $ satisfies $E[\epsilon |Z]=0$. We let $\mathcal Y $ denote the support of $Y$. Suppose $Y$ is unobservable, but there exist $(Y_{L},Y_{U})^{\prime }:\Omega \rightarrow \mathcal Y \times \mathcal Y $ such that $Y_{L}\le Y\le Y_{U}$ almost surely. Then, $(Y_{L},Y_{U},Z)^{\prime }$ satisfies the following inequalities almost surely:

$$\begin{aligned} E[Y_{L}|Z]-s_{0}(Z)&\le 0 \end{aligned}$$

(3)

$$\begin{aligned} s_{0}(Z)-E[Y_{U}|Z]&\le 0. \end{aligned}$$

(4)

Let $x=(y_{L},y_{U},z)^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Y \times \mathcal Z $. Given a collection $\{A_{1},{\ldots } ,A_{K}\}$ of Borel subsets of $\mathcal Z $, the inequalities in (3), (4) imply that the moment inequalities in (1) hold with

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0}):= \left[\begin{array}{l} y_{L}-s_{0}(z) \\ s_{0}(z)-y_{U} \end{array}\right] \otimes 1_{A}(z), \end{aligned}$$

(5)

where $1_{A}(z):=(1\{z\in A_{1}\},{\ldots } ,1\{z\in A_{K}\})^{\prime }$.^{Footnote 1} For each $x\in \mathcal X $ and $s\in \mathcal S $, the functional $\varphi $ evaluates vertical distances of $r(z)$ from $y_{L}$ and $y_{U}$ and multiplies them by the indicator function evaluated at $z$. Additional information on $\rho _{0}$ available in this example is that the moment functions are based on the vertical distances.

A common specification for $s_{0}$ is $s_{0}(z)=r_{\theta _{0}}(z)=z^{\prime }\theta _{0}$ for some $\theta _{0}\in \Theta \subseteq \mathbb R ^{d_{Z}}$. The parametric moment function is then given for each $x\in \mathcal X $ by $m(x,\theta )=\varphi (x,r_{\theta })$. Therefore, this example satisfies Assumption 2.4.

Example 2.2

Tamer (2003) considers a simultaneous game of complete information. For each $j=1,2$, let $Z_{j}:\Omega \rightarrow \mathbb R ^{d_{Z}}$ and $\epsilon _{j}:\Omega \rightarrow \mathbb R $ be firm $j$’s characteristics that are observable to the firms. The econometrician observes the $Z$’s but not the $\epsilon $’s. For each $j$, let $g_{j}:\mathcal Z \times \{0,1\}\rightarrow \mathbb R $. These functions are known to the firms but not to the econometrician. Suppose that each firm’s payoff is given by

$$\begin{aligned} \pi _{j}(Z_{j},Y_{j},Y_{-j})=(\epsilon _{j}-g_{j}(Z_{j},Y_{-j}))Y_{j},\quad j=1,2, \end{aligned}$$

where $Y_{j}\in \mathcal Y :=\{0,1\}$ is firm $j$’s entry decision, and $ Y_{-j}\in \mathcal Y $ is the other firm’s entry decision. The econometrician observes these decisions. Given $(z_{1},z_{2})$, the firms’ payoffs can be summarized in Table 1.

Suppose the firms and the econometrician know that $g(z,1)\ge g(z,0)$ for any value of $z$. This means that, other things equal, the opponent’s entry would reduce the firm’s own profit. In this setting, there are several possible equilibrium outcomes depending on the realization of $(\epsilon _{1},\epsilon _{2})$. If $\epsilon _{1}>g_{1}(z_{1},1)$ and $\epsilon _{2}>g_{2}(z_{2},1)$, then $(1,1)$ is the unique Nash equilibrium (NE) outcome. Similarly, if $\epsilon _{1}>g_{1}(z_{1},1)$ and $\epsilon _{2}<g_{2}(z_{2},1)$, $(1,0)$ is the unique NE outcome, and if $\epsilon _{1}<g_{1}(z_{1},1)$ and $\epsilon _{2}>g_{2}(z_{2},1)$, $(0,1)$ is the unique NE outcome. Now, if $\epsilon _{1}<g_{1}(z_{1},1)$ and $\epsilon _{2}<g_{2}(z_{2},1)$, there are two Nash equilibria, and they give the outcomes $(1,0)$ and $(0,1)$. Let $F_{j},j=1,2$ be the unknown CDFs of $\epsilon _{1}$ and $\epsilon _{2}$.^{Footnote 2} Without any assumptions on the equilibrium selection mechanism, the model predicts the following set of inequalities:

$$\begin{aligned}&P(Y_{1}=1,Y_{2}=1|Z_{1}=z_{1},Z_{2}=z_{2}) =(1-F_{1}(g_{1}(z_{1},1)))(1-F_{2}(g_{2}(z_{2},1)))\nonumber \\ \end{aligned}$$

(6)

$$\begin{aligned}&P(Y_{1}=1,Y_{2}=0|Z_{1}=z_{1},Z_{2}=z_{2}) \ge (1-F_{1}(g_{1}(z_{1},1)))F_{2}(g_{2}(z_{2},1)) \end{aligned}$$

(7)

$$\begin{aligned}&P(Y_{1}=1,Y_{2}=0|Z_{1}=z_{1},Z_{2}=z_{2}) \le F_{2}(g_{2}(z_{2},1)). \end{aligned}$$

(8)

Let $x:=(y_{1},y_{2},z_{1},z_{2})^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Y \times \mathcal Z \times \mathcal Z $. Let $s_{0}\in \mathcal S :=\{s\in L_{\mathcal Z \times \mathcal Z }^{2}:s(z_{1},z_{2})\in [ 0,1]^{2},\forall (z_{1},z_{2})\in \mathcal Z \times \mathcal Z \}$ be defined by

$$\begin{aligned} s_{0}^{(1)}(z_{1},z_{2})&:=F_{1}(g_{1}(z_{1},1)) \nonumber \\ s_{0}^{(2)}(z_{1},z_{2})&:=F_{2}(g_{2}(z_{2},1)). \end{aligned}$$

Here, $s_{0}^{(j)}(z_{1},z_{2})$ is the conditional probability that firm $j$’s profit upon entry is negative given $z_{1}$ and $z_{2}$. Given a collection $\{A_{j},j=1,{\ldots } ,K\}$ of Borel subsets of $\mathcal Z \times \mathcal Z $, let $1_{A}(z):=(1\{(z_{1},z_{2})\in A_{1}\},{\ldots },1\{(z_{1},z_{2})\in A_{K}\})^{\prime }$. The inequalities (6)–(8) imply the moment inequalities in (1) hold with

$$\begin{aligned} \rho _{0}(x)&=\varphi (x,s_{0}) \nonumber \\&= \left(\begin{array}{cc} 1\{y_{1}=1,y_{2}=1\}-(1-s_{0}^{(1)}(z_{1},z_{2}))(1-s_{0}^{(2)}(z_{1},z_{2})) \\ (1-s_{0}^{(1)}(z_{1},z_{2}))(1-s_{0}^{(2)}(z_{1},z_{2}))-1\{y_{1}=1,y_{2}=1\} \\ (1-s_{0}^{(1)}(z_{1},z_{2}))s_{0}^{(2)}(z_{1},z_{2})-1\{y_{1}=1,y_{2}=0\} \\ 1\{y_{1}=1,y_{2}=0\}-s_{0}^{(2)}(z_{1},z_{2}) \end{array}\right) \otimes 1_{A}(z). \end{aligned}$$

The additional information on $\rho _{0}$ is that it is based on the differences between some combinations of the conditional probabilities $s_{0}(z_{1},z_{2})$ and indicators for specific events.

A common parametric specification for $g_j$ is $g_j(z_j,y_{-j})=z_{j}^{\prime }\gamma _{0}-y_{-j}\beta _{j,0}$ for some $\beta _{j,0}\in B\subseteq \mathbb R _+$ and $\gamma _0\in \Gamma \subseteq \mathbb R ^{d_Z}$. It is also common to assume that $F_j,j=1,2$ belong to a known parametric class $\{F(\cdot ;\alpha ),\alpha \in \mathcal A \}$ of distributions. Then the parametric moment function can be defined for each $x$ by $m(x,\theta ):=\varphi (x,r_\theta )$, where $\theta :=(\alpha _1,\alpha _2,\beta _1,\beta _2,\gamma )^{\prime }$ and

$$\begin{aligned} r^{(1)}_\theta (z_1,z_2)&= F(z_{1}^{\prime }\gamma -\beta _{1};\alpha _1) \end{aligned}$$

(9)

$$\begin{aligned} r^{(2)}_\theta (z_1,z_2)&= F(z_{2}^{\prime }\gamma -\beta _{2};\alpha _2). \end{aligned}$$

(10)

This example also satisfies Assumption 2.4.

Table 1 The entry game payoff matrix

Full size table

Example 2.3

(Discrete choice) Suppose an agent chooses $Z\in \mathbb R ^{d_{Z}}$ from a set $\mathcal Z :=\{z_{1},{\ldots } ,z_{K}\}$ in order to maximize her expected payoff $E[s_{0}(Y,Z)\mid \mathcal I ]$, where $Y$ is a vector of observable random variables, $s_{0}\in \mathcal R :=L_{\mathcal Y \times \mathcal Z }^{2}$ is the payoff function, and $\mathcal I $ is the agent’s information set. The optimality condition for the agent’s choice is given by:

$$\begin{aligned} E[s_{0}(Y,z_{j})-s_{0}(Y,Z)\mid \mathcal I ]\le 0,\quad j=1,{\ldots } ,K. \end{aligned}$$

(11)

Let $x:=(y,z)^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Z $. The optimality conditions in (11) imply that the unconditional moment inequalities in (1) hold with

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0})=\left(\begin{array}{cc} \left[\begin{array}{c} s_{0}(y,z_{1})-s_{0}(y,z_{1}) \\ \vdots \\ s_{0}(y,z_{K})-s_{0}(y,z_{1}) \end{array}\right] \times 1\{z=z_{1}\} \\ \vdots \\ \left[\begin{array}{c} s_{0}(y,z_{1})-s_{0}(y,z_{K}) \\ \vdots \\ s_{0}(y,z_{K})-s_{0}(y,z_{K})\end{array}\right] \times 1\{z=z_{K}\} \end{array}\right) .\end{aligned}$$

For given $y,$ the functional $\varphi $ evaluates the profit differences between a given choice $z$ (e.g., $z_{1}$) and every other possible choice. The additional information on $\rho _{0}$ is that it is based on the profit differences.

A common specification for $s_{0}$ is $s_{0}(y,z)=r_{\theta _{0}}(y,z)=\psi (y,z;\alpha _{0})+z^{\prime }\beta _{0}+\epsilon _{z}$ for some known function $\psi $, unknown $(\alpha _{0},\beta _{0})\in \Theta \subset \mathbb R ^{d_{\alpha }+d_{\beta }}$, and an unobservable choice-dependent error $\epsilon _{z}$. For simplicity, we assume that $\epsilon _{z}$ satisfies $E[\epsilon _{z_{i}}-\epsilon _{z_{j}}\mid \mathcal I ]=0$ for any $i,j$; see Pakes et al (2006) and Pakes (2010) for detailed discussions. The parametric moment function is then given for each $x\in \mathcal X $ by $m(x,\theta )=\varphi (x,r_{\theta })$. This example satisfies Assumption 2.4.

Example 2.4

(Pricing kernel) Let $Z:\Omega \rightarrow \mathbb R ^{d_{Z}}$ be the payoffs of $d_{Z}$ securities that are traded at a price of $P\in \mathcal P \subseteq \mathbb R ^{d_{Z}}$. If short sales are not allowed for any securities, then the feasible set of portfolio weights is restricted to $\mathbb R _{+}^{d_{Z}}$ and the standard Euler equation does not hold. Instead, the following Euler inequalities hold (see Luttmer 1996):

$$\begin{aligned} E[s_{0}(Y)Z-P]\le 0, \end{aligned}$$

where $Y:\Omega \rightarrow \mathcal Y $ is a state variable, e.g. consumption growth, and $s_{0}\in \mathcal S :=\{s\in L_{\mathcal Y }^{2}:s(y)\ge 0,\forall y\in \mathcal Y \}$ is the pricing kernel function. The moment inequalities thus hold with the true moment function:

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0})=s_{0}(y)z-p, \end{aligned}$$

where $x:=(y,z,p)^{\prime }\in \mathcal Y \times \mathcal Z \times \mathcal P $. This function evaluates the pricing kernel $r$ at $y$ and computes a vector of pricing errors. The additional information on $\rho _{0}$ is that it is based on the pricing errors.

A common specification for $s_{0}$ is $s_{0}(y)=r_{\theta _{0}}(y)=\beta _{0}y^{-\gamma _{0}}$, where $\beta _{0}\in B\subseteq [ 0,1]$ is the investor’s subjective discount factor and $\gamma _{0}\in \Gamma \subseteq \mathbb R _{+}$ is the relative risk aversion coefficient. Let $\theta :=(\beta ,\gamma )^{\prime }$. The parametric moment function is then given for each $x\in \mathcal X $ by $m(x,\theta )=\varphi (x,r_{\theta })$, satisfying Assumption 2.4.

2.2 Projection

The inequality restrictions $E[\varphi (X,s_{0})]\le 0$ may not uniquely identify $s_{0}$. Define

$$\begin{aligned} \mathcal S _{0}:=\{s\in \mathcal S :E[\varphi (X,s)]\le 0\}. \end{aligned}$$

We define a pseudo-true identified set of parameters as a collection of projections of elements in $\mathcal S _{0}$. Let $W$ be a given non-random finite $L\times L$ symmetric positive-definite matrix. For each $s\in \mathcal S $, define the norm $\Vert s\Vert _{W}:=E[s(X)^{\prime }Ws(X)]^{1/2}$. For each $s\in \mathcal S $ and $A\subseteq \mathcal S $, the projection map $\Pi _{A}:\mathcal S $ $\rightarrow A$ is the map such that

$$\begin{aligned} \Vert s-\Pi _{A}s\Vert _{W}=\inf _{a\in A}\Vert s-a\Vert _{W}. \end{aligned}$$

Let $\mathcal R _{\Theta }:=\{r_{\theta }\in \mathcal S :\theta \in \Theta \} $. Given Assumption 2.4, we can define

$$\begin{aligned} \Theta _{*}:=\{\theta \in \Theta :r_{\theta }=\Pi _{\mathcal R _{\Theta }}s,s\in \mathcal S _{0}\}. \end{aligned}$$

When $\varphi $ is the evaluation map $e$, $\Theta _{*}$ is simply $\Theta _{*}:=\{\theta \in \Theta :m_{\theta }=\Pi _{\mathcal M _{\Theta }}s,s\in \mathcal S _{0}\}.$

$\Theta _{*}$ can be interpreted as the set of parameters that correspond to the elements $m_{\theta }$ in the $\mathcal R _{\Theta }$ -projection of $\mathcal S _{0}$. This set is nonempty (under some regularity conditions), and each element can be interpreted as a projection of $s$ inducing a functional $\varphi (\cdot ,s)$ that is observationally equivalent to $\rho _{0}$. In this sense, each element in $\Theta _{*}$ has an interpretation as a pseudo-true value. Thus, we call $\Theta _{*}$ the pseudo-true identified set. [White (1982) uses $\theta _{*}$ to denote the unique pseudo-true value in the fully identified case.]

We illustrate the relationship between $\Theta _{I}$ and $\Theta _{*}$ with an example. Consider Example 2.1. Let $\Theta \subseteq \mathbb R ^{d_{Z}}$. The conventional identified set is given by

$$\begin{aligned} \Theta _{I}=\{\theta \in \Theta&:E[(Y_{L}-Z^{\prime }\theta )1\{Z\in A_{j}\}]\le 0, \nonumber \\&\qquad {\text{ and}}\;E[(Z^{\prime }\theta -Y_{U})1\{Z\in A_{j}\}]\le 0,\quad j=1,{\ldots },K\}. \end{aligned}$$

(12)

The pseudo-true identified set is given by

$$\begin{aligned} \Theta _{*}=\{\theta \in \Theta :\theta =E[ZZ^{\prime }]^{-1}E[Zs(Z)],s\in \mathcal {S}_{0}\}. \end{aligned}$$

(13)

Let $D$ be a $d_{Z}\times K$ matrix whose $j$th column is $E[Z\,1\{Z\in A_{j}\}]$. For this example, the following result holds:

proposition 2.1

Let the conditions of Example 2.1 hold, and let $\Theta _{*}$ be given as in (13). Let $\Theta _{I}$ be given as in (12). Then $\Theta _{I}\subseteq \Theta _{*}$. Suppose further that $\mathcal M _{\Theta }$ is correctly specified, that $E[Y_{U}|Z]=E[Y_{L}|Z]=Z^{\prime }\theta _{0}$ a.s, and that $d_{Z}\le rank(D)$. Then $\Theta _{I}=\Theta _{*}=\{\theta _{0}\}$.

As this example shows, unless there is some information that helps restrict $ \mathcal S _{0}$ very tightly, $\Theta _{I}$ is often a proper subset of $ \Theta _{*}$. This is because without such information, $\mathcal S _{0}$ is typically a much richer class of functions than $\mathcal R _{\Theta }$. Another important point to note is that, although $\Theta _{*}$ is well-defined generally, $\Theta _{I}$ can be empty quite easily. In particular, for any $x,x^{\prime }\in \mathcal X $, let $x_{\lambda }:=\lambda x+(1-\lambda )x^{\prime },0\le \lambda \le 1$. $\Theta _{I}$ is empty if there exists $(x,x^{\prime })$ and $\lambda \in [0,1]$ such that (i) $x_{\lambda }\in \mathcal X $ and $(E[Y_{L}|x_{\lambda }]-E[Y_{U}|x])/\Vert x_{\lambda }-x\Vert >(E[Y_{U}|x^{\prime }]-E[Y_{U}|x])/\Vert x^{\prime }-x\Vert $ or (ii) $x_{\lambda }\in \mathcal X $ and $(E[Y_{U}|x_{\lambda }]-E[Y_{L}|x])/\Vert x_{\lambda }-x\Vert <(E[Y_{L}|x^{\prime }]-E[Y_{L}|x])/\Vert x^{\prime }-x\Vert $.^{Footnote 3} Figure , which is similar to Fig. 1 in Ponomareva and Tamer (2010), illustrates an example that satisfies condition (i) for the one-dimensional case.

In this example, each element in $\Theta _{*}$ solves the following moment restrictions:

$$\begin{aligned} E[Z(Z^{\prime }\theta -Y)]=E[Zu(X)], \end{aligned}$$

(14)

with $u(x)=s(z)-y$ for some $s\in \mathcal S _0.$ This can be viewed as a special case of incomplete linear moment restrictions studied in Bontemps, Magnac, and Maurin (2011) (BMM, henceforth).^{Footnote 4} BMM shows that the set of parameters that solves incomplete linear moment restrictions is necessarily convex and develops an inference method that exploits this property.

We note here that this connection to BMM’s work only occurs when the parametric class is of the form: $\mathcal R _\Theta =\{r_\theta : r_\theta (z)=z^{\prime }\theta ,~\theta \in \Theta \}$. The elements of $\Theta _{*}$, however, do not generally solve incomplete linear moment restrictions when $\mathcal R _\Theta $ includes nonlinear functions of $\theta $. Therefore, BMM’s inference method is only applicable when $r_\theta $ is linear. Our estimation procedure is more flexible than theirs in the following two respects. First, one may allow projection to a more general class of parametric functions that includes nonlinear functions of $\theta $. Second, as a consequence of the first point, we do not require $\Theta _{*}$ to be convex. We, however, pay a price for achieving this generality. We require $s$ to satisfy suitable smoothness conditions, which are not required by BMM. We discuss these conditions in detail in the following section.

3 Estimation

3.1 Set Estimator

For $W$ as above and each $(\theta ,s)\in \Theta \times \mathcal S $, let the population criterion function be defined by

$$\begin{aligned} Q(\theta, {\rm s}) & = E[(s(X_{i})-r_{\theta }(X_{i}))^{\prime } W(s(X_{i})-r_{\theta }(X_{i}))] \nonumber\\ &\quad- \inf_{\vartheta \in \Theta } E[(s(X_{i})-r_{\vartheta} (X_{i}))^{\prime} W(s(X_{i})-r_{\vartheta }(X_{i}))].\end{aligned}$$

(15)

Using the population criterion function, the “pseudo-true” identified set $\Theta _{*}$ can be equivalently written as

$$\begin{aligned} \Theta _{*}=\{\theta :Q(\theta ,s)=0,\quad s\in \mathcal S _{0}\}. \end{aligned}$$

Given a sample $\{X_{1},{\ldots } ,X_{n}\}$ of observations, let the sample criterion function be defined for each $(\theta ,s)\in \Theta \times \mathcal S $ by

$$\begin{aligned} Q_{n}(\theta ,s) :=&\;\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\theta }(X_{i}))^{\prime }W(s(X_{i})-r_{\theta }(X_{i})) \nonumber \\&\; -\inf _{\vartheta \in \Theta }\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\vartheta }(X_{i}))^{\prime }W(s(X_{i})-r_{\vartheta }(X_{i})). \end{aligned}$$

(16)

Ideally, we would like to estimate $\Theta _{*}$ by $\tilde{\Theta }_{n}$, say, where $\tilde{\Theta }_{n}:=\{\theta :Q_{n}(\theta ,s)\le c_{n},s\in \mathcal S _{0}\}$. But $\mathcal S _{0}$ is unknown, so we must estimate it. Thus, we employ a two-stage procedure, similar to that studied in Kaido and White (2010). Section 3.3 discusses how to construct a first-stage estimator of $\mathcal S _{0}$. For now, we suppose that such an estimator exists. For this, let $\mathcal F (A)$ be the set of closed subsets of a set $A$. See Kaido and White (2010) for background, including discussion of Effros measurability.

Assumption 3.1

(First-stage estimator) For each $n$, let $\mathcal S _{n}\subseteq \mathcal S $. $\hat{\mathcal S }_{n}:\Omega \rightarrow \mathcal F (\mathcal S _{n})$ is (Effros-) measurable.

Given a first-stage estimator, we define a set estimator for the pseudo-true identified set. Let $\{c_{n}\}$ be a sequence of non-negative constants. The set estimator for $\Theta _{*}$ is defined by

$$\begin{aligned} \hat{\Theta }_{n}:=\{\theta \in \Theta :Q_{n}(\theta ,s)\le c_{n},s\in \hat{\mathcal S }_{n}\}. \end{aligned}$$

(17)

We establish our consistency results using the Hausdorff metric. Let $||\cdot ||$ denote the Euclidean norm, and for any closed subsets $A$ and $B$ of a finite-dimensional Euclidean space (e.g., containing $\theta $), let

$$\begin{aligned} d_{H}(A,B):=\max \{\vec {d}_{H}(A,B),\vec {d}_{H}(B,A)\},\quad \vec {d}_{H}(A,B):=\sup _{a\in A}\inf _{b\in B}\Vert a-b\Vert , \end{aligned}$$

(18)

where $d_{H}$ and $\vec {d}_{H}$ are the Hausdorff metric and directed Hausdorff distance respectively.

Before stating our assumptions, we introduce some additional notation. Let $ D_{\theta }^{\alpha }$ denote the differential operator $\partial ^{\alpha }/\partial \theta _{1}^{\alpha _{1}}\cdots \partial \theta _{p}^{\alpha _{p}} $ with $|\alpha |:=\sum _{j=1}^{p}\alpha _{j}$. Similarly, we let $ D_{x}^{\beta }$ denote the differential operator $\partial ^{\beta }/\partial x_{1}^{\beta _{1}}\cdots \partial x_{k}^{\beta _{k}}$ with $ |\beta |:=\sum _{j=1}^{k}\beta _{j}$. For a function $f:\mathcal X \rightarrow \mathbb R $ and $\gamma >0$, let $\underline{\gamma }$ be the smallest integer smaller than $\gamma $ and define

$$\begin{aligned} \Vert f\Vert _{\gamma }:=\max _{|\beta |\le \underline{\gamma }}\sup _{x\in \mathcal X }\big |D_{x}^{\beta }f(x)\big |+\max _{|\beta |=\underline{\gamma } }\sup _{x,y\in \mathcal X }\frac{\big |D_{x}^{\beta }f(x)-D_{x}^{\beta }f(x)\big |}{ \Vert x--y\Vert ^{\gamma -\underline{\gamma }}}. \end{aligned}$$

Let $\mathcal C _{M}^{\gamma }(\mathcal X )$ be the set of all continuous functions $f:\mathcal X \rightarrow \mathbb R $ such that $\Vert f\Vert _{\gamma }\le M$. Let $\mathcal C _{M,L}^{\gamma }(\mathcal X ):=\{f:\mathcal X \rightarrow \mathbb R ^L:f^{(j)}\in \mathcal C ^\gamma _M(\mathcal X ), j=1,{\ldots },L\}$. Finally, for any $\eta >0$, let $\mathcal S _{0}^{\eta }:=\{s\in \mathcal S :\inf _{s^{\prime }\in \mathcal S _{0}}\Vert s-s^{\prime }\Vert _{W}<\eta \}$.

Our first assumption places conditions on the parameter spaces $\Theta $ and $\mathcal S $. We let $ int (\Theta )$ denote the interior of $ \Theta $.

Assumption 3.2

(i) $\Theta $ is compact; (ii) $\mathcal S $ is a compact convex set with nonempty interior; (iii) there exists $\gamma >k/2$ such that $\mathcal S \subseteq \mathcal C _{M,L}^{\gamma }(\mathcal X )$; (iv) $\mathcal R _{\Theta }$ is a convex subset of $\mathcal S $; (v) $\Theta _{*}\subseteq int (\Theta ).$

Assumption 3.2 (i) is standard in the literature of extremum estimation and also ensures the compactness of the pseudo-true identified set. Assumption 3.2 (iii) imposes a smoothness requirement on each component of $s\in \mathcal S $. Together with Assumption (ii), this implies that $\mathcal S $ is compact under the uniform norm, which will be also used for establishing the Hausdorff consistency of $\hat{\mathcal S } _{n}$ in the following section. For the Hausdorff consistency of $\hat{\Theta }_{n} $, the requirement $\gamma >k/2$ can be relaxed to $\gamma >0$, and it also suffices that the smoothness requirement holds for functions in neighborhoods of $\mathcal S _{0}$. The stronger requirement given here, however, will be useful for deriving the rates of convergence of $\hat{\Theta }_{n}$ and $\hat{\mathcal S }_{n}$.

For ease of analysis, we assume below that the observations are from a sample of IID random vectors.

Assumption 3.3

The observations $\{X_i,i=1,{\ldots },n\}$ are independently and identically distributed.

The following two assumptions impose regularity conditions on $r_\theta $.

Assumption 3.4

(i) $r(x,\cdot )$ is twice continuously differentiable on the interior of $\Theta $ $\mathrm{a.e.}-P_{0}$, and for any $j$, $x,$ and $|\alpha | \le 2$, there exists a measurable bounded function $C:\mathcal X \rightarrow \mathbb R $ such that $|D_\theta ^{\alpha }r_{\theta }^{(j)}(x)-D_\theta ^{\alpha }r_{\theta ^{\prime }}^{(j)}(x)|\le C(x)\Vert \theta -\theta ^{\prime }\Vert $; (ii) there exists a measurable bounded function $R:\mathcal X \rightarrow \mathbb R $ such that

$$\begin{aligned} \max _{\begin{matrix} j=1,{\ldots },l \\ |\alpha |\le 2 \end{matrix}}~ \sup _{\theta \in \Theta } \big |D^\alpha _\theta r^{(j)}_\theta (x)\big |\le R(x). \end{aligned}$$

For each $x$, let $\nabla _{\theta }r_{\theta }(x)$ be a $L\times p$ matrix whose $j$th row is the gradient vector of $r_{\theta }^{(j)}$ with respect to $\theta $. For each $x\in \mathcal X $ and $i,j\in \{1,{\ldots },L\}$, let $\partial ^{2}/\partial \theta _{i}\partial \theta _{j}r_{\theta }(x)$ be a $L\times 1$ vector whose $k$th component is given by $\partial ^{2}/\partial \theta _{i}\partial \theta _{j}r^{(k)}_{\theta }(x)$. For each $\theta \in \Theta $, $s\in \mathcal S $, and $x\in \mathcal X $, let $H_{W}(\theta ,s,x)$ be a $p\times p$ matrix whose $(i,j)$th component is given by

$$\begin{aligned} H_{W}^{(i,j)}(\theta ,s,x)=2\left(\frac{\partial ^{2}}{\partial \theta _{i}\partial \theta _{j}}r_{\theta }(x)\right)^{\prime }W(r_{\theta }(x)-s(x)). \end{aligned}$$

(19)

Let $\eta >0$. For each $s\in \mathcal S _0^{\eta }$ and $\epsilon >0$, let $V^\epsilon (s)$ be the neighborhood of $\theta _{*}(s)$ defined by

$$\begin{aligned} V^\epsilon (s) :=\{\theta \in \Theta :\Vert \theta -\theta _{*}(s)\Vert \le \epsilon \}. \end{aligned}$$

Let $\mathcal N _{\epsilon ,\eta }:=\{(\theta ,s):\theta \in V^\epsilon (s),s\in \mathcal S _0^{\eta }\}$ be the graph of the correspondence $V^\epsilon $ on $\mathcal S _0^{\eta }$.

Assumption 3.5

There exist $\bar{\epsilon }>0$ and $\bar{\eta }>0$ such that the Hessian matrix $\nabla _\theta ^2Q(\theta ,s):=E[H_{W}(\theta ,s,X_{i})+2\nabla _{\theta }r_{\theta }(X_{i})^{\prime }W\nabla _{\theta }r_{\theta }(X_{i})]$ is positive definite uniformly over $\mathcal N _{\bar{\epsilon },\bar{\eta }}$.

Assumption 3.4 imposes a smoothness requirement on $r_\theta $ as a function of $\theta $, enabling us to expand the first order condition for minimization, as is standard in the literature. Assumption 3.5 requires that Hessian of $Q(\theta ,s)$ with respect to $\theta $ to be positive definite uniformly on a suitable neighborhood of $\Theta _{*}\times \mathcal S _0$. For the consistency of $\hat{\Theta }_n$, it suffices to assume that the Hessian is uniformly non-singular over $\mathcal N _{\bar{\epsilon },\bar{\eta }}$, but a stronger condition given here will be useful to ensure a quadratic approximation of the criterion function, which is crucial for the $\sqrt{n}$-consistency of$\hat{\Theta }_{n}$.

Further, we assume that $\hat{\mathcal S }_{n}$ is consistent for $\mathcal S _{0}$ in a suitable Hausdorff metric. Specifically, for subsets $A,B$ of $ \mathcal S $, let

$$\begin{aligned} d_{H,W}(A,B):=\max \left\{ \sup _{a\in A}\inf _{b\in B}\Vert a-b\Vert _{W},\sup _{b\in B}\inf _{a\in A}\Vert a-b\Vert _{W}\right\} . \end{aligned}$$

Assumption 3.6

$d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1)$.

Theorem 3.1 is our first main result, which establishes the consistency of the set estimator defined in (17) with $c_n$ set to 0. This result is established by extending the standard consistency proof for extremum estimators to the current setting. Note that, under Assumption 3.2 (iv), the projection $\theta _{*}(s):=\Pi _{\mathcal R _\Theta }s$ of each point $s\in \mathcal S $ to $\mathcal R _\Theta $ exists and is uniquely determined. In other words, for each $s\in \mathcal S $, $\theta _{*}(s)$ is point identified. By setting $c_n=0$, the set estimator is then asymptotically equivalent to the collection of minimizers $\hat{\theta }_n (s):={\text{ argmin}}_{\theta ^{\prime }\in \Theta }Q_n(\theta ,s)$ of the sample criterion function. The main challenge for establishing Hausdorff consistency is to show that $\hat{\theta }_n(s)-\theta _{*}(s)$ vanishes in probability over a sufficiently large neighborhood of $\mathcal S _0$. The proof of the theorem in the appendix formally establishes this and gives the desired result.

Theorem 3.1

Suppose Assumptions 2.1–2.4 and 3.1–3.6 hold. Let $\hat{\Theta }_{n}$ be defined as in (17) with $c_{n}=0$ for all $n$. Then $d_{H}(\hat{\Theta }_{n},\Theta _{*})=o_{p}(1)$.

The result of Theorem 3.1 is similar to that of Theorem 3.2 in Chernozhukov et al. (2007), who establish the Hausdorff consistency of a level-set estimator with $c_n=0$ when $Q_n$ degenerates on a neighborhood of the identified set.^{Footnote 5} When Assumption 3.2 (iv) fails to hold, this estimator may not be consistent. We, however, conjecture that it would be possible to construct a Hausdorff consistent estimator of $\Theta _{*}$ even in such a setting by choosing a positive sequence $\{c_n\}$ of levels that tends to 0 as $n\rightarrow \infty $ and by exploiting the fact that $\hat{\mathcal S }_n$ converges to $\mathcal S _0$ in a suitable Hausdorff metric. In fact, Kaido and White (2010) establish the Hausdorff consistency of their two-stage set estimator using this argument, but in their analysis, the first-stage parameter ($s$ in our setting) must be finite dimensional. Extending Theorem 3.1 to a more general one that allows non-convex parametric classes is definitely of interest, but to keep our tight focus here, we leave this as a future work.

3.2 The Rate of Convergence

Theorem 3.1 uses the fact that $d_{H}(\hat{\Theta }_{n},\Theta _{*})$ can be bounded by $d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})$. Although $\hat{\mathcal S }_{n}$ does not converge at a parametric rate generally, the convergence rate of $\hat{\Theta }_{n}$ can be improved when $\hat{\mathcal S }_{n}$ converges to $\mathcal S _{0}$ at a rate $o_{p}(n^{-1/4})$. This is analogous to the results obtained for the point identified case; see, for example, Newey (1994), Ai and Chen (2003), and Ichimura and Lee (2010).

Assumption 3.7

$d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(n^{-1/4})$.

Theorem3.2

Suppose the conditions of Theorem 3.1 hold. Suppose in addition Assumption 3.7 holds. Let $\hat{\Theta }_{n}$ be defined as in (17) with $c_n=0$ for all $n$. Then, $d_H(\hat{\Theta }_{n},\Theta _{*})=O_p(n^{-1/2})$.

For this, setting $c_n$ to 0 is crucial for achieving the $O_p(n^{-1/2})$ rate. We here note that Theorem 3.2 builds on Lemma A.2 in the appendix, which establishes the convergence rate (in directed Hausdorff distance) of $\hat{\Theta }_{n}$ in (17) with a possibly nonzero level $c_n$. This lemma does not require Assumption 3.2 (iv) but assumes the Hausdorff consistency of $\hat{\Theta }_{n}$ as a high-level condition. This is why Theorem 3.2 is stated for $\hat{\Theta }_{n}$ with $c_n=0$. As previously discussed, however, if Theorem 3.1 is extended to allow non-convex parametric classes, this lemma can be used to characterize the estimator’s convergence rate under a more general setting.

3.3 The First-Stage Estimator

This section discusses how to construct a first-stage set estimator. A challenge is that the object of interest $\mathcal S _{0}$ is a subset of an infinite-dimensional space. This requires us to use a nonparametric estimation technique for estimating $\mathcal S _{0}$. This type of estimation problem was recently analyzed in Santos (2011), who studies estimation of linear functionals of function-valued parameters in nonparametric instrumental variable problems. We rely on his results on consistency and the rate of convergence, which extend Chernozhukov et al. (2007) analysis to a nonparametric setting. Specifically, for each $s\in \mathcal S $, let

$$\begin{aligned} \mathcal Q _{n}(s):=\sum _{j=1}^{l}\Big (\frac{1}{n}\sum _{i=1}^{n}\varphi ^{(j)} (X_{i},s)\Big )_{+}^{2}. \end{aligned}$$

(20)

This is a sample criterion function defined on $\mathcal S $. For instance, ${\mathcal Q }_{n}$ for Example 2.1 is given by

$$\begin{aligned} \mathcal Q _{n}(s)=\sum _{j{=}1}^{K}\Big (\!\frac{1}{n}\sum _{i=1}^n(Y_{L,i}-s(Z_i))1_{A_j}(Z_i)\!\Big )^{2}_{+}{+} \sum _{j=1}^{K}\Big (\!\frac{1}{n}\sum _{i=1}^n(s(Z_i){-}Y_{U,i})1_{A_j}(Z_i)\!\Big )^{2}_{+}. \end{aligned}$$

Our first-stage set estimator is a level set of $\mathcal Q _{n}$ over a sieve $\mathcal S _{n}\subseteq \mathcal S $. Given a sequence of non-negative constants $\{a_{n}\}$ and $\{b_{n}\}$, define

$$\begin{aligned} \hat{\mathcal S }_{n}:=\{s\in \mathcal S _{n}:\mathcal Q _{n}(s)\le b_{n}/a_{n}\}. \end{aligned}$$

(21)

We add regularity conditions on $\varphi $, $\{\mathcal S _{n}\}$, and $\{(a_{n},b_{n})\}$ to ensure the Hausdorff consistency of $\hat{\mathcal S }_{n}$ and derive its convergence rate. The following two assumptions impose smoothness requirements on the map $\varphi $.

Assumption 3.7

For each $j$, there is a function $B_{j}:\mathcal X \rightarrow \mathbb R _{+}$ such that

$$\begin{aligned} |\varphi ^{(j)}(x,s)-\varphi ^{(j)}(x,s^{\prime })|\le B_{j}(x)\rho (s,s^{\prime }),\quad \forall s,s^{\prime }\in \mathcal S , \end{aligned}$$

where $\rho (s,s^{\prime }):=\sup _{x\in \mathcal S }\max _{j=1,{\ldots },l}|s^{(j)}(x)-s^{\prime (j)}(x)|$.

For each $s\in \mathcal S $, let $\mathcal I (s):=\{j\in \{1,{\ldots },l\}:E[\varphi ^{(j)}(X_{i},s)]>0\}$. $\mathcal I (s)$ is the set of indexes whose associated moments violate the inequality restrictions. For each $j$, let $\bar{\varphi }^{(j)}:=E[\varphi ^{(j)}(X_{i},s)]$.

Assumption 3.8

(i) For each $s\in \mathcal S $ and $j$, $\bar{\varphi }^{(j)}:\mathcal S \rightarrow \mathbb R $ is continuously Fréchet differentiable with the Fréchet derivative $\dot{\varphi }_{s}^{(j)}:\mathcal S \rightarrow \mathbb R $, and for each $s\in \mathcal S $, the operator norm $\Vert \dot{\varphi }_{s}^{(j)}\Vert _{op}$ of $\dot{\varphi }_{s}^{(j)}$ is bounded away from 0 for some $j\in \{1,{\ldots } ,l\}$; (ii) for each $s\notin \mathcal S _{0}$, there exist $j\in \mathcal I (s)$ and $C_{j}>0$ such that $E[\varphi ^{(j)}(X_{i},s)]\ge C_{j}\Vert s-s_{0}\Vert _{W}$ for some $s_{0}\in \mathcal S _{0}$.

We also add regularity conditions on $\mathcal S _{n}$, which can be satisfied by commonly used sieves including polynomials, splines, wavelets, and certain artificial neural network sieves.

Assumption 3.9

(i) For each $n$, $\mathcal S _{n}\subseteq \mathcal S $, and both $\mathcal S _{n}$ and $\mathcal S $ are closed with respect to $\rho $; (ii) for every $s\in \mathcal S $, there is $\Pi _{n}s\in \mathcal S _{n}$ such that $\sup _{s\in \mathcal S }\Vert s-\Pi _{n}s\Vert _{W}=O(\delta _{n})$ for some sequence $\{\delta _{n}\}$ of non-negative constants such that $\delta _{n}\rightarrow 0$.

Theorem 3.3

Suppose Assumptions 2.1–2.3, 3.2 (i)–(iii), 3.3, 3.8, 3.9 (i), and 3.10 hold. Let $a_{n}=O(\max \{n^{-1},\delta _{n}^{2}\}^{-1})$ and $b_{n}\rightarrow \infty $ with $b_{n}=o(a_{n})$. Then

$$\begin{aligned} d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1). \end{aligned}$$

In addition, suppose that Assumption 3.9 (ii) holds. Then

$$\begin{aligned} d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=O_{p}\big (\sqrt{b_{n}/a_{n}}\big ). \end{aligned}$$

Theorem 3.3 can be used to establish Assumptions 3.6 and 3.7, which are imposed in Theorems 3.1 and 3.2. These conditions are satisfied for Example 2.1 with a single regressor.

In what follows, for any two sequences of positive constants $\{c_{n}\},$ $\{d_{n}\}$, let $c_{n}\asymp d_{n}$ mean there exist constants $0<C_{1}<C_{2}<\infty $ such that $C_{1}\le |c_{n}/d_{n}|\le C_{2}$ for all $n$.

Corollary 3.1

In Example 2.1, suppose that $\mathcal Z $ is a compact convex subset of the real line and $r_{\theta }(z)=\theta ^{(1)}+\theta ^{(2)}z$, where $\theta \in \Theta \subseteq \mathbb R ^{2}$. Suppose that $\Theta $ is compact and convex. Suppose further that $\{(Y_{L,i},Y_{U,i},Z_{i})\}_{i=1,{\ldots },n}$ is a random sample from $P_{0}$ and that $P_{0}(Z\in A_{k})>0$ for all $k$ and $Var(Z)>0$. Let $\mathcal S :=\{s\in L_{\mathcal Z ,1}^{2}:\mathcal Z \rightarrow \mathbb R :\Vert s\Vert _{\infty }\le M,|s(z)-s(z^{\prime })|\le M\Vert z-z^{\prime }\Vert ,\forall z,z^{\prime }\in \mathcal Z \}$ for some $M>0$. Let $\{r_{q}(\cdot )\}_{q=1}^{J_{n}}$ be splines of order two with $J_{n}$ knots on $\mathcal Z $. Define $\mathcal S _{n}:=\{s:s(z)=\sum _{q=1}^{J_{n}}\beta _{q}r_{q}(z)\}$ with $J_{n}\asymp n^{c_{1}},c_{1}>1/3$. Let $\hat{\mathcal S }_{n}$ be defined as in (21) with $a_{n}\asymp n^{c_{2}}$, where $2/3<c_{2}<1$ and $b_{n}\asymp \ln n$. Then: (i) $\hat{\mathcal S }_{n}$ is (Effros-) measurable; (ii) $d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1)$; (iii) $d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(n^{-1/4}).$

Given these results, we further show that the estimator of the pseudo-true identified set is consistent and converges at a $n^{-1/2}$-rate.

Corollary 3.2

Suppose that the conditions of Corollary 3.1 hold. Let $Q$ be defined as in (17) with $W=1$. Let $Q_{n}$ be defined as in (16) and $\hat{\Theta }_{n}$ be defined as in (17) with $c_{n}=0$ and $\hat{\mathcal S }_{n}$ as in Corollary 3.1. Then $d_{H}(\hat{\Theta }_{n},\Theta _{*})=O_{p}(n^{-1/2})$.

4 Concluding Remarks

Moment inequalities are widely used to estimate discrete choice problems and structures that involve censored variables. In many empirical applications, potentially misspecified parametric models are used to estimate such structures. This chapter studies a novel estimation procedure that is robust to misspecification of moment inequalities. To overcome the challenge that the conventional identified set may be empty under misspecification, we defined a pseudo-true identified set as the least squares projection of the set of functions at which the moment inequalities are satisfied. This set is nonempty under mild assumptions. We also proposed a two-stage set estimator for estimating the pseudo-true identified set. Our estimator first estimates the identified set of function-valued parameters by a level-set estimator over a suitable sieve. The pseudo-true identified set can then be estimated by projecting the first-stage estimator to a finite-dimensional parameter space. We give conditions, under which the estimator is consistent for the pseudo-true identified set in the Hausdorff metric and converges at a rate $O_p(n^{-1/2})$. Developing inference procedures based on the proposed estimator would be an interesting future work. Another interesting extension would be to study the optimal choice of the weighting matrix. In this chapter, we maintained the assumption that $W$ is fixed and does not depend on $(\theta ,s)$. Given the form of the criterion function, the most natural choice of $W$ would be the inverse matrix of the variance covariance matrix of $s(X_i)-r_{\theta }(X_i)$. This matrix is generally unknown but can be consistently estimated by its sample analog: $ \hat{W}_n(\theta ,s):=(\frac{1}{n}\sum _{i=1}^n(s(X_i)-r_{\theta }(X_i)) (s(X_i)-r_{\theta }(X_i))^{\prime })^{-1}.$ Defining a sample criterion function using $\hat{W}_n(\theta ,s)$ as a weighting matrix would lead to a three-step procedure. Such a procedure may result in more efficient estimation of $\Theta _{*}$.^{Footnote 6} Yet, another interesting direction would be to develop a specification test for the moment inequality models based on the current framework. This direction would extend the results of Guggenberger et al. (2008), which studies a testing procedure that tests the nonemptiness of the identified set.

Notes

1.
Here, we take the indicators (or instruments) $1_A(z)$ as given. The indicators $1_{A}(z)$ could be replaced by any finite vector of measurable non-negative functions of $z$. Andrews and Shi (2011) give examples of such functions.
2.
The players do not need to know the $F$’s, but these are important to the econometrician.
3.
For this example, $\Theta _I$ is never empty as long as the number ($2K$) of moment inequalities equals the number of parameters $(\ell )$.
4.
We are indebted to an anonymous referee for pointing out a relationship between BMM’s framework and ours. General incomplete linear moment restrictions are given by $E[V(Z^{\prime }\theta -Y)]=E[Vu(V)]$, where $V$ is a vector of random variables, and $u$ is an unknown bounded function. See BMM for details.
5.
Their framework does not consider misspecification. Their object of interest is therefore the conventional identified set $\Theta _I$. In our setting, the sample criterion function degenerates, i.e., $Q_n(\theta ,s)=0$, on a neighborhood of $\Theta _{*}\times \mathcal S _0$ under Assumption 3.2 (iv).
6.
We are indebted to an anonymous referee for this point.
7.
Since the mean value theorem only applies element by element to the vector in (A.8), the mean value $\bar{\theta }_n$ differs across the elements. For notational simplicity, we use $\bar{\theta }_n$ in what follows, but the fact that they differ element to element should be understood implicitly. For the measurability of these mean values, see Jennrich (1969) for example.

References

Ai, C., and X. Chen (2003): “Efficient Estimation of Models with Conditional Moment Restrictions Containing Unknown Functions”, Econometrica, 71(6), 1795–1843.
Article Google Scholar
Aliprantis, C. D., and K. C. Border (2006): Infinite Dimensional Analysis-A Hitchhiker’s Guide. Springer, Berlin.
Google Scholar
Andrews, D. W. K. (1994): “Chapter 37: Empirical Process Methods in Econometrics”, vol. 4 of Handbook of Econometrics, pp. 2247–2294. Elsevier, Amsterdam.
Google Scholar
Andrews, D. W. K., and X. Shi (2011): “Inference for Parameters Defined by Conditional Moment Inequalities”, Discussion Paper, Yale University.
Google Scholar
Bajari, P., C. L. Benkard, and J. Levin (2007): “Estimating Dynamic Models of Imperfect Competition”, Econometrica, 75(5), 1331–1370.
Article Google Scholar
Bontemps, C., T. Magnac, and E. Maurin (2011): “Set Identified Linear Models”, CeMMAP Working Paper.
Google Scholar
Chen, X. (2007): “Large Sample Sieve Estimation of Semi-Nonparametric Models”, Handbook of Econometrics, 6, 5549–5632.
Article Google Scholar
Chernozhukov, V., H. Hong, and E. Tamer (2007): “Estimation and Confidence Regions for Parameter Sets in Econometric Models1”, Econometrica, 75(5), 1243–1284.
Article Google Scholar
Ciliberto, F., and E. Tamer (2009): “Market Structure and Multiple Equilibria in Airline Markets”, Econometrica, 77(6), 1791–1828.
Article Google Scholar
Folland, G. (1999): Real Analysis: Modern Techniques and Their Applications, vol. 40. Wiley-Interscience, New York.
Google Scholar
Guggenberger, P., J. Hahn, and K. Kim (2008): “Specification Testing under Moment Inequalities”, Economics Letters, 99(2), 375–378.
Article Google Scholar
Ichimura, H., and S. Lee (2010): “Characterization of the Asymptotic Distribution of Semiparametric M-Estimators”, Journal of Econometrics, 159(2), 252–266.
Article Google Scholar
Jennrich, R. I. (1969): “Asymptotic Properties of Nonlinear Least Squares Estimators”, Annals of Mathematical Statistics, 40(2), 633–643.
Article Google Scholar
Kaido, H., and H. White (2010): “A Two-Stage Approach for Partially Identified Models”, Discussion Paper, University of California San Diego.
Google Scholar
Lindenstrauss, J., D. Preiss, and J. Tiser (2007): “Differentiability of Lipschitz Maps”, in Banach Spaces and Their Applications in, Analysis, pp. 111–123.
Google Scholar
Luttmer, E. G. J. (1996): “Asset Pricing in Economies with Frictions”, Econometrica, 64(6), 1439–1467.
Article Google Scholar
Manski, C. F., and E. Tamer (2002): “Inference on Regressions with Interval Data on a Regressor or Outcome”, Econometrica, 70(2), 519–546.
Article Google Scholar
Molchanov, I. S. (2005): Theory of Random Sets. Springer, Berlin.
Google Scholar
Newey, W. (1994): "The Asymptotic Variance of Semipara metric Estimators," Econometrica, 62(6), 1349–1382.
Article Google Scholar
Newey, W. K., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing”, Handbook of Econometrics, 4, 2111–2245.
Article Google Scholar
Pakes, A. (2010): “Alternative Models for Moment Inequalities”, Econometrica, 78(6), 1783–1822.
Article Google Scholar
Pakes, A., J. Porter, K. Ho, and J. Ishii (2006): “Moment Inequalities and Their Application”, Working Paper, Harvard University.
Google Scholar
Ponomareva, M., and E. Tamer (2010): “Misspecification in Moment Inequality Models: Back to Moment Equalities?” Econometrics Journal, 10, 1–21.
Google Scholar
Santos, A. (2011): “Instrumental Variables Methods for Recovering Continuous Linear Functionals”, Journal of Econometrics, 161, 129–146.
Article Google Scholar
Sherman, R. P. (1993): “The Limiting Distribution of the Maximum Rank Correlation Estimator”, Econometrica, 61(1), 123–137.
Article Google Scholar
Tamer, E. (2003): “Incomplete Simultaneous Discrete Response Model with Multiple Equilibria”, The Review of Economic Studies, 70(1), 147–165.
Article Google Scholar
van der Vaart, A. W., and J. A. Wellner (1996): Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, New York.
Google Scholar
White, H. (1982): “Maximum Likelihood Estimation of Misspecified Models”, Econometrica, 50(1), 1–25.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Economics, Boston University, 270 Bay State Rd., Boston, MA, 02215, USA
Hiroaki Kaido
Department of Economics (0508), University of California, San Diego, 9500 Gilman Dr., La Jolla, CA, 92093-0508, USA
Halbert White

Authors

Hiroaki Kaido
View author publications
You can also search for this author in PubMed Google Scholar
Halbert White
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroaki Kaido .

Editor information

Editors and Affiliations

Yale University, 30 Hillhouse Ave, New Haven, 06501, USA
Xiaohong Chen
Hamilton St 75, NEW BRUNSWICK, 08901-1248, New Jersey, USA
Norman R. Swanson

Mathematical Proofs

1.1 Notation

Throughout the appendix, let $\Vert \cdot \Vert $ denote the usual Euclidean norm. For each $s,s^{\prime }\in \mathcal S $, let $\rho (s,s^{\prime }):=\sup _{x\in \mathcal S }\max _{j=1,{\ldots } ,l}|s^{(j)}(x)-s^{\prime (j)}(x)| $. For each $a\times b$ matrix $A$, let $\Vert A\Vert _{op}:=\min \{c:\Vert Av\Vert \le c\Vert v\Vert ,v\in \mathbb R ^{b}\}$ be the operator norm. For any symmetric matrix $A$, let $\xi (A)$ denote the smallest eigenvalue of $A$.

For a given pseudometric space $(T,\rho )$, let $N(\epsilon ,T,\rho )$ be the covering number, i.e., the minimal number of $\epsilon $-balls needed to cover $T$. For each measurable function $f:\mathcal X \rightarrow \mathbb R $ and $1\le p<\infty $, let $\Vert f\Vert _{L^{p}}:=E[|f(X)|^p]^{1/p}$ provided that the integral exists. Similarly, let $\Vert f\Vert _\infty :=\inf \{c:P(|f(X)|>c)=0\}$. For a given function space $\mathcal G $ equipped with a norm $\Vert \cdot \Vert _{\mathcal G }$ and $l,u\in \mathcal G $, let $[l,u]:=\{f\in \mathcal G :l\le f\le u\}$. For each $f\in \mathcal G $, let $B_{\epsilon ,f}:=\{[l,u]:l\le f\le u,\Vert l-u\Vert _{\mathcal G }<\epsilon \}$ be the $\epsilon $-bracket of $f$. The bracketing number $N_{[\,]}(\epsilon ,\mathcal G ,\Vert \cdot \Vert _{\mathcal G })$ is the minimum number of $\epsilon $-brackets needed to cover $\mathcal G $. An envelope function $G$ of a function class $\mathcal G $ is a measurable function such that $g(x)\le G(x)$ for all $g\in \mathcal G $. For each $\delta >0$, the bracketing integral of $\mathcal G $ with an envelope function $G$ is defined as $J_{[]}(\delta ,\mathcal G ,\Vert \cdot \Vert _{\mathcal G }):=\int _0^\delta \sqrt{1+\ln N_{[]}(\epsilon \Vert G\Vert _{\mathcal G },\mathcal G ,\Vert \cdot \Vert _{\mathcal G })}d\epsilon $.

1.2 Projection

Proof of Proposition 2.1.

Note that under the conditions of Example 2.1, Assumption 2.3 holds. This ensures $\mathcal S _0$ is nonempty. By Eq. (13), $\Theta _{*}$ is nonempty. Furthermore, let $\theta \in \Theta _I,$ and for each $z\in \mathcal Z $, let $r_\theta (z):=z^{\prime }\theta $. Note that $r_\theta \in \mathcal S _0$. Thus, (13) holds with $s=r_\theta $, which ensures the first claim.

For the second claim, note that the condition $E[Y_U|Z]=E[Y_L|Z]=Z^{\prime }\theta _0$ a.s implies that any $\theta \in \Theta _I$ must satisfy

$$\begin{aligned} E[Z1\{Z\in A_j\}]^{\prime }(\theta _0-\theta )= 0,\quad j=1,2,{\ldots },K. \end{aligned}$$

(A.1)

By the rank condition on $D$, the unique solution to (A.1) is $\theta _0-\theta =0$. Thus, $\{\theta _0\}=\Theta _I$. Since $\{\theta _0\}\subseteq \Theta _{*}$ by the first claim, it suffices to show that $\theta _0$ is the unique element of $\Theta _{*}$. For this, note that under our assumptions, $\mathcal S _0=\{s_0\}$ with $s_0(z)=z^{\prime }\theta _0$. Thus, $\Theta _{*}=\{\theta _0\}$. This completes the proof.$\square $

1.3 Consistency of the Parametric Part

For each $s\in \mathcal S $, let $\theta _{\ast }(s):=\mathop{\rm arg\,min}_{\theta \in \Theta }Q(\theta ,s)$ and $\hat{\theta}_{n}(s):=\mathop{\rm arg\,min}_{\theta \in \Theta}Q_{n}(\theta ,s)$.

Lemma A.1

Suppose that Assumptions 3.4 and 3.2 (iv) hold. Then, (i) for each $x\in \mathcal X $ and any $s,s^{\prime }\in \mathcal S $, there exists a function $C_{1}:\mathcal X \rightarrow \mathbb R _{+}$ such that

$$\begin{aligned} \Big \Vert r_{\theta _{*}(s)}(x)-r_{\theta _{*}(s^{\prime })}(x)\Big \Vert \le C_{1}(x)\rho (s,s^{\prime }); \end{aligned}$$

(A.2)

(ii) For each $x\in \mathcal X $, $j=1,{\ldots } ,L,$ and any $s,s^{\prime }\in \mathcal S $, there exists a function $C_{2}:\mathcal X \rightarrow \mathbb R _{+}$ such that

$$\begin{aligned} \Big \Vert \nabla _{\theta }^{(j)}r_{\theta _{*}(s)}(x)-\nabla _{\theta }^{(j)}r_{\theta _{*}(s^{\prime })}(x)\Big \Vert \le C_{2}(x)\rho (s,s^{\prime }). \end{aligned}$$

(A.3)

Proof of Lemma A.1

Assumption 3.4 ensures that

$$\begin{aligned} \Big \Vert r_{\theta _{*}(s)}(x)-r_{\theta _{*}(s^{\prime })}(x)\Big \Vert \le L^{1/2} C(x) \Big \Vert \theta _{*}(s)-\theta _{*}(s^{\prime })\Big \Vert . \end{aligned}$$

(A.4)

Assumption 3.2 (iv) ensures that for each $s\in L^2_{\mathcal S ,L}$, $\theta _{*}(s)=\Pi _{\mathcal R _\Theta }s$ is uniquely determined, where $\Pi _{\mathcal R _\Theta }$ is the projection mapping from the Hilbert space $L^2_{\mathcal S ,L}$ to the closed convex subset $\mathcal R _{\Theta }$. Furthermore, Lemma 6.54 (d) in Aliprantis and Border (2006) and the fact that $\rho $ is stronger than $\Vert \cdot \Vert _W$ imply

$$\begin{aligned} \Big \Vert \theta _{*}(s)-\theta _{*}(s^{\prime })\Big \Vert \le \Big \Vert s-s^{\prime }\Big \Vert _{W}\le c\rho (s,s^{\prime }), \end{aligned}$$

(A.5)

for some $c>0$. Combining (A.4) and (A.5) ensures (i). Similarly, Assumption 3.4 ensures that for each $x\in \mathcal X $

$$\begin{aligned} \Big \Vert \nabla ^{(j)}_\theta r_{\theta _{*}(s)}(x)-\nabla ^{(j)}_\theta r_{\theta _{*}(s^{\prime })}(x)\Big \Vert \le J^{1/2} C(x) \Big \Vert \theta _{*}(s)-\theta _{*}(s^{\prime })\Big \Vert . \end{aligned}$$

(A.6)

Combining (A.5) and (A.6) ensures (ii). $\square $

Proof of Theorem 3.1

Step 1: Let $s\in \mathcal S $ be given. For each $\theta \in \Theta $, let $Q_s(\theta ):=Q(\theta ,s)$ and $Q_{n,s}(\theta ):=Q_n(\theta ,s)$. By Assumption 3.2 (iv) and Theorem 6.53 in Aliprantis and Border (2006), $Q_s$ is uniquely minimized at $\theta _{*}(s)$. By Assumption 3.2 (i), $\Theta $ is compact. By Assumption 3.2, $Q(\theta )$ is continuous. Furthermore, Assumption 3.4 ensures the applicability of the uniform law of large numbers. Thus, $\sup _{\theta \in \Theta }|Q_{n,s}(\theta )-Q_s(\theta )|=o_p(1)$. Hence, by Theorem 2.1 in Newey and McFadden (1994), $\hat{\theta }_n(s)-\theta _{*}(s)=o_p(1)$.

By Assumptions 3.2 (v), 3.4 (ii), and the fact that $\hat{\theta }_n(s)$ is consistent for $\theta _{*}(s)$, $\hat{\theta }_n(s)$ solves the first order condition:

$$\begin{aligned} \nabla _\theta Q_n(\theta ,s)=\frac{1}{n}\sum _{i=1}^n \nabla _\theta r_\theta (X_i)^{\prime }W(s(X_i)-r_\theta (X_i))=0, \end{aligned}$$

(A.7)

with probability approaching one. Expanding this condition at $\theta _{*}(s)$ using the mean-value theorem applied to each element of $\nabla _\theta Q_n(\theta ,s)$ yields

$$\begin{aligned} \nabla ^2_\theta Q_n(\bar{\theta }_n(s),s)(\hat{\theta }_n(s)-\theta _{*}(s))=\frac{1}{n} \sum _{i=1}^n \nabla _\theta r_{\theta _{*}(s)}(X_i)^{\prime }W(s(X_i)-r_{\theta _{*}(s)}(X_i)), \end{aligned}$$

(A.8)

where $\bar{\theta }_n(s)$ lies on the line segment that connects $\hat{\theta }_n(s)$ and $\theta _{*}(s)$.^{Footnote 7} For each $s\in \mathcal S _0^{\bar{\eta }}$, let

$$\begin{aligned} \psi _s(x):=\nabla _\theta r_{\theta _{*}(s)}(x)^{\prime }W(s(x)-r_{\theta _{*}(s)}(x)). \end{aligned}$$

(A.9)

Below, we show that the function class $\Psi :=\{f_s:f_s=\psi ^{(j)}_s, s\in \mathcal S _0^{\bar{\eta }}, j=1,2,{\ldots },J\}$ is a Glivenko–Cantelli class.

By Assumption 3.4 (ii), Lemma A.1, the triangle inequality, and the Cauchy–Schwarz inequality, for any $s,s^{\prime }\in \mathcal S $,

$$\begin{aligned} |\psi ^{(j)}_s(x)-\psi ^{(j)}_{s^{\prime }}(x)|\le&\;\Big \Vert (\nabla ^{(j)}_\theta r_{\theta _{*}(s)}(x)-\nabla ^{(j)}_\theta r_{\theta _{*}(s^{\prime })}(x))^{\prime }W \Big \Vert \nonumber \\&\;\times \Big \Vert s(x)-r_{\theta _{*}(s)}(x)\Big \Vert +\Big \Vert \nabla ^{(j)}_\theta r_{\theta _{*}(s^{\prime })}(x)^{\prime }W\Big \Vert \nonumber \\&\;\times \Big \Vert [s(x)-s^{\prime }(x)]+[r_{\theta _{*}(s^{\prime })}(x)-r_{\theta _{*}(s)}(x)]\Big \Vert \nonumber \\ \le&\; (C_2(x)\Vert W\Vert _{op}(M+R(x))+(1+C_1(x))\Vert W\Vert _{op}R(x))\nonumber \\&\qquad \qquad \qquad \qquad \quad \;\times \sup _{x\in \mathcal S }\Big \Vert s(x)-s^{\prime }(x)\Big \Vert \nonumber \\ \le&\; F(x)\rho (s,s^{\prime }), \end{aligned}$$

(A.10)

where $F(x):=(C_2(x)\Vert W\Vert _{op}(M+R(x))+(1+C_1(x))\Vert W\Vert _{op}R(x))\times \sqrt{L}$. For any $\epsilon >0$, let $u:=\epsilon /2\Vert F\Vert _{L^1}$. By, Theorem 2.7.11 in van der Vaart and Wellner (1996) and Assumption 3.2 (ii), we obtain

$$\begin{aligned} N_{[]}(\epsilon ,\Psi ,\Vert \cdot \Vert _{L^1})&=N_{[]}(2 u\Vert F\Vert _{L^1},\Psi ,\Vert \cdot \Vert _{L^1})\nonumber \\&\le N(u,\mathcal S _{0}^{\bar{\eta }},\rho ). \end{aligned}$$

(A.11)

For each $j=1,{\ldots },L$, let $\mathcal S _0^{\bar{\eta },(j)}:=\{s^{(j)}:s\in \mathcal S _0^{\bar{\eta }}\}$. For each $j, g\in \mathcal S _0^{\bar{\eta },(j)},$ and $\epsilon >0$, let $B^{(j)}_\epsilon (g):=\{f\in \mathcal S _0^{\bar{\eta },(j)}:\Vert f-g\Vert _{\infty }<\epsilon \}$. Similarly, for each $s\in \mathcal S _0^{\bar{\eta }}$, let $B_{u,\rho }(s):=\{f\in \mathcal S _0^{\bar{\eta },(j)}:\rho (f,s)<\epsilon \}$. As we will show below, $N_j:=N(u,\mathcal S _0^{\bar{\eta },(j)},\Vert \cdot \Vert _\infty )$ is finite for all $j$. Thus, for each $j$ there exist $f_{1,j},{\ldots },f_{N_j,j}\in \mathcal S _0^{\bar{\eta },(j)}$ such that $\mathcal S _0^{\bar{\eta },(j)}\subseteq \bigcup _{l=1}^{N_j}B_{u}^{(j)}(f_{l,j})$. We can then obtain a grid of distinct points $f_1,{\ldots },f_N\in \mathcal S _0^{\bar{\eta }}$ such that $f_i^{(j)}=f_{l,j}$ for some $1\le l\le N_j$, where $N=\prod _{j=1}^L N_j$. Then, by the definition of $\rho $, $\mathcal S _0^{\bar{\eta }}\subseteq \bigcup _{i=1}^N B_{u,\rho }(f_i)$. Thus,

$$\begin{aligned} N\big (u,\mathcal S _{0}^{\bar{\eta }},\rho \big )\le \prod _{j=1}^{L}N\big (u,\mathcal S _0^{\bar{\eta },(j)},\Vert \cdot \Vert _\infty \big )\le N\big (u,\mathcal C ^\gamma _M(\mathcal X ),\Vert \cdot \Vert _\infty \big )^{L}<\infty , \end{aligned}$$

(A.12)

where the last inequality follows from Assumption 3.2 (ii)–(iii) and Theorem 2.7.1 in van der Vaart and Wellner (1996). By Theorem 2.4.1 in van der Vaart and Wellner (1996), $\Psi $ is a Glivenko–Cantelli class.

Note that, by Assumptions 3.2 (v) and 3.4, $\theta ^*(s)$ solves the population analog of (A.7). Thus,

$$\begin{aligned} E[ \nabla _\theta r_{\theta _{*}(s)}(X_i)^{\prime }W(s(X_i)-r_{\theta _{*}(s)}(X_i))]=E[\psi _{s}(x)]=0. \end{aligned}$$

(A.13)

These results together with the strong law of large numbers whose applicability is ensured by Assumptions 3.3 and 3.4 (ii) imply

$$\begin{aligned} \sup _{s\in \mathcal S _{0}^{\bar{\eta }}}\left|\frac{1}{n}\sum _{i=1}^n\psi ^{(j)}_s(X_i)\right|=o_p(1),\quad j=1,{\ldots }, J. \end{aligned}$$

(A.14)

Step 2: In this step, we show that the Hessian $\nabla ^2_\theta Q_n(\theta ,s)$ is invertible with probability approaching 1 uniformly over $\mathcal N _{\bar{\epsilon },\bar{\eta }}$. Let $\mathcal H :=\{h_{\theta ,s}:\mathcal X \rightarrow \mathbb R :h_{\theta ,s}(x)=H^{(i,j)}_W(\theta ,s,x)+2\nabla _\theta r^{(i)}_{\theta }(x)^{\prime }W\nabla _\theta r^{(j)}_{\theta }(x), 1\le i,j\le p,\theta \in \Theta ,s\in \mathcal S _0^{\bar{\eta }}\}$. Note that $h_{\theta ,s}$ takes the form:

$$\begin{aligned} h_{\theta ,s}(x)&=2\sum _{k=1}^L\sum _{h=1}^L\frac{\partial ^2 r_\theta ^{(h)}(x)}{\partial \theta _i\partial \theta _j}W^{(h,k)}\big (s^{(k)}(x)-r_\theta ^{(k)}(x)\big )\\&\quad +\sum _{k=1}^L\sum _{h=1}^L\frac{\partial r_\theta ^{(h)}(x)}{\partial \theta _i}W^{(h,k)}\frac{\partial r_\theta ^{(k)}(x)}{\partial \theta _j} \end{aligned}$$

for some $1\le i,j\le p, \theta \in \Theta $, and $s\in \mathcal S ^{\bar{\eta }}_0$. Consider the function classes $\mathcal F _1:=\{D^\alpha _\theta r^{(k)}_\theta :\theta \in \Theta ,|\alpha |\le 2,k=1,{\ldots },L\}$ and $\mathcal F _2:=\{s^{(k)}:s\in \mathcal S _0^{\bar{\eta }},k=1,{\ldots },L\}$. Assumptions 3.2 (i), 3.4, and Theorem 2.7.11 in van der Vaart and Wellner (1996) ensure $N_{[]}(\epsilon ,\mathcal F _1,\Vert \cdot \Vert _{L^2})\le N(u,\Theta ,\Vert \cdot \Vert )<\infty $ with $u:=\epsilon /2\Vert C\Vert _{L^2}$. Assumption 3.2 (ii)–(iii) and Corollary 2.7.2 in van der Vaart and Wellner (1996) ensure $N_{[]}(\epsilon ,\mathcal F _2,\Vert \cdot \Vert _{L^2})\le N_{[]}(\epsilon , \mathcal C ^\gamma _M(\mathcal X ),\Vert \cdot \Vert _{L^2})<\infty $. Since $\mathcal H $ can be obtained by combining functions in $\mathcal F _1$ and $\mathcal F _2$ by additions and pointwise multiplications, Theorem 6 in Andrews (1994) implies $N_{[]}(\epsilon ,\mathcal H ,\Vert \cdot \Vert _{L^2})<\infty $. This bracketing number is given in terms of the $L^2$-norm, but we can also obtain a bracketing number in terms of the $L^1$-norm. For this, let $h_1,{\ldots },h_p$ be the centers of $\Vert \cdot \Vert _{L^2}$-balls that cover $\mathcal H $. Then, the brackets $[h_i-\epsilon ,h_i+\epsilon ],i=1,{\ldots },p$ cover $\mathcal H $, and each bracket has length at most $2\epsilon $ in $\Vert \cdot \Vert _{L^1}$. Thus, $N_{[]}(\epsilon ,\mathcal H ,\Vert \cdot \Vert _{L^1})<\infty $. By Theorem 2.7.1 in van der Vaart and Wellner (1996), $\mathcal H $ is a Glivenko–Cantelli class. Hence, uniformly over $\Theta \times \mathcal S _0^{\bar{\eta }}$,

$$\begin{aligned} \nabla ^2_\theta Q_n(\theta ,s)&=\frac{1}{n}\sum _{i=1}^n H_W(\theta ,s,X_i)+ 2\nabla _\theta r_{\theta }(X_i)^{\prime }W\nabla _\theta r_{\theta }(X_i)\nonumber \\&\stackrel{p}{\rightarrow }E[H_W(\theta ,s,X_i)+ 2\nabla _\theta r_{\theta }(X_i)^{\prime }W\nabla _\theta r_{\theta }(X_i)]. \end{aligned}$$

(A.15)

Note that $d_{H,W}(\hat{\mathcal S }_n,\mathcal S _0)=o_p(1)$ by Assumption 3.6. Thus, $(\bar{\theta }_n(s),s)\in \mathcal N _{\bar{\epsilon },\bar{\eta }}$ with probability approaching one. By Assumption 3.5 and (A.15), there exists $\delta >0$ such that $\nabla ^2_\theta Q_n(\bar{\theta }_n(s),s)$’s smallest eigenvalue is above $\delta $ uniformly over $\mathcal N _{\bar{\epsilon },\bar{\eta }}$. Thus, the Hessian $\nabla ^2_\theta Q_n(\bar{\theta }_n(s),s) $ in (A.8) is invertible with probability approaching 1.

Step 3: Steps 1–2 imply that, uniformly over $\mathcal S _0^{\bar{\eta }}$,

$$\begin{aligned} \Vert \theta _{*}(s)-\hat{\theta }_n(s^{\prime })\Vert&=\Vert \theta _{*}(s)-\theta _{*}(s^{\prime })+\theta _{*}(s^{\prime })-\hat{\theta }_n(s^{\prime })\Vert \nonumber \\&\le \Vert \theta _{*}(s)-\theta _{*}(s^{\prime })\Vert +2\delta ^{-1}\sup _{s\in \mathcal S _{0}^{\bar{\eta }}}\left\Vert\frac{1}{n}\sum _{i=1}^n\psi _s(X_i)\right\Vert\nonumber \\&\le \Vert s-s^{\prime }\Vert _{W}+o_p(1), \end{aligned}$$

(A.16)

where we used the fact that $\Vert \theta _{*}(s)-\theta _{*}(s^{\prime })\Vert \le \Vert s-s^{\prime }\Vert _{W}$ by Lemma 6.54 (d) in Aliprantis and Border (2006).

Step 4: Finally, note that by Step 3,

$$\begin{aligned} \vec d_H(\Theta _{*},\hat{\Theta }_n)&=\sup _{\theta \in \Theta _{*}}\inf _{\theta ^{\prime }\in \hat{\Theta }_n}\Vert \theta -\theta ^{\prime }\Vert =\sup _{s\in \mathcal S _0}\inf _{s^{\prime }\in \hat{\mathcal S }_n}\Vert \theta _{*}(s)-\hat{\theta }_n(s^{\prime })\Vert \nonumber \\&\le \sup _{s\in \mathcal S _0}\inf _{ s^{\prime }\in \hat{\mathcal S }_n}\Vert s-s^{\prime }\Vert _W +o_p(1)\end{aligned}$$

(A.17)

$$\begin{aligned} \vec d_H(\hat{\Theta }_n,\Theta _{*})&=\sup _{\theta ^{\prime }\in \hat{\Theta }_n}\inf _{\theta \in \Theta _{*}}\Vert \theta -\theta ^{\prime }\Vert =\sup _{s^{\prime }\in \hat{\mathcal S }_n}\inf _{s\in \mathcal S _0}\Vert \theta _{*}(s)-\hat{\theta }_n(s^{\prime })\Vert \nonumber \\&\le \sup _{s^{\prime }\in \hat{\mathcal S }_n}\inf _{s\in \mathcal S _0}\Vert s-s^{\prime }\Vert _W+o_p(1). \end{aligned}$$

(A.18)

Equation (18) and Assumption 3.6 then ensure the desired result. $\square $

1.4 Convergence Rate

The following lemma controls the rate at which $\hat{\Theta }_n$ covers $\Theta _{*}.$ Given a sequence $\{\eta _{n}\}$ such that $\eta _{n}\rightarrow 0$, we let $V^{\delta _{1n}}(s):=\{\theta ^{\prime }:\Vert \theta ^{\prime }-\theta (s)\Vert \le e_n, \quad e_n=O_p(\eta _n)\}$ and let $\mathcal N _{\eta _n,0}:=\{(\theta ,s):\theta \in V^{\eta _n}(s),s\in \mathcal S _0\}$.

Lemma A.2

Suppose Assumptions 2.1–2.3, 3.1–3.2, and 3.6 hold. Let $\{\delta _{1n}\}$ and $\{\epsilon _{n}\}$ be sequences of non-negative numbers converging to 0 as $n\rightarrow \infty $. Let $G:\Theta \times \mathcal S \rightarrow \mathbb R _{+}$ be a function such that $G$ is jointly measurable and lower semicontinuous. For each $n$, let $G_{n}:\Omega \times \Theta \times \mathcal S \rightarrow \mathbb R $ be a function such that for each $\omega \in \Omega $, $G_{n}(\omega ,\cdot ,\cdot )$ is jointly measurable and lower semicontinuous, and for each $(\theta ,s)\in \Theta \times \mathcal S $, $G_{n}(\cdot ,\theta ,s)$ is measurable. Let $\Theta _{*}:=\{G(\theta ,s)=0,s\in \mathcal S _{0}\}$ and $\hat{\Theta }_{n}:=\{\theta \in \Theta :G_{n}(\theta ,s)\le \inf _{\theta \in \Theta }G_{n}(\theta ,s)+c_{n},s\in \hat{\mathcal S }_{n}\}$. Suppose that $d_{H}(\hat{\Theta }_{n},\Theta _{*})=O_{p}(\delta _{1n})$. Suppose further that there exists a positive constant $\kappa $ and a neighborhood $V(s)$ of $\theta _{*}(s)$ such that

$$\begin{aligned} G(\theta ,s)\ge \kappa \Vert \theta -\theta _{*}(s)\Vert ^{2} \end{aligned}$$

(A.19)

for all $\theta \in V(s),s\in \mathcal S _{0}$. Suppose that uniformly over $\mathcal N _{\delta _{1n},0}$,

$$\begin{aligned} G_{n}(\theta ,s)=G(\theta ,s)+O_{p}(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})+o_{p}(\Vert \theta -\theta _{*}(s)\Vert ^{2})+O_{p}(\epsilon _{n}). \end{aligned}$$

(A.20)

Then

$$\begin{aligned} \vec {d}_{H}(\Theta _{*},\hat{\Theta }_{n})=O_{p}(\max \{c_{n}^{1/2},\epsilon _{n}^{1/2},1/\sqrt{n}\}). \end{aligned}$$

Proof of Lemma A.2

The proof of this Lemma is similar to Theorem 1 in Sherman (1993). By (A.19), (A.20), and the Hausdorff consistency of $\hat{\Theta }_n$, it follows that, uniformly over $\mathcal N _{\delta _{1n},0}$,

$$\begin{aligned} c_n\ge \kappa \Vert \theta -\theta _{*}(s)\Vert ^2+O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n}) + o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)+O_p(\epsilon _n), \end{aligned}$$

(A.21)

with probability approaching 1. As in Theorem 1 in Sherman (1993), write $K_n\Vert \theta -\theta (s)\Vert $ for the $O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})$ term, where $K_n=O_p(1/\sqrt{n})$ and also note that $o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)$ is bounded from below by $-\frac{\kappa }{2}\Vert \theta -\theta ^*(s)\Vert ^2$ with probability approaching 1. Thus, we obtain

$$\begin{aligned} \frac{\kappa }{2}\Vert \theta -\theta _{*}(s)\Vert ^2+K_n\Vert \theta -\theta _{*}(s)\Vert \le c_n+O_p(\epsilon _n). \end{aligned}$$

(A.22)

Completing the square, we obtain

$$\begin{aligned} \frac{1}{2}\kappa (\Vert \theta -\theta _{*}(s)\Vert -K_n/\kappa )^2\le c_n+O_p(\epsilon _n)+\frac{1}{2}K_n^2/\kappa =c_n+O_p(\epsilon _n)+O_p(1/n). \end{aligned}$$

(A.23)

Taking square roots gives

$$\begin{aligned} \Vert \theta -\theta _{*}(s)\Vert&\le (2/\kappa )^{1/2}c_n^{1/2}+K_n/\kappa +O_p(\epsilon _n^{1/2})+O_p(1/\sqrt{n})\end{aligned}$$

(A.24)

$$\begin{aligned}&= O_p(c^{1/2}_n)+O_p(\epsilon ^{1/2}_n)+O_p(1/\sqrt{n}). \end{aligned}$$

(A.25)

Thus,

$$\begin{aligned} \vec d_H(\Theta _{*},\hat{\Theta }_n)&= \sup _{s\in \mathcal S _0}\inf _{\theta \in \hat{\Theta }_n}\Vert \theta -\theta _{*}(s)\Vert \end{aligned}$$

(A.26)

$$\begin{aligned}&\le \sup _{s\in \mathcal S _0}\inf _{\theta \in V^{\delta _{1n}}(s)}\Vert \theta -\theta _{*}(s)\Vert \nonumber \\&\le O_p(c^{1/2}_n)+O_p(\epsilon ^{1/2}_n)+O_p(1/\sqrt{n}). \end{aligned}$$

(A.27)

This completes the proof. $\square $

The following lemma controls the rate at which $\hat{\Theta }_n$ is contracted into a neighborhood of $\Theta _{*}$. Given $s\in \mathcal S $ and a sequence $\{\delta _n\}$ such that $\delta _n\rightarrow 0$, let $U^{\delta _n}(s):=\{\theta \in \Theta :\Vert \theta -\theta _{*}(s)\Vert \ge \delta _n\}$.

Lemma A.3

Suppose Assumptions 2.1–2.3, 3.1–3.2, and 3.6 hold. Let $G_{n}$ be defined as in Lemma A.2. Suppose that there exist positive constants $(k,\kappa _{2})$ and a sequence $\{\delta _{1n}\}$ such that

$$\begin{aligned} G_{n}(\theta ,s)\ge \kappa _{2}\Vert \theta -\theta _{*}(s)\Vert ^{2} \end{aligned}$$

(A.28)

with probability approaching 1 for all $\theta \in U^{\delta _{n}}(s)$ with $\delta _{n}:=(k\delta _{1n}/\sqrt{n})^{1/2}$ and $s\in \mathcal S _{0}^{\bar{\eta }}$. Then,

$$\begin{aligned} \vec {d}_{H}(\hat{\Theta }_{n},\Theta _{*})=O_{p}(\delta _{1n}^{1/2}/n^{1/4})+O_{p}(c_{n}^{1/2}). \end{aligned}$$

Proof of Lemma A.3

Note first that $\hat{\mathcal S }_n$ is in $\mathcal S _0^{\bar{\eta }}$ with probability approaching 1 by Assumption 3.6. Let $\tilde{c}_n:=\sqrt{n}c_n$ and $\bar{c}_n:=\max \{\kappa _2k\delta _{1n},\tilde{c}_n\}$. Let $\epsilon _n:=(\bar{c}_n /\kappa _2 \sqrt{n})^{1/2}$. Then, uniformly over $\mathcal S _0^{\bar{\eta }}$,

$$\begin{aligned} \inf _{\Theta \cap U^{\epsilon _n}(s)}\sqrt{n}G_n(\theta ,s)\ge \kappa _2 \sqrt{n}\epsilon _n^2 \ge \bar{c}_n. \end{aligned}$$

(A.29)

Since $\sqrt{n}G_n(\hat{\theta }_n(s),s)\le \tilde{c}_n$ for all $s\in \hat{\mathcal S }_n$, the results above ensure

$$\begin{aligned} \vec d_H(\hat{\Theta }_n,\Theta _{*})&=\sup _{s\in \hat{\mathcal S }_n}\inf _{\theta \in \Theta _{*}}\Vert \hat{\theta }_n(s)-\theta \Vert \\&\le \sup _{s\in \hat{\mathcal S }_n}\Vert \hat{\theta }_n(s)-\theta _{*}(s)\Vert \le \epsilon _n=O_p(\delta _{1n}^{1/2}/n^{1/4})+O_p(\tilde{c}_n^{1/2}/n^{1/4}). \end{aligned}$$

This ensures the claim of the Lemma. $\square $

Proof of Theorem 3.2

We first show (A.19) holds with $G(\theta ,s)=Q(\theta ,s)$. For this, we use the second-order Taylor expansion of $Q(\theta ,s)$. For $\theta \in V^{\delta _{1n}}(s)$, it holds by Assumptions 3.2 (v) and 3.4 that

$$\begin{aligned} Q(\theta ,s)=&\;Q(\theta _{*}(s),s)+\nabla _\theta Q(\theta _{*}(s),s)^{\prime }(\theta -\theta _{*}(s))\nonumber \\&\; +\frac{1}{2}(\theta -\theta _{*}(s))^{\prime }\nabla ^2_\theta Q(\bar{\theta }(s),s)(\theta -\theta _{*}(s)), \end{aligned}$$

(A.30)

where $\bar{\theta }(s)$ is on the line segment that connects $\theta $ and $\theta _{*}(s)$. By (15), $Q(\theta _{*}(s),s)=0$, and by the first order condition of the optimality, $\nabla _\theta Q(\theta _{*}(s),s)=0$. Thus, it follows that

$$\begin{aligned} Q(\theta ,s)= \frac{1}{2}(\theta -\theta _{*}(s))^{\prime }\nabla ^2_\theta Q(\bar{\theta }(s),s)(\theta -\theta _{*}(s))\ge \kappa \Vert \theta -\theta _{*}(s)\Vert ^2, \end{aligned}$$

(A.31)

where $\kappa :=\inf _{\theta \in \Theta ,s\in \mathcal S _0} \xi (\nabla ^2_\theta Q(\theta ,s))/2$, and $\kappa >0$ by Assumption 3.5.

We next show that (A.20) holds for

$$\begin{aligned} G_{n}(\theta ,s)=&\; \frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\theta }(X_{i}))^{\prime }W(s(X_{i})-r_{\theta }(X_{i}))\nonumber \\&\quad -\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\theta _{*}(s) }(X_{i}))^{\prime }W(s(X_{i})-r_{\theta _{*}(s) }(X_{i})). \end{aligned}$$

(A.32)

In what follows, let $\hat{E}_n$ denote the expectation with respect to the empirical distribution. Using the Taylor expansion of $G_n$ and $G$ with respect to $\theta $ at $\theta _{*}(s)$, we may write

$$\begin{aligned} G_n(\theta ,s)-G(\theta ,s)=S_{1,n}(\theta ,s)+S_{2,n}(\theta ,s), \end{aligned}$$

(A.33)

where

$$\begin{aligned} S_{1n}(\theta ,s)&:= -2(\theta -\theta _{*}(s))^{\prime }(\hat{E}_n-E)[\nabla _\theta r_{\theta _{*}(s)}(x)^{\prime }W(s(x)-r_{\theta _{*}}(x))]\nonumber \\&+o_p(\Vert \theta -\theta _{*}(s)\Vert ^2) \end{aligned}$$

(A.34)

$$\begin{aligned} S_{2n}(\theta ,s)&:= (\theta -\theta _{*}(s))^{\prime }(\hat{E}_n-E)[\nabla _\theta r_{\theta _{*}(s)}(x)^{\prime }W\nabla _\theta r_{\theta _{*}(s)}(x)](\theta -\theta _{*}(s)).\nonumber \\ \end{aligned}$$

(A.35)

Thus, for (A.20) to hold, it suffices to show that $ S_{1n}(\theta ,s)=O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})+o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)$ and $S_{2n}(\theta ,s)=O_p(\epsilon _n)$ for some $\epsilon _n\rightarrow 0$. For $S_{1n}$, note that our assumptions suffice for the conditions of Lemma A.4. Thus, $\Phi $ is a $P_0$-Donsker class. This ensures $S_{1n}(\theta ,s)=O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})+o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)$. We now consider $S_{2n}$. For each $s\in \mathcal S _0$ and $x\in \mathcal X $, let $\phi _s(x):=\nabla _\theta r_{\theta _{*}(s)}(x)^{\prime }W\nabla _\theta r_{\theta _{*}(s)}(x)$. Note that

$$\begin{aligned} E\left[\sup _{(\theta ,s)\in \mathcal N _{\delta _{1n},0}}\left|S_{2n}(\theta ,s)\right|\right]&\le \delta _{1n}^2 n^{-1/2}E\left[\sup _{s\in \mathcal S _0}\left|\mathbb G _n\phi _s\right|\right]\nonumber \\&\le n^{-1/2}\delta _{1n}^2 C J_{[]}(1,\mathcal S _0,\Vert \cdot \Vert _{L^2})\left\Vert\sup _{s\in \mathcal S _0}|\phi _s|\right\Vert_{L^2}, \end{aligned}$$

(A.36)

where the last inequality follows from Lemma B.1 of Ichimura and Lee (2010). Now, Markov’s inequality, Lemma A.4, and Assumption 3.4 (ii) ensure that $S_{2n}=O_p(\epsilon _n)$, where $\epsilon _n=n^{-1/2}\delta _{1n}^2$.

We further set $c_n=0$. Note that the estimator defined in (17) with $c_n=0$ equals the set estimator $\hat{\Theta }_n=\{\theta :G_n(\theta ,s)\le \inf _{\theta \in \Theta }G_n(\theta ,s)\}.$ By Assumption 3.7 and Step 4 of the proof of Theorem 3.1, we may take $\delta _{1n}=O_p(n^{-1/4})$ as an initial rate. Lemma A.2 then implies that $\vec d_H(\Theta _{*},\hat{\Theta }_n)=O_p(\epsilon ^{1/2}_n)$, where $\epsilon _n=O_p(n^{-1/2}\delta _{1n}^2)=O_p(n^{-1})$. Thus, $\vec d_H(\Theta _{*},\hat{\Theta }_n)=O_p(n^{-1/2})$.

Now we consider $\vec d_H(\hat{\Theta }_n,\Theta _{*})$. We show that (A.28) holds for $G_n$. For each $\theta $ and $s$, let $L_n(\theta ,s):=\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\theta }(X_{i}))^{\prime }W(s(X_{i})-r_{\theta }(X_{i})) \nonumber $. Let $s\in \mathcal S ^{\bar{\eta }}_0$ and $\theta \in U^{\delta _{1n}}(s)$. A second-order Taylor expansion of $G_n(\theta ,s)=L_n(\theta ,s)-L_n(\theta _{*}(s),s)$ with respect to $\theta $ at $\theta _{*}(s)$ gives

$$\begin{aligned} G_n(\theta ,s)&=\nabla _\theta L_n(\theta _{*}(s),s)^{\prime }(\theta -\theta _{*}(s))+\frac{1}{2}(\theta -\theta _{*}(s))^{\prime }\nabla _\theta ^2 L_n(\bar{\theta }_n(s),s)(\theta -\theta _{*}(s))\nonumber \\&=o_p(1)+\frac{1}{2}(\theta -\theta _{*}(s))^{\prime }\nabla _\theta ^2 L_n(\bar{\theta }_n(s),s)(\theta -\theta _{*}(s))\nonumber \\&\ge \kappa _2\Vert \theta -\theta _{*}(s)\Vert ^2, \end{aligned}$$

(A.37)

with probability approaching 1 for some $\kappa _2>0$, where $\bar{\theta }_n(s)$ is a point on the line segment that connects $\theta $ and $\theta _{*}(s)$. The last inequality follows from Step 3 of the proof of Theorem 3.1 and Assumption 3.5.

Set $\tilde{c}_n=0$. Then, Lemma A.3 implies $\vec d_H(\hat{\Theta }_n,\Theta _{*})=O_p(\delta _{1n}^{1/2}/n^{1/4})$. Setting $\delta _{1n}=O_p(n^{-1/4})$ refines this rate to $O_p(n^{-3/8})$. Repeated applications of Lemma A.3 then implies $\vec d_H(\hat{\Theta }_n,\Theta _{*})=O_p(n^{-1/2})$. As both of the directed Hausdorff distances converge to 0 at the stochastic order of $n^{-1/2}$, the claim of the theorem follows. $\square $

Lemma A.4

Suppose Assumptions 3.2 and 3.4 hold. Then $\Phi $ is a $P_0$-Donsker class.

Proof of Lemma A.4

The proof of Theorem 3.1 shows that each $f_s\in \Phi $ is Lipschitz in $s$. For any $\epsilon >0$, Assumption 3.2 (ii)–(iii), Theorems 2.7.11 and 2.7.2 in van der Vaart and Wellner (1996), and (A.12) imply

$$\begin{aligned} \ln N_{[]}(\epsilon \Vert F\Vert _{L^2},\Psi ,\Vert \cdot \Vert _{L^2}) \le \ln N(\epsilon /2,\mathcal S _{0}^{\delta _2},\rho )^L \le C (1/\epsilon )^{k/\gamma }, \end{aligned}$$

(A.38)

where $C$ is a constant that depends only on $k,\gamma ,L$, and ${\text{ diam}}(\mathcal X )$. Thus, for any $\delta >0$,

$$\begin{aligned} J_{[]}(\delta ,\Phi , \Vert \cdot \Vert _{L^2})\le \int \limits _{0}^{\delta } \sqrt{1+C (1/\epsilon )^{k/\gamma }}d\epsilon <\infty . \end{aligned}$$

(A.39)

Example 2.14.4 in van der Vaart and Wellner (1996) ensures that $\Psi $ is $P_0$-Donsker. $\square $

1.5 First Stage Estimation

In the following, we work with the following population criterion function. For each $s\in \mathcal S $, let $\mathcal Q $ be defined by

$$\begin{aligned} \mathcal Q (s):=\sum _{j=1}^lE[\varphi ^{(j)}(X_i,s)]^2_+. \end{aligned}$$

(A.40)

Lemma A.5

Suppose that Assumption 3.9 (i) holds. Let the criterion function be given as in (A.40). Then, there exists a positive constant $C_{2}$ such that

$$\begin{aligned} \mathcal Q (s)\le \inf _{s_{0}\in \mathcal S _{0}}C_{2}\Vert s-s_{0}\Vert _{W}^{2}. \end{aligned}$$

Proof of Lemma A.5

Let $s\in \mathcal S $ be arbitrary. For any $s_0\in {\mathcal S }$, $E[\varphi ^{(j)}(X,s_0)]\le 0$ for $j=1,{\ldots }, l$. Let $V$ be an open set that contains $s$ and $s_0$. Assumption 3.9 (i) and Theorem 1.7 in Lindenstrauss et al. (2007), it holds that

$$\begin{aligned} \mathcal Q (s)&\le \sum _{j=1}^l\Big (E[\varphi ^{(j)}(X_i,s)]-E[\varphi ^{(j)}(X_i,s_0)]\Big )^2_+\nonumber \\&\le \left(\sum _{j=1}^l \Vert \sup _{g \in \tilde{V}_j} \dot{\varphi }^{(j)}_{g}\Vert ^2_{op}\right)\Vert s-s_0\Vert _W^{2}, \end{aligned}$$

(A.41)

where $\tilde{V}_j:=\{g\in V:\dot{\varphi }^{(j)}_{g}\;{\text{ exists}}\}$. Let $C_2:=\sum _{j=1}^l \Vert \sup _{g \in \mathcal S } \dot{\varphi }^{(j)}_{g}\Vert ^2_{op}$. It holds that $0<C_2<\infty $ by the hypothesis. We thus obtain

$$\begin{aligned} \mathcal Q (s)\le C_2\Vert s-s_0\Vert ^2_W \end{aligned}$$

(A.42)

for all $s_0\in \mathcal S _0$. Note that $s_0\mapsto \Vert s-s_0\Vert _W$ is continuous and $\mathcal S _0$ is compact by Assumption 3.2 (ii)–(iii) and Assumption 3.10 (i). Taking infimum over $\mathcal S _0$ then ensures the desired result. $\square $

Lemma A.6

Suppose Assumption 3.9 (ii) holds. Let the criterion function be given as in (A.40). Then there exists a positive constant $C$ such that

$$\begin{aligned} \mathcal Q (s)\ge \inf _{s_{0}\in \mathcal S _{0}}C_{3}\Vert s-s_{0}\Vert _{W}^{2}. \end{aligned}$$

Proof of Lemma A.6

If $s\in \mathcal S _0$, the conclusion is immediate. Suppose that $s\notin \mathcal S _0.$ By Assumption 3.9 (ii), there exists $s_0\in \mathcal S _0$

$$\begin{aligned} \mathcal Q (s)= \sum _{j\in \mathcal I (s)}(E[\varphi ^{(j)}(X_i,s)])^2 \ge C_j\Vert s-s_0\Vert _W^2. \end{aligned}$$

(A.43)

Let $C_3:= C_j$. Thus, the claim of the lemma follows. $\square $

In the following, let $\mathcal G :=\{g:g(x)=\varphi _{s}^{(j)}(x),s\in \mathcal S ,j=1,{\ldots } ,l\}$.

Lemma A.7

Suppose Assumptions 3.2, 3.4 , and 3.8 hold. Then $\mathcal G $ is a $P_0$-Donsker class.

Proof of Lemma A.7

By Assumption 3.8, $\varphi ^{(j)}_s$ is Lipschitz in $s$. The rest of the proof is the same as that of Lemma A.4. $\square $

Proof of Theorem 3.3

We establish the claims of the theorem by applying Theorem B.1 in Santos (2011). Note first that Assumption 3.2 (ii)–(iii) and Assumption 3.10 (i) ensure that $\mathcal S $ is compact. This ensures condition (i) of Theorem B.1 in Santos (2011). Condition (ii) of Theorem B.1 in Santos (2011) is ensured by Assumption 3.10. Lemma A.7 ensures that uniformly over $\Theta _n$

$$\begin{aligned} \mathcal Q _{n}(s)=\mathcal Q (s)+O_p(n^{-1}). \end{aligned}$$

(A.44)

Thus, condition (iii) of Theorem B.1 in Santos (2011) hold with $C_1=1$ and $c_{2n}=n^{-1}$. Lemma A.5 ensures that $\mathcal Q (s)\le \inf _{s_0\in \mathcal S _0}C_2\Vert s-s_0\Vert _W^2$ for some $C_2>0$. Thus, condition (iv) of Theorem B.1 in Santos (2011) hold with $\kappa _1=2$. Now, the first claim of Theorem B.1. in Santos (2011) establishes

$$\begin{aligned} d_{H,W}(\hat{\mathcal S }_n,\mathcal S _0)=o_p(1). \end{aligned}$$

(A.45)

Furthermore, Lemma A.6 ensures $\mathcal Q (s)\ge \inf _{s_0\in \mathcal S _0}C_3\Vert s-s_0\Vert ^2$ for some $C_3>0$. This ensures condition (v) of Theorem B.1 in Santos (2011) with $\kappa _2=2$. Now, the second claim of Theorem B.1. in Santos (2011) ensures

$$\begin{aligned} d_{H,W}(\hat{\mathcal S }_n,\mathcal S _0)=O_p(\max \{(b_n/a_n)^{1/2},\delta _n\}). \end{aligned}$$

(A.46)

Since $(b_n/a_n)^{1/2}/\delta _n\rightarrow \infty $, the claim of the theorem follows. $\square $

Proof of Corollary 3.1

In what follows, we explicitly show $\mathcal Q _{n}$’s dependence on $\omega \in \Omega $. Let $\mathcal Q _{n}:\Omega \times \mathcal S \rightarrow \mathbb R $ be defined by $\mathcal Q _{n}(\omega ,s)=\sum _{j=1}^l(\frac{1}{n}\sum _{i=1}^{n}\varphi (X_{i}(\omega ),s))_+^2$. By Assumption 2.3, $\varphi $ is continuous in $s$ for every $x$ and measurable for every $s$. Also note that $X_i$ is measurable for every $i$. Thus, by Lemma 4.51 in Aliprantis and Border (2006), $\mathcal Q _{n}$ is jointly measurable in $(\omega ,s)$ and lower semicontinuous in $s$ for every $\omega $. Note that $\mathcal S $ is compact by Assumptions 3.2 (ii)–(iii) and 3.10 (i), which implies $\mathcal S $ is locally compact. Since $\mathcal S $ is a metric space, it is a Hausdorff space. Thus, by Proposition 5.3.6 in Molchanov (2005), $\mathcal Q _n$ is a normal integrand defined on a locally compact Hausdorff space. Proposition 5.3.10 in Molchanov (2005) then ensures the first claim.

Now we show the second claim using Theorem 3.3 (i). Assumptions 2.1–2.3 hold with $\varphi $ defined in (5). Assumption 3.2 holds by our hypothesis with $\gamma =1$. Assumption 3.3 is also satisfied by the hypothesis. Note that for each $j$, $\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)$ or $\varphi ^{(j)}(x,s)=(s(z)-y_U)1_{A_k}(z)$ for some $k\in \{1,{\ldots }, K\}$. Without loss of generality, let $j$ be an index for which $\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)$ for some Borel set $A_k$. For any $s,s^{\prime }\in \mathcal S $,

$$\begin{aligned} |\varphi ^{(j)}(x,s)-\varphi ^{(j)}(x,s^{\prime })|=|(s^{\prime }(z)-s(z))1_{A_k}(z)|\le \rho (s,s^{\prime }). \end{aligned}$$

(A.47)

It is straightforward to show the same result for other indexes. Thus, Assumption 3.8 is satisfied.

Now for $j$ such that $\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)$, note that

$$\begin{aligned} |\bar{\varphi }^{(j)}(s+h)-\bar{\varphi }^{(j)}(s) -E[h(Z)(-1_{A_k}(Z))]|=0. \end{aligned}$$

(A.48)

Thus, the Fréchet derivative is given by $\dot{\varphi }^{(j)}_s(h)=E[h(Z)(-1_{A_k}(Z))]$. By Proposition 6.13 in Folland (1999), the norm of the operator is given by $\Vert \dot{\varphi }^{(j)}_s\Vert _{op}=E[|-1_{A_k}(Z)|^2]^{1/2}=P_0(Z\in A_k)>0$, which ensures the boundedness (continuity) of the operator. It is straightforward to show the same result for other indexes. Hence, Assumption 3.9 (i) is satisfied. By construction, Assumption 3.10 (i) is satisfied, and Assumption 3.10 (ii) holds with $\delta _n\asymp J_n^{-1}$ (See Chen 2007). These ensure the conditions of Theorem 3.3 (i). Thus, the second claim follows.

For the third claim, let $s\in \mathcal S \setminus \mathcal S _0$. Then, there exists $j$ such that $E[\varphi ^{(j)}(X_i,s)]>0$. Without loss of generality, suppose that $E[\varphi ^{(j)}(X_i,s)]=E[(Y_{L,i}-s(Z_i))1_{A_k} (Z_i)]\ge \delta >0$. Let $s_0\in \mathcal S _0$ be such that

$$\begin{aligned} E[(Y_{L,i}-s_0(Z_i))1_{A_k}(Z_i)]=0. \end{aligned}$$

(A.49)

Such $s_0$ always exists by the intermediate value theorem. Then, for $j$ with which $\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)$, it follows that

$$\begin{aligned} E[\varphi ^{(j)}(X_i,s)]&=E[(Y_{L,i}-s(Z_i))1_{A_k}(Z_i)]-E[(Y_{L,i}-s_0(Z_i))1_{A_k}(Z_i)]\nonumber \\[6pt]&=E[(s_0(Z_i)-s(Z_i))1_{A_k}(Z_i)]>0 \end{aligned}$$

(A.50)

Thus, we have

$$\begin{aligned} E[\varphi ^{(j)}(X_i,s)]\ge C\Vert s_0-s\Vert _W, \end{aligned}$$

(A.51)

where $C:=\inf _{q\in E}E[q(Z_i)1_{A_k}(Z_i)]$ and $E:=\{q\in \mathcal S :\Vert q\Vert _W=1,E[q(Z_i)1_{A_k} (Z_i)]>0\}$. Since $C$ is the minimum value of a linear function over a convex set, it is finite. Furthermore, by the construction of $E$, it holds that $C>0$. Thus, Assumption 3.9 (ii) holds. Thus, by Theorem 3.3 (ii), the third claim follows. $\square $

Proof of Corollary 3.2

We show the claim of the corollary using Theorem 3.2. Note that we have shown, in the proof of Corollary 3.1, that Assumptions 2.1–2.3, 3.2 (i)–(iii), and 3.3 hold. Thus, to apply Theorem 3.2, it remains to show Assumptions 2.4, 3.2 (iv), and 3.4–3.7.

Assumption 2.4 is satisfied by the parameterization $r_\theta (z)=\theta ^{(1)}+\theta ^{(2)}z$. For Assumption 3.2 (iv), note that $\mathcal R _\Theta $ is given by

$$\begin{aligned} \mathcal R _\Theta =\big \{r_\theta :r_\theta =\theta ^{(1)}+\theta ^{(2)}z, \quad \theta \in \Theta \big \}. \end{aligned}$$

Since $\Theta $ is convex, for any $\lambda \in [0,1]$, it holds that $\lambda r_\theta +(1-\lambda )r_{\theta ^{\prime }}=r_{\lambda \theta +(1-\lambda )\theta ^{\prime }}\in \mathcal R _\Theta $. Thus, Assumption 3.2 (iv) is satisfied. For Assumption 3.4, note first that $r_\theta $ is twice continuously differentiable on the interior of $\Theta $. Because $r_\theta $ is linear, $\max _{|\alpha |\le 2}|D^{\alpha }_\theta r_\theta (z)-D^{\alpha }_\theta r_{\theta ^{\prime }}(z)|= (1+z^2)^{1/2}\Vert \theta -\theta ^{\prime }\Vert $ by the Cauchy–Schwarz inequality. By the compactness of $\mathcal Z $, $C(z):= (1+z^2)^{1/2}$ is bounded. Thus, Assumption 3.4 (i) is satisfied. Similarly, $\max _{|\alpha |\le 2}\sup _{\theta \in \Theta }|D^\alpha _\theta r_\theta |\le \max \{1,|z|,C(1+z^2)^{1/2}\}=:R(z)$, where $C:=\sup _{\theta \in \Theta }\Vert \theta \Vert $. By the compactness of $\mathcal Z $ and $\Theta $, $R$ is bounded. Thus, Assumption 3.4 (ii) is satisfied. Note that the Hessian of $Q(\theta ,s)$ with respect to $\theta $ is given by $2E[(1,z)(1,z)^{\prime }]$, which does not depend on $\theta $ nor $s$ and is positive definite by the assumption that $Var(Z)>0$. Thus, Assumption 3.5 is satisfied. Assumptions 3.6 and 3.7 are ensured by Corollary 3.1. Now the conditions of Theorem 3.2 are satisfied. Thus, the claim of the Corollary follows. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kaido, H., White, H. (2013). Estimating Misspecified Moment Inequality Models. In: Chen, X., Swanson, N. (eds) Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1653-1_13

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1653-1_13
Published: 01 August 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1652-4
Online ISBN: 978-1-4614-1653-1
eBook Packages: Business and EconomicsEconomics and Finance (R0)

Publish with us

Policies and ethics

Estimating Misspecified Moment Inequality Models

Abstract

Similar content being viewed by others

Semiparametric Generalized Estimating Equations in Misspecified Models

The EM algorithm for ML Estimators under nonlinear inequalities restrictions on the parameters

The consistency for the estimators of semiparametric regression model based on weakly dependent errors

Keywords

1 Introduction

2 The Data Generating Process and the Model

Assumption 2.1

Assumption 2.2

Definition 2.1

Assumption 2.3

Assumption 2.4

2.1 Examples

Example 2.1

Example 2.2

Example 2.3

Example 2.4

2.2 Projection

proposition 2.1

3 Estimation

3.1 Set Estimator

Assumption 3.1

Assumption 3.2

Assumption 3.3

Assumption 3.4

Assumption 3.5

Assumption 3.6

Theorem 3.1

3.2 The Rate of Convergence

Assumption 3.7

Theorem3.2

3.3 The First-Stage Estimator

Assumption 3.7

Assumption 3.8

Assumption 3.9

Theorem 3.3

Corollary 3.1

Corollary 3.2

4 Concluding Remarks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Mathematical Proofs

Mathematical Proofs

1.1 Notation

1.2 Projection

Proof of Proposition 2.1.

1.3 Consistency of the Parametric Part

Lemma A.1

Proof of Lemma A.1

Proof of Theorem 3.1

1.4 Convergence Rate

Lemma A.2

Proof of Lemma A.2

Lemma A.3

Proof of Lemma A.3

Proof of Theorem 3.2

Lemma A.4

Proof of Lemma A.4

1.5 First Stage Estimation

Lemma A.5

Proof of Lemma A.5

Lemma A.6

Proof of Lemma A.6

Lemma A.7

Proof of Lemma A.7

Proof of Theorem 3.3

Proof of Corollary 3.1

Proof of Corollary 3.2

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation