Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This chapter develops a new approach to estimating structures defined by moment inequalities. Moment inequalities often arise as optimality conditions in discrete choice problems or in structures where economic variables are subject to some type of censoring. Typically, parametric models are used to estimate such structures. For example, in their analysis of an entry game in the airline markets, Ciliberto and Tamer (2009) use a linear specification for airlines’ profit functions and assume that unobserved heterogeneity in the profit functions can be captured by independent normal random variables. In asset pricing theory with short sales prohibited, Luttmer (1996) specifies the functional form of the pricing kernel as a power function of consumption growth, based on the assumption that the investor’s utility function is additively separable and isoelastic.

Any conclusions drawn from such methods rely on the validity of the model specification. Although commonly used estimation and inference methods for moment inequality models are robust to potential lack of identification, typically they are not robust to misspecification. Compared to cases where the parameter of interest is point identified, much less is known about the consequences of misspecified moment inequalities. As we will discuss, these can be serious. In general, misspecification makes it hard to interpret the estimated set of parameter values; an even more serious possibility is that the identified set could be an empty set. If the identified set is empty, every nonempty estimator sequence is inconsistent. Furthermore, it is often hard to see if the estimator is converging to some object that can be given any meaningful interpretation. An exception is the estimation method developed by Ponomareva and Tamer (2010), which focuses on estimating a regression function with interval censored outcome variables.

This chapter develops a new estimation method that is robust to potential parametric misspecification in general moment inequality models. Our contributions are three-fold. First, we define a pseudo-true identified set that is nonempty under mild assumptions and that can be interpreted as the projection of the set of function-valued parameters identified by the moment inequalities. Second, we construct a set estimator using a two-stage estimation procedure, and we show that the estimator is consistent for the pseudo-true identified set in Hausdorff metric. Third, we give conditions under which the proposed estimator converges to the pseudo-true identified set at the \(n^{-1/2}\)-rate.

The first stage is a nonparametric estimator of the true moment function. Given this, why perform a parametric second-stage estimation? After all, the nonparametric first stage estimates the same object of interest, without the possibility of parametric misspecification. There are a variety of reasons a researcher may nevertheless prefer to implement the parametric second stage: first is the undeniably appealing interpretability of the parametric specification; second is the much more precise estimation and inference afforded by using a parametric specification; and third, the second term of the second-stage objective function may offer a potentially useful model specification diagnostic. Future research may permit deriving the asymptotic distribution of this term under the null of correct parametric specification to provide a formal test. The two-stage procedure proposed here delivers these benefits, while avoiding the more serious adverse consequences of potential misspecification.

The chapter is organized as follows. Section  2 describes the data generating process and gives examples that fall within the scope of this chapter. We also introduce our definition of the pseudo-true identified Sect.  3 defines our estimator and presents our main results. We conclude in Sect.  4. We collect all proofs into the appendix.

2 The Data Generating Process and the Model

Our first assumption describes the data generating process (DGP).

Assumption 2.1

Let \((\Omega ,\mathfrak F ,\mathbb P _{0})\) be a complete probability space. Let \(k,\ell \in \mathbb N \). Let \(X:\Omega \rightarrow \mathbb R ^{k}\) be a Borel measurable map, let \(\mathcal X \subseteq \mathbb R ^{k}\) be the support of \(X\), and let \(P_{0}\) be the probability measure induced by \(X\) on \(\mathcal X \). Let \(\rho _{0}:\mathcal X \rightarrow \mathbb R ^{\ell }\) be an unknown measurable function such that \(E[\rho _{0}(X)]\) exists and

$$\begin{aligned} E[\rho _{0}(X)]\le 0, \end{aligned}$$
(1)

where the expectation is taken with respect to \(P_{0}\).

In what follows, we call \(\rho _{0}\) the true moment function. The moment inequalities (1) often arise as an optimality condition in game-theoretic models (Bajari et al. 2007; Ciliberto and Tamer 2009) or models that involve variables that are subject to some kind of censoring (Manski and Tamer 2002). In empirical studies of such models, it is common to specify a parametric model for \(\rho _{0}\).

Assumption 2.2

Let \(p\in \mathbb N \) and let \(\Theta \) be a subset of \(\mathbb R ^{p}\) with nonempty interior. Let \(m:\mathcal X \times \Theta \rightarrow \mathbb R ^{\ell }\) be such that \(m(\cdot ,\theta )\) is measurable for each \(\theta \in \Theta \) and \(m(x,\cdot )\) is continuous on \(\Theta ,\) \(\mathrm{a.e.}-P_{0}\). For each \(\theta \in \Theta \), \(m(\cdot ,\theta )\in L_{\ell }^{2}:=\{f:\mathcal X \rightarrow \mathbb R ^{\ell }:E[f(X)^{\prime }f(X)]<\infty \}.\)

Throughout, we call \(m(\cdot ,\cdot )\) the parametric moment function.

Definition 2.1

Let \(m_{\theta }(\cdot ):=m(\cdot ,\theta )\). Define \(\mathcal M _{\Theta }:=\{m_{\theta }\in L_{\ell }^{2}:\theta \in \Theta \}.\) \(\mathcal M _{\Theta }\) is correctly specified (\(-P_{0}\)) if there exists \(\theta _{0}\in \Theta \) such that

$$\begin{aligned} P_{0}[\rho _{0}(X)=m(X,\theta _{0})]=1. \end{aligned}$$

Otherwise, the model is misspecified.

If the model is correctly specified, we may define the set of parameter values that can be identified by the inequalities in (1):

$$\begin{aligned} \Theta _{I}:=\{\theta \in \Theta :E[m(X,\theta )]\le 0\}. \end{aligned}$$

We call \(\Theta _{I}\) the conventional identified set. This set collects all parameter values that yield parametric moment functions that are observationally equivalent to \(\rho _{0}\).

It becomes difficult to interpret \(\Theta _{I}\) when the model is misspecified, as pointed out by Ponomareva and Tamer (2010) for a regression model with an interval-valued outcome variable. Suppose first that the model is misspecified but \(\Theta _{I}\) is nonempty. The set is still a collection of parameter values that are observationally equivalent to each other, but since there is no \(\theta \) in \(\Theta _{I}\) that corresponds to the true moment function, further structure is required to unambiguously interpret \(\Theta _{I}\) as a collection of “pseudo-true parameter(s)”. Further, \(\Theta _{I}\) may be empty, especially if \(\mathcal M _{\Theta }\) is a small class of functions. This makes the interpretation of \(\Theta _{I}\) even more difficult. In fact, interpretation is impossible, as there is nothing to interpret.

Often, the economics of a given problem impose further structure on the DGP. To specify this, we let \(0<L\le \ell ,\) and for measurable \(s:\mathcal X \rightarrow \mathbb R ^{L}\), let \(\Vert s\Vert _{L}:=E[s(X)^{\prime }s(X)]^{1/2}\). Let \(L_{L}^{2}:=\{s:\mathcal X \rightarrow \mathbb R ^{L},\Vert s\Vert _{L}<\infty \}\), and let \(\mathcal S \subseteq L_{L}^{2}\).

Assumption 2.3

There exists \(\varphi :{\mathcal X }\times \mathcal S \rightarrow \mathbb R ^{\ell }\) such that for each \(x\in \mathcal X \), \( \varphi (x,\cdot )\) is continuous on \(\mathcal S \) and for each \(s\in \mathcal S \), \(\varphi (\cdot ,s)\) is measurable. Further, there exists \( s_{0}\in \mathcal S \) such that

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0}),\quad \forall x\in \mathcal X . \end{aligned}$$

When \(\rho _{0}\in L_{\ell }^{2}\) and there is no further structure on \(\rho _{0}\) available, we let \(L=\ell ,\) \(\mathcal S =L_{\ell }^{2},\) and take \(\varphi \) to be the evaluation functional \(e:\mathcal X \times \mathcal S \rightarrow \mathbb R ^{\ell }\):

$$\begin{aligned} \varphi (x,s)=e(x,s)\equiv s(x), \end{aligned}$$

as then \(\varphi (x,\rho _{0})=e(x,\rho _{0})\equiv \rho _{0}(x)\) and \(s_{0}=\rho _{0}.\) In this case, it is not necessary to explicitly introduce \(\varphi \). Often, however, further structure on the form of \(\rho _{0}\) is available. Typically, this is reflected in \(s\) depending non-trivially only on a strict subvector of \(X,\) say \(X_{1}.\) In such cases, we may write \(\mathcal S \subseteq L_{\mathcal X _{1}}^{2}\) for clarity. We give several examples below.

When Assumption 2.3 holds, we typically parametrize the unknown function \(s_{0}\). For example, it is common to specify \(s_{0}\) as a linear function of some of the components of \(x\). As we will see in the examples, a common modeling assumption is

Assumption 2.4

There exists \(r:\mathcal X \times \Theta \rightarrow \mathbb R ^{L}\) such that with \(r_{\theta }:=r(\cdot ,\theta )\),

$$\begin{aligned} m(x,\theta )=\varphi (x,r_{\theta }),\quad \forall (x,\theta )\in \mathcal X \times \Theta . \end{aligned}$$

Thus, misspecification occurs when there is no \(\theta _{0}\) in \(\Theta \) such that \(s_{0}=r_{\theta _{0}}.\)

More generally, misspecification can occur because the researcher mistakenly imposes Assumption 2.3, in which case \(s_{0}\) fails to exist and there is again no \(\theta _{0}\) in \(\Theta \) such that \(\rho _{0}(x)=\varphi (x,r_{\theta _{0}}).\) As \(s_{0}\) is an element of an infinite-dimensional space, we may refer to this as “nonparametric” misspecification. To proceed, we assume that, as is often plausible, the researcher is sufficiently able to specify the structure of interest that nonparametric misspecification is not an issue, either because correct \(\varphi \) restrictions are imposed or no \(\varphi \) restrictions are imposed. We thus focus on the case of parametric misspecification, where \(s_{0}\) exists but there is no \(\theta _{0}\) in \(\Theta \) such that \(s_{0}=r_{\theta _{0}}.\)

2.1 Examples

In this section, we present several motivating examples and also give commonly used parametric specifications in these examples. For any vector \(x\), we use \(x^{(j)}\) to denote the \(j\)th component of the vector. Similarly, for a vector valued function \(f(x)\), we use \(f^{(j)}(x)\) to denote the \(j\)th component of \(f(x)\).

Example 2.1

(Interval censored outcome) Let \(Z:\Omega \rightarrow \mathbb R ^{d_{Z}}\) be a regressor with support \(\mathcal Z \). Let \(Y:\Omega \rightarrow \mathbb R \) be an outcome variable that is generated as:

$$\begin{aligned} Y=s_{0}(Z)+\epsilon , \end{aligned}$$
(2)

where \(s_{0}\in \mathcal S :=L_{\mathcal Z }^{2},\) say, and \(\epsilon \) satisfies \(E[\epsilon |Z]=0\). We let \(\mathcal Y \) denote the support of \(Y\). Suppose \(Y\) is unobservable, but there exist \((Y_{L},Y_{U})^{\prime }:\Omega \rightarrow \mathcal Y \times \mathcal Y \) such that \(Y_{L}\le Y\le Y_{U}\) almost surely. Then, \((Y_{L},Y_{U},Z)^{\prime }\) satisfies the following inequalities almost surely:

$$\begin{aligned} E[Y_{L}|Z]-s_{0}(Z)&\le 0 \end{aligned}$$
(3)
$$\begin{aligned} s_{0}(Z)-E[Y_{U}|Z]&\le 0. \end{aligned}$$
(4)

Let \(x=(y_{L},y_{U},z)^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Y \times \mathcal Z \). Given a collection \(\{A_{1},{\ldots } ,A_{K}\}\) of Borel subsets of \(\mathcal Z \), the inequalities in (3), (4) imply that the moment inequalities in (1) hold with

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0}):= \left[\begin{array}{l} y_{L}-s_{0}(z) \\ s_{0}(z)-y_{U} \end{array}\right] \otimes 1_{A}(z), \end{aligned}$$
(5)

where \(1_{A}(z):=(1\{z\in A_{1}\},{\ldots } ,1\{z\in A_{K}\})^{\prime }\).Footnote 1 For each \(x\in \mathcal X \) and \(s\in \mathcal S \), the functional \(\varphi \) evaluates vertical distances of \(r(z)\) from \(y_{L}\) and \(y_{U}\) and multiplies them by the indicator function evaluated at \(z\). Additional information on \(\rho _{0}\) available in this example is that the moment functions are based on the vertical distances.

A common specification for \(s_{0}\) is \(s_{0}(z)=r_{\theta _{0}}(z)=z^{\prime }\theta _{0}\) for some \(\theta _{0}\in \Theta \subseteq \mathbb R ^{d_{Z}}\). The parametric moment function is then given for each \(x\in \mathcal X \) by \(m(x,\theta )=\varphi (x,r_{\theta })\). Therefore, this example satisfies Assumption 2.4.

Example 2.2

Tamer (2003) considers a simultaneous game of complete information. For each \(j=1,2\), let \(Z_{j}:\Omega \rightarrow \mathbb R ^{d_{Z}}\) and \(\epsilon _{j}:\Omega \rightarrow \mathbb R \) be firm \(j\)’s characteristics that are observable to the firms. The econometrician observes the \(Z\)’s but not the \(\epsilon \)’s. For each \(j\), let \(g_{j}:\mathcal Z \times \{0,1\}\rightarrow \mathbb R \). These functions are known to the firms but not to the econometrician. Suppose that each firm’s payoff is given by

$$\begin{aligned} \pi _{j}(Z_{j},Y_{j},Y_{-j})=(\epsilon _{j}-g_{j}(Z_{j},Y_{-j}))Y_{j},\quad j=1,2, \end{aligned}$$

where \(Y_{j}\in \mathcal Y :=\{0,1\}\) is firm \(j\)’s entry decision, and \( Y_{-j}\in \mathcal Y \) is the other firm’s entry decision. The econometrician observes these decisions. Given \((z_{1},z_{2})\), the firms’ payoffs can be summarized in Table 1.

Suppose the firms and the econometrician know that \(g(z,1)\ge g(z,0)\) for any value of \(z\). This means that, other things equal, the opponent’s entry would reduce the firm’s own profit. In this setting, there are several possible equilibrium outcomes depending on the realization of \((\epsilon _{1},\epsilon _{2})\). If \(\epsilon _{1}>g_{1}(z_{1},1)\) and \(\epsilon _{2}>g_{2}(z_{2},1)\), then \((1,1)\) is the unique Nash equilibrium (NE) outcome. Similarly, if \(\epsilon _{1}>g_{1}(z_{1},1)\) and \(\epsilon _{2}<g_{2}(z_{2},1)\), \((1,0)\) is the unique NE outcome, and if \(\epsilon _{1}<g_{1}(z_{1},1)\) and \(\epsilon _{2}>g_{2}(z_{2},1)\), \((0,1)\) is the unique NE outcome. Now, if \(\epsilon _{1}<g_{1}(z_{1},1)\) and \(\epsilon _{2}<g_{2}(z_{2},1)\), there are two Nash equilibria, and they give the outcomes \((1,0)\) and \((0,1)\). Let \(F_{j},j=1,2\) be the unknown CDFs of \(\epsilon _{1}\) and \(\epsilon _{2}\).Footnote 2 Without any assumptions on the equilibrium selection mechanism, the model predicts the following set of inequalities:

$$\begin{aligned}&P(Y_{1}=1,Y_{2}=1|Z_{1}=z_{1},Z_{2}=z_{2}) =(1-F_{1}(g_{1}(z_{1},1)))(1-F_{2}(g_{2}(z_{2},1)))\nonumber \\ \end{aligned}$$
(6)
$$\begin{aligned}&P(Y_{1}=1,Y_{2}=0|Z_{1}=z_{1},Z_{2}=z_{2}) \ge (1-F_{1}(g_{1}(z_{1},1)))F_{2}(g_{2}(z_{2},1)) \end{aligned}$$
(7)
$$\begin{aligned}&P(Y_{1}=1,Y_{2}=0|Z_{1}=z_{1},Z_{2}=z_{2}) \le F_{2}(g_{2}(z_{2},1)). \end{aligned}$$
(8)

Let \(x:=(y_{1},y_{2},z_{1},z_{2})^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Y \times \mathcal Z \times \mathcal Z \). Let \(s_{0}\in \mathcal S :=\{s\in L_{\mathcal Z \times \mathcal Z }^{2}:s(z_{1},z_{2})\in [ 0,1]^{2},\forall (z_{1},z_{2})\in \mathcal Z \times \mathcal Z \}\) be defined by

$$\begin{aligned} s_{0}^{(1)}(z_{1},z_{2})&:=F_{1}(g_{1}(z_{1},1)) \nonumber \\ s_{0}^{(2)}(z_{1},z_{2})&:=F_{2}(g_{2}(z_{2},1)). \end{aligned}$$

Here, \(s_{0}^{(j)}(z_{1},z_{2})\) is the conditional probability that firm \(j\)’s profit upon entry is negative given \(z_{1}\) and \(z_{2}\). Given a collection \(\{A_{j},j=1,{\ldots } ,K\}\) of Borel subsets of \(\mathcal Z \times \mathcal Z \), let \(1_{A}(z):=(1\{(z_{1},z_{2})\in A_{1}\},{\ldots },1\{(z_{1},z_{2})\in A_{K}\})^{\prime }\). The inequalities (6)–(8) imply the moment inequalities in (1) hold with

$$\begin{aligned} \rho _{0}(x)&=\varphi (x,s_{0}) \nonumber \\&= \left(\begin{array}{cc} 1\{y_{1}=1,y_{2}=1\}-(1-s_{0}^{(1)}(z_{1},z_{2}))(1-s_{0}^{(2)}(z_{1},z_{2})) \\ (1-s_{0}^{(1)}(z_{1},z_{2}))(1-s_{0}^{(2)}(z_{1},z_{2}))-1\{y_{1}=1,y_{2}=1\} \\ (1-s_{0}^{(1)}(z_{1},z_{2}))s_{0}^{(2)}(z_{1},z_{2})-1\{y_{1}=1,y_{2}=0\} \\ 1\{y_{1}=1,y_{2}=0\}-s_{0}^{(2)}(z_{1},z_{2}) \end{array}\right) \otimes 1_{A}(z). \end{aligned}$$

The additional information on \(\rho _{0}\) is that it is based on the differences between some combinations of the conditional probabilities \(s_{0}(z_{1},z_{2})\) and indicators for specific events.

A common parametric specification for \(g_j\) is \(g_j(z_j,y_{-j})=z_{j}^{\prime }\gamma _{0}-y_{-j}\beta _{j,0}\) for some \(\beta _{j,0}\in B\subseteq \mathbb R _+\) and \(\gamma _0\in \Gamma \subseteq \mathbb R ^{d_Z}\). It is also common to assume that \(F_j,j=1,2\) belong to a known parametric class \(\{F(\cdot ;\alpha ),\alpha \in \mathcal A \}\) of distributions. Then the parametric moment function can be defined for each \(x\) by \(m(x,\theta ):=\varphi (x,r_\theta )\), where \(\theta :=(\alpha _1,\alpha _2,\beta _1,\beta _2,\gamma )^{\prime }\) and

$$\begin{aligned} r^{(1)}_\theta (z_1,z_2)&= F(z_{1}^{\prime }\gamma -\beta _{1};\alpha _1) \end{aligned}$$
(9)
$$\begin{aligned} r^{(2)}_\theta (z_1,z_2)&= F(z_{2}^{\prime }\gamma -\beta _{2};\alpha _2). \end{aligned}$$
(10)

This example also satisfies Assumption 2.4.

Table 1 The entry game payoff matrix

Example 2.3

(Discrete choice) Suppose an agent chooses \(Z\in \mathbb R ^{d_{Z}}\) from a set \(\mathcal Z :=\{z_{1},{\ldots } ,z_{K}\}\) in order to maximize her expected payoff \(E[s_{0}(Y,Z)\mid \mathcal I ]\), where \(Y\) is a vector of observable random variables, \(s_{0}\in \mathcal R :=L_{\mathcal Y \times \mathcal Z }^{2}\) is the payoff function, and \(\mathcal I \) is the agent’s information set. The optimality condition for the agent’s choice is given by:

$$\begin{aligned} E[s_{0}(Y,z_{j})-s_{0}(Y,Z)\mid \mathcal I ]\le 0,\quad j=1,{\ldots } ,K. \end{aligned}$$
(11)

Let \(x:=(y,z)^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Z \). The optimality conditions in (11) imply that the unconditional moment inequalities in (1) hold with

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0})=\left(\begin{array}{cc} \left[\begin{array}{c} s_{0}(y,z_{1})-s_{0}(y,z_{1}) \\ \vdots \\ s_{0}(y,z_{K})-s_{0}(y,z_{1}) \end{array}\right] \times 1\{z=z_{1}\} \\ \vdots \\ \left[\begin{array}{c} s_{0}(y,z_{1})-s_{0}(y,z_{K}) \\ \vdots \\ s_{0}(y,z_{K})-s_{0}(y,z_{K})\end{array}\right] \times 1\{z=z_{K}\} \end{array}\right) .\end{aligned}$$

For given \(y,\) the functional \(\varphi \) evaluates the profit differences between a given choice \(z\) (e.g., \(z_{1}\)) and every other possible choice. The additional information on \(\rho _{0}\) is that it is based on the profit differences.

A common specification for \(s_{0}\) is \(s_{0}(y,z)=r_{\theta _{0}}(y,z)=\psi (y,z;\alpha _{0})+z^{\prime }\beta _{0}+\epsilon _{z}\) for some known function \(\psi \), unknown \((\alpha _{0},\beta _{0})\in \Theta \subset \mathbb R ^{d_{\alpha }+d_{\beta }}\), and an unobservable choice-dependent error \(\epsilon _{z}\). For simplicity, we assume that \(\epsilon _{z}\) satisfies \(E[\epsilon _{z_{i}}-\epsilon _{z_{j}}\mid \mathcal I ]=0\) for any \(i,j\); see Pakes et al (2006) and Pakes (2010) for detailed discussions. The parametric moment function is then given for each \(x\in \mathcal X \) by \(m(x,\theta )=\varphi (x,r_{\theta })\). This example satisfies Assumption 2.4.

Example 2.4

(Pricing kernel) Let \(Z:\Omega \rightarrow \mathbb R ^{d_{Z}}\) be the payoffs of \(d_{Z}\) securities that are traded at a price of \(P\in \mathcal P \subseteq \mathbb R ^{d_{Z}}\). If short sales are not allowed for any securities, then the feasible set of portfolio weights is restricted to \(\mathbb R _{+}^{d_{Z}}\) and the standard Euler equation does not hold. Instead, the following Euler inequalities hold (see Luttmer 1996):

$$\begin{aligned} E[s_{0}(Y)Z-P]\le 0, \end{aligned}$$

where \(Y:\Omega \rightarrow \mathcal Y \) is a state variable, e.g. consumption growth, and \(s_{0}\in \mathcal S :=\{s\in L_{\mathcal Y }^{2}:s(y)\ge 0,\forall y\in \mathcal Y \}\) is the pricing kernel function. The moment inequalities thus hold with the true moment function:

$$\begin{aligned} \rho _{0}(x)=\varphi (x,s_{0})=s_{0}(y)z-p, \end{aligned}$$

where \(x:=(y,z,p)^{\prime }\in \mathcal Y \times \mathcal Z \times \mathcal P \). This function evaluates the pricing kernel \(r\) at \(y\) and computes a vector of pricing errors. The additional information on \(\rho _{0}\) is that it is based on the pricing errors.

A common specification for \(s_{0}\) is \(s_{0}(y)=r_{\theta _{0}}(y)=\beta _{0}y^{-\gamma _{0}}\), where \(\beta _{0}\in B\subseteq [ 0,1]\) is the investor’s subjective discount factor and \(\gamma _{0}\in \Gamma \subseteq \mathbb R _{+}\) is the relative risk aversion coefficient. Let \(\theta :=(\beta ,\gamma )^{\prime }\). The parametric moment function is then given for each \(x\in \mathcal X \) by \(m(x,\theta )=\varphi (x,r_{\theta })\), satisfying Assumption 2.4.

2.2 Projection

The inequality restrictions \(E[\varphi (X,s_{0})]\le 0\) may not uniquely identify \(s_{0}\). Define

$$\begin{aligned} \mathcal S _{0}:=\{s\in \mathcal S :E[\varphi (X,s)]\le 0\}. \end{aligned}$$

We define a pseudo-true identified set of parameters as a collection of projections of elements in \(\mathcal S _{0}\). Let \(W\) be a given non-random finite \(L\times L\) symmetric positive-definite matrix. For each \(s\in \mathcal S \), define the norm \(\Vert s\Vert _{W}:=E[s(X)^{\prime }Ws(X)]^{1/2}\). For each \(s\in \mathcal S \) and \(A\subseteq \mathcal S \), the projection map \(\Pi _{A}:\mathcal S \) \(\rightarrow A\) is the map such that

$$\begin{aligned} \Vert s-\Pi _{A}s\Vert _{W}=\inf _{a\in A}\Vert s-a\Vert _{W}. \end{aligned}$$

Let \(\mathcal R _{\Theta }:=\{r_{\theta }\in \mathcal S :\theta \in \Theta \} \). Given Assumption 2.4, we can define

$$\begin{aligned} \Theta _{*}:=\{\theta \in \Theta :r_{\theta }=\Pi _{\mathcal R _{\Theta }}s,s\in \mathcal S _{0}\}. \end{aligned}$$

When \(\varphi \) is the evaluation map \(e\), \(\Theta _{*}\) is simply \(\Theta _{*}:=\{\theta \in \Theta :m_{\theta }=\Pi _{\mathcal M _{\Theta }}s,s\in \mathcal S _{0}\}.\)

\(\Theta _{*}\) can be interpreted as the set of parameters that correspond to the elements \(m_{\theta }\) in the \(\mathcal R _{\Theta }\) -projection of \(\mathcal S _{0}\). This set is nonempty (under some regularity conditions), and each element can be interpreted as a projection of \(s\) inducing a functional \(\varphi (\cdot ,s)\) that is observationally equivalent to \(\rho _{0}\). In this sense, each element in \(\Theta _{*}\) has an interpretation as a pseudo-true value. Thus, we call \(\Theta _{*}\) the pseudo-true identified set. [White (1982) uses \(\theta _{*}\) to denote the unique pseudo-true value in the fully identified case.]

We illustrate the relationship between \(\Theta _{I}\) and \(\Theta _{*}\) with an example. Consider Example 2.1. Let \(\Theta \subseteq \mathbb R ^{d_{Z}}\). The conventional identified set is given by

$$\begin{aligned} \Theta _{I}=\{\theta \in \Theta&:E[(Y_{L}-Z^{\prime }\theta )1\{Z\in A_{j}\}]\le 0, \nonumber \\&\qquad {\text{ and}}\;E[(Z^{\prime }\theta -Y_{U})1\{Z\in A_{j}\}]\le 0,\quad j=1,{\ldots },K\}. \end{aligned}$$
(12)

The pseudo-true identified set is given by

$$\begin{aligned} \Theta _{*}=\{\theta \in \Theta :\theta =E[ZZ^{\prime }]^{-1}E[Zs(Z)],s\in \mathcal {S}_{0}\}. \end{aligned}$$
(13)

Let \(D\) be a \(d_{Z}\times K\) matrix whose \(j\)th column is \(E[Z\,1\{Z\in A_{j}\}]\). For this example, the following result holds:

proposition 2.1

Let the conditions of Example 2.1 hold, and let \(\Theta _{*}\) be given as in (13). Let \(\Theta _{I}\) be given as in (12). Then \(\Theta _{I}\subseteq \Theta _{*}\). Suppose further that \(\mathcal M _{\Theta }\) is correctly specified, that \(E[Y_{U}|Z]=E[Y_{L}|Z]=Z^{\prime }\theta _{0}\) a.s, and that \(d_{Z}\le rank(D)\). Then \(\Theta _{I}=\Theta _{*}=\{\theta _{0}\}\).

As this example shows, unless there is some information that helps restrict \( \mathcal S _{0}\) very tightly, \(\Theta _{I}\) is often a proper subset of \( \Theta _{*}\). This is because without such information, \(\mathcal S _{0}\) is typically a much richer class of functions than \(\mathcal R _{\Theta }\). Another important point to note is that, although \(\Theta _{*}\) is well-defined generally, \(\Theta _{I}\) can be empty quite easily. In particular, for any \(x,x^{\prime }\in \mathcal X \), let \(x_{\lambda }:=\lambda x+(1-\lambda )x^{\prime },0\le \lambda \le 1\). \(\Theta _{I}\) is empty if there exists \((x,x^{\prime })\) and \(\lambda \in [0,1]\) such that (i) \(x_{\lambda }\in \mathcal X \) and \((E[Y_{L}|x_{\lambda }]-E[Y_{U}|x])/\Vert x_{\lambda }-x\Vert >(E[Y_{U}|x^{\prime }]-E[Y_{U}|x])/\Vert x^{\prime }-x\Vert \) or (ii) \(x_{\lambda }\in \mathcal X \) and \((E[Y_{U}|x_{\lambda }]-E[Y_{L}|x])/\Vert x_{\lambda }-x\Vert <(E[Y_{L}|x^{\prime }]-E[Y_{L}|x])/\Vert x^{\prime }-x\Vert \).Footnote 3 Figure , which is similar to Fig. 1 in Ponomareva and Tamer (2010), illustrates an example that satisfies condition (i) for the one-dimensional case.

In this example, each element in \(\Theta _{*}\) solves the following moment restrictions:

$$\begin{aligned} E[Z(Z^{\prime }\theta -Y)]=E[Zu(X)], \end{aligned}$$
(14)

with \(u(x)=s(z)-y\) for some \(s\in \mathcal S _0.\) This can be viewed as a special case of incomplete linear moment restrictions studied in Bontemps, Magnac, and Maurin (2011) (BMM, henceforth).Footnote 4 BMM shows that the set of parameters that solves incomplete linear moment restrictions is necessarily convex and develops an inference method that exploits this property.

We note here that this connection to BMM’s work only occurs when the parametric class is of the form: \(\mathcal R _\Theta =\{r_\theta : r_\theta (z)=z^{\prime }\theta ,~\theta \in \Theta \}\). The elements of \(\Theta _{*}\), however, do not generally solve incomplete linear moment restrictions when \(\mathcal R _\Theta \) includes nonlinear functions of \(\theta \). Therefore, BMM’s inference method is only applicable when \(r_\theta \) is linear. Our estimation procedure is more flexible than theirs in the following two respects. First, one may allow projection to a more general class of parametric functions that includes nonlinear functions of \(\theta \). Second, as a consequence of the first point, we do not require \(\Theta _{*}\) to be convex. We, however, pay a price for achieving this generality. We require \(s\) to satisfy suitable smoothness conditions, which are not required by BMM. We discuss these conditions in detail in the following section.

Fig. 1
figure 1

An example with an empty conventional identified set

3 Estimation

3.1 Set Estimator

For \(W\) as above and each \((\theta ,s)\in \Theta \times \mathcal S \), let the population criterion function be defined by

$$\begin{aligned} Q(\theta, {\rm s}) & = E[(s(X_{i})-r_{\theta }(X_{i}))^{\prime } W(s(X_{i})-r_{\theta }(X_{i}))] \nonumber\\ &\quad- \inf_{\vartheta \in \Theta } E[(s(X_{i})-r_{\vartheta} (X_{i}))^{\prime} W(s(X_{i})-r_{\vartheta }(X_{i}))].\end{aligned}$$
(15)

Using the population criterion function, the “pseudo-true” identified set \(\Theta _{*}\) can be equivalently written as

$$\begin{aligned} \Theta _{*}=\{\theta :Q(\theta ,s)=0,\quad s\in \mathcal S _{0}\}. \end{aligned}$$

Given a sample \(\{X_{1},{\ldots } ,X_{n}\}\) of observations, let the sample criterion function be defined for each \((\theta ,s)\in \Theta \times \mathcal S \) by

$$\begin{aligned} Q_{n}(\theta ,s) :=&\;\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\theta }(X_{i}))^{\prime }W(s(X_{i})-r_{\theta }(X_{i})) \nonumber \\&\; -\inf _{\vartheta \in \Theta }\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\vartheta }(X_{i}))^{\prime }W(s(X_{i})-r_{\vartheta }(X_{i})). \end{aligned}$$
(16)

Ideally, we would like to estimate \(\Theta _{*}\) by \(\tilde{\Theta }_{n}\), say, where \(\tilde{\Theta }_{n}:=\{\theta :Q_{n}(\theta ,s)\le c_{n},s\in \mathcal S _{0}\}\). But \(\mathcal S _{0}\) is unknown, so we must estimate it. Thus, we employ a two-stage procedure, similar to that studied in Kaido and White (2010). Section  3.3 discusses how to construct a first-stage estimator of \(\mathcal S _{0}\). For now, we suppose that such an estimator exists. For this, let \(\mathcal F (A)\) be the set of closed subsets of a set \(A\). See Kaido and White (2010) for background, including discussion of Effros measurability.

Assumption 3.1

(First-stage estimator) For each \(n\), let \(\mathcal S _{n}\subseteq \mathcal S \). \(\hat{\mathcal S }_{n}:\Omega \rightarrow \mathcal F (\mathcal S _{n})\) is (Effros-) measurable.

Given a first-stage estimator, we define a set estimator for the pseudo-true identified set. Let \(\{c_{n}\}\) be a sequence of non-negative constants. The set estimator for \(\Theta _{*}\) is defined by

$$\begin{aligned} \hat{\Theta }_{n}:=\{\theta \in \Theta :Q_{n}(\theta ,s)\le c_{n},s\in \hat{\mathcal S }_{n}\}. \end{aligned}$$
(17)

We establish our consistency results using the Hausdorff metric. Let \(||\cdot ||\) denote the Euclidean norm, and for any closed subsets \(A\) and \(B\) of a finite-dimensional Euclidean space (e.g., containing \(\theta \)), let

$$\begin{aligned} d_{H}(A,B):=\max \{\vec {d}_{H}(A,B),\vec {d}_{H}(B,A)\},\quad \vec {d}_{H}(A,B):=\sup _{a\in A}\inf _{b\in B}\Vert a-b\Vert , \end{aligned}$$
(18)

where \(d_{H}\) and \(\vec {d}_{H}\) are the Hausdorff metric and directed Hausdorff distance respectively.

Before stating our assumptions, we introduce some additional notation. Let \( D_{\theta }^{\alpha }\) denote the differential operator \(\partial ^{\alpha }/\partial \theta _{1}^{\alpha _{1}}\cdots \partial \theta _{p}^{\alpha _{p}} \) with \(|\alpha |:=\sum _{j=1}^{p}\alpha _{j}\). Similarly, we let \( D_{x}^{\beta }\) denote the differential operator \(\partial ^{\beta }/\partial x_{1}^{\beta _{1}}\cdots \partial x_{k}^{\beta _{k}}\) with \( |\beta |:=\sum _{j=1}^{k}\beta _{j}\). For a function \(f:\mathcal X \rightarrow \mathbb R \) and \(\gamma >0\), let \(\underline{\gamma }\) be the smallest integer smaller than \(\gamma \) and define

$$\begin{aligned} \Vert f\Vert _{\gamma }:=\max _{|\beta |\le \underline{\gamma }}\sup _{x\in \mathcal X }\big |D_{x}^{\beta }f(x)\big |+\max _{|\beta |=\underline{\gamma } }\sup _{x,y\in \mathcal X }\frac{\big |D_{x}^{\beta }f(x)-D_{x}^{\beta }f(x)\big |}{ \Vert x--y\Vert ^{\gamma -\underline{\gamma }}}. \end{aligned}$$

Let \(\mathcal C _{M}^{\gamma }(\mathcal X )\) be the set of all continuous functions \(f:\mathcal X \rightarrow \mathbb R \) such that \(\Vert f\Vert _{\gamma }\le M\). Let \(\mathcal C _{M,L}^{\gamma }(\mathcal X ):=\{f:\mathcal X \rightarrow \mathbb R ^L:f^{(j)}\in \mathcal C ^\gamma _M(\mathcal X ), j=1,{\ldots },L\}\). Finally, for any \(\eta >0\), let \(\mathcal S _{0}^{\eta }:=\{s\in \mathcal S :\inf _{s^{\prime }\in \mathcal S _{0}}\Vert s-s^{\prime }\Vert _{W}<\eta \}\).

Our first assumption places conditions on the parameter spaces \(\Theta \) and \(\mathcal S \). We let \( int (\Theta )\) denote the interior of \( \Theta \).

Assumption 3.2

(i) \(\Theta \) is compact; (ii) \(\mathcal S \) is a compact convex set with nonempty interior; (iii) there exists \(\gamma >k/2\) such that \(\mathcal S \subseteq \mathcal C _{M,L}^{\gamma }(\mathcal X )\); (iv) \(\mathcal R _{\Theta }\) is a convex subset of \(\mathcal S \); (v) \(\Theta _{*}\subseteq int (\Theta ).\)

Assumption 3.2 (i) is standard in the literature of extremum estimation and also ensures the compactness of the pseudo-true identified set. Assumption 3.2 (iii) imposes a smoothness requirement on each component of \(s\in \mathcal S \). Together with Assumption (ii), this implies that \(\mathcal S \) is compact under the uniform norm, which will be also used for establishing the Hausdorff consistency of \(\hat{\mathcal S } _{n}\) in the following section. For the Hausdorff consistency of \(\hat{\Theta }_{n} \), the requirement \(\gamma >k/2\) can be relaxed to \(\gamma >0\), and it also suffices that the smoothness requirement holds for functions in neighborhoods of \(\mathcal S _{0}\). The stronger requirement given here, however, will be useful for deriving the rates of convergence of \(\hat{\Theta }_{n}\) and \(\hat{\mathcal S }_{n}\).

For ease of analysis, we assume below that the observations are from a sample of IID random vectors.

Assumption 3.3

The observations \(\{X_i,i=1,{\ldots },n\}\) are independently and identically distributed.

The following two assumptions impose regularity conditions on \(r_\theta \).

Assumption 3.4

(i) \(r(x,\cdot )\) is twice continuously differentiable on the interior of \(\Theta \) \(\mathrm{a.e.}-P_{0}\), and for any \(j\), \(x,\) and \(|\alpha | \le 2\), there exists a measurable bounded function \(C:\mathcal X \rightarrow \mathbb R \) such that \(|D_\theta ^{\alpha }r_{\theta }^{(j)}(x)-D_\theta ^{\alpha }r_{\theta ^{\prime }}^{(j)}(x)|\le C(x)\Vert \theta -\theta ^{\prime }\Vert \); (ii) there exists a measurable bounded function \(R:\mathcal X \rightarrow \mathbb R \) such that

$$\begin{aligned} \max _{\begin{matrix} j=1,{\ldots },l \\ |\alpha |\le 2 \end{matrix}}~ \sup _{\theta \in \Theta } \big |D^\alpha _\theta r^{(j)}_\theta (x)\big |\le R(x). \end{aligned}$$

For each \(x\), let \(\nabla _{\theta }r_{\theta }(x)\) be a \(L\times p\) matrix whose \(j\)th row is the gradient vector of \(r_{\theta }^{(j)}\) with respect to \(\theta \). For each \(x\in \mathcal X \) and \(i,j\in \{1,{\ldots },L\}\), let \(\partial ^{2}/\partial \theta _{i}\partial \theta _{j}r_{\theta }(x)\) be a \(L\times 1\) vector whose \(k\)th component is given by \(\partial ^{2}/\partial \theta _{i}\partial \theta _{j}r^{(k)}_{\theta }(x)\). For each \(\theta \in \Theta \), \(s\in \mathcal S \), and \(x\in \mathcal X \), let \(H_{W}(\theta ,s,x)\) be a \(p\times p\) matrix whose \((i,j)\)th component is given by

$$\begin{aligned} H_{W}^{(i,j)}(\theta ,s,x)=2\left(\frac{\partial ^{2}}{\partial \theta _{i}\partial \theta _{j}}r_{\theta }(x)\right)^{\prime }W(r_{\theta }(x)-s(x)). \end{aligned}$$
(19)

Let \(\eta >0\). For each \(s\in \mathcal S _0^{\eta }\) and \(\epsilon >0\), let \(V^\epsilon (s)\) be the neighborhood of \(\theta _{*}(s)\) defined by

$$\begin{aligned} V^\epsilon (s) :=\{\theta \in \Theta :\Vert \theta -\theta _{*}(s)\Vert \le \epsilon \}. \end{aligned}$$

Let \(\mathcal N _{\epsilon ,\eta }:=\{(\theta ,s):\theta \in V^\epsilon (s),s\in \mathcal S _0^{\eta }\}\) be the graph of the correspondence \(V^\epsilon \) on \(\mathcal S _0^{\eta }\).

Assumption 3.5

There exist \(\bar{\epsilon }>0\) and \(\bar{\eta }>0\) such that the Hessian matrix \(\nabla _\theta ^2Q(\theta ,s):=E[H_{W}(\theta ,s,X_{i})+2\nabla _{\theta }r_{\theta }(X_{i})^{\prime }W\nabla _{\theta }r_{\theta }(X_{i})]\) is positive definite uniformly over \(\mathcal N _{\bar{\epsilon },\bar{\eta }}\).

Assumption 3.4 imposes a smoothness requirement on \(r_\theta \) as a function of \(\theta \), enabling us to expand the first order condition for minimization, as is standard in the literature. Assumption 3.5 requires that Hessian of \(Q(\theta ,s)\) with respect to \(\theta \) to be positive definite uniformly on a suitable neighborhood of \(\Theta _{*}\times \mathcal S _0\). For the consistency of \(\hat{\Theta }_n\), it suffices to assume that the Hessian is uniformly non-singular over \(\mathcal N _{\bar{\epsilon },\bar{\eta }}\), but a stronger condition given here will be useful to ensure a quadratic approximation of the criterion function, which is crucial for the \(\sqrt{n}\)-consistency of\(\hat{\Theta }_{n}\).

Further, we assume that \(\hat{\mathcal S }_{n}\) is consistent for \(\mathcal S _{0}\) in a suitable Hausdorff metric. Specifically, for subsets \(A,B\) of \( \mathcal S \), let

$$\begin{aligned} d_{H,W}(A,B):=\max \left\{ \sup _{a\in A}\inf _{b\in B}\Vert a-b\Vert _{W},\sup _{b\in B}\inf _{a\in A}\Vert a-b\Vert _{W}\right\} . \end{aligned}$$

Assumption 3.6

\(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1)\).

Theorem 3.1 is our first main result, which establishes the consistency of the set estimator defined in (17) with \(c_n\) set to 0. This result is established by extending the standard consistency proof for extremum estimators to the current setting. Note that, under Assumption 3.2 (iv), the projection \(\theta _{*}(s):=\Pi _{\mathcal R _\Theta }s\) of each point \(s\in \mathcal S \) to \(\mathcal R _\Theta \) exists and is uniquely determined. In other words, for each \(s\in \mathcal S \), \(\theta _{*}(s)\) is point identified. By setting \(c_n=0\), the set estimator is then asymptotically equivalent to the collection of minimizers \(\hat{\theta }_n (s):={\text{ argmin}}_{\theta ^{\prime }\in \Theta }Q_n(\theta ,s)\) of the sample criterion function. The main challenge for establishing Hausdorff consistency is to show that \(\hat{\theta }_n(s)-\theta _{*}(s)\) vanishes in probability over a sufficiently large neighborhood of \(\mathcal S _0\). The proof of the theorem in the appendix formally establishes this and gives the desired result.

Theorem 3.1

Suppose Assumptions 2.1–2.4 and 3.1–3.6 hold. Let \(\hat{\Theta }_{n}\) be defined as in (17) with \(c_{n}=0\) for all \(n\). Then \(d_{H}(\hat{\Theta }_{n},\Theta _{*})=o_{p}(1)\).

The result of Theorem 3.1 is similar to that of Theorem 3.2 in Chernozhukov et al. (2007), who establish the Hausdorff consistency of a level-set estimator with \(c_n=0\) when \(Q_n\) degenerates on a neighborhood of the identified set.Footnote 5 When Assumption 3.2 (iv) fails to hold, this estimator may not be consistent. We, however, conjecture that it would be possible to construct a Hausdorff consistent estimator of \(\Theta _{*}\) even in such a setting by choosing a positive sequence \(\{c_n\}\) of levels that tends to 0 as \(n\rightarrow \infty \) and by exploiting the fact that \(\hat{\mathcal S }_n\) converges to \(\mathcal S _0\) in a suitable Hausdorff metric. In fact, Kaido and White (2010) establish the Hausdorff consistency of their two-stage set estimator using this argument, but in their analysis, the first-stage parameter (\(s\) in our setting) must be finite dimensional. Extending Theorem 3.1 to a more general one that allows non-convex parametric classes is definitely of interest, but to keep our tight focus here, we leave this as a future work.

3.2 The Rate of Convergence

Theorem 3.1 uses the fact that \(d_{H}(\hat{\Theta }_{n},\Theta _{*})\) can be bounded by \(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})\). Although \(\hat{\mathcal S }_{n}\) does not converge at a parametric rate generally, the convergence rate of \(\hat{\Theta }_{n}\) can be improved when \(\hat{\mathcal S }_{n}\) converges to \(\mathcal S _{0}\) at a rate \(o_{p}(n^{-1/4})\). This is analogous to the results obtained for the point identified case; see, for example, Newey (1994), Ai and Chen (2003), and Ichimura and Lee (2010).

Assumption 3.7

\(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(n^{-1/4})\).

Theorem3.2

Suppose the conditions of Theorem 3.1 hold. Suppose in addition Assumption 3.7 holds. Let \(\hat{\Theta }_{n}\) be defined as in (17) with \(c_n=0\) for all \(n\). Then, \(d_H(\hat{\Theta }_{n},\Theta _{*})=O_p(n^{-1/2})\).

For this, setting \(c_n\) to 0 is crucial for achieving the \(O_p(n^{-1/2})\) rate. We here note that Theorem 3.2 builds on Lemma A.2 in the appendix, which establishes the convergence rate (in directed Hausdorff distance) of \(\hat{\Theta }_{n}\) in (17) with a possibly nonzero level \(c_n\). This lemma does not require Assumption 3.2 (iv) but assumes the Hausdorff consistency of \(\hat{\Theta }_{n}\) as a high-level condition. This is why Theorem 3.2 is stated for \(\hat{\Theta }_{n}\) with \(c_n=0\). As previously discussed, however, if Theorem 3.1 is extended to allow non-convex parametric classes, this lemma can be used to characterize the estimator’s convergence rate under a more general setting.

3.3 The First-Stage Estimator

This section discusses how to construct a first-stage set estimator. A challenge is that the object of interest \(\mathcal S _{0}\) is a subset of an infinite-dimensional space. This requires us to use a nonparametric estimation technique for estimating \(\mathcal S _{0}\). This type of estimation problem was recently analyzed in Santos (2011), who studies estimation of linear functionals of function-valued parameters in nonparametric instrumental variable problems. We rely on his results on consistency and the rate of convergence, which extend Chernozhukov et al. (2007) analysis to a nonparametric setting. Specifically, for each \(s\in \mathcal S \), let

$$\begin{aligned} \mathcal Q _{n}(s):=\sum _{j=1}^{l}\Big (\frac{1}{n}\sum _{i=1}^{n}\varphi ^{(j)} (X_{i},s)\Big )_{+}^{2}. \end{aligned}$$
(20)

This is a sample criterion function defined on \(\mathcal S \). For instance, \({\mathcal Q }_{n}\) for Example 2.1 is given by

$$\begin{aligned} \mathcal Q _{n}(s)=\sum _{j{=}1}^{K}\Big (\!\frac{1}{n}\sum _{i=1}^n(Y_{L,i}-s(Z_i))1_{A_j}(Z_i)\!\Big )^{2}_{+}{+} \sum _{j=1}^{K}\Big (\!\frac{1}{n}\sum _{i=1}^n(s(Z_i){-}Y_{U,i})1_{A_j}(Z_i)\!\Big )^{2}_{+}. \end{aligned}$$

Our first-stage set estimator is a level set of \(\mathcal Q _{n}\) over a sieve \(\mathcal S _{n}\subseteq \mathcal S \). Given a sequence of non-negative constants \(\{a_{n}\}\) and \(\{b_{n}\}\), define

$$\begin{aligned} \hat{\mathcal S }_{n}:=\{s\in \mathcal S _{n}:\mathcal Q _{n}(s)\le b_{n}/a_{n}\}. \end{aligned}$$
(21)

We add regularity conditions on \(\varphi \), \(\{\mathcal S _{n}\}\), and \(\{(a_{n},b_{n})\}\) to ensure the Hausdorff consistency of \(\hat{\mathcal S }_{n}\) and derive its convergence rate. The following two assumptions impose smoothness requirements on the map \(\varphi \).

Assumption 3.7

For each \(j\), there is a function \(B_{j}:\mathcal X \rightarrow \mathbb R _{+}\) such that

$$\begin{aligned} |\varphi ^{(j)}(x,s)-\varphi ^{(j)}(x,s^{\prime })|\le B_{j}(x)\rho (s,s^{\prime }),\quad \forall s,s^{\prime }\in \mathcal S , \end{aligned}$$

where \(\rho (s,s^{\prime }):=\sup _{x\in \mathcal S }\max _{j=1,{\ldots },l}|s^{(j)}(x)-s^{\prime (j)}(x)|\).

For each \(s\in \mathcal S \), let \(\mathcal I (s):=\{j\in \{1,{\ldots },l\}:E[\varphi ^{(j)}(X_{i},s)]>0\}\). \(\mathcal I (s)\) is the set of indexes whose associated moments violate the inequality restrictions. For each \(j\), let \(\bar{\varphi }^{(j)}:=E[\varphi ^{(j)}(X_{i},s)]\).

Assumption 3.8

(i) For each \(s\in \mathcal S \) and \(j\), \(\bar{\varphi }^{(j)}:\mathcal S \rightarrow \mathbb R \) is continuously Fréchet differentiable with the Fréchet derivative \(\dot{\varphi }_{s}^{(j)}:\mathcal S \rightarrow \mathbb R \), and for each \(s\in \mathcal S \), the operator norm \(\Vert \dot{\varphi }_{s}^{(j)}\Vert _{op}\) of \(\dot{\varphi }_{s}^{(j)}\) is bounded away from 0 for some \(j\in \{1,{\ldots } ,l\}\); (ii) for each \(s\notin \mathcal S _{0}\), there exist \(j\in \mathcal I (s)\) and \(C_{j}>0\) such that \(E[\varphi ^{(j)}(X_{i},s)]\ge C_{j}\Vert s-s_{0}\Vert _{W}\) for some \(s_{0}\in \mathcal S _{0}\).

We also add regularity conditions on \(\mathcal S _{n}\), which can be satisfied by commonly used sieves including polynomials, splines, wavelets, and certain artificial neural network sieves.

Assumption 3.9

(i) For each \(n\), \(\mathcal S _{n}\subseteq \mathcal S \), and both \(\mathcal S _{n}\) and \(\mathcal S \) are closed with respect to \(\rho \); (ii) for every \(s\in \mathcal S \), there is \(\Pi _{n}s\in \mathcal S _{n}\) such that \(\sup _{s\in \mathcal S }\Vert s-\Pi _{n}s\Vert _{W}=O(\delta _{n})\) for some sequence \(\{\delta _{n}\}\) of non-negative constants such that \(\delta _{n}\rightarrow 0\).

Theorem 3.3

Suppose Assumptions 2.1–2.3, 3.2 (i)–(iii), 3.3, 3.8, 3.9 (i), and 3.10 hold. Let \(a_{n}=O(\max \{n^{-1},\delta _{n}^{2}\}^{-1})\) and \(b_{n}\rightarrow \infty \) with \(b_{n}=o(a_{n})\). Then

$$\begin{aligned} d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1). \end{aligned}$$

In addition, suppose that Assumption 3.9 (ii) holds. Then

$$\begin{aligned} d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=O_{p}\big (\sqrt{b_{n}/a_{n}}\big ). \end{aligned}$$

Theorem 3.3 can be used to establish Assumptions 3.6 and 3.7, which are imposed in Theorems 3.1 and 3.2. These conditions are satisfied for Example 2.1 with a single regressor.

In what follows, for any two sequences of positive constants \(\{c_{n}\},\) \(\{d_{n}\}\), let \(c_{n}\asymp d_{n}\) mean there exist constants \(0<C_{1}<C_{2}<\infty \) such that \(C_{1}\le |c_{n}/d_{n}|\le C_{2}\) for all \(n\).

Corollary 3.1

In Example 2.1, suppose that \(\mathcal Z \) is a compact convex subset of the real line and \(r_{\theta }(z)=\theta ^{(1)}+\theta ^{(2)}z\), where \(\theta \in \Theta \subseteq \mathbb R ^{2}\). Suppose that \(\Theta \) is compact and convex. Suppose further that \(\{(Y_{L,i},Y_{U,i},Z_{i})\}_{i=1,{\ldots },n}\) is a random sample from \(P_{0}\) and that \(P_{0}(Z\in A_{k})>0\) for all \(k\) and \(Var(Z)>0\). Let \(\mathcal S :=\{s\in L_{\mathcal Z ,1}^{2}:\mathcal Z \rightarrow \mathbb R :\Vert s\Vert _{\infty }\le M,|s(z)-s(z^{\prime })|\le M\Vert z-z^{\prime }\Vert ,\forall z,z^{\prime }\in \mathcal Z \}\) for some \(M>0\). Let \(\{r_{q}(\cdot )\}_{q=1}^{J_{n}}\) be splines of order two with \(J_{n}\) knots on \(\mathcal Z \). Define \(\mathcal S _{n}:=\{s:s(z)=\sum _{q=1}^{J_{n}}\beta _{q}r_{q}(z)\}\) with \(J_{n}\asymp n^{c_{1}},c_{1}>1/3\). Let \(\hat{\mathcal S }_{n}\) be defined as in (21) with \(a_{n}\asymp n^{c_{2}}\), where \(2/3<c_{2}<1\) and \(b_{n}\asymp \ln n\). Then: (i) \(\hat{\mathcal S }_{n}\) is (Effros-) measurable; (ii) \(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1)\); (iii) \(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(n^{-1/4}).\)

Given these results, we further show that the estimator of the pseudo-true identified set is consistent and converges at a \(n^{-1/2}\)-rate.

Corollary 3.2

Suppose that the conditions of Corollary 3.1 hold. Let \(Q\) be defined as in (17) with \(W=1\). Let \(Q_{n}\) be defined as in (16) and \(\hat{\Theta }_{n}\) be defined as in (17) with \(c_{n}=0\) and \(\hat{\mathcal S }_{n}\) as in Corollary 3.1. Then \(d_{H}(\hat{\Theta }_{n},\Theta _{*})=O_{p}(n^{-1/2})\).

4 Concluding Remarks

Moment inequalities are widely used to estimate discrete choice problems and structures that involve censored variables. In many empirical applications, potentially misspecified parametric models are used to estimate such structures. This chapter studies a novel estimation procedure that is robust to misspecification of moment inequalities. To overcome the challenge that the conventional identified set may be empty under misspecification, we defined a pseudo-true identified set as the least squares projection of the set of functions at which the moment inequalities are satisfied. This set is nonempty under mild assumptions. We also proposed a two-stage set estimator for estimating the pseudo-true identified set. Our estimator first estimates the identified set of function-valued parameters by a level-set estimator over a suitable sieve. The pseudo-true identified set can then be estimated by projecting the first-stage estimator to a finite-dimensional parameter space. We give conditions, under which the estimator is consistent for the pseudo-true identified set in the Hausdorff metric and converges at a rate \(O_p(n^{-1/2})\). Developing inference procedures based on the proposed estimator would be an interesting future work. Another interesting extension would be to study the optimal choice of the weighting matrix. In this chapter, we maintained the assumption that \(W\) is fixed and does not depend on \((\theta ,s)\). Given the form of the criterion function, the most natural choice of \(W\) would be the inverse matrix of the variance covariance matrix of \(s(X_i)-r_{\theta }(X_i)\). This matrix is generally unknown but can be consistently estimated by its sample analog: \( \hat{W}_n(\theta ,s):=(\frac{1}{n}\sum _{i=1}^n(s(X_i)-r_{\theta }(X_i)) (s(X_i)-r_{\theta }(X_i))^{\prime })^{-1}.\) Defining a sample criterion function using \(\hat{W}_n(\theta ,s)\) as a weighting matrix would lead to a three-step procedure. Such a procedure may result in more efficient estimation of \(\Theta _{*}\).Footnote 6 Yet, another interesting direction would be to develop a specification test for the moment inequality models based on the current framework. This direction would extend the results of Guggenberger et al. (2008), which studies a testing procedure that tests the nonemptiness of the identified set.