Abstract
This chapter studies partially identified structures defined by a finite number of moment inequalities. When the moment function is misspecified, it becomes difficult to interpret the conventional identified set. Even more seriously, this can be an empty set. We define a pseudo-true identified set whose elements can be interpreted as the least-squares projections of the moment functions that are observationally equivalent to the true moment function. We then construct a set estimator for the pseudo-true identified set and establish its \(O_{p}(n^{-1/2})\) rate of convergence.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
This chapter develops a new approach to estimating structures defined by moment inequalities. Moment inequalities often arise as optimality conditions in discrete choice problems or in structures where economic variables are subject to some type of censoring. Typically, parametric models are used to estimate such structures. For example, in their analysis of an entry game in the airline markets, Ciliberto and Tamer (2009) use a linear specification for airlines’ profit functions and assume that unobserved heterogeneity in the profit functions can be captured by independent normal random variables. In asset pricing theory with short sales prohibited, Luttmer (1996) specifies the functional form of the pricing kernel as a power function of consumption growth, based on the assumption that the investor’s utility function is additively separable and isoelastic.
Any conclusions drawn from such methods rely on the validity of the model specification. Although commonly used estimation and inference methods for moment inequality models are robust to potential lack of identification, typically they are not robust to misspecification. Compared to cases where the parameter of interest is point identified, much less is known about the consequences of misspecified moment inequalities. As we will discuss, these can be serious. In general, misspecification makes it hard to interpret the estimated set of parameter values; an even more serious possibility is that the identified set could be an empty set. If the identified set is empty, every nonempty estimator sequence is inconsistent. Furthermore, it is often hard to see if the estimator is converging to some object that can be given any meaningful interpretation. An exception is the estimation method developed by Ponomareva and Tamer (2010), which focuses on estimating a regression function with interval censored outcome variables.
This chapter develops a new estimation method that is robust to potential parametric misspecification in general moment inequality models. Our contributions are three-fold. First, we define a pseudo-true identified set that is nonempty under mild assumptions and that can be interpreted as the projection of the set of function-valued parameters identified by the moment inequalities. Second, we construct a set estimator using a two-stage estimation procedure, and we show that the estimator is consistent for the pseudo-true identified set in Hausdorff metric. Third, we give conditions under which the proposed estimator converges to the pseudo-true identified set at the \(n^{-1/2}\)-rate.
The first stage is a nonparametric estimator of the true moment function. Given this, why perform a parametric second-stage estimation? After all, the nonparametric first stage estimates the same object of interest, without the possibility of parametric misspecification. There are a variety of reasons a researcher may nevertheless prefer to implement the parametric second stage: first is the undeniably appealing interpretability of the parametric specification; second is the much more precise estimation and inference afforded by using a parametric specification; and third, the second term of the second-stage objective function may offer a potentially useful model specification diagnostic. Future research may permit deriving the asymptotic distribution of this term under the null of correct parametric specification to provide a formal test. The two-stage procedure proposed here delivers these benefits, while avoiding the more serious adverse consequences of potential misspecification.
The chapter is organized as follows. Section 2 describes the data generating process and gives examples that fall within the scope of this chapter. We also introduce our definition of the pseudo-true identified Sect. 3 defines our estimator and presents our main results. We conclude in Sect. 4. We collect all proofs into the appendix.
2 The Data Generating Process and the Model
Our first assumption describes the data generating process (DGP).
Assumption 2.1
Let \((\Omega ,\mathfrak F ,\mathbb P _{0})\) be a complete probability space. Let \(k,\ell \in \mathbb N \). Let \(X:\Omega \rightarrow \mathbb R ^{k}\) be a Borel measurable map, let \(\mathcal X \subseteq \mathbb R ^{k}\) be the support of \(X\), and let \(P_{0}\) be the probability measure induced by \(X\) on \(\mathcal X \). Let \(\rho _{0}:\mathcal X \rightarrow \mathbb R ^{\ell }\) be an unknown measurable function such that \(E[\rho _{0}(X)]\) exists and
where the expectation is taken with respect to \(P_{0}\).
In what follows, we call \(\rho _{0}\) the true moment function. The moment inequalities (1) often arise as an optimality condition in game-theoretic models (Bajari et al. 2007; Ciliberto and Tamer 2009) or models that involve variables that are subject to some kind of censoring (Manski and Tamer 2002). In empirical studies of such models, it is common to specify a parametric model for \(\rho _{0}\).
Assumption 2.2
Let \(p\in \mathbb N \) and let \(\Theta \) be a subset of \(\mathbb R ^{p}\) with nonempty interior. Let \(m:\mathcal X \times \Theta \rightarrow \mathbb R ^{\ell }\) be such that \(m(\cdot ,\theta )\) is measurable for each \(\theta \in \Theta \) and \(m(x,\cdot )\) is continuous on \(\Theta ,\) \(\mathrm{a.e.}-P_{0}\). For each \(\theta \in \Theta \), \(m(\cdot ,\theta )\in L_{\ell }^{2}:=\{f:\mathcal X \rightarrow \mathbb R ^{\ell }:E[f(X)^{\prime }f(X)]<\infty \}.\)
Throughout, we call \(m(\cdot ,\cdot )\) the parametric moment function.
Definition 2.1
Let \(m_{\theta }(\cdot ):=m(\cdot ,\theta )\). Define \(\mathcal M _{\Theta }:=\{m_{\theta }\in L_{\ell }^{2}:\theta \in \Theta \}.\) \(\mathcal M _{\Theta }\) is correctly specified (\(-P_{0}\)) if there exists \(\theta _{0}\in \Theta \) such that
Otherwise, the model is misspecified.
If the model is correctly specified, we may define the set of parameter values that can be identified by the inequalities in (1):
We call \(\Theta _{I}\) the conventional identified set. This set collects all parameter values that yield parametric moment functions that are observationally equivalent to \(\rho _{0}\).
It becomes difficult to interpret \(\Theta _{I}\) when the model is misspecified, as pointed out by Ponomareva and Tamer (2010) for a regression model with an interval-valued outcome variable. Suppose first that the model is misspecified but \(\Theta _{I}\) is nonempty. The set is still a collection of parameter values that are observationally equivalent to each other, but since there is no \(\theta \) in \(\Theta _{I}\) that corresponds to the true moment function, further structure is required to unambiguously interpret \(\Theta _{I}\) as a collection of “pseudo-true parameter(s)”. Further, \(\Theta _{I}\) may be empty, especially if \(\mathcal M _{\Theta }\) is a small class of functions. This makes the interpretation of \(\Theta _{I}\) even more difficult. In fact, interpretation is impossible, as there is nothing to interpret.
Often, the economics of a given problem impose further structure on the DGP. To specify this, we let \(0<L\le \ell ,\) and for measurable \(s:\mathcal X \rightarrow \mathbb R ^{L}\), let \(\Vert s\Vert _{L}:=E[s(X)^{\prime }s(X)]^{1/2}\). Let \(L_{L}^{2}:=\{s:\mathcal X \rightarrow \mathbb R ^{L},\Vert s\Vert _{L}<\infty \}\), and let \(\mathcal S \subseteq L_{L}^{2}\).
Assumption 2.3
There exists \(\varphi :{\mathcal X }\times \mathcal S \rightarrow \mathbb R ^{\ell }\) such that for each \(x\in \mathcal X \), \( \varphi (x,\cdot )\) is continuous on \(\mathcal S \) and for each \(s\in \mathcal S \), \(\varphi (\cdot ,s)\) is measurable. Further, there exists \( s_{0}\in \mathcal S \) such that
When \(\rho _{0}\in L_{\ell }^{2}\) and there is no further structure on \(\rho _{0}\) available, we let \(L=\ell ,\) \(\mathcal S =L_{\ell }^{2},\) and take \(\varphi \) to be the evaluation functional \(e:\mathcal X \times \mathcal S \rightarrow \mathbb R ^{\ell }\):
as then \(\varphi (x,\rho _{0})=e(x,\rho _{0})\equiv \rho _{0}(x)\) and \(s_{0}=\rho _{0}.\) In this case, it is not necessary to explicitly introduce \(\varphi \). Often, however, further structure on the form of \(\rho _{0}\) is available. Typically, this is reflected in \(s\) depending non-trivially only on a strict subvector of \(X,\) say \(X_{1}.\) In such cases, we may write \(\mathcal S \subseteq L_{\mathcal X _{1}}^{2}\) for clarity. We give several examples below.
When Assumption 2.3 holds, we typically parametrize the unknown function \(s_{0}\). For example, it is common to specify \(s_{0}\) as a linear function of some of the components of \(x\). As we will see in the examples, a common modeling assumption is
Assumption 2.4
There exists \(r:\mathcal X \times \Theta \rightarrow \mathbb R ^{L}\) such that with \(r_{\theta }:=r(\cdot ,\theta )\),
Thus, misspecification occurs when there is no \(\theta _{0}\) in \(\Theta \) such that \(s_{0}=r_{\theta _{0}}.\)
More generally, misspecification can occur because the researcher mistakenly imposes Assumption 2.3, in which case \(s_{0}\) fails to exist and there is again no \(\theta _{0}\) in \(\Theta \) such that \(\rho _{0}(x)=\varphi (x,r_{\theta _{0}}).\) As \(s_{0}\) is an element of an infinite-dimensional space, we may refer to this as “nonparametric” misspecification. To proceed, we assume that, as is often plausible, the researcher is sufficiently able to specify the structure of interest that nonparametric misspecification is not an issue, either because correct \(\varphi \) restrictions are imposed or no \(\varphi \) restrictions are imposed. We thus focus on the case of parametric misspecification, where \(s_{0}\) exists but there is no \(\theta _{0}\) in \(\Theta \) such that \(s_{0}=r_{\theta _{0}}.\)
2.1 Examples
In this section, we present several motivating examples and also give commonly used parametric specifications in these examples. For any vector \(x\), we use \(x^{(j)}\) to denote the \(j\)th component of the vector. Similarly, for a vector valued function \(f(x)\), we use \(f^{(j)}(x)\) to denote the \(j\)th component of \(f(x)\).
Example 2.1
(Interval censored outcome) Let \(Z:\Omega \rightarrow \mathbb R ^{d_{Z}}\) be a regressor with support \(\mathcal Z \). Let \(Y:\Omega \rightarrow \mathbb R \) be an outcome variable that is generated as:
where \(s_{0}\in \mathcal S :=L_{\mathcal Z }^{2},\) say, and \(\epsilon \) satisfies \(E[\epsilon |Z]=0\). We let \(\mathcal Y \) denote the support of \(Y\). Suppose \(Y\) is unobservable, but there exist \((Y_{L},Y_{U})^{\prime }:\Omega \rightarrow \mathcal Y \times \mathcal Y \) such that \(Y_{L}\le Y\le Y_{U}\) almost surely. Then, \((Y_{L},Y_{U},Z)^{\prime }\) satisfies the following inequalities almost surely:
Let \(x=(y_{L},y_{U},z)^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Y \times \mathcal Z \). Given a collection \(\{A_{1},{\ldots } ,A_{K}\}\) of Borel subsets of \(\mathcal Z \), the inequalities in (3), (4) imply that the moment inequalities in (1) hold with
where \(1_{A}(z):=(1\{z\in A_{1}\},{\ldots } ,1\{z\in A_{K}\})^{\prime }\).Footnote 1 For each \(x\in \mathcal X \) and \(s\in \mathcal S \), the functional \(\varphi \) evaluates vertical distances of \(r(z)\) from \(y_{L}\) and \(y_{U}\) and multiplies them by the indicator function evaluated at \(z\). Additional information on \(\rho _{0}\) available in this example is that the moment functions are based on the vertical distances.
A common specification for \(s_{0}\) is \(s_{0}(z)=r_{\theta _{0}}(z)=z^{\prime }\theta _{0}\) for some \(\theta _{0}\in \Theta \subseteq \mathbb R ^{d_{Z}}\). The parametric moment function is then given for each \(x\in \mathcal X \) by \(m(x,\theta )=\varphi (x,r_{\theta })\). Therefore, this example satisfies Assumption 2.4.
Example 2.2
Tamer (2003) considers a simultaneous game of complete information. For each \(j=1,2\), let \(Z_{j}:\Omega \rightarrow \mathbb R ^{d_{Z}}\) and \(\epsilon _{j}:\Omega \rightarrow \mathbb R \) be firm \(j\)’s characteristics that are observable to the firms. The econometrician observes the \(Z\)’s but not the \(\epsilon \)’s. For each \(j\), let \(g_{j}:\mathcal Z \times \{0,1\}\rightarrow \mathbb R \). These functions are known to the firms but not to the econometrician. Suppose that each firm’s payoff is given by
where \(Y_{j}\in \mathcal Y :=\{0,1\}\) is firm \(j\)’s entry decision, and \( Y_{-j}\in \mathcal Y \) is the other firm’s entry decision. The econometrician observes these decisions. Given \((z_{1},z_{2})\), the firms’ payoffs can be summarized in Table 1.
Suppose the firms and the econometrician know that \(g(z,1)\ge g(z,0)\) for any value of \(z\). This means that, other things equal, the opponent’s entry would reduce the firm’s own profit. In this setting, there are several possible equilibrium outcomes depending on the realization of \((\epsilon _{1},\epsilon _{2})\). If \(\epsilon _{1}>g_{1}(z_{1},1)\) and \(\epsilon _{2}>g_{2}(z_{2},1)\), then \((1,1)\) is the unique Nash equilibrium (NE) outcome. Similarly, if \(\epsilon _{1}>g_{1}(z_{1},1)\) and \(\epsilon _{2}<g_{2}(z_{2},1)\), \((1,0)\) is the unique NE outcome, and if \(\epsilon _{1}<g_{1}(z_{1},1)\) and \(\epsilon _{2}>g_{2}(z_{2},1)\), \((0,1)\) is the unique NE outcome. Now, if \(\epsilon _{1}<g_{1}(z_{1},1)\) and \(\epsilon _{2}<g_{2}(z_{2},1)\), there are two Nash equilibria, and they give the outcomes \((1,0)\) and \((0,1)\). Let \(F_{j},j=1,2\) be the unknown CDFs of \(\epsilon _{1}\) and \(\epsilon _{2}\).Footnote 2 Without any assumptions on the equilibrium selection mechanism, the model predicts the following set of inequalities:
Let \(x:=(y_{1},y_{2},z_{1},z_{2})^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Y \times \mathcal Z \times \mathcal Z \). Let \(s_{0}\in \mathcal S :=\{s\in L_{\mathcal Z \times \mathcal Z }^{2}:s(z_{1},z_{2})\in [ 0,1]^{2},\forall (z_{1},z_{2})\in \mathcal Z \times \mathcal Z \}\) be defined by
Here, \(s_{0}^{(j)}(z_{1},z_{2})\) is the conditional probability that firm \(j\)’s profit upon entry is negative given \(z_{1}\) and \(z_{2}\). Given a collection \(\{A_{j},j=1,{\ldots } ,K\}\) of Borel subsets of \(\mathcal Z \times \mathcal Z \), let \(1_{A}(z):=(1\{(z_{1},z_{2})\in A_{1}\},{\ldots },1\{(z_{1},z_{2})\in A_{K}\})^{\prime }\). The inequalities (6)–(8) imply the moment inequalities in (1) hold with
The additional information on \(\rho _{0}\) is that it is based on the differences between some combinations of the conditional probabilities \(s_{0}(z_{1},z_{2})\) and indicators for specific events.
A common parametric specification for \(g_j\) is \(g_j(z_j,y_{-j})=z_{j}^{\prime }\gamma _{0}-y_{-j}\beta _{j,0}\) for some \(\beta _{j,0}\in B\subseteq \mathbb R _+\) and \(\gamma _0\in \Gamma \subseteq \mathbb R ^{d_Z}\). It is also common to assume that \(F_j,j=1,2\) belong to a known parametric class \(\{F(\cdot ;\alpha ),\alpha \in \mathcal A \}\) of distributions. Then the parametric moment function can be defined for each \(x\) by \(m(x,\theta ):=\varphi (x,r_\theta )\), where \(\theta :=(\alpha _1,\alpha _2,\beta _1,\beta _2,\gamma )^{\prime }\) and
This example also satisfies Assumption 2.4.
Example 2.3
(Discrete choice) Suppose an agent chooses \(Z\in \mathbb R ^{d_{Z}}\) from a set \(\mathcal Z :=\{z_{1},{\ldots } ,z_{K}\}\) in order to maximize her expected payoff \(E[s_{0}(Y,Z)\mid \mathcal I ]\), where \(Y\) is a vector of observable random variables, \(s_{0}\in \mathcal R :=L_{\mathcal Y \times \mathcal Z }^{2}\) is the payoff function, and \(\mathcal I \) is the agent’s information set. The optimality condition for the agent’s choice is given by:
Let \(x:=(y,z)^{\prime }\in \mathcal X :=\mathcal Y \times \mathcal Z \). The optimality conditions in (11) imply that the unconditional moment inequalities in (1) hold with
For given \(y,\) the functional \(\varphi \) evaluates the profit differences between a given choice \(z\) (e.g., \(z_{1}\)) and every other possible choice. The additional information on \(\rho _{0}\) is that it is based on the profit differences.
A common specification for \(s_{0}\) is \(s_{0}(y,z)=r_{\theta _{0}}(y,z)=\psi (y,z;\alpha _{0})+z^{\prime }\beta _{0}+\epsilon _{z}\) for some known function \(\psi \), unknown \((\alpha _{0},\beta _{0})\in \Theta \subset \mathbb R ^{d_{\alpha }+d_{\beta }}\), and an unobservable choice-dependent error \(\epsilon _{z}\). For simplicity, we assume that \(\epsilon _{z}\) satisfies \(E[\epsilon _{z_{i}}-\epsilon _{z_{j}}\mid \mathcal I ]=0\) for any \(i,j\); see Pakes et al (2006) and Pakes (2010) for detailed discussions. The parametric moment function is then given for each \(x\in \mathcal X \) by \(m(x,\theta )=\varphi (x,r_{\theta })\). This example satisfies Assumption 2.4.
Example 2.4
(Pricing kernel) Let \(Z:\Omega \rightarrow \mathbb R ^{d_{Z}}\) be the payoffs of \(d_{Z}\) securities that are traded at a price of \(P\in \mathcal P \subseteq \mathbb R ^{d_{Z}}\). If short sales are not allowed for any securities, then the feasible set of portfolio weights is restricted to \(\mathbb R _{+}^{d_{Z}}\) and the standard Euler equation does not hold. Instead, the following Euler inequalities hold (see Luttmer 1996):
where \(Y:\Omega \rightarrow \mathcal Y \) is a state variable, e.g. consumption growth, and \(s_{0}\in \mathcal S :=\{s\in L_{\mathcal Y }^{2}:s(y)\ge 0,\forall y\in \mathcal Y \}\) is the pricing kernel function. The moment inequalities thus hold with the true moment function:
where \(x:=(y,z,p)^{\prime }\in \mathcal Y \times \mathcal Z \times \mathcal P \). This function evaluates the pricing kernel \(r\) at \(y\) and computes a vector of pricing errors. The additional information on \(\rho _{0}\) is that it is based on the pricing errors.
A common specification for \(s_{0}\) is \(s_{0}(y)=r_{\theta _{0}}(y)=\beta _{0}y^{-\gamma _{0}}\), where \(\beta _{0}\in B\subseteq [ 0,1]\) is the investor’s subjective discount factor and \(\gamma _{0}\in \Gamma \subseteq \mathbb R _{+}\) is the relative risk aversion coefficient. Let \(\theta :=(\beta ,\gamma )^{\prime }\). The parametric moment function is then given for each \(x\in \mathcal X \) by \(m(x,\theta )=\varphi (x,r_{\theta })\), satisfying Assumption 2.4.
2.2 Projection
The inequality restrictions \(E[\varphi (X,s_{0})]\le 0\) may not uniquely identify \(s_{0}\). Define
We define a pseudo-true identified set of parameters as a collection of projections of elements in \(\mathcal S _{0}\). Let \(W\) be a given non-random finite \(L\times L\) symmetric positive-definite matrix. For each \(s\in \mathcal S \), define the norm \(\Vert s\Vert _{W}:=E[s(X)^{\prime }Ws(X)]^{1/2}\). For each \(s\in \mathcal S \) and \(A\subseteq \mathcal S \), the projection map \(\Pi _{A}:\mathcal S \) \(\rightarrow A\) is the map such that
Let \(\mathcal R _{\Theta }:=\{r_{\theta }\in \mathcal S :\theta \in \Theta \} \). Given Assumption 2.4, we can define
When \(\varphi \) is the evaluation map \(e\), \(\Theta _{*}\) is simply \(\Theta _{*}:=\{\theta \in \Theta :m_{\theta }=\Pi _{\mathcal M _{\Theta }}s,s\in \mathcal S _{0}\}.\)
\(\Theta _{*}\) can be interpreted as the set of parameters that correspond to the elements \(m_{\theta }\) in the \(\mathcal R _{\Theta }\) -projection of \(\mathcal S _{0}\). This set is nonempty (under some regularity conditions), and each element can be interpreted as a projection of \(s\) inducing a functional \(\varphi (\cdot ,s)\) that is observationally equivalent to \(\rho _{0}\). In this sense, each element in \(\Theta _{*}\) has an interpretation as a pseudo-true value. Thus, we call \(\Theta _{*}\) the pseudo-true identified set. [White (1982) uses \(\theta _{*}\) to denote the unique pseudo-true value in the fully identified case.]
We illustrate the relationship between \(\Theta _{I}\) and \(\Theta _{*}\) with an example. Consider Example 2.1. Let \(\Theta \subseteq \mathbb R ^{d_{Z}}\). The conventional identified set is given by
The pseudo-true identified set is given by
Let \(D\) be a \(d_{Z}\times K\) matrix whose \(j\)th column is \(E[Z\,1\{Z\in A_{j}\}]\). For this example, the following result holds:
proposition 2.1
Let the conditions of Example 2.1 hold, and let \(\Theta _{*}\) be given as in (13). Let \(\Theta _{I}\) be given as in (12). Then \(\Theta _{I}\subseteq \Theta _{*}\). Suppose further that \(\mathcal M _{\Theta }\) is correctly specified, that \(E[Y_{U}|Z]=E[Y_{L}|Z]=Z^{\prime }\theta _{0}\) a.s, and that \(d_{Z}\le rank(D)\). Then \(\Theta _{I}=\Theta _{*}=\{\theta _{0}\}\).
As this example shows, unless there is some information that helps restrict \( \mathcal S _{0}\) very tightly, \(\Theta _{I}\) is often a proper subset of \( \Theta _{*}\). This is because without such information, \(\mathcal S _{0}\) is typically a much richer class of functions than \(\mathcal R _{\Theta }\). Another important point to note is that, although \(\Theta _{*}\) is well-defined generally, \(\Theta _{I}\) can be empty quite easily. In particular, for any \(x,x^{\prime }\in \mathcal X \), let \(x_{\lambda }:=\lambda x+(1-\lambda )x^{\prime },0\le \lambda \le 1\). \(\Theta _{I}\) is empty if there exists \((x,x^{\prime })\) and \(\lambda \in [0,1]\) such that (i) \(x_{\lambda }\in \mathcal X \) and \((E[Y_{L}|x_{\lambda }]-E[Y_{U}|x])/\Vert x_{\lambda }-x\Vert >(E[Y_{U}|x^{\prime }]-E[Y_{U}|x])/\Vert x^{\prime }-x\Vert \) or (ii) \(x_{\lambda }\in \mathcal X \) and \((E[Y_{U}|x_{\lambda }]-E[Y_{L}|x])/\Vert x_{\lambda }-x\Vert <(E[Y_{L}|x^{\prime }]-E[Y_{L}|x])/\Vert x^{\prime }-x\Vert \).Footnote 3 Figure , which is similar to Fig. 1 in Ponomareva and Tamer (2010), illustrates an example that satisfies condition (i) for the one-dimensional case.
In this example, each element in \(\Theta _{*}\) solves the following moment restrictions:
with \(u(x)=s(z)-y\) for some \(s\in \mathcal S _0.\) This can be viewed as a special case of incomplete linear moment restrictions studied in Bontemps, Magnac, and Maurin (2011) (BMM, henceforth).Footnote 4 BMM shows that the set of parameters that solves incomplete linear moment restrictions is necessarily convex and develops an inference method that exploits this property.
We note here that this connection to BMM’s work only occurs when the parametric class is of the form: \(\mathcal R _\Theta =\{r_\theta : r_\theta (z)=z^{\prime }\theta ,~\theta \in \Theta \}\). The elements of \(\Theta _{*}\), however, do not generally solve incomplete linear moment restrictions when \(\mathcal R _\Theta \) includes nonlinear functions of \(\theta \). Therefore, BMM’s inference method is only applicable when \(r_\theta \) is linear. Our estimation procedure is more flexible than theirs in the following two respects. First, one may allow projection to a more general class of parametric functions that includes nonlinear functions of \(\theta \). Second, as a consequence of the first point, we do not require \(\Theta _{*}\) to be convex. We, however, pay a price for achieving this generality. We require \(s\) to satisfy suitable smoothness conditions, which are not required by BMM. We discuss these conditions in detail in the following section.
3 Estimation
3.1 Set Estimator
For \(W\) as above and each \((\theta ,s)\in \Theta \times \mathcal S \), let the population criterion function be defined by
Using the population criterion function, the “pseudo-true” identified set \(\Theta _{*}\) can be equivalently written as
Given a sample \(\{X_{1},{\ldots } ,X_{n}\}\) of observations, let the sample criterion function be defined for each \((\theta ,s)\in \Theta \times \mathcal S \) by
Ideally, we would like to estimate \(\Theta _{*}\) by \(\tilde{\Theta }_{n}\), say, where \(\tilde{\Theta }_{n}:=\{\theta :Q_{n}(\theta ,s)\le c_{n},s\in \mathcal S _{0}\}\). But \(\mathcal S _{0}\) is unknown, so we must estimate it. Thus, we employ a two-stage procedure, similar to that studied in Kaido and White (2010). Section 3.3 discusses how to construct a first-stage estimator of \(\mathcal S _{0}\). For now, we suppose that such an estimator exists. For this, let \(\mathcal F (A)\) be the set of closed subsets of a set \(A\). See Kaido and White (2010) for background, including discussion of Effros measurability.
Assumption 3.1
(First-stage estimator) For each \(n\), let \(\mathcal S _{n}\subseteq \mathcal S \). \(\hat{\mathcal S }_{n}:\Omega \rightarrow \mathcal F (\mathcal S _{n})\) is (Effros-) measurable.
Given a first-stage estimator, we define a set estimator for the pseudo-true identified set. Let \(\{c_{n}\}\) be a sequence of non-negative constants. The set estimator for \(\Theta _{*}\) is defined by
We establish our consistency results using the Hausdorff metric. Let \(||\cdot ||\) denote the Euclidean norm, and for any closed subsets \(A\) and \(B\) of a finite-dimensional Euclidean space (e.g., containing \(\theta \)), let
where \(d_{H}\) and \(\vec {d}_{H}\) are the Hausdorff metric and directed Hausdorff distance respectively.
Before stating our assumptions, we introduce some additional notation. Let \( D_{\theta }^{\alpha }\) denote the differential operator \(\partial ^{\alpha }/\partial \theta _{1}^{\alpha _{1}}\cdots \partial \theta _{p}^{\alpha _{p}} \) with \(|\alpha |:=\sum _{j=1}^{p}\alpha _{j}\). Similarly, we let \( D_{x}^{\beta }\) denote the differential operator \(\partial ^{\beta }/\partial x_{1}^{\beta _{1}}\cdots \partial x_{k}^{\beta _{k}}\) with \( |\beta |:=\sum _{j=1}^{k}\beta _{j}\). For a function \(f:\mathcal X \rightarrow \mathbb R \) and \(\gamma >0\), let \(\underline{\gamma }\) be the smallest integer smaller than \(\gamma \) and define
Let \(\mathcal C _{M}^{\gamma }(\mathcal X )\) be the set of all continuous functions \(f:\mathcal X \rightarrow \mathbb R \) such that \(\Vert f\Vert _{\gamma }\le M\). Let \(\mathcal C _{M,L}^{\gamma }(\mathcal X ):=\{f:\mathcal X \rightarrow \mathbb R ^L:f^{(j)}\in \mathcal C ^\gamma _M(\mathcal X ), j=1,{\ldots },L\}\). Finally, for any \(\eta >0\), let \(\mathcal S _{0}^{\eta }:=\{s\in \mathcal S :\inf _{s^{\prime }\in \mathcal S _{0}}\Vert s-s^{\prime }\Vert _{W}<\eta \}\).
Our first assumption places conditions on the parameter spaces \(\Theta \) and \(\mathcal S \). We let \( int (\Theta )\) denote the interior of \( \Theta \).
Assumption 3.2
(i) \(\Theta \) is compact; (ii) \(\mathcal S \) is a compact convex set with nonempty interior; (iii) there exists \(\gamma >k/2\) such that \(\mathcal S \subseteq \mathcal C _{M,L}^{\gamma }(\mathcal X )\); (iv) \(\mathcal R _{\Theta }\) is a convex subset of \(\mathcal S \); (v) \(\Theta _{*}\subseteq int (\Theta ).\)
Assumption 3.2 (i) is standard in the literature of extremum estimation and also ensures the compactness of the pseudo-true identified set. Assumption 3.2 (iii) imposes a smoothness requirement on each component of \(s\in \mathcal S \). Together with Assumption (ii), this implies that \(\mathcal S \) is compact under the uniform norm, which will be also used for establishing the Hausdorff consistency of \(\hat{\mathcal S } _{n}\) in the following section. For the Hausdorff consistency of \(\hat{\Theta }_{n} \), the requirement \(\gamma >k/2\) can be relaxed to \(\gamma >0\), and it also suffices that the smoothness requirement holds for functions in neighborhoods of \(\mathcal S _{0}\). The stronger requirement given here, however, will be useful for deriving the rates of convergence of \(\hat{\Theta }_{n}\) and \(\hat{\mathcal S }_{n}\).
For ease of analysis, we assume below that the observations are from a sample of IID random vectors.
Assumption 3.3
The observations \(\{X_i,i=1,{\ldots },n\}\) are independently and identically distributed.
The following two assumptions impose regularity conditions on \(r_\theta \).
Assumption 3.4
(i) \(r(x,\cdot )\) is twice continuously differentiable on the interior of \(\Theta \) \(\mathrm{a.e.}-P_{0}\), and for any \(j\), \(x,\) and \(|\alpha | \le 2\), there exists a measurable bounded function \(C:\mathcal X \rightarrow \mathbb R \) such that \(|D_\theta ^{\alpha }r_{\theta }^{(j)}(x)-D_\theta ^{\alpha }r_{\theta ^{\prime }}^{(j)}(x)|\le C(x)\Vert \theta -\theta ^{\prime }\Vert \); (ii) there exists a measurable bounded function \(R:\mathcal X \rightarrow \mathbb R \) such that
For each \(x\), let \(\nabla _{\theta }r_{\theta }(x)\) be a \(L\times p\) matrix whose \(j\)th row is the gradient vector of \(r_{\theta }^{(j)}\) with respect to \(\theta \). For each \(x\in \mathcal X \) and \(i,j\in \{1,{\ldots },L\}\), let \(\partial ^{2}/\partial \theta _{i}\partial \theta _{j}r_{\theta }(x)\) be a \(L\times 1\) vector whose \(k\)th component is given by \(\partial ^{2}/\partial \theta _{i}\partial \theta _{j}r^{(k)}_{\theta }(x)\). For each \(\theta \in \Theta \), \(s\in \mathcal S \), and \(x\in \mathcal X \), let \(H_{W}(\theta ,s,x)\) be a \(p\times p\) matrix whose \((i,j)\)th component is given by
Let \(\eta >0\). For each \(s\in \mathcal S _0^{\eta }\) and \(\epsilon >0\), let \(V^\epsilon (s)\) be the neighborhood of \(\theta _{*}(s)\) defined by
Let \(\mathcal N _{\epsilon ,\eta }:=\{(\theta ,s):\theta \in V^\epsilon (s),s\in \mathcal S _0^{\eta }\}\) be the graph of the correspondence \(V^\epsilon \) on \(\mathcal S _0^{\eta }\).
Assumption 3.5
There exist \(\bar{\epsilon }>0\) and \(\bar{\eta }>0\) such that the Hessian matrix \(\nabla _\theta ^2Q(\theta ,s):=E[H_{W}(\theta ,s,X_{i})+2\nabla _{\theta }r_{\theta }(X_{i})^{\prime }W\nabla _{\theta }r_{\theta }(X_{i})]\) is positive definite uniformly over \(\mathcal N _{\bar{\epsilon },\bar{\eta }}\).
Assumption 3.4 imposes a smoothness requirement on \(r_\theta \) as a function of \(\theta \), enabling us to expand the first order condition for minimization, as is standard in the literature. Assumption 3.5 requires that Hessian of \(Q(\theta ,s)\) with respect to \(\theta \) to be positive definite uniformly on a suitable neighborhood of \(\Theta _{*}\times \mathcal S _0\). For the consistency of \(\hat{\Theta }_n\), it suffices to assume that the Hessian is uniformly non-singular over \(\mathcal N _{\bar{\epsilon },\bar{\eta }}\), but a stronger condition given here will be useful to ensure a quadratic approximation of the criterion function, which is crucial for the \(\sqrt{n}\)-consistency of\(\hat{\Theta }_{n}\).
Further, we assume that \(\hat{\mathcal S }_{n}\) is consistent for \(\mathcal S _{0}\) in a suitable Hausdorff metric. Specifically, for subsets \(A,B\) of \( \mathcal S \), let
Assumption 3.6
\(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1)\).
Theorem 3.1 is our first main result, which establishes the consistency of the set estimator defined in (17) with \(c_n\) set to 0. This result is established by extending the standard consistency proof for extremum estimators to the current setting. Note that, under Assumption 3.2 (iv), the projection \(\theta _{*}(s):=\Pi _{\mathcal R _\Theta }s\) of each point \(s\in \mathcal S \) to \(\mathcal R _\Theta \) exists and is uniquely determined. In other words, for each \(s\in \mathcal S \), \(\theta _{*}(s)\) is point identified. By setting \(c_n=0\), the set estimator is then asymptotically equivalent to the collection of minimizers \(\hat{\theta }_n (s):={\text{ argmin}}_{\theta ^{\prime }\in \Theta }Q_n(\theta ,s)\) of the sample criterion function. The main challenge for establishing Hausdorff consistency is to show that \(\hat{\theta }_n(s)-\theta _{*}(s)\) vanishes in probability over a sufficiently large neighborhood of \(\mathcal S _0\). The proof of the theorem in the appendix formally establishes this and gives the desired result.
Theorem 3.1
Suppose Assumptions 2.1–2.4 and 3.1–3.6 hold. Let \(\hat{\Theta }_{n}\) be defined as in (17) with \(c_{n}=0\) for all \(n\). Then \(d_{H}(\hat{\Theta }_{n},\Theta _{*})=o_{p}(1)\).
The result of Theorem 3.1 is similar to that of Theorem 3.2 in Chernozhukov et al. (2007), who establish the Hausdorff consistency of a level-set estimator with \(c_n=0\) when \(Q_n\) degenerates on a neighborhood of the identified set.Footnote 5 When Assumption 3.2 (iv) fails to hold, this estimator may not be consistent. We, however, conjecture that it would be possible to construct a Hausdorff consistent estimator of \(\Theta _{*}\) even in such a setting by choosing a positive sequence \(\{c_n\}\) of levels that tends to 0 as \(n\rightarrow \infty \) and by exploiting the fact that \(\hat{\mathcal S }_n\) converges to \(\mathcal S _0\) in a suitable Hausdorff metric. In fact, Kaido and White (2010) establish the Hausdorff consistency of their two-stage set estimator using this argument, but in their analysis, the first-stage parameter (\(s\) in our setting) must be finite dimensional. Extending Theorem 3.1 to a more general one that allows non-convex parametric classes is definitely of interest, but to keep our tight focus here, we leave this as a future work.
3.2 The Rate of Convergence
Theorem 3.1 uses the fact that \(d_{H}(\hat{\Theta }_{n},\Theta _{*})\) can be bounded by \(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})\). Although \(\hat{\mathcal S }_{n}\) does not converge at a parametric rate generally, the convergence rate of \(\hat{\Theta }_{n}\) can be improved when \(\hat{\mathcal S }_{n}\) converges to \(\mathcal S _{0}\) at a rate \(o_{p}(n^{-1/4})\). This is analogous to the results obtained for the point identified case; see, for example, Newey (1994), Ai and Chen (2003), and Ichimura and Lee (2010).
Assumption 3.7
\(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(n^{-1/4})\).
Theorem3.2
Suppose the conditions of Theorem 3.1 hold. Suppose in addition Assumption 3.7 holds. Let \(\hat{\Theta }_{n}\) be defined as in (17) with \(c_n=0\) for all \(n\). Then, \(d_H(\hat{\Theta }_{n},\Theta _{*})=O_p(n^{-1/2})\).
For this, setting \(c_n\) to 0 is crucial for achieving the \(O_p(n^{-1/2})\) rate. We here note that Theorem 3.2 builds on Lemma A.2 in the appendix, which establishes the convergence rate (in directed Hausdorff distance) of \(\hat{\Theta }_{n}\) in (17) with a possibly nonzero level \(c_n\). This lemma does not require Assumption 3.2 (iv) but assumes the Hausdorff consistency of \(\hat{\Theta }_{n}\) as a high-level condition. This is why Theorem 3.2 is stated for \(\hat{\Theta }_{n}\) with \(c_n=0\). As previously discussed, however, if Theorem 3.1 is extended to allow non-convex parametric classes, this lemma can be used to characterize the estimator’s convergence rate under a more general setting.
3.3 The First-Stage Estimator
This section discusses how to construct a first-stage set estimator. A challenge is that the object of interest \(\mathcal S _{0}\) is a subset of an infinite-dimensional space. This requires us to use a nonparametric estimation technique for estimating \(\mathcal S _{0}\). This type of estimation problem was recently analyzed in Santos (2011), who studies estimation of linear functionals of function-valued parameters in nonparametric instrumental variable problems. We rely on his results on consistency and the rate of convergence, which extend Chernozhukov et al. (2007) analysis to a nonparametric setting. Specifically, for each \(s\in \mathcal S \), let
This is a sample criterion function defined on \(\mathcal S \). For instance, \({\mathcal Q }_{n}\) for Example 2.1 is given by
Our first-stage set estimator is a level set of \(\mathcal Q _{n}\) over a sieve \(\mathcal S _{n}\subseteq \mathcal S \). Given a sequence of non-negative constants \(\{a_{n}\}\) and \(\{b_{n}\}\), define
We add regularity conditions on \(\varphi \), \(\{\mathcal S _{n}\}\), and \(\{(a_{n},b_{n})\}\) to ensure the Hausdorff consistency of \(\hat{\mathcal S }_{n}\) and derive its convergence rate. The following two assumptions impose smoothness requirements on the map \(\varphi \).
Assumption 3.7
For each \(j\), there is a function \(B_{j}:\mathcal X \rightarrow \mathbb R _{+}\) such that
where \(\rho (s,s^{\prime }):=\sup _{x\in \mathcal S }\max _{j=1,{\ldots },l}|s^{(j)}(x)-s^{\prime (j)}(x)|\).
For each \(s\in \mathcal S \), let \(\mathcal I (s):=\{j\in \{1,{\ldots },l\}:E[\varphi ^{(j)}(X_{i},s)]>0\}\). \(\mathcal I (s)\) is the set of indexes whose associated moments violate the inequality restrictions. For each \(j\), let \(\bar{\varphi }^{(j)}:=E[\varphi ^{(j)}(X_{i},s)]\).
Assumption 3.8
(i) For each \(s\in \mathcal S \) and \(j\), \(\bar{\varphi }^{(j)}:\mathcal S \rightarrow \mathbb R \) is continuously Fréchet differentiable with the Fréchet derivative \(\dot{\varphi }_{s}^{(j)}:\mathcal S \rightarrow \mathbb R \), and for each \(s\in \mathcal S \), the operator norm \(\Vert \dot{\varphi }_{s}^{(j)}\Vert _{op}\) of \(\dot{\varphi }_{s}^{(j)}\) is bounded away from 0 for some \(j\in \{1,{\ldots } ,l\}\); (ii) for each \(s\notin \mathcal S _{0}\), there exist \(j\in \mathcal I (s)\) and \(C_{j}>0\) such that \(E[\varphi ^{(j)}(X_{i},s)]\ge C_{j}\Vert s-s_{0}\Vert _{W}\) for some \(s_{0}\in \mathcal S _{0}\).
We also add regularity conditions on \(\mathcal S _{n}\), which can be satisfied by commonly used sieves including polynomials, splines, wavelets, and certain artificial neural network sieves.
Assumption 3.9
(i) For each \(n\), \(\mathcal S _{n}\subseteq \mathcal S \), and both \(\mathcal S _{n}\) and \(\mathcal S \) are closed with respect to \(\rho \); (ii) for every \(s\in \mathcal S \), there is \(\Pi _{n}s\in \mathcal S _{n}\) such that \(\sup _{s\in \mathcal S }\Vert s-\Pi _{n}s\Vert _{W}=O(\delta _{n})\) for some sequence \(\{\delta _{n}\}\) of non-negative constants such that \(\delta _{n}\rightarrow 0\).
Theorem 3.3
Suppose Assumptions 2.1–2.3, 3.2 (i)–(iii), 3.3, 3.8, 3.9 (i), and 3.10 hold. Let \(a_{n}=O(\max \{n^{-1},\delta _{n}^{2}\}^{-1})\) and \(b_{n}\rightarrow \infty \) with \(b_{n}=o(a_{n})\). Then
In addition, suppose that Assumption 3.9 (ii) holds. Then
Theorem 3.3 can be used to establish Assumptions 3.6 and 3.7, which are imposed in Theorems 3.1 and 3.2. These conditions are satisfied for Example 2.1 with a single regressor.
In what follows, for any two sequences of positive constants \(\{c_{n}\},\) \(\{d_{n}\}\), let \(c_{n}\asymp d_{n}\) mean there exist constants \(0<C_{1}<C_{2}<\infty \) such that \(C_{1}\le |c_{n}/d_{n}|\le C_{2}\) for all \(n\).
Corollary 3.1
In Example 2.1, suppose that \(\mathcal Z \) is a compact convex subset of the real line and \(r_{\theta }(z)=\theta ^{(1)}+\theta ^{(2)}z\), where \(\theta \in \Theta \subseteq \mathbb R ^{2}\). Suppose that \(\Theta \) is compact and convex. Suppose further that \(\{(Y_{L,i},Y_{U,i},Z_{i})\}_{i=1,{\ldots },n}\) is a random sample from \(P_{0}\) and that \(P_{0}(Z\in A_{k})>0\) for all \(k\) and \(Var(Z)>0\). Let \(\mathcal S :=\{s\in L_{\mathcal Z ,1}^{2}:\mathcal Z \rightarrow \mathbb R :\Vert s\Vert _{\infty }\le M,|s(z)-s(z^{\prime })|\le M\Vert z-z^{\prime }\Vert ,\forall z,z^{\prime }\in \mathcal Z \}\) for some \(M>0\). Let \(\{r_{q}(\cdot )\}_{q=1}^{J_{n}}\) be splines of order two with \(J_{n}\) knots on \(\mathcal Z \). Define \(\mathcal S _{n}:=\{s:s(z)=\sum _{q=1}^{J_{n}}\beta _{q}r_{q}(z)\}\) with \(J_{n}\asymp n^{c_{1}},c_{1}>1/3\). Let \(\hat{\mathcal S }_{n}\) be defined as in (21) with \(a_{n}\asymp n^{c_{2}}\), where \(2/3<c_{2}<1\) and \(b_{n}\asymp \ln n\). Then: (i) \(\hat{\mathcal S }_{n}\) is (Effros-) measurable; (ii) \(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(1)\); (iii) \(d_{H,W}(\hat{\mathcal S }_{n},\mathcal S _{0})=o_{p}(n^{-1/4}).\)
Given these results, we further show that the estimator of the pseudo-true identified set is consistent and converges at a \(n^{-1/2}\)-rate.
Corollary 3.2
Suppose that the conditions of Corollary 3.1 hold. Let \(Q\) be defined as in (17) with \(W=1\). Let \(Q_{n}\) be defined as in (16) and \(\hat{\Theta }_{n}\) be defined as in (17) with \(c_{n}=0\) and \(\hat{\mathcal S }_{n}\) as in Corollary 3.1. Then \(d_{H}(\hat{\Theta }_{n},\Theta _{*})=O_{p}(n^{-1/2})\).
4 Concluding Remarks
Moment inequalities are widely used to estimate discrete choice problems and structures that involve censored variables. In many empirical applications, potentially misspecified parametric models are used to estimate such structures. This chapter studies a novel estimation procedure that is robust to misspecification of moment inequalities. To overcome the challenge that the conventional identified set may be empty under misspecification, we defined a pseudo-true identified set as the least squares projection of the set of functions at which the moment inequalities are satisfied. This set is nonempty under mild assumptions. We also proposed a two-stage set estimator for estimating the pseudo-true identified set. Our estimator first estimates the identified set of function-valued parameters by a level-set estimator over a suitable sieve. The pseudo-true identified set can then be estimated by projecting the first-stage estimator to a finite-dimensional parameter space. We give conditions, under which the estimator is consistent for the pseudo-true identified set in the Hausdorff metric and converges at a rate \(O_p(n^{-1/2})\). Developing inference procedures based on the proposed estimator would be an interesting future work. Another interesting extension would be to study the optimal choice of the weighting matrix. In this chapter, we maintained the assumption that \(W\) is fixed and does not depend on \((\theta ,s)\). Given the form of the criterion function, the most natural choice of \(W\) would be the inverse matrix of the variance covariance matrix of \(s(X_i)-r_{\theta }(X_i)\). This matrix is generally unknown but can be consistently estimated by its sample analog: \( \hat{W}_n(\theta ,s):=(\frac{1}{n}\sum _{i=1}^n(s(X_i)-r_{\theta }(X_i)) (s(X_i)-r_{\theta }(X_i))^{\prime })^{-1}.\) Defining a sample criterion function using \(\hat{W}_n(\theta ,s)\) as a weighting matrix would lead to a three-step procedure. Such a procedure may result in more efficient estimation of \(\Theta _{*}\).Footnote 6 Yet, another interesting direction would be to develop a specification test for the moment inequality models based on the current framework. This direction would extend the results of Guggenberger et al. (2008), which studies a testing procedure that tests the nonemptiness of the identified set.
Notes
- 1.
Here, we take the indicators (or instruments) \(1_A(z)\) as given. The indicators \(1_{A}(z)\) could be replaced by any finite vector of measurable non-negative functions of \(z\). Andrews and Shi (2011) give examples of such functions.
- 2.
The players do not need to know the \(F\)’s, but these are important to the econometrician.
- 3.
For this example, \(\Theta _I\) is never empty as long as the number (\(2K\)) of moment inequalities equals the number of parameters \((\ell )\).
- 4.
We are indebted to an anonymous referee for pointing out a relationship between BMM’s framework and ours. General incomplete linear moment restrictions are given by \(E[V(Z^{\prime }\theta -Y)]=E[Vu(V)]\), where \(V\) is a vector of random variables, and \(u\) is an unknown bounded function. See BMM for details.
- 5.
Their framework does not consider misspecification. Their object of interest is therefore the conventional identified set \(\Theta _I\). In our setting, the sample criterion function degenerates, i.e., \(Q_n(\theta ,s)=0\), on a neighborhood of \(\Theta _{*}\times \mathcal S _0\) under Assumption 3.2 (iv).
- 6.
We are indebted to an anonymous referee for this point.
- 7.
Since the mean value theorem only applies element by element to the vector in (A.8), the mean value \(\bar{\theta }_n\) differs across the elements. For notational simplicity, we use \(\bar{\theta }_n\) in what follows, but the fact that they differ element to element should be understood implicitly. For the measurability of these mean values, see Jennrich (1969) for example.
References
Ai, C., and X. Chen (2003): “Efficient Estimation of Models with Conditional Moment Restrictions Containing Unknown Functions”, Econometrica, 71(6), 1795–1843.
Aliprantis, C. D., and K. C. Border (2006): Infinite Dimensional Analysis-A Hitchhiker’s Guide. Springer, Berlin.
Andrews, D. W. K. (1994): “Chapter 37: Empirical Process Methods in Econometrics”, vol. 4 of Handbook of Econometrics, pp. 2247–2294. Elsevier, Amsterdam.
Andrews, D. W. K., and X. Shi (2011): “Inference for Parameters Defined by Conditional Moment Inequalities”, Discussion Paper, Yale University.
Bajari, P., C. L. Benkard, and J. Levin (2007): “Estimating Dynamic Models of Imperfect Competition”, Econometrica, 75(5), 1331–1370.
Bontemps, C., T. Magnac, and E. Maurin (2011): “Set Identified Linear Models”, CeMMAP Working Paper.
Chen, X. (2007): “Large Sample Sieve Estimation of Semi-Nonparametric Models”, Handbook of Econometrics, 6, 5549–5632.
Chernozhukov, V., H. Hong, and E. Tamer (2007): “Estimation and Confidence Regions for Parameter Sets in Econometric Models1”, Econometrica, 75(5), 1243–1284.
Ciliberto, F., and E. Tamer (2009): “Market Structure and Multiple Equilibria in Airline Markets”, Econometrica, 77(6), 1791–1828.
Folland, G. (1999): Real Analysis: Modern Techniques and Their Applications, vol. 40. Wiley-Interscience, New York.
Guggenberger, P., J. Hahn, and K. Kim (2008): “Specification Testing under Moment Inequalities”, Economics Letters, 99(2), 375–378.
Ichimura, H., and S. Lee (2010): “Characterization of the Asymptotic Distribution of Semiparametric M-Estimators”, Journal of Econometrics, 159(2), 252–266.
Jennrich, R. I. (1969): “Asymptotic Properties of Nonlinear Least Squares Estimators”, Annals of Mathematical Statistics, 40(2), 633–643.
Kaido, H., and H. White (2010): “A Two-Stage Approach for Partially Identified Models”, Discussion Paper, University of California San Diego.
Lindenstrauss, J., D. Preiss, and J. Tiser (2007): “Differentiability of Lipschitz Maps”, in Banach Spaces and Their Applications in, Analysis, pp. 111–123.
Luttmer, E. G. J. (1996): “Asset Pricing in Economies with Frictions”, Econometrica, 64(6), 1439–1467.
Manski, C. F., and E. Tamer (2002): “Inference on Regressions with Interval Data on a Regressor or Outcome”, Econometrica, 70(2), 519–546.
Molchanov, I. S. (2005): Theory of Random Sets. Springer, Berlin.
Newey, W. (1994): "The Asymptotic Variance of Semipara metric Estimators," Econometrica, 62(6), 1349–1382.
Newey, W. K., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing”, Handbook of Econometrics, 4, 2111–2245.
Pakes, A. (2010): “Alternative Models for Moment Inequalities”, Econometrica, 78(6), 1783–1822.
Pakes, A., J. Porter, K. Ho, and J. Ishii (2006): “Moment Inequalities and Their Application”, Working Paper, Harvard University.
Ponomareva, M., and E. Tamer (2010): “Misspecification in Moment Inequality Models: Back to Moment Equalities?” Econometrics Journal, 10, 1–21.
Santos, A. (2011): “Instrumental Variables Methods for Recovering Continuous Linear Functionals”, Journal of Econometrics, 161, 129–146.
Sherman, R. P. (1993): “The Limiting Distribution of the Maximum Rank Correlation Estimator”, Econometrica, 61(1), 123–137.
Tamer, E. (2003): “Incomplete Simultaneous Discrete Response Model with Multiple Equilibria”, The Review of Economic Studies, 70(1), 147–165.
van der Vaart, A. W., and J. A. Wellner (1996): Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, New York.
White, H. (1982): “Maximum Likelihood Estimation of Misspecified Models”, Econometrica, 50(1), 1–25.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Mathematical Proofs
Mathematical Proofs
1.1 Notation
Throughout the appendix, let \(\Vert \cdot \Vert \) denote the usual Euclidean norm. For each \(s,s^{\prime }\in \mathcal S \), let \(\rho (s,s^{\prime }):=\sup _{x\in \mathcal S }\max _{j=1,{\ldots } ,l}|s^{(j)}(x)-s^{\prime (j)}(x)| \). For each \(a\times b\) matrix \(A\), let \(\Vert A\Vert _{op}:=\min \{c:\Vert Av\Vert \le c\Vert v\Vert ,v\in \mathbb R ^{b}\}\) be the operator norm. For any symmetric matrix \(A\), let \(\xi (A)\) denote the smallest eigenvalue of \(A\).
For a given pseudometric space \((T,\rho )\), let \(N(\epsilon ,T,\rho )\) be the covering number, i.e., the minimal number of \(\epsilon \)-balls needed to cover \(T\). For each measurable function \(f:\mathcal X \rightarrow \mathbb R \) and \(1\le p<\infty \), let \(\Vert f\Vert _{L^{p}}:=E[|f(X)|^p]^{1/p}\) provided that the integral exists. Similarly, let \(\Vert f\Vert _\infty :=\inf \{c:P(|f(X)|>c)=0\}\). For a given function space \(\mathcal G \) equipped with a norm \(\Vert \cdot \Vert _{\mathcal G }\) and \(l,u\in \mathcal G \), let \([l,u]:=\{f\in \mathcal G :l\le f\le u\}\). For each \(f\in \mathcal G \), let \(B_{\epsilon ,f}:=\{[l,u]:l\le f\le u,\Vert l-u\Vert _{\mathcal G }<\epsilon \}\) be the \(\epsilon \)-bracket of \(f\). The bracketing number \(N_{[\,]}(\epsilon ,\mathcal G ,\Vert \cdot \Vert _{\mathcal G })\) is the minimum number of \(\epsilon \)-brackets needed to cover \(\mathcal G \). An envelope function \(G\) of a function class \(\mathcal G \) is a measurable function such that \(g(x)\le G(x)\) for all \(g\in \mathcal G \). For each \(\delta >0\), the bracketing integral of \(\mathcal G \) with an envelope function \(G\) is defined as \(J_{[]}(\delta ,\mathcal G ,\Vert \cdot \Vert _{\mathcal G }):=\int _0^\delta \sqrt{1+\ln N_{[]}(\epsilon \Vert G\Vert _{\mathcal G },\mathcal G ,\Vert \cdot \Vert _{\mathcal G })}d\epsilon \).
1.2 Projection
Proof of Proposition 2.1.
Note that under the conditions of Example 2.1, Assumption 2.3 holds. This ensures \(\mathcal S _0\) is nonempty. By Eq. (13), \(\Theta _{*}\) is nonempty. Furthermore, let \(\theta \in \Theta _I,\) and for each \(z\in \mathcal Z \), let \(r_\theta (z):=z^{\prime }\theta \). Note that \(r_\theta \in \mathcal S _0\). Thus, (13) holds with \(s=r_\theta \), which ensures the first claim.
For the second claim, note that the condition \(E[Y_U|Z]=E[Y_L|Z]=Z^{\prime }\theta _0\) a.s implies that any \(\theta \in \Theta _I\) must satisfy
By the rank condition on \(D\), the unique solution to (A.1) is \(\theta _0-\theta =0\). Thus, \(\{\theta _0\}=\Theta _I\). Since \(\{\theta _0\}\subseteq \Theta _{*}\) by the first claim, it suffices to show that \(\theta _0\) is the unique element of \(\Theta _{*}\). For this, note that under our assumptions, \(\mathcal S _0=\{s_0\}\) with \(s_0(z)=z^{\prime }\theta _0\). Thus, \(\Theta _{*}=\{\theta _0\}\). This completes the proof.\(\square \)
1.3 Consistency of the Parametric Part
For each \(s\in \mathcal S \), let \(\theta _{\ast }(s):=\mathop{\rm arg\,min}_{\theta \in \Theta }Q(\theta ,s)\) and \(\hat{\theta}_{n}(s):=\mathop{\rm arg\,min}_{\theta \in \Theta}Q_{n}(\theta ,s)\).
Lemma A.1
Suppose that Assumptions 3.4 and 3.2 (iv) hold. Then, (i) for each \(x\in \mathcal X \) and any \(s,s^{\prime }\in \mathcal S \), there exists a function \(C_{1}:\mathcal X \rightarrow \mathbb R _{+}\) such that
(ii) For each \(x\in \mathcal X \), \(j=1,{\ldots } ,L,\) and any \(s,s^{\prime }\in \mathcal S \), there exists a function \(C_{2}:\mathcal X \rightarrow \mathbb R _{+}\) such that
Proof of Lemma A.1
Assumption 3.4 ensures that
Assumption 3.2 (iv) ensures that for each \(s\in L^2_{\mathcal S ,L}\), \(\theta _{*}(s)=\Pi _{\mathcal R _\Theta }s\) is uniquely determined, where \(\Pi _{\mathcal R _\Theta }\) is the projection mapping from the Hilbert space \(L^2_{\mathcal S ,L}\) to the closed convex subset \(\mathcal R _{\Theta }\). Furthermore, Lemma 6.54 (d) in Aliprantis and Border (2006) and the fact that \(\rho \) is stronger than \(\Vert \cdot \Vert _W\) imply
for some \(c>0\). Combining (A.4) and (A.5) ensures (i). Similarly, Assumption 3.4 ensures that for each \(x\in \mathcal X \)
Combining (A.5) and (A.6) ensures (ii). \(\square \)
Proof of Theorem 3.1
Step 1: Let \(s\in \mathcal S \) be given. For each \(\theta \in \Theta \), let \(Q_s(\theta ):=Q(\theta ,s)\) and \(Q_{n,s}(\theta ):=Q_n(\theta ,s)\). By Assumption 3.2 (iv) and Theorem 6.53 in Aliprantis and Border (2006), \(Q_s\) is uniquely minimized at \(\theta _{*}(s)\). By Assumption 3.2 (i), \(\Theta \) is compact. By Assumption 3.2, \(Q(\theta )\) is continuous. Furthermore, Assumption 3.4 ensures the applicability of the uniform law of large numbers. Thus, \(\sup _{\theta \in \Theta }|Q_{n,s}(\theta )-Q_s(\theta )|=o_p(1)\). Hence, by Theorem 2.1 in Newey and McFadden (1994), \(\hat{\theta }_n(s)-\theta _{*}(s)=o_p(1)\).
By Assumptions 3.2 (v), 3.4 (ii), and the fact that \(\hat{\theta }_n(s)\) is consistent for \(\theta _{*}(s)\), \(\hat{\theta }_n(s)\) solves the first order condition:
with probability approaching one. Expanding this condition at \(\theta _{*}(s)\) using the mean-value theorem applied to each element of \(\nabla _\theta Q_n(\theta ,s)\) yields
where \(\bar{\theta }_n(s)\) lies on the line segment that connects \(\hat{\theta }_n(s)\) and \(\theta _{*}(s)\).Footnote 7 For each \(s\in \mathcal S _0^{\bar{\eta }}\), let
Below, we show that the function class \(\Psi :=\{f_s:f_s=\psi ^{(j)}_s, s\in \mathcal S _0^{\bar{\eta }}, j=1,2,{\ldots },J\}\) is a Glivenko–Cantelli class.
By Assumption 3.4 (ii), Lemma A.1, the triangle inequality, and the Cauchy–Schwarz inequality, for any \(s,s^{\prime }\in \mathcal S \),
where \(F(x):=(C_2(x)\Vert W\Vert _{op}(M+R(x))+(1+C_1(x))\Vert W\Vert _{op}R(x))\times \sqrt{L}\). For any \(\epsilon >0\), let \(u:=\epsilon /2\Vert F\Vert _{L^1}\). By, Theorem 2.7.11 in van der Vaart and Wellner (1996) and Assumption 3.2 (ii), we obtain
For each \(j=1,{\ldots },L\), let \(\mathcal S _0^{\bar{\eta },(j)}:=\{s^{(j)}:s\in \mathcal S _0^{\bar{\eta }}\}\). For each \(j, g\in \mathcal S _0^{\bar{\eta },(j)},\) and \(\epsilon >0\), let \(B^{(j)}_\epsilon (g):=\{f\in \mathcal S _0^{\bar{\eta },(j)}:\Vert f-g\Vert _{\infty }<\epsilon \}\). Similarly, for each \(s\in \mathcal S _0^{\bar{\eta }}\), let \(B_{u,\rho }(s):=\{f\in \mathcal S _0^{\bar{\eta },(j)}:\rho (f,s)<\epsilon \}\). As we will show below, \(N_j:=N(u,\mathcal S _0^{\bar{\eta },(j)},\Vert \cdot \Vert _\infty )\) is finite for all \(j\). Thus, for each \(j\) there exist \(f_{1,j},{\ldots },f_{N_j,j}\in \mathcal S _0^{\bar{\eta },(j)}\) such that \(\mathcal S _0^{\bar{\eta },(j)}\subseteq \bigcup _{l=1}^{N_j}B_{u}^{(j)}(f_{l,j})\). We can then obtain a grid of distinct points \(f_1,{\ldots },f_N\in \mathcal S _0^{\bar{\eta }}\) such that \(f_i^{(j)}=f_{l,j}\) for some \(1\le l\le N_j\), where \(N=\prod _{j=1}^L N_j\). Then, by the definition of \(\rho \), \(\mathcal S _0^{\bar{\eta }}\subseteq \bigcup _{i=1}^N B_{u,\rho }(f_i)\). Thus,
where the last inequality follows from Assumption 3.2 (ii)–(iii) and Theorem 2.7.1 in van der Vaart and Wellner (1996). By Theorem 2.4.1 in van der Vaart and Wellner (1996), \(\Psi \) is a Glivenko–Cantelli class.
Note that, by Assumptions 3.2 (v) and 3.4, \(\theta ^*(s)\) solves the population analog of (A.7). Thus,
These results together with the strong law of large numbers whose applicability is ensured by Assumptions 3.3 and 3.4 (ii) imply
Step 2: In this step, we show that the Hessian \(\nabla ^2_\theta Q_n(\theta ,s)\) is invertible with probability approaching 1 uniformly over \(\mathcal N _{\bar{\epsilon },\bar{\eta }}\). Let \(\mathcal H :=\{h_{\theta ,s}:\mathcal X \rightarrow \mathbb R :h_{\theta ,s}(x)=H^{(i,j)}_W(\theta ,s,x)+2\nabla _\theta r^{(i)}_{\theta }(x)^{\prime }W\nabla _\theta r^{(j)}_{\theta }(x), 1\le i,j\le p,\theta \in \Theta ,s\in \mathcal S _0^{\bar{\eta }}\}\). Note that \(h_{\theta ,s}\) takes the form:
for some \(1\le i,j\le p, \theta \in \Theta \), and \(s\in \mathcal S ^{\bar{\eta }}_0\). Consider the function classes \(\mathcal F _1:=\{D^\alpha _\theta r^{(k)}_\theta :\theta \in \Theta ,|\alpha |\le 2,k=1,{\ldots },L\}\) and \(\mathcal F _2:=\{s^{(k)}:s\in \mathcal S _0^{\bar{\eta }},k=1,{\ldots },L\}\). Assumptions 3.2 (i), 3.4, and Theorem 2.7.11 in van der Vaart and Wellner (1996) ensure \(N_{[]}(\epsilon ,\mathcal F _1,\Vert \cdot \Vert _{L^2})\le N(u,\Theta ,\Vert \cdot \Vert )<\infty \) with \(u:=\epsilon /2\Vert C\Vert _{L^2}\). Assumption 3.2 (ii)–(iii) and Corollary 2.7.2 in van der Vaart and Wellner (1996) ensure \(N_{[]}(\epsilon ,\mathcal F _2,\Vert \cdot \Vert _{L^2})\le N_{[]}(\epsilon , \mathcal C ^\gamma _M(\mathcal X ),\Vert \cdot \Vert _{L^2})<\infty \). Since \(\mathcal H \) can be obtained by combining functions in \(\mathcal F _1\) and \(\mathcal F _2\) by additions and pointwise multiplications, Theorem 6 in Andrews (1994) implies \(N_{[]}(\epsilon ,\mathcal H ,\Vert \cdot \Vert _{L^2})<\infty \). This bracketing number is given in terms of the \(L^2\)-norm, but we can also obtain a bracketing number in terms of the \(L^1\)-norm. For this, let \(h_1,{\ldots },h_p\) be the centers of \(\Vert \cdot \Vert _{L^2}\)-balls that cover \(\mathcal H \). Then, the brackets \([h_i-\epsilon ,h_i+\epsilon ],i=1,{\ldots },p\) cover \(\mathcal H \), and each bracket has length at most \(2\epsilon \) in \(\Vert \cdot \Vert _{L^1}\). Thus, \(N_{[]}(\epsilon ,\mathcal H ,\Vert \cdot \Vert _{L^1})<\infty \). By Theorem 2.7.1 in van der Vaart and Wellner (1996), \(\mathcal H \) is a Glivenko–Cantelli class. Hence, uniformly over \(\Theta \times \mathcal S _0^{\bar{\eta }}\),
Note that \(d_{H,W}(\hat{\mathcal S }_n,\mathcal S _0)=o_p(1)\) by Assumption 3.6. Thus, \((\bar{\theta }_n(s),s)\in \mathcal N _{\bar{\epsilon },\bar{\eta }}\) with probability approaching one. By Assumption 3.5 and (A.15), there exists \(\delta >0\) such that \(\nabla ^2_\theta Q_n(\bar{\theta }_n(s),s)\)’s smallest eigenvalue is above \(\delta \) uniformly over \(\mathcal N _{\bar{\epsilon },\bar{\eta }}\). Thus, the Hessian \(\nabla ^2_\theta Q_n(\bar{\theta }_n(s),s) \) in (A.8) is invertible with probability approaching 1.
Step 3: Steps 1–2 imply that, uniformly over \(\mathcal S _0^{\bar{\eta }}\),
where we used the fact that \(\Vert \theta _{*}(s)-\theta _{*}(s^{\prime })\Vert \le \Vert s-s^{\prime }\Vert _{W}\) by Lemma 6.54 (d) in Aliprantis and Border (2006).
Step 4: Finally, note that by Step 3,
Equation (18) and Assumption 3.6 then ensure the desired result. \(\square \)
1.4 Convergence Rate
The following lemma controls the rate at which \(\hat{\Theta }_n\) covers \(\Theta _{*}.\) Given a sequence \(\{\eta _{n}\}\) such that \(\eta _{n}\rightarrow 0\), we let \(V^{\delta _{1n}}(s):=\{\theta ^{\prime }:\Vert \theta ^{\prime }-\theta (s)\Vert \le e_n, \quad e_n=O_p(\eta _n)\}\) and let \(\mathcal N _{\eta _n,0}:=\{(\theta ,s):\theta \in V^{\eta _n}(s),s\in \mathcal S _0\}\).
Lemma A.2
Suppose Assumptions 2.1–2.3, 3.1–3.2, and 3.6 hold. Let \(\{\delta _{1n}\}\) and \(\{\epsilon _{n}\}\) be sequences of non-negative numbers converging to 0 as \(n\rightarrow \infty \). Let \(G:\Theta \times \mathcal S \rightarrow \mathbb R _{+}\) be a function such that \(G\) is jointly measurable and lower semicontinuous. For each \(n\), let \(G_{n}:\Omega \times \Theta \times \mathcal S \rightarrow \mathbb R \) be a function such that for each \(\omega \in \Omega \), \(G_{n}(\omega ,\cdot ,\cdot )\) is jointly measurable and lower semicontinuous, and for each \((\theta ,s)\in \Theta \times \mathcal S \), \(G_{n}(\cdot ,\theta ,s)\) is measurable. Let \(\Theta _{*}:=\{G(\theta ,s)=0,s\in \mathcal S _{0}\}\) and \(\hat{\Theta }_{n}:=\{\theta \in \Theta :G_{n}(\theta ,s)\le \inf _{\theta \in \Theta }G_{n}(\theta ,s)+c_{n},s\in \hat{\mathcal S }_{n}\}\). Suppose that \(d_{H}(\hat{\Theta }_{n},\Theta _{*})=O_{p}(\delta _{1n})\). Suppose further that there exists a positive constant \(\kappa \) and a neighborhood \(V(s)\) of \(\theta _{*}(s)\) such that
for all \(\theta \in V(s),s\in \mathcal S _{0}\). Suppose that uniformly over \(\mathcal N _{\delta _{1n},0}\),
Then
Proof of Lemma A.2
The proof of this Lemma is similar to Theorem 1 in Sherman (1993). By (A.19), (A.20), and the Hausdorff consistency of \(\hat{\Theta }_n\), it follows that, uniformly over \(\mathcal N _{\delta _{1n},0}\),
with probability approaching 1. As in Theorem 1 in Sherman (1993), write \(K_n\Vert \theta -\theta (s)\Vert \) for the \(O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})\) term, where \(K_n=O_p(1/\sqrt{n})\) and also note that \(o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)\) is bounded from below by \(-\frac{\kappa }{2}\Vert \theta -\theta ^*(s)\Vert ^2\) with probability approaching 1. Thus, we obtain
Completing the square, we obtain
Taking square roots gives
Thus,
This completes the proof. \(\square \)
The following lemma controls the rate at which \(\hat{\Theta }_n\) is contracted into a neighborhood of \(\Theta _{*}\). Given \(s\in \mathcal S \) and a sequence \(\{\delta _n\}\) such that \(\delta _n\rightarrow 0\), let \(U^{\delta _n}(s):=\{\theta \in \Theta :\Vert \theta -\theta _{*}(s)\Vert \ge \delta _n\}\).
Lemma A.3
Suppose Assumptions 2.1–2.3, 3.1–3.2, and 3.6 hold. Let \(G_{n}\) be defined as in Lemma A.2. Suppose that there exist positive constants \((k,\kappa _{2})\) and a sequence \(\{\delta _{1n}\}\) such that
with probability approaching 1 for all \(\theta \in U^{\delta _{n}}(s)\) with \(\delta _{n}:=(k\delta _{1n}/\sqrt{n})^{1/2}\) and \(s\in \mathcal S _{0}^{\bar{\eta }}\). Then,
Proof of Lemma A.3
Note first that \(\hat{\mathcal S }_n\) is in \(\mathcal S _0^{\bar{\eta }}\) with probability approaching 1 by Assumption 3.6. Let \(\tilde{c}_n:=\sqrt{n}c_n\) and \(\bar{c}_n:=\max \{\kappa _2k\delta _{1n},\tilde{c}_n\}\). Let \(\epsilon _n:=(\bar{c}_n /\kappa _2 \sqrt{n})^{1/2}\). Then, uniformly over \(\mathcal S _0^{\bar{\eta }}\),
Since \(\sqrt{n}G_n(\hat{\theta }_n(s),s)\le \tilde{c}_n\) for all \(s\in \hat{\mathcal S }_n\), the results above ensure
This ensures the claim of the Lemma. \(\square \)
Proof of Theorem 3.2
We first show (A.19) holds with \(G(\theta ,s)=Q(\theta ,s)\). For this, we use the second-order Taylor expansion of \(Q(\theta ,s)\). For \(\theta \in V^{\delta _{1n}}(s)\), it holds by Assumptions 3.2 (v) and 3.4 that
where \(\bar{\theta }(s)\) is on the line segment that connects \(\theta \) and \(\theta _{*}(s)\). By (15), \(Q(\theta _{*}(s),s)=0\), and by the first order condition of the optimality, \(\nabla _\theta Q(\theta _{*}(s),s)=0\). Thus, it follows that
where \(\kappa :=\inf _{\theta \in \Theta ,s\in \mathcal S _0} \xi (\nabla ^2_\theta Q(\theta ,s))/2\), and \(\kappa >0\) by Assumption 3.5.
We next show that (A.20) holds for
In what follows, let \(\hat{E}_n\) denote the expectation with respect to the empirical distribution. Using the Taylor expansion of \(G_n\) and \(G\) with respect to \(\theta \) at \(\theta _{*}(s)\), we may write
where
Thus, for (A.20) to hold, it suffices to show that \( S_{1n}(\theta ,s)=O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})+o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)\) and \(S_{2n}(\theta ,s)=O_p(\epsilon _n)\) for some \(\epsilon _n\rightarrow 0\). For \(S_{1n}\), note that our assumptions suffice for the conditions of Lemma A.4. Thus, \(\Phi \) is a \(P_0\)-Donsker class. This ensures \(S_{1n}(\theta ,s)=O_p(\Vert \theta -\theta _{*}(s)\Vert /\sqrt{n})+o_p(\Vert \theta -\theta _{*}(s)\Vert ^2)\). We now consider \(S_{2n}\). For each \(s\in \mathcal S _0\) and \(x\in \mathcal X \), let \(\phi _s(x):=\nabla _\theta r_{\theta _{*}(s)}(x)^{\prime }W\nabla _\theta r_{\theta _{*}(s)}(x)\). Note that
where the last inequality follows from Lemma B.1 of Ichimura and Lee (2010). Now, Markov’s inequality, Lemma A.4, and Assumption 3.4 (ii) ensure that \(S_{2n}=O_p(\epsilon _n)\), where \(\epsilon _n=n^{-1/2}\delta _{1n}^2\).
We further set \(c_n=0\). Note that the estimator defined in (17) with \(c_n=0\) equals the set estimator \(\hat{\Theta }_n=\{\theta :G_n(\theta ,s)\le \inf _{\theta \in \Theta }G_n(\theta ,s)\}.\) By Assumption 3.7 and Step 4 of the proof of Theorem 3.1, we may take \(\delta _{1n}=O_p(n^{-1/4})\) as an initial rate. Lemma A.2 then implies that \(\vec d_H(\Theta _{*},\hat{\Theta }_n)=O_p(\epsilon ^{1/2}_n)\), where \(\epsilon _n=O_p(n^{-1/2}\delta _{1n}^2)=O_p(n^{-1})\). Thus, \(\vec d_H(\Theta _{*},\hat{\Theta }_n)=O_p(n^{-1/2})\).
Now we consider \(\vec d_H(\hat{\Theta }_n,\Theta _{*})\). We show that (A.28) holds for \(G_n\). For each \(\theta \) and \(s\), let \(L_n(\theta ,s):=\frac{1}{n}\sum _{i=1}^{n}(s(X_{i})-r_{\theta }(X_{i}))^{\prime }W(s(X_{i})-r_{\theta }(X_{i})) \nonumber \). Let \(s\in \mathcal S ^{\bar{\eta }}_0\) and \(\theta \in U^{\delta _{1n}}(s)\). A second-order Taylor expansion of \(G_n(\theta ,s)=L_n(\theta ,s)-L_n(\theta _{*}(s),s)\) with respect to \(\theta \) at \(\theta _{*}(s)\) gives
with probability approaching 1 for some \(\kappa _2>0\), where \(\bar{\theta }_n(s)\) is a point on the line segment that connects \(\theta \) and \(\theta _{*}(s)\). The last inequality follows from Step 3 of the proof of Theorem 3.1 and Assumption 3.5.
Set \(\tilde{c}_n=0\). Then, Lemma A.3 implies \(\vec d_H(\hat{\Theta }_n,\Theta _{*})=O_p(\delta _{1n}^{1/2}/n^{1/4})\). Setting \(\delta _{1n}=O_p(n^{-1/4})\) refines this rate to \(O_p(n^{-3/8})\). Repeated applications of Lemma A.3 then implies \(\vec d_H(\hat{\Theta }_n,\Theta _{*})=O_p(n^{-1/2})\). As both of the directed Hausdorff distances converge to 0 at the stochastic order of \(n^{-1/2}\), the claim of the theorem follows. \(\square \)
Lemma A.4
Suppose Assumptions 3.2 and 3.4 hold. Then \(\Phi \) is a \(P_0\)-Donsker class.
Proof of Lemma A.4
The proof of Theorem 3.1 shows that each \(f_s\in \Phi \) is Lipschitz in \(s\). For any \(\epsilon >0\), Assumption 3.2 (ii)–(iii), Theorems 2.7.11 and 2.7.2 in van der Vaart and Wellner (1996), and (A.12) imply
where \(C\) is a constant that depends only on \(k,\gamma ,L\), and \({\text{ diam}}(\mathcal X )\). Thus, for any \(\delta >0\),
Example 2.14.4 in van der Vaart and Wellner (1996) ensures that \(\Psi \) is \(P_0\)-Donsker. \(\square \)
1.5 First Stage Estimation
In the following, we work with the following population criterion function. For each \(s\in \mathcal S \), let \(\mathcal Q \) be defined by
Lemma A.5
Suppose that Assumption 3.9 (i) holds. Let the criterion function be given as in (A.40). Then, there exists a positive constant \(C_{2}\) such that
Proof of Lemma A.5
Let \(s\in \mathcal S \) be arbitrary. For any \(s_0\in {\mathcal S }\), \(E[\varphi ^{(j)}(X,s_0)]\le 0\) for \(j=1,{\ldots }, l\). Let \(V\) be an open set that contains \(s\) and \(s_0\). Assumption 3.9 (i) and Theorem 1.7 in Lindenstrauss et al. (2007), it holds that
where \(\tilde{V}_j:=\{g\in V:\dot{\varphi }^{(j)}_{g}\;{\text{ exists}}\}\). Let \(C_2:=\sum _{j=1}^l \Vert \sup _{g \in \mathcal S } \dot{\varphi }^{(j)}_{g}\Vert ^2_{op}\). It holds that \(0<C_2<\infty \) by the hypothesis. We thus obtain
for all \(s_0\in \mathcal S _0\). Note that \(s_0\mapsto \Vert s-s_0\Vert _W\) is continuous and \(\mathcal S _0\) is compact by Assumption 3.2 (ii)–(iii) and Assumption 3.10 (i). Taking infimum over \(\mathcal S _0\) then ensures the desired result. \(\square \)
Lemma A.6
Suppose Assumption 3.9 (ii) holds. Let the criterion function be given as in (A.40). Then there exists a positive constant \(C\) such that
Proof of Lemma A.6
If \(s\in \mathcal S _0\), the conclusion is immediate. Suppose that \(s\notin \mathcal S _0.\) By Assumption 3.9 (ii), there exists \(s_0\in \mathcal S _0\)
Let \(C_3:= C_j\). Thus, the claim of the lemma follows. \(\square \)
In the following, let \(\mathcal G :=\{g:g(x)=\varphi _{s}^{(j)}(x),s\in \mathcal S ,j=1,{\ldots } ,l\}\).
Lemma A.7
Suppose Assumptions 3.2, 3.4 , and 3.8 hold. Then \(\mathcal G \) is a \(P_0\)-Donsker class.
Proof of Lemma A.7
By Assumption 3.8, \(\varphi ^{(j)}_s\) is Lipschitz in \(s\). The rest of the proof is the same as that of Lemma A.4. \(\square \)
Proof of Theorem 3.3
We establish the claims of the theorem by applying Theorem B.1 in Santos (2011). Note first that Assumption 3.2 (ii)–(iii) and Assumption 3.10 (i) ensure that \(\mathcal S \) is compact. This ensures condition (i) of Theorem B.1 in Santos (2011). Condition (ii) of Theorem B.1 in Santos (2011) is ensured by Assumption 3.10. Lemma A.7 ensures that uniformly over \(\Theta _n\)
Thus, condition (iii) of Theorem B.1 in Santos (2011) hold with \(C_1=1\) and \(c_{2n}=n^{-1}\). Lemma A.5 ensures that \(\mathcal Q (s)\le \inf _{s_0\in \mathcal S _0}C_2\Vert s-s_0\Vert _W^2\) for some \(C_2>0\). Thus, condition (iv) of Theorem B.1 in Santos (2011) hold with \(\kappa _1=2\). Now, the first claim of Theorem B.1. in Santos (2011) establishes
Furthermore, Lemma A.6 ensures \(\mathcal Q (s)\ge \inf _{s_0\in \mathcal S _0}C_3\Vert s-s_0\Vert ^2\) for some \(C_3>0\). This ensures condition (v) of Theorem B.1 in Santos (2011) with \(\kappa _2=2\). Now, the second claim of Theorem B.1. in Santos (2011) ensures
Since \((b_n/a_n)^{1/2}/\delta _n\rightarrow \infty \), the claim of the theorem follows. \(\square \)
Proof of Corollary 3.1
In what follows, we explicitly show \(\mathcal Q _{n}\)’s dependence on \(\omega \in \Omega \). Let \(\mathcal Q _{n}:\Omega \times \mathcal S \rightarrow \mathbb R \) be defined by \(\mathcal Q _{n}(\omega ,s)=\sum _{j=1}^l(\frac{1}{n}\sum _{i=1}^{n}\varphi (X_{i}(\omega ),s))_+^2\). By Assumption 2.3, \(\varphi \) is continuous in \(s\) for every \(x\) and measurable for every \(s\). Also note that \(X_i\) is measurable for every \(i\). Thus, by Lemma 4.51 in Aliprantis and Border (2006), \(\mathcal Q _{n}\) is jointly measurable in \((\omega ,s)\) and lower semicontinuous in \(s\) for every \(\omega \). Note that \(\mathcal S \) is compact by Assumptions 3.2 (ii)–(iii) and 3.10 (i), which implies \(\mathcal S \) is locally compact. Since \(\mathcal S \) is a metric space, it is a Hausdorff space. Thus, by Proposition 5.3.6 in Molchanov (2005), \(\mathcal Q _n\) is a normal integrand defined on a locally compact Hausdorff space. Proposition 5.3.10 in Molchanov (2005) then ensures the first claim.
Now we show the second claim using Theorem 3.3 (i). Assumptions 2.1–2.3 hold with \(\varphi \) defined in (5). Assumption 3.2 holds by our hypothesis with \(\gamma =1\). Assumption 3.3 is also satisfied by the hypothesis. Note that for each \(j\), \(\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)\) or \(\varphi ^{(j)}(x,s)=(s(z)-y_U)1_{A_k}(z)\) for some \(k\in \{1,{\ldots }, K\}\). Without loss of generality, let \(j\) be an index for which \(\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)\) for some Borel set \(A_k\). For any \(s,s^{\prime }\in \mathcal S \),
It is straightforward to show the same result for other indexes. Thus, Assumption 3.8 is satisfied.
Now for \(j\) such that \(\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)\), note that
Thus, the Fréchet derivative is given by \(\dot{\varphi }^{(j)}_s(h)=E[h(Z)(-1_{A_k}(Z))]\). By Proposition 6.13 in Folland (1999), the norm of the operator is given by \(\Vert \dot{\varphi }^{(j)}_s\Vert _{op}=E[|-1_{A_k}(Z)|^2]^{1/2}=P_0(Z\in A_k)>0\), which ensures the boundedness (continuity) of the operator. It is straightforward to show the same result for other indexes. Hence, Assumption 3.9 (i) is satisfied. By construction, Assumption 3.10 (i) is satisfied, and Assumption 3.10 (ii) holds with \(\delta _n\asymp J_n^{-1}\) (See Chen 2007). These ensure the conditions of Theorem 3.3 (i). Thus, the second claim follows.
For the third claim, let \(s\in \mathcal S \setminus \mathcal S _0\). Then, there exists \(j\) such that \(E[\varphi ^{(j)}(X_i,s)]>0\). Without loss of generality, suppose that \(E[\varphi ^{(j)}(X_i,s)]=E[(Y_{L,i}-s(Z_i))1_{A_k} (Z_i)]\ge \delta >0\). Let \(s_0\in \mathcal S _0\) be such that
Such \(s_0\) always exists by the intermediate value theorem. Then, for \(j\) with which \(\varphi ^{(j)}(x,s)=(y_L-s(z))1_{A_k}(z)\), it follows that
Thus, we have
where \(C:=\inf _{q\in E}E[q(Z_i)1_{A_k}(Z_i)]\) and \(E:=\{q\in \mathcal S :\Vert q\Vert _W=1,E[q(Z_i)1_{A_k} (Z_i)]>0\}\). Since \(C\) is the minimum value of a linear function over a convex set, it is finite. Furthermore, by the construction of \(E\), it holds that \(C>0\). Thus, Assumption 3.9 (ii) holds. Thus, by Theorem 3.3 (ii), the third claim follows. \(\square \)
Proof of Corollary 3.2
We show the claim of the corollary using Theorem 3.2. Note that we have shown, in the proof of Corollary 3.1, that Assumptions 2.1–2.3, 3.2 (i)–(iii), and 3.3 hold. Thus, to apply Theorem 3.2, it remains to show Assumptions 2.4, 3.2 (iv), and 3.4–3.7.
Assumption 2.4 is satisfied by the parameterization \(r_\theta (z)=\theta ^{(1)}+\theta ^{(2)}z\). For Assumption 3.2 (iv), note that \(\mathcal R _\Theta \) is given by
Since \(\Theta \) is convex, for any \(\lambda \in [0,1]\), it holds that \(\lambda r_\theta +(1-\lambda )r_{\theta ^{\prime }}=r_{\lambda \theta +(1-\lambda )\theta ^{\prime }}\in \mathcal R _\Theta \). Thus, Assumption 3.2 (iv) is satisfied. For Assumption 3.4, note first that \(r_\theta \) is twice continuously differentiable on the interior of \(\Theta \). Because \(r_\theta \) is linear, \(\max _{|\alpha |\le 2}|D^{\alpha }_\theta r_\theta (z)-D^{\alpha }_\theta r_{\theta ^{\prime }}(z)|= (1+z^2)^{1/2}\Vert \theta -\theta ^{\prime }\Vert \) by the Cauchy–Schwarz inequality. By the compactness of \(\mathcal Z \), \(C(z):= (1+z^2)^{1/2}\) is bounded. Thus, Assumption 3.4 (i) is satisfied. Similarly, \(\max _{|\alpha |\le 2}\sup _{\theta \in \Theta }|D^\alpha _\theta r_\theta |\le \max \{1,|z|,C(1+z^2)^{1/2}\}=:R(z)\), where \(C:=\sup _{\theta \in \Theta }\Vert \theta \Vert \). By the compactness of \(\mathcal Z \) and \(\Theta \), \(R\) is bounded. Thus, Assumption 3.4 (ii) is satisfied. Note that the Hessian of \(Q(\theta ,s)\) with respect to \(\theta \) is given by \(2E[(1,z)(1,z)^{\prime }]\), which does not depend on \(\theta \) nor \(s\) and is positive definite by the assumption that \(Var(Z)>0\). Thus, Assumption 3.5 is satisfied. Assumptions 3.6 and 3.7 are ensured by Corollary 3.1. Now the conditions of Theorem 3.2 are satisfied. Thus, the claim of the Corollary follows. \(\square \)
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kaido, H., White, H. (2013). Estimating Misspecified Moment Inequality Models. In: Chen, X., Swanson, N. (eds) Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1653-1_13
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1653-1_13
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1652-4
Online ISBN: 978-1-4614-1653-1
eBook Packages: Business and EconomicsEconomics and Finance (R0)