Keywords

1 Introduction

We consider a moment condition model, namely a family \(\mathcal {M}\) of probability measures Q, all defined on the same measurable space \(({\mathbb {R}^m},\mathcal {B}({\mathbb {R}^m}))\), such that \(\int _{\mathbb {R}^m} g(x,\theta )\,dQ(x) = 0\). The unknown parameter \(\theta \) belongs to the interior of a compact set \(\varTheta \subset \mathbb {R}^{d}\), and the function \(g:=(g_{1},\ldots ,g_{\ell })^{\top }\), with \(\ell \ge d\), is defined on the set \({\mathbb {R}^m}\times \varTheta \), each \(g_{i}\) being a real valued function. Denoting by M the set of all probability measures on \(({\mathbb {R}^m},\mathcal {B}(\mathbb {R}^m))\) and defining the sets

$$\begin{aligned} \mathcal {M}_{\theta } := \left\{ Q\in M \text { s.t. } \int _{\mathbb {R}^m} g(x,\theta )\, dQ(x) = 0\right\} ,\; \theta \in \varTheta , \end{aligned}$$

then the moment condition model \(\mathcal {M}\) can be written under the form

$$\begin{aligned} \mathcal {M} = \bigcup _{\theta \in \varTheta }\mathcal {M}_{\theta }. \end{aligned}$$
(1)

Let \(X_1,\ldots ,X_n\) be an i.i.d. sample with unknown probability measure \(P_{0}\). We assume that the equation \(\int _{\mathbb {R}^m} g(x,\theta )\,dP_0(x) = 0\) has a unique solution (in \(\theta \)) which will be denoted \(\theta _0\). We consider the estimation problem of the true unknown value \(\theta _{0}\).

Among the most known methods for estimating the parameter \(\theta _{0}\), we recall the Generalized Method of Moments (GMM) of [6], the Continuous Updating (CU) estimator of [7], the Empirical Likelihood (EL) estimator of [8, 14, 15], the Exponential Tilting (ET) of [11], as well as the Generalized Empirical Likelihood (GEL) class of estimators of [13] that contains the EL, ET and CU estimators in particular. Some alternative methods have been proposed in order to improve the finite sample accuracy or the robustness under misspecification of the model, for example in [4, 8, 11, 12, 16].

The authors in [3] have developed a general methodology for estimation and testing in moment condition models. Their approach is based on minimizing divergences in dual form and allows the asymptotic study of the estimators (called minimum empirical divergence estimators) and of the associated test statistics, both under the model and under misspecification of the model. Using the approach based on the influence function, [18] studied robustness properties for these classes of estimators and test statistics, showing that the minimum empirical divergence estimators of the parameter \(\theta _{0}\) of the model are generally not robust. This approach based on divergences and duality was initially used in the case of parametric models, the results being published in the articles, [2, 19, 20].

The classical EL estimator represents a particular case of the class of estimators from [3], namely, when using the modified Kullback-Leibler divergence. Although the EL estimator is superior to other above mentioned estimators in what regards higher-order asymptotic efficiency, this property is valid only in the case of the correct specification of the moment conditions. It is a known fact that the EL estimator and the EL ratio test for moment condition models are not robust with respect to the presence of outliers in the sample. Also, [17] showed that, when the support of the p.m. corresponding to the model and the orthogonality functions are not bounded, the EL estimator is not root n consistent under misspecification.

In this paper, we present a robust version of the EL estimator for moment condition models. This estimator is defined by minimizing an empirical version of the modified Kullback-Leibler divergence in dual form, using truncated orthogonality functions. For this estimator, we present some asymptotic properties regarding both consistency and limit laws. The robust EL estimator is root n consistent, even under misspecification, which gives a solution to the problem noticed by [17] for the EL estimator.

2 A Robust Version of the Empirical Likelihood Estimator

Let \(\left\{ P_{\theta }; \theta \in \varTheta \right\} \) be a reference identifiable model, containing probability measures such that, for each \(\theta \in \varTheta \), \(P_{\theta }\in \mathcal {M}_{\theta }\), meaning that \(\int _{\mathbb {R}^m} g(x,\theta )\,dP_{\theta } (x)=0\), and \(\theta \) is the unique solution of the equation. We assume that the p.m. \(P_{0}\) of the data, corresponding to the true unknown value \(\theta _{0}\) of the parameter to be estimated, belongs to this reference model. The reference model will be associated to the truncated orthogonality function \(g_c\), defined hereafter, that will be used in the definition of the robust version of the EL estimator of the parameter \(\theta _{0}\). We use the notation \(\Vert \cdot \Vert \) for the Euclidean norm. Similarly as in [16], using the reference model \(\left\{ P_{\theta };\, \theta \in \varTheta \right\} \), define the function \(g_{c}: \mathbb {R}^{m}\times \varTheta \rightarrow \mathbb {R}^{\ell }\),

$$\begin{aligned} g_{c}(x,\theta ) := H_{c}\left( A_\theta \, \left[ g(x,\theta )-\tau _\theta \right] \right) , \end{aligned}$$
(2)

where \(H_{c}:\mathbb {R}^{\ell }\rightarrow \mathbb {R}^{\ell }\) is the Huber’s function

$$\begin{aligned} H_{c}(y) := \left\{ \begin{array}{l} y \cdot \min \left( 1,\frac{c}{\Vert y\Vert }\right) \, \text { if }\, y\ne 0, \\ 0 \, \text { if } \, y=0, \end{array}\right. \end{aligned}$$
(3)

and \(A_\theta \), \(\tau _\theta \) are determined by the solutions of the system of implicit equations

$$\begin{aligned} \left\{ \begin{array}{l} \int g_{c}(x,\theta )\, dP_{\theta }(x) = 0, \\ \int g_{c}(x,\theta )\, g_{c}(x,\theta )^{\top }\,dP_\theta (x) = I_\ell , \end{array}\right. \end{aligned}$$
(4)

where \(I_\ell \) is the \(\ell \times \ell \) identity matrix and \(c >0\) is a given positive constant. Therefore, we have \(\Vert g_{c}(x,\theta )\Vert \le c\), for all x and \(\theta \). We also use the function

$$\begin{aligned} h_{c}(x,\theta ,A,\tau ):=H_{c}\left( A\, [g(x,\theta )-\tau ]\right) , \end{aligned}$$
(5)

when needed to work with the dependence on the \(\ell \times \ell \) matrix A and on the \(\ell \)-dimensional vector \(\tau \). Then,

$$\begin{aligned} g_{c}(x,\theta ) = h_{c}(x,\theta ,A_\theta ,\tau _\theta ), \end{aligned}$$
(6)

where \(A_\theta \) and \(\tau _\theta \) are the solution of (4). For given \(P_{\theta }\) from the reference model, the triplet \((\theta , A_\theta , \tau _\theta )\) is the unique solution of the system

$$\begin{aligned} \left\{ \begin{array}{l} \int g(x,\theta )\, dP_{\theta }(x)=0, \\ \int g_{c}(x,\theta )\, dP_{\theta }(x)=0, \\ \int g_{c}(x,\theta )\, g_{c}(x,\theta )^{\top }\,dP_\theta (x)=I_\ell . \end{array}\right. \end{aligned}$$
(7)

The uniqueness is justified in [16], p. 48.

In what follows, we will use the so-called modified Kullback-Leibler divergence between probability measures, say Q and P, defined by

$$\begin{aligned} KL_m(Q,P):=\int _{\mathbb {R}^m} \varphi \left( \frac{dQ}{dP}(x)\right) \, dP(x), \end{aligned}$$
(8)

if Q is absolutely continuous with respect to P, and \(KL_m(Q,P):=+\infty \), elsewhere. The strictly convex function \(\varphi \) is defined by \(\varphi (x):=-\log x+x-1\), if \(x>0\), respectively \(\varphi (x):=+\infty \), if \(x\le 0\). Straightforward calculus show that the convex conjugateFootnote 1 of the convex function \(\varphi \) is \(\psi (u)=-\log (1-u)\) if \(u<1\), respectively \(\psi (u)=+\infty \), if \(u\ge 1\). Recall that the Kullback-Leibler divergence, between any probability measures Q and P, is defined by

$$\begin{aligned} KL(Q,P):=\int _{\mathbb {R}^m} \varphi \left( \frac{dQ}{dP}(x)\right) \, dP(x), \end{aligned}$$

if Q is absolutely continuous with respect to P, and \(KL_m(Q,P):=+\infty \), elsewhere. Here, the strictly convex function \(\varphi \) is defined by \(\varphi (x)=x\log x -x +1\), if \(x\ge 0\), and \(\varphi (x) = +\infty \), if \(x<0\). Notice also that \(KL_m(Q,P)=KL(P,Q)\), for all probability measures Q and P. For any subset \(\varOmega \) of M, we define the \(KL_m\)-divergence, between \(\varOmega \) and any probability measure P, by

$$KL_m(\varOmega ,P) := \inf _{Q\in \varOmega } KL_m(Q,P).$$

Define the moment condition model

$$\begin{aligned} \mathcal {M}_c := \bigcup _{\theta \in \varTheta } \mathcal {M}_{c, \theta } := \bigcup _{\theta \in \varTheta }\left\{ Q\in M \text { s.t. } \int _{\mathbb {R}^m} g_c(x,\theta )\, dQ(x)=0\right\} . \end{aligned}$$
(9)

For any \(\theta \in \varTheta \), define the set

$$\varLambda _{c,\theta }:=\varLambda _{c,\theta }(P_0) := \left\{ t\in \mathbb {R}^{\ell }\text { s.t. } \int _{\mathbb {R}^m} |\psi \left( t^{\top }\,g_{c}(x,\theta )\right) |\, dP_0(x)<\infty \right\} .$$

Since \(g_{c}(x,\theta )\) is bounded (with respect to x), then on the basis of Theorem 1.1 in [1] and Proposition 4.2 in [3], the following dual representation of divergence holds

$$\begin{aligned} KL_m(\mathcal {M}_{c,\theta }, P_0) = \sup _{t\in \varLambda _{c,\theta }} \int _{\mathbb {R}^m} m_{c}(x, \theta , t)\, dP_0(x), \end{aligned}$$
(10)

where

$$\begin{aligned} m_{c}(x,\theta , t):=-\psi (t^{\top } g_{c}(x,\theta ))= \log (1-t^{\top } g_{c}(x,\theta )), \end{aligned}$$
(11)

and the supremum in (10) is reached, provided that \(KL_m(\mathcal {M}_{c,\theta },P_0)\) is finite. Moreover, the supremum in (10) is unique under the following assumption

$$\begin{aligned} P_0\left( \{x\in \mathbb {R}^{m} \text { s.t. } \overline{t}^{\top }\,\overline{g}_{c}(x,\theta )\ne 0\}\right) >0, \, \text { for all } \overline{t}\in \mathbb {R}^{1+\ell }\backslash \{0\}, \end{aligned}$$
(12)

where \(\overline{t}:=(t_0,t_1,\ldots ,t_\ell )^\top \) and \(\overline{g}:=(g_0,g_{c,1},\ldots , g_{c,\ell })^\top \). This last condition is satisfied if the functions \(g_0(\cdot ):=\mathbf {1}_{\mathbb {R}^{m}}(\cdot ),g_{c,1}(\cdot ,\theta ),\dots ,g_{c,\ell }(\cdot ,\theta )\) are linearly independent and \(P_0\) is not degenerate. The empirical measure, associated to the sample \(X_1,\ldots , X_n\), is defined by

$$P_n(\cdot ) := \frac{1}{n}\sum _{i=1}^n \delta _{X_i}(\cdot ),$$

\(\delta _{x}(\cdot )\) being the Dirac measure at the point x. Denote

$$\begin{aligned} \varLambda _{c,\theta , n}:= \varLambda _{c,\theta }(P_n)= & {} \left\{ t\in \mathbb {R}^{\ell }\text { s.t. } \int _{\mathbb {R}^m} |\psi \left( t^{\top }\,g_{c}(x,\theta )\right) |\, dP_n(x)<\infty \right\} \\= & {} \left\{ t\in \mathbb {R}^{\ell }\text { s.t. } \frac{1}{n}\sum _{i=1}^n \left| \log (1-\sum _{j=1}^\ell t_j \, g_{c,j}(X_i,\theta )) \right| \, <\, \infty \right\} . \nonumber \end{aligned}$$
(13)

In view of relation (10), for given \(\theta \in \varTheta \), a natural estimator of

$$\begin{aligned} t_{c, \theta } := \underset{t\in \varLambda _{c,\theta }}{\arg \sup } \, \int m_{c} (x,\theta ,t)\,dP_0(x), \end{aligned}$$
(14)

can be defined by “plug-in” as follows

$$\begin{aligned} \widehat{t}_{c,\theta } := \underset{t\in \varLambda _{c,\theta ,n}}{\arg \sup } \, \int m_{c}(x,\theta ,t) \,dP_{n}(x). \end{aligned}$$
(15)

A “dual” plug-in estimator of the modified Kullback-Leibler divergence, between \(\mathcal {M}_{c,\theta }\) and \(P_0\), can then be defined by

$$\begin{aligned} \widehat{KL_m}(\mathcal {M}_{c,\theta },P_0):= & {} \sup _{t\in \varLambda _{c, \theta , n}} \int m_{c}(x,\theta ,t)\,dP_{n}(x) \\= & {} \sup _{(t_1, \ldots , t_\ell )\in \mathbb {R}^\ell } \left\{ \frac{1}{n}\sum _{i=1}^n\overline{\log }\left( 1-\sum _{j=1}^\ell t_j\,g_{c,j}(X_i,\theta )\right) \right\} , \nonumber \end{aligned}$$
(16)

where \(\overline{\log }(\cdot )\) is the extended logarithm function, i.e., the function defined by \(\overline{\log }(u) = \log (u)\) if \(u>0\), and \(\overline{\log }(u) = -\infty \) if \(u\le 0\). Hence,

$$\begin{aligned} KL_m(\mathcal {M}_c, P_0) := \inf _{\theta \in \varTheta }\, KL_m(\mathcal {M}_{c,\theta }, P_0) \end{aligned}$$
(17)

can be estimated by

$$\begin{aligned} \widehat{KL_m}(\mathcal {M}_c, P_0):= & {} \inf _{\theta \in \varTheta } \, \widehat{KL_m}(\mathcal {M}_{c,\theta },P_0) \nonumber \\= & {} \inf _{\theta \in \varTheta } \, \sup _{(t_1, \ldots , t_\ell )\in \mathbb {R}^\ell } \left\{ \frac{1}{n}\sum _{i=1}^n\overline{\log }\left( 1-\sum _{j=1}^\ell t_j\,g_{c,j}(X_i,\theta )\right) \right\} . \end{aligned}$$
(18)

Since \(\theta _0 = \underset{\theta \in \varTheta }{\arg \inf }\, KL_m(\mathcal {M}_{c,\theta }, P_0)\), where the infimum is unique, we propose then to estimate \(\theta _0\) by

$$\begin{aligned} \widehat{\theta }_{c}:= & {} \underset{\theta \in \varTheta }{\arg \inf }\, \sup _{t\in \varLambda _{c,\theta ,n}} \, \int m_{c}(x,\theta ,t)\,dP_{n}(x)\\= & {} \underset{\theta \in \varTheta }{\arg \inf } \, \sup _{(t_1,\ldots ,t_\ell )\in \mathbb {R}^\ell } \, \left\{ \frac{1}{n}\sum _{i=1}^n\overline{\log }\left( 1-\sum _{j=1}^\ell t_j\,g_{c,j}(X_i,\theta )\right) \right\} , \nonumber \end{aligned}$$
(19)

which can be seen as a “robust” version of the classical EL estimator. Recall that the EL estimator, see e.g. [14], can be written as

$$\widehat{\theta } = \underset{\theta \in \varTheta }{\arg \inf }\, \sup _{(t_1,\ldots ,t_\ell )\in \mathbb {R}^\ell } \, \left\{ \frac{1}{n}\sum _{i=1}^n\overline{\log }\left( 1-\sum _{j=1}^\ell t_j\,g_{j}(X_i,\theta )\right) \right\} .$$

A slightly different definition of an estimator for the parameter \(\theta _{0}\) was introduced in [10], where robustness and consistency properties are also stated. However, the limiting distribution of the estimator in [10] is not standard, and not easy to be obtained, due to the fact that the used bounded orthogonality functions depend on both \(\theta \) and the data. The present version is simpler and does not present this difficulty. We give in the following sections, the influence function of the estimator (19), and state both consistency and the limiting distributions of all the proposed estimators (15), (16), (19) and (18).

2.1 Robustness Property

The classical EL estimator of the parameter \(\theta _{0}\) of a moment condition model can be obtained as a particular case of the class of minimum empirical divergence estimators introduced by [3]. [18] showed that the influence functions for the estimators from this class, so particularly the influence function of the EL estimator, are each proportional to the orthogonality function \(g(x,\theta _{0})\) of the model. These influence functions also coincide with the influence function of the GMM estimator obtained by [16]. Therefore, when \(g(x,\theta )\) is not bounded in x, all these estimators, and particularly the EL estimator of \(\theta _{0}\), are not robust.

Denote \(T_c(\cdot )\) the statistical functional associated to the estimator \(\widehat{\theta }_c\), so that \(\widehat{\theta }_c = T_c(P_n)\). The influence function \(\mathrm {IF}(x;T_{c},P_{0})\) of \(T_c\) at \(P_0\) is defined by

$$\mathrm {IF}(x;T_{c},P_{0}) := \left. \frac{\partial }{\partial \varepsilon }T_c(P_{\varepsilon ,x})\right| _{\varepsilon =0},$$

where \(P_{\varepsilon ,x}(\cdot ) = (1-\varepsilon )\, P_0(\cdot ) + \varepsilon \, \delta _x(\cdot )\), \(\varepsilon \in \,]0,1[\); see e.g. [5]. The influence function \(\mathrm {IF}(x;T_{c},P_{0})\) of the estimator \(\widehat{\theta }_{c}\) presented in this paper is linearly related to the bounded function \(g_{c}(x,\theta )\), more precisely, the following result holds

$$\begin{aligned} \mathrm {IF}(x;T_{c},P_{0})= & {} \left\{ \left[ \int \frac{\partial }{\partial \theta }g_{c}(y,\theta _{0})\, dP_{0}(y)\right] ^{\top } \int \frac{\partial }{\partial \theta }g_{c}(y,\theta _{0})\, dP_{0}(y)\right\} ^{-1}\cdot \nonumber \\&\cdot \left[ \int \frac{\partial }{\partial \theta }g_{c}(y,\theta _{0})\,dP_{0}(y)\right] ^{\top }g_{c}(x,\theta _{0}), \end{aligned}$$

which implies the robustness of the estimator \(\widehat{\theta }_{c}\) of the parameter \(\theta _{0}\). The proof of this result is similar to the one presented in [10].

2.2 Asymptotic Properties

In this subsection, we give the limiting distributions of the proposed estimators, under some regularity assumptions similar to those used by [3].

Proposition 1

For any fixed \(\theta \in \varTheta \), under some regularity assumptions, we have

  1. (1)

    \(\sqrt{n}(\widehat{t}_{c,\theta } - t_{c,\theta })\) converges in distribution to a centered normal random vector;

  2. (2)

    If \(P_0\not \in \mathcal {M}_{c,\theta }\), then \(\sqrt{n}(\widehat{KL_m} (\mathcal {M}_{c,\theta }, P_0)-KL_m (\mathcal {M}_{c,\theta }, P_0))\) converges in distribution to a centered normal random variable;

  3. (3)

    If \(P_0\in \mathcal {M}_{c,\theta }\), then \(2n\, \widehat{KL_m} (\mathcal {M}_{c,\theta }, P_0)\) convergences in distribution to a \(\chi ^2(\ell )\) random variable.

Proposition 2

Under some regularity assumptions, we have

  1. (1)

    \(\sqrt{n}\left( \widehat{\theta }_c - \theta _0\right) \) converges in distribution to a centered normal random vector;

  2. (2)

    If \(\ell >d\), then \(2n \, \widehat{KL_m}(\mathcal {M}_c,P_0)\) converges in distribution to a \(\chi ^2(\ell - d)\) random variable.