Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The goals of this chapter are to:

  • Consider the problem of parameter estimation by the method of minimal distances,

  • Study the properties of the estimators.

Notation introduced in this chapter:

Notation

Description

w  ∘ 

Brownian bridge

F θ(x) = F(x, θ)

Distribution function with parameter θ

p θ(x) = p(x, θ)

Density of F θ(x)

1 Introduction

In this chapter, we consider minimal distance estimators resulting from using the \(\mathfrak{N}\)-metrics and compare them with classical M-estimators. This chapter, like Chap. 22, is not directly related to quantitative convergence criteria, although it does demonstrate the importance of \(\mathfrak{N}\)-metrics.

2 Estimating a Location Parameter: First Approach

Let us begin by considering a simple case of estimating a one-dimensional location parameter. Assume that

$$\mathcal{L}(x,y) = \mathcal{L}(x - y)$$

is a strongly negative definite kernel and

$$N(F,G) = -\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,y)\mathrm{d}R(x)\mathrm{d}R(y),\,\,\,R = F - G,$$

is the corresponding kernel defined on the class of distribution functions (DFs). As we noted in Chap. 22, \(\mathfrak{N}(F,G) = {\mathcal{N}}^{1/2}(F,G)\) is a distance on the class \(\mathbf{B}(\mathcal{L})\) of DFs under the condition

$$\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,y)\mathrm{d}F(x)\mathrm{d}F(y) < \infty.$$

Suppose that x 1, , x n is a random sample from a population with DF \(F_{\theta }(x) = F(x - \theta )\), where θ ∈ Θ ⊂ ℝ1 is an unknown parameter (Θ is some interval, which may be infinite). Assume that there exists a density p(x) of F(x) (with respect to the Lebesgue measure). Let F n  ∗ (x) be the empirical distribution based on the random sample, and let θ ∗  be a minimum distance estimator of θ, so that

$$N(F_{n}^{{_\ast}},F_{{ \theta }^{{_\ast}}}) =\min \limits _{\theta \in \Theta }N(F_{n},F_{\theta })$$
(23.2.1)

or

$${\theta }^{{_\ast}} = \mbox{ argmin}_{ \theta \in \Theta }N(F_{n}^{{_\ast}},F_{ \theta }).$$
(23.2.2)

We have

$$\begin{array}{rcl} N(F_{n}^{{_\ast}},F_{ \theta })& =& \frac{2} {n}\sum \limits _{j=1}^{n} \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x_{ j} - \theta - y)p(y)\mathrm{d}y \\ & & - \frac{1} {{n}^{2}} \sum \limits _{ij}\mathcal{L}(x_{i} - x_{j}) \\ & & -\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x - y)p(x)p(y)\mathrm{d}x\mathrm{d}y.\end{array}$$

Suppose that \(\mathcal{L}(u)\) is differentiable and \(\mathcal{L}\) and p are such that

$$\begin{array}{rcl} \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x)p^{\prime}(x + \theta )\mathrm{d}x& =& \frac{\mathrm{d}} {\mathrm{d}\theta }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x - \theta )p(x)\mathrm{d}x \\ & =& -\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(x - \theta )p(x)\mathrm{d}x.\end{array}$$
(23.2.3)

Then, (23.2.2) implies that θ ∗  is the root of

$$\frac{\mathrm{d}} {\mathrm{d}\theta }N(F_{n}^{{_\ast}},F_{ \theta })\vert _{\theta ={\theta }^{{_\ast}}} = 0$$

or

$$\sum \limits _{j=1}^{n} \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(x_{ j} - {\theta }^{{_\ast}}- v)p(v)\mathrm{d}v = 0.$$
(23.2.4)

Since the estimator θ ∗  satisfies the equation

$$\sum \limits _{j=1}^{n}g_{ 1}(x_{j} - \theta ) = 0,$$
(23.2.5)

where

$$g_{1}(x) = \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(x - v)p(v)\mathrm{d}v),$$

it is an M-estimator.Footnote 1 It is well known [see, e.g., Huber [1981]] that (23.2.4) [or (23.2.5)] determines a consistent estimator only if

$$\int \nolimits \limits _{-\infty }^{\infty }g_{ 1}(x)p(x)\mathrm{d}x = 0,$$

that is,

$$\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(u - v)p(u)p(v)\mathrm{d}u\mathrm{d}v = 0.$$
(23.2.6)

We show that if (23.2.3) holds, then (23.2.6) does as well. The integral

$$\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u-v)p(u+\theta )p(v+\theta )\mathrm{d}u\mathrm{d}v = \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u-v)p(u)p(v)\mathrm{d}u\mathrm{d}v$$

does not depend on θ. Therefore,

$$\frac{\mathrm{d}} {\mathrm{d}\theta }\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p(u + \theta )p(v + \theta )\mathrm{d}u\mathrm{d}v = 0.$$
(23.2.7)

On the other hand,

$$\begin{array}{rcl} \frac{\mathrm{d}} {\mathrm{d}\theta }\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p(u + \theta )p(v + \theta )\mathrm{d}u\mathrm{d}v&& \\ & =& \int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p^{\prime}(u + \theta )p(v + \theta )\mathrm{d}u\mathrm{d}v \\ & & +\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p(u + \theta )p^{\prime}(v + \theta )\mathrm{d}u\mathrm{d}v \\ & =& 2\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p^{\prime}(u + \theta )p(v + \theta )\mathrm{d}u\mathrm{d}v.\end{array}$$

Here, we used the equality \(\mathcal{L}(u - v) = \mathcal{L}(v - u)\). Comparing this with (23.2.7), we find that for θ = 0

$$\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p^{\prime}(u)p(v)\mathrm{d}u\mathrm{d}v = 0.$$
(23.2.8)

However,

$$\begin{array}{rcl} \int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p^{\prime}(u)p(v)\mathrm{d}u\mathrm{d}v& =& \int \nolimits \limits _{-\infty }^{\infty }\left ( \frac{\mathrm{d}} {\mathrm{d}u}\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p(v)\mathrm{d}v\right )p(u)\mathrm{d}u \\ & =& \int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(u - v)p(u)p(v)\mathrm{d}u\mathrm{d}v.\end{array}$$

Consequently [see (23.2.8)],

$$\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u - v)p(u)p(v)\mathrm{d}u\mathrm{d}v = 0,$$

which proves (23.2.6).

We see that the minimum \(\mathfrak{N}\)-distance estimator is an M-estimator, and the necessary condition for its consistency is automatically fulfilled.

The standard theory of M-estimators shows that the asymptotic variance of θ ∗  [i.e., the variance of the limiting random variable of \(\sqrt{n}({\theta }^{{_\ast} }- \theta )\) as n] is

$$\sigma _{{\theta }^{{_\ast}}}^{2} = \frac{\int \nolimits \limits _{-\infty }^{\infty }{\left [\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(u - v)p(v)\mathrm{d}v\right ]}^{2}p(u)\mathrm{d}u} {{\left [\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime\prime}(u - v)p(u)p(v)\mathrm{d}u\mathrm{d}v\right ]}^{2}}\,,$$

where we assumed the existence of \(\mathcal{L}^{\prime\prime}\) and that the differentiation can be carried out under the integral. Note that when the parameter space Θ is compact, it is clear from geometric considerations that θ ∗  = argminθ ∈ Θ N(F n  ∗ , F θ) is unique for sufficiently large n.

3 Estimating a Location Parameter: Second Approach

We now consider another method for estimating a location parameter θ. Let

$$\theta ^{\prime} = \mbox{ argmin}_{\theta \in \Theta }N(F_{n}^{{_\ast}},\delta _{ \theta }),$$
(23.3.1)

where δθ is a distribution concentrated at the point θ and F n  ∗  is an empirical DF. Proceeding as in Sect. 23.2, it is easy to verify that θ is a root of

$$\sum \limits _{j=1}^{n}\mathcal{L}^{\prime}(x_{ j} - \theta ) = 0,$$
(23.3.2)

and so it is a classic M-estimator. A consistent solution of (23.3.2) exists only if

$$\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(u)p(u)\mathrm{d}u = 0.$$
(23.3.3)

What is a geometric interpretation of (23.3.3)? More precisely, how is the measure parameter δθ related to the family parameter, that is, to the DF F θ? This must be the same parameter, that is, for all θ1 we must have

$$N(F_{\theta },\delta _{\theta }) \leq N(F_{\theta },\delta _{\theta _{1}}).$$

Otherwise,

$$\frac{\mathrm{d}} {\mathrm{d}\theta _{1}}N(F_{\theta },\delta _{\theta _{1}})\vert _{\theta _{1}=\theta } = 0.$$

It is easy to verify that the last condition is equivalent to (23.3.3). Thus, (23.3.3) has to do with the accuracy of parameterization and has the following geometric interpretation. The space of measures with metric \(\mathfrak{N}\) is isometric to some simplex in a Hilbert space. In this case, δ-measures correspond to the extreme points (vertices) of the simplex. Consequently, (23.3.3) signifies that the vertex closest to the measure with DF F θ corresponds to the same value of the parameter θ (and not to some other value θ1).

4 Estimating a General Parameter

We now consider the case of an arbitrary one-dimensional parameter, which is approximately the same as the case of a location parameter. We just carry out formal computations assuming that all necessary regularity conditions are satisfied.

Let x 1, , x n be a random sample from a population with DF F(x, θ),  θ ∈ Θ ⊂ ℝ1. Assume that p(x, θ) = p θ(x) is the density of F(x, θ). The estimator

$${\theta }^{{_\ast}} = \mbox{ argmin}_{ \theta \in \Theta }N(F_{n}^{{_\ast}},F_{ \theta })$$

is an M-estimator defined by the equation

$$\frac{1} {n}\sum \limits _{j=1}^{n}g(x_{ j},\theta ) = 0,$$
(23.4.1)

where

$$g(x,\theta ) = \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,v)p_{ \theta }^{\prime}(v)\mathrm{d}v -\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u,v)p_{ \theta }(u)p_{\theta }^{\prime}(v)\mathrm{d}u\mathrm{d}v.$$

Here, \(\mathcal{L}(u,v)\) is a negative definite kernel, which does not necessarily depend on the difference of arguments, and the prime denotes the derivative with respect to θ. As in Sect. 23.2, the necessary condition for consistency,

$$E_{\theta }g(x,\theta ) = 0,$$

is automatically fulfilled. The asymptotic variance of θ ∗  is given by

$$\sigma _{{\theta }^{{_\ast}}}^{2} = \frac{\mbox{ Var}\left (\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,v)p_{\theta }^{\prime}(v)\mathrm{d}v\right )} {{\left (\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(u,v)p_{\theta }^{\prime}(u)p_{\theta }^{\prime}(v)\mathrm{d}u\mathrm{d}v\right )}^{2}}.$$

We can proceed similarly to Sect. 23.3 to obtain the corresponding results in this case. Since the calculations are quite similar, we do not state these results explicitly. Note that to obtain the existence and uniqueness of θ ∗  for sufficiently large n, we do not need standard regularity conditions such as the existence of variance, differentiability of the density with respect to θ, and so on. These are used only to obtain the estimating equation and to express the asymptotic variance of the estimator.

In general, from the construction of θ ∗  we have

$$N(F_{n}^{{_\ast}},F_{{ \theta }^{{_\ast}}}) \leq N(F_{n}^{{_\ast}},F_{ \theta })\ \mathrm{a.s.},$$

and hence

$$\begin{array}{rcl} E_{\theta }N(F_{n}^{{_\ast}},F_{{ \theta }^{{_\ast}}})& \leq & E_{\theta }N(F_{n}^{{_\ast}},F_{ \theta }) \\ & =& \frac{1} {n}\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,y)\mathrm{d}F(x,\theta )\mathrm{d}F(y,\theta )\mathop{\longrightarrow}\limits_{n \rightarrow \infty }^{}0.\end{array}$$
(23.4.2)

In the case of a bounded kernel \(\mathcal{L}\), the convergence is uniform with respect to θ. In this case it is easy to verify that nN(F n  ∗ , F θ) converges to

$$-\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,y)\mathrm{d}{w}^{\circ }(F(x,\theta ))\mathrm{d}{w}^{\circ }(F(y,\theta ))$$

as n, where w  ∘  is the Brownian bridge.

5 Estimating a Location Parameter: Third Approach

Let us return to the case of estimating a location parameter. We will present an example of an estimator obtained by minimizing the \(\mathfrak{N}\)-distance, which has good robust properties. Let

$$\mathcal{L}_{r}(x) = \left \{\begin{array}{ll} \vert x\vert &\mbox{ for }\vert x\vert < r\,\\ r &\mbox{ for } \vert x\vert \geq r, \end{array} \right.$$

where r > 0 is a fixed number. The famous Pólya criterionFootnote 2 implies that the function \(f(t) = 1 -\frac{1} {r}\mathcal{L}_{r}(t)\) is the characteristic function of some probability distribution. Consequently, \(\mathcal{L}_{r}(t)\) is a negative definite function. This implies that for a sufficiently large sample size n there exists an estimator θ ∗  of minimal \({\mathfrak{N}}^{r}\) distance, where \({\mathcal{N}}^{r}\) is the kernel constructed from \(\mathcal{L}_{r}(x - y)\). If the distribution function F(x − θ) has a symmetric unimodal density p(x − θ) that is absolutely continuous and has a finite Fisher information

$$I = \int\nolimits_{-\infty }^{\infty }{\left (\frac{p^{\prime}(x)} {p(x)}\right )}^{2}p(x)\mathrm{d}x,$$

then we conclude by (23.4.2) that θ ∗  is consistent and is asymptotically normal. The estimator θ ∗  satisfies (23.2.5), where

$$g_{1}(x) = \int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}^{\prime}(x - v)p(v)\mathrm{d}v\,$$

and

$$\mathcal{L}^{\prime}(u) = \left \{\begin{array}{rl} 0&\mbox{ for }\vert u\vert \geq r, \\ 1&\mbox{ for }0 < u < r, \\ 0&\mbox{ for }u = 0, \\ - 1&\mbox{ for } - r < u < 0. \end{array} \right.$$

This implies that θ ∗  has a bounded influence function and, hence, is B-robust.Footnote 3

Consider now the estimator θ obtained by the method discussed in Sect. 23.3. It is easy to verify that this estimator is consistent under the same assumptions. However, θ satisfies the equation

$$\sum \limits _{j=1}^{n}\mathcal{L}^{\prime}(x_{ j} - \theta ) = 0,$$

so that it is a trimmed median. It is well known that a trimmed median is the most B-robust estimator in the corresponding class of M-estimators.Footnote 4

6 Semiparametric Estimation

Let us now briefly discuss semiparametric estimation. This problem is similar to that considered in Sect. 23.4, except that here we do not assume that the sample comes from a parametric family. Let x 1, , x n , be a random sample from a population given by DF F(x), which belongs to some distribution class \(\mathcal{P}\). Suppose that the metric \(\mathfrak{N}\) is generated by the negative definite kernel \(\mathcal{L}(x,y)\) and that \(\mathbf{P} \subset \mathcal{B}(\mathcal{L})\). \(\mathcal{B}(\mathcal{L})\) is isometric to some subset of the Hilbert space \(\mathfrak{H}\). Moreover, Aronszajn’s theorem implies that \(\mathfrak{H}\) can be chosen to be minimal in some sense. In this case, the definition of \(\mathfrak{N}\) is extended to the entire \(\mathfrak{H}\).

We assume that the distributions under consideration lie on some “nonparametric curve.” In other words, there exists a nonlinear functional φ on \(\mathfrak{H}\) such that the distributions F satisfy the condition

$$\varphi (F) = c = \mbox{ const.}$$

The functional φ is assumed to be smooth. For any \(H \in \mathfrak{H}\)

$$\begin{array}{rcl} \lim \limits _{t\rightarrow 0}\frac{N(F + tH,G) - N(F,G)} {t} & =& 2\int \nolimits \limits _{-\infty }^{\infty }\int \nolimits \limits _{-\infty }^{\infty }\mathcal{L}(x,y)\mathrm{d}(G(x) - F(x))\mathrm{d}H(y) \\ & =& \langle \mbox{ grad }N(F,G),H\rangle,\end{array}$$

where G is fixed.

Under the parametric formulation of Sect. 23.4, the equation for θ has the form

$$\frac{\mathrm{d}} {\mathrm{d}\theta }N(F_{\theta },F_{n}^{{_\ast}}) = 0,$$

that is,

$$\left \langle \mbox{ grad }N(F,F_{n}^{{_\ast}})\vert _{ F=F_{\theta }},\ \frac{\mathrm{d}} {\mathrm{d}\theta }F_{\theta }\right \rangle = 0.$$

Here, the equation explicitly depends on the gradient of the functional N(F, F n  ∗ ). However, under the nonparametric formulation, we work with the conditional minimum of the functional N(F, F n  ∗ ), assuming that F lies on the surface φ(F) = C. Here, our estimator is

$$\tilde{{F}}^{{_\ast}} = \begin{array}{c} \mathrm{argmin} \\ F\in \{F:\varphi (F)=c\} \end{array} N(F,F_{n}^{{_\ast}}).$$

According to general rules for finding conditional critical points, we have

$$\mbox{ grad }N(\tilde{{F}}^{{_\ast}},\tilde{F}_{ n}^{{_\ast}}) = \lambda \mbox{ grad }\phi (\tilde{{F}}^{{_\ast}}),$$
(23.6.1)

where λ is a number. Thus, in the general case, (23.6.1) is an eigenvalue problem. This is a general framework of semiparametric estimation.