Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Composite Hypotheses Defined by Half Spaces of Distributions

Let \(\nu _0,\nu _1\) be fixed distributions on \(\mathbb {R}^d\) which are the nominal distributions under two hypotheses. Let

$$ V(\nu ,\mu )=\sup _{A\subseteq \mathbb {R}^d} \left| \nu (A) - \mu (A)\right| $$

denote the total variation distance between two distributions \(\nu \) and \(\mu \), where the supremum is taken over all Borel sets of \(\mathbb {R}^d\).

Let \(X,X_1,X_2,\dots \) be i.i.d. random vectors according to a common distribution \(\mu \). We observe \(X_1,\dots ,X_n\). Under the hypothesis \(H_j\) (\(j=0,1\)) the distribution \(\mu \) is a distorted version of \(\nu _j\). Formally define the two hypotheses by

$$\begin{aligned} H_0 =\left\{ {\mu }: V(\mu ,\nu _0) < V(\mu ,\nu _1)\right\} , \end{aligned}$$

and

$$\begin{aligned} H_1 =\left\{ {\mu }: V(\mu ,\nu _1) < V(\mu ,\nu _0)\right\} . \end{aligned}$$

Our aim is to construct a distribution-free strongly consistent test, which makes an error only finitely many times almost surely (a.s.). The concept of strongly consistent test is quite unusual: it means that both on \( H_0\) and on \( H_1\) the test makes a.s. no error after a random sample size. In other words, denoting by \({\mathbb {P}}_0\) and \({\mathbb {P}}_1\) the probability under the hypotheses \( H_0\) and \( H_1\), we have

$$ {\mathbb {P}}_0\{ \text {rejecting}\,H_0 \,\text {for only finitely many }\,n\}=1 $$

and

$$ {\mathbb {P}}_1\{ \text {rejecting}\,H_1\,\text {for only finitely many } \,n\}=1. $$

In a real-life problem, for example, when we get the data sequentially, one gets data just once, and should make good inferences from these data. Strong consistency means that the single sequence of inference is a.s. perfect if the sample size is large enough. This concept is close to the definition of discernability introduced by Dembo and Peres [5]. For a discussion and references, we refer the reader to Biau and Györfi [3], Devroye and Lugosi [7], Gretton and Györfi [10], and Györfi and Walk [15].

Motivated by a related goodness of fit test statistic of Györfi and van der Meulen [14], we put

$$ L_{n,0} = \sum _{j=1}^{m_n}|\mu _n(A_{n,j})-\nu _0(A_{n,j})|, $$

and

$$ L_{n,1} = \sum _{j=1}^{m_n}|\mu _n(A_{n,j})-\nu _1(A_{n,j})|, $$

where \(\mu _n\) denotes the empirical measures associated with the sample \(X_1, \ldots ,X_n\), so that

$$\mu _{n}(A)=\frac{\#\{i : X_i \in A, i=1, \ldots ,n\}}{n}$$

for any Borel subset A, and \(\mathcal {P}_n=\{A_{n,1},\ldots ,A_{n,m_n}\}\) is a finite partition of \({\mathbb R}^d\).

We introduce a test such that the hypothesis \(H_0\) is accepted if

$$\begin{aligned} L_{n,0}<L_{n,1}, \end{aligned}$$
(22.1)

and rejected otherwise.

The sequence of partitions \(\mathcal {P}_1,\mathcal {P}_2,\dots \) is called asymptotically fine if for any sphere S centered at the origin

$$\begin{aligned} \lim _{n\rightarrow \infty }\max _{A\in \mathcal {P}_n,A\cap S \ne \emptyset }diam (A)=0. \end{aligned}$$

Theorem 22.1

Assume that the sequence of partitions \(\mathcal {P}_1,\mathcal {P}_2,\dots \) is asymptotically fine and

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{m_n}{n}=0. \end{aligned}$$
(22.2)

Then the test (22.1) is strongly consistent.

Proof

Assume \(H_0\) without loss of generality. Then the error event means that

$$ L_{n,0}\ge L_{n,1}. $$

Thus,

$$\begin{aligned} 0&\le \sum _{j=1}^{m_n}|\mu _n(A_{n,j})-\nu _0(A_{n,j})| - \sum _{j=1}^{m_n}|\mu _n(A_{n,j})-\nu _1(A_{n,j})|\\&\le 2L_n + \sum _{j=1}^{m_n}|\mu (A_{n,j})-\nu _0(A_{n,j})| - \sum _{j=1}^{m_n}|\mu (A_{n,j})-\nu _1(A_{n,j})|, \end{aligned}$$

where

$$ L_n=\sum _{j=1}^{m_n}|\mu _n(A_{n,j})-\mu (A_{n,j})|. $$

Introduce the notation

$$ \epsilon =-(V(\mu ,\nu _0) - V(\mu ,\nu _1))>0. $$

The sequence of partitions \(\mathcal {P}_1,\mathcal {P}_2,\dots \) is asymptotically fine, which implies that

$$\begin{aligned}&\lim _{n\rightarrow \infty }\left( \sum _{j=1}^{m_n}|\mu (A_{n,j})-\nu _0(A_{n,j})| - \sum _{j=1}^{m_n}|\mu (A_{n,j})-\nu _1(A_{n,j})| \right) \\&=2\left( V(\mu ,\nu _0) - V(\mu ,\nu _1) \right) \\&=-2\epsilon , \end{aligned}$$

(cf. Biau and Györfi [3]). Thus, for all n large enough,

$$ P_{e,n}={\mathbb {P}}\{\text {error} \} \le {\mathbb {P}}\{0\le 2L_n-\epsilon \}. $$

Beirlant et al. [2] and Biau and Györfi [3] proved that, for any \(\delta >0\),

$$ {\mathbb {P}}\{L_n>\delta \} \le 2^{m_n}e^{-n\delta ^2/2}. $$

Therefore

$$ P_{e,n} \le 2^{m_n}e^{-n\epsilon ^2/8}. $$

Because of (22.2),

$$ \sum _{n=1}^{\infty }P_{e,n}<\infty , $$

and so the Borel-Cantelli lemma implies that a.s.

$$ L_{n,0}- L_{n,1}<0 $$

for all n large enough, i.e., the error

$$ L_{n,0}- L_{n,1}\ge 0 $$

occurs a.s. for only finitely many n. Thus, strong consistency is proved.    \(\Box \)

In a straightforward way, the proof of Theorem 22.1 can be extended to infinite partitions if we assume that for each sphere S centered at the origin

$$ \lim _{n\rightarrow \infty }\frac{|\{j:A_{n,j}\cap S\ne \emptyset \}|}{n}=0. $$

Next, a variant of the test (22.1) with much smaller computational complexity will be defined. The test statistic is based on a recursive histogram. In this section we assume that the partitions are infinite and all cells of the partitions have finite and positive Lebesgue measure \(\lambda \). Let \(A_n(x)\) denote the cell of \(\mathcal {P}_n\) to which x belongs. The density estimate

$$ f_n(x):=\frac{1}{n}\sum _{i=1}^n \frac{{\mathbb {I}}_{\{X_i\in A_i(x)\}}}{\lambda (A_i(x))} $$

is called a recursive histogram.

For \(A\in \mathcal {P}_n\), introduce the estimate

$$ \mu _n^*(A):=\int _A f_n(x)dx. $$

Notice that \(\mu _n^*(A)\) can be calculated in a recursive way, which is important in on-line applications. These definitions imply that

$$\begin{aligned} \mu _n^*(A)=\frac{1}{n}\sum _{i=1}^n\int _A \frac{{\mathbb {I}}_{\{X_i\in A_i(x)\}}}{\lambda (A_i(x))}dx&=\frac{1}{n}\sum _{i=1}^n\int _A \frac{{\mathbb {I}}_{\{x\in A_i(X_i)\}}}{\lambda (A_i(X_i))}dx\\&\qquad \qquad =\frac{1}{n}\sum _{i=1}^n \frac{\lambda (A\cap A_i(X_i))}{\lambda (A_i(X_i))}. \end{aligned}$$

If the sequence of partitions \(\mathcal {P}_1,\mathcal {P}_2,\dots \) is nested, i.e., the sequence of \(\sigma \)-algebras \(\sigma (\mathcal {P}_n)\) is non-decreasing, then for \(A\in \mathcal {P}_n\) let the ancestor \(B_A^{(i)}\in \mathcal {P}_i\) be such that \(A \subseteq B_A^{(i)}\) (\(i\le n\)). One can check that for nested partitions the estimate has the following form:

$$ \mu _n^*(A)=\frac{1}{n}\sum _{i=1}^n {\mathbb {I}}_{\{X_i\in B_A^{(i)}\}}\frac{\lambda (A)}{\lambda (B_A^{(i)})}. $$

Put

$$ L^*_{n,j}:=\sum _{A\in \mathcal {P}_n}|\mu _n^*(A)-\nu _j(A)| $$

(\(j=0,1\)). We introduce a test such that the hypothesis \(H_0\) is accepted if

$$\begin{aligned} L^*_{n,0}<L^*_{n,1}, \end{aligned}$$
(22.3)

and rejected otherwise.

Theorem 22.2

Assume that the sequence of partitions \(\mathcal {P}_1,\mathcal {P}_2,\dots \) is asymptotically fine such that

$$\begin{aligned} \sum _{n=1}^{\infty }\frac{1}{n^2\inf _{j}\lambda (A_{n,j})}<\infty . \end{aligned}$$

Further suppose that \(\mu \) has a density. Then the test (22.3) is strongly consistent.

Proof

Assume \(H_0\) without loss of generality. One notices

$$ L^*_{n,0}-L^*_{n,1} \le 2L_n^*+Q_n^*, $$

where

$$ L_n^*=\sum _{A\in \mathcal {P}_n}|\mu ^*_n(A)-\mu (A)|, $$

and

$$ Q_n^*=\sum _{A\in \mathcal {P}_n}\left| \mu (A)-\nu _0(A) \right| -\sum _{A\in \mathcal {P}_n}\left| \mu (A)-\nu _1(A) \right| . $$

By Biau and Györfi [3],

$$ Q^*_n\rightarrow 2(V(\mu ,\nu _0)-V(\mu ,\nu _1))<0, $$

the latter because of \(H_0\). Next \(L_n^*\rightarrow 0 \) a.s. will be shown. Denote the density of \(\mu \) by f. Thus

$$ L_n^*=\sum _{A\in \mathcal {P}_n}\left| \int _A f_n(x)dx-\int _A f(x)dx\right| \le \int | f_n(x)- f(x)|dx. $$

Therefore we have to prove the strong \(L_1\)-consistency of the recursive histogram. Consider the bias part. Introduce the ordinary histogram:

$$ \tilde{f}_n(x):=\frac{1}{n}\sum _{i=1}^n \frac{{\mathbb {I}}_{\{X_i\in A_n(x)\}}}{\lambda (A_n(x))}, $$

and put

$$ \bar{f}_n(x):={\mathbb {E}}\{ \tilde{f}_n(x) \}= \frac{\mu (A_n(x))}{\lambda (A_n(x))}. $$

According to the Abou-Jaoude theorem, if the sequence of partitions \(\mathcal {P}_1,\mathcal {P}_2,\dots \) is asymptotically fine, then

$$ \int |\bar{f}_n-f|\rightarrow 0 $$

(cf. Devroye and Györfi [6]). Thus, for the bias term of the recursive histogram, we get

$$\begin{aligned} \int |{\mathbb {E}}\{f_n\}-f|= \int \left| \frac{1}{n}\sum _{i=1}^n\bar{f}_i-f\right| \le \frac{1}{n}\sum _{i=1}^n\int |\bar{f}_i-f|\rightarrow 0. \end{aligned}$$
(22.4)

For the variation term of the recursive histogram, we apply the generalized theorem of Kolmogorov. Let \(U_n\), \(n=1,2,\dots \) be an \(L_2\)-valued sequence of independent, zero-mean random variables such that

$$ \sum _{n=1}^{\infty }\frac{{\mathbb {E}}\{\Vert U_n\Vert ^2_2\}}{n^2} <\infty $$

where \(\Vert \cdot \Vert _2\) denotes the \(L_2\) norm. Then

$$ \lim _{n\rightarrow \infty }\left\| \frac{1}{n} \sum _{i=1}^n U_i\right\| _2=0 $$

a.s. (cf. Györfi et al. [11]). For

$$ U_n:=\frac{{\mathbb {I}}_{\{X_n\in A_n(\cdot )\}}}{\lambda (A_n(\cdot ))}-{\mathbb {E}}\left\{ \frac{{\mathbb {I}}_{\{X_n\in A_n(\cdot )\}}}{\lambda (A_n(\cdot ))}\right\} , $$

one has to verify the condition of the generalized Kolmogorov theorem:

$$\begin{aligned} \sum _{n=1}^{\infty }\frac{{\mathbb {E}}\left\{ \left\| \frac{{\mathbb {I}}_{\{X_n\in A_n(\cdot )\}}}{\lambda (A_n(\cdot ))}-{\mathbb {E}}\left\{ \frac{{\mathbb {I}}_{\{X_n\in A_n(\cdot )\}}}{\lambda (A_n(\cdot ))}\right\} \right\| ^2_2\right\} }{n^2}&\le \sum _{n=1}^{\infty }\frac{{\mathbb {E}}\left\{ \left\| \frac{{\mathbb {I}}_{\{X_n\in A_n(\cdot )\}}}{\lambda (A_n(\cdot ))} \right\| ^2_2\right\} }{n^2}\\&=\sum _{n=1}^{\infty }\frac{{\mathbb {E}}\left\{ \int \frac{{\mathbb {I}}_{\{X_n\in A_n(x)\}}}{\lambda (A_n(x))^2} dx\right\} }{n^2}\\&=\sum _{n=1}^{\infty }\frac{{\mathbb {E}}\left\{ \int \frac{{\mathbb {I}}_{\{x\in A_n(X_n)\}}}{\lambda (A_n(X_n))^2} dx\right\} }{n^2}\\&=\sum _{n=1}^{\infty }\frac{{\mathbb {E}}\left\{ \frac{1}{\lambda (A_n(X_n))} \right\} }{n^2}\\&\le \sum _{n=1}^{\infty }\frac{1}{n^2\inf _{j}\lambda (A_{n,j})} < \infty , \end{aligned}$$

by the condition of the theorem, and so

$$\begin{aligned} \int |f_n-{\mathbb {E}}\{f_n\}|^2\rightarrow 0 \end{aligned}$$
(22.5)

a.s. From Lemma 3.1 in Györfi and Masry [13] we get that the limit relations (22.4) and (22.5) imply

$$ \int |f_n-f|\rightarrow 0 $$

a.s. Therefore a.s.

$$ L^*_{n,0}-L^*_{n,1} < 0 $$

for all n large enough, and so strong consistency is proved.    \(\Box \)

2 Composite Hypotheses Defined by Half Spheres of Distributions

Again, under the hypothesis \(H'_j\) (\(j=0,1\)) the distribution \(\mu \) is a distorted version of \(\nu _j\). In this section we assume that the true distribution lies within a certain total variation distance of the underlying nominal distribution.

We formally define the two hypotheses by

$$\begin{aligned} H'_j =\left\{ \mu : V(\mu ,\nu _j) < \varDelta \right\} , \quad j=0,1, \end{aligned}$$
(22.6)

where

$$ \varDelta := (1/2) V(\nu _0,\nu _1). $$

Because of

$$ H'_j \subseteq H_j, \quad j=0,1~, $$

the test (22.1) in the previous section is strongly consistent. In this section we introduce a simpler test. For the notations

$$ f=\frac{d\nu }{d(\nu +\mu )} \qquad \text {and} \qquad g=\frac{d\mu }{d(\nu +\mu )}, $$

the general version of Scheffé’s theorem implies that

$$ V(\nu ,\mu )=\nu (A^*) - \mu (A^*), $$

where

$$ A^* = \{x:f(x)>g(x)\}~. $$

Introduce the notation

$$ A_{0,1}= \left\{ x:f_0(x)>f_1(x)\right\} = \left\{ x:f_0(x)>1/2\right\} ~, $$

where

$$ f_0=\frac{d\nu _0}{d(\nu _0+\nu _1)} \qquad \text {and} \qquad f_1=\frac{d\nu _1}{d(\nu _0+\nu _1)}. $$

The proposed test is the following: accept hypothesis \(H'_0\) if

$$\begin{aligned} \mu _n(A_{0,1})\ge \frac{\nu _0(A_{0,1})+\nu _1(A_{0,1})}{2}, \end{aligned}$$
(22.7)

and reject otherwise.

Then, we get that

Theorem 22.3

The test (22.7) is strongly consistent.

Proof

Assume \(H_0\) without loss of generality. Put

$$ \epsilon =\varDelta -V(\mu ,\nu _0)>0. $$

Observe that by the Scheffé theorem [22],

$$\begin{aligned} \nu _0(A_{0,1}) - \mu (A_{0,1})&\le V(\nu _0,\mu )\\&= \varDelta - \epsilon \\&= \frac{1}{2} V(\nu _0,\nu _1) - \epsilon \\&= \frac{1}{2}\left( \nu _0(A_{0,1}) - \nu _1(A_{0,1})\right) - \epsilon . \end{aligned}$$

Rearranging the obtained inequality, we get that

$$\begin{aligned} \mu (A_{0,1}) \ge \frac{\nu _0(A_{0,1}) + \nu _1(A_{0,1})}{2} + \epsilon ~. \end{aligned}$$
(22.8)

Therefore, (22.8) and Hoeffding’s inequality [16] imply that

$$\begin{aligned} {\mathbb {P}}\{ \text {error}\}&= {\mathbb {P}}\left\{ \mu _n(A_{0,1}) < \frac{\nu _0(A_{0,1}) + \nu _1(A_{0,1})}{2} \right\} \\&\le {\mathbb {P}}\left\{ \mu (A_{0,1}) -\mu _n(A_{0,1}) > \epsilon \right\} \\&\le e^{-2n{\epsilon }^2}. \end{aligned}$$

Therefore the Borel-Cantelli lemma implies strong consistency.    \(\Box \)

3 Discussion

3.1 Indistinguishability

For the hypotheses \(H_0\) and \(H_1\) there is no positive margin, because the gap between \(H_0\) and \(H_1\) is just the hyperplane

$$ \left\{ \mu : V(\mu ,\nu _0) = V(\mu ,\nu _1)\right\} . $$

Moreover, the margin is zero:

$$ \inf _{\mu \in H_0, \nu \in H_1}V(\mu ,\nu )=0. $$

Without any positive margin condition it is impossible to derive a uniform bound on the error probabilities. The pair \((H_0,H_1)\) of hypotheses is called distinguishable if there is a sequence of uniformly consistent tests, which means that the errors of the first and second kind tend to zero uniformly. For a test \(T_n\) with sample size n, let \(\alpha _{n,\mu }(T_n)\) and \(\beta _{n,\mu }(T_n)\) denote the errors of the first and second kind, resp. Put

$$ \alpha _{n}(T_n,H_0)=\sup _{\mu \in H_0}\alpha _{n,\mu }(T_n), \qquad \beta _{n}(T_n,H_1)=\sup _{\mu \in H_1}\beta _{n,\mu }(T_n). $$

A sequence of tests \(T_n\), \(n=1,2,\dots \) is called uniformly consistent if

$$ \lim _{n\rightarrow \infty }(\alpha _{n}(T_n,H_0) + \beta _{n}(T_n,H_1))=0. $$

It is known that a necessary condition of the distinguishable property is that for any distribution \(\mu \)

$$ \max \left\{ \inf _{\nu \in H_0} V(\mu ,\nu ),\inf _{\nu \in H_1} V(\mu ,\nu ) \right\} >0. $$

(See Barron [1], Ermakov [9], Hoeffding and Wolfowitz [17], Kraft [18], LeCam [19], LeCam and Schwartz [20], Schwartz [23].) Obviously, this necessary condition is not satisfied when \(\mu ^*=(\nu _1+\nu _2)/2\). Because of

$$ \max \left\{ \inf _{\nu \in H'_0} V(\mu ^*,\nu ),\inf _{\nu \in H'_1} V(\mu ^*,\nu ) \right\} =0, $$

the pair \((H'_0,H'_1)\) of hypotheses is indistinguishable, too.

3.2 Computation

The hypothesis testing method (22.7) proposed above is computationally quite simple. The set \(A_{0,1}\) and the nominal probabilities \(\nu _0(A_{0,1})\) and \(\nu _1(A_{0,1})\) may be computed and stored before seeing the data. Then one merely needs to calculate \(\mu _n(A_{0,1})\).

3.3 Hypotheses Formulated by Densities

Devroye et al. [8] formulated a special case of hypotheses \((H'_0,H'_1)\), when \(\mu \), \(\nu _0\), and \(\nu _1\) have densities f, \(f_0\), and \(f_1\). Under some mild margin condition they proved uniform exponential bounds for the probability of failure for \(k\ge 2\) hypotheses. Moreover, they illustrated robustness of these bounds under additive noise, and showed examples where the test (22.7) is consistent and the maximum likelihood test does not work. Formally, the maximum likelihood test \(T_n\) is defined by

$$ T_n = \left\{ \begin{array}{ll} 0 &{} \text { if } \sum _{i=1}^n(\log f_0 (X_i) - \log f_1(X_i))> 0 \\ 1 &{} \text { otherwise.} \end{array} \right. $$

For \(f\in H'_0\), the strong law of large numbers implies the strong consistency of the maximum likelihood test if both integrals \(\int f\log f_0\) and \(\int f\log f_1\) are well defined, and

$$ \int f\log f_0>\int f\log f_1. $$

3.4 Robustness

Note that Theorem 22.3 does not require any assumptions about the nominal distributions. The test is robust in a very strong sense: we obtain consistency under the sole assumption that the distorted distribution remains within a certain total variation distance of the nominal distribution. For example, if \(\mu \) is either \((1-\delta )\nu _0 + \delta \tau \), or \((1-\delta )\nu _1 + \delta \tau \) with arbitrary “strange” distribution \(\tau \) such that \(\delta < \varDelta \), then we have (22.6):

$$\begin{aligned} V(\mu ,\nu _0)= & {} V((1-\delta )\nu _0 + \delta \tau ,\nu _0)\\= & {} V(\delta \tau ,\delta \nu _0)\\\le & {} \delta \\< & {} \varDelta . \end{aligned}$$

The outliers’ distribution \(\tau \) is really arbitrary. For example, it may not have expectations, or may even be a discrete distribution. The probability of outlier \(\delta \) can be at most equal to \(\varDelta \). The outliers can be formulated such that we are given three independent i.i.d. sequences \(\{U_i\}\), \(\{V_i\}\), \(\{I_i\}\), where \(\{U_i\}\), \(\{V_i\}\) are \(\mathbb {R}^d\)-valued, and \(\{I_i\}\) are binary. Put

$$ X_n=(1-I_n)U_n + I_nV_n. $$

If \(U_n\) is \(\nu _0\) distributed, \(V_n\) is \(\tau \) distributed, \({\mathbb {P}}\{I_n=1\}=\delta \), then we get the previous scheme. Other application include the case of censored observations, when \(V_n\) is a distortion of \(U_n\) such that some components of the vector \(U_n\) are censored. In this scheme \(\delta \) is the probability of censoring. Notice that in order to estimate the distribution from censored observations one needs samples \(\{(X_i,I_i)\}_{i=1}^n\) (cf. Györfi et al. [12]), while for detection it is enough to have \(\{X_i\}_{i=1}^n\).

3.5 Open Problems

  1. 1.

    Characterize the distributions \(\mu \in H_0{\setminus } H'_0\) where the simple test (22.7) is strongly consistent. As in the proof of Theorem 22.3, strong consistency can be verified if

    $$ \mu (A_{0,1}) > \frac{\nu _0(A_{0,1}) + \nu _1(A_{0,1})}{2}. $$

    We are interested in non-consistent examples, too.

  2. 2.

    Maybe one can improve the test (22.1), since in the construction of the partitions we don’t take into account the properties of \(\nu _0\) and \(\nu _1\). For example, we can include somehow the set \(A_{0,1}\).

3.6 Sequential Tests

We dealt with sequences of nonparametric tests with increasing sample size n, where almost surely type I and II errors occur only for finitely many n. One has to distinguish them from nonparametric sequential tests with power one (cf. Darling and Robbins [4], Sect. 6 in Robbins [21], Sect. 9.2 in Sen [24]). Such tests almost surely terminate at a random sample size with rejection of a null hypothesis \(H_0\) after finitely many observations, if the alternative hypothesis is valid, and with positive probability do not terminate if \(H_0\) is valid (open-ended procedures). In the latter case an upper bound of the complementary probabilities is an upper bound for the type I error probability.