Keywords

1 Formulation of the Problem

Let us consider the classical problem of testing hypothesis on the equality of two distributions

$$\begin{aligned} H_0\,:\,F_1 = F_2 \end{aligned}$$
(1)

against the alternative

$$\begin{aligned} H_1\,:\,F_1 \not = F_2 \end{aligned}$$
(2)

in the case of two independent samples \(X=(X_{1},\ldots , X_{n})\) and \(Y=(Y_{1},\ldots , Y_{m})\) with the distributions functions \(F_1\) and \(F_2\) respectively.

It is well known (see e.g. [1]) that in the case when both distributions differ only by the means and are normal the classical Student test has a few optimal properties. If the distributions are not normal but still differs only by means a widely popular Wilcoxon-Mann-Whitney (WMW) U-statistic is often used instead. However, it can be shown that if two normal populations differ only in variances, the power of WMW test is very low. If distributions are arbitrary there are some universal techniques such as tests by Kolmogorov-Smirnov and Cramer-von Mises (see [2]) and the Anderson-Darling test (see [3]) that can be applied but in many cases these tests can be not powerful.

Recently, Zech and Aslan [4] suggested the test based on U-statistics with the logarithmic kernel and provided its numerical justification for one and many dimensional cases in comparison with a few alternative techniques. However, to the best authors knowledge there are no analytical results about its asymptotic power. Here we introduce a similar but different test and provide a few analytical results on its power.

2 The New Test and Its Statistical Motivation

Assume that the distribution functions \(F_1\) and \(F_2\) belongs to the class of distribution functions of random variables \(\xi \), such that

$$\begin{aligned} E [\ln ^2 (1+ \xi ^2) ] < \infty . \end{aligned}$$
(3)

Many distributions and, in particular, the Cauchy distribution have this property.

Among all distributions with given left hand side of (3) the Cauchy’s one has the maximum entropy.

Consider the following test

$$\begin{aligned} \varPhi _{A}=-\frac{1}{n^2}\sum _{1\le i<j\le n} g(X_i-X_j), \varPhi _{B}=-\frac{1}{m^2}\sum _{1\le i<j\le m} g(Y_i-Y_j), \end{aligned}$$
(4)
$$\begin{aligned} \varPhi _{AB}=-\frac{1}{nm}\sum _{i=1}^n\sum _{j=1}^m g(X_i-Y_j), \varPhi _{nm}=\varPhi _{AB}-\varPhi _{A}- \varPhi _{B}, \end{aligned}$$
(5)

where

$$ g(u)= -\ln (1+|u|^2), $$

g(x) is under a constant term precision the logarithm of the density of the standard Cauchy distribution. (Note that Zech and Aslan (2005) took \(g(u)= -\ln (|u|)\)).

We would like to have a test that is appropriate for the case where the basic distribution belongs to a rather general class of distributions and the alternative distribution differs only by shift and scale transformations.

In particular, we consider the class of distributions satisfying (3), but the approach can be generalized for other classes of distributions.

Consider the class of distributions given by the property (3). Note that if the parameters are known the test based on likelihood ratio is the most powerful among tests with given parameters.

The test suggested above can be considered as an approximation of logarithm of this ratio for the Cauchy distribution. We suppose that it will be very efficient for all distributions with property (3).

3 The Analytical Study of Asymptotic Power

Let us consider the case of two distributions having the property (3) and, in particular, the two that differ only by a shift. To simplify notations assume that \(m=n\). The case \(m\ne n\) is similar. Now the criterion (4)–(5) assumes the form

$$\begin{aligned} T_n=\varPhi _{nn}= \frac{1}{n^2}\sum _{i,j=1}^n \ln (1 + (X_i - Y_j)^2)-\frac{1}{n^2}\sum _{1\le i<j\le n} \ln (1 + (X_i - X_j)^2) \end{aligned}$$
(6)
$$\begin{aligned} - \frac{1}{n^2}\sum _{1\le i<j\le n} \ln (1 + (Y_i - Y_j)^2). \end{aligned}$$
(7)

Denote by C(uv) the Cauchy distribution with the density function

$$ v/(\pi (v^2 + (x-u)^2)). $$

Let f(x) denotes the density of \(F_1\). Denote

$$\begin{aligned} J_h =\int _R -g(x-y-|h|/\sqrt{n})f(x)f(y)dxdy, \end{aligned}$$

where \(g(u)= - \ln (1+|u|^2)\).

By expending the function \(g(u)= \ln (1+|u|^2)\) into the Taylor series we obtain that for arbitrary density function f(x) there exists the finite limit

$$\begin{aligned} J^*(h)= lim_{n\rightarrow \infty } n(J_h - J_0) \end{aligned}$$
(8)

and it is equal to

$$ (1/2)h^2 \int _R g''_\theta (x-y-\theta )f(x)f(y)dxdy|_{\theta =0}. $$

(Note that the differentiation under integral is justified since the derivative \(g ''_\theta (x-y-\theta )|_{\theta =0}\) is less than 2.) That is

$$ J^*(h)=h^2 \int _R \frac{1-(x-y)^2}{(1+(x-y)^2)^2} f(x)f(y)dxdy. $$

Denote

$$ \bar{b} =\sqrt{J^*(h)/h^2}. $$

The basic analytical result of the present paper is the following

Theorem 1

Consider the problem of testing hypothesis on the equality of two distributions (1)–(2) where both functions have the property (3). Then

  1. (i)

    under the condition \(n \rightarrow \infty \) the distribution function of \(nT_n\) converges under \(H_0\) to that of the random variable

    $$\begin{aligned} (aL)^2, \end{aligned}$$
    (9)

    where L has the normal distribution with zero expectation and variance equal to 1, \(a>0\) is some number.

  2. (ii)

    Let \(F_1(x)= F(x),F_2=F(x+\theta ),\) where F is an arbitrary distribution function that is symmetric around a point and possess property (3), \(\theta =h/\sqrt{n},h\) is an arbitrary given number. Then the distribution function of \(nT_n\) converges under \(H_1\) to that of the random variable

    $$ (aL + b)^2, $$

    where \(b=0\) for the case of \(H_0\) and \(b=\bar{b} h\) for \(H_1\). In this case the power of the criterion \(T_n\) with significance \(\alpha \) is asymptotically equal to that is given by the formula

    $$ Pr\{L\ge z_{1-\alpha /2}-\bar{b}h/a\} + Pr\{L\le - z_{1-\alpha /2}-\bar{b}h/a\}, $$

    where \(z_{1-\alpha /2}\) is such that

    $$ Pr \{L\ge z_{1-\alpha /2}\}= \alpha /2. $$

If \(F_1=C(\nu ,1), F_2=C(\nu + \theta ,1)\) then \( b= h/3\).

Note that the analytical presentation for the coefficient a is a difficult problem that is not solved up to now. However this coefficient can be easily found by stochastic simulation. In the case of Cauchy distribution we found a heuristic formula \(3a^2= J_0\), that means \(a =\sqrt{(2/3)\ln 3}\). This formula provide a very exact approximation for empirical power (see Tables 1, 2 and 3 in the next section).

Thus in the case of Cauchy distributions with scale parameter equal to 1 the power of the criterion \(T_n\) with significance \(\alpha \) is approximately equal to

$$ Pr\{L\ge z_{1-\alpha /2}-(1/\sqrt{6\ln 3})h\} + Pr\{L\le - z_{1-\alpha /2}-(1/\sqrt{6\ ln 3})h\}. $$

The proof of the theorem is given in the Appendix.

4 Simulation Results

In this section we present numerical results of the efficiency of new criterion in comparison with a few alternative criteria.

At the Tables 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 results for cases n = 100, 500, 1000 and different values of h with \(\alpha =0.05\) are given for normal and Cauchy distributions that differ either by shift or by scale parameters. The critical values were calculated in two ways: by simulation of the initial distribution and by random permutations (we used 800 random permutation in all cases). It worth to be noted that the results are very similar. Since the permutation technique is more universal, it can be recommended for practical applications.

Note that in all these cases when the distributions differ only in the shift parameters the power of \(T_n\) and that of the Wilcoxon-Mann-Whitney, the Kolmogorov-Smirnov and the Anderson-Darling tests were approximately equal to each other. It can be pointed out also that if the variances are not standard but are known we should simply make the corresponding normalisation. But for the cases where the distributions differ in scale parameters the Wilcoxon-Mann-Whitney is not appropriate at all and the power of the Kolmogorov-Smirnov and the Anderson-Darling tests is considerably lower.

Table 1. Cauchy distribution, \(X\sim C(0,1)\), \(Y\sim C(h/\sqrt{n},1)\), \(n=100\)
Table 2. Cauchy distribution, \(X\sim C(0,1)\), \(Y\sim C(h/\sqrt{n},1)\), \(n=500\)
Table 3. Cauchy distribution, \(X\sim C(0,1)\), \(Y\sim C(h/\sqrt{n},1)\), \(n=1000\)
Table 4. Cauchy distribution, \(X\sim C(0,1)\), \(Y\sim C(0, 1 + h/\sqrt{n})\), \(n=100\)
Table 5. Cauchy distribution, \(X\sim C(0,1)\), \(Y\sim C(0, 1 + h/\sqrt{n})\), \(n=500\)
Table 6. Cauchy distribution, \(X\sim C(0,1)\), \(Y\sim C(0, 1 + h/\sqrt{n})\), \(n=1000\)
Table 7. Normal distribution, \(X\sim N(0,1)\), \(Y\sim N(h/\sqrt{n},1)\), \(n=100\)
Table 8. Normal distribution, \(X\sim N(0,1)\), \(Y\sim N(h/\sqrt{n},1)\), \(n=500\)
Table 9. Normal distribution, \(X\sim N(0,1)\), \(Y\sim N(h/\sqrt{n},1)\), \(n=1000\)
Table 10. Normal distribution, \(X\sim N(0,1)\), \(Y\sim N(0, 1 + h/\sqrt{n})\), \(n=100\)
Table 11. Normal distribution, \(X\sim N(0,1)\), \(Y\sim N(0, 1 + h/\sqrt{n})\), \(n=500\)
Table 12. Normal distribution, \(X\sim N(0,1)\), \(Y\sim N(0, 1 + h/\sqrt{n})\), \(n=1000\)

5 Conclusion

In this paper we suggested a new test for equality of two distributions. In a wide class of distributions it was proved that the limiting distribution is the square of a Normal distribution. It allows to find asymptotic power analytically for the case of distributions that differ only by shift up to unknown parameter that can be found by stochastic simulation. The high efficiency of the test was confirmed by stochastic simulations.