Abstract
The paper introduces a new test for equality of two distributions in a class of models. We proved analytically and by stochastic simulation that the test possesses high efficiency. For the case of normal and Cauchy distributions that differ only by shift the asymptotic power of the test appears to be approximately the same as for the Wilcoxon-Mann-Whitney, the Kolmogorov-Smirnov and the Anderson-Darling tests. But if the distributions differ by scale parameters the power of the new test is considerably better.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Formulation of the Problem
Let us consider the classical problem of testing hypothesis on the equality of two distributions
against the alternative
in the case of two independent samples \(X=(X_{1},\ldots , X_{n})\) and \(Y=(Y_{1},\ldots , Y_{m})\) with the distributions functions \(F_1\) and \(F_2\) respectively.
It is well known (see e.g. [1]) that in the case when both distributions differ only by the means and are normal the classical Student test has a few optimal properties. If the distributions are not normal but still differs only by means a widely popular Wilcoxon-Mann-Whitney (WMW) U-statistic is often used instead. However, it can be shown that if two normal populations differ only in variances, the power of WMW test is very low. If distributions are arbitrary there are some universal techniques such as tests by Kolmogorov-Smirnov and Cramer-von Mises (see [2]) and the Anderson-Darling test (see [3]) that can be applied but in many cases these tests can be not powerful.
Recently, Zech and Aslan [4] suggested the test based on U-statistics with the logarithmic kernel and provided its numerical justification for one and many dimensional cases in comparison with a few alternative techniques. However, to the best authors knowledge there are no analytical results about its asymptotic power. Here we introduce a similar but different test and provide a few analytical results on its power.
2 The New Test and Its Statistical Motivation
Assume that the distribution functions \(F_1\) and \(F_2\) belongs to the class of distribution functions of random variables \(\xi \), such that
Many distributions and, in particular, the Cauchy distribution have this property.
Among all distributions with given left hand side of (3) the Cauchy’s one has the maximum entropy.
Consider the following test
where
g(x) is under a constant term precision the logarithm of the density of the standard Cauchy distribution. (Note that Zech and Aslan (2005) took \(g(u)= -\ln (|u|)\)).
We would like to have a test that is appropriate for the case where the basic distribution belongs to a rather general class of distributions and the alternative distribution differs only by shift and scale transformations.
In particular, we consider the class of distributions satisfying (3), but the approach can be generalized for other classes of distributions.
Consider the class of distributions given by the property (3). Note that if the parameters are known the test based on likelihood ratio is the most powerful among tests with given parameters.
The test suggested above can be considered as an approximation of logarithm of this ratio for the Cauchy distribution. We suppose that it will be very efficient for all distributions with property (3).
3 The Analytical Study of Asymptotic Power
Let us consider the case of two distributions having the property (3) and, in particular, the two that differ only by a shift. To simplify notations assume that \(m=n\). The case \(m\ne n\) is similar. Now the criterion (4)–(5) assumes the form
Denote by C(u, v) the Cauchy distribution with the density function
Let f(x) denotes the density of \(F_1\). Denote
where \(g(u)= - \ln (1+|u|^2)\).
By expending the function \(g(u)= \ln (1+|u|^2)\) into the Taylor series we obtain that for arbitrary density function f(x) there exists the finite limit
and it is equal to
(Note that the differentiation under integral is justified since the derivative \(g ''_\theta (x-y-\theta )|_{\theta =0}\) is less than 2.) That is
Denote
The basic analytical result of the present paper is the following
Theorem 1
Consider the problem of testing hypothesis on the equality of two distributions (1)–(2) where both functions have the property (3). Then
-
(i)
under the condition \(n \rightarrow \infty \) the distribution function of \(nT_n\) converges under \(H_0\) to that of the random variable
$$\begin{aligned} (aL)^2, \end{aligned}$$(9)where L has the normal distribution with zero expectation and variance equal to 1, \(a>0\) is some number.
-
(ii)
Let \(F_1(x)= F(x),F_2=F(x+\theta ),\) where F is an arbitrary distribution function that is symmetric around a point and possess property (3), \(\theta =h/\sqrt{n},h\) is an arbitrary given number. Then the distribution function of \(nT_n\) converges under \(H_1\) to that of the random variable
$$ (aL + b)^2, $$where \(b=0\) for the case of \(H_0\) and \(b=\bar{b} h\) for \(H_1\). In this case the power of the criterion \(T_n\) with significance \(\alpha \) is asymptotically equal to that is given by the formula
$$ Pr\{L\ge z_{1-\alpha /2}-\bar{b}h/a\} + Pr\{L\le - z_{1-\alpha /2}-\bar{b}h/a\}, $$where \(z_{1-\alpha /2}\) is such that
$$ Pr \{L\ge z_{1-\alpha /2}\}= \alpha /2. $$
If \(F_1=C(\nu ,1), F_2=C(\nu + \theta ,1)\) then \( b= h/3\).
Note that the analytical presentation for the coefficient a is a difficult problem that is not solved up to now. However this coefficient can be easily found by stochastic simulation. In the case of Cauchy distribution we found a heuristic formula \(3a^2= J_0\), that means \(a =\sqrt{(2/3)\ln 3}\). This formula provide a very exact approximation for empirical power (see Tables 1, 2 and 3 in the next section).
Thus in the case of Cauchy distributions with scale parameter equal to 1 the power of the criterion \(T_n\) with significance \(\alpha \) is approximately equal to
The proof of the theorem is given in the Appendix.
4 Simulation Results
In this section we present numerical results of the efficiency of new criterion in comparison with a few alternative criteria.
At the Tables 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 results for cases n = 100, 500, 1000 and different values of h with \(\alpha =0.05\) are given for normal and Cauchy distributions that differ either by shift or by scale parameters. The critical values were calculated in two ways: by simulation of the initial distribution and by random permutations (we used 800 random permutation in all cases). It worth to be noted that the results are very similar. Since the permutation technique is more universal, it can be recommended for practical applications.
Note that in all these cases when the distributions differ only in the shift parameters the power of \(T_n\) and that of the Wilcoxon-Mann-Whitney, the Kolmogorov-Smirnov and the Anderson-Darling tests were approximately equal to each other. It can be pointed out also that if the variances are not standard but are known we should simply make the corresponding normalisation. But for the cases where the distributions differ in scale parameters the Wilcoxon-Mann-Whitney is not appropriate at all and the power of the Kolmogorov-Smirnov and the Anderson-Darling tests is considerably lower.
5 Conclusion
In this paper we suggested a new test for equality of two distributions. In a wide class of distributions it was proved that the limiting distribution is the square of a Normal distribution. It allows to find asymptotic power analytically for the case of distributions that differ only by shift up to unknown parameter that can be found by stochastic simulation. The high efficiency of the test was confirmed by stochastic simulations.
References
Lehmann, E.: Testing Statistical Hypotheses, Probability and Statistics Series. Wiley, Hoboken (1986)
Buening, H.: Kolmogorov-Smirnov and Cramer-von Mises type two-sample tests with various weight functions. Commun. Stat.-Simul. Comput. 30, 847–865 (2001)
Anderson, T.W.: Anderson-Darling tests of goodness-of-fit. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_118
Zech, G., Aslan, B.: New test for the multivariate two-sample problem based on the concept of minimum energy. J. Stat. Comput. Simul. 75(2), 109–119 (2005)
Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19, 293–325 (1948)
Gradshteyn, I.S., Ryzhik, I.M.: Table of Integrals, Series and Products. 7th edn. Amsterdam, Boston, Heidelberg, London
Prudnikov, A.P., Brychkov, Y.A., Marichev, O.I.: Integrals and Series. Elementary Functions, Nauka, Moscow (1981). [in Russian]
Acknowledgments
The authors are indebted to professor Yakov Nikitin for the help in calculating the integrals. Work of Viatcheslav Melas was supported by RFBR (grant N 20-01-00096).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
6 Appendix
6 Appendix
Proof of Theorem 1. Let us consider the test (4)–(5) with the function \(g(u)=-u^2\) that is the logarithm of the density of the standard Normal distribution.
Lemma 1
For \(g(x)= x^2\) the following identity holds
where
Denote
The proof follows from the known formula [see e.g. [5], p. 296]
and the obvious identity
by direct but non trivial calculations.
Really, let us use the standard notation
And \(S_y^2\) and \(S_z^2\) will be understood in the similar way. Denote
Note that due to formula (10) for X replaced by Z
Therefore
and we obtain
Thus Lemma 1 is proved. It follows from this lemma, that the criterion \(\varPhi _{nn}\) in this case is equivalent to the criterion \((\bar{x} - \bar{y})^2.\)
Let us turn to the proof of the theorem.
Assume that either \(H_0\) or \(H_1\) holds. Then due to the law of large numbers for \(U-\)statistics [5] each of the sums
tends to \(J_0\).
Moreover,
Note that
Let us apply the limit theorem for U-statistics (see Theorem 7.1 [5]) to each of the three terms in brackets. We obtain that \(nT_n\) tends to a random variable with a finite variance. Note that the conditions of the limit theorem are fulfilled for distributions \(F_1\) with the property (3).
Note that \(0 \le \ln (1+x^2) \le x^2\). By this reason \(\varPhi _{AB}\) is between 0 and \(S_{xy}\). Due to theorem about the mean it is equal to \(c_nS_{xy}\), \(0<c_n<1\) and \(c_n\) tends to a constant c with \(n \rightarrow \infty \). In a similar way, \(\varPhi _{A}+ \varPhi _{B}= c_{1n}(\frac{n}{n-1}(S_x^2+S_y)^2)\) and \(c_{1n}\) tends to \(c_1\) while \(c_1=c\).
Let C be an arbitrary positive number,
where \(\tilde{X_{i}}=X_{i}\), if \( |X_{i}| \le C\) and \(\tilde{X_i}=C\) if \(X_{i}>0\), \(\tilde{X_i}=-C\) if \(X_{i}<0\) otherwise. And \(\tilde{Y_{i}}\) are determined similarly.
Consider the function
Due to the presentations for \(\varPhi _{AB}\), \(\varPhi _{A}\) and \(\varPhi _{B}\) derived above it can be checked that there exists a value \(t_n\) that depends on \(\tilde{X}\) and \(\tilde{Y}\) and numbers \(B_n\) such that it is equal to
and \(B_n\) is o(1).
Consider expression (14)–(15). Note that for distributions \(F_1\) and \(F_2\) satisfying (3) with \(\tilde{X}_i\) and \(\tilde{Y}_i\) replaced by \(X_i\) and \(Y_i\), respectively, its variance is bounded from above due to that \(nT_n\) tends to a random variable with a finite variance. Therefore the expression (14)–(15) tends with \(n\rightarrow \infty \) to a random variable with a finite variance for arbitrary C. Passing to the limit with \(n\rightarrow \infty \) we obtain due to the central limit theorem that (16) has the limit distribution of the form (9), where L has the standard normal distribution. Since C is arbitrary we obtain that the limiting distribution has the required form.
For determining b in the part (ii) of the theorem we now can use the equality
that follows from the equality between (14)-(15) and (16). If \(H_0\) take place we obviously have \(b=0\). In the case when \(H_1\) take place \(EnT_n\) is asymptotically equivalent to
where \(\hat{T_n}\) received from \(T_n\) by replacing \(Y_i\) by \(Y_i - b/\sqrt{n},\) \(i=1,\dots ,n\) and we obtain by passing to the limit with \(n\rightarrow \infty \) that
And the asymptotic behaviour of the power announced in (ii) follows from the asymptotic normality of \(\sqrt{nT_n}\). In order to calculate \(\bar{b}\) in the case when \(F_1\) is the standard Cachy distribution the following result is crucial.
Lemma 2
If X and Y are independent random variables with the distribution C(0, 1), then
In order to prove this Lemma we need the following integrals
([6] 4.296.2 and 4.295.7.)
[see [7], formula (2.6.14.19)]. Using these integrals we obtain
Submitting here \(\theta =0\) we obtain both formulas of the Lemma. Note that \(\theta ^2=nh^2\) and
Therefore we obtain \(\bar{b} = 1/3\) that completes the proof of the theorem.
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Melas, V., Salnikov, D. (2021). On Asymptotic Power of the New Test for Equality of Two Distributions. In: Shiryaev, A.N., Samouylov, K.E., Kozyrev, D.V. (eds) Recent Developments in Stochastic Methods and Applications. ICSM-5 2020. Springer Proceedings in Mathematics & Statistics, vol 371. Springer, Cham. https://doi.org/10.1007/978-3-030-83266-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-83266-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83265-0
Online ISBN: 978-3-030-83266-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)