1 Introduction

The Gini distance correlation in Dang et al. (2021) is proposed to measure dependence between a numerical random variable, \(\varvec{X}\) in \(\mathbb {R}^q\) and a categorical variable Y in \(\mathbb {R}\). Suppose that the categorical variable Y takes values \(L_1,...,L_K\) with its distribution \(P_Y\) is \(P(Y = L_k) = p_k>0\) for \(k=1,2,...,K\). \(\varvec{X}\) is from F and assume that the conditional distribution of \(\varvec{X}\) given \(Y=L_k\) is \(F_k\). Let \((\varvec{X}, \varvec{X}')\) and \((\varvec{X}^{(k)}, \varvec{X}^{(k)'})\) be independent pair variables from F and \(F_k\), respectively, then the Gini covariance is defined as

$$\begin{aligned} {gCov}(\varvec{X}, Y)= \sum _{k=1}^K p_k T(\varvec{X}^{(k)}, \varvec{X}), \end{aligned}$$
(1)

where \(T(\varvec{X}^{(k)},\varvec{X})=2\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}\Vert -\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}^{(k)'}\Vert -\mathbb {E}\Vert \varvec{X}-\varvec{X}'\Vert \) is the energy distance between \(F_k\) and F Székely and Rizzo (2013, 2017). The Gini distance covariance is the weighted average of energy distance between \(F_k\) and F, which implies that \({gCov}(\varvec{X}, Y)=0\) if and only if \(F_1=F_2=\cdots =F_K=F\). That is, zero Gini distance covariance mutually implies independence between \(\varvec{X}\) and Y. The Gini distance correlation standardizes the Gini distance covariance by

$$\begin{aligned} \rho _g(\varvec{X}, Y) = \dfrac{\sum _{k=1}^K p_k T(\varvec{X}^{(k)}, \varvec{X})}{\mathbb {E}\Vert \varvec{X}-\varvec{X}'\Vert }, \end{aligned}$$
(2)

which takes values in [0, 1]. The naive estimator for the Gini distance covariance in (1) is a linear combination of U-statistics or V-statistics. Under independence between \(\varvec{X}\) and Y, the estimators are degenerate and hence converge to a infinite sum of quadratic form of centered Gaussian random variables (Dang et al. 2021). This cannot be easily applied to test the equality of K distributions because it is an infinite sum, and finding the weights in the degenerating limit is also a difficult problem. In high dimension, as q diverges, this degenerate estimator admit a normal limit (Sang and Dang 2023). In this paper, we aim to establish a normal limit under the regular setting where q is fixed.

Ahmad (1993) provided a method to testing goodness-of-fit by adding weights to the Cramér-von Mises statistic. Then the modified estimator is asymptotically normal under the null of the goodness-of-fit problem. The Cramér-von Mises statistic is an estimator of the \(L_2\) distance between a completely specified distribution and the underlying distribution. The Gini distance covariance and the correlation are Gini distance based dependence measures. In order to achieve asymptotic normality under the null of independence between \(\varvec{X}\) and Y, we make an appropriate modification of the aforementioned V-estimator by adopting the approach proposed in Ahmad (1993).

Zhang et al. (2019) extended the Gini distance covariance and GDC to the RKHS by a Mercer kernel induced distance. The generalized covariance and correlation also characterize independence between \(\varvec{X}\) and Y. Same as GDC, the empirical parts of the generalized measures are degenerate under independence between \(\varvec{X}\) and Y. We provide modified estimators for the generalized Gini distance covariance and GDC in RKHS which admit normal limits under the null of independence. Makigusa and Naito (2020) constructed a consistent estimator of the maximum mean discrepancy in the Hilbert space to make it yield a normal limit when the the maximum mean discrepancy is zero. And their result has been generalized to solve K-sample problem in Balogoun et al. (2021). Manfoumbi Djonguet et al. (2024) has also adopted this method for independence testing between two functional variables.

Throughout this paper, \(\Vert \cdot \Vert \) represents the Euclidean norm, that is, \(\Vert \varvec{a}\Vert =\sqrt{a^2_1+a^2_2+\cdots +a^2_q}\) for a q-vector, \(\varvec{a}=(a_1, a_2, \cdots , a_q)^T\), in \(\mathbb {R}^q\). For two sequences, \(a_n, b_n\), of real numbers, \(a_n=o(b_n)\) means \(\lim _{n \rightarrow \infty }{a_n}/{b_n}=0\), and \(a_n=O(b_n)\) means \(L \le {a_n}/{b_n} \le U\) for some finite constants L and U. For random variable sequences, similar notations \(o_p(n)\) and \(O_p(n)\) are used to stand for the relationships holding in probability.

The remainder of the paper is organized as follows. In Sect. 2, we provide the modified estimator for the Gini distance covariance and the asymptotic distribution. Section 3 is devoted to the modified estimator for the generalized Gini distance covariance in RKHS. In Sect. 4, we conduct simulation studies to evaluate the performance of the proposed modified test statistics. We conclude and discuss future works in Sect. 5. All technical proofs are provided in Appendix.

2 Modified Gini distance covariance estimator

There is an alternative representation for the Gini distance covariance and correlation using multivariate Gini mean differences (GMD) defined as below

$$\begin{aligned} \Delta&=\mathbb {E}\Vert \varvec{X}-\varvec{X}'\Vert , \ \ \Delta _k=\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}^{(k)'}\Vert , \ k=1,2,...,K,\\ \Delta _{kl}&=\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}^{(l)}\Vert , \ k \ne l, k, l=1,2,...,K. \end{aligned}$$

where \(\Delta \) and \(\Delta _k\) are GMDs for F and \(F_k\), respectively. Gini mean difference was introduced as an alternative measure of variability to the standard deviation (Gini 1914; Yitzhaki and Schechtman 2013). The Gini covariation between \(\varvec{X}\) and Y defined in (1) can be represented in the GMD,

$$\begin{aligned} \text{ gCov }(\varvec{X},Y) = \Delta -\sum _{k=1}^Kp_k\Delta _k, \end{aligned}$$
(3)

and the Gini correlation is

$$\begin{aligned} \rho _g(\varvec{X}, Y) = \frac{ \Delta -\sum _{k=1}^Kp_k\Delta _k}{\Delta }. \end{aligned}$$
(4)

This representation not only shows a nice interpretation of the new dependence measurement (Dang et al. 2021) but also makes the analytical calculation feasible. In the proof of Theorem 1 in Dang et al. (2021), it has been shown that

$$\begin{aligned} \text {gCov}({\varvec{X}, Y})=2 \sum _{1 \le k <l \le K}p_k p_l \Delta _{kl}-\sum _{k=1}^K p_k(1-p_k)\Delta _k. \end{aligned}$$
(5)

All the three representations (1), (3) and (5) are equivalent (Dang et al. 2021). We will use the equation (5) to develop new estimators as it has the distance between different groups where we will add the weights.

Suppose a sample \(\mathcal{D} =\{(\varvec{X}_1, Y_1), (\varvec{X}_2, Y_2),...., (\varvec{X}_n, Y_n)\}\) is drawn from the joint distribution of \(\varvec{X}\) and Y. We can write \(\mathcal{D} =\mathcal{D}_1\cup \mathcal{D}_2...\cup \mathcal{D}_K\), where \(\mathcal{D}_k=\left\{ \varvec{X}^{(k)}_{1}, \varvec{X}^{(k)}_{2},...,\varvec{X}^{(k)}_{n_k}\right\} \) is the sample with \(Y_i=L_k\) and \(n_k\) is the number of sample points in the \(k^{th}\) class. Then the Gini distance covariance in (5) can be estimated by

$$\begin{aligned} T_n=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert \nonumber \\&-\sum _{k=1}^K \hat{p}_k (1-\hat{p}_k) {n_k \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n_k}\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(k)}_j\Vert , \end{aligned}$$
(6)

where \(\hat{p}_k=\dfrac{n_k}{n}\). Under independence of \(\varvec{X}\) and Y, \(T_n\) is a degenerate statistic and hence converges to a infinite sum of weighted Chi-squared random variables (Dang et al. 2021).

In order to overcome the degeneracy of the naive estimator, \(T_n\), under independence, we propose a modified estimator as

$$\begin{aligned} T_{n, \gamma }=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\omega _{i, n_k}(\gamma )\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert \nonumber \\&-\sum _{k=1}^K \hat{p}_k (1-\hat{p}_k) \frac{1}{{n_k \atopwithdelims ()2}}\sum _{1 \le i <j \le n_k}\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(k)}_j\Vert ,\nonumber \\ \end{aligned}$$
(7)

where the weights \(\{\omega _{i, s}(\gamma )\}_{i=1}^s\) are triangular array of positive real numbers depending on a parameter \(\gamma (0 < \gamma \le 1)\) and satisfy the following conditions Makigusa and Naito (2020):

C1.:

There exists a real number \(\kappa (>0)\) and a positive integer \(s_0\) such that

$$\begin{aligned} s|\dfrac{1}{s}\sum _{i=1}^s \omega _{i, s}(\gamma )-1| \le \kappa \end{aligned}$$

for all \(s > s_0\);

C2.:

There exists \(c_k\) such that \(\max _{1 \le i \le s}\omega _{i, s}(\gamma )<c_k\) for all s and \(0<\gamma \le 1\);

C3.:

For all \(0<\gamma <1\), \(\lim _{s \rightarrow \infty }\dfrac{1}{s}\sum _{i=1}^s \big (\omega _{i, s}(\gamma )-1\big )^2=\eta (\gamma )>0\).

Then the corresponding modified estimator for GDC is

$$\begin{aligned} \hat{\rho }_{g,\gamma }=\dfrac{T_{n, \gamma }}{\hat{\Delta }}, \end{aligned}$$
(8)

where \(\hat{\Delta }={n \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n}\Vert \varvec{X}_i-\varvec{X}_j\Vert .\)

A typical choice of the weights \(\{\omega _{i, s}(\gamma )\}_{i=1}^s\) suggested by Ahmad (1993) is \(\omega _{i, s}(\gamma )=1+(-1)^i \gamma \), and this has been adopted to develop the modified maximum mean discrepancy estimators in Balogoun et al. (2021) and Makigusa and Naito (2020). Manfoumbi Djonguet et al. (2024) also provided some other examples of \(\omega _{i, s}(\gamma )\) satisfying the above conditions C1-C3: \(\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )\) and \(\omega _{i, s}(\gamma )=1+\text {cos}(i \pi \gamma )\). The first choice of weights yields \(\eta (\gamma )=\gamma ^2\) and the latter two weights generate \(\eta (\gamma )=1/2\).

Applying the weights satisfying the conditions C1-C3 to \(T_{n, \gamma }\) in (7), we provide the asymptotic normality of this modified estimator in the following theorem.

Define \(h(\varvec{x}, \varvec{x}')=\Vert \varvec{x}-\varvec{x}'\Vert \), \(h_1(\varvec{x})=\mathbb {E}\Vert \varvec{x}-\varvec{X}_1\Vert \) and \(\sigma ^2_g=\text {Var}(h_{1}(\varvec{X}))>0\).

Theorem 2.1

Under independence of \(\varvec{X}\) and Y, and conditions C1-C3, if \(\mathbb {E}\Vert \varvec{X}\Vert ^2< \infty \), as \(\min \{n_1, n_2,..., n_k\} \rightarrow \infty \), we have

$$\begin{aligned} \sqrt{n}T_{n, \gamma } {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,\sigma ^2_{\gamma }),\ \end{aligned}$$

with \(\sigma ^2_{\gamma }=\sum _{k=1}^K p_k(1-p_k)^2\sigma ^2_1(\gamma )\) where \(\sigma ^2_1(\gamma )=\eta (\gamma ) \sigma ^2_g\).

Theorem 2.1 shows that the modified estimator, \(T_{n, \gamma }\), for the Gini distance covariance has a normal limit which can be applied to test independence between \(\varvec{X}\) and Y, and hence to test the equality of K-distributions.

Applying Slustky’s theorem, we have central limit theorem (CLT) for the modified GDC estimator, \(\hat{\rho }_{g, \gamma }\), defined in (8).

Corollary 2.1

Under independence of \(\varvec{X}\) and Y, and conditions C1-C3, if \(\mathbb {E}\Vert \varvec{X}\Vert ^2< \infty \), as \(\min \{n_1, n_2,..., n_k\} \rightarrow \infty \), we have

$$\begin{aligned} \sqrt{n}\hat{\rho }_{g, \gamma } {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,\sigma ^2_{\rho _g, \gamma }), \end{aligned}$$

where \(\sigma ^2_{\rho _g, \gamma }=\sum _{k=1}^K p_k(1-p_k)^2\sigma ^2_1(\gamma )/\Delta ^2\).

In order to apply Theorem 2.1 to make inference, we provide a consistent estimator for \(\sigma ^2_\gamma \). \(\sigma ^2_g\) can be estimated by the empirical version,

$$\begin{aligned} \hat{\nu }&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert -\dfrac{1}{n}\sum _{i=1}^n \Big (\dfrac{1}{n-1} \sum _{j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert \Big )\right\} ^2\\&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert -\dfrac{1}{n(n-1)}\sum _{i, j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert \right\} ^2. \end{aligned}$$

Then a consistent estimator for \(\sigma ^2_{\gamma }\) can be obtained by \(\hat{\sigma }^2_0=\eta (\gamma )\hat{\nu }\sum _{k=1}^K\hat{p}^2_k(1-\hat{p}_k) \).

Corollary 2.2

Under independence of \(\varvec{X}\) and Y, and conditions C1-C3, if \(\mathbb {E}\Vert \varvec{X}\Vert < \infty \), as \(\min \{n_1, n_2,..., n_k\} \rightarrow \infty \), we have

$$\begin{aligned} \dfrac{\sqrt{n}T_{n, \gamma } }{\hat{\sigma }_0} {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,1). \end{aligned}$$

These established CLTs can be applied to test the independence of \(\varvec{X}\) and Y. We will use the CLT for the Gini distance covariance to do the test. The one based on the Gini correlation is asymptotically equivalent. The independence test is stated as

$$\begin{aligned} \mathcal{H}_0: \text{ gCov }(\varvec{X}, Y) = 0,\;\;\;\; \text{ vs }\;\;\;\; \mathcal{H}_1: \text{ gCov }(\varvec{X}, Y) > 0. \end{aligned}$$
(9)

Note that the null hypothesis of the test in (9) is equivalent to the null of the K-sample test

$$\begin{aligned} \mathcal{H}_0^\prime : F_1 = F_2 =...=F_K =F. \end{aligned}$$

In the K sample test, we can view sample point \((\varvec{X}_i, Y_i)\) in such way. \(Y_i\) is the class label of \(\varvec{X}_i\). \(Y_i=L_k\) indicates that \(\varvec{X}_i\) is drawn from \(F_k\). The pooled sample \(\mathcal{D} =\mathcal{D}_1\cup \mathcal{D}_2...\cup \mathcal{D}_K\) has the distribution F, which is the average distribution of \(F_k\)’s.

By Corollary 2.2, we can reject \(\mathcal{H}_0\) or \(\mathcal{H}_0^\prime \) if \(\sqrt{n} T_{n, \gamma }>Z_{\alpha }\hat{\sigma }_0\) at level \(\alpha \), where \(Z_{\alpha }\) is the \((1-\alpha )100\% \) percentile of the standard normal distribution.

3 Modified Gini distance covariance estimator in RKHS

Distance based statistics can be generalized from a euclidean space to metric spaces. With a Mercer (1909), distributions can be mapped into a RKHS with a kernel induced distance. The Gini distance covariance has been generalized to a RKHS, \(\mathcal {H}_M\), as Zhang et al. (2019)

$$\begin{aligned} \text {gCov}_{\mathcal {H}(M)}(\varvec{X}, Y)=&2 \sum _{1 \le k <l \le K}p_k p_l \mathbb {E}d_M(\varvec{X}^{(k)},\varvec{X}^{(l)})\nonumber \\&-\sum _{k=1}^K p_k(1-p_k)\mathbb {E}d_M(\varvec{X}^{(k)}_1,\varvec{X}^{(k)}_2), \end{aligned}$$
(10)

where \(M: \mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}\) is a Mercer kernel with the distance function \(d: \mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}\). d defines a distance in \(\mathcal {H}_M\) as

$$\begin{aligned} d_M(\varvec{x}, \varvec{x}')= \sqrt{M(\varvec{x}, \varvec{x})+M(\varvec{x}', \varvec{x}')-2M(\varvec{x}, \varvec{x}')}. \end{aligned}$$

As the regular Gini distance covariance in \(\mathbb {R}^p\), the generalized Gini distance covariance can also characterize independence in RKHS, \(\text {gCov}_{\mathcal {H}(M)}(\varvec{X}, Y)=0\) if and only if \(\varvec{X}\) and Y are independent (Zhang et al. 2019).

The generalized Gini distance covariance can be estimated by

$$\begin{aligned} G_n=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l} d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(l)}_j)\nonumber \\&-\sum _{k=1}^K \hat{p}_k(1-\hat{p}_k) {n_k \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n_k} d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(k)}_j), \end{aligned}$$
(11)

which has been shown to be degenerate and converges to a mixture of infinite chi-square distributions under independence of \(\varvec{X}\) and Y Zhang et al. (2019).

We give a modified estimator as

$$\begin{aligned} G_{n, \gamma }=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l} \omega _{i, n_k}(\gamma )d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(l)}_j)\nonumber \\&-\sum _{k=1}^K \hat{p}_k(1-\hat{p}_k) {n_k \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n_k} d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(k)}_j), \end{aligned}$$
(12)

where the weights \(\{\omega _{i, s}(\gamma )\}_{i=1}^s\) are chosen as the same in Sect. 2.

Theorem 3.1

Assume M is a Mercer kernel over \(\mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}\) that induces a distance function \(d_M(\cdot , \cdot )\) with bounded range [0, 1). Under independence of \(\varvec{X}\) and Y, assume conditions C1-C3, as \(\min \{n_1, n_2,..., n_k\} \rightarrow \infty \), we have

$$\begin{aligned} \sqrt{n}G_{n, \gamma } {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,\sigma ^2_{M, \gamma }), \end{aligned}$$

with \(\sigma ^2_{M, \gamma }=\sum _{k=1}^K p_k(1-p_k)^2 \sigma ^2_{2, M}(\gamma )\) where \(\sigma ^2_{2, M}(\gamma )\) is given in the proof.

A consistent estimator for \(\sigma ^2_{M, \gamma }\) is \(\hat{\sigma }^2_{M,0}=\eta (\gamma )\hat{\nu }_M\sum _{k=1}^K\hat{p}^2_k(1-\hat{p}_k)\), where

$$\begin{aligned} \hat{\nu }_M&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)-\dfrac{1}{n}\sum _{i=1}^n \Big (\dfrac{1}{n-1} \sum _{j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)\Big )\right\} ^2\\&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)-\dfrac{1}{n(n-1)}\sum _{i, j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)\right\} ^2. \end{aligned}$$

Corollary 3.1

Assume M is a Mercer kernel over \(\mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}\) that induces a distance function \(d_M(\cdot , \cdot )\) with bounded range [0, 1). Under independence of \(\varvec{X}\) and Y, assume conditions C1-C3, as \(\min \{n_1, n_2,..., n_k\} \rightarrow \infty \), we have

$$\begin{aligned} \dfrac{\sqrt{n}G_{n, \gamma }}{\hat{\sigma }_{M, 0}} {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0, 1). \end{aligned}$$

The modified estimator of the generalized Gini distance covariance can also be used to test the equality of K populations. By Corollary 3.1, we can reject \(\mathcal{H}_0\) or \(\mathcal{H}_0^\prime \) if \(\sqrt{n} G_{n, \gamma }>Z_{\alpha }\hat{\sigma }_{M,0}\) at level \(\alpha \).

Fig. 1
figure 1

Histograms of the proposed weighted Gini covariance under weight functions \(\omega _{i,n}(\gamma ) = 1+(-1)^i \gamma \) with different \(\gamma \) values of 0.1, 0.2, 0.5 and 0.8, respectively. The left plots are for dimension \(q=3\), and the right ones are for dimension \(q=5\)

4 Simulation

In this section, we conduct simulation studies to verify the theoretical properties of the modified Gini covariance statistic and compare its performance in K-sample tests with others. Also based on empirical results, we discuss how to select the weight function.

4.1 Limiting normality

We generate independent K samples from the same multivariate normal distributions and compute the weighted Gini covariance statistic with weights \(\omega _{i,n}(\gamma )=1+(-1)^i \gamma , i=1,...,n\). The procedure is repeated 10000 times.

Example 1

\(K=2\) samples of size \((n_1, n_2)=(200, 200)\) are generated from \( \mathcal {N}(\varvec{0}, \varvec{\Sigma })\), where \(\varvec{\Sigma }=(\Sigma _{ij}) \in \mathbb {R}^{q \times q}\) with \(\Sigma _{ij}=0.7^{|i-j|}\). We consider \(q=3, 5\) and \(\gamma =0.1, 0.2, 0.5, 0.8\), respectively.

For each dimension and each value of \(\gamma \), the histogram of 10000 standardized weighted Gini covariance statistics is plotted in Fig. 1. The kernel density estimation (KDE) for the weighted Gini covariance and the standard Gini covariance are added to the plots. We also add the standard normal density curve to visualize the closeness between empirical density and asymptotic density functions. Firstly, we can see that the KDEs for the regular Gini covariance are always skewed to the right, which agrees well with its limiting distribution of the mixture of \(\chi ^2\) distributions due to the degeneracy of the regular Gini covariance statistics. Then we notice that the histograms at \(\gamma =0.1\) for both dimensions are skewed to the right, and there is some discrepancy between KDE of the weighted Gini covariance and the normal curve. However, as \(\gamma \) increases, the discrepancy becomes less and diminishes. This suggests that larger \(\gamma \) values are preferred for this weight function. We use \(\gamma =0.8\) in the next subsection for performance comparison in K-sample tests. The impacts of the choice of \(\gamma \) in \(\omega _{i,n}(\gamma )=1+(-1)^i \gamma \) as well as in \(\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )\) are explored in Subsection 4.3.

Table 1 Size and Power of Tests in Example 2

4.2 Size and power in K-sample tests

In this simulation, we compare three methods for K sample problem by computing the type I errors and the powers.

mmd::

generalized maximum mean discrepancy method developed in Balogoun et al. (2021).

wrg::

our proposed method using weighted Gini covariance statistic.

wkrg::

our proposed method using weighted Gini covariance statistic in a RKHS where the distance function \(d_M(\varvec{x}, \varvec{x}')=\sqrt{1-e^{{-\Vert \varvec{x}-\varvec{x}'\Vert ^2}/{\sigma ^2}}}\) is induced by a weighted Gaussian kernel \(M(\varvec{x}, \varvec{x}')=0.5e^{-{\Vert \varvec{x}-\varvec{x}'\Vert ^2}/{\sigma ^2}}\) Zhang et al. (2019).

Both mmd and wkrg are kernel methods. The bandwidth of the Gaussian kernel in both methods is chosen to be the median of pairwise distances, as used and suggested in Chen et al. (2009). All three methods use the weight function \(\omega _{i,n}(\gamma )=1+(-1)^i\gamma \) with \(\gamma =0.8\).

We consider cases for \(K=3\) and \(q=5\) with \(\varvec{p} = (p_1,p_2,...p_K)\) where \(p_k = P(Y_i=L_k)\): (I) balanced, \(\varvec{p} = (1/3, 1/3, 1/3)\); (II) slightly unbalanced, \(\varvec{p} = (3/12, 4/12, 5/12)\); (III) heavily unbalanced, \(\varvec{p} = (0.1, 0.3, 0.6)\). We conduct 10000 simulations for different sample sizes of \(n=120\) and \(n=240\), respectively. The type I error and the power of each test are computed at significance level \(\alpha =0.05\) for Example 2 and Example 3.

Example 2

Generate samples of \(\varvec{X}^{(k)} = \varvec{\mu }_k + \varvec{\epsilon }, k=1,2,3\), where the mean vector \(\varvec{\mu }_1 = (0, 0,....,0)\), \(\varvec{\mu }_2 =(\delta _1, \delta _1,..., \delta _1)\), \(\varvec{\mu }_3 = (\delta _2, \delta _2,...,\delta _2)\), and \(\varvec{\epsilon }= (\epsilon _{1}, \epsilon _{2},...\epsilon _{q})\) is a q-dimensional error term with \(\epsilon _{j}\)’s are iid from N(0, 1). Here \(\varvec{\delta }=(\delta _1, \delta _2)\) measures differences in means.

Results of Example 2 are reported in Table 1. The column \(\varvec{\delta }=(0, 0)\) corresponds to the size of tests. At \(n=120\), all tests have slight over-size problems with size is 1–2% higher than the nominal level. And they all have higher powers for equal size case than unbalanced cases. For unbalanced cases, our method wrg gains 1%-4% power advantage over mmd at small values of \(\varvec{\delta }\), especially for heavily unbalanced cases. As sample size increases, the type I errors of our wrg and wkrg are getting closer to the nominal level and powers get improved. However, mmd suffers from severe under-size problems with very low powers for unbalanced case when the difference in means is small.

Example 3

We generate samples from \(\varvec{X}^{(k)} = (Z_{k1}, Z_{k2},...Z_{kq})^T, k=1,2,3\). For \(k=1\), \(j=1,...,q\), \(Z_{kj}\)’s are i.i.d. from Exp(1); \(k=2\), \(j=1,...,q\), \(Z_{kj}\)’s are i.i.d. from Exp(\(\delta _1\)); \(k=3\), \(j=1,...,q\), \(Z_{kj}\)’s are i.i.d. from Exp(\(\delta _2\)).

We present results for this example in Table 2. The mmd seems sensitive to the asymmetry of distributions. It is undersized and its power is much lower than the weight Gini covariance based ones. Our wrg performs best with well-controlled size and higher powers.

Table 2 Size and Power of Tests in Example 3

4.3 Discussion on weights

Manfoumbi Djonguet et al. (2024) provided two weight schemes based on sine and cosine functions, but they didn’t study the performance of those weights. In this simulation, we would like to compare this weight \(\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )\) (weight2) with the previously used one \(\omega _{i, s}(\gamma ) = 1+(-1)^i \gamma \) (weight1). We compare the effects of different \(\gamma \) values of 0.2, 0.4,0.6, 0.8 and 0.9 of each weight function on our wrg method. With empirical results, we provide some suggestions on the choice of weights.

The balanced \(\varvec{p} = (1/3, 1/3, 1/3)\) and unbalanced \(\varvec{p} =(0.1,0.3,0.6)\) scenarios of Example 2 are considered. At different sample sizes of \(n=60, 90, 120, 150, 180, 210, 240, 270, 300\), the type I errors and/or powers of the tests under different weighting schemes are calculated based on \(M=10000\) repetitions and reported in plots.

Figure 2 is for the balanced case with the top two plots on type I error and the bottom two plots on power at \((\delta _1, \delta _2)=(0.3, 0.6)\). The left panels are for weight1, while the right ones are for weight2.

The nominal size is 0.05. From Fig. 2, we observe that the tests based on two weight functions have the over-size problem but this issue becomes less serious as sample size increases. For weight1 with \(\gamma \) values 0.8 and 0.9, the type I errors decrease from 0.10 to 0.06 when the sample size increases from 60 to 150. But small \(\gamma \) values of 0.2 and 0.4, even at sample size 300, produce unacceptable type I errors of 0.16 and 0.10 respectively. For weight2 with a large range of gamma values from 0.2 to 0.8, the type I errors decrease from 0.12 to 0.08 when the sample sizes increase from 60 to150. Further reduction type I error to 0.06 requires a sample size to be as large as 300. As sample size increases, \(\gamma \) value 0.9 yields a relatively large zig-zag oscillation in type I errors, which is undesired.

Fig. 2
figure 2

Empirical size and power versus total sample size at different \(\gamma \) values in Example 2 with \(\varvec{p} = (1/3, 1/3, 1/3)\). The left plots are for \(\omega _(i,n)(\gamma ) = 1+(-1)^i \gamma \), and the right ones are for \(\omega _(i,n)(\gamma ) = 1+\text {sin}(i\pi \gamma )\)

The tests based on both weight schemes produce relatively high power. The power of all tests is over 0.90 when the sample size is 150. For weight1, the power decreases in \(\gamma \). With consideration of controllable type I error and high power, \(\gamma =0.8\) is recommended. This suggestion is also supported by the previous empirical results in Subsections 4.1 and 4.2. weight2 performs very well in terms of power. Except for 0.9, all other \(\gamma \) values produce an almost same power at each sample size. It is quite robust with the choice of \(\gamma \) in the balanced case. However, in the unbalanced case, weight2 tests fail badly not only in terms of huge type I errors but also in terms of sensitivity of \(\gamma \) choice.

The type I errors of tests under the unbalanced scenarios are reported in Fig. 3. The left plot is for weight1, while the right one is for weight2. Both plots have a same scale in order to provide a fair visual comparison. From Fig. 3, we see that for weight2, all gamma values cause unacceptable type I errors when sample size is less than 300. With \(\gamma =0.2\) and sample size less than 240, the test is meaningless with type I error higher than 0.9. For \(\gamma =0.4\), the type I errors jump up and down in a wide range, reaching 0.94 followed by 0.53, then up to 0.97 followed by 0.08, and then 0.43 followed by 0.64 when sample size changes from 60 to 210. Oscillation patterns in large ranges of type I errors also present for larger \(\gamma \) values. However, when sample size is as large as 300, all tests have a good size. We increase sample size up to 600 and find out weight2 maintains the nominal size well. For weight1 with sample size less than 300, the type I errors also show zig-zag oscillations, but in much smaller ranges. Except for \(\gamma =0.2\), all tests yield a reasonably good empirical size when the sample size is 300 or larger.

Overall, for balanced case, both weights are acceptable depending on the choice of \(\gamma \). weight2 performs better than weight1 with a wide range of choices for \(\gamma \). For unbalanced cases, weight2 is not applicable unless the sample size is sufficiently large (at least 300). weight1 with \(\gamma =0.8\) is recommended due to its controllable type I errors and high powers for both balanced and unbalanced cases.

Fig. 3
figure 3

Empirical size versus the sample size at different \(\gamma \) values when \(\varvec{p} = (0.1, 0.3, 0.6)\). The left plot is for \(\omega _(i,n)(\gamma ) = 1+(-1)^i \gamma \), and the right one is for \(\omega _(i,n)(\gamma ) = 1+\text {sin}(i\pi \gamma )\)

5 Conclusions and future work

We have proposed a modified estimator of the Gini distance correlation. By adding weights to the distances between different groups, the modified estimator admits a normal limit under independence of numerical and categorical variables. We have also generalized the results into RKHS, where normal limits also hold. All of the asymptotic results have been applied to test the equality of K distributions. With a proper choice of weight function, the modified Gini correlation estimator performs well. We have studied two types of weights \(\omega _{i,n}(\gamma )=1+(-1)^i \gamma \) and \(\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )\). The second weight works well with a wide range of choices for \(\gamma \) for balanced K sample problem. However, it is not applicable unless the sample size is sufficiently large for unbalanced cases. The first weight with a large \(\gamma \) value like 0.8 is recommended. It could control type I error close to nominal level for all cases including balanced and unbalanced, and keep reasonable high power.

In the real application of K sample problem, most of the existing omnibus tests are permutation procedures based. The permutation tests have some optimum properties (Gebhard and Schmitz 1998a). However, they are computation-intensive (Gebhard and Schmitz 1998b). By considering all permutations, exact tests such as in Neuhäuser (2005) are only feasible for very small sample sizes. For larger sample sizes, a large number of random permutation procedures are repeated in order to approximate the p-value or to determine the critical value. The test based on the regular Gini distance covariance (Dang et al. 2021) relies on such a computationally expensive procedure. Our proposed weighted Gini covariance statistic admits a normal limit and can be directly applied to real-world analysis for K-sample problem by avoiding permutation procedures.

In this paper, we used the median of pairwise distances as a bandwidth in the Gaussian kernel for the wkrg and mmd. Such a choice makes a half of pairwise distances in the induced feature space greater than 0.8871 and 0.6065 for the wkrg method and mmd method, respectively. This choice is simple and seems to be effective, but is by no means “optimal". How to select an optimal bandwidth (in terms of some criteria) is always a challenge for any kernel method. For wkrg and also mmd methods, the task is particularly difficult because the bandwidth shall be selected optimally with consideration of weight function. How to jointly select the optimal kernel parameter and weight scheme in wkrg and mmd is worthy of further investigation.