Asymptotic normality of a modified estimator of Gini distance correlation

Sang, Yongli; Dang, Xin

doi:10.1007/s00362-024-01575-9

Asymptotic normality of a modified estimator of Gini distance correlation

Regular Article
Published: 06 June 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistical Papers Aims and scope Submit manuscript

Asymptotic normality of a modified estimator of Gini distance correlation

Download PDF

119 Accesses
Explore all metrics

Abstract

Recently, the Gini distance correlation (GDC), $\rho _g$, was proposed to measure dependence between numerical and categorical variables (Dang et al. 2021). This new dependence measure can mutually characterize independence between the random variables. That is, $\rho _g=0$ if and only only the categorical variable and the numerical variable are independent. Limiting distributions of the naive estimator of GDC have been established in Dang et al. (2021). It has been shown that under independence, the empirical GDC admits a degenerating limit which is an infinite weighted sum of Chi-squared distributions. In this paper, we propose a modified estimator of the GDC that is asymptotically normal under independence between the numerical and the categorical variables. We also extend this method to the generalized GDC Zhang et al. (2019) in reproducing kernel Hilbert space (RKHS). Both the modified GDC and generalized GDC can be applied to test the K-sample problem. Simulations studies are conducted to examine the finite sample performance of the new K-sample test based on the modified estimators.

Non-Parametric Inference for Gini Covariance and its Variants

Article 13 October 2020

Gini’s mean difference and variance as measures of finite populations scales

Article 01 July 2015

Robust approach for comparing two dependent normal populations through Wald-type tests based on Rényi’s pseudodistance estimators

Article Open access 25 October 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Gini distance correlation in Dang et al. (2021) is proposed to measure dependence between a numerical random variable, $\varvec{X}$ in $\mathbb {R}^q$ and a categorical variable Y in $\mathbb {R}$. Suppose that the categorical variable Y takes values $L_1,...,L_K$ with its distribution $P_Y$ is $P(Y = L_k) = p_k>0$ for $k=1,2,...,K$. $\varvec{X}$ is from F and assume that the conditional distribution of $\varvec{X}$ given $Y=L_k$ is $F_k$. Let $(\varvec{X}, \varvec{X}')$ and $(\varvec{X}^{(k)}, \varvec{X}^{(k)'})$ be independent pair variables from F and $F_k$, respectively, then the Gini covariance is defined as

$$\begin{aligned} {gCov}(\varvec{X}, Y)= \sum _{k=1}^K p_k T(\varvec{X}^{(k)}, \varvec{X}), \end{aligned}$$

(1)

where $T(\varvec{X}^{(k)},\varvec{X})=2\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}\Vert -\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}^{(k)'}\Vert -\mathbb {E}\Vert \varvec{X}-\varvec{X}'\Vert $ is the energy distance between $F_k$ and F Székely and Rizzo (2013, 2017). The Gini distance covariance is the weighted average of energy distance between $F_k$ and F, which implies that ${gCov}(\varvec{X}, Y)=0$ if and only if $F_1=F_2=\cdots =F_K=F$. That is, zero Gini distance covariance mutually implies independence between $\varvec{X}$ and Y. The Gini distance correlation standardizes the Gini distance covariance by

$$\begin{aligned} \rho _g(\varvec{X}, Y) = \dfrac{\sum _{k=1}^K p_k T(\varvec{X}^{(k)}, \varvec{X})}{\mathbb {E}\Vert \varvec{X}-\varvec{X}'\Vert }, \end{aligned}$$

(2)

which takes values in [0, 1]. The naive estimator for the Gini distance covariance in (1) is a linear combination of U-statistics or V-statistics. Under independence between $\varvec{X}$ and Y, the estimators are degenerate and hence converge to a infinite sum of quadratic form of centered Gaussian random variables (Dang et al. 2021). This cannot be easily applied to test the equality of K distributions because it is an infinite sum, and finding the weights in the degenerating limit is also a difficult problem. In high dimension, as q diverges, this degenerate estimator admit a normal limit (Sang and Dang 2023). In this paper, we aim to establish a normal limit under the regular setting where q is fixed.

Ahmad (1993) provided a method to testing goodness-of-fit by adding weights to the Cramér-von Mises statistic. Then the modified estimator is asymptotically normal under the null of the goodness-of-fit problem. The Cramér-von Mises statistic is an estimator of the $L_2$ distance between a completely specified distribution and the underlying distribution. The Gini distance covariance and the correlation are Gini distance based dependence measures. In order to achieve asymptotic normality under the null of independence between $\varvec{X}$ and Y, we make an appropriate modification of the aforementioned V-estimator by adopting the approach proposed in Ahmad (1993).

Zhang et al. (2019) extended the Gini distance covariance and GDC to the RKHS by a Mercer kernel induced distance. The generalized covariance and correlation also characterize independence between $\varvec{X}$ and Y. Same as GDC, the empirical parts of the generalized measures are degenerate under independence between $\varvec{X}$ and Y. We provide modified estimators for the generalized Gini distance covariance and GDC in RKHS which admit normal limits under the null of independence. Makigusa and Naito (2020) constructed a consistent estimator of the maximum mean discrepancy in the Hilbert space to make it yield a normal limit when the the maximum mean discrepancy is zero. And their result has been generalized to solve K-sample problem in Balogoun et al. (2021). Manfoumbi Djonguet et al. (2024) has also adopted this method for independence testing between two functional variables.

Throughout this paper, $\Vert \cdot \Vert $ represents the Euclidean norm, that is, $\Vert \varvec{a}\Vert =\sqrt{a^2_1+a^2_2+\cdots +a^2_q}$ for a q-vector, $\varvec{a}=(a_1, a_2, \cdots , a_q)^T$, in $\mathbb {R}^q$. For two sequences, $a_n, b_n$, of real numbers, $a_n=o(b_n)$ means $\lim _{n \rightarrow \infty }{a_n}/{b_n}=0$, and $a_n=O(b_n)$ means $L \le {a_n}/{b_n} \le U$ for some finite constants L and U. For random variable sequences, similar notations $o_p(n)$ and $O_p(n)$ are used to stand for the relationships holding in probability.

The remainder of the paper is organized as follows. In Sect. 2, we provide the modified estimator for the Gini distance covariance and the asymptotic distribution. Section 3 is devoted to the modified estimator for the generalized Gini distance covariance in RKHS. In Sect. 4, we conduct simulation studies to evaluate the performance of the proposed modified test statistics. We conclude and discuss future works in Sect. 5. All technical proofs are provided in Appendix.

2 Modified Gini distance covariance estimator

There is an alternative representation for the Gini distance covariance and correlation using multivariate Gini mean differences (GMD) defined as below

$$\begin{aligned} \Delta&=\mathbb {E}\Vert \varvec{X}-\varvec{X}'\Vert , \ \ \Delta _k=\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}^{(k)'}\Vert , \ k=1,2,...,K,\\ \Delta _{kl}&=\mathbb {E}\Vert \varvec{X}^{(k)}-\varvec{X}^{(l)}\Vert , \ k \ne l, k, l=1,2,...,K. \end{aligned}$$

where $\Delta $ and $\Delta _k$ are GMDs for F and $F_k$, respectively. Gini mean difference was introduced as an alternative measure of variability to the standard deviation (Gini 1914; Yitzhaki and Schechtman 2013). The Gini covariation between $\varvec{X}$ and Y defined in (1) can be represented in the GMD,

$$\begin{aligned} \text{ gCov }(\varvec{X},Y) = \Delta -\sum _{k=1}^Kp_k\Delta _k, \end{aligned}$$

(3)

and the Gini correlation is

$$\begin{aligned} \rho _g(\varvec{X}, Y) = \frac{ \Delta -\sum _{k=1}^Kp_k\Delta _k}{\Delta }. \end{aligned}$$

(4)

This representation not only shows a nice interpretation of the new dependence measurement (Dang et al. 2021) but also makes the analytical calculation feasible. In the proof of Theorem 1 in Dang et al. (2021), it has been shown that

$$\begin{aligned} \text {gCov}({\varvec{X}, Y})=2 \sum _{1 \le k <l \le K}p_k p_l \Delta _{kl}-\sum _{k=1}^K p_k(1-p_k)\Delta _k. \end{aligned}$$

(5)

All the three representations (1), (3) and (5) are equivalent (Dang et al. 2021). We will use the equation (5) to develop new estimators as it has the distance between different groups where we will add the weights.

Suppose a sample $\mathcal{D} =\{(\varvec{X}_1, Y_1), (\varvec{X}_2, Y_2),...., (\varvec{X}_n, Y_n)\}$ is drawn from the joint distribution of $\varvec{X}$ and Y. We can write $\mathcal{D} =\mathcal{D}_1\cup \mathcal{D}_2...\cup \mathcal{D}_K$, where $\mathcal{D}_k=\left\{ \varvec{X}^{(k)}_{1}, \varvec{X}^{(k)}_{2},...,\varvec{X}^{(k)}_{n_k}\right\} $ is the sample with $Y_i=L_k$ and $n_k$ is the number of sample points in the $k^{th}$ class. Then the Gini distance covariance in (5) can be estimated by

$$\begin{aligned} T_n=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert \nonumber \\&-\sum _{k=1}^K \hat{p}_k (1-\hat{p}_k) {n_k \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n_k}\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(k)}_j\Vert , \end{aligned}$$

(6)

where $\hat{p}_k=\dfrac{n_k}{n}$. Under independence of $\varvec{X}$ and Y, $T_n$ is a degenerate statistic and hence converges to a infinite sum of weighted Chi-squared random variables (Dang et al. 2021).

In order to overcome the degeneracy of the naive estimator, $T_n$, under independence, we propose a modified estimator as

$$\begin{aligned} T_{n, \gamma }=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\omega _{i, n_k}(\gamma )\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert \nonumber \\&-\sum _{k=1}^K \hat{p}_k (1-\hat{p}_k) \frac{1}{{n_k \atopwithdelims ()2}}\sum _{1 \le i <j \le n_k}\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(k)}_j\Vert ,\nonumber \\ \end{aligned}$$

(7)

where the weights $\{\omega _{i, s}(\gamma )\}_{i=1}^s$ are triangular array of positive real numbers depending on a parameter $\gamma (0 < \gamma \le 1)$ and satisfy the following conditions Makigusa and Naito (2020):

C1.:

There exists a real number $\kappa (>0)$ and a positive integer $s_0$ such that

$$\begin{aligned} s|\dfrac{1}{s}\sum _{i=1}^s \omega _{i, s}(\gamma )-1| \le \kappa \end{aligned}$$

for all $s > s_0$;

C2.:

There exists $c_k$ such that $\max _{1 \le i \le s}\omega _{i, s}(\gamma )<c_k$ for all s and $0<\gamma \le 1$;

C3.:

For all $0<\gamma <1$, $\lim _{s \rightarrow \infty }\dfrac{1}{s}\sum _{i=1}^s \big (\omega _{i, s}(\gamma )-1\big )^2=\eta (\gamma )>0$.

Then the corresponding modified estimator for GDC is

$$\begin{aligned} \hat{\rho }_{g,\gamma }=\dfrac{T_{n, \gamma }}{\hat{\Delta }}, \end{aligned}$$

(8)

where $\hat{\Delta }={n \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n}\Vert \varvec{X}_i-\varvec{X}_j\Vert .$

A typical choice of the weights $\{\omega _{i, s}(\gamma )\}_{i=1}^s$ suggested by Ahmad (1993) is $\omega _{i, s}(\gamma )=1+(-1)^i \gamma $, and this has been adopted to develop the modified maximum mean discrepancy estimators in Balogoun et al. (2021) and Makigusa and Naito (2020). Manfoumbi Djonguet et al. (2024) also provided some other examples of $\omega _{i, s}(\gamma )$ satisfying the above conditions C1-C3: $\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )$ and $\omega _{i, s}(\gamma )=1+\text {cos}(i \pi \gamma )$. The first choice of weights yields $\eta (\gamma )=\gamma ^2$ and the latter two weights generate $\eta (\gamma )=1/2$.

Applying the weights satisfying the conditions C1-C3 to $T_{n, \gamma }$ in (7), we provide the asymptotic normality of this modified estimator in the following theorem.

Define $h(\varvec{x}, \varvec{x}')=\Vert \varvec{x}-\varvec{x}'\Vert $, $h_1(\varvec{x})=\mathbb {E}\Vert \varvec{x}-\varvec{X}_1\Vert $ and $\sigma ^2_g=\text {Var}(h_{1}(\varvec{X}))>0$.

Theorem 2.1

Under independence of $\varvec{X}$ and Y, and conditions C1-C3, if $\mathbb {E}\Vert \varvec{X}\Vert ^2< \infty $, as $\min \{n_1, n_2,..., n_k\} \rightarrow \infty $, we have

$$\begin{aligned} \sqrt{n}T_{n, \gamma } {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,\sigma ^2_{\gamma }),\ \end{aligned}$$

with $\sigma ^2_{\gamma }=\sum _{k=1}^K p_k(1-p_k)^2\sigma ^2_1(\gamma )$ where $\sigma ^2_1(\gamma )=\eta (\gamma ) \sigma ^2_g$.

Theorem 2.1 shows that the modified estimator, $T_{n, \gamma }$, for the Gini distance covariance has a normal limit which can be applied to test independence between $\varvec{X}$ and Y, and hence to test the equality of K-distributions.

Applying Slustky’s theorem, we have central limit theorem (CLT) for the modified GDC estimator, $\hat{\rho }_{g, \gamma }$, defined in (8).

Corollary 2.1

Under independence of $\varvec{X}$ and Y, and conditions C1-C3, if $\mathbb {E}\Vert \varvec{X}\Vert ^2< \infty $, as $\min \{n_1, n_2,..., n_k\} \rightarrow \infty $, we have

$$\begin{aligned} \sqrt{n}\hat{\rho }_{g, \gamma } {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,\sigma ^2_{\rho _g, \gamma }), \end{aligned}$$

where $\sigma ^2_{\rho _g, \gamma }=\sum _{k=1}^K p_k(1-p_k)^2\sigma ^2_1(\gamma )/\Delta ^2$.

In order to apply Theorem 2.1 to make inference, we provide a consistent estimator for $\sigma ^2_\gamma $. $\sigma ^2_g$ can be estimated by the empirical version,

$$\begin{aligned} \hat{\nu }&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert -\dfrac{1}{n}\sum _{i=1}^n \Big (\dfrac{1}{n-1} \sum _{j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert \Big )\right\} ^2\\&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert -\dfrac{1}{n(n-1)}\sum _{i, j=1}^n \Vert \varvec{X}_j-\varvec{X}_i\Vert \right\} ^2. \end{aligned}$$

Then a consistent estimator for $\sigma ^2_{\gamma }$ can be obtained by $\hat{\sigma }^2_0=\eta (\gamma )\hat{\nu }\sum _{k=1}^K\hat{p}^2_k(1-\hat{p}_k) $.

Corollary 2.2

Under independence of $\varvec{X}$ and Y, and conditions C1-C3, if $\mathbb {E}\Vert \varvec{X}\Vert < \infty $, as $\min \{n_1, n_2,..., n_k\} \rightarrow \infty $, we have

$$\begin{aligned} \dfrac{\sqrt{n}T_{n, \gamma } }{\hat{\sigma }_0} {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,1). \end{aligned}$$

These established CLTs can be applied to test the independence of $\varvec{X}$ and Y. We will use the CLT for the Gini distance covariance to do the test. The one based on the Gini correlation is asymptotically equivalent. The independence test is stated as

$$\begin{aligned} \mathcal{H}_0: \text{ gCov }(\varvec{X}, Y) = 0,\;\;\;\; \text{ vs }\;\;\;\; \mathcal{H}_1: \text{ gCov }(\varvec{X}, Y) > 0. \end{aligned}$$

(9)

Note that the null hypothesis of the test in (9) is equivalent to the null of the K-sample test

$$\begin{aligned} \mathcal{H}_0^\prime : F_1 = F_2 =...=F_K =F. \end{aligned}$$

In the K sample test, we can view sample point $(\varvec{X}_i, Y_i)$ in such way. $Y_i$ is the class label of $\varvec{X}_i$. $Y_i=L_k$ indicates that $\varvec{X}_i$ is drawn from $F_k$. The pooled sample $\mathcal{D} =\mathcal{D}_1\cup \mathcal{D}_2...\cup \mathcal{D}_K$ has the distribution F, which is the average distribution of $F_k$’s.

By Corollary 2.2, we can reject $\mathcal{H}_0$ or $\mathcal{H}_0^\prime $ if $\sqrt{n} T_{n, \gamma }>Z_{\alpha }\hat{\sigma }_0$ at level $\alpha $, where $Z_{\alpha }$ is the $(1-\alpha )100\% $ percentile of the standard normal distribution.

3 Modified Gini distance covariance estimator in RKHS

Distance based statistics can be generalized from a euclidean space to metric spaces. With a Mercer (1909), distributions can be mapped into a RKHS with a kernel induced distance. The Gini distance covariance has been generalized to a RKHS, $\mathcal {H}_M$, as Zhang et al. (2019)

$$\begin{aligned} \text {gCov}_{\mathcal {H}(M)}(\varvec{X}, Y)=&2 \sum _{1 \le k <l \le K}p_k p_l \mathbb {E}d_M(\varvec{X}^{(k)},\varvec{X}^{(l)})\nonumber \\&-\sum _{k=1}^K p_k(1-p_k)\mathbb {E}d_M(\varvec{X}^{(k)}_1,\varvec{X}^{(k)}_2), \end{aligned}$$

(10)

where $M: \mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}$ is a Mercer kernel with the distance function $d: \mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}$. d defines a distance in $\mathcal {H}_M$ as

$$\begin{aligned} d_M(\varvec{x}, \varvec{x}')= \sqrt{M(\varvec{x}, \varvec{x})+M(\varvec{x}', \varvec{x}')-2M(\varvec{x}, \varvec{x}')}. \end{aligned}$$

As the regular Gini distance covariance in $\mathbb {R}^p$, the generalized Gini distance covariance can also characterize independence in RKHS, $\text {gCov}_{\mathcal {H}(M)}(\varvec{X}, Y)=0$ if and only if $\varvec{X}$ and Y are independent (Zhang et al. 2019).

The generalized Gini distance covariance can be estimated by

$$\begin{aligned} G_n=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l} d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(l)}_j)\nonumber \\&-\sum _{k=1}^K \hat{p}_k(1-\hat{p}_k) {n_k \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n_k} d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(k)}_j), \end{aligned}$$

(11)

which has been shown to be degenerate and converges to a mixture of infinite chi-square distributions under independence of $\varvec{X}$ and Y Zhang et al. (2019).

We give a modified estimator as

$$\begin{aligned} G_{n, \gamma }=&\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l} \omega _{i, n_k}(\gamma )d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(l)}_j)\nonumber \\&-\sum _{k=1}^K \hat{p}_k(1-\hat{p}_k) {n_k \atopwithdelims ()2}^{-1}\sum _{1 \le i <j \le n_k} d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(k)}_j), \end{aligned}$$

(12)

where the weights $\{\omega _{i, s}(\gamma )\}_{i=1}^s$ are chosen as the same in Sect. 2.

Theorem 3.1

Assume M is a Mercer kernel over $\mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}$ that induces a distance function $d_M(\cdot , \cdot )$ with bounded range [0, 1). Under independence of $\varvec{X}$ and Y, assume conditions C1-C3, as $\min \{n_1, n_2,..., n_k\} \rightarrow \infty $, we have

$$\begin{aligned} \sqrt{n}G_{n, \gamma } {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0,\sigma ^2_{M, \gamma }), \end{aligned}$$

with $\sigma ^2_{M, \gamma }=\sum _{k=1}^K p_k(1-p_k)^2 \sigma ^2_{2, M}(\gamma )$ where $\sigma ^2_{2, M}(\gamma )$ is given in the proof.

A consistent estimator for $\sigma ^2_{M, \gamma }$ is $\hat{\sigma }^2_{M,0}=\eta (\gamma )\hat{\nu }_M\sum _{k=1}^K\hat{p}^2_k(1-\hat{p}_k)$, where

$$\begin{aligned} \hat{\nu }_M&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)-\dfrac{1}{n}\sum _{i=1}^n \Big (\dfrac{1}{n-1} \sum _{j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)\Big )\right\} ^2\\&=\dfrac{1}{n-1}\sum _{i=1}^n \left\{ \dfrac{1}{n-1} \sum _{j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)-\dfrac{1}{n(n-1)}\sum _{i, j=1}^n d_M(\varvec{X}_i, \varvec{X}_j)\right\} ^2. \end{aligned}$$

Corollary 3.1

Assume M is a Mercer kernel over $\mathbb {R}^q \times \mathbb {R}^q \rightarrow \mathbb {R}$ that induces a distance function $d_M(\cdot , \cdot )$ with bounded range [0, 1). Under independence of $\varvec{X}$ and Y, assume conditions C1-C3, as $\min \{n_1, n_2,..., n_k\} \rightarrow \infty $, we have

$$\begin{aligned} \dfrac{\sqrt{n}G_{n, \gamma }}{\hat{\sigma }_{M, 0}} {\mathop {\longrightarrow }\limits ^{{d}}} \mathcal{N}(0, 1). \end{aligned}$$

The modified estimator of the generalized Gini distance covariance can also be used to test the equality of K populations. By Corollary 3.1, we can reject $\mathcal{H}_0$ or $\mathcal{H}_0^\prime $ if $\sqrt{n} G_{n, \gamma }>Z_{\alpha }\hat{\sigma }_{M,0}$ at level $\alpha $.

4 Simulation

In this section, we conduct simulation studies to verify the theoretical properties of the modified Gini covariance statistic and compare its performance in K-sample tests with others. Also based on empirical results, we discuss how to select the weight function.

4.1 Limiting normality

We generate independent K samples from the same multivariate normal distributions and compute the weighted Gini covariance statistic with weights $\omega _{i,n}(\gamma )=1+(-1)^i \gamma , i=1,...,n$. The procedure is repeated 10000 times.

Example 1

$K=2$ samples of size $(n_1, n_2)=(200, 200)$ are generated from $ \mathcal {N}(\varvec{0}, \varvec{\Sigma })$, where $\varvec{\Sigma }=(\Sigma _{ij}) \in \mathbb {R}^{q \times q}$ with $\Sigma _{ij}=0.7^{|i-j|}$. We consider $q=3, 5$ and $\gamma =0.1, 0.2, 0.5, 0.8$, respectively.

For each dimension and each value of $\gamma $, the histogram of 10000 standardized weighted Gini covariance statistics is plotted in Fig. 1. The kernel density estimation (KDE) for the weighted Gini covariance and the standard Gini covariance are added to the plots. We also add the standard normal density curve to visualize the closeness between empirical density and asymptotic density functions. Firstly, we can see that the KDEs for the regular Gini covariance are always skewed to the right, which agrees well with its limiting distribution of the mixture of $\chi ^2$ distributions due to the degeneracy of the regular Gini covariance statistics. Then we notice that the histograms at $\gamma =0.1$ for both dimensions are skewed to the right, and there is some discrepancy between KDE of the weighted Gini covariance and the normal curve. However, as $\gamma $ increases, the discrepancy becomes less and diminishes. This suggests that larger $\gamma $ values are preferred for this weight function. We use $\gamma =0.8$ in the next subsection for performance comparison in K-sample tests. The impacts of the choice of $\gamma $ in $\omega _{i,n}(\gamma )=1+(-1)^i \gamma $ as well as in $\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )$ are explored in Subsection 4.3.

Table 1 Size and Power of Tests in Example 2

Full size table

4.2 Size and power in K-sample tests

In this simulation, we compare three methods for K sample problem by computing the type I errors and the powers.

mmd::: generalized maximum mean discrepancy method developed in Balogoun et al. (2021).
wrg::: our proposed method using weighted Gini covariance statistic.
wkrg::: our proposed method using weighted Gini covariance statistic in a RKHS where the distance function $d_M(\varvec{x}, \varvec{x}')=\sqrt{1-e^{{-\Vert \varvec{x}-\varvec{x}'\Vert ^2}/{\sigma ^2}}}$ is induced by a weighted Gaussian kernel $M(\varvec{x}, \varvec{x}')=0.5e^{-{\Vert \varvec{x}-\varvec{x}'\Vert ^2}/{\sigma ^2}}$ Zhang et al. (2019).

Both mmd and wkrg are kernel methods. The bandwidth of the Gaussian kernel in both methods is chosen to be the median of pairwise distances, as used and suggested in Chen et al. (2009). All three methods use the weight function $\omega _{i,n}(\gamma )=1+(-1)^i\gamma $ with $\gamma =0.8$.

We consider cases for $K=3$ and $q=5$ with $\varvec{p} = (p_1,p_2,...p_K)$ where $p_k = P(Y_i=L_k)$: (I) balanced, $\varvec{p} = (1/3, 1/3, 1/3)$; (II) slightly unbalanced, $\varvec{p} = (3/12, 4/12, 5/12)$; (III) heavily unbalanced, $\varvec{p} = (0.1, 0.3, 0.6)$. We conduct 10000 simulations for different sample sizes of $n=120$ and $n=240$, respectively. The type I error and the power of each test are computed at significance level $\alpha =0.05$ for Example 2 and Example 3.

Example 2

Generate samples of $\varvec{X}^{(k)} = \varvec{\mu }_k + \varvec{\epsilon }, k=1,2,3$, where the mean vector $\varvec{\mu }_1 = (0, 0,....,0)$, $\varvec{\mu }_2 =(\delta _1, \delta _1,..., \delta _1)$, $\varvec{\mu }_3 = (\delta _2, \delta _2,...,\delta _2)$, and $\varvec{\epsilon }= (\epsilon _{1}, \epsilon _{2},...\epsilon _{q})$ is a q-dimensional error term with $\epsilon _{j}$’s are iid from N(0, 1). Here $\varvec{\delta }=(\delta _1, \delta _2)$ measures differences in means.

Results of Example 2 are reported in Table 1. The column $\varvec{\delta }=(0, 0)$ corresponds to the size of tests. At $n=120$, all tests have slight over-size problems with size is 1–2% higher than the nominal level. And they all have higher powers for equal size case than unbalanced cases. For unbalanced cases, our method wrg gains 1%-4% power advantage over mmd at small values of $\varvec{\delta }$, especially for heavily unbalanced cases. As sample size increases, the type I errors of our wrg and wkrg are getting closer to the nominal level and powers get improved. However, mmd suffers from severe under-size problems with very low powers for unbalanced case when the difference in means is small.

Example 3

We generate samples from $\varvec{X}^{(k)} = (Z_{k1}, Z_{k2},...Z_{kq})^T, k=1,2,3$. For $k=1$, $j=1,...,q$, $Z_{kj}$’s are i.i.d. from Exp(1); $k=2$, $j=1,...,q$, $Z_{kj}$’s are i.i.d. from Exp($\delta _1$); $k=3$, $j=1,...,q$, $Z_{kj}$’s are i.i.d. from Exp($\delta _2$).

We present results for this example in Table 2. The mmd seems sensitive to the asymmetry of distributions. It is undersized and its power is much lower than the weight Gini covariance based ones. Our wrg performs best with well-controlled size and higher powers.

Table 2 Size and Power of Tests in Example 3

Full size table

4.3 Discussion on weights

Manfoumbi Djonguet et al. (2024) provided two weight schemes based on sine and cosine functions, but they didn’t study the performance of those weights. In this simulation, we would like to compare this weight $\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )$ (weight2) with the previously used one $\omega _{i, s}(\gamma ) = 1+(-1)^i \gamma $ (weight1). We compare the effects of different $\gamma $ values of 0.2, 0.4,0.6, 0.8 and 0.9 of each weight function on our wrg method. With empirical results, we provide some suggestions on the choice of weights.

The balanced $\varvec{p} = (1/3, 1/3, 1/3)$ and unbalanced $\varvec{p} =(0.1,0.3,0.6)$ scenarios of Example 2 are considered. At different sample sizes of $n=60, 90, 120, 150, 180, 210, 240, 270, 300$, the type I errors and/or powers of the tests under different weighting schemes are calculated based on $M=10000$ repetitions and reported in plots.

Figure 2 is for the balanced case with the top two plots on type I error and the bottom two plots on power at $(\delta _1, \delta _2)=(0.3, 0.6)$. The left panels are for weight1, while the right ones are for weight2.

The nominal size is 0.05. From Fig. 2, we observe that the tests based on two weight functions have the over-size problem but this issue becomes less serious as sample size increases. For weight1 with $\gamma $ values 0.8 and 0.9, the type I errors decrease from 0.10 to 0.06 when the sample size increases from 60 to 150. But small $\gamma $ values of 0.2 and 0.4, even at sample size 300, produce unacceptable type I errors of 0.16 and 0.10 respectively. For weight2 with a large range of gamma values from 0.2 to 0.8, the type I errors decrease from 0.12 to 0.08 when the sample sizes increase from 60 to150. Further reduction type I error to 0.06 requires a sample size to be as large as 300. As sample size increases, $\gamma $ value 0.9 yields a relatively large zig-zag oscillation in type I errors, which is undesired.

The tests based on both weight schemes produce relatively high power. The power of all tests is over 0.90 when the sample size is 150. For weight1, the power decreases in $\gamma $. With consideration of controllable type I error and high power, $\gamma =0.8$ is recommended. This suggestion is also supported by the previous empirical results in Subsections 4.1 and 4.2. weight2 performs very well in terms of power. Except for 0.9, all other $\gamma $ values produce an almost same power at each sample size. It is quite robust with the choice of $\gamma $ in the balanced case. However, in the unbalanced case, weight2 tests fail badly not only in terms of huge type I errors but also in terms of sensitivity of $\gamma $ choice.

The type I errors of tests under the unbalanced scenarios are reported in Fig. 3. The left plot is for weight1, while the right one is for weight2. Both plots have a same scale in order to provide a fair visual comparison. From Fig. 3, we see that for weight2, all gamma values cause unacceptable type I errors when sample size is less than 300. With $\gamma =0.2$ and sample size less than 240, the test is meaningless with type I error higher than 0.9. For $\gamma =0.4$, the type I errors jump up and down in a wide range, reaching 0.94 followed by 0.53, then up to 0.97 followed by 0.08, and then 0.43 followed by 0.64 when sample size changes from 60 to 210. Oscillation patterns in large ranges of type I errors also present for larger $\gamma $ values. However, when sample size is as large as 300, all tests have a good size. We increase sample size up to 600 and find out weight2 maintains the nominal size well. For weight1 with sample size less than 300, the type I errors also show zig-zag oscillations, but in much smaller ranges. Except for $\gamma =0.2$, all tests yield a reasonably good empirical size when the sample size is 300 or larger.

Overall, for balanced case, both weights are acceptable depending on the choice of $\gamma $. weight2 performs better than weight1 with a wide range of choices for $\gamma $. For unbalanced cases, weight2 is not applicable unless the sample size is sufficiently large (at least 300). weight1 with $\gamma =0.8$ is recommended due to its controllable type I errors and high powers for both balanced and unbalanced cases.

5 Conclusions and future work

We have proposed a modified estimator of the Gini distance correlation. By adding weights to the distances between different groups, the modified estimator admits a normal limit under independence of numerical and categorical variables. We have also generalized the results into RKHS, where normal limits also hold. All of the asymptotic results have been applied to test the equality of K distributions. With a proper choice of weight function, the modified Gini correlation estimator performs well. We have studied two types of weights $\omega _{i,n}(\gamma )=1+(-1)^i \gamma $ and $\omega _{i, s}(\gamma )=1+\text {sin}(i \pi \gamma )$. The second weight works well with a wide range of choices for $\gamma $ for balanced K sample problem. However, it is not applicable unless the sample size is sufficiently large for unbalanced cases. The first weight with a large $\gamma $ value like 0.8 is recommended. It could control type I error close to nominal level for all cases including balanced and unbalanced, and keep reasonable high power.

In the real application of K sample problem, most of the existing omnibus tests are permutation procedures based. The permutation tests have some optimum properties (Gebhard and Schmitz 1998a). However, they are computation-intensive (Gebhard and Schmitz 1998b). By considering all permutations, exact tests such as in Neuhäuser (2005) are only feasible for very small sample sizes. For larger sample sizes, a large number of random permutation procedures are repeated in order to approximate the p-value or to determine the critical value. The test based on the regular Gini distance covariance (Dang et al. 2021) relies on such a computationally expensive procedure. Our proposed weighted Gini covariance statistic admits a normal limit and can be directly applied to real-world analysis for K-sample problem by avoiding permutation procedures.

In this paper, we used the median of pairwise distances as a bandwidth in the Gaussian kernel for the wkrg and mmd. Such a choice makes a half of pairwise distances in the induced feature space greater than 0.8871 and 0.6065 for the wkrg method and mmd method, respectively. This choice is simple and seems to be effective, but is by no means “optimal". How to select an optimal bandwidth (in terms of some criteria) is always a challenge for any kernel method. For wkrg and also mmd methods, the task is particularly difficult because the bandwidth shall be selected optimally with consideration of weight function. How to jointly select the optimal kernel parameter and weight scheme in wkrg and mmd is worthy of further investigation.

References

Ahmad IA (1993) Modification of some goodness-of-fit statistics to yield asymptotic normal null distributions. Biometrika 80:466–472
Article MathSciNet Google Scholar
Balogoun AKS, Nkiet GM, Ogouyandjou C (2021) Asymptotic normality of a generalized maximum mean discrepancy estimator. Statist Probab Lett 169:108961
Article MathSciNet Google Scholar
Chen Y, Dang X, Peng H, Bart H (2009) Outlier detection with the kernelized spatial depth function. IEEE Trans Pattern Anal Mach Intell 31(2):288–305
Article Google Scholar
Dang X, Nguyen D, Chen X, Zhang J (2021) A new Gini correlation between quantitative and qualitative variables. Scand J Stat 48(4):1314–1343
Article MathSciNet Google Scholar
Gebhard J, Schmitz N (1998) Permutation tests-a revival?! I Optimum properties. Stat Pap 39:75–85
Article MathSciNet Google Scholar
Gebhard J, Schmitz N (1998) Permutation tests-a revival?! II. An efficient algorithm for computing the critical region. Stat Pap 39:87–96
Article MathSciNet Google Scholar
Gini C (1914) On the measurement of concentration and variability of characters. Metron LXII I(1):3–8
MathSciNet Google Scholar
Makigusa N, Naito K (2020) Asymptotic normality of a consistent estimator of maximum mean discrepancy in Hilbert space. Statist Probab Lett 156:108596
Article MathSciNet Google Scholar
Manfoumbi Djonguet TK, Mbina Mbina A, Nkiet GM (2024) Testing independence of functional variables by an Hilbert–Schmidt independence criterion estimator. Statist Probab Lett 207:110016
Article MathSciNet Google Scholar
Mercer J (1909) Functions of positive and negative type, and their connection the theory of integral equations. Philos Trans Roy Soc A 209:415–446
Google Scholar
Neuhäuser M (2005) Exact tests based on the Baumgartner–Wei${\beta }$-Schindler statistic-a survey. Stat Pap 46:1–30
Article MathSciNet Google Scholar
(1993) Asymptotic distributions of weighted $U$-statistics of degree 2. Ann Probab 21(2):1159–1169
Ryman N, Jorde PE (2001) Statistical power when testing for genetic differentiation. Mol Ecol 10:2361–2373
Article Google Scholar
Sang Y, Dang X (2023) Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem. Electron J Stat 17(2):2539–2574
Article MathSciNet Google Scholar
Serfling R (1980) Approximation theorems of mathematical statistics. Wiley, New York
Book Google Scholar
Székely GJ, Rizzo ML (2013) Energy statistics: a class of statistics based on distances. J Stat Plan Infer 143:1249–1272
Article MathSciNet Google Scholar
Székely GJ, Rizzo ML (2017) The energy of data. Ann Rev Stat Appl 4(1):447–479
Article Google Scholar
Yitzhaki S, Schechtman E (2013) The Gini Methodology. Springer, New York
Book Google Scholar
Zhang S, Dang X, Nguyen D, Wilkins D, Chen Y (2019) Estimating feature—label dependence using Gini distance statistics. IEEE Trans Pattern Anal Mach Intell 43(6):1947–1963
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of Louisiana at Lafayette, Lafayette, LA, 70504, USA
Yongli Sang
Department of Mathematics, University of Mississippi, University, MS, 38677, USA
Xin Dang

Authors

Yongli Sang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongli Sang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem 2.1

$$\begin{aligned} T_{n, \gamma }&=T_{n, \gamma }-T_n+T_n\\&=\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert +T_n\\&:=D_{n}+T_n, \end{aligned}$$

where

$$\begin{aligned} D_{n}=\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert . \end{aligned}$$

From Dang et al. (2021), $T_n$ is degenerate under independence of $\varvec{X}$ and Y with $\text {Var}(T_{n})=O_p(n^{-2})$. We will show that $D_{n}$ dominates with a normal limit.

Define $\hat{D}_{k,l}(\gamma )=\dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )\Vert \varvec{X}^{(k)}_i-\varvec{X}^{(l)}_j\Vert $, then $D_{n}= \sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \hat{D}_{k,l}(\gamma )$. Then we can decompose $\hat{D}_{k,l}(\gamma )$ as

$$\begin{aligned} \hat{D}_{k,l}(\gamma )=&\dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )\Delta _{kl}\nonumber \\&+\dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg [\bigg (h_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta _{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )\bigg ]\nonumber \\&+\dfrac{1}{n_l}\sum _{j=1}^{n_l}\bigg (h_{1}\big (\varvec{X}^{(l)}_j\big )-\Delta _{kl}\bigg ) \dfrac{1}{n_k}\sum _{i=1}^{n_k}\big (\omega _{i, n_k}(\gamma )-1\big )\nonumber \\&+\dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l} \bigg [\big (\omega _{i, n_k}(\gamma )-1\big ) \phi _{k, l}(\varvec{X}^{(k)}_i,\varvec{X}^{(l)}_j)\bigg ], \end{aligned}$$

(13)

where $\phi _{k, l}(\varvec{X}^{(k)},\varvec{X}^{(l)})=h(\varvec{X}^{(k)},\varvec{X}^{(l)})-h_{1}(\varvec{X}^{(k)})-h_{1}(\varvec{X}^{(l)})+\Delta _{kl}$ is the degenerate part with $\text {Var}\big (\phi _{k, l}(\varvec{X}^{(k)},\varvec{X}^{(l)})\big )=O(\dfrac{1}{n_kn_l})$.

We will show that $\hat{D}_{k,l}(\gamma )$ is dominated by $\dfrac{1}{n_k}\sum _{i=1}^{n_k}(h_{1}\big (\varvec{X}^{(k)}_i)-\Delta _{kl}\big )\big (\omega _{i, n_k}(\gamma )-1 \big )$.

First of all, the first sum on the right of (13) is not random and is bounded. That is,

$$\begin{aligned} \biggl | \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )\Delta _{kl} \biggr |&=\dfrac{\Delta _{kl}}{n_k }\biggl |\sum _{i=1}^{n_k}\bigg (\omega _{i, n_k}(\gamma )-1\bigg ) \biggr |\\&\le \dfrac{\kappa }{n_k} \Delta _{kl}\\&\rightarrow 0 \end{aligned}$$

by condition C1.

Then the second and the third sums on the right side of (13) are the first-projection parts. By Theorem 3 in O‘Neil, K.A. and Redner, R.A. (1993), we have

$$\begin{aligned}&\text {Var}\bigg (\dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg [\bigg (h_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta _{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )\bigg ]\bigg )=O(\dfrac{1}{n^2_k}) n_k \sum _{i=1}^{n_k}\big (\omega _{i, n_k}(\gamma )-1\big )^2,\\&\text {Var}\bigg (\dfrac{1}{n_l}\sum _{j=1}^{n_l}\bigg (h_{1}\big (\varvec{X}^{(l)}_j\big )-\Delta _{kl}\bigg ) \dfrac{1}{n_k}\sum _{i=1}^{n_k}\big (\omega _{i, n_k}(\gamma )-1\big )\bigg )\\&\quad =O(\dfrac{1}{n^2_l n^2_k}) n_l \bigg [\sum _{i=1}^{n_k}\big (\omega _{i, n_k}(\gamma )-1\big )\bigg ]^2. \end{aligned}$$

By conditions C1–C3, we have

$$\begin{aligned} \sum _{i=1}^{n_k} \big ( \omega _{i, n_k}(\gamma )-1\big )^2 \rightarrow \infty , \end{aligned}$$

and

$$\begin{aligned} \bigg [\sum _{j=1}^{n_k} \bigg (\omega _{j, n_k}(\gamma )-1\bigg )\bigg ]^2 \le \kappa ^2. \end{aligned}$$

Therefore, $\dfrac{1}{n_k}\sum _{i=1}^{n_k}h_{1}(\varvec{X}^{(k)}_i)\big (\omega _{i, n_k}(\gamma )-1 \big )$ dominates $\hat{D}_{k,l}(\gamma )$ and admits a normal limit, and hence $D_{n}$ has a normal limit.

Next we find the variance of $T_{n, \gamma }$. We have

$$\begin{aligned} T_{n, \gamma }&= \sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \left\{ \dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg (h_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta _{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )\right\} +O_P(n^{-1})\\&= \sum _{k=1}^K\hat{p}_k (1-\hat{p}_k) \left\{ \dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg (h_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta _{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )\right\} +O_P(n^{-1}). \end{aligned}$$

By Theorem A of Section 6.4, Serfling (1980), we have

$$\begin{aligned} \sqrt{n} T_{n, \gamma }\rightarrow N\bigg (0,\sum _{k=1}^K p_k(1-p_k)^2 \sigma ^2_{1}(\gamma )\bigg ), \end{aligned}$$

where $\sigma ^2_{1}(\gamma )=\text {Var}\big (h_1(\varvec{X})\big )\eta (\gamma )$. $\square $

Proof of Theorem 3.1

Define

$$\begin{aligned}&\Delta ^{M} =\mathbb {E}d_M(\varvec{X},\varvec{X}'),\\&\Delta ^M_k=\mathbb {E}d_M(\varvec{X}^{(k)}, \varvec{X}^{(k)'}), \ k=1,2,...,K,\\&\Delta ^M_{kl}=\mathbb {E}d(\varvec{X}^{(k)}, \varvec{X}^{(l)}), \ k \ne l, k, l=1,2,...,K. \end{aligned}$$

The proof of Theorem 3.1 is similar to the proof of Theorem 2.1 by replacing the the euclidean distance $\Vert \cdot \Vert $ by the induced distance $d_M(\cdot , \cdot )$. Here we provide a sketchy proof.

$$\begin{aligned} G_{n, \gamma }&=G_{n, \gamma }-G_n+G_n\\&=\sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )d_M(\varvec{X}^{(k)}_i,\varvec{X}^{(l)}_j)+G_n\\&:=\Gamma _{n}+G_n. \end{aligned}$$

By Theorem 11 of Zhang et al. (2019), $G_n$ is degenerate under independence of $\varvec{X}$ and Y with $\text {Var}(G_{n})=O_p(n^{-2})$. We will show that $\Gamma _{n}$ dominates with a normal limit.

Define $\hat{\Gamma }_{k,l}(\gamma )=\dfrac{1}{n_k n_l}\sum _{i=1}^{n_k}\sum _{j=1}^{n_l}\bigg (\omega _{i, n_k}(\gamma )-1\bigg )d_M(\varvec{X}^{(k)}_i, \varvec{X}^{(l)}_j)$, then $\Gamma _{n}= \sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \hat{\Gamma }_{k,l}(\gamma )$. Let $g(\varvec{x}, \varvec{x}')=d_M(\varvec{x}, \varvec{x}')$, and $g_1(\varvec{x})=\mathbb {E}d_M(\varvec{x}, \varvec{X}_1)$.

Under conditions C1–C3, $\hat{\Gamma }_{k,l}(\gamma )$ is dominated by $\dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg (g_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta ^M_{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )$.

Therefore,

$$\begin{aligned} G_{n, \gamma }&= \sum _{1 \le k \ne l \le K}\hat{p}_k \hat{p}_l \left\{ \dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg (g_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta ^M_{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )\right\} +O_P(n^{-1})\\&= \sum _{k=1}^K\hat{p}_k (1-\hat{p}_k) \left\{ \dfrac{1}{n_k}\sum _{i=1}^{n_k}\bigg (g_{1}\big (\varvec{X}^{(k)}_i\big )-\Delta ^M_{kl}\bigg )\big (\omega _{i, n_k}(\gamma )-1 \big )\right\} +O_P(n^{-1}). \end{aligned}$$

Applying Theorem A of Section 6.4 of Serfling (1980) again, we have

$$\begin{aligned} \sqrt{n} G_{n, \gamma }\rightarrow N\bigg (0, \sum _{k=1}^K p_k(1-p_k)^2 \sigma ^2_{2,M}(\gamma )\bigg ), \end{aligned}$$

where $\sigma ^2_{2,M}(\gamma )=\text {var}\big (g_{1}(\varvec{X})\big )\eta (\gamma )$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sang, Y., Dang, X. Asymptotic normality of a modified estimator of Gini distance correlation. Stat Papers (2024). https://doi.org/10.1007/s00362-024-01575-9

Download citation

Received: 07 February 2024
Revised: 30 April 2024
Published: 06 June 2024
DOI: https://doi.org/10.1007/s00362-024-01575-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Asymptotic normality of a modified estimator of Gini distance correlation

Abstract

Similar content being viewed by others

Non-Parametric Inference for Gini Covariance and its Variants

Gini’s mean difference and variance as measures of finite populations scales

Robust approach for comparing two dependent normal populations through Wald-type tests based on Rényi’s pseudodistance estimators

1 Introduction