1 Introduction

Traditional statistical methods for measuring the association between random vectors are generally based on coefficient (covariance). See Wilks (1935), Anderson (2003), Robert et al. (1985), Székely et al. (2007) and among others. Wilks (1935) introduced an effective likelihood ratio test (LRT) for block independence under Gaussian population and Anderson (2003) detailed LRT for the Gaussian population. RV correlation coefficient proposed by Escoufier (1973) was considered in Robert et al. (1985) to measure multivariate association between two sets of variables. Székely et al. (2007) developed distance covariance and distance correlation, and provided an approach to the problem of testing the joint independence of random vectors. The asymptotic properties of these classical procedures aforementioned are established under the scheme that the sample size n tends to infinity and the dimensions(p and q) are fixed. This is the so-called “small p and q, large n” paradigm.

However, high-dimensional data, such as microarray analysis, tumor classification and biomedical imaging, tend to have a dimension ( p and (or) q ) comparable to, or much larger than, the sample size n. This brings great challenges to these traditional methods. For example, the empirical power of the conventional test may be largely impacted by the increasing dimension, and even converges to the significance level \(\alpha \) due to “the curse of dimensionality”, which means that the test cannot distinguish the null hypothesis from the alternatives. Therefore more and more statisticians are pursuing new methods to address the high-dimensional problems. To accommodate the large-dimensionality, Jiang et al. (2013) proposed the corrected likelihood ratio test and large-dimensional trace criterion to test the independence of two large sets of multivariate variables when the dimensions \(p+q\) and the sample size n tend to infinity simultaneously. To make the RV coefficient be applicable for high-dimensional data, some test procedures are introduced in the following two papers. Srivastava and Reid (2012) proposed a new statistic based on the RV coefficient for testing the independence of two sub-vectors, and obtained its asymptotic properties under the scheme that \(\min (p,q,n)\rightarrow \infty \), \( p/(p+q)\rightarrow d_1>0, q/(p+q)\rightarrow d_2>0\) and \(n=O((p+q)^{\delta })\) for some constant \(\delta >0\). By constructing an unbiased estimator for the numerator of the RV coefficient, Li et al. (2017) considered the independence test under the assumption that only one random vector has a divergent dimension, that is, \(\min (p,n)\rightarrow \infty \) and q is fixed. The asymptotic properties in these three papers (Jiang et al. 2013; Srivastava and Reid 2012; Li et al. 2017) are established under the assumption that the random vectors are multivariate normal distributed. Without normal constraint, Yang and Pan (2015) proposed a test statistic based on regularized canonical correlation coefficients, and they obtained the limiting distributions when both p and q are comparable to the sample size n. Discovering that the empirical distance correlation of the two vectors converges to one even though they are independent as dimensions tend to infinity, Székely and Rizzo (2013) extended the distance correlation with a modified version in high-dimensional settings, and obtained a distance correlation t-test for independence of random vectors in arbitrarily high dimension. Heller et al. (2012) presented a powerful test of association based on ranks of distances which is consistent against all alternatives and can be applied in any dimensions p and q even greater than n. On the basis of power enhancement technique introduced by Fan et al. (2015), Zheng et al. (2022) developed a powerful test on block-structured correlation of a high dimensional-random vector for sparse or non-sparse alternatives without normality assumption, and obtained the statistical properties under the asymptotic regime that \((p+q)/n\rightarrow y \in (0,\infty )\).

This paper aims to develop a new and powerful test on high-dimensional association under no strict distributional assumptions. To this end, we propose a U-statistic of order four based on the RV covariance introduced in Escoufier (1973). To boost the power especially under sparse alternatives a screening term is added. It is worth mentioning that the proposed test statistic is effective not only for non-sparse alternatives but also for sparse alternatives. Four distinguishing features can be summarized as follows. First, although the proposed U-statistic is order four, it can be fast implemented with an estimated U-statistic of order two. Second, the asymptotic distributions of the U-statistic under both null hypothesis and local alternatives are derived in the scheme that \(p+q\) tends to infinity. It is noteworthy that this scheme can be divided into two cases: one is that only p or q is much more than or comparable to the sample size n; the other is that both p and q tend to infinity with n simultaneously. Third, the power enhancement technique can dramatically improve the performance of the proposed test, which is demonstrated by Monte Carlo simulations. Last but not least is some examples under specific structures are given to gain more insights on the regularity conditions.

The rest of this work is organized as follows. A new statistic for testing the association between two random vectors is proposed on the basis of RV covariance, and its asymptotic properties are established in Sect. 2. In Sect. 3 we investigate the properties of the screening term based on the power enhancement technique. Numerical studies and real data analysis are listed in Sect. 4 to examine the size and power of our proposed test. The Appendix is devoted to gather the technical proofs.

2 Association test in high dimension

In this part, our interest is to study and test the association between two random vectors \(\textbf{X}=(X_1,\ldots ,X_p)^\top \in {\mathbb {R}}^p\) and \(\textbf{Y}= (Y_1,\ldots ,Y_q)^\top \in {\mathbb {R}}^q\). Let \(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\) denote the population covariance matrix of \(\textbf{X}\) and \(\textbf{Y}\). Our association test hypothesis can be represented as follows:

$$\begin{aligned} H_0: \varvec{\Sigma }_{\textbf{X}\textbf{Y}}=0_{p\times q} \hspace{5mm} \text {VS} \hspace{5mm} H_1: \varvec{\Sigma }_{\textbf{X}\textbf{Y}}\ne 0_{p\times q}. \end{aligned}$$

It is worth noting that \(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}=0_{p\times q}\) if and only if the two random vectors are uncorrelated. Escoufier (1973) defined \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\) as the “covariance” of two random vectors \(\textbf{X}\) and \(\textbf{Y}\), where \(\textrm{tr}(\cdot )\) denotes the trace operator. It is evident to see that \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})= 0\) is equivalent to \(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}=0_{p\times q}\). This motivates us to utilize \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\) to quantify the discrepancy between \(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\) and \(0_{p\times q}\). Adopting the idea of \({\mathcal {U}}\)-centring in Székely and Rizzo (2014) and Yao et al. (2018), we can construct an unbiased estimator, denoted by \(T_{n,p,q}(\textbf{X},\textbf{Y})\), of \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\). Suppose that \(\textbf{Z}_i=(\textbf{X}_i,\textbf{Y}_i)^{^\top }\in {\mathbb {R}}^{p+q}\) are random samples of \(\textbf{Z}=(\textbf{X},\textbf{Y})^{^\top }\) in which \(\textbf{X}_i=(X_{i1},\ldots ,X_{ip})^\top \) and \(\textbf{Y}_i=(Y_{i1},\ldots ,Y_{iq})^\top \). Then \(T_{n,p,q}(\textbf{X},\textbf{Y})\) is given by

$$\begin{aligned} T_{n,p,q}(\textbf{X},\textbf{Y})=\left( {\begin{array}{c}n\\ 4\end{array}}\right) ^{-1}\sum _{i<j<k<l} h(\textbf{Z}_i,\textbf{Z}_j,\textbf{Z}_k,\textbf{Z}_l), \end{aligned}$$

where

$$\begin{aligned} h(\textbf{Z}_1,\textbf{Z}_2,\textbf{Z}_3,\textbf{Z}_4)=\frac{1}{ 4!}\sum _{(s,t,u,v)}^{(1,2,3,4)} \frac{1}{4}(\textbf{X}_s-\textbf{X}_t)^\top (\textbf{X}_u-\textbf{X}_v)(\textbf{Y}_s-\textbf{Y}_t)^\top (\textbf{Y}_u-\textbf{Y}_v) \end{aligned}$$

and the summation is over all 24 permutations of the 4-tuples of indices (1, 2, 3, 4).

It is obvious to see that \(T_{n,p,q}(\textbf{X},\textbf{Y})\) is a U-statistic of order four and it is an unbiased estimator of \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\).

Remark 1

Tests for high-dimensional regression coefficients in Zhong and Chen (2011) and Cui et al. (2018) can be seen as a special case of our test hypothesis with respect to measuring the discrepancy. Consider a linear regression model

$$\begin{aligned} Y=\alpha +\textbf{X}^\top \varvec{\beta }+\varepsilon \end{aligned}$$

where \(\varvec{\beta }=(\beta _1,\ldots ,\beta _p)^\top \in {\mathbb {R}}^p\) is a \(p-\)dimensional vector of regression coefficients of interest and \(\alpha \) is a nuisance intercept parameter. \(\varepsilon \) is random error with mean zero and variance \(\sigma ^2\), and is independent of \(\textbf{X}\). Testing the high-dimensional regression coefficients simultaneously in the linear model can be formulated as follows:

$$\begin{aligned} H_0:\varvec{\beta }=0_{p\times 1} \hspace{5mm} \text {VS} \hspace{5mm} H_1:\varvec{\beta }\ne 0_{p\times 1}. \end{aligned}$$

It tests the overall significance of linear regression coefficients. Both (Zhong and Chen 2011) and (Cui et al. 2018) adopted \(\varvec{\beta }^\top \varvec{\Sigma }_{\textbf{X}}^2\varvec{\beta }\) as an effective measure of the difference between \(\varvec{\beta }\) and \(0_{p \times 1}\). Simple calculation shows that \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}Y}\varvec{\Sigma }_{Y\textbf{X}})=\varvec{\beta }^\top \varvec{\Sigma }_{\textbf{X}}^2\varvec{\beta }\).

Remark 2

\(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\), the squared distance covariance dCov\(^2(\textbf{X},\textbf{Y})\) introduced by Székely et al. (2007) and the squared martingale difference divergence MDD(\(Y|\textbf{X})^2\) proposed in Shao and Zhang (2014) can be constructed in an analogous way. In fact, write

$$\begin{aligned} \tau (\textbf{X},\textbf{Y})= & {} \textrm{E}(\Vert \textbf{X}-\textbf{X}'\Vert ^{\alpha }\Vert \textbf{Y}-\textbf{Y}'\Vert ^{\beta })+\textrm{E}(\Vert \textbf{X}-\textbf{X}'\Vert ^{\alpha })\textrm{E}(\Vert \textbf{Y}-\textbf{Y}'\Vert ^{\beta })\\{} & {} - 2\textrm{E}(\Vert \textbf{X}-\textbf{X}'\Vert ^{\alpha }\Vert \textbf{Y}-\textbf{Y}''\Vert ^{\beta }), \end{aligned}$$

where \(\textbf{Z}'=(\textbf{X}',\textbf{Y}')^\top \) and \(\textbf{Z}''=(\textbf{X}'',\textbf{Y}'')^\top \) are independent copies of \(\textbf{Z}=(\textbf{X},\textbf{Y})^\top \). When \(\alpha =\beta =1\), \(\tau (\textbf{X},\textbf{Y})=\text {dCov}^2(\textbf{X},\textbf{Y})\) and it characterizes independence of random vectors \(\textbf{X}\) and \(\textbf{Y}\). \(\tau (\textbf{X},Y)= 2\text {MDD}(Y|\textbf{X})^2\) as \(\alpha =1\) and \(\beta =2\), and it measures the departure of conditional mean independence between a scalar response variable Y and a vector predictor variable \(\textbf{X}\). When \(\alpha =\beta =2\), \(\tau (\textbf{X},\textbf{Y})=4\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\). Furthermore, it is worth mentioning that \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\) can degenerate into the squared covariance of random variable X and Y, when \(p=q=1\). This suggests us consider not only the simultaneous measure between random vectors \(\textbf{X}\) and \(\textbf{Y}\), but also the marginal quantization between their components \(X_i\) and \(Y_j,i=1,\ldots ,p,j=1,\ldots ,q\) due to the curse of dimensionality.

Remark 3

According to the Székely and Rizzo (2014) and Yao et al. (2018), \(T_{n,p,q}(\textbf{X},\textbf{Y})\) can be fast implemented. Define the respective \({\mathcal {U}}\)-centred versions of \(a_{ij}=\textbf{X}_i^\top \textbf{X}_j\) and \({b}_{ij}=\textbf{Y}_i^\top \textbf{Y}_j\) as follows:

$$\begin{aligned} A_{ij}= & {} {a}_{ij}-\frac{1}{n-2}\sum _{l\ne i} {a}_{il}- \frac{1}{n-2}\sum _{k\ne j}{a}_{kj}+ \frac{1}{(n-1)(n-2)}\sum _{k\ne l} {a}_{kl},\\ B_{ij}= & {} {b}_{ij}-\frac{1}{n-2}\sum _{l\ne i} {b}_{il} -\frac{1}{n-2}\sum _{k\ne j}{b}_{kj}+\frac{1}{(n-1)(n-2)}\sum _{k\ne l}{b}_{kl}. \end{aligned}$$

Then \(T_{n,p,q}(\textbf{X},\textbf{Y})\) has the following reformulation:

$$\begin{aligned} T_{n,p,q}(\textbf{X},\textbf{Y})=\frac{1}{n(n-3)}\sum _{i\ne j} {A}_{ij}{B}_{ij} \end{aligned}$$

which is shown to be a quick implementation in numerical simulations.

2.1 Asymptotic analysis of \(T_{n,p,q}(\textbf{X},\textbf{Y})\) under null hypothesis

In this subsection the asymptotic properties of \(T_{n,p,q}(\textbf{X},\textbf{Y})\) are investigated under some regularity assumptions. Some notations are introduced before studying the properties of the proposed statistic. Write \(L_1(\textbf{Z},\textbf{Z}')=\textrm{E}\{L(\textbf{Z},\textbf{Z}'')L(\textbf{Z}',\textbf{Z}'')|(\textbf{Z},\textbf{Z}')\}\), and \(\zeta ^2=\textrm{E}\{L(\textbf{Z},\textbf{Z}')^2\}\), where \(L(\textbf{Z},\textbf{Z}')=\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}'\). To obtain the asymptotic distribution of \(T_{n,p,q}(\textbf{X},\textbf{Y})\) we require the following technical assumption.

(A1) \(\textrm{E}\{L_1(\textbf{Z},\textbf{Z}')^2\}=o(\zeta ^4)\),   \(\textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}')^4\}\) \(=o(n\zeta ^4)\), \(\textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}'')^4\}=o(n\zeta ^4)\) and \(\textrm{E}\{(\textbf{X}^\top \textbf{X}')^4\}\textrm{E}\{(\textbf{Y}^\top \textbf{Y}')^4\}=o(n\zeta ^4)\).

The following theorem presents the limiting null distribution of \(T_{n,p,q}(\textbf{X},\textbf{Y})\).

Theorem 1

Suppose assumption (A1) holds. Then under \(H_0\), as \((n,p+q)\rightarrow \infty \) we have that

$$\begin{aligned} \frac{nT_{n,p,q}(\textbf{X},\textbf{Y})}{\sqrt{2\zeta ^2}} \overset{d}{\longrightarrow }\ {\mathcal {N}}(0,1), \end{aligned}$$

and a ratio consistent estimator of \(\zeta ^2\) is

$$\begin{aligned} \zeta ^2_n=\frac{1}{n(n-3)}\sum _{i\ne j} A_{ij}^2B_{ij}^2, \end{aligned}$$

where \(\overset{d}{\longrightarrow }\ \) denotes convergence in distribution.

By referring to Zhang et al. (2018) and its supplementary material, we impose assumption (A1) to ensure the asymptotic normality of the degenerate U-statistic \(T_{n,p,q}(\textbf{X},\textbf{Y})\) under the null hypothesis. Meanwhile this assumption guarantees that \(\zeta ^2_n\) is ratio-consistent. To further understand this assumption, the following proposition is established when \(\textbf{Z}=(\textbf{X},\textbf{Y})^{^\top }\) follows a multivariate normal distribution.

Proposition 1

Suppose that \(\textbf{Z}=(\textbf{X},\textbf{Y})^{^\top }\sim {\mathcal {N}}(0_{(p+q)\times 1},\textbf{I}_{p+q})\). Then we have

$$\begin{aligned}{} & {} \zeta ^2=pq,\quad \textrm{E}\{L_1(\textbf{Z},\textbf{Z}')^2\}=pq,\\{} & {} \textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}')^4\}=9pq(p+2)(q+2),\\{} & {} \textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}'')^4\}=9pq(p+2)(q+2),\\{} & {} \textrm{E}\{(\textbf{X}^\top \textbf{X}')^4\}\textrm{E}\{(\textbf{Y}^\top \textbf{Y}')^4\}=9pq(p+2)(q+2). \end{aligned}$$

Proposition 1 may shed light on the assumption (A1). It indicates that when \(\textbf{Z}\) follows a standard multivariate normal distribution, this condition holds automatically as long as n and \(p+q\) tend to infinity, which implies that condition (A1) is imposed reasonably.

Remark 4

Theorem 1 states that the asymptotic null distribution of \(T_{n,p,q}\) is normal, and we can reject \(H_0\) at a significance level \(\alpha \) if

$$\begin{aligned} nT_{n,p,q}(\textbf{X},\textbf{Y}) \ge z_\alpha \sqrt{2\zeta ^2_n}, \end{aligned}$$

where \(z_{\alpha }\) denotes the upper \(\alpha \) quantile of \({\mathcal {N}}(0,1)\).

2.2 Asymptotic analysis of \(T_{n,p,q}(\textbf{X},\textbf{Y})\) under the local alternatives

In this subsection, we turn to the asymptotic analysis of \(T_{n,p,q}(\textbf{X},\textbf{Y})\) under the local alternatives.

The following assumption is required for theoretical study.

(A2) \(\textrm{E}\{(\textbf{X}^\top \varvec{\Sigma }_{\textbf{X}\textbf{Y}}\textbf{Y})^2\}=o(n^{-1}\zeta ^2)\) and \(\textrm{E}\{(\textbf{X}^\top \varvec{\Sigma }_{\textbf{X}\textbf{Y}}\textbf{Y}')^2\}=o(\zeta ^2)\).

Theorem 2

Suppose that assumptions (A1) and (A2) hold. Then as \((n,p+q)\rightarrow \infty \) we have

$$\begin{aligned} \frac{n\{T_{n,p,q}(\textbf{X},\textbf{Y})-\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\}}{\sqrt{2\zeta ^2}} \overset{d}{\longrightarrow }\ {\mathcal {N}}(0,1), \end{aligned}$$

and \(\zeta _n^2\) is still a ratio-consistent estimator of \(\zeta ^2\).

Assumption (A2) characterizes the local alternative in the sense that the alternative is not too far away from the null hypothesis, and thus shows that our proposed statistic is a degenerate U-statistic. It is noteworthy that this assumption holds automatically under the null hypothesis and it is given in light of Zhang et al. (2018). In following we illustrate the assumptions (A1) and (A2) under linear regression model.

Proposition 2

Assume that \(\textbf{Y}=\textbf{X}^\top \varvec{\beta }+\varepsilon \) where \(\textbf{X}\) and \(\varepsilon \) are independent with \(\textbf{X}\sim {\mathcal {N}}(0_{p\times 1},\textbf{I}_p)\) and \(\varepsilon \sim {\mathcal {N}}(0,1)\). Then we have

$$\begin{aligned}{} & {} \zeta ^2=(p+8)\Vert \varvec{\beta }\Vert ^4+2(p+2)\Vert \varvec{\beta }\Vert ^2+p,\\{} & {} \textrm{E}\{(\textbf{X}^\top \varvec{\Sigma }_{\textbf{X}\textbf{Y}}\textbf{Y})^2\}=3\Vert \varvec{\beta }\Vert ^4+\Vert \varvec{\beta }\Vert ^2,\\{} & {} \textrm{E}\{(\textbf{X}^\top \varvec{\Sigma }_{\textbf{X}\textbf{Y}}\textbf{Y}')^2\}=\Vert \varvec{\beta }\Vert ^4+\Vert \varvec{\beta }\Vert ^2,\\{} & {} \textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}')^4\} = O\{p^2(\Vert \varvec{\beta }\Vert ^2+1)^4\},\\{} & {} \textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}'')^4\} = O\{p^2(\Vert \varvec{\beta }\Vert ^2+1)^4\},\\{} & {} \textrm{E}\{(\textbf{X}^\top \textbf{X}')^4\}\textrm{E}\{(\textbf{Y}^\top \textbf{Y}')^4\}=27p(p+2)(\Vert \varvec{\beta }\Vert ^2+1)^4\\{} & {} \textrm{E}\{L_1(\textbf{Z},\textbf{Z}')^2\}=(3\Vert \varvec{\beta }\Vert ^2+1)^4+(p-1)(\Vert \varvec{\beta }\Vert ^2+1)^4. \end{aligned}$$

Proposition 2 implies that assumption (A1) holds automatically as \((n,p)\rightarrow \infty \). Furthermore, assumption (A2) can be satisfied when \(\Vert \varvec{\beta }\Vert ^2/(1+\Vert \varvec{\beta }\Vert ^2)=o(p/n)\). Therefore, assumption (A2) can be viewed as local alternatives.

Remark 5

It can be inferred from theorem 2 that the asymptotic power of the proposed test under the local alternatives is

$$\begin{aligned} \Phi \left( -z_{\alpha }+ \frac{n\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})}{\sqrt{2\zeta ^2}}\right) , \end{aligned}$$

where \(\Phi (\cdot )\) is the cumulative distribution function of the standard normal distribution. It is clear to see that this power is dominantly affected by signal to noise ratio term \(\eta (\varvec{\Sigma }_{\textbf{X}\textbf{Y}})=\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})/\sqrt{2\zeta ^2}\). Specially, the power tends to \(\alpha \) when \(\eta (\varvec{\Sigma }_{\textbf{X}\textbf{Y}})=o(n^{-1})\), which implies that the test fails to make a distinction between the null hypothesis and the local alternatives. Meanwhile, if \(\eta (\varvec{\Sigma }_{\textbf{X}\textbf{Y}})\) has a higher order of \(n^{-1}\), the power converges to 1 and thus it is a consistent test.

3 Power enhancement technique

In Sect. 2, \(T_{n,p,q}\) is constructed based on \(\textrm{tr}(\varvec{\Sigma }_{\textbf{X}\textbf{Y}}\varvec{\Sigma }_{\textbf{Y}\textbf{X}})\), and its asymptotic distributions are obtained under regularity conditions (A1)–(A2). It is worth noting that it is a simultaneous measure for association, and its power is adversely affected by the increasing dimensions, especially under the sparse alternatives. We adopt the power enhancement technique introduced by Fan et al. (2015), and utilize the marginal association to boost the empirical power in this case. Define

$$\begin{aligned}{} & {} {\mathcal {S}} =\{(i,j): \rho ^2(X_i,Y_j)\ge 2\delta _{n,p,q}, 1\le i\le p,1\le j\le q\}\\{} & {} {\mathcal {R}}_n(X_i,Y_j)=\frac{T_{n,1,1}(X_i,Y_j)}{\sqrt{T_{n,1,1}(X_i,X_i)T_{n,1,1}(Y_j,Y_j)}}, 1\le i\le p,1\le j\le q, \end{aligned}$$

where \(\delta _{n,p,q}=c\log (\log (n))/\log (\log (pq))(pq)^{1/4}(\log (n))^{3/4}/n\) and \(\rho (X_i,Y_j)\) is the Pearson correlation coefficient.

The conditions below are imposed to derive the limiting properties of \({\mathcal {R}}_n(X_i,Y_j)\).

(A3) \(pq=O(n^\kappa ),0<\kappa <4\) and \(\textrm{E}\{(\textbf{X}^\top \textbf{X}'\textbf{Y}^\top \textbf{Y}')^4\}\) exists.

(A4) \(\max \limits _{1\le i\le p,1\le j\le q} \xi _{ij} =O((pq)^{-1/2})\), where \(\xi _{ij}=\sigma ^2(X_i,Y_j)\textrm{Var}\{(X_i-\textrm{E}(X_i))(Y_j-\textrm{E}(Y_j))\}\) and \(\sigma (X_i,Y_j)\) is the covariance of random variables \(X_i\) and \(Y_j\).

We present the following result regarding the asymptotic behavior of \({\mathcal {R}}_n(X_i,Y_j)\) under both the null hypothesis and the alternatives.

Theorem 3

Suppose conditions (A3)–(A4) hold. Then we have that

(1) under \(H_0\), almost surely

$$\begin{aligned} \max \limits _{1\le i \le p,1\le j\le q}|{\mathcal {R}}_n(X_i,Y_j)| =o(\delta _{n,p,q}), \hspace{5mm} n \rightarrow \infty . \end{aligned}$$

(2) when \({\mathcal {S}} \ne \emptyset \), almost surely

$$\begin{aligned} \max \limits _{1\le i \le p,1\le j \le q} |{\mathcal {R}}_n(X_i,Y_j)| \ge \delta _{n,p,q}, \hspace{5mm} n \rightarrow \infty . \end{aligned}$$

Assumptions (A3)–(A4) are imposed based on the results of Section 5.3 in Serfling (1980). \(pq=O(n^\kappa ),0<\kappa <4\) has an explicit relationship between p, q and n, and means \(\delta _{n,p,q} \) converges to 0 as n tends to infinity. It is worth noting that assumption (A3) ensures an almost sure representation of the difference between U-statistic \(T_{n,1,1}(X_i,Y_j)\) and its projection. As for condition (A4), it holds automatically under the null hypothesis and shows that \(\xi _{ij}\) is of order \((pq)^{-1/2}\) uniformly among pq components under sparse alternatives, which guarantees an almost sure representation for the projection of U-statistic \(T_{n,1,1}(X_i,Y_j)\). Details can be seen in the proof of Theorem 3 in the Appendix.

Remark 6

In light of theorem 3, the screening term used for enhancing the empirical power can be constructed as follows:

$$\begin{aligned} T_{n,p,q}^0 = pq\textbf{1}\left( \max \limits _{1\le i \le p,1\le j \le q}|{\mathcal {R}}_n(X_i,{Y_j}\right) |\ge \delta _{n,p,q}) \end{aligned}$$

where \(\textbf{1}{(\cdot )}\) is an indicator function. It can be inferred from this theorem that \(T_{n,p,q}^{0}\) is negligible under null hypothesis whereas it will diverge to infinity when \({\mathcal {S}} \ne \emptyset \). Therefore the power of the test will be well enhanced if add the term \(T_{n,p,q}^{0}\) owing to taking into account of more information from the alternative. Similar to the discussion in Fan et al. (2015), a general form of the test statistic can be proposed as follows:

$$\begin{aligned} \widehat{T}_{n,p,q}=T_{n,p,q}+T_{n,p,q}^0. \end{aligned}$$

It is clear to see from Theorems 1 and 3 that the test statistic \(\widehat{T}_{n,p,q}\) shares the same distribution as \(T_{n,p,q}\), which implies that the proposed test reject \(H_0\) at the significance level \(\alpha \) if \(\widehat{T}_{n,p,q}\ge z_\alpha \). Furthermore, it yields from Theorems 2 and 3 that the empirical power of \(\widehat{T}_{n,p,q}\) can tend to 1 under some regularity conditions. This also indicates that the power enhancement technique can be utilized to boost the power as long as a test statistic has a limiting null distribution such as normal approximation.

Remark 7

Given that the kernel function of U-statistic \(T_{n,1,1}(X_i,Y_j)\) has finite fourth moments, the convergence rate, \((\log (n))^{3/4}/n\), of difference between a U-statistic and its projection is obtained in Theorem 5.3.3 of Serfling (1980). \(\max _{1\le i\le p,1\le j\le q}|{\mathcal {R}}_n(X_i,Y_j)|\) refers to the maximum of pq marginal correlation, which indicates that the convergence rate should be multiplied by a function of pq. In light of the fact that the kernel function of \(T_{n,1,1}(X_i,Y_j)\) has finite fourth moments, we choose \((pq)^{1/4}\). More details can be found in the proof of Theorems 3. Furthermore, it is worth noting that the choice of tuning parameter c will affect the performance of the proposed test in both empirical size and empirical power. To be specific, given the sample size and the dimension, larger c can lead to a higher probability of controlling the empirical size while smaller c has a greater chance of boosting the empirical power, which indicates that the choice of c becomes a trade-off between controlling the empirical size and boosting the empirical power. In practical applications a series of values are given to identify which ones can meet the need of size control. Based on the selected values, we choose the smallest value as c, which can increase the empirical power. More details can be found from the Example 1 in the simulation studies.

4 Numerical studies

In this section we illustrate the proposed test procedure by investigating its finite sample performance through simulations and a real microarray gene data analysis. For the purpose of comparison, we also consider the following methods. The first one proposed in Zheng et al. (2022) adopts the Frobenius distance between covariance matrix. The second procedure is based on the trace of covariance matrix advocated in Li et al. (2017). The third one utilizes the modified distance correlation established in Székely and Rizzo (2013). The fourth one is built based on ranks of distances in Heller et al. (2012) and the remaining one is \(T_{n,p,q}\) proposed in this paper. To be specific, we denote the method used in this section as follows:

  • NEW1: the test based on \(T_{n,p,q}\);

  • NEW: the test based on \(\widehat{T}_{n,p,q}\);

  • FDS: the test of Zheng et al. (2022);

  • TCM: the test of Li et al. (2017);

  • MDC: the test of Székely and Rizzo (2013);

  • HHG: the test of Heller et al. (2012).

To implement MDC and HHG test procedures, we adopt the dcor.ttest function in the energy package and the hhg.test function in the HHG package, respectively. We set the sample size \(n=100,200\), and the dimension \(p+q=500\). The nominal significance level is fixed at \(\alpha =0.05\), and the number of independent replications is 1000. All simulation studies are conducted using R version 4.1.2.

Fig. 1
figure 1

The empirical size of the proposed test NEW under different values of c in Example 1 when \(n=100\)

Fig. 2
figure 2

The empirical size of the proposed test NEW under different values of c in Example 1 when \(n=200\)

Example 1

This example is designed to compare the finite sample performance of the test procedures. In this example, we assume that

$$\begin{aligned} \textbf{Z}=(\textbf{X},\textbf{Y})^{^\top }=\varvec{\Sigma }^{1/2}\textbf{W}, \end{aligned}$$

where the components of \(\textbf{W}=(W_1,\ldots ,W_{p+q})^{^\top }\) are i.i.d., and \(W_1\) is from the standard normal distribution \({\mathcal {N}}(0,1)\), the uniform distribution \(U(-\sqrt{3},\sqrt{3})\), the scaled student distribution \(t(6)/\sqrt{6/4}\) and the scaled chi-square distribution \((\chi ^2(6)-6)/\sqrt{12}\) with 6 degrees of freedom, respectively. In this example, the dimension is \(p=1,5,10,20,50,100\).

Recall from Remark 7, it is of importance to determine c under different choice of (npq). Thus we set \(c=(0.1,0.2,\ldots ,2)\) and investigate their empirical size of our proposed test when \(\varvec{\Sigma }=\textbf{I}_{p+q}\). Note that the choice of c should not be affected by the distribution of \(W_1\). Here we choose \(W_1\sim {\mathcal {N}}(0,1)\). Figures 1 and 2 display the empirical size of the proposed test procedure NEW and show that \(c=1\) can be applied to all the settings except for the case \((n,p,q)=(100,1,499)\) in this Example. We choose \(c=1.5\) when \((n,p,q)=(100,1,499)\).

Based on the choice of c, we investigate the empirical size of these tests when \(\varvec{\Sigma }=\textbf{I}_{p+q}\). Table 1 presents the results and shows that all the empirical sizes are close to the nominal significance level \(\alpha =0.05\). Meanwhile, the fact that little difference existing between the empirical sizes of NEW and NEW1 indicates that the screening term \(T_{n,p,q}^0\) has little effect on the size under the null hypothesis.

Table 1 Empirical sizes of the tests at the nominal significance level 5(%) in Example 1

To examine the powers of these test procedures, we set \(\varvec{\Sigma }=(0.5^{|i-j|})_{i,j=1}^{p+q}\). The simulation results are summarized in Table 2. From this table it is convenient to see that the performance of NEW and FDS outperform the remaining methods, especially when \(n=200\), which can be attributed to the adoption of the power enhancement technique. Furthermore, it is worth noting that the empirical power of each procedure as \(n=200\) is much higher than that under the setting \(n=100\), which shows the large sample theory.

Table 2 Empirical power(%) of the tests in Example 1
Table 3 Empirical power(%) of the tests in Example 2

Example 2

The power of the test procedures are evaluated via the model studied in Jiang et al. (2013). They define the populations \(\textbf{X}\) and \(\textbf{Y}\) as

$$\begin{aligned} \textbf{X}= \textbf{U}_1 + \gamma \textbf{U}_2^p, \textbf{Y}= \textbf{U}_2 + \gamma \textbf{U}_2 \end{aligned}$$

where \(\textbf{U}_1=(U_{11},\ldots ,U_{1p})^{^\top }\) and \(\textbf{U}_2=(U_{21},\ldots ,U_{2q})^{^\top }\) are independent, \(\textbf{U}_2^p\) is a subset of \(\textbf{U}_2\) consisting of its first p variables, and the factor \(\gamma \) represents the degree of dependence. In this example \(\gamma =0.3\). We assume that \(U_{11},\ldots , U_{1p}, U_{21},\ldots , U_{2q}\) are i.i.d., and follow the same distribution as \(W_1\) in Example 1. The dimension is \(p=20,50,100\) when the sample size is \(n=100\). As for the sample size \(n=200\), the dimension is \(p=5,10,20\). From this model it is easy to calculate that the covariance matrices are respectively

$$\begin{aligned} \varvec{\Sigma }_{\textbf{X}}=(1+\gamma ^2)\textbf{I}_{p}, \varvec{\Sigma }_{\textbf{Y}}=(1+\gamma )^2\textbf{I}_{q}, \varvec{\Sigma }_{\textbf{X}\textbf{Y}}=\gamma (1+\gamma )(\textbf{I}_{p_1},O_{p,q-p}), \end{aligned}$$

where \(O_{p,q-p}\) denotes a \(p\times (q-p)\) zero matrix. Table 3 displays the performance of these methods, and shows the empirical power of each procedure increases with the increasing dimension p, which implies that these methods are effective against the dense alternatives. At the same time to achieve the same performance for each of these test procedures, the dimension p under the setting that \( n=200\) can be much smaller than that of \( n=100\), which implies that the powers of these methods can tend to be 1 as long as the sample size n is sufficiently large.

Example 3

To illustrate the application of our proposed procedure in high-dimensional settings, we analyze a microarray data reported in Scheetz et al. (2006). In order to gain a broad perspective of gene regulation in the mammalian eye and to identify genetic variation relevant to human eye disease, 120 twelve-week-old male offspring were chosen for tissue harvesting from their eyes and microarray analysis. 18,976 probes on the array which was used to analyze the RNA from the eyes of these F2 animals were detected sufficiently expressed and variable. More details of the experiment can be found in Scheetz et al. (2006).

Note that 1389163_at, one of the 18,976 sufficiently expressed probes, is from the gene TRIM32. This gene is found in Chiang et al. (2006) to cause an extremely heterogeneous human obesity syndrome known as Bardet-Biedl syndrome. The relationship between the probe 1389163_at and the remaining 18,975 ones is first investigated. The P-values of the three tests are listed in Table 4 and denoted by \(p=18{,}975\), which suggests that gene TRIM32 and the rest of gene exhibit some type of association. Huang et al. (2008) verified this situation and further studied the data set. They used the adaptive Lasso in sparse high-dimensional linear regression models and selected 19 genes whose expression are most correlated with that of gene TRIM32. Excluding these 19 probes, we take another test to check whether the probe 1389163_at is associated with the 18,956 ones or not. To remove the effects of these 19 genes on TRIM32, we replace the value of TRIM32 with the residual from a multiple linear regression on the response variable TRIM32 and these 19 genes being the predictors. Table 4 also presents the result when \(p=18,956\), which shows the absence of a linear relationship between the gene TRIM32 and the 18,956 ones.

Table 4 P-value in Example 3