Abstract
A unified testing framework is presented for large-dimensional mean vectors of one or several populations which may be non-normal with unequal covariance matrices. Beginning with one-sample case, the construction of tests, underlying assumptions and asymptotic theory, is systematically extended to multi-sample case. Tests are defined in terms of U-statistics-based consistent estimators, and their limits are derived under a few mild assumptions. Accuracy of the tests is shown through simulations. Real data applications, including a five-sample unbalanced MANOVA analysis on count data, are also given.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Let \(\mathbf{X}_k = (X_{k1}, \ldots , X_{kp})' \sim {\mathcal {F}}\), \(k = 1, \ldots , n\) be iid random vectors, where \({\mathcal {F}}\) denotes a p-variate distribution, with \(E(\mathbf{X}_k) = {\varvec{\mu }} \in {\mathbb {R}}^p\) and \({{\mathrm{Cov}}}(\mathbf{X}_k) = {\varvec{\Sigma }} \in {\mathbb {R}}^{p\times p}_{> 0}\). A hypothesis of foremost interest to be tested in this setup is \(H_0: {\varvec{\mu }} = \mathbf{0}\) against an appropriate alternative, say \(H_1:\) Not \(H_0\). For an extension to \(g \ge 2\) samples, let \(\mathbf{X}_{ik} = (X_{ik1}, \ldots , X_{ikp})' \sim {\mathcal {F}}_i\) be iid random vectors with \(E(\mathbf{X}_{ik}) = {\varvec{\mu }}_i \in {\mathbb {R}}^p\), \({{\mathrm{Cov}}}(\mathbf{X}_{ik}) = {\varvec{\Sigma }}_i \in {\mathbb {R}}^{p\times p}_{>0}\), \(k = 1, \ldots , n_i\), \(i = 1, \ldots , g\). The corresponding hypothesis of interest is \(H_{0g}:{\varvec{\mu }}_1 = \cdots = {\varvec{\mu }}_g\) vs. \(H_{1g}:\) Not \(H_{0g}\).
Our objective here is to present test statistics for the aforementioned one- and multi-sample hypotheses when \(p > n_i\), \({\mathcal {F}}_i\)’s are not necessarily normal and \({\varvec{\Sigma }}_i\), likewise \(n_i\), in the multi-sample case may be unequal. The proposed tests are thus valid for high-dimensional, non-normal, unbalanced data under Behrens–Fisher problem. In particular, for \(g \ge 3\), it refers to testing high-dimensional one-way MANOVA hypothesis under non-normality and multi-sample Behrens–Fisher problem.
When \(p < n_i\), tests of \(H_0\) or \(H_{0g}\) are most often carried out by Hotelling’s \(T^2\) or Wilks’ Lambda statistic which are uniformly most powerful invariant likelihood ratio tests. They, however, collapse for high-dimensional case, particularly due to singularity of the empirical covariance matrix involved (see Sects. 2, 3). A number of proposals have recently been put forth in the literature on the modification of these classical tests for high-dimensional data.
Whereas most modifications assume normality, some of them are based on a more flexible model, and still others offer completely nonparametric solution to the problem. Likewise holds for homoscedasticity assumption, \({\varvec{\Sigma }}_i = {\varvec{\Sigma }}\)\(\forall ~i = 1, \ldots g \ge 2\). For details, see e.g., Dempster (1958), Bai and Saranadasa (1996), Läuter et al. (1998), Läuter (2004), Fujikoshi (2004), Schott (2007), Chen and Qin (2010), Aoshima and Yata (2011, 2015), Katayama and Kano (2014), Wang et al. (2015), Feng et al. (2016) and Hu et al. (2017). For a review, see Hu and Bai (2015) and Fujikoshi et al. (2010).
We present a coherent testing theory encompassing one- and multi-sample cases. The construction of the tests, the assumptions, and the strategy of obtaining limiting distribution of the test statistics is succinctly threaded together via a common approach, initiating with the one-sample case and extending systematically to the multi-sample cases. The main distinguishing feature of the proposed tests is that we simultaneously relax commonly adopted linear model assumptions such as normality and homoscedasticity, for all cases up to one-way MANOVA. Further, all tests are defined in terms of U-statistics with simple, bivariate, product kernels composed of bilinear forms of independent vectors. This helps us determine the limits of the test statistics for a general multivariate model. These limits are derived under \((n_i, p)\)- or high-dimensional asymptotics, i.e., \(n_i, p \rightarrow \infty \), using only a few mild assumptions.
The basic idea is introduced in detail for one-sample case in the next section, with an extension to two-sample case. Multi-sample extension follows in Sect. 3. Sections 4 and 5 deal with simulations and applications. Proofs and technical details are deferred to “Appendix”.
2 The one- and two-sample tests
2.1 The one-sample case
For the one-sample data setup in Sect. 1, let the unbiased estimators of \({\varvec{\mu }}\) and \({\varvec{\Sigma }}\) be defined as \(\overline{\mathbf{X}} = \sum _{k = 1}^n\mathbf{X}_k/n\) and \(\widehat{\varvec{\Sigma }} = \sum _{k = 1}^n(\mathbf{X}_k - \overline{\mathbf{X}})(\mathbf{X}_k - \overline{\mathbf{X}})'/(n - 1)\). If \(n > p\) and \({\mathcal {F}}\) is multivariate normal, then \(H_0: {\varvec{\mu }} = \mathbf{0}\) can be tested using Hotelling’s statistic \(T^2 = n\overline{\mathbf{X}}'\widehat{\varvec{\Sigma }}^{-1}\overline{\mathbf{X}} = \overline{\mathbf{X}}'[\widehat{\varvec{\Sigma }}/n]^{-1}\overline{\mathbf{X}}\) where \(\widehat{\varvec{\Sigma }}/n\) estimates \({\varvec{\Sigma }}/n = {{\mathrm{Cov}}}(\overline{\mathbf{X}})\). When \(p > n\), \(\widehat{\varvec{\Sigma }}\) is singular and \(T^2\) collapses, requiring a careful modification that can provide valid inference when \(p \rightarrow \infty \), possibly along with \(n \rightarrow \infty \). The two indices may sometimes be assumed to grow at the same rate so that \(p/n \rightarrow c \in (0, \infty )\). Alternatively, a sequential asymptotics, first letting \(p \rightarrow \infty \) followed by \(n \rightarrow \infty \), may be considered under which conditions like \(p/n \rightarrow c\) may be dispensed with.
Note that, \(\widehat{\varvec{\Sigma }}\) may be ill-conditioned for \(p < n\) whence \((\cdot )^{-1}\) can be replaced with Moor–Penrose inverse (see e.g., Duchesne and Francq 2015). For \(p \gg n\), this approach is unreliable and inefficient. An alternative is to remove \(\widehat{\varvec{\Sigma }}^{-1}\) from \(T^2\) and consider the Euclidean distance \(Q = \overline{\mathbf{X}}'\overline{\mathbf{X}} = \Vert \overline{\mathbf{X}}\Vert ^2\). An interesting consequence of this can be witnessed by a simple split of Q as
where \(Q_1 = (E - U_n)/n\) with \(E = \sum _{k = 1}^n\mathbf{X}'_k\mathbf{X}_k/n\) and \(U_n = \sum _{k \ne r}^n\mathbf{X}'_k\mathbf{X}_r/n(n - 1)\). Note that, E is an average of quadratic forms, and \(U_n\) is an average of bilinear forms composed of independent components. It is shown below that the limiting distribution of the statistic mainly follows from \(U_n\) where \(Q_1\) converges in probability to a constant. With \(E(\mathbf{X}'_k\mathbf{X}_k) = {{\mathrm{tr}}}({\varvec{\Sigma }}) + {\varvec{\mu }}'{\varvec{\mu }}\), \(E(\mathbf{X}'_k\mathbf{X}_r) = {\varvec{\mu }}'{\varvec{\mu }}\), we get \(E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }})/n\), \(E(U_n) = {\varvec{\mu }}'{\varvec{\mu }}\). Thus,
which is \({{\mathrm{tr}}}({\varvec{\Sigma }})/n\) under \(H_0\). We observe a few salient features of this bifurcation of Q. First, \(E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }})/n = {{\mathrm{Cov}}}(\overline{\mathbf{X}})\) implies that the removal of the inverse of the estimator of \({{\mathrm{Cov}}}(\overline{\mathbf{X}})\) results into a bias term composed of the trace of the same estimator, since it can be verified that \(Q_1 = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})/n\) or \(\mathbf{E}_1 - \mathbf{Q}_0 = \widehat{\varvec{\Sigma }}\) with \(\mathbf{E}_1 = \sum _{k = 1}^n\mathbf{X}_k\mathbf{X}'_k/n\) and \(\mathbf{Q}_0 = \sum _{k \ne r}^n\mathbf{X}_k\mathbf{X}'_r/n(n - 1)\) as matrix versions of E and \(U_n\). Note also that \(Q_1\) is independent of \({\varvec{\mu }}\), and \(U_n\) is independent of \({\varvec{\Sigma }}\), under both \(H_0\) and \(H_1\). Now, \(E(U_n) = \Vert {\varvec{\mu }}\Vert ^2\) which is 0 under \(H_0\). Together, the last two facts imply that \(U_n\) can be used to construct the modified test statistic for \(H_0\), whereas \(Q_1\) can help compensate for the removal of estimator from the original test statistic. For this, write \(Q = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})/n + U_n = Q_1 + U_n\) and by a simple scaling and re-writing, consider the statistic
where \(Q_0 = U_n/p\) is \(U_n\), but with kernel normed by p, \(h(\mathbf{x}_k, \mathbf{x}_r) = \mathbf{X}'_k\mathbf{X}_r/p\). \(T_1\) is the proposed modified statistic for \(H_0: {\varvec{\mu }} = \mathbf{0}\) when \(p \gg n\) and \({\mathcal {F}}\) may be non-normal.
For the limit of \(T_1\), \(nQ_1/p\) is first shown to converge in probability to a constant as \(n, p \rightarrow \infty \). Then, \(nQ_0\) is shown to converge weakly to a normal limit. Under \(H_0\), the kernel of \(U_n\) degenerates so that the null limit follows through a weighted sum of independent \(\chi ^2\) variables. The limit of T follows then by Slutsky’s lemma. As the same scheme will later be extended for \(g \ge 2\), we treat the one-sample case in detail. Let \(\lambda _s\), \(s = 1, \ldots , p\) be the eigenvalues of \({\varvec{\Sigma }}\) so that \(\lambda _s/p = \nu _s\) corresponds to \({\varvec{\Sigma }}/p\). We need the following assumptions.
Assumption 1
\(E(X^4_{ks}) = \gamma _s \le \gamma < \infty \forall ~s = 1, \ldots , p\), \(\gamma _0 \in {\mathbb {R}}^+\).
Assumption 2
\(\lim _{p \rightarrow \infty }\sum _{s = 1}^p\nu _s = \nu _0 \in {\mathbb {R}}^+\).
Assumption 3
\(\lim _{n, p \rightarrow \infty } p/n = c = O(1)\).
Assumption 4
\(\lim _{p \rightarrow \infty } {\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}/p = \phi = O(1)\).
Assumption 1 helps us relax normality. By Assumption 2, \(\sum _{s = 1}^p\nu ^2_s = O(1)\), as \(p \rightarrow \infty \). Assumption 4 is only required under \(H_1\). We have the following theorem, proved in “Appendix B.1.”
Theorem 5
For \(T_1\) in Eq. (3), \((T_1 - E(T_1))/\sqrt{{{\mathrm{Var}}}(T_1)}\xrightarrow {{\mathcal {D}}} N(0, 1)\) under Assumptions 1–4, as \(n, p \rightarrow \infty \), where \(E(T_1)\) and \({{\mathrm{Var}}}(T_1)\) denote the mean and variance of \(T_1\).
From the proof of Theorem 5, \(E(T_1)\) and \({{\mathrm{Var}}}(T_1)\) approximate 1 and \(2{{\mathrm{tr}}}({\varvec{\Xi }}^2)/[{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), respectively, \({\varvec{\Xi }} = n{\varvec{\Sigma }}/p\). As the limit follows from a weighted sum of \(\chi ^2_1\) variables, the moments in fact approximate a scaled Chi-square variable, say \(\chi ^2_f/f\) with moments 1 and 2 / f, where \(f = f_1/f_2\), \(f_1 = [{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), \(f_2 = {{\mathrm{tr}}}({\varvec{\Xi }}^2)\). Thus, to estimate \({{\mathrm{Var}}}(T_1)\), we need consistent estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)\) and \([{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\). Define \(Q = \sum _{k = 1}^n(\widetilde{\mathbf{X}}'_k\widetilde{\mathbf{X}}_k)^2/(n - 1)\), \(\widetilde{\mathbf{X}} = \mathbf{X}_i - \overline{\mathbf{X}}\), \(\eta = (n - 1)/[n(n - 2)(n - 3)]\). Then, \(E_2 = \eta \{(n - 1)(n - 2)\text {tr}(\widehat{\varvec{\Sigma }}^2) + [\text {tr}(\widehat{\varvec{\Sigma }})]^2 - nQ\}\), \(E_3 = \eta \{2\text {tr}(\widehat{\varvec{\Sigma }}^2) + (n^2 - 3n + 1)[\text {tr}(\widehat{\varvec{\Sigma }})]^2 - nQ\}\) are unbiased and consistent estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)\) and \([{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\). Then \({\widehat{f}} = {\widehat{f}}_1/{\widehat{f}}_2\) is consistent estimator of f, hence \(\widehat{{{\mathrm{Var}}}(T_1)}\) of \({{\mathrm{Var}}}(T_1)\) such that \(\widehat{{{\mathrm{Var}}}(T_1)}/{{\mathrm{Var}}}(T_1) \rightarrow 1\); see Ahmad (2017b) and end of Sect. 3. We have the following corollary.
Corollary 6
Theorem 5 remains valid when \({{\mathrm{Var}}}(T_1)\) is replaced with \({\widehat{{{\mathrm{Var}}}}}(T_1)\).
Power of\(T_1\) Let \(z_{\alpha }\) be \(100\alpha \%\hbox {th}\) quantile of N(0, 1), \(\beta ({\varvec{\theta }})\) the power function of \(T_1\) with \({\varvec{\theta }} \in {\varvec{\Theta }_0}\) or \({\varvec{\theta }} \in {\varvec{\Theta }}_1\) where \({\varvec{\Theta }}_0 = \{\mathbf{0}\}\), \({\varvec{\Theta }}_1 = {\varvec{\Theta }}\backslash \{\mathbf{0}\}\) are respective parameter spaces under \(H_0\), \(H_1\) with \({\varvec{\Theta }} = {\varvec{\Theta }}\cup {\varvec{\Theta }}_1\), \({\varvec{\Theta }}_0\cap {\varvec{\Theta }}_1 = \phi \). By Theorem 5, \(\beta ({\varvec{\theta }}) = \text {P}(z_1 \ge z_{\alpha })\) with \(\beta ({\varvec{\theta }}|H_0) = \alpha \), \(\beta ({\varvec{\theta }}|H_1) = 1 - \beta \), as \(n, p \rightarrow \infty \), where \(z_1 = (T_1 - E(T_1))/\sqrt{{{\mathrm{Var}}}(T_1)}\). Then, \(1 - \beta = P(z \ge z_\alpha - n\delta )\), \(\delta = \delta _1/\delta _2\), \(\delta _1 = {\varvec{\mu }}'{\varvec{\mu }}/p\), \(\delta ^2_2 = \sum _{s = 1}^p\nu ^2_s\). By the convergence of \(nQ_1/p\), and as \(\delta _1\), \(\delta _2\) are uniformly bounded under the assumptions, \(1 - \beta \rightarrow 1\) as \(n, p \rightarrow \infty \).
Remark 7
A remark on the structure of \(T_1\) is in order. With \([nQ_1/p]/{{\mathrm{tr}}}({\varvec{\Xi }})\) converging in probability to 1, consider \(T_1 = 1 + nU_n/{{\mathrm{tr}}}({\varvec{\Sigma }})\), also ignoring p for convenience. Then, \(E(T_1) = 1 + n\Vert {\varvec{\mu }}\Vert ^2/{{\mathrm{tr}}}({\varvec{\Sigma }}) = 1 + E(\overline{\mathbf{X}})'E(\overline{\mathbf{X}})/{{\mathrm{Cov}}}(\overline{\mathbf{X}})\), where \(E(T_1) = 1\) under \(H_0\). In this sense, \(T_1\) is similar to an F-statistic, where \(T_1\) is close to 1 under \(H_0\) and moves apart as \({\varvec{\mu }}\) deviates from \(\mathbf{0}\). Since \({{\mathrm{Cov}}}(\overline{\mathbf{X}}) + E(\overline{\mathbf{X}})'E(\overline{\mathbf{X}}) = E(\overline{\mathbf{X}}'\overline{\mathbf{X}})\), the partitioning used to define \(T_1\) helps not only adjust for bias term but also makes the resulting statistic computationally much simpler, particularly under non-normality. A similar argument holds for multi-sample tests presented in the next sections.
2.2 The two-sample case
For the multi-sample setup in Sect. 1, let \(g = 2\). We are interested to test \(H_{02}: {\varvec{\mu }}_1 = {\varvec{\mu }}_2\) versus \(H_{12}:\) Not \(H_{02}\). Let \(\overline{\mathbf{X}}_i = \sum _{k = 1}^{n_i}{} \mathbf{X}_{ik}/n_i\) and \(\widehat{\varvec{\Sigma }}_i = \sum _{k = 1}^{n_i}(\mathbf{X}_{ik} - \overline{\mathbf{X}}_i)(\mathbf{X}_{ik} - \overline{\mathbf{X}}_i)'/(n_i - 1)\) be unbiased estimators of \({\varvec{\mu }}_i\) and \({\varvec{\Sigma }}_i\). Denote \(n = n_1 + n_2\). Assuming normality, \({\varvec{\Sigma }}_i = {\varvec{\Sigma }} \forall ~i\) and \(n - 2 > p\), \(H_{02}\) is usually tested by two-sample \(T^2\), \(T^2 = [n_1n_2/n](\overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2)'\widehat{\varvec{\Sigma }}^{-1}(\overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2)\), where \(\widehat{\varvec{\Sigma }} = \sum _{i = 1}^2(n_i - 1)\widehat{\varvec{\Sigma }}_i/\sum _{i = 1}^2(n_i - 1)\) is an estimator of common \({\varvec{\Sigma }}\). For \(p > n - 2\) or more generally for \(p > n_i\), \(T^2\) is invalid by the same token as for its one-sample counterpart. We consider a likewise partition of \(Q = \Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2 = \overline{\mathbf{X}}'_{1}\overline{\mathbf{X}}_{1} + \overline{\mathbf{X}}'_{2}\overline{\mathbf{X}}_{2} - 2\overline{\mathbf{X}}'_{1}\overline{\mathbf{X}}_{2}\) as
with \(Q_1 = \sum _{i = 1}^2Q_{i1} = \sum _{i = 1}^2{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_0)\), \(Q_{i1} = (E_i - U_{n_i})/n_i = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i\), \(U_0 = \sum _{i = 1}^2U_{n_i} - 2U_{n_1n_2}\), where \(E_i = \sum _{k = 1}^{n_i}\mathbf{X}'_{ik}{} \mathbf{X}_{ik}/n_i\) and
are one- and two-sample U-statistics, respectively, with symmetric kernels as bilinear forms of independent vectors. As in the one-sample case, \(E(Q_{i1}) = {{\mathrm{tr}}}({\varvec{\Sigma }}_i)/n_i \Rightarrow E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }}_0)\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^2{\varvec{\Sigma }}_i/n_i\) and \(E(U_0) = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2\) which vanishes under \(H_{02}\). Thus,
Again, \(E(Q_1)\) is independent of \({\varvec{\mu }}_i\), and \(E(U_0)\) is independent of \({\varvec{\Sigma }}_i\), under \(H_{02}\) and \(H_{12}\). Further, \(E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }}_0)\), \({\varvec{\Sigma }}_0 = {{\mathrm{Cov}}}(\overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2)\). We thus extend \(T_1\) in Eq. (3) for \(H_{02}\) as
where \(Q_0 = U_0/p\) is \(U_0\) with kernels of \(U_{n_i}\) and \(U_{n_1n_2}\) scaled by p, i.e., \(h(\mathbf{x}_k, \mathbf{x}_r) = \mathbf{X}'_{ik}{} \mathbf{X}_{ir}/p\) and \(h(\mathbf{x}_k, \mathbf{x}_l) = \mathbf{X}'_{1k}{} \mathbf{X}_{2l}/p\), respectively. Following assumptions extend those of one-sample case, where \(\nu _{is} = \lambda _{is}/p\) are eigenvalues of \({\varvec{\Xi }}_i = {\varvec{\Sigma }}_i/p\), \(i = 1, 2\).
Assumption 8
\(E(X^4_{iks}) = \gamma _{is} \le \gamma < \infty \forall ~s = 1, \ldots , p\), \(i = 1, \ldots , g\), \(\gamma \in {\mathbb {R}}^+\).
Assumption 9
\(\lim _{p \rightarrow \infty }\sum _{i = 1}^p\nu _{is} = \sum _{s = 1}^\infty \nu _{is} = \nu _{i0} \in {\mathbb {R}}^+\), \(i = 1, \ldots , g\).
Assumption 10
\(\lim _{n_i, p \rightarrow \infty } p/n_i = c_i = O(1)\), \(i = 1, \ldots , g\).
Assumption 11
\(\lim _{n_i \rightarrow \infty } n/n_i = \rho _i = O(1)\), \(i = 1, \ldots , g\).
Assumption 12
\(\lim _{p \rightarrow \infty } {\varvec{\mu }}'_i{\varvec{\Sigma }}_k{\varvec{\mu }}_j/p = \phi _{ijk} \le \phi = O(1)\), \(i, j, k = 1, \ldots , g\).
As the same assumptions will be used in Sect. 3, they are stated for \(g \ge 2\). Assumption 11 is additional to those for one-sample case. It is needed to keep the limit non-degenerate when \(n_i \rightarrow \infty \), \(n = \sum _{i = 1}^gn_i\). Assumption 12 is again needed only under \(H_{12}\). Following theorem, proved in “Appendix B.2”, extends Theorem 5 to two-sample case.
Theorem 13
For \(T_2\) in Eq. (7), \((T_2 - E(T_2))/\sqrt{{{\mathrm{Var}}}(T_2)} \xrightarrow {{\mathcal {D}}} N(0, 1)\) under Assumptions 8–12, as \(n_i, p \rightarrow \infty \), where \(E(T_2)\) and \({{\mathrm{Var}}}(T_2)\) denote the mean and variance of \(T_2\).
It is interesting to see how the limit for degenerate case sums up. With \(\nu _0\) as the limit of \(nQ_1/p\), it follows from (21) and (22) that (see e.g., Anderson et al. 1994)
with 1 and \(2\sum _{s = 1}^\infty (\rho _1\nu _{1s} - \rho _1\nu _{2s})^2/\nu ^2_0\) as limiting mean and variance, where the variance approximates \(2{{\mathrm{tr}}}({\varvec{\Xi }}^2)/[{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), \({\varvec{\Xi }} = n{\varvec{\Sigma }}_0/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^2{\varvec{\Sigma }}_i/n_i\). By the same argument of a scaled Chi-square approximation as for one-sample case, the moments correspond to those of \(\chi ^2_f/f\), i.e., 1 and 2 / f, \(f = f_1/f_2\), \(f_1 = [{{\mathrm{tr}}}({\varvec{\Xi }}_0)]^2\), \(f_2 = {{\mathrm{tr}}}({\varvec{\Xi }}^2_0)\). Let \(E_{2i}\) = \(\eta _i\{(n_i - 1)(n_i - 2)\text {tr}(\widehat{\varvec{\Sigma }}^2_i)\) + \([\text {tr}(\widehat{\varvec{\Sigma }}_i)]^2 - n_iQ_i\}\), \(E_{3i}\) = \(\eta _i\{2\text {tr}(\widehat{\varvec{\Sigma }}^2_i)\) + \((n^2_i - 3n_i + 1)[\text {tr}(\widehat{\varvec{\Sigma }}_i)]^2 - n_iQ_i\}\), where \(Q_i = \sum _{k = 1}^{n_i}(\widetilde{\mathbf{X}}'_{ik}\widetilde{\mathbf{X}}_{ik})^2/(n_i - 1)\), \(\widetilde{\mathbf{X}}_i = \mathbf{X}_{ik} - \overline{\mathbf{X}}_i\), \(\eta _i = (n_i - 1)/[n_i(n_i - 2)(n_i - 3)]\). Further, by independence, \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_1\widehat{\varvec{\Sigma }}_2)\) is an unbiased and consistent estimator of \({{\mathrm{tr}}}({\varvec{\Sigma }}_1{\varvec{\Sigma }}_2)\). Plugging in \(f_1\), \(f_2\) leads to a consistent estimator of f, hence of \({{\mathrm{Var}}}(T_2)\), i.e., \({\widehat{{{\mathrm{Var}}}}}(T_2)\). We have the following corollary.
Corollary 14
Theorem 13 remains valid when \({{\mathrm{Var}}}(T_2)\) is replaced with \({\widehat{{{\mathrm{Var}}}}}(T_2)\).
Remark 15
Due to its special practical value, the two-sample test has been investigated the most, also for high-dimensional case. We briefly discuss three tests, most closely related to \(T_2\). Denote \(\kappa = n_1n_2/n\), \(\omega _1 = (n - 1)/(n - 2)\), \(\omega _2 = (n - 2)^2/n(n - 1)\), \(n = n_1 + n_2\). Let \(\xi = \Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2 - {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})/\kappa \), where \(\widehat{\varvec{\Sigma }}\) is the pooled estimator of common \({\varvec{\Sigma }}\) as given in the context of \(T^2\) above.
Dempster (1958) proposed the first two-sample test for high-dimensional data under normality, motivated by a problem put forth by his colleagues (see Sect. 5). The test, in simpler form, is given as \(T_D = \Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2/\kappa {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})\). An alternative form of \(T_D\) follows by partitioning the norm in the numerator into several independent quadratic forms using an orthonormal transformation, so that the test follows an approximate F distribution with degrees of freedom estimated using a scaled Chi-square distribution. See also Dempster (1960, 1968) for details, where Bai and Saranadasa (1996) give a detailed evaluation of the approximation and power of Dempster’s test.
Bai and Saranadasa (1996)’s test, \(T_\text {BS} = \kappa \xi /\sqrt{2\omega _1}B\), is a standardization of \(\xi \) under homoscedasticity, where \(B^2 = \omega _2\{{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})^2 - [{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})]^2/n\}\). Chen and Qin (2010)’s test, \(T_\text {CQ}\), is a standardization of \(U_0 = \sum _{i = 1}^2U_{n_i} - 2U_{n_1n_2}\); see (4). \(T_\text {CQ}\) is based on the same model used for \(T_\text {BS}\) but relaxing normality and homoscedasticity. From the partition of Q in (4), it follows that, under the assumption of homoscedasticity, \(T_\text {D}\) divides the norm by the biased term, where \(T_\text {BS}\) and \(T_\text {CQ}\) subtract the same bias term from the norm, so that the numerator in both tests is \(U_0\) with \(E(U_0) = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2 = 0\) under \(H_0\), where for \({\varvec{\Sigma }}_i = {\varvec{\Sigma }}\), \(i = 1, 2\), both tests coincide.
The proposed test, \(T_g\), \(g \ge 1\), differs from both in that it uses the removed bias term to rescale the test, where it neither requires normality nor homoscedasticity assumption. Note that, \(T_\text {CQ}\) is also defined without the two assumptions, but the bias adjustment, assumptions and computation of variance of the statistic are reasonably different for the two tests.
To get a more precise idea on the comparison of these tests, we did a simulation study to assess their test sizes and power. Two independent random samples of iid vectors of sizes (\(n_1\), \(n_2\)), \(n_1 \in \{10, 20, 50\}\), \(n_2 = 2n_1\), each of dimension \(p \in \{50, 100, 300, 500\}\), are generated from normal, \(t_7\) and Unif[0, 1] distributions with covariance matrices, \({\varvec{\Sigma }}_i\), i = 1, 2, compound symmetry, CS, and autoregressive of order 1, AR(1). The CS and AR(1) are defined, respectively, as \(\kappa \mathbf{I} + \rho \mathbf{J}\) and \(\text {Cov}(X_k, X_l) = \kappa \rho ^{|k - l|}\), \(\forall ~ k, l\), with \(\mathbf{I}\) as identity matrix and \(\mathbf{J}\) a matrix of 1s. For size, we pair \({\varvec{\Sigma }}_i\) for the two populations: both \({\varvec{\Sigma }}_1\) and \({\varvec{\Sigma }}_2\) CS with \(\rho = 0.5\) and \(\rho = 0.8\), respectively; \({\varvec{\Sigma }}_1\) as CS, \({\varvec{\Sigma }}_2\) as AR(1), both with \(\rho = 0.5\). For power, we use CS with \(\rho \) = 0.4 and 0.8. We take \(\kappa = 1\) for all cases. For brevity, power results are only reported for p = 100, for normal and t distributions.
Table 1 reports estimated test sizes of \(T_2\), \(T_\text {BS}\) and \(T_\text {CQ}\) for all distributions with both pairs of \({\varvec{\Sigma }}_i\). We observe an accurate performance of \(T_2\) for all parameters, whereas \(T_\text {BS}\) and \(T_\text {CQ}\) prove, respectively, to be very liberal and very conservative, with their performance at least not improving with increasing p or (particularly) increasing n. Note that, the inaccuracy of \(T_\text {BS}\) can be justified as it may pertain to the homoscedasticity assumption the test is based on and which is violated in the simulations. The performance of \(T_\text {CQ}\), on the other hand, can be ascribed to its assumptions, particularly on the vanishing of trace ratios such as \({{\mathrm{tr}}}({\varvec{\Sigma }}^4)/[{{\mathrm{tr}}}({\varvec{\Sigma }}^2)]^2\), \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)/[{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\) and \({{\mathrm{tr}}}({\varvec{\Sigma }}^3)/{{\mathrm{tr}}}({\varvec{\Sigma }}){{\mathrm{tr}}}({\varvec{\Sigma }}^2)\), which are not satisfied for certain covariance structures, e.g., compound symmetric. A discussion on \(T_g\) is adjourned for Sect. 4, where it is evaluated in more detail.
From Fig. 1, we also observe power of \(T_2\) higher than its competitors where the curves come closer with increasing non-centrality parameter as well as with increasing sample sizes, and this phenomenon is very similar for both distributions. Generally, a similar comparative performance and effect of sample sizes are observed for different p values; hence, not all are reported here.
3 Multi-sample test: one-way MANOVA
Here, we extend \(T_2\) to the general case, \(g \ge 2\). As usual, \(\overline{\mathbf{X}}_i\) and \(\widehat{\varvec{\Sigma }}_i\) are unbiased estimators of \({\varvec{\mu }}_i\), \({\varvec{\Sigma }}_i\, i = 1, \ldots , g\). Recall \(T_2\) in (7) as a modification of \(T^2\) using the Euclidean distance \(\Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2\). For \(H_{0g}\), we sum over all pairwise norms, \(\sum _{i< j}\Vert \overline{\mathbf{X}}_i - \overline{\mathbf{X}}_j\Vert ^2 = \sum _{i< j}(E_i - U_{n_i})/n_i + \sum _{i< j}(U_{n_i} + U_{n_j} - 2U_{n_in_j}) = (g - 1)\sum _{i = 1}^g{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i + (g - 1)\sum _{i = 1}^gU_{n_i} - 2\sum _{i < j}^gU_{n_in_j}\), and define the MANOVA statistic as
\(Q_1 = \sum _{i = 1}^gQ_{i1}\), \(Q_{i1} = (E_i - U_{n_i})/n_i = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i\), \(E_i = \sum _{k = 1}^{n_i}\mathbf{X}'_{ik}{} \mathbf{X}_{ik}/n_i\), \(Q_0 = \sum _{i < j}Q_{0ij}\), \(Q_{0ij} = U_{n_i} + U_{n_j} - 2U_{n_in_j}\), where \(U_{n_i}\), \(U_{n_in_j}\) are as defined in (5) with kernels \(h(\mathbf{x}_{ik}, \mathbf{x}_{ir}) = \mathbf{X}'_{ik}{} \mathbf{X}_{ir}\), \(k \ne r\), \(h(\mathbf{x}_{ik}, \mathbf{x}_{jl}) = \mathbf{X}'_{ik}{} \mathbf{X}_{jl}\), \(i \ne j\), \(k, r, l = 1, \ldots , n_i\), \(i, j = 1, \ldots , g\), \(n = \sum _{i = 1}^gn_i\). Further, \(Q_1 = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_0)\), \(\widehat{\varvec{\Sigma }}_0 = \sum _{i = 1}^g\widehat{\varvec{\Sigma }}_i/n_i\), is an unbiased estimator of \({{\mathrm{tr}}}({\varvec{\Sigma }}_0)\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^g{\varvec{\Sigma }}_i/n_i\). We begin with the moments of \(Q_0\). In particular,
where the covariances vanish when \(i \ne i', j \ne j'\). Denoting \({\varvec{\Sigma }}_{0ij} = {\varvec{\Sigma }}_i/n_i + {\varvec{\Sigma }}_j/n_j\), \(i < j\), and using the moments of U-statistics from Sect. A.2, we obtain
Theorem 16 summarizes the moments which reduce to those of two-sample case for \(g = 2\).
Theorem 16
For \(Q_0\) defined above, we have
Now, consider the limit of \(T_g\) under Assumptions 8–12. By the independence of g samples, the convergence of \(Q_1\) follows exactly as for \(g = 2\) so that, as \(n_i, p \rightarrow \infty \),
where \(\nu _0 = \sum _{i = 1}^g\rho _i\nu _{i0} = \sum _{i = 1}^g\sum _{s = 1}^\infty \rho _i\nu _{is}\). For the limit of \(Q_0\), we note, from the formulation \((g - 1)\sum _{i = 1}^gU_{n_i} - 2\sum _{i < j}^gU_{n_in_j}\) and by the independence of \(U_{n_i}\), \(U_{n_j}\), \(i \ne j\), which we need the distribution of \(\mathbf{U}_N = (U_{n_i}, U_{n_in_{i + 1}}, \ldots , U_{n_in_g})'\), \(i = 1, \ldots , g - 1\). Alternatively, we can consider \(\mathbf{Q}_0 = (Q_{012}, \ldots , Q_{0g-1,g})'\). For \(g = 2\), \(\mathbf{U}_N = (U_{n_1}, ~U_{n_1n_2})'\), \(\mathbf{Q}_0 = Q_{012}\). We can use either of the two options and proceed as for \(g = 2\). \(\mathbf{Q}_0\) is a \(G\times 1\) vector, \(G = g(g - 1)/2\), with \({{\mathrm{Cov}}}(\mathbf{Q}_0)\) a \(G\times G\) partitioned matrix \({\varvec{\Lambda }} = ({\varvec{\Lambda }}_{ij}/p^2)_{i, j = 1}^G\) where
Thus, \({\varvec{\Lambda }}_{ii}/p^2 = {{\mathrm{Cov}}}(\mathbf{Q}_{0i})\): \((g - i)\times (g - i)\), and \({\varvec{\Lambda }}_{ij}/p^2 = {{\mathrm{Cov}}}(\mathbf{Q}_{0i}, \mathbf{Q}_{0j})\): \((g - i)\times (g - j)\), \({\varvec{\Lambda }}_{ji} = {\varvec{\Lambda }}'_{ij}\), \(i = 1, \ldots , g-1\), \(j = i+1, \ldots , g\). Denote \(a_i = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_i/n^2_i)\), \(a_{0ij} = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij})\). Then \({\varvec{\Lambda }}_{ii} = 2(\oplus _{j = i + 1}^ga_{0ij} + (\mathbf{J} - \mathbf{I})_{g-i}a_i)/p^2\), \({\varvec{\Lambda }}_{ij} = 2(\mathbf{0}' ~~ \mathbf{1}'_{g-i}a_i ~~ \oplus _{j = i+2}^ga_j)'/p^2\), where \(\mathbf{1}\) is vector of 1s, \(\mathbf{J} = \mathbf{11}'\), \(\mathbf{I}\) is identity matrix, \(\oplus \) is Kronecker sum and \(\mathbf{0}\) in \({\varvec{\Lambda }}_{ij}\) is \((j-i-1)\times (g-j)\) with no \(\mathbf{0}\) if \(j - i - 1 = 0\). For any i, \({\varvec{\Lambda }}_{ii}\) has same off-diagonal element \(a_i\) with diagonal elements \(a_{0ij} = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij})\), \({\varvec{\Sigma }}_{0ij} = {\varvec{\Sigma }}_i/n_i + {\varvec{\Sigma }}_j/n_j = {{\mathrm{Cov}}}(\overline{\mathbf{X}}_i - \overline{\mathbf{X}}_j)\), \(j = i +1\). Further, most off-diagonals in \({\varvec{\Lambda }}_{ij}\) are 0, and the number of (rows with) zeros increases with j for every i, making \({\varvec{\Lambda }}\) an increasingly sparse matrix.
The weak convergence holds for \(Q_{0ij}\) for any (i, j) in \(\mathbf{Q}_0\), and we only need to take care of the nonzero off-diagonal elements in \({\varvec{\Lambda }}\), i.e., \(a_i/p^2\), which are uniformly bounded under the assumptions and same holds for Eqs. (9)–(11). The limit of \(n\mathbf{Q}_0\), hence of \(nQ_0\), follows then as of \(\mathbf{U}_N\) for \(g = 2\). Finally, Slutsky’s lemma gives the limit of \(T_g\). For the limit under \(H_{0g}\), \(E(\mathbf{Q}_0) = \mathbf{0}\), all covariances of U-statistics vanish (Sect. A.2) and Eqs. (9)–(11) reduce to \({{\mathrm{Var}}}(Q_{0ij}) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij})\), \({{\mathrm{Cov}}}(Q_{0ij}, Q_{0ij'}) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)/n^2_i\), \({{\mathrm{Cov}}}(Q_{0ij}, Q_{0i'j}) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_j)/n^2_j\), which are independent of \({\varvec{\mu }}_i\), so that we continue to assume \({\varvec{\mu }}_i = \mathbf{0}\forall ~i\). In particular, from Theorem 16, \(E(Q_0) = 0\) under \(H_{0g}\) and
which is \(2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_{012})\) for \(g = 2\); see Eq. (19). The null limit then also follows on the same lines as for \(g = 2\). The following theorem generalizes Theorem 13 for \(g \ge 2\) samples.
Theorem 17
For \(T_g\) in Eq. (8), \((T_g - E(T_g))/\sqrt{{{\mathrm{Var}}}(T_g)}\xrightarrow {{\mathcal {D}}} N(0, 1)\) under Assumptions 8–12, as \(n_i, p \rightarrow \infty \), where \(E(T_g)\) and \({{\mathrm{Var}}}(T_g)\) denote the mean and variance of \(T_g\).
For the moments of \(T_g\), note that the general distribution follows from the projection \({\widehat{Q}}_0 = \sum _{i< j}^g{\widehat{Q}}_{0ij} = \sum _{i < 1}^g({\widehat{U}}_{n_i} - 2{\widehat{U}}_{n_in_j})\) of \({\widetilde{Q}} = \sum _{i < 1}^g{\widetilde{Q}}_{0ij} {\widetilde{Q}}_{0ij} = Q_0 - E(Q_0)\), so that
Likewise, under \(H_{0g}\), the convergence of degenerate \(U_{n_i}\) and \(U_{n_in_j}\) gives
such that the limiting moments \(E(nQ_0) = 0\) and \({{\mathrm{Var}}}(nQ_0) = 2\sum _{i < 1}^g\sum _{s = 1}^\infty (\rho _i\nu _{is} + \rho _j\nu _{js})^2\) approximate exact moments of \(Q_0\) under \(H_{0g}\). Combined with the limit of \(nQ_1/p\), it gives
with \(E(T_g) = g - 1\) and variance \({{\mathrm{Var}}}(T_g) = 2\sum _{i < j}^g\sum _{s = 1}^\infty (\rho _i\nu _{is} + \rho _j\nu _{js})^2/\nu ^2_0\) which approximates \(2{{\mathrm{tr}}}({\varvec{\Xi }}^2)/[{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), \({\varvec{\Xi }} = n{\varvec{\Sigma }}_0/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^g{\varvec{\Sigma }}_i/n_i\). Further \(z_{ij} = \sqrt{\rho _i\nu _{is}}z_{is} - \sqrt{\rho _j\nu _{js}}z_{js}\) is a linear combination of independent N(0, 1) variables, hence itself normal with mean 0, variance \(\rho _i\nu _{is} + \rho _j\nu _{js}\). To estimate \({{\mathrm{Var}}}(T_g)\), we note that the set of distinct non-zero elements in \({\varvec{\Lambda }}\) is
with cardinality \(s_0 = \#\{{\mathcal {S}}\} = g(g + 1)/2\), i.e., for any g, we only need to estimate \(s_0\) elements out of \(G(G + 1)/2\) in order to estimate \({\varvec{\Lambda }}\). With the estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)\), \([{{\mathrm{tr}}}({\varvec{\Sigma }}_i)]^2\) and \({{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)\) same as given in the two-sample case, a consistent plug-in estimator of \({{\mathrm{Var}}}(T_g)\) follows, leading to the following generalization of Corollary 14.
Corollary 18
Theorem 17 remains valid when \({{\mathrm{Var}}}(T_g)\) is replaced with \({\widehat{{{\mathrm{Var}}}}}(T_g)\).
Power of\(T_g\) For \(z_{\alpha }\) as before, \(\text {P}(T_g \ge z_{\alpha }\sqrt{{{\mathrm{Var}}}(T_g)} + (g - 1)) = \alpha \), so that \(1 - \beta = \text {P}(z_g \ge z_\alpha - n\delta )\) where, with \(z_g = (T_g - E(T_g))/\sqrt{{{\mathrm{Var}}}(T_g)}\), \(\delta = \delta _1/\delta _2\), \(\delta _1 = \sum _{i < j}^g\Vert {\varvec{\mu }}_i - {\varvec{\mu }}_j\Vert ^2/p\), \(\delta ^2_2 = {{\mathrm{tr}}}({\varvec{\Xi }}_0)\), \({\varvec{\Xi }}_0 = n{\varvec{\Sigma }}_0/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^g{\varvec{\Sigma }}_i/n_i\). For \(g = 2\), \(\delta _1 = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^2{\varvec{\Sigma }}_i/n_i\). A case of particular interest is when \({\varvec{\mu }}_i\) are mutually orthogonal, \({\varvec{\mu }}'_i{\varvec{\mu }}_j = 0\), \(\forall ~i < j\). The power function remains the same, now with \(\delta _1 = (g - 1)\sum _{i = 1}^g\Vert {\varvec{\mu }}_i\Vert ^2/p\) or, for \(g = 2\), \(\delta _1 = \Vert {\varvec{\mu }}_1\Vert ^2 + \Vert {\varvec{\mu }}_2\Vert ^2\).
Remark 19
This remark pertains to the trace estimators used to define consistent estimators of \({{\mathrm{Var}}}(T_g)\). Consider one-sample case where \(E_2\), \(E_3\) as estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)\), \([{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\), given after Theorem 5, are defined as functions of \(\widehat{\varvec{\Sigma }}\) to keep them simple in formulation and efficient in computation. Alternatively, however, the same estimators can be defined as U-statistics which helps study their properties, particularly consistency, more conveniently. Let \(\mathbf{D}_{kr} = \mathbf{X}_k - \mathbf{X}_r\), \(k \ne r\) and define \(A_{kr} = \mathbf{D}'_{kr}{} \mathbf{D}_{kr}\), \(A^2_{krls} = (\mathbf{D}'_{kr}{} \mathbf{D}_{ls})^2\). Then, we can equivalently write
where \(B_{krls} = A^2_{krls} + A^2_{klrs} + A^2_{ksrl}\), \(C_{krls} = A_{kr}A_{ls} + A_{kl}A_{rs} + A_{ks}A_{lr}\), \(\pi (\cdot )\) means all indices pairwise unequal and \(P(n) = n(n - 1)(n - 2)(n - 3)\). This formulation of \(E_2\), \(E_3\) lends itself to be mathematically easily amenable using the theory of U-statistics. For details, see Ahmad (2016). The form extends directly to multi-sample cases by defining \(E_{2i}\), \(E_{3i}\) for ith independent sample in the same way, with \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i){{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_j)\) and \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i\widehat{\varvec{\Sigma }}_j)\) estimating \({{\mathrm{tr}}}({\varvec{\Sigma }}_i){{\mathrm{tr}}}({\varvec{\Sigma }}_j)\) and \({{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)\) as usual, where a U-statistic form of \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})\) is \(\sum _{k \ne r}^nA_{kr}/n(n - 1)\). For details, see Ahmad (2017a, b).
Remark 20
Note that, the Chi-square approximation in both one- and multi-sample cases follows through two-moment approximation of the limit of the test statistics with that of a scaled Chi-square variable. Box (1954a, b) used this approximation to study the violation of assumptions of homoscedastic and uncorrelated errors in ANOVA settings, later extended and modified by Geissser and Greenhouse (1958), Greenhouse and Geissser (1959) and Huynh and Feldt (1970, 1976).
4 Simulations
We evaluate the accuracy of tests for size control and power, specifically focusing on violation of normality and homoscedasticity assumptions. We take g = 1 and 3 and generate data from Normal, Exponential and Uniform distributions with \(n = 10, 20, 50\) for \(T_1\) and \((n_1, n_2, n_3) = (10, 15, 20), (5, 25, 50), (10, 30, 60)\), for \(T_3\), where the last two triplets represent seriously unbalanced designs. For dimension, we take \(p \in \{50, 100, 300, 500, 1000\}\). For covariances structures, we use compound symmetry (CS), autoregressive of order 1, AR(1), as defined in Sect. 2.2, and unstructured (UN), defined as \({\varvec{\Sigma }} = (\sigma _{ij})_{i, j = 1}^d\) with \(\sigma _{ij} = 1(1)d\) (\(i = j\)), \(\rho _{ij} = (i - 1)/d\) (\(i > j\)), with \(\mathbf{I}\) as identity matrix and \(\mathbf{J}\) as matrix of 1s. We use \(\rho = 0.5\), \(\kappa = 1\).
We use \(\alpha = 0.01, 0.05, 0.10\) and estimate test size by averaging \(P(T\le T_o|H_0)\) over 1000 simulations, where \(T\) denotes \(T_1\) or \(T_3\) and \(T_o\) is their observed value under \(H_0\). Tables 2 and 3 report estimated size and power of \(T_1\) for normal and exponential distributions, and Tables 4 and 5 report the same for \(T_3\) for all distributions. For power, we fix \(\alpha = 0.05\) and estimate the power by averaging \(P(T\ge T_o|H_1)\) over 1000 runs, where \(H_1\) is defined as \({\varvec{\mu }} = \delta _r\mathbf{p}_1\), \(\mathbf{p}_1 = (1/p, \ldots , p/p)\), \(\delta _r = 0.2(0.2)1\). Note that, \(T_3\) is assessed under a triplet of covariance structures (CS, AR, UN) followed by the three populations.
We observe an accurate size control for normal as well as for non-normal distributions and under all covariance structures. The stability of the size control for increasing p, for n as small as 10, is also evident. We observe a similar performance for power, with discernably better performance under AR and UN structures than under CS, for all distributions, which might be attributed to the spiky nature of CS. The power, however, also improves reasonably under CS for increasing n and p. For \(g = 3\), we also observe accuracy for unbalanced design, with a drastic improvement for the last triplet of \(n_i\). Although not reported here, similar results were observed for other \(\rho \) values in CS and AR, for other covariance structures, e.g., Toeplitz, and for other distributions, e.g., t.
We also assessed the power of proposed tests under possible sparse alternatives. For simplicity, we report results for \(T_1\) for normal distribution with same n as used above and \(p \in \{60, 100, 200\}\). We consider three levels of sparsity: small, medium and large with 25%, 50% and 75% zeros in the mean vector, respectively. Note that, 0% sparsity implies the case under \(H_1\), where 100% sparsity implies the null case. Table 6 reports the results. Generally, the power is high under all parameter settings, indicating the validity of tests for such alternatives. Further, the power increases with increasing sample size, so that even under sparsity, the test shows a high probability to tell the null from the alternative, particularly as the sample size grows.
5 Analyses of real data sets
Figure 2 depicts average counts of macrobenthos observed along an approximately 2000 km long transact of Norwegian continental shelf. The transact under observation comprised a range of water depths and sediment properties. A total of \(p = 809\) species were observed from \(n = 101\) independent sites in five different regions of the transact, where \(n_1 = 16\), \(n_2 = 21\), \(n_3 = 25\), \(n_4 = 19\), \(n_5 = 20\). Each count is a five-replicate pooled observation, and the data contain a large amount of zeros where no species could be recorded. For details, see Ellingsen and Gray (2002).
In our notation, \(\mathbf{X} = (\mathbf{X}'_1, \ldots , \mathbf{X}'_5)' \in {\mathbb {R}}^{n\times p}\) represents the complete data matrix with regionwise data matrices \(\mathbf{X}_i = (\mathbf{X}'_{i1}, \ldots , \mathbf{X}'_{in_i})' \in {\mathbb {R}}^{n_i\times p}\), \(\mathbf{X}_{ik} \in {\mathbb {R}}^p\), where \(n_i\) and p are given above. It is thus an unbalanced one-way MANOVA experiment with \(g = 5\) independent samples, each of \(n_i\) iid vectors of dimension 809, where \(n = \sum _{i = 1}^5n_i = 101\). The linear model can be expressed as
where the vector \(\mathbf{X}_{ik}\) consists of 809 species counts measured for \(k\hbox {th}\) replicate (site) from ith region, \({\varvec{\mu }}_i \in {\mathbb {R}}^p\) is the true average count vector of \(i\hbox {th}\) region, and \({\varvec{\epsilon }}_{ik} \in {\mathbb {R}}^p\) are random error vectors, associated with each \(\mathbf{X}_{ik}\), with \(E({\varvec{\epsilon }}_{ik}) = \mathbf{0}\) and \({{\mathrm{Cov}}}({\varvec{\epsilon }}_{ik}) = {\varvec{\Sigma }}_i\forall ~k\), \(i = 1, \ldots , 5\). The hypothesis of interest can be formulated as \(H_{05}: {\varvec{\mu }}_1 = \ldots = {\varvec{\mu }}_5\) vs \(H_{15}: {\varvec{\mu }}_i \ne {\varvec{\mu }}_j\) for at least one pair \(i \ne j\), i, \(j = 1, \ldots , 5\). We use \(T_g\) in Eq. (8) to test \(H_{05}\).
We also apply the proposed test to two well-known data sets, referred here to as alcohol and leukemia data. The alcohol data is a two-group (\(g = 2\)) data that motivated Dempster to construct the first two-sample high-dimensional test (Dempster 1958); see also Dempster (1960, 1968). The data consist of \(p = 59\) biochemistry measurements on \(n_1 = 8\) alcoholic and \(n_2 = 4\) control individuals aged 16–39 years; see also Beerstecher et al. (1950). The three-group (\(g = 3\)) leukemia data are often also used for classification. It consist of measurements on patients with acute lymphoblastic leukemia (ALL) carrying a chromosomal translocation involving mixed-lineage leukemia (MLL) gene. A total of \(p = 11225\) gene expression profiles of leukemia cells are taken from patients in ALL group (\(n_1 = 28\)), B-precursor ALL carrying an MLL translocation (\(n_2 = 24\)) and conventional B-precursor without MLL translocation (\(n_3 = 20\)); see Armstrong et al. (2002) for details.
Model (15) remains the same for alcohol and leukemia data sets, with \(g = 2\) and 3, respectively, and with corresponding sample sizes given above. The analyses of all three data sets are reported in Table 7. The first three columns report the data sizes and the next three the Chi-square approximation for \(T_g\), and the penultimate two columns provide the corresponding normal approximation. Only for alcohol data, the results provide evidence in support of null hypothesis of no difference of mean vectors, whereas the hypotheses are significantly rejected for both leukemia and species data. The conclusions for all three data sets are consistent for both approximations. In particular, the results for species data substantiate what can be roughly witnessed in Fig. 2.
6 Discussion and remarks
Test statistics for high-dimensional mean vectors are presented. A unified strategy is proposed that systematically encompasses one- and multi-sample cases. The tests are constructed as linear combinations of U-statistics-based estimators and are valid for any distribution with finite fourth moment. The limiting distributions of the tests are derived under a few mild assumptions. Simulations are used to show the accuracy of the tests for moderate sample size and any large dimension. The tests are location invariant, so that the mean vectors need not be assumed zero. Due to singularity of empirical covariance matrix in high-dimensional case, an affine-invariant test is not possible, and location-invariance is the best that can be achieved in this case.
References
Ahmad, M.R.: A \(U\)-statistic approach for a high-dimensional two-sample mean testing problem under non-normality and Behrens–Fisher setting. Ann. Inst. Stat. Math. 66, 33–61 (2014)
Ahmad, M.R.: On testing sphericity and identity of a covariance matrix with large dimensions. Math. Methods Stat. 25, 121–132 (2016)
Ahmad, M.R.: Location-invariant multi-sample \(U\)-tests for covariance matrices with large dimension. Scand. J. Stat. 44, 500–523 (2017a)
Ahmad, M.R.: Location-invariant tests of homogeneity of large dimensional covariance matrices. J. Stat. Theory Pract. 11, 731–745 (2017b)
Anderson, N.H., Hall, P., Titterington, D.M.: Two-sample test statistics for measuring discrepencies between two multivariate probability density functions using kernel based density estimates. J. Multivar. Anal. 50, 41–54 (1994)
Aoshima, M., Yata, K.: Two-stage procedures for high-dimensional data. Seq. Anal. 30, 356–399 (2011)
Aoshima, M., Yata, K.: Asymptotic normality for inference on multisample high-dimensional mean vectors under mild conditions. Methodol. Comput. Appl. Probab. 17, 419–439 (2015)
Armstrong, S.A., Staunton, J.E., Silverman, L.B., et al.: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30, 41–47 (2002)
Bai, Z., Saranadasa, H.: Effect of high dimension: by an example of a two sample problem. Stat. Sin. 6, 311–329 (1996)
Beerstecher, E., Sutton, H.E., Berry, H.K., et al.: Biochemical individuality. V. Exlorations with respect to the metabolic pattersn of compulsive drinkers. Arch. Biochem. 29, 27–40 (1950)
Box, G.E.P.: Some theorems on quadratic forms applied in the study of analysis of variance problems, I: effect of inequality of variance in the one-way classification. Ann. Math. Stat. 25, 290–302 (1954a)
Box, G.E.P.: Some theorems on quadratic forms applied in the study of analysis of variance problems, II: effect of inequality of variance and of correaltion between errors in the two-way classification. Ann. Math. Stat. 25, 484–498 (1954b)
Chen, S.X., Qin, Y.-L.: A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Stat. 38(2), 808–835 (2010)
Dempster, A.P.: A high dimensional two sample significance test. Ann. Math. Stat. 29, 995–1010 (1958)
Dempster, A.P.: A significance test for the separation of two highly multivariate small samples. Biometrics 16, 41–50 (1960)
Dempster, A.P.: Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading (1968)
Duchesne, P., Francq, C.: Multivariate hypothesis testing using generalized and 2-inverses with applications. Statistics 49, 475–496 (2015)
Ellingsen, K.E., Gray, J.S.: Spatial patterns of benthic diversity: is there a latitudinal gradient along the Norwegian continental shelf? J. Anim. Ecol. 71, 373–389 (2002)
Feng, L., Zou, C., Wang, Z.: Multivariate sign-based high-dimensional test for the two-sample location problem. J. Am. Stat. Assoc. 111, 721–735 (2016)
Fujikoshi, Y.: Multivariate analysis for the case when the dimension is large compared to the sampel size. J. Korean Stat. Soc. 33, 1–24 (2004)
Fujikoshi, Y., Ulyanov, V.V., Shimizu, R.: Multivariate statistics: high-dimensional and large-sample approximations. Wiley, New York (2010)
Geissser, S., Greenhouse, W.: An extension of Box’s results on the use of \(F\) distribution in multivariate analysis. Ann. Math. Stat. 29, 885–891 (1958)
Greenhouse, W., Geissser, S.: On methods in the analysis of profile data. Psychometrika 24, 95–112 (1959)
Hájek, J., Šidák, Z., Sen, P.K.: Theory of Rank Tests. Academic Press, Cambridge (1999)
Hu, J., Bai, Z.: A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices. Sci. China Math. 55, 1–19 (2015)
Hu, J., Bai, Z., Wang, C., Wang, W.: On testing equality of high dimensional mean vectors with unequal covariance matrices. Ann. Inst. Stat. Math. 69, 365–387 (2017)
Huynh, H., Feldt, L.S.: Conditions under which mean square ratios in repeated measurements designs have exact \(F\)-distributions. J. Am. Stat. Assoc. 65, 1582–1589 (1970)
Huynh, H., Feldt, L.S.: Estimation of the Box correction for the degrees of freedom from sample data in randomized block and split-plot designs. J. Educ. Stat. 1, 69–82 (1976)
Jiang, J.: Large Sample Techniques for Statistics. Springer, New York (2010)
Katayama, S., Kano, Y.: A new test on high-dimensional mean vectors without any assumption on population covariance matrix. Commun. Stat. Theory Methods 43, 5290–5304 (2014)
Koroljuk, V.S., Borovskich, Y.V.: Theory of \(U\)-Statistics. Kluwer, Dordrecht (1994)
Läuter, J.: Two new multivariate tests, in particular for a high dimension. Acta et Comment. Univ. Tartu. Math. 8, 179–186 (2004)
Läuter, J., Glimm, E., Kropf, S.: Multivariate tests based on left-spherically distributed linear scores. Ann. Stat. 26, 1972–1988 (1998). Corrections: 27, 1441
Lee, A.J.: U-Statistics: Theory and Practice. CRC, Boca Raton (1990)
Lehmann, E.L.: Elements of Large-Sample Theory. Springer, New York (1999)
Reed, M., Simon, B.: Methods of Modern Mathematical Physics: Functional analysis, vol. I. Academic Press, Cambridge (1980)
Schott, J.R.: Some high-dimensional tests for a one-way MANOVA. J. Multivar. Ann. 98, 1825–1839 (2007)
Serfling, R.J.: Approximation theorems of Mathematical Statistics. Wiley, Weinheim (1980)
Shao, J.: Mathematical Statistics, 2nd edn. Springer, New York (2003)
van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Wang, L., Peng, B., Li, R.: A high-dimensional nonparametric multivariate test or mean vector. J. Am. Stat. Assoc. 110, 1658–1669 (2015)
Acknowledgements
The author is thankful to the editor, the associate editor and anonymous referee for their constructive comments which helped improve the original version of the manuscript. Kari Ellingsen’s kind permission to use species data is also duly acknowledged.
Author information
Authors and Affiliations
Corresponding author
Appendices
Some miscellaneous results
1.1 U-statistics
First, we need to set some notations. For details, see e.g., Serfling (1980), Koroljuk and Borovskich (1994), van der Vaart (1998) and Lehmann (1999). For iid \(X_i\), let \(h(X_1, \ldots , X_m): {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) denote the kernel of an mth order U-statistic, \(U_n\), with \(E(U_n) = \theta = E[h(\cdot )]\) with its projection \(h_c(x_1, \ldots , x_c) = E[h(\cdot )|x_1, \ldots , x_c]\), \(h_m(\cdot ) = h(\cdot )\) and \(\xi _c = {{\mathrm{Var}}}[h_c(\cdot )\), \(c = 1, \ldots , m\), so that \({{\mathrm{Var}}}(U_n) = \sum _{c = 1}^m\left( {\begin{array}{c}m\\ c\end{array}}\right) \left( {\begin{array}{c}n - m\\ m - c\end{array}}\right) \xi _{c}/\left( {\begin{array}{c}n\\ m\end{array}}\right) \). If \(0< \xi _c < \infty \forall ~c\), then \((U_n - E(U_n))/\sqrt{{{\mathrm{Var}}}(U_n)}\)\(\xrightarrow {{\mathcal {D}}} N(0, 1)\). For two U-statistics, \(U_{ni}\), of order \(m_i\), kernels \(h_i(\cdot )\), projections \(h_{ic}(\cdot )\), \(i = 1, 2\), let \(\xi _{cc} = {{\mathrm{Cov}}}[h_{1c}(\cdot ), h_{2c}(\cdot )]\), \(c = 1, \ldots , m_1 \le m_2\). Then, \({{\mathrm{Cov}}}(U_{n1}, U_{n2}) = \sum _{c = 1}^{m_1}\left( {\begin{array}{c}m_2\\ c\end{array}}\right) \left( {\begin{array}{c}n - m_2\\ m_1 - c\end{array}}\right) \xi _{cc}/\left( {\begin{array}{c}n\\ m_1\end{array}}\right) \). Let \(U_{n_1n_2}\) be a U-statistic of two independent samples, with kernel \(h(X_{11}, \ldots , X_{1m_1}, X_{21}, \ldots , X_{2m_2})\), symmetric in each sample, projection \(h_{c_1c_2} = E[h(\cdot )|X_{11}, \ldots , X_{1c_1}; X_{21}, \ldots , X_{2c_2}]\), \(\xi _{c_1c_2} = {{\mathrm{Cov}}}[h(\cdot ), h_{c_1c_2}(\cdot )]\), \(\xi _{00} = 0\), \(c_i = 0, 1, \ldots , m_i\). Then, \({{\mathrm{Var}}}(U_{n_1n_2}) = \sum _{c_1 = 0}^{m_1}\sum _{c_2 = 0}^{m_2}\left( {\begin{array}{c}m_1\\ c_1\end{array}}\right) \left( {\begin{array}{c}n_1 - m_1\\ m_1 - c_1\end{array}}\right) \left( {\begin{array}{c}m_2\\ c_2\end{array}}\right) \left( {\begin{array}{c}n_2 - m_2\\ m_2 - c_2\end{array}}\right) \xi _{c_1c_2}/\left( {\begin{array}{c}n_1\\ m_1\end{array}}\right) \left( {\begin{array}{c}n_2\\ m_2\end{array}}\right) \). If \(0 \le n_i/n \le 1\), \(n = n_1 + n_2\), \(0< \xi _{c_1c_2} < \infty \)\(\forall ~c_i\), then \((U_{n_1n_2} - E(U_{n_1n_2}))/\sqrt{{{\mathrm{Var}}}(U_{n_1n_2})}\)\(\xrightarrow {{\mathcal {D}}} N(0, 1)\).
Lemma 21
(Jiang 2010, p. 183; Hájek et al. 1999, p. 184) Let \(Y_1, Y_2, \ldots \) be iid random variables, \(E(Y_i) = 0\), \({{\mathrm{Var}}}(Y_i) = 1\). Let \(b_{ni}\) be a sequence of constants, \(1 \le i \le n\). Then \(\sum _{i = 1}^nb_{ni}Y_i \xrightarrow {{\mathcal {D}}} N(0, 1)\) given \(\max _ib^2_{ni} \rightarrow 0\), as \(n \rightarrow \infty \).
1.2 Basic moments of U-statistics
For \(U_{n_i}\), \(h(\mathbf{X}_{ik}, \mathbf{X}_{ir}) = \mathbf{X}'_{ik}\mathbf{X}_{ir}\), \(m = 2\), \(h_1(\mathbf{X}_{ik}) = {\varvec{\mu }}'_i\mathbf{X}_{ik}\), \(\xi _1 = {\varvec{\mu }}'_i{\varvec{\Sigma }}_i{\varvec{\mu }}_i\), \(\xi _2 = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_i) + 2{\varvec{\mu }}'_i{\varvec{\Sigma }}_i{\varvec{\mu }}_i\). For \(U_{n_in_j}\), \(h(\mathbf{X}_{ik}, \mathbf{X}_{jl})= \mathbf{X}'_{ik}\mathbf{X}_{jl}\), \(m_1 = 1 = m_2\), \(h_{10} = {\varvec{\mu }}'_j\mathbf{X}_{ik}\), \(h_{01} = {\varvec{\mu }}'_i\mathbf{X}_{jl}\), \(h_{11}(\cdot ) = h(\cdot )\), \(\xi _{10} = {\varvec{\mu }}'_j{\varvec{\Sigma }}_i{\varvec{\mu }}_j\), \(\xi _{01} = {\varvec{\mu }}'_i{\varvec{\Sigma }}_j{\varvec{\mu }}_i\), \(\xi _{11} = {\varvec{\mu }}'_i{\varvec{\Sigma }}_j{\varvec{\mu }}_i + {\varvec{\mu }}'_j{\varvec{\Sigma }}_i{\varvec{\mu }}_j + {{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)\). Then, for \(i \ne j\), \(i \ne j'\), \(i' \ne j\), \({{\mathrm{Var}}}(U_{n_i}) = 2[2(n_i - 1){\varvec{\mu }}'_i{\varvec{\Sigma }}_i{\varvec{\mu }}_i + {{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)]/n_i(n_i - 1)\), \({{\mathrm{Var}}}(U_{n_in_j}) = [n_i{\varvec{\mu }}'_i{\varvec{\Sigma }}_j{\varvec{\mu }}_i + n_j{\varvec{\mu }}'_j{\varvec{\Sigma }}_i{\varvec{\mu }}_j + {{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)]/n_in_j\), \({{\mathrm{Cov}}}(U_{n_i}, U_{n_in_j}) = 2{\varvec{\mu }}'_j{\varvec{\Sigma }}_i{\varvec{\mu }}_i/n_i\), \({{\mathrm{Cov}}}(U_{n_j}, U_{n_in_j}) = 2{\varvec{\mu }}'_i{\varvec{\Sigma }}_j{\varvec{\mu }}_j/n_j\), \({{\mathrm{Cov}}}(U_{n_in_j}, U_{n_in_{j'}}) = {\varvec{\mu }}'_j{\varvec{\Sigma }}_i{\varvec{\mu }}_{j'}/n_i\), \({{\mathrm{Cov}}}(U_{n_in_j}, U_{n_{i'}n_j}) = {\varvec{\mu }}'_i{\varvec{\Sigma }}_j{\varvec{\mu }}_{i'}/n_j\). See Sect. A.1 for basic notations and general moment expressions.
Main proofs
1.1 Proof of Theorem 5
First, \(E(Q_1/p) = {{\mathrm{tr}}}({\varvec{\Sigma }})/p = \sum _{s = 1}^p\nu _s\), bounded by \(\nu _0\), under Assumption 2, as \(p \rightarrow \infty \). Now
With \({{\mathrm{Var}}}(\mathbf{X}'_k\mathbf{X}_k) \le \gamma p^2\) under Assumption 1, \( {{\mathrm{Var}}}(E/p) \le \gamma /n = O(1/n)\). From Sect. A.2, \({{\mathrm{Var}}}(U_n/p) = \gamma /n + 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2)/n(n - 1)p^2 + 4{\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}/(n - 1)p^2 = O(1/n)\) under the assumptions. Finally, \({{\mathrm{Cov}}}(E/p, U_n/p)\) = 0 for \({\varvec{\mu }} = \mathbf{0}\) which can be assumed w.o.l.o.g. since \(E(Q_1/p)\) does not depend on \({\varvec{\mu }}\). Alternatively, by Cauchy–Schwarz inequality, \({{\mathrm{Cov}}}(E/p, U_n/p) \le \)\(\left[ {{\mathrm{Var}}}(E){{\mathrm{Var}}}(U_n)\right] ^{1/2}\) which simplifies to O(1 / n). This proves consistency of \(nQ_1/p\). Note that, this consistency holds both under simultaneous and sequential (n, p)-asymptotics where in the later case the last term vanishes with p, before the limit over n is carried out. Further, the limit is the same under \(H_0\) and \(H_1\).
Now, consider \(nU_n\) with \(h(\mathbf{x}_k, \mathbf{x}_r) = \mathbf{X}'_k\mathbf{X}_r/p\) so that \(E[h(\cdot )] = \Vert {\varvec{\mu }}\Vert ^2 = E(U_n)\). Define \({\widetilde{U}}_n = U_n - E(U_n)\) with corresponding kernel \({\widetilde{h}}(\cdot ) = h(\cdot ) - E[h(\cdot )]\). Let \({\widehat{U}}_n\) denotes the projection of \({\widetilde{U}}_n\). As \(U_n\) is a second-order U-statistic with product kernel (bilinear form of independent components), \(h(\cdot )\), following the notation in Sect. A.1,
with \(E({\widehat{U}}_n) = 0 = E({\widetilde{U}}_n)\) and \({{\mathrm{Cov}}}({\widehat{U}}_n, {\widetilde{U}}_n) = 4{\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}/n = {{\mathrm{Var}}}({\widehat{U}}_n)\) so that, with \({{\mathrm{Var}}}({\widetilde{U}}_n)\) as given above (see Sect. A.2), it follows that \({{\mathrm{Var}}}(n{\widehat{U}}_n)\) and \({{\mathrm{Var}}}(n{\widetilde{U}}_n)\) are uniformly bounded under Assumptions 2 and 4, such that \({{\mathrm{Var}}}(n{\widehat{U}}_n)/{{\mathrm{Var}}}(n{\widetilde{U}}_n) \rightarrow 1\); see e.g., Lehmann (1999, Ch.6), Serfling (1980, Ch.5) or van der Vaart (1998, Ch.12). This, along with the convergence of \(nQ_1/p\), gives normal limit of \(nU_n/[nQ_1/p]\), hence of T, by Slutsky’s theorem.
Some remarks concerning the aforementioned limit will help us extend it further under the null. To begin with, the first-order projection of \(h(\cdot )\), \(h_1(\mathbf{x}_k) = E[h(\cdot )|\mathbf{x}_k] = {\varvec{\mu }}'{} \mathbf{X}_k/p\), along with its variance, \(\xi _1 = {{\mathrm{Var}}}[h_1(\mathbf{x}_k)] = {\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}/p^2\) exactly vanishes under \(H_0: {\varvec{\mu }} = \mathbf{0}\), making the kernel (first-order) degenerate under \(H_0\). Note that, for the limit under \(H_1\) above, the term involving this projection, \(4{\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}/np^2\) is eventually bounded under Assumption 4, for simultaneous (n, p)-asymptotics, when used for \(nU_n\). But under sequential asymptotics, if \(p \rightarrow \infty \) first, then the projection vanishes asymptotically. But the limit under \(H_1\) still holds since the total variance \({{\mathrm{Var}}}(U_n)\) still remains bounded under the assumptions. In fact, an additional advantage under sequential asymptotics is that now the power of T does not depend on any specific \({\varvec{\mu }}\).
Under \(H_0\), however, the projection and its variance \(\xi _1\) are exactly zero and the limit need to be derived differently. Since \(E[h^2(\cdot )] = {{\mathrm{tr}}}({\varvec{\Sigma }}^2)/p^2 < \infty \) under the consequence of Assumption 2, the kernel is square integrable. As we shall see in the sequel, \(h(\cdot )\) being a product kernel makes it further convenient to derive the limit. Without loss of generality, we can assume that the data \(\mathbf{X}_k\) are generated by a separable (Hilbert) space \({\mathcal {L}}_2({\mathcal {X}}, {\mathcal {A}}, {\mathcal {P}})\). By symmetry and square integrability of \(h(\cdot )\), the map \(T: {\mathcal {L}}_2({\mathcal {X}}, {\mathcal {A}}, {\mathcal {P}}) \rightarrow {\mathcal {L}}_2({\mathcal {X}}, {\mathcal {A}}, {\mathcal {P}})\), being a (bounded, linear) integral operator, i.e.,
is self-adjoint, Hilbert–Schmidt. With \(\lambda \)’s and \(\nu \)’s introduced just before the assumptions, let \((\nu _s, f_s)\) forms its orthonormal eigendecomposition, i.e., \(h(\mathbf{x}_k, \mathbf{x}_r) = \sum _{s = 0}^\infty \lambda _sf_s(\mathbf{x}_k)f_s(\mathbf{x}_r)\), where \(\sum _s\nu ^2 < \infty \) and \(f_0 = 1\) correspond to \(\lambda _0 = 0\). For details, see e.g., van der Vaart (1998) and Koroljuk and Borovskich (1994). By the Hilbert–Schmidt Theorem (Reed and Simon 1980, p. 203), the convergence of the kernel to its basis is in \({\mathcal {L}}_2\), i.e.,
A general theorem on the limit of a degenerate U-statistics under this setup is given in van der Vaart (1998, Theorem 12.10, p. 169) or Lee (1990, Theorem 1, p. 90). The limit holds for \(n^{c/2}U_{n,c}\) with variance \(c!E[h^2_c(\cdot )]\), where \(U_{n,c}\) is a U-statistic with (projected) kernel \(h_c(\cdot )\) and c is the least value for which \(h_c(\cdot )\) is non-degenerate (see Sect. A.1).
Thus, in the present context with \(m = 2\), \(c = 2\), \(nU_n\) has a finite limit with variance approximating \(2\xi _2 = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2)/p^2 = 2\sum _{s = 1}^p\nu ^2_s\). Specifically, for first-order degeneracy, the limit is \([m(m - 1)/2]\sum _{s = 1}^\infty \nu _s(z^2_s - 1)\), where \(z_s\) are independent N(0, 1) variables; see Koroljuk and Borovskich (1994, Ch. ) and Shao (2003, Ch. 3). With \(m = 2\), we thus have, for \(n, p \rightarrow \infty \),
with \(z^2_s \sim \chi ^2_1\) iid, where the limiting mean is 0 and variance is \(2\sum _{s = 1}^\infty \nu ^2_s\) which approximates \(2{{\mathrm{tr}}}({\varvec{\Sigma }}^2)/p^2\). Combined with the limit of \(nQ_1/p\) by Slutsky’s theorem, we have
Now write \(\omega _s = \nu _s/\sum _s\nu _s\) such that \(\sum _s\omega _s = 1\) and \(\max \omega ^2_s \rightarrow 0\). Also let \(Y_s = (z^2_s - 1)/\sqrt{2}\) so that \(E(Y_s) = 0\), \({{\mathrm{Var}}}(Y_s) = 1\). Then, the normal limit follows by the Hájek–Šidák Lemma (Lemma 21).
1.2 Proof of Theorem 13
With \(Q_1\) composed of two independent components, the probability convergence of \(nQ_1/p\) follows exactly as in one-sample case, so that
where \(\nu _0 = \sum _{i = 1}^2\rho _i\nu _{i0} = \sum _{i = 1}^2\sum _{s = 1}^\infty \rho _i\nu _{is}\), as \(n_i, p \rightarrow \infty \). Now \(Q_0\), which we first write as \(Q_0 = \mathbf{a}'{} \mathbf{U}_N\), where \(\mathbf{a} = (1~1~-2)'\) and \(\mathbf{U}_N = (U_{n_1}~U_{n_2}~U_{n_1n_2})'\), so that the limit of \(Q_0\) follows from that of \(\mathbf{U}_N\). Obviously \(E(Q_0) = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2\) where, from “Appendix A.2”,
so that \({{\mathrm{Var}}}(Q_0) = \mathbf{a}'{{\mathrm{Cov}}}(\mathbf{U}_N)\mathbf{a}\) results into
Note that, as \(n_i, p \rightarrow \infty \), terms involving \({\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}\) are finite under the assumptions, making \({{\mathrm{Cov}}}(n\mathbf{U}_N)\), hence \({{\mathrm{Var}}}(nQ_0)\), uniformly bounded, implying in turn that \(nQ_0\) might have a finite limit. Let \({\widehat{Q}}_0 = \sum _{i = 1}^2{\widehat{U}}_{n_i} - 2{\widehat{U}}_{n_1n_2}\) be the projection of \({\widetilde{Q}}_0 = \sum _{i = 1}^2{\widetilde{U}}_{n_i} - 2{\widetilde{U}}_{n_1n_2}\), where \({\widetilde{U}}_{n_i} = U_{n_i} - E(U_{n_i})\), with its kernel \({\widetilde{h}}(\mathbf{x}_{ik}, \mathbf{x}_{ir}) = \mathbf{X}'_{ik}{} \mathbf{X}_{ir}/p - \Vert {\varvec{\mu }}\Vert ^2/p\), and similarly \({\widetilde{U}}_{n_1n_2}\) with its kernel \({\widetilde{h}}(\mathbf{x}_{1k}, \mathbf{x}_{2l})\). Further, \({\widehat{U}}_{n_i}\) are as defined for one-sample case, whereas
where \({\widetilde{h}}_{10}(\mathbf{x}_{1k}) = E[{\widetilde{h}}(\mathbf{x}_{1k}, \mathbf{x}_{2l})|\mathbf{x}_{1k}]\), similarly \({\widetilde{h}}_{01}(\mathbf{x}_{2l})\); see Sect. A.1 for notations. Thus
is the projection of \({\widetilde{Q}}_0\) with \(E({\widehat{Q}}_0) = 0\), \({{\mathrm{Var}}}({\widehat{Q}}_0) = 4({\varvec{\mu }}_1 - {\varvec{\mu }}_2)'{\varvec{\Sigma }}_0({\varvec{\mu }}_1 - {\varvec{\mu }}_2)/p^2\). The term within brackets in Eq. (20) is the sum of two independent components, as a direct extension of one-sample case. By the same procedure then, it follows that \({{\mathrm{Cov}}}({\widetilde{Q}}_0, {\widehat{Q}}_0) = {{\mathrm{Var}}}({\widehat{Q}}_0)\) so that \({{\mathrm{Var}}}({\widehat{nQ}}_0)/{{\mathrm{Var}}}({\widetilde{nQ}}_0) \rightarrow 1\). Under the assumptions, \(n{\widetilde{Q}}_0 = n{\widehat{Q}}_0 + o_P(1)\) with the limit of \(n{\widehat{Q}}_0\) following by the central limit theorem, leading to the limit of \(n{\widetilde{Q}}_0\), hence of \(T_2\), by Slutsky theorem.
Now consider \(H_0\) whence the projection \({\widehat{Q}} = 0\), making the component U-statistics, hence \(Q_0\), degenerate and leaving the normal limit above invalid under the null. To simplify the matters, assume without loss of generality, that \({\varvec{\mu }}_i = {\varvec{\mu }} = \mathbf{0}\), \(i = 1, 2\). Then
and \({{\mathrm{Var}}}(Q_0) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_0)/p^2\). This, again, is a direct extension of one-sample case under \(H_0\), so that we can similarly proceed to obtain the limit, except that here we need to deal with a three dimensional vector instead of a scalar. Then, the limits of \(nU_{n_i}\), \(i = 1, 2\), follow from (17) as (see also Ahmad 2014)
as \(n_i, p \rightarrow \infty \), where \(n = n_1 + n_2\). Writing \(n/n_1n_2 = [\sqrt{n/n_1}\sqrt{n/n_2}][1/\sqrt{n_1n_2}]\), the corresponding limit for \(nU_{n_1n_2}\) is given as Koroljuk and Borovskich (1994, Ch. 4)
as \(n_i, p \rightarrow \infty \), where \(z_{is}\) are iid N(0, 1) variables in both limits and \(z_{1s}\), \(z_{2s}\) are also independent of each other. To combine the three limits, define \(w^2_{is} = \rho _i\nu ^2_{is}/\sum _s\rho _i\nu ^2_{is}\), \(i = 1, 2\) such that \(\lim _{p \rightarrow \infty }\max _sw^2_{is} = 0\). Then, a multivariate extension of Lemma 21 gives the normal limit
where \(\widetilde{\mathbf{U}}_N = (U_{n_1}/\sqrt{{{\mathrm{Var}}}(U_{n_1})}, ~U_{n_2}/\sqrt{{{\mathrm{Var}}}(U_{n_21})}, ~U_{n_1n_2}/\sqrt{{{\mathrm{Var}}}(U_{n_1n_2})})'\) is the standardized form of \(\mathbf{U}_N\) with each component having mean zero. Finally, under Assumption 2 and by Slutsky theorem, with covariance matrix diagonal, the limit easily extends for \(nQ_0/[nQ_1/p]\) and hence for \(T_2\) as a linear combination of three components.
Rights and permissions
OpenAccess This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Ahmad, M.R. A unified approach to testing mean vectors with large dimensions. AStA Adv Stat Anal 103, 593–618 (2019). https://doi.org/10.1007/s10182-018-00343-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10182-018-00343-z