Abstract
We present a systematic theory for tests for means of high-dimensional data. Our testing procedure is based on an invariance principle which provides distributional approximations of functionals of non-Gaussian vectors by those of Gaussian ones. Differently from the widely used Bonferroni approach, our procedure is dependence-adjusted and has an asymptotically correct size and power. To obtain cutoff values of our test, we propose a half-sampling method which avoids estimating the underlying covariance matrix of the random vectors. The latter method is shown via extensive simulations to have an excellent performance.
Access provided by CONRICYT-eBooks. Download chapter PDF
Similar content being viewed by others
Keywords
- Gaussian approximation
- Goodness-of-Fit Test
- Half-sampling
- High-dimensional data
- Hypothesis testing
- Large p small n
- Rademacher weighted differencing
1 Introduction
With the advance of modern data collection techniques, high-dimensional data appear in various fields including physics, biology, healthcare, finance, marketing, social network, and engineering among others. A common feature in such datasets is that the data dimension or the number of involved parameters can be quite large. As a fundamentally important problem in the study of such data, one would like to perform statistical inference of those parameters such as multiple testing or construction of confidence regions. With that one is able to provide an answer to the question whether there is signal in the dataset, or whether the dataset consists only of random noises. Due to the high-dimensionality, the inferential procedures developed for low-dimensional problems may no longer be valid in the high-dimensional setting. Different approaches should be designed to account for high-dimensionality. There exists a huge literature on multiple testing; see, for example, Dudiot and van der Laan (2008), Efron (2010) and Dickhaus (2014).
We now introduce the setting of our testing problem. Assume that X 1, X 2, …, are independent and identically distributed (i.i.d.) p-dimensional random vectors, with mean vector μ = (μ 1, …, μ p)T = E(X i) and covariance matrix Σ = cov(X i) = (σ jk)j,k≤p. We are testing the hypothesis of existence of a signal
based on the sample X 1, …, X n. This formulation is actually very general and its solution can be applied to many other problems; see Sect. 8.2. We can estimate μ by the sample mean vector \(\hat \mu = \bar X_n = n^{-1} \sum _{i=1}^n X_i\). The classical Hotelling’s T-squared test has the form
where
is the sample covariance matrix estimate of Σ. If p is small and fixed, by the Central Limit Theorem (CLT),
By the Law of Large Numbers, if Σ is non-singular,
Clearly (8.4) and (8.5) imply that under H 0, the Hotelling’s T-squared statistic \(n T \Rightarrow \chi ^2_p\) (χ 2 distribution with degrees of freedom p). Thus we can reject H 0 at level 0 < α < 1 if \(n T > \chi ^2_{p, 1-\alpha }\), the (1 − α)th quantile of \(\chi ^2_p\).
In the high-dimensional situation in which p can be much larger than n, the CLT (8.4) is no longer valid; see Portnoy (1986). Furthermore, \(\hat \Sigma _n\) is singular and thus T is not well-defined. Also the matrix convergence (8.5) may not hold, see Marčenko and Pastur (1967). In this chapter we shall apply a testing functional approach that does not use \(\hat \Sigma _n^{-1}\) or the precision matrix Σ−1. A function \(g: \mathbb {R}^p \to [0, \infty )\) is said to be a testing functional if the following requirements are satisfied: (1) (monotonicity) for any \(x = (x_1, \ldots, x_p)^T \in \mathbb {R}^p\) and 0 < c < 1, g(cx) ≤ g(x); (2) (identifiability) g(x) = 0 if and only if x = 0. We shall consider the test statistic
Examples of g include the L 2-based test with \(g(x) = \sum _{j=1}^p x_j^2\), the L ∞-based test with g(x) =maxj≤p|x j|, the weighted empirical process \(g(x) = \sup _{u \ge 0} ( \sum _{j=1}^p \mathbf {1} _{|x_j| \ge u} h(u) )\), where h(⋅) is a nonnegative-valued non-decreasing function, among others. We reject H 0 in (8.1) if T n is too big.
As a theoretical foundation, we base our testing procedure on the following invariance principle result
where Z, Z 1, Z 2, … are i.i.d. N(0, Σ) random vectors and \(\bar Z_n = n^{-1} \sum _{i=1}^n Z_i =_{\mathcal {D}} n^{-1/2} Z\). Interestingly, though the CLT (8.4) does not generally hold in the high-dimensional setting, the testing functional form (8.7) may still be valid. Chernozhukov et al. (2014) proved (8.7) with the L ∞ norm g(x) =maxj≤p|x j|, while Xu et al. (2014) consider the L 2 based test with \(g(x) = \sum _{j=1}^p x_j^2\). In Sect. 8.5 we shall provide a sufficient condition so that (8.7) holds for certain testing functionals.
In applying (8.7) for testing (8.1), one needs to know the distribution of \(g(\sqrt {n} \bar Z_n) =_{\mathcal {D}} g(Z)\) so that a suitable cutoff value can be obtained. The latter problem is highly nontrivial since the covariance matrix Σ, which is viewed as a nuisance parameter here, is typically not known and the associated estimation issue can be quite challenging. In Sect. 8.5 we shall propose a half-sampling technique which can avoid estimating the nuisance covariance matrix Σ.
2 Applications
Our paradigm (8.1) is actually quite general and it can be applied to testing of high-dimensional covariance matrices, testing of independence of high-dimensional data, analysis of variances with non-normal and heteroscedastic errors.
2.1 Testing of Covariance Matrices
There is a huge literature on testing covariance matrices such as uncorrelatedness, sphericity, or other patterns. For Gaussian data, tests for Σ = σ 2Ip, where Ip is the identity matrix, can be found in Ahmad (2010), Birke and Dette (2005), Chen et al. (2010), Fisher et al. (2010) and Ledoit and Wolf (2002). Tests for equality of covariance matrices are studied in Bai et al. (2009) and Jiang et al. (2012), and for sphericity is in Onatski et al. (2013). Minimax properties are considered in Cai and Ma (2013). For other contributions, see Qu and Chen (2012), Schott (2005, 2007), Srivastava (2005), Xiao and Wu (2013) and Zhang et al. (2013).
Assume that we have data matrix Y n = (Y i,j)1≤i≤n,1≤j≤p, where \((Y_{i, j})_{j=1}^p\), i = 1, …, n, are i.i.d. p-dimensional random vectors. Let
be the covariance function. Consider testing hypothesis for uncorrelatedness:
For simplicity assume that E(Y i,j) = 0. For a pair a = (j, k) write X i,a = Y i,j Y i,k, and \(\bar X_a = n^{-1} \sum _{i=1}^n X_{i, a}\) and the (p 2 − p)-dimensional vector \(\bar X = (\bar X_a)_{a \in {\mathcal {A}}}\), where \({\mathcal {A}} = \{(j, k): \, j\neq k, \, j \le p, k \le p\}\). The hypothesis H 0 in (8.9) can be tested by using the test statistics \(T= g(\sqrt {n} \bar X)\). Xiao and Wu (2013) considered the L ∞ based test with g(x) =maxi|x i|, generalizing the result in Jiang (2004) which concerns the special case for i.i.d. vectors with independent entries. Han and Wu (2017) performed an L 2 based test for patterns of covariances with the test statistic
With slight modifications, one can also test the sphericity hypothesis
where Ip is the p × p identity matrix. Let \({\mathcal {A}}_0 = \{(j, k): \, j, k \le p\}\) with diagonal entries added to \({\mathcal {A}}\). For \(a = (j, j) \in {\mathcal {A}}_0\), let \(X_{i, a} = Y_{i, j}^2 - \sigma ^2\). If σ 2 is known, then H 0 in (8.11) can be rejected at level α ∈ (0, 1) if \(T = g(\sqrt {n} \bar X) > t_{1-\alpha }\), where t 1−α is the (1 − α)th quantile of g(Z) and Z is a centered Gaussian vector with covariance structure cov(Z a, Z b) = E(X i,a X i,b), \(a, b \in {\mathcal {A}}_0\). In the case that σ 2 is not known, we shall use an estimate. For example, we can let \(\hat \sigma ^2 = n^{-1} \sum _{j=1}^n \hat \sigma _{j j}^2\), and consider \(X_{i, a}^\circ = Y_{i, j}^2 - \hat \sigma ^2\). Let \(X_{i, a}^\circ = X_{i, a}\) if a = (j, k) with j ≠ k. The hypothesis H 0 in (8.11) can be tested by the statistic \(T^\circ = g(\sqrt {n} \bar X^\circ )\).
2.2 Testing of Independence
Let \(Y_i = (Y_{i, j})_{j=1}^p\), i = 1, …, n, be i.i.d. p-dimensional random vectors with joint cumulative distribution function
Consider the problem of testing whether entries of Y i are independent. Assume that the marginal distributions are standard uniform[0, 1]. For j = (j 1, …, j d), write \(F_{\mathbf {j}} (y_{\mathbf {j}}) = F_{j_1, \ldots, j_d} (y_{j_1}, \ldots, y_{j_d})\). For fixed d, the hypothesis of d-wise independence is
where \({\mathcal {A}}_d = \{ \mathbf {j} = (j_1, \ldots, j_d): \, j_1 < \cdots < j_d \le p \}\). Pairwise and triple-wise independence correspond to d = 2 and d = 3, respectively. We estimate F j(y j) by the empirical cdf
where the notation Y i,j ≤ y j means \(Y_{i, j_h} \le y_{j_h}\) for all h = 1, …, d. Let \(y_{\mathbf { m}_1}, \ldots, y_{{\mathbf {m}}_N}\), N →∞, be a dense set of [0, 1]d. For example, we can choose them to be the lattice set {1∕K, …, (K − 1)∕K}d with N = (K − 1)d. Let X i, 1 ≤ i ≤ n, be the Np!∕(d!(p − d)!)-dimensional vector with the (ℓ j)th component being \({\mathbf {1}}_{ Y_{i, \mathbf {j}} \le y_{{\mathbf {m}}_\ell }} - \prod _{h \in {\mathbf {m}}_\ell } y_h \), 1 ≤ ℓ ≤ N, \(\mathbf {j} \in {\mathcal {A}}_d\). Then the L 2-based test for (8.13) on the dense set \((y_{{\mathbf {m}}_\ell } )_{\ell =1}^N\) has the form \(n | \bar X |{ }_2^2\).
2.3 Analysis of Variance
Consider the following two-way ANOVA model
where μ is the grand mean, α i and β j are the main effects from the first and the second factors, respectively, and δ ij are the interaction effect. Assume that (Y ijk)i≤I,j≤J, k = 1, …, K, are i.i.d. Consider the hypothesis of interaction:
In the classical ANOVA procedure, one assumes that ε ijk, i ≤ I, j ≤ J, are i.i.d. N(0, σ 2) and makes use of the fact that the sum of squares
is distributed as \(\sigma ^2 \chi ^2_{(I-1)(J-1)}\). Here \(\bar Y_{i j \cdot } = K^{-1} \sum _{k=1}^K Y_{i j k}\) and other sample averages \(\bar Y_{i \cdot \cdot }\), \(\bar Y_{\cdot j \cdot }\) and \(\bar Y_{\cdot \cdot \cdot }\) are similarly defined. The null hypothesis H 0 is rejected at level α ∈ (0, 1) if
where F (I−1)(J−1),IJ(K−1),1−α is the (1 − α)th quantile of the F-distribution F (I−1)(J−1),IJ(K−1) and
is an estimate of σ 2.
The classical ANOVA procedure can be invalid when the assumption that ε ijk, i ≤ I, j ≤ J are i.i.d. N(0, σ 2) is violated. In the latter case SS I may no longer have a χ 2 distribution. However we can still approximate the distribution of SS I in terms of (8.7). For a = (i, j) let \(X_{a k} = \bar Y_{i j k} - \bar Y_{i \cdot k} - \bar Y_{\cdot j k} + \bar Y_{\cdot \cdot k}\). Then \(SS_I = \sum _{a \in {\mathcal {A}}} \bar X_{a \cdot }^2\), where \(\bar X_{a \cdot } = K^{-1} \sum _{k=1}^K X_{a k}\).
3 Tests Based on L ∞ Norms
Fan et al. (2007) considered the L ∞ norm based test of (8.1) with the form
Assume that the dimension p satisfies
and the uniform bounded third moment condition
Let Φ be the standard normal cumulative distribution function and z α = Φ−1(α). Then
Namely, if we perform the test by rejecting H 0 of (8.1) whenever M n ≥ z 1−α∕(2p), the familywise type I error of the latter test is asymptotically bounded by α. As a finite sample correction, the cutoff value z 1−α∕(2p) in (8.23) can be replaced by the t-distribution quantile t n−1,1−α∕(2p) with degree of freedom n − 1, noting that \((n-1)^{1/2} \hat \mu _j / \hat \sigma _j \sim t_{n-1}\) if X ij are Gaussian. Due to the Bonferroni correction, the test by Fan et al. (2007) can be quite conservative if the dependence among entries of X i is strong. For example, if X i1 = X i2 = ⋯ = X ip, then instead of using the cutoff value z 1−α∕(2p), one should use z 1−α∕2, since the cutoff value z 1−α∕(2p) leads to the extremely conservative type I error α∕(2p). If entries of X i are independent and X i is Gaussian, then the type I error is 1 − (1 − α∕p)p → 1 − e −α and it is slightly conservative. For example, when α = 0.05, 1 − e −α = 0.04877058.
Liu and Shao (2013) obtained Gumbel convergence of M n under the following conditions: (1) for some r > 3, the uniform bounded rth moment conditions maxj≤p E|X ij − μ j|r = O(1) holds, which is slightly stronger than (8.22) and (2) weak dependence among entries of X i. For Σ = (σ jk)j,k≤p, assume the correlation matrix R = (r jk)j,k≤p with \(r_{jk} = \sigma _{j k} / (\sigma _{j j}^{1/2} \sigma _{k k}^{1/2})\) has the property: for some γ > 0,
holds for all ρ > 0. Then under (8.21), Theorem 3.1 in Liu and Shao (2013) asserts the Gumbel convergence
where \({\mathcal {G}}\) follows the Gumbel distribution \(P({\mathcal {G}} \le y) = \exp (-e^{-y/2}/\pi ^{1/2})\). By (8.25), one can reject H 0 in (8.1) at level α ∈ (0, 1) based on the L ∞ norm test
where g 1−α is chosen such that \(P({\mathcal {G}} \le g_{1-\alpha }) = 1 - \alpha \). Clearly the latter test has an asymptotically correct size.
Applying Theorem 2.2 in Chernozhukov et al. (2014), we can have the following Gaussian approximation result. Assume that there exist constants c 1, c 2 > 0 such that c 1 ≤ E(X ij − μ j)2 ≤ c 2 holds for all j ≤ p and assume that u = u n,p satisfies
Let m k =maxj≤p(E|X 1j − μ j|k)1∕k and further assume that
Let Z ∼ N(0, R). Then we have the Gaussian approximation result: as n →∞
Let t 1−α be the (1 − α)th quantile of |Z|∞. The Gaussian approximation (8.29) leads to L ∞ norm based test: H 0 is rejected at level α if \(\max _{j \le p} \sqrt {n} |\hat \mu _j| / \hat \sigma _j \ge t_{1-\alpha }\). In comparison with the result in Fan et al. (2007), the latter test has an asymptotically correct size and it is dependence adjusted. To obtain an estimate for the cutoff value t 1−α, Chernozhukov et al. (2014) proposed a Gaussian Multiplier Bootstrap (GMB) method. Given X 1, …, X n, let \(\hat t_{1-\alpha }\) be such that
where e i are i.i.d. N(0, 1) random variables independent of (X ij)i≥1,j≥1. Note that \( \hat t_{1-\alpha }\) can be numerically calculated by extensive Monte Carlo simulations. In Sect. 8.5 we shall propose a Hadamard matrix and a Rademacher weighted approaches. The simulation study in Sect. 8.6 shows that, for finite-sample performance, the latter approach gives a more accurate size than the method based on Gaussian Multiplier Bootstrap (8.30).
Chen et al. (2016) generalized Fan, Hall and Yao’s L ∞ norm to high-dimensional dependent vectors. Assume that \((X_i)_{i \in \mathbb {Z}}\) is a p-dimensional stationary process of the form
where ε t, \(t\in \mathbb {Z}\), are i.i.d. random variables, \({\mathcal {F}}_t = (\ldots, \varepsilon _{t-1}, \varepsilon _t)\) and G(⋅) is a measurable function such that X t is well-defined. Assume that the long-run covariance matrix
exists. Let \(\varepsilon _i^\ast, \varepsilon _j, i, j\in \mathbb {Z}\), be i.i.d. random variables. Assume that X t has finite rth moment, r > 2. Define the functional dependence measures (see, Wu 2005, 2011) as
If X i are i.i.d., then Σ∞ = Σ and θ r(m) = 0 if m ≥ 1. We say that (X t) is geometric moment contraction (GMC; see Wu and Shao 2004) if there exist ρ ∈ (0, 1) and a 1 > 0 such that
Let μ = EX t. To test the hypothesis H 0 in (8.1), Chen et al. (2016) introduced the following dependence-adjusted versions of Fan, Hall, and Yao’s M n. Let n = mk, where m ≍ n 1∕4 and blocks B l = {i : m(l − 1) + 1 ≤ i ≤ ml}. Let \(Y_{l j} = \sum _{i \in B_l} X_{i j}\), 1 ≤ j ≤ p, 1 ≤ l ≤ k, be the block sums. Define the block-normalized sum
and the interlacing normalized sum: let k ∗ = k∕2, \(\mu ^\dagger _j = (m k^*)^{-1} \sum _{l=1}^{k^*} Y_{2 l j}\) and
By Chen et al. (2016), we have the following result: Assume exists a constant ζ > 0 such that the long-run variance ω jj ≥ ζ for j ≤ p, (8.34) holds with r = 3, and
Then (8.23) holds for both the block-normalized sum \(M_n^\circ \) and the interlacing normalized sum \(M_n^\dagger \). Note that, while (8.37) still allows ultra high dimensions, due to dependence, the allowed dimension p in condition (8.37) is smaller than the one in (8.21). Additionally, if the GMC (8.34) holds with some r > 3, (8.24) holds with the long-run correlation matrix R = D −1∕2 Σ∞ D −1∕2, where D = diag( Σ∞), and for some 0 < τ < 1∕4,
then we have the Gumbel convergence for the interlacing normalized sum:
where \({\mathcal {G}}\) is given in (8.25). Similarly as (8.26), one can perform the following test which has an asymptotically correct size: we reject H 0 in (8.1) at level α ∈ (0, 1) if
4 Tests Based on L 2 Norms
In this section we shall consider the test which is based on the L 2 functional with \(g(x) = \sum _{j=1}^p x_j^2\). Let λ 1 ≥⋯ ≥ λ p ≥ 0 be the eigenvalues of Σ. For Z ∼ N(0, Σ), we have the distributional equality \(g(Z) = Z^T Z =_{\mathcal {D}} \sum _{j=1}^p \lambda _j \eta _j^2\), where η j are i.i.d. standard N(0, 1) random variables. Let \(f_k = (\sum _{j=1}^p \lambda _j^k)^{1/k}\), k > 0, and f = f 2. Then Eg(Z) = f 1 = tr( Σ) and var(g(Z)) = 2f 2. Xu et al. (2014) provide a sufficient condition for the invariance principle (8.7) with the quadratic functional g. For some 0 < δ ≤ 1 let q = 2 + δ.
Condition 1
Let δ > 0. Assume EX 1 = 0, E|X 1|2q < ∞ and let
Observe that Condition 1, (8.41) and (8.42) are Lyapunov-type conditions. Assume that
Then (8.7) holds (cf Xu et al. 2014). Consequently we have
In the literature, researchers primarily focus on developing the central limit theorem
or its modified version; see, for example, Bai and Saranadasa (1996), Chen and Qin (2010) and Srivastava (2009). Xu et al. (2014) clarified an important issue on the CLT of R n. By the Lindeberg–Feller central limit theorem, V ⇒ N(0, 2) as p →∞ holds if and only if λ 1∕f → 0. The distributional approximation (8.44) indicates that, if λ 1∕f does not go to 0, then the central limit theorem cannot hold for R n.
Let t 1−α be the (1 − α)th quantile of g(Z) = |Z|2 = Z T Z. By (8.7) we can reject (8.1) at level α ∈ (0, 1) if
To calculate t 1−α, one needs to know the eigenvalues λ 1, …, λ p. However, estimation of those eigenvalues is a very challenging problem, in particular if one does not impose certain structural assumptions on Σ. In Sect. 8.5.2 we shall propose a half-sampling based approach which does not need estimation of the covariance matrix Σ.
The L ∞ based tests discussed in Sect. 8.3 have a good power when the alternative consists of few large signals. If the signals are small and have a similar magnitude, then the L 2 test is more powerful. To this end, assume that there exists a constant c > 0 and a small δ > 0 such that cδ ≤ μ j ≤ δ∕c holds for all j = 1, …, p. We can interpret δ as the departure parameter (from the null H 0 with μ = 0). For the L ∞-based test to have power approaching to 1, one necessarily requires that \(\sqrt {n} \delta \to \infty \). Elementary calculation shows that, under the much weaker condition np 1∕2 δ 2 →∞, then the power of the L 2 based test, or the probability that event (8.46) occurs going to one. In the latter condition, larger dimension p is actually a blessing as it requires a smaller departure δ.
5 Asymptotic Theory
In Sects. 8.3 and 8.4, we discussed the classical L ∞ and L 2 functionals, respectively. For a general testing functional, we have the following invariance principle (cf Theorem 1), which asserts that functionals of sample means of non-Gaussian random vectors X 1, X 2, … can be approximated by those of Gaussian vectors Z 1, Z 2, … with same covariance structure. Assume \(g \in \mathbb {C}^3(\mathbb {R}^p)\). For x = (x 1, …, x p)T write g j = g j(x) = ∂g(x)∕∂x j. Similarly we define the partial derivatives g jk and g jkl. For all j, k, l = 1, …, p, assume that
For Z 1 ∼ N(0, Σ) write Z 1 = (Z 11, …, Z 1p)T. Define
For \(g(Z_1) = _{\mathcal {D}} g(\sqrt {n} \bar Z_n)\), we assume that its c.d.f. F(t) = P[g(Z) ≤ t] is Hölder continuous: there exists ℓ p > 0, index α > 0, such that for all ψ > 0, the concentration function
Theorem 1 (Lou and Wu (2018))
Assume (8.47), (8.49) and \(\mathcal {K}_p \ell _p^{3/\alpha } = o(\sqrt {n})\) . Then
To apply Theorem 1 for hypothesis testing, we need to know the c.d.f. F(t) = P[g(Z) ≤ t]. Note that F(⋅) depends on g and the covariance matrix Σ. Thus we can also write F(⋅) = F g,Σ(⋅). If Σ is known, the distribution of g(Z) is completely known and its cdf F(t) = P[g(Z) ≤ t] can be calculated either analytically or by extensive Monte Carlo simulations. Let t 1−α, 0 < α < 1, be the (1 − α)th quantile of g(Z). Namely
Then the null hypothesis H 0 in (8.1) is rejected at level α if the test statistic \(T_n = g(\sqrt {n} \bar X_n) > t_{1-\alpha }\). This test has asymptotically correct size α. Additionally, the (1 − α) confidence region for μ can be constructed as
If Σ is not known, as a straightforward way to approximate F(t) = F g,Σ(t), one may use an estimate \(\tilde \Sigma \) so that F g,Σ(t) can be approximated by \(F_{g, \tilde \Sigma }(t)\). Here we do not adopt this approach for the following two reasons. First, it can be quite difficult to consistently estimate Σ without assuming sparseness or other structural conditions. The latter assumptions are widely used in the literature; see, for example, Bickel and Levina (2008a), Bickel and Levina (2008b), Cai et al. (2011) and Fan et al. (2013). Second, it is difficult to quantify the difference \(F_{g, \tilde \Sigma }(\cdot ) - F(\cdot )\) based on operator norm or other type of matrix convergence of the estimate \(\tilde \Sigma \). Xu et al. (2014) argued that, for the L 2 test with \(g(x) = \sum _{j=1}^p x_j^2\), one needs to use the normalized consistency of \(\tilde \Sigma \), instead of the widely used operator norm consistency. We propose using half-sampling and balanced Rademacher schemes.
5.1 Preamble: i.i.d. Gaussian Data
In practice, however, the covariance matrix Σ is typical unknown. Assume at the outset that X 1, …, X n are i.i.d. N(μ, Σ) vectors. Assume that n = 4m, where m is a positive integer. Then we can estimate the cumulative distribution function F(t) = P[g(Z) ≤ t] by using Hadamard matrices (see, Georgiou et al. 2003; Hedayat and Wallis 1978; Yarlagadda and Hershey 1997). We say that H is an n × n Hadamard matrix if its first row consisting all 1s, and all its entries taking values 1 or − 1 such that
where I n is the n × n identity matrix. Let
By (8.53), we have \(\sum _{i=1}^n H_{j i} = 0\) for 2 ≤ j ≤ n and \(\sum _{i=1}^n H_{j i} H_{j' i}= 0\) if j ≠ j′. Since X 1, …, X n are i.i.d. N(μ, Σ), it is clear that Y 2, …, Y n are also i.i.d. N(0, Σ) vectors. Hence the random variables g(Y 2), …, g(Y n) are independent and identically distributed as g(Z). Therefore we can construct the empirical cumulative distribution function
which converges uniformly to F(t) as n →∞, and t 1−α can be estimated by \(\hat t_{1-\alpha } = \hat F_n^{-1}(1-\alpha ) \), the (1 − α)th empirical quantile of \(\hat F_n(\cdot )\). As an important feature of the latter method, one does not need to estimate the covariance matrix Σ, the nuisance parameter. In combinatorial experiment design, however, it is highly nontrivial to construct Hadamard matrices. If n is a power of 2, then one can simply apply Sylvester’s construction. The Hadamard conjecture states that a Hadamard matrix of order n exists when 4|n. The latter problem is still open. For example, it is unclear whether a Hadamard matrix exists when n = 668 (see Brent et al. 2015).
5.2 Rademacher Weighted Differencing
To circumvent the existence problem of Hadamard matrices in Sect. 8.5.1, we shall construct asymptotically independent realizations by using Rademacher random variables. Let \(\varepsilon _{j k}, j, k \in \mathbb {Z}\), independent of (X i)i≥1, be i.i.d. Bernoulli random variables with P(ε jk = 1) = P(ε jk = −1) = 1∕2. Define the Rademacher weighted differences
where the random set
When defining Y j, we require that A j satisfies |A j|≠ 0 and |A j|≠ n. By the Hoeffding inequality, |A j| concentrates around n∕2 in the sense that, for u ≥ 0, \(P( ||A_j| -n/2| \ge u) \le 2 \exp (-2 u^2 / n)\). Alternatively, we consider the balanced Rademacher weighted differencing: let \(A_1^\circ, A_2^\circ, \ldots \) be simple random sample drawn equally likely from \({\mathcal {A}}_m = \{ A\subset \{1, \ldots, n \}: \, |A| = m\}\), where m = ⌊n∕2⌋. Similarly as Y j in (8.56), we define
Clearly, given A j (resp. \(A_j^\circ \)), Y j (resp. \(Y_j^\circ \)) has mean 0 and covariance matrix Σ. Based on Y j in (8.56) (resp. \(Y_j^\circ \) in (8.58)), define the empirical distribution functions
where N →∞ and
For sets A, B ⊂{1, …, n}, let A c = {1, …, n}− A, B c = {1, …, n}− B and
If A, B are chosen according to a Hadamard matrix, then d(A, B) = 0. Assume that
Then there exists an absolute constant c > 0 such that
Again by the Hoeffding inequality, if we choose A 1, A 2 according to (8.57), there exists absolute constants c 1, c 2 > 0 such that \(P( d(A_1, A_2) \ge u) \le c_1 \exp (-c_2 u^2 / n)\), indicating that (8.61) holds with probability close to 1, d(A 1, A 2) = O P(n 1∕2) and hence the weak orthogonality with δ(A 1, A 2) = O P(n −1∕2).
Theorem 2 (Lou and Wu (2018))
Under conditions of Theorem 1 , we have \(\sup _t |\hat F^\circ _N(t) - F(t)| \to 0\) in probability as N →∞.
5.3 Calculating the Power
The asymptotic power expression is
Given the sample X 1, …, X n whose mean vector μ may not necessarily be 0, based on the estimated \(\hat t_{1-\alpha }\) from the empirical cumulative distribution functions (8.59) and (8.60), we can actually estimate the power function by the following:
5.4 An Algorithm with General Testing Functionals
For ease of application, we shall in this section provide details of testing the hypothesis H 0 in (8.1) using the Rademacher weighting scheme described in Sect. 8.5.2.
Algorithm 1: Rademacher weighted testing procedure
_________________
-
1.
Input X 1, …, X n;
-
2.
Compute the average \(\bar X_n\) and the test statistic \(T = g(\sqrt {n} \bar X_n)\);
-
3.
Choose a large N in (8.60) and obtain the empirical quantile \(\hat t^\circ _{1-\alpha }\);
-
4.
Reject H 0 at level α if \(T > \hat t^\circ _{1-\alpha }\);
-
5.
Report the p-value as \(\hat F^\circ _N(T)\).
To construct a confidence region for μ, one can use (8.52) with t 1−α therein replaced by the empirical quantile \(\hat t^\circ _{1-\alpha }\).
6 Numerical Experiments
In this section, we shall perform a simulation study and evaluate the finite-sample performance of our Algorithm 1 with \(\hat {F}_N^\circ (t)\) defined in (8.60). Tests for mean vectors and covariance matrices are considered in Sects. 8.6.1 and 8.6.2, respectively. Section 8.6.3 contains a real data application on testing correlations between different pathways of a pancreatic ductal adenocarcinoma dataset.
6.1 Test of Mean Vectors
We consider three different testing functionals: for \(x=(x_1,\ldots,x_p)^\top \in \mathbb {R}^p\), let
For the L ∞ form g 1(x), four different testing procedures are compared: the procedure using our Algorithm 1 with \(\hat {F}_N^\circ (\cdot )\) replaced by \(\hat {F}_N (\cdot )\); cf (8.59); or by
and ε ji are i.i.d. Bernoulli(1∕2) independent of (X ij); the test of Fan et al. (2007) (FHY, see (8.20) and (8.23)) and the Gaussian Multiplier Bootstrap method in Chernozhukov et al. (2014) (CCK, see (8.30)).
For g 2(x), we compare the performance of our Algorithm 1 with \(\hat {F}_N^\circ (\cdot )\), \(\hat {F}_N(\cdot )\) and \(\hat {F}_N^\dagger (\cdot )\), and also the CLT-based procedure of Chen and Qin (2010) (CQ), which is a variant of (8.45) with the numerator \(n \bar X_n^T \bar X_n - f_1\) therein replaced by \(n^{-1} \sum _{i\neq j}X_i^\top X_j\).
The portmanteau testing functional g 3(x) is a marked weighted empirical process.
For our Algorithm 1 and the Gaussian Multiplier Bootstrap method, we calculate the empirical cutoff values with N = 4000. For each functional, we consider two models and use n = 40, 80 and p = 500, 1000. The empirical sizes for each case are calculated based on 1000 simulations.
Example 1 (Factor Model)
Let Z ij be i.i.d. N(0, 1) and consider
Then X i are i.i.d. N(0, Σ) with Σ = Ip + p 2δ 11 ⊤, where 1 = (1, …, 1)⊤. Larger δ implies stronger correlation among the entries X i1, …, X ip.
Table 8.1 reports empirical sizes for the factor model with g 1(⋅) at the 5% significance level. For each choice of p, n, and δ, our Algorithm 1 with \(\hat {F}_N^\circ (\cdot )\) and \(\hat {F}_N (\cdot )\) perform reasonably well, while the empirical sizes using \(\hat {F}_{N}^\dagger (\cdot )\) are generally slightly larger than 5%. The empirical sizes using Chernozhukov et al.’s (8.30) or Fan et al.’s (8.23) are substantially different from the nominal level 5%. For large δ, as expected, the procedure of Fan, Hall, and Yao can be very conservative.
The empirical sizes for the factor model using g 2(⋅) are summarized in Table 8.2. Our Algorithm 1 with \(\hat {F}_N^\circ (\cdot )\) and \(\hat {F}_N(\cdot )\) perform quite well. The empirical sizes for Chen and Qin’s procedure deviate significantly from 5%. This can be explained by the fact that CLT of type (8.45) is no longer valid for model (8.66); see the discussion following (8.45) and Theorem 2.2 in Xu et al. (2014).
When using functional g 3(x), our Algorithm 1 with \(\hat {F}_N^\circ (\cdot )\) and \(\hat {F}_N (\cdot )\) perform slightly better than \(\hat {F}_N^\dagger (\cdot )\) and approximate the nominal 5% level well (Table 8.3).
Example 2 (Multivariate t-Distribution)
Consider the multivariate t ν vector
where the degrees of freedom ν = 4, \(\Sigma =(\sigma _{jk})_{j,k=1}^p\), σ jj = 1 for j = 1, …, p and
and Y i ∼ N(0, Σ), \(W_i \sim \chi _{\nu }^2\) are independent. The above covariance structure allows long-range dependence among X i1, …, X ip; see Veillette and Taqqu (2013).
We summarize the simulated sizes for model (8.67) in Tables 8.4, 8.5, and 8.6. As in Example 1, similar conclusions apply here. Due to long-range dependence, the procedure of Fan, Hall, and Yao appears conservative. The Gaussian Multiplier Bootstrap (8.30) yields empirical sizes that are quite different from 5%. The CLT-based procedure of Chen and Qin is severely affected by the dependence. In practice we suggest using Algorithm 1 with \(\hat {F}_N^\circ (\cdot )\) which has a good size accuracy.
6.2 Test of Covariance Matrices
6.2.1 Sizes Accuracy
We first consider testing for H 0a : Σ = I for the following model:
where ε ij are i.i.d. (1) standard normal; (2) centralized Gamma(4,1); and (3) the student t 5. We then study the second test H 0b : Σ1,2 = 0, by partitioning equally the entire random vector X i = (X i1, …, X ip)T into two subvectors of p 1 = p∕2 and p 2 = p − p 1. In the simulation, we generate samples of two subvectors independently according to model (8.68). We shall use Algorithm 1 with L 2 functional. Tables 8.7 and 8.8 report the simulated sizes based on 1000 replications with N = 1000 half-sampling implementations, and they are reasonably closed to the nominal level 5%.
6.2.2 Power Curve
To access the power for testing H 0 : Σ = Ip using the L 2 test, we consider the model
where ε ij and ζ i are i.i.d. Student t 5 and ρ is chosen to be 0, 0.02, 0.04, …, 0.7. The power curve is shown in Fig. 8.1. As expected, the power increases with n.
6.3 A Real Data Application
We now apply our testing procedures to a pancreatic ductal adenocarcinoma (PDAC) dataset, preprocessed from NCBI’s Gene Expression Omnibus, accessible through GEO Series accession number GSE28735 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28735). The dataset consists of two classes of gene expression levels that came from 45 pancreatic tumor patients and 45 pancreatic normal patients. There are a total of 28,869 genes. We shall test existence of correlations between two subvectors, which can be useful for identifying sets of genes which are significantly correlated.
We consider genetic pathways of the PDAC dataset. Pathways are found to be highly significantly associated with the disease even if they harbor a very small amount of individually significant genes. According to the KEGG database, the pathway “hsa05212” is relevant to pancreatic cancer. Among the 28,869 genes, 66 are mapped to this pathway. We are interested in testing whether the pathway to pancreatic cancer is correlated with some common pathways, “hsa04950” (21 genes, with name “Maturity onset diabetes of the young”), “hsa04940” (59 genes, with name “Type I diabetes mellitus”), “hsa04972” (87 genes, with name “Pancreatic secretion”). Let W i, X i, Y i, and Z i be the expression levels of individual i from the tumor group for pathways “hsa05212,” “hsa04950,” “hsa04940,” and “hsa04972,” respectively. The null hypotheses are \(H^T_{0 1}: \mathrm {cov}(W_i, X_i) = 0_{66 \times 21}\), \(H^T_{0 2}: \mathrm {cov}(W_i, Y_i) = 0_{66 \times 59}\) and \(H^T_{0 3}: \mathrm {cov}(W_i, Z_i) = 0_{66 \times 87}\). Similar null hypothesis \(H^N_{0 1}, H^N_{0 1}, H^N_{0 1}\) can be formulated for the normal group. Our L 2 test of Algorithm 1 is compared with the Gaussian multiplier bootstrap (8.30). The results are summarized in Table 8.9. The CCK test is not able to reject the null hypothesis H 03 at 5% level since it gives a p-value of 0.063291. However using the L 2 test, H 03 is rejected, suggesting that there is a substantial correlation between pathways “hsa05212” and “hsa04972.” Similar claims can be made for other cases. The L 2 test also suggests that, at 0.1% level, for the tumor group, the hypotheses \(H^T_{0 2}\) and \(H^T_{0 3}\) are rejected, while for the normal group, the hypotheses \(H^N_{0 2}\) and \(H^N_{0 3}\) are not rejected.
References
Ahmad MR (2010) Tests for covariance matrices, particularly for high dimensional data. Technical Reports, Department of Statistics, University of Munich. http://epub.ub.uni-muenchen.de/11840/1/tr091.pdf. Accessed 3 Apr 2018
Bai ZD, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6:311–329
Bai ZD, Jiang DD, Yao JF, Zheng SR (2009) Corrections to LRT on large-dimensional covariance matrix by RMT. Ann Stat 37:3822–3840
Bickel PJ, Levina E (2008a) Regularized estimation of large covariance matrices. Ann Stat 36:199–227
Bickel PJ, Levina E (2008b) Covariance regularization by thresholding. Ann Stat 36:2577–2604
Birke M, Dette H (2005) A note on testing the covariance matrix for large dimension. Stat Probab Lett 74:281–289
Brent RP, Osborn JH, Smith WD (2015) Probabilistic lower bounds on maxima determinants of binary matrices. Available at http://arxiv.org/pdf/1501.06235. Accessed 3 Apr 2018
Cai Y, Ma ZM (2013) Optimal hypothesis testing for high dimensional covariance matrices. Bernoulli 19:2359–2388
Cai T, Liu WD, Luo X (2011) A constrained l 1 minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106:594–607
Chen SX, Qin Y-L (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38:808–835
Chen SX, Zhang L-X, Zhong P-S (2010) Tests for high-dimensional covariance matrices. J Am Stat Assoc 105:810–819
Chen XH, Shao QM, Wu WB, Xu LH (2016) Self-normalized Cramér type moderate deviations under dependence. Ann Stat 44:1593–1617
Chernozhukov V, Chetverikov D, Kato K (2014) Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann Stat 41:2786–2819
Dickhaus T (2014) Simultaneous statistical inference: with applications in the life sciences. Springer, Heidelberg
Dudiot S, van der Laan M (2008) Multiple testing procedures with applications to genomics. Springer, New York
Efron B (2010) Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Cambridge University Press, Cambridge
Fan J, Hall P, Yao Q (2007) To how many simultaneous hypothesis tests can normal, Student’s t or bootstrap calibration be applied. J Am Stat Assoc 102:1282–1288
Fan J, Liao Y, Mincheva M (2013) Large covariance estimation by thresholding principal orthogonal complements. J R Stat Soc Ser B Stat Methodol 75:603–680
Fisher TJ, Sun XQ, Gallagher CM (2010) A new test for sphericity of the covariance matrix for high dimensional data. J Multivar Anal 101:2554–2570
Georgiou S, Koukouvinos C, Seberry J (2003) Hadamard matrices, orthogonal designs and construction algorithms. In: Designs 2002: further computational and constructive design theory, vols 133–205. Kluwer, Boston
Han YF, Wu WB (2017) Test for high dimensional covariance matrices. Submitted to Ann Stat
Hedayat A, Wallis WD (1978) Hadamard matrices and their applications. Ann Stat 6:1184–1238
Jiang TF (2004) The asymptotic distributions of the largest entries of sample correlation matrices. Ann Appl Probab 14:865–880
Jiang DD, Jiang TF, Yang F (2012) Likelihood ratio tests for covariance matrices of high-dimensional normal distributions. J Stat Plann Inference 142:2241–2256
Ledoit O, Wolf M (2002) Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann Stat 30:1081–1102
Liu WD, Shao QM (2013) A Cramér moderate deviation theorem for Hotelling’s T 2-statistic with applications to global tests. Ann Stat 41:296–322
Lou ZP, Wu WB (2018) Construction of confidence regions in high dimension (Paper in preparation)
Marčenko VA, Pastur LA (1967) Distribution of eigenvalues for some sets of random matrices. Math U S S R Sbornik 1:457–483
Onatski A, Moreira MJ, Hallin M (2013) Asymptotic power of sphericity tests for high-dimensional data. Ann Stat 41:1204–1231
Portnoy S (1986) On the central limit theorem in \(\mathbb {R}^p\) when p →∞. Probab Theory Related Fields 73:571–583
Qu YM, Chen SX (2012) Test for bandedness of high-dimensional covariance matrices and bandwidth estimation. Ann Stat 40:1285–1314
Schott JR (2005) Testing for complete independence in high dimensions. Biometrika 92:951–956
Schott JR (2007) A test for the equality of covariance matrices when the dimension is large relative to the sample size. Comput Stat Data Anal 51:6535–6542
Srivastava MS (2005) Some tests concerning the covariance matrix in high-dimensional data. J Jpn Stat Soc 35:251–272
Srivastava MS (2009) A test for the mean vector with fewer observations than the dimension under non-normality. J Multivar Anal 100:518–532
Veillette MS, Taqqu MS (2013) Properties and numerical evaluation of the Rosenblatt distribution. Bernoulli 19:982–1005
Wu WB (2005) Nonlinear system theory: another look at dependence. Proc Natl Acad Sci USA 102:14150–14154 (electronic)
Wu WB (2011) Asymptotic theory for stationary processes. Stat Interface 4:207–226
Wu WB, Shao XF (2004) Limit theorems for iterated random functions. J Appl Probab 41:425–436
Xiao H, Wu WB (2013) Asymptotic theory for maximum deviations of sample covariance matrix estimates. Stoch Process Appl 123:2899–2920
Xu M, Zhang DN, Wu WB (2014) L 2 asymptotics for high-dimensional data. Available at http://arxiv.org/pdf/1405.7244v3. Accessed 3 Apr 2018
Yarlagadda RK, Hershey JE (1997) Hadamard matrix analysis and synthesis. Kluwer, Boston
Zhang RM, Peng L, Wang RD (2013) Tests for covariance matrix with fixed or divergent dimension. Ann Stat 41:2075–2096
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Wu, W.B., Lou, Z., Han, Y. (2018). Hypothesis Testing for High-Dimensional Data. In: Härdle, W., Lu, HS., Shen, X. (eds) Handbook of Big Data Analytics. Springer Handbooks of Computational Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-18284-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-18284-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18283-4
Online ISBN: 978-3-319-18284-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)