1 Introduction

High dimensional problems have become increasingly important with the advancement of technology that increased the capacity to collect high dimensional data. For genomic or microarray data, thousands of gene expressions can be collected for each subject, in which case the number of variables is very large compared to sample size. When there is an interest in identifying mean differences for gene sets in two samples, this leads to a simultaneous testing problem of differences in the means of two different gene sets with large number of genes in each set (Nettleton et al. 2008; Newton et al. 2007). In such fields there is a growing demand for methods for handling high dimensional data when the number of variables is very large compared to sample size.

Key issues involved in two sample tests in high dimensional problems arise due to a large dimension of the mean vector for a relatively small sample size, and because Hotelling’s \(T^{2}\) statistic has poor performance with singular covariance matrix. In order to overcome this issue, a test statistic using Moore–Penrose inverse is proposed by Srivastava and Khatri (1979), and covariance shrinkage techniques are introduced to be able to work with positive definite sample covariance matrix (Ledoit and Wolf 2004). The question of the possibility of getting around the issues related to inverse covariance matrix remains for two sample tests in high dimensional problems. Among the various contributions made to the literature, Bai and Saranadasa (2004) proposed to modify Hotelling’s \(T^{2}\) statistic by excluding sample covariance matrix under the assumption \(p/n \rightarrow c\), where c is a constant with \(c \le \infty \). However, in high dimensional problems, it is in general too restrictive to assume that p / n will converge to c. Chen and Qin (2010) noted this and constructed a test statistic that allows p to be arbitrarily large without restriction of p being of the same order of n, under given assumptions. Wang et al. (2013) proposed a jackknife empirical likelihood (JEL) test, which works under weaker conditions than those proposed by Chen and Qin, and results in good statistical power. Wang et al. (2013) showed that their proposed statistic has a chi-square limit regardless of whether p is finite or diverges. The key idea in Wang et al. (2013)’s methodology is (1) to split the samples into two independent groups, (2) the use of empirical likelihood, and (3) the use of the jackknife samples. Point (1) is essential for the derivation of the necessary asymptotic results. However, the extent to which (2) and (3) contribute to power has not been explored in detail. This understanding is important because this leads to an insight on what should be considered primarily in more complicated problems. There are sufficient conditions that need to be met for Wang et al. (2013)’s methodology as given in the “Appendix”, and the conditions are restrictive, requiring the rate of increase of p to be controlled. Relaxing these conditions would lead to important improvement in the methodology.

In this paper, the restrictive conditions in Wang et al. (2013)’s approach are significantly relaxed in our proposed model allowing flexible dependence structures. The explicit form of the model is given in (2.1). In addition, we derive simpler alternative statistics for testing the equality of two high dimensional means and study the contribution of the use of empirical likelihood and jackknife samples. The proposed statistics result in a chi-square or the maximum of two independent chi-square statistics as limit distributions, and the asymptotic results hold regardless of whether p is finite or diverges.

To study the contribution of the jackknife samples, we investigate one statistic based on the jackknife sample, and another that is not. The proposed statistics are not based on the empirical likelihood, and does not require any optimization procedure. A simulation study is performed to compare the performance of the new statistics not based on empirical likelihood with Wang et al. (2013)’s statistics. In the simulation study, we consider various factors that can affect the performance of the two sample test, including the skewness of distribution, correlation between variables, sample size, the number of variables, and the sign of the mean shifts. It turns out that the sign of mean shifts is critical in obtaining good power. In order to take into account that mean shifts can be in different directions, we investigate different choices of coefficient vectors. In Wang et al. (2013)’s approach, the coefficient vector is chosen a priori to boost the statistical power, taking \((1,\ldots ,1)\) as a convenient choice. Although this method is useful in some settings, the simulation results show that the use of \((1,\ldots ,1)\) does not always yield good power in some practical settings. This vector can be chosen based on prior information (Wang et al. 2013), but we often do not have such information in practice. In this paper, we propose a data-adaptive method to select the coefficient vector, and show by simulation that the proposed choice substantially improves the power. The simple statistics proposed in this paper together with the data adaptive choice of coefficient vector yields good power, and can be used for high dimensional problems in various areas of research.

The organization of this paper is as follows. In Sect. 2, we review Wang et al. (2013)’s approach in detail and explain how we derive the new alternative statistic for two sample tests. A numerical study is given in Sect. 3. The data-adaptive choice of the coefficient vector is explained in Sect. 4 followed by simulation results. We apply the methods in the analysis of gene expression data in Sect. 5. Concluding remarks are given in Sect. 6. All the details of the simulation results are given in the supplementary document.

2 Review of Wang et al. (2013) and proposed statistics

Since our proposed statistics are closely related to the setting in Wang et al. (2013) in the sense that it requires common asymptotic results, we start this section with a detailed review of the JEL approach in Wang et al. (2013). To avoid introducing additional difficulty for the readers, we intend to keep most of notation used in Wang et al. (2013). Assume that \(X_{i}=(X_{i1},\ldots ,X_{ip})^{T}\) \((i=1,\ldots ,n_{1})\) and \(Y_{j}=(Y_{j1},\ldots ,Y_{jp})^{T}\) \((j=1,\ldots ,n_{2})\), where p denotes the dimension of the variables and \(n_{1}\) and \(n_{2}\) are the sample sizes of each group, respectively. \(X_{i}\) and \(Y_{j}\) are assumed to be two independent random samples with mean vectors \(\mu _{1}\) and \(\mu _{2}\), respectively. In this paper, we are concerned with testing \(H_{0}:\mu _{1}=\mu _{2}\) while allowing different covariances for the two groups. Note that this null hypothesis is equivalent to testing \(H_{0}:(\mu _{1}-\mu _{2})^{T}(\mu _{1}-\mu _{2})=0.\) Let \(m_{1}=[n_{1}/2]\), \(m_{2}=[n_{2}/2]\), \(m=m_{1}+m_{2}\), and let \(\tilde{X}_{i}=X_{i+m_{1}}\) for \(i=1\ldots ,m_{1}\), and \(\tilde{Y}_{j}=Y_{j+m_{2}}\) for \(j=1\ldots ,m_{2}\).

To test \(H_{0}\), Wang et al. (2013) proposed a JEL method. The jackknife sample is formulated as

$$\begin{aligned} Z_{k,1}= & {} \frac{m_{1}+m_{2}}{m_{1}m_{2}}\sum _{i=1}^{m_{1}}\sum _{j=1}^{m_{2}}(X_{i}-Y_{j})^{T}(\tilde{X}_{i}-\tilde{Y}_{j}) -\frac{m_{1}+m_{2}-1}{(m_{1}-1)m_{2}}\sum _{i=1,i\ne k}^{m_{1}}\sum _{j=1}^{m_{2}}(X_{i}-Y_{j})^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\\ Z_{k,2}= & {} \frac{m_{1}+m_{2}}{m_{1}m_{2}}\sum _{i=1}^{m_{1}}\sum _{j=1}^{m_{2}}\left\{ \alpha ^{T}(X_{i}-Y_{j})+\alpha ^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\right\} \\&-\,\frac{m_{1}+m_{2}-1}{(m_{1}-1)m_{2}}\sum _{i=1,i\ne k}^{m_{1}}\sum _{j=1}^{m_{2}}\left\{ \alpha ^{T}(X_{i}-Y_{j})+\alpha ^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\right\} \end{aligned}$$

for \(k=1,\ldots ,m_{1}\), and

$$\begin{aligned} Z_{k,1}= & {} \frac{m_{1}+m_{2}}{m_{1}m_{2}}\sum _{i=1}^{m_{1}}\sum _{j=1}^{m_{2}}(X_{i}-Y_{j})^{T}(\tilde{X}_{i}-\tilde{Y}_{j}) -\frac{m_{1}+m_{2}-1}{m_{1}(m_{2}-1)}\sum _{i=1}^{m_{1}}\sum _{j=1,j\ne k-m_{1}}^{m_{2}}(X_{i}-Y_{j})^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\\ Z_{k,2}= & {} \frac{m_{1}+m_{2}}{m_{1}m_{2}}\sum _{i=1}^{m_{1}}\sum _{j=1}^{m_{2}}\left\{ \alpha ^{T}(X_{i}-Y_{j})+\alpha ^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\right\} \\&-\,\frac{m_{1}+m_{2}-1}{m_{1}(m_{2}-1)}\sum _{i=1}^{m_{1}}\sum _{j=1,j\ne k-m_{1}}^{m_{2}}\left\{ \alpha ^{T}(X_{i}-Y_{j})+\alpha ^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\right\} \end{aligned}$$

for \(k=m_{1}+1,\ldots ,m\). Here, \(\alpha \) denotes the coefficient vector previously referred to in the introduction. The JEL ratio function for testing \(H_{0}:\mu _{1}=\mu _{2}\) is given by

$$\begin{aligned} L_{m}=\sup \left\{ \prod _{i=1}^{m}(mp_{i}):p_{i}\ge 0,\sum _{i=1}^{m}p_{i}=1,\quad \sum _{i=1}^{m}p_{i}Z_{i}=(0,0)^{T} \right\} . \end{aligned}$$

where \(Z_{i}=(Z_{i,1},Z_{i,2})^{T}\). Under either condition A1 or A2 in “Appendix”, Wang et al. (2013) showed that \(-2\log L_{m}\rightarrow \chi ^2_{2}\) in distribution. A remarkable property of this statistic is that the asymptotic result holds regardless of whether p is finite or diverges. However, this works under restrictive situations since the sufficient conditions A1 and A2 are required for this property to hold. For instance, to satisfy condition A2, we need \(p=o\left( m^{\frac{\delta +\min (\delta ,2)}{2(2+\delta )}}\right) \) for some \(\delta >0\). Since \(\frac{\delta +\min (\delta ,2)}{2(2+\delta )}\le 1/2\) for any \(\delta >0\), p should increase at slower rate than \(m^{1/2}\). In order to relax these restrictive conditions, we consider models that allow flexible dependence structures. Let

$$\begin{aligned} \varSigma= & {} E\left[ (X_{1}-\mu _{1})(X_{1}-\mu _{1})^{T}\right] \\ \tilde{\varSigma }= & {} E\left[ (Y_{1}-\mu _{2})(Y_{1}-\mu _{2})^{T}\right] \\ \rho _{1}= & {} tr(\varSigma ^2)=E\left[ \left( (X_{1}-\mu _{1})^{T}(X_{2}-\mu _{1})\right) ^2\right] \\ \rho _{2}= & {} tr(\tilde{\varSigma }^2)=E\left[ \left( (Y_{1}-\mu _{2})^{T}(Y_{2}-\mu _{2})\right) ^2\right] . \end{aligned}$$

Our model assumes

$$\begin{aligned} X_{i}-\mu _{1}=\varSigma ^{1/2}\varepsilon _{i} \quad \text { and } \quad Y_{i}-\mu _{2}=\tilde{\varSigma }^{1/2}\tilde{\varepsilon }_{i} \end{aligned}$$
(2.1)

where the elements in \(\varepsilon _{i}\) and \(\tilde{\varepsilon }_{i}\) are i.i.d random variables with mean 0 and finite fourth moment. If

$$\begin{aligned} \frac{\lambda ^4_{p}p^2}{m_{1}\rho ^2_{1}}=o(1),\quad \frac{\tilde{\lambda }^4_{p}p^2}{m_{2}\rho ^2_{2}}=o(1). \end{aligned}$$

hold where \(\lambda _{p}\) and \(\tilde{\lambda }_{p}\) are the largest eigenvalues of \(\varSigma \) and \(\tilde{\varSigma }\), respectively, then the asymptotic chi-square limiting distribution is obtained as described before.

In fact, our model provides a significantly relaxed condition on p. Specifically, if we impose the boundedness on all the eigenvalues of \(\varSigma \) and \(\tilde{\varSigma }\) as in Wang et al. (2013),

$$\begin{aligned} \frac{\lambda ^4_{p}p^2}{m_{1}\rho ^2_{1}}=O\left( \frac{1}{m_{1}}\right) \rightarrow 0,\quad \frac{\tilde{\lambda }^4_{p}p^2}{m_{2}\rho ^2_{2}}=O\left( \frac{1}{m_{2}}\right) \rightarrow 0. \end{aligned}$$

Therefore, the asymptotic chi-square limiting distribution is obtained for any order of p. More details and proofs are given in “Appendix 2 and 3”.

Now we propose a new statistic for testing the equality of two high dimensional means that can be used instead of the JEL. This is a simpler statistic that follows from an intermediate step instead of deriving the JEL.

Denote \(U_{1}=\frac{1}{m_{1}m_{2}}\sum _{i=1}^{m_{1}}\sum _{j=1}^{m_{2}}(X_{i}-Y_{j})^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\) and \(U_{2}=\frac{1}{m_{1}m_{2}}\sum _{i=1}^{m_{1}}\sum _{j=1}^{m_{2}}\alpha ^{T}(X_{i}-Y_{j})+\alpha ^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\). Under either condition A1 or A2 or B1 in the “Appendix” and \(H_{0}:\mu _{1}=\mu _{2}\), we have

$$\begin{aligned} \sqrt{m}\left( \begin{array}{c} \frac{U_{1}}{\sqrt{\rho }} \\ \frac{U_{2}}{\sqrt{\tau }} \\ \end{array} \right) \rightarrow _{d} N(0,I_{2}) \end{aligned}$$
(2.2)

where \(\rho =\frac{m}{m_{1}}\rho _{1}+\frac{m}{m_{2}}\rho _{2}\) and \(\tau =\frac{m}{m_{1}}\tau _{1}+\frac{m}{m_{2}}\tau _{2}\),

$$\begin{aligned} \tau _{1}= & {} 2tr(\alpha ^{T}\varSigma \alpha )=2E\left( (\alpha ^{T}(X_{1}-\mu _{1}))^2\right) \\ \tau _{2}= & {} 2tr(\alpha ^{T}\tilde{\varSigma } \alpha )=2E\left( (\alpha ^{T}(Y_{1}-\mu _{2}))^2\right) \end{aligned}$$

We immediately have

$$\begin{aligned} m\left( \frac{U^2_{1}}{\rho }+\frac{U^2_{2}}{\tau } \right) \rightarrow _{d} \chi ^2_{2}. \end{aligned}$$

Replacing \(\rho \) and \(\tau \) with their consistent estimators provides us with simple statistics for two sample high dimensional testing without introducing the empirical likelihood. In fact, Wang et al. (2013) developed two consistent estimators for \(\rho \) and \(\tau \). First, denote \(\hat{\rho }_{jack}=\frac{1}{m}\sum _{k=1}^{m}Z^2_{k,1}\) and \(\hat{\tau }_{jack}=\frac{1}{m}\sum _{k=1}^{m}Z^2_{k,2}\). The subscript “jack” highlights the use of the jackknife samples. Following Wang et al. (2013), we have

$$\begin{aligned} \frac{\hat{\rho }_\mathrm{jack}}{\rho } \rightarrow _{p} 1 \quad \text { and }\quad \frac{\hat{\tau }_\mathrm{jack}}{\tau } \rightarrow _{p} 1. \end{aligned}$$
(2.3)

Secondly, let \(\hat{\rho }_{ss}=\frac{m}{m^2_{1}m^2_{2}}\sum ^{m_{1}}_{k=1}(\sum ^{m_{2}}_{j=1}u_{kj})^2+\frac{m}{m^2_{1}m^2_{2}}\sum ^{m_{2}}_{k=1}(\sum ^{m_{1}}_{i=1}u_{ik})^2\) and \(\hat{\tau }_{ss}=\frac{m}{m^2_{1}m^2_{2}}\sum ^{m_{1}}_{k=1}(\sum ^{m_{2}}_{j=1}v_{kj})^2+\frac{m}{m^2_{1}m^2_{2}}\sum ^{m_{2}}_{k=1}(\sum ^{m_{1}}_{i=1}v_{ik})^2 \) where \(u_{ij}=(X_{i}-Y_{j})^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\) and \(v_{ij}=\alpha ^{T}(X_{i}-Y_{j})+\alpha ^{T}(\tilde{X}_{i}-\tilde{Y}_{j})\). Then, we have

$$\begin{aligned} \frac{\hat{\rho }_{ss}}{\rho } \rightarrow _{p} 1 \quad \text { and }\quad \frac{\hat{\tau }_{ss}}{\tau } \rightarrow _{p} 1 \end{aligned}$$
(2.4)

By combing the results above, we have the following:

Theorem 2.1

Under either condition A1 or A2 or B1 in “Appendix” and \(H_{0}:\mu _{1}=\mu _{2}\), for both \(i=jack\) and \(i=ss\), we have

$$\begin{aligned} m\left( \frac{U^2_{1}}{\widehat{\rho }_{i}}+\frac{U^2_{2}}{\widehat{\tau }_{i}} \right) \rightarrow _{d} \chi ^2_{2}. \end{aligned}$$
(2.5)

Proof

Assume that

$$\begin{aligned} \frac{\hat{\rho }}{\rho }\rightarrow _{p} 1\quad \hbox {and} \quad \frac{\hat{\tau }}{\tau }\rightarrow _{p}1. \end{aligned}$$

By (2.2) and Slutsky’s theorem, we have

$$\begin{aligned} \sqrt{m}\left( \begin{array}{c} \textstyle \frac{U_{1}}{\sqrt{\widehat{\rho }}} \\ \textstyle \frac{U_{2}}{\sqrt{\widehat{\tau }}} \\ \end{array} \right) \rightarrow _{d} N(0,I_{2}). \end{aligned}$$

which in turn concludes (2.5) along with (2.3) and (2.4). \(\square \)

We will call these simple \(\chi ^2_{2}\) statistics S1 and S2 where \((\rho ,\tau )\) is replaced by \((\widehat{\rho }_\mathrm{jack},\widehat{\tau }_\mathrm{jack})\) and \((\widehat{\rho }_{ss},\widehat{\tau }_{ss})\), respectively. Furthermore, by exploiting the asymptotic independence of \(U_{1}\) and \(U_{2}\),

$$\begin{aligned} \max \left( \frac{mU^2_{1}}{{\widehat{\rho }}},\frac{mU^2_{2}}{{\widehat{\tau }}}\right) \end{aligned}$$

can be used for testing \(H_{0}:\mu _{1}=\mu _{2}\) as well. Here, the null distribution is the maximum of two independent chi-square statistics with one degree of freedom. We call these maximum statistics M1 and M2 where \((\rho ,\tau )\) is replaced by \((\widehat{\rho }_\mathrm{jack},\widehat{\tau }_\mathrm{jack})\) and \((\widehat{\rho }_{ss},\widehat{\tau }_{ss})\), respectively.

3 Simulation study

In this section we compare several approaches in a simulation study by investigating the sizes and powers of our proposed methods (S1,S2,M1,M2) as well as the JEL test (JEL). For comparison, we consider simulation settings that are similar to those in Wang et al. (2013), but diversify the factors that can affect the statistical power. Assume that \(W_{1},\ldots ,W_{p}\) are iid random variables, and \(\bar{W}_{1},\ldots ,\bar{W}_p\) are iid random variables independent of \(W_{i}\)’s. We consider eight different simulation settings: Four settings (Setting I) are under independence assumption between the variables, and the other four (Setting II) are correlated settings. Within the four independent or correlated settings, we investigate the differences in skewness, and also allow the mean shifts to have opposite signs. In each of these settings, \(100c_{2}\) % of the components of \(Y_{1}\) have a shifted mean compared to the mean of \(X_{1}\). Detailed descriptions are given below for each simulation setting:

  • Setting I (Independent cases)

  • Setting I-1: Let \(W_{i} \sim N(0,1)\) and \(\bar{W}_{i}\sim t(8)\). Define \(X_{1,1} = W_{1}, X_{1,2} = W_{2}, \ldots , X_{1,p} = W_{p}, Y_{1,1} = \bar{W}_{1}+\mu _{2,1}, Y_{1,2} =\bar{W}_{2}+\mu _{2,2}, \ldots , Y_{1,p} = \bar{W}_{p} +\mu _{2,p}\), where \(\mu _{2,i} = c_{1}\) if \(i \le [c_{2}p]\), and \(\mu _{2,i} = 0\) if \(i > [c_{2}p]\).

  • Setting I-2: The same setting as I-1, except that \(\mu _{2,i}=c_{1}\) for odd i, \(\mu _{2,i}=-c_{1}\) for even i.

  • Setting I-3: \(\bar{W}_{i}\sim \chi ^2(1)-1\) where \(\mu _{2,i} = c_{1}\) if \(i \le [c_{2}p]\), and \(\mu _{2,i} = 0\) if \(i > [c_{2}p]\).

  • Setting I-4: The same setting as I-3, except that \(\mu _{2,i}=c_{1}\) for odd i, \(\mu _{2,i}=-c_{1}\) for even i.

  • Setting II (Correlated cases used in Wang et al. (2013))

  • Setting II-1 : Let \(W_{i} \sim N(0,1)\) and \(\bar{W}_{i}\sim t(8).\) Define \(X_{1,1} = W_{1}, X_{1,2} = W_{1}+W_{2}, \ldots , X_{1,p} = W_{p-1}+W_{p}, Y_{1,1} = \bar{W}_{1}+\mu _{2,1}, Y_{1,2} =\bar{W}_{1}+\bar{W}_{2}+\mu _{2,2}, \ldots , Y_{1,p} = \bar{W}_{p-1} + \bar{W}_{p} +\mu _{2,p}\), where \(\mu _{2,i} = c_{1}\) if \(i \le [c_{2}p]\), and \(\mu _{2,i} = 0\) if \(i > [c_{2}p]\).

  • Setting II-2: The same setting as II-1, except that \(\mu _{2,i}=c_{1}\) for odd i, \(\mu _{2,i}=-c_{1}\) for even i.

  • Setting II-3: \(\bar{W}_{i}\sim \chi ^2(1)-1\) , where \(\mu _{2,i} = c_{1}\) if \(i \le [c_{2}p]\), and \(\mu _{2,i} = 0\) if \(i > [c_{2}p]\).

  • Setting II-4: The same setting as II-3, except that \(\mu _{2,i}=c_{1}\) for odd i, \(\mu _{2,i}=-c_{1}\) for even i.

In this simulation, the null hypothesis to be tested is \(H_{0}: E(X_{1})=E(Y_{1})\). Note that if \(c_{1}\) is zero there is no shift in the mean vector, so the size of tests can be investigated in this case. After generating 1000 random samples of sizes \(n_{1} = 30, 100, 150\) from \(X = (X_{1,1},\ldots ,X_{1,p})^{T}\) and independently generating 1000 random samples of sizes \(n_{2} = 30, 100, 200\) from \(Y = (Y_{1,1},\ldots , Y_{1,p})^{T}\) with \(p = 10, 20, \ldots , 100, 300, 500\), \(c_{1} = 0, 0.1\) and \(c_{2} = 0.25, 0.75\), we compute the powers of the five tests.

For comparisons of the five methods, we report the empirical sizes and powers for each simulation setting. The results are given in Tables S1 to S8 in the supplementary document, showing the proportion of rejecting the null \(H_{0}:\mu _{1}=\mu _{2}\) out of 1000 replications. Each table is divided into three sections where the top part shows the sizes and powers of our proposed tests (S1,M1,S2,M2) and the JEL for (\(n_1\), \(n_2\)) = (30, 30), the middle part shows the results for (\(n_1\), \(n_2\)) = (100, 100), and the bottom part for \((n_{1},n_{2})=(150,200)\). For M1 and M2, a rejection is declared when the statistic is larger than 5, which corresponds to the 95% quantile of the maximum of two independent \(\chi ^2_{1}\). Assuming that no prior information is available, \(\alpha =(1,\ldots ,1)\) was used. Key findings from the simulation study can be summarized as below:

  • S2 and M2 yield very low statistical power in (\(n_1\), \(n_2\)) = (30, 30) and (100,100), illustrating that the use of jackknife samples is critical in boosting the statistical power when the sample sizes are not so high.

  • Type I error of JEL is a little higher than its nominal counterpart (0.05) when (\(n_1\), \(n_2\)) = (30, 30), whereas that of S1 and M1 is a little lower than 0.05. This explains the reason why JEL has power that is a little higher than S1 and M1 when (\(n_1\), \(n_2\)) = (30, 30). The performances of JEL and S1 are comparable for (\(n_1\), \(n_2\)) = (100, 100) and (150,200). Thus, the use of empirical likelihood does not seem to be critical.

  • When the mean shifts have opposite signs, all the statistics have extremely low power. See Tables S2 and S4.

  • The skewness of the distribution of data does not seem to affect the power much. This can be seen by comparing Tables S1 and S3.

4 The choice of \(\alpha \)

The simulation study in Sect. 3 showed that both JEL as well as our proposed statistics perform badly when the shifted means have opposite signs. The reason for low statistical power in this case is mainly due to inappropriate choice of \(\alpha \). To understand this, suppose that \(\alpha =(1,\ldots ,1)\) and the signs of \(\mu _{1}-\mu _{2}\) alternate. Then \(\alpha ^{T}(X_{i}-Y_{i})\approx \alpha ^{T}(\mu _{1}-\mu _{2})\approx 0\) because positive and negative mean shifts cancel each other out. We expect that the choice \(\alpha =(1,\ldots ,1)\) is effective only when either positive or negative shifts dominate in \(\mu _{1}-\mu _{2}\). Otherwise we need a clever choice for \(\alpha \) so that the mean shifts don’t cancel each other out.

In particular, we consider the situation where there is no strong prior knowledge on the variables. Our strategy is to estimate the signs of the shifted means from the data. We first split the samples into three independent parts instead of splitting into two. The first two parts will be used to construct the two sample statistics as described in Sect. 2, and the remaining part will be used to estimate the signs. Let

$$\begin{aligned} \alpha ^{*}=I\left( \tilde{\tilde{X}}-\tilde{\tilde{Y}}>0\right) -I\left( \tilde{\tilde{X}}-\tilde{\tilde{Y}}<0\right) \end{aligned}$$

where \(\tilde{\tilde{X}}\) and \(\tilde{\tilde{Y}}\) correspond to the part of the dataset that is used to estimate the signs. Since \(\alpha ^{*}\) is independent of the construction of the two sample statistics, the choice of \(\alpha ^{*}\) does not change the asymptotic property of JEL and our proposed statistics under some regularity conditions. This can be rephrased as the following:

Corollary 1

Suppose that for any \(s\in S\), \(Var(s^{T}X_{i})>0\) and \(Var(s^{T}Y_{i})>0\) where \(S=\{(s_{1},\ldots ,s_{p})|s_{i}=\pm 1\}\). Under either condition A1 or A2 or B1 and \(H_{0}:\mu _{1}=\mu _{2}\), (2.5) holds by conditioning on \(\alpha =\alpha ^*\).

\(\hbox {Var}(s^{T}X_{i})>0\) and \(\hbox {Var}(s^{T}Y_{i})>0\) simply requires that \(X_{i}\) and \(Y_{i}\) should not be degenerate for any sign combination s. A different estimation method for \(\alpha \) is possible as discussed in Wang et al. (2013). However, as they pointed out, the derived theorems cannot be applied to their choice directly, while they can be applied to our proposed choice directly.

A simulation study was implemented to evaluate the performance of the proposed approach. We use the same setting as in Settings II-2 and II-4 except that we now have \(c_{1}=0.5\). The proportion of rejecting \(H_{0}:\mu _{1}=\mu _{2}\) are shown in Tables S9–S12 in the supplementary file. Statistical power can be improved with larger \(c_{1}\) even when \(\alpha =(1,\ldots ,1)\), so \(c_{1}\) should be fixed when comparing the results from \(\alpha =(1,\ldots ,1)\) and the data-adaptive \(\alpha \). Tables S9 and S10 provide the results for \(\alpha =(1,\ldots ,1)\), and Tables S11 and S12 are the results when \(\alpha \) is estimated from 10 % of the dataset that was randomly selected. A substantial increase in the statistical power is observed in the results by using our data-adaptive method.

5 Analysis of Gene expression data

There are two major categories of gene set tests: competitive gene set tests and self-contained gene set tests (Goeman and Bühlmann 2007). Competitive gene set tests are concerned with the comparison of the set of genes of interest, say G, with the complementary set of genes which are not in G. On the other hand, self-contained gene set tests focus on the gene set of interest itself without reference to the complementary set of genes. An example of the former is Wu and Smyth (2012), which considered inter-gene correlation. The proposed two sample statistics in this paper belong to the category of self-contained gene set tests.

We analyze the Colon data available in the R-package “plsgenomics”. This data set is from the microarray experiments of Colon tissue samples of Alon et al. (1999), and has 2000 gene expression levels where 22 of them are (\(n_{1}\)) normal tissues and 40 are (\(n_{2}\)) tumor tissues. To see the effect of genes with significant difference of sample means, Wang et al. (2013) applied those genes satisfying

$$\begin{aligned} \left| \frac{1}{n_{1}}\sum _{i=1}^{n_{1}}X_{ij}-\frac{1}{n_{2}}\sum _{i=1}^{n_{2}}Y_{ij}\right| \le c_{3} \end{aligned}$$

for some given threshold \(c_{3}>0\). We report the p values for testing the equality of means of the genes with absolute difference in the sample means less than the threshold \(c_{3}\) given in Table 1.

Table 1 Colon data: p values for testing equal means of those genes with the absolute difference of sample means less than the threshold \(c_{3}\)

For \(c_{3}=3000\), the p values of JEL and S1 are 0.136 and 0.182, respectively. These results are based on \(\alpha =(1,\ldots ,1)\) as the coefficient vector. However, since the direction of differentially expressed genes can be inconsistent, it would be reasonable to apply the data-adaptive choice for \(\alpha \) for the analysis. Two tissues are randomly selected from the normal group, and four tissues from tumor group, and \(\alpha \) is computed based on these selected tissues. Then the two sample mean test is performed on the remaining samples with 20 from the normal group, and 36 in tumor group. The results are given in Table 1. When \(c_{3}\) is greater than 1000 and \(\alpha \) is estimated, p values from all the methods show highly significant results. Although this is an encouraging result, this should be interpreted carefully because few observations with large differences can have a large influence on the test results. In order to see whether the results are still significant when the effects by the large observations are removed, we apply log transformations to the 2000 gene expression levels. The results are given in Table 2. For testing the equality of means of the logarithms of the 2000 gene expression levels on normal colon tissues and tumor colon tissues, JEL and S1 with \(\alpha =(1,\ldots ,1)\) give p values of 0.206 and 0.180, respectively. However, when we consider the data-adaptive \(\alpha \), all the results are highly significant. Normal and tumor tissues seem to have different mean vectors, but instead of making a quick judgement based on the results, it is recommended to investigate how such a large difference can be obtained from experiments, and to check the possibility of the biological justification for the mean difference.

Table 2 Colon data (logarithm scale): p values for testing equal means of those genes with the absolute difference of sample means less than the threshold \(c_{3}\)

6 Conclusion

In this paper, we propose alternative statistics for testing the equality of two high dimensional means, and study their finite sample properties. In our simulation study, we observe that the use of jackknife samples is substantial to gaining good statistical power, but the contribution of the empirical likelihood does not seem substantial. We propose a new statistic that does not involve the empirical likelihood, eliminating the need for optimization procedures. We also provide significantly relaxed the sufficient conditions compared to what was required by Wang et al. (2013). Simulation results show that the choice of the coefficient vector is critical in all of the proposed methods. In many practical settings, \(\alpha =(1,\ldots ,1)\) is a naive choice, so we propose a simple data-adaptive estimation for \(\alpha \). A numerical study shows substantial increase in statistical power for the practical settings that was considered, and this is also observed in the analysis results of the gene expression data.

There are some issues that remain as possible future research topics. First, we may consider different functional forms for \(U_{2}\) to complement \(U_{1}\), but to keep the necessary asymptotic theory simple, they are needed to have mean zero and correlation zero with \(U_{1}\) under \(H_{0}\). Otherwise, new theoretical developments will be required. It would be interesting to see whether power will increase substantially by using different functional forms of \(U_{2}\). Second, there are enormous amount of accumulated biological information in modern research environment, and it would be interesting to incorporate the biological information to estimate \(\alpha \).