Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Although the literature on multiple test procedures (MTPs) is nowadays exponentially increasing over time, it is still possible to systematize the proposed methods according to some general categories. For instance, one class of methods only models the marginal distributions of the involved test statistics explicitly and combines these test statistics or, equivalently, corresponding \(p\)-values following probabilistic calculations. We call resulting procedures margin-based multiple test procedures. Different margin-based MTPs employ different qualitative assumptions on the dependency structure between test statistics or \(p\)-values, cf. our Chap. 2. Examples of this kind of procedures are discussed in Sect. 3.1.

Another class of MTPs considers the full joint distribution of all test statistics and relies on calculating or approximating quantiles of this joint distribution, for instance by resampling or by proving asymptotic normality by means of central limit theorems. We term such procedures multivariate multiple test procedures and discuss them in Sect. 3.2. A class of in a certain sense hybrid (neither purely margin-based nor entirely multivariate) multiple test procedures, which are specifically tailored to control the FWER in structured systems of hypotheses, is constituted by closed test procedures, which we will treat in Sect. 3.3.

Further criteria to distinguish MTPs are their structure (single-step or stepwise rejective), and the type of error control (\(k\)-FWER-controlling, FDR-controlling, FDX-controlling, etc.) that they provide. We exclude a distinction between frequentist and Bayesian procedures here, because this work is not considered with Bayesian approaches to multiple hypotheses testing. As far as frequentist procedures are concerned, the aforementioned criteria in our opinion allow us to treat the majority of the most popular MTPs up to present.

One type of procedures which do not fit in a clear-cut way into the categories defined above is constituted by so-called augmentation procedures. Augmentation procedures for control of the \(k\)-FWER, the FDR or the FDX work in two stages: In the first stage, an FWER-controlling MTP is applied. In the second stage, a certain number of hypotheses not rejected by the procedure employed in the first stage is rejected additionally, whereby this number in general depends on the data and on probabilistic bounds. Although augmentation procedures have attracted some attention recently, we do not cover them in the present work. References for augmentation procedures include van der Laan et al. (2004; 2005), and Farcomeni (2009).

1 Margin-Based Multiple Test Procedures

The multiple tests discussed in this section only require that each marginal test \(\varphi _i\) can be calibrated to keep a local significance level \(\alpha _{\text {loc.}}\) (say). The multiple test \(\varphi = (\varphi _i: 1 \le i \le m)\) is then built up from these marginal tests by adjusting \(\alpha _{\text {loc.}}\) for the multiplicity of the problem. This adjustment may be given by an explicit “correction for multiplicity” based on probabilistic considerations or in a data-dependent manner, for instance by defining \(\alpha _{\text {loc.}}\) by the value of an order statistic of marginal \(p\)-values \(p_1, \ldots , p_m\).

1.1 Single-Step Procedures

Single-step multiple test procedures carry out each individual test \(\varphi _i\), \(1 \le i \le m\), at (local) significance level \(\alpha _{\text {loc.}}\), where \(\alpha _{\text {loc.}}\) is the result of a multiplicity correction of \(\alpha \). In view of Theorem 2.1, single-step multiple tests are extremely easy to carry out in practice: Just calculate marginal \(p\)-values \(p_1, \ldots , p_m\) and reject \(H_i\) if and only if \(p_i < \alpha _{\text {loc.}}\). The choice of \(\alpha _{\text {loc.}}\) depends on qualitative assumptions regarding the joint distribution of \((p_1, \ldots , p_m)\). Two classical procedures are the Bonferroni correction (or Bonferroni test) and the Šidák correction (or Šidák test).

Example 3.1

(Bonferroni correction, cf. Bonferroni (1935; 1936)). The Bonferroni correction is based on the union bound and consists in choosing \(\alpha _{\text {loc.}}= \alpha / m\). It provides strong control of the FWER without any assumptions on the dependency structure among \((p_1, \ldots , p_m)\), because for a Bonferroni test \(\varphi \), it holds for all \(\vartheta \in \varTheta \) that

$$\begin{aligned} \text {FWER}_\vartheta (\varphi )&= {\mathbb {P}}_\vartheta (\bigcup _{i \in I_0(\vartheta )}\{\varphi _i = 1\})\\&\le \sum _{i \in I_0(\vartheta )} {\mathbb {P}}_\vartheta (\{\varphi _i = 1\})\\&\le m_0 \alpha / m \le \alpha . \end{aligned}$$

The inequality \({\mathbb {P}}(\bigcup _{i=1}^m A_i) \le \sum _{i=1}^m {\mathbb {P}}(A_i)\) is referred to as Bonferroni inequality in the multiple testing literature.

The disadvantage of Bonferroni tests is that \(\alpha / m\) is very small for large \(m\). Therefore, Bonferroni tests have low multiple power if \(m\) is large. If joint independence of all \(m\) marginal \(p\)-values can be assumed, \(\alpha _{\text {loc.}}\) can be chosen slightly larger than \(\alpha / m\).

Example 3.2

(Šidák correction, cf. Šidák 1967). The Šidák correction consists in choosing \(\alpha _{\text {loc.}}= 1 - (1 - \alpha )^{1/m}\). It provides strong control of the FWER if \((p_1, \ldots , p_m)\) are jointly stochastically independent, because for a Šidák test \(\varphi \), it then holds for all \(\vartheta \in {\varTheta }\) that

$$\begin{aligned} \text {FWER}_\vartheta (\varphi )&= {\mathbb {P}}_\vartheta (\bigcup _{i \in I_0(\vartheta )}\{\varphi _i = 1\})\\&= 1 - {\mathbb {P}}_\vartheta (\bigcap _{i \in I_0(\vartheta )}\{\varphi _i = 0\})\\&= 1 - \prod _{i \in I_0(\vartheta )} {\mathbb {P}}_\vartheta (\{\varphi _i = 0\})\\&\le 1 - \prod _{i \in I_0(\vartheta )} (1 - \alpha )^{1/m}\\&= 1 - (1 - \alpha )^{m_0/m}\\&\le 1 - (1 - \alpha ) = \alpha . \end{aligned}$$

As mentioned before, for all \(m \in {\mathbb {N}}\) it holds \(\alpha / m < 1 - (1 - \alpha )^{1/m}\), so that the more restrictive model assumptions made for a Šidák test allow one to increase multiple power uniformly. We may remark here that Šidák tests control the FWER under certain forms of positive dependence among \((p_1, \ldots , p_m)\), too. More details are provided in Chap. 4. Also asymptotically, it holds \(m [1 - (1 - \alpha )^{1/m}] \rightarrow -\ln (1-\alpha ) > \alpha = m \alpha / m\), \(m \rightarrow \infty \), for any \(\alpha \in (0,1)\). However, also for the Šidák correction, we have \(\alpha _{\text {loc.}}\rightarrow 0\), \(m \rightarrow \infty \).

In the particular context of testing linear contrasts in Gaussian models, Scheffé (1953) obtained the following result.

Theorem 3.1

(Scheffé (1953)). Let \(k \ge 3\) and \(n_i \ge 2\) for all \(1 \le i \le k\) be given integers and \(X=(X_{ij} : 1 \le i \le k, 1 \le j \le n_i)\). Assume that all \(X_{ij}\) are stochastically independent and normally distributed, \(X_{ij}\sim {\fancyscript{N}}(\mu _i,\sigma ^2)\), where \(\mu _i\in {\mathbb {R}}\), \(1 \le i \le k\), and \(\sigma ^2>0\). For notational convenience, denote \(n. = \sum _{i=1}^k n_i\). Consider the linear subspace

$$\begin{aligned} {\fancyscript{L}}=\left\{ \sum _{j=1}^q h_j a^{(j)}\right\} \end{aligned}$$

of \({\mathbb {R}}^k\) of dimension \(q\le k\), where \(h_j\in {\mathbb {R}}\) for all \(1 \le j \le q\) and \(a^{(1)},\ldots ,a^{(q)} \in {\mathbb {R}}^k\) are linearly independent vectors. Then it holds for all \(\mu \in {\mathbb {R}}^k\) and for all \(\sigma ^2 > 0\) that

$$\begin{aligned} {\mathbb {P}}_{(\mu ,\sigma ^2)}\left( \forall c\in {\fancyscript{L}}: c^T\mu \in \left[ c^T\hat{\mu } \mp \sqrt{q\widehat{\text {Var}}(c^T\hat{\mu }) F_{q, n.-k;\alpha }}\right] \right) = 1-\alpha , \end{aligned}$$
(3.1)

where \(\mu = (\mu _1, \ldots , \mu _k)^\top \), \(\hat{\mu }=(\overline{X}_{1.},\ldots ,\overline{X}_{k.})^T\) (vector of empirical group means), and \(\widehat{\text {Var}}(c^T\hat{\mu })= s^2\sum _{i=1}^k (c_i^2 /n_i)\), with \(s^2\) denoting the pooled unbiased estimator of \(\sigma ^2\), and \(F_{q, n.-k;\alpha }\) the upper \(\alpha \)-quantile of Fisher’s \(F\)-distribution with \(q\) and \(n.-k\) degrees of freedom.

Equation (3.1) yields a simultaneous \(1-\alpha \) confidence region for all linear contrasts of group means defined by \(\fancyscript{L}\) in the considered analysis of variance model. By duality of tests and confidence regions (see Theorem 1.1), this also entails a multiple single-step test for such contrasts.

1.2 Stepwise Rejective Multiple Tests

An interesting other class of multiple test procedures are stepwise rejective tests. In contrast to single-step tests, here the hypotheses are ordered by a pre-defined criterion and tested one after the other, where testing can stop at every step due to the occurrence of a rejection or a non-rejection. This means that the test result for a particular pair of hypotheses \(H_i\) versus \(K_i\) depends on the data not only directly via the test statistic \(T_i\) or the \(p\)-value \(p_i\), but also indirectly via potentially all other test statistics or \(p\)-values. The way the ordering among the hypotheses is defined leads to different subtypes of stepwise rejective multiple tests.

1.2.1 Step-Up-Down Tests

Step-up-down tests, introduced by Tamhane et al. (1998), rely on an ordering of the hypotheses \(H_1, \ldots , H_m\) which is induced by the order statistics of marginal \(p\)-values \(p_1, \ldots , p_m\).

Definition 3.1

(Step-up-down test of order \(\kappa \) , cf. Finner et al. (2012)). Let \(p_{1:m} < p_{2:m} < \cdot \cdot \cdot < p_{m:m}\) denote the ordered marginal \(p\)-values for a multiple test problem. For a tuning parameter \(\kappa \in \{1, \ldots , m\}\) a step-up-down (SUD) test \( \varphi ^{ \kappa } = ( \varphi _1^\kappa , \ldots , \varphi _m^\kappa ) \) of order \( \kappa \) based on some critical values \(\alpha _{1:m} \le \cdots \le \alpha _{m:m}\) is defined as follows. If \( p_{\kappa :m} \le \alpha _{\kappa :m} \), set \( j^{*} = \max \{ j \in \{ \kappa , \ldots , m \} : p_{i:m} \le \alpha _{i:m} \text{ for } \text{ all } i \in \{ \kappa , \ldots , j \} \} \), whereas for \( p_{\kappa :m} > \alpha _{\kappa :m} \), put \( j^{*} = \sup \{ j \in \{ 1 , \ldots , \kappa - 1 \} : p_{j:m} \le \alpha _{j:m} \} \) \(( \sup \emptyset = - \infty )\). Define \( \varphi _i^\kappa = 1 \) if \( p_i \le \alpha _{j^{*}:m} \) and \( \varphi _i = 0 \) otherwise \( ( \alpha _{- \infty :m} = - \infty ) \).

A step-up-down test of order \( \kappa = 1 \) or \( \kappa = m \), respectively, is called step-down (SD) or step-up (SU) test, respectively. If all critical values are identical, we obtain a single-step test.

Figure 3.1 illustrates the decision rule of an SUD test schematically.

Fig. 3.1
figure 1

Decision rule of an SUD test. If \(\kappa = m\) (SU test) and \(p_{m:m} \le \alpha _{m:m}\), all \(m\) null hypotheses are rejected. If \(\kappa = 1\) (SD test) and \(p_{1:m} > \alpha _{1:m}\), all \(m\) null hypotheses are retained

As we will discuss in Chap. 5, many commonly used step-up-down tests are margin-based and only employ qualitative assumptions regarding the joint distribution of test statistics or \(p\)-values. For instance, this holds true for the multiple tests by Holm (1979) (which are FWER-controlling step-down tests) and the famous linear step-up test by Benjamini and Hochberg (1995) for FDR control. However, there are remarkable exceptions, especially shortcuts of closed test procedures, cf. Sect. 3.3. The following obvious lemma can be used to compare different SUD tests which keep the same type I error criterion.

Lemma 3.1.

Consider two SUD tests \(\varphi ^{(1)}\) and \(\varphi ^{(2)}\) for the same multiple test problem \(({\fancyscript{X}}, {\fancyscript{F}}, ({\mathbb {P}}_{\vartheta })_{\vartheta \in {\varTheta }}, {\fancyscript{H}}_m)\). Assume that one of the following properties holds true.

  1. (a)

    The two tests \(\varphi ^{(1)}\) and \(\varphi ^{(2)}\) employ the same set of critical values and the tuning parameter \(\kappa _2\) of \(\varphi ^{(2)}\) is larger than the tuning parameter \(\kappa _1\) of \(\varphi ^{(1)}\).

  2. (b)

    The two tests \(\varphi ^{(1)}\) and \(\varphi ^{(2)}\) employ the same tuning parameter \(\kappa \) and the critical values utilized in \(\varphi ^{(2)}\) are index-wise not smaller than the ones utilized in \(\varphi ^{(1)}\).

  3. (c)

    Both tests \(\varphi ^{(1)}\) and \(\varphi ^{(2)}\) are single-step tests and the critical value utilized in \(\varphi ^{(2)}\) is larger than that utilized in \(\varphi ^{(1)}\).

Then, for any realization of \((p_1, \ldots , p_m)^\top \), \(\varphi ^{(2)}\) rejects all hypotheses that are rejected by \(\varphi ^{(1)}\), and possibly more.

Hence, under the constraint of type I error control of given type and at given level, an optimal SUD test (with respect to multiple power, cf. Definition 1.4) is given by choosing \(\kappa \) and \(\alpha _{1:m}, \ldots , \alpha _{m:m}\) as large as possible. For instance, SU tests have higher (not smaller) multiple power than the corresponding SD tests (with the same set of critical values). On the other hand, the same holds true for the comparison with respect to the FWER. Let us mention that additional assumptions are required in order that more rejections entail larger FDR, cf. Theorem 5.7.

Notice that we implicitly used part (c) of Lemma 3.1 for the comparison of Bonferroni tests and Šidák tests. In Chap. 5, Lemma 3.1 will be used for discussing relationships between the dependency structure among \(p_1, \ldots , p_m\) and the choice of tuning parameters and critical values for SUD tests.

1.2.2 Fixed Sequence Multiple Tests

Similarly to step-up-down tests, fixed sequence multiple tests also rely on an ordering of the hypotheses \(H_1, \ldots , H_m\). However, the ordering is now not data-dependently given by the ordering of \(p\)-values or test statistics, but is pre-defined before testing starts, for instance by weighting the hypotheses for importance. With respect to control of the FWER, the following fixed sequence procedure is widely used.

Theorem 3.2.

Let \(({\fancyscript{X}}, {\fancyscript{F}}, {\fancyscript{P}}, {\fancyscript{H}})\) with \({\fancyscript{H}} = (H_i: 1 \le i \le m)\) denote a multiple test problem and assume that valid marginal \(p\)-values \(p_1, \ldots , p_m\) are at hand. Let \(\alpha \in (0, 1)\) be a given constant and consider the multiple test \(\varphi \) defined by the following rule: Reject exactly hypotheses \(H_1, \ldots , H_{k^*}\), where

$$\begin{aligned} k^* = \max \{1 \le i \le m: p_j \le \alpha \text { for all } j=1, \ldots , i\}. \end{aligned}$$

If \(k^*\) does not exist, retain all \(m\) null hypotheses. Then, \(\varphi \) strongly controls the FWER at level \(\alpha \).

Proof.

First, consider the case \(m = 2\). We have to distinguish four cases.

  1. 1.

    If both \(H_1\) and \(H_2\) are false, no type I error can occur, hence \(\text{ FWER }_\vartheta (\varphi ) = 0\) for such \(\vartheta \).

  2. 2.

    If only \(H_1\) is true, \(\text{ FWER }_\vartheta (\varphi ) = \mathbb {P}_\vartheta (p_1(X) \le \alpha ) \le \alpha \).

  3. 3.

    If only \(H_2\) is true, \(\text{ FWER }_\vartheta (\varphi ) = \mathbb {P}_\vartheta (\{p_1(X) \le \alpha \} \cap \{p_2(X) \le \alpha \}) \le \mathbb {P}_\vartheta (p_2(X) \le \alpha ) \le \alpha \).

  4. 4.

    If both \(H_1\) and \(H_2\) are true, \(\text{ FWER }_\vartheta (\varphi ) = \mathbb {P}_\vartheta (p_1(X) \le \alpha ) \le \alpha \).

It is easy to check that the latter reasoning remains to hold true for \(m > 2\). \(\square \)

The obvious drawback of the multiple test \(\varphi \) from Theorem 3.2 is that, once a particular hypothesis cannot be rejected, the remaining not yet rejected hypotheses have to be retained without being tested explicitly. Wiens (2003) developed a method based on a Bonferroni-type adjustment of \(\alpha \) that allows for continuing testing after potential non-rejections. Other related testing strategies for fixed sequences of (pre-ordered) hypotheses ensuring strict FWER control have been discussed by Westfall and Krishen (2001) and Bauer et al. (1998), among many others. Such methods are particularly important for clinical trials with multiple endpoints.

1.3 Data-Adaptive Procedures

From the calculations in Examples 3.1 and 3.2, it follows that the realized \(k\)-FWER of the investigated margin-based multiple tests crucially depends on the proportion \(\pi _0 = m_0 / m\) of true null hypotheses. In Chap. 5, we will show that the same holds true for the realized FDR of many classical step-up-down tests. Data-adaptive procedures aim at adapting to the unknown quantity \(\pi _0\) in order to exhaust the type I error level better and, consequently, increase multiple power of standard procedures. Explicitly adaptive (plug-in) procedures employ an estimate \(\hat{\pi }_0\) and plug \(\hat{\pi }_0\) into critical values, typically replacing \(m\) by \(m \cdot \hat{\pi }_0\). In view of Definition 1.4 and Lemma 3.1, this increases multiple power at least on parameter subspaces on which \({\mathbb {P}}_\vartheta (\hat{\pi }_0 < 1)\) is large.

Maybe, the still most popular though, as well, the most ancient estimation technique for \(\pi _0\) is the one of Schweder and Spjøtvoll (1982). It relies on a tuning parameter \(\lambda \in [0, 1)\). Denoting the empirical cumulative distribution function (ecdf) of \(m\) marginal \(p\)-values by \(\hat{F}_m\), the proposed estimator from Schweder and Spjøtvoll (1982) can be written as

$$\begin{aligned} \hat{\pi }_0 \equiv \hat{\pi }_0(\lambda ) = \frac{ 1 - \hat{F}_m(\lambda )}{1 - \lambda }. \end{aligned}$$
(3.2)

Among others, Storey et al. (2004), Langaas et al. (2005), Finner and Gontscharuk (2009), Dickhaus et al. (2012) and Dickhaus (2013) have investigated theoretical properties of \(\hat{\pi }_0\) and slightly modified versions of this estimator. There exist several possible heuristic motivations for the usage of \(\hat{\pi }_0\). The simplest one considers a histogram of the marginal \(p\)-values with exactly two bins, namely \([0, \lambda ]\) and \((\lambda , 1]\). Then, the height of the bin associated with \((\lambda , 1]\) equals \(\hat{\pi }_0(\lambda )\), see graph (a) in Fig. 3.2. A graphical algorithm for computing \(\hat{\pi }_0\) connects the point \((\lambda , \hat{F}_m(\lambda ))\) with the point \((1, 1)\). The offset of the resulting straight line at \(t = 0\) equals \(\hat{\pi }_1 = \hat{\pi }_1(\lambda )= 1 - \hat{\pi }_0(\lambda )\), see graph (b) in Fig. 3.2.

Fig. 3.2
figure 2

Two graphical representations of the Schweder-Spjøtvoll estimator \(\hat{\pi }_0(\lambda )\)

The following lemma is due to Dickhaus et al. (2012), see Lemma 1 in their paper.

Lemma 3.2.

Whenever \((p_1, \ldots , p_m)\) are valid \(p\)-values, i.e., marginally stochastically not smaller than \(\text{ UNI }[0, 1]\) under null hypotheses, the value of \(\hat{\pi }_0\) is a conservative estimate of \(\pi _0\), meaning that \(\hat{\pi }_0\) has a non-negative bias. More specifically, it holds

$$\begin{aligned} {\mathbb {E}}_{\vartheta }[\hat{\pi }_0(\lambda )] - \pi _0 \ge \frac{1}{m (1-\lambda )} \sum _{i \in I_1} {\mathbb {P}}_{\vartheta }(p_i > \lambda ) \ge 0. \end{aligned}$$

The data-adaptive Bonferroni plug-in (BPI) test by Finner and Gontscharuk (2009) replaces \(m\) by \(m \cdot \hat{\pi }_0\) in the Bonferroni-corrected threshold for marginal \(p\)-values and the asymptotic version of the data-adaptive multiple test procedure by Storey et al. (2004) (STS test) replaces \(m\) by \(m \cdot \hat{\pi }_0\) in Simes’ critical values, cf. Sect. 5.3.

Another class of data-adaptive multiple tests is constituted by two-stage or multistage adaptive procedures, see Benjamini and Hochberg (2000) or Benjamini et al. (2006), for example. Such methods employ the number of rejections of a multiple test applied in the first stage in an estimator for \(m_0\). This estimator is then used to calibrate the second stage test which leads to the actual decisions, where this principle may be applied iteratively. A third of class of methods is given by implicitly adaptive procedures. Here, the idea is to find critical values that automatically (for as many values of \(\pi _0\) as possible) lead to full exhaustion of the type I error level. To this end, worst-case situations (i.e., LFCs) build the basis for the respective calculations. We will present some of such implicitly adaptive multiple tests in Sect. 5.5. Further estimation techniques for \(\pi _0\) have also been proposed in the multiple testing literature. We defer the reader to the introduction in Finner and Gontscharuk (2009) for an overview.

2 Multivariate Multiple Test Procedures

The basic idea behind multivariate multiple test procedures is to incorporate the dependency structure of the data explicitly into the multiple test and thereby optimizing its power. The general reason why this is often possible is that margin-based procedures which control a specific multiple type I error rate have to provide this multiple type I error control generically over a potentially very large family of dependency structures. Hence, if it is possible to derive or to approximate the particular dependency structure for the data-generating distribution at hand, this information may be helpful to fine-tune a multiple test for this specific case. This is particularly important for applications from modern life sciences, because the data there are often spatially, temporally, or spatio-temporally correlated as we will demonstrate in later chapters. Three alternative ways to approximate dependency structures are resampling (Sect. 3.2.1), proving asymptotic normality by means of central limit theorems (Sect. 3.2.2), and fitting copula models (Sect. 3.2.3).

2.1 Resampling-Based Methods

It is fair to say that the basic reference for resampling-based FWER control is the book by Westfall and Young (1993), who introduced simultaneous and step-down multiple tests based on resampling under the assumption of subset pivotality (see Definition 4.3, basically meaning that the joint distribution of test statistics corresponding to true null hypotheses does not depend on the distribution of the remaining test statistics such that resampling under the global hypothesis \(H_0\) is not only providing weak, but also strong FWER control). This assumption has been criticized as too restrictive such that (among others) Troendle (1995) and Romano and Wolf (2005a, b) generalized the methods of Westfall and Young (1993) to dispense with subset pivotality.

FDR-controlling (asymptotic) multiple tests based on resampling have been derived by Yekutieli and Benjamini (1999), Troendle (2000), and Romano et al. (2008). The resampling methods developed by Dudoit and van der Laan (2008) (see also the references therein) provide a general framework for controlling a variety of error rates (some of which we have introduced in Definitions 1.2 and 1.3), with particular emphasis on applications in genetics. While resampling often only asymptotically (for the sample size \(n\) tending to infinity) reproduces the true data distribution, Arlot et al. (2010) provide an in-depth study of resampling methods that control the FWER strictly for finite \(n\).

2.2 Methods Based on Central Limit Theorems

Asymptotic normality of moment and maximum likelihood estimators are classical results in mathematical statistics, see, for instance, Chap. 12 by Lehmann and Romano (2005) or Chap. 5 by Van der Vaart (1998). We will discuss the special cases of multiple linear regression models and of generalized linear models in Chap. 4. If the vector \(T\) of test statistics for a given multiple test problem is (a transformation of) such an asymptotically normal point estimator, the asymptotic distribution of \(T\) can be derived and utilized for calibrating the multiple test. This has been demonstrated, for instance, by Hothorn et al. (2008) and Bretz et al. (2010) in general parametric models. For particular applications in genetic association studies (cf. Chap. 9), central limit theorems for multinomial distributions, together with positive dependency properties of multivariate chi-square distributions, have been exploited by Moskvina and Schmidt (2008) and Dickhaus and Stange (2013) (see also the references therein).

2.3 Copula-Based Methods

As discussed in Chap. 2, \(p\)-values are under certain assumptions uniformly distributed on \([0, 1]\) under null hypotheses. In particular, this holds true in many models which are typically used in life science applications. One example is the problem of multiple testing for differential gene expression, see Chap. 10. Hence, according to Theorem 2.4, in such cases it suffices to estimate the (often unknown) copula of \(p_1(X), \ldots , p_m(X)\) in order to calibrate a multivariate multiple test procedure operating on these \(p\)-values. In particular, parametric copula models are convenient, because the dependency structure can in such models be condensed into a low-dimensional copula parameter. A flexible class of copula models is constituted by the family of Archimedean copulae.

Definition 3.2

(Archimedean copula). The joint distribution of the random vector \((p_i(X): 1 \le i \le m)\) under \(\vartheta \in {\varTheta }\) is given by an Archimedean copula with copula generator \(\psi \), if for all \((t_1, \ldots , t_m)^\top \in [0, 1]^m\),

$$\begin{aligned} {\mathbb {P}}_{\vartheta , \psi }(p_1(X) \le t_1, \ldots , p_m(X) \le t_m) = \psi \left( \sum _{i=1}^m \psi ^{-1}\left( F_{p_i(X)}(t_i)\right) \right) , \end{aligned}$$
(3.3)

where \(F_{p_i(X)}\) denotes the marginal cdf of \(p_i(X)\) under \(\vartheta \in {\varTheta }\).

Dickhaus and Gierl (2013) demonstrated the usage of Archimedean copula models for FWER control, while Bodnar and Dickhaus (2013) are considered with FDR control under Archimedean \(p\)-value copulae. If the generator \(\psi \) only depends on a copula parameter \(\eta \) (say), standard parametric estimation approaches can be employed to estimate \(\eta \). Two plausible estimation strategies are the maximum likelihood method (see, e. g., Hofert et al. (2012)) or the method of moments (referred to as “realized copula” method by Fengler and Okhrin (2012)). For the latter approach, the “inversion formulas” provided in the following lemma are helpful.

Lemma 3.3.

Let \(X\) and \(Y\) two real-valued random variables with marginal cdfs \(F_X\) and \(F_Y\) and bivariate copula \(C_\eta \), depending on a copula parameter \(\eta \). Let \(\sigma _{X, Y}\), \(\rho _{X, Y}\) and \(\tau _{X, Y}\) denote (the population versions of) the covariance, Spearman’s rank correlation coefficient and Kendall’s tau, respectively, of \(X\) and \(Y\). Then it holds:

$$\begin{aligned} \sigma _{X, Y} = f_1(\eta )&= \int _{\mathbb {R}^2} \left[ C_\eta \{F_X(x), F_Y(y)\} - F_X(x) F_Y(y)\right] dx \, dy, \end{aligned}$$
(3.4)
$$\begin{aligned} \rho _{X, Y} = f_2(\eta )&= 12 \int _{{[0,1]}^2} C_\eta (u, v) \, du \, dv - 3, \end{aligned}$$
(3.5)
$$\begin{aligned} \tau _{X, Y} = f_3(\eta )&= 4 \int _{{[0,1]}^2} C_\eta (u, v) \, dC_\eta (u, v) - 1. \end{aligned}$$
(3.6)

Proof.

Equation (3.4) is due to Höffding (1940), Eq. (3.5) is Theorem 5.1.6. in Nelsen (2006) and (3.6) is Theorem 5.1.3 in Nelsen (2006). \(\square \)

The “realized copula” method for empirical calibration of a one-dimensional parameter \(\eta \) of an \(m\)-variate copula essentially considers every of the \(m (m-1)/2\) pairs of the \(m\) underlying random variables \(X_1, \ldots , X_m\), inverts (3.4) each time with respect to \(\eta \), replaces the population covariance by its empirical counterpart and aggregates the resulting \(m (m-1)/2\) estimates in an appropriate way. More specifically, Fengler and Okhrin (2012) define for \(1 \le i < j \le m\): \(g_{ij}(\eta ) = \hat{\sigma }_{ij} - f_1(\eta )\), set \(\mathbf {g}(\eta ) = (g_{ij}(\eta ))_{1 \le i < j \le m}\), and propose to estimate

$$\begin{aligned} \hat{\eta } = \arg \min _{\eta } {\mathbf {g}}^\top (\eta ) {\mathbf {W}} \mathbf {g}(\eta ) \end{aligned}$$

for an appropriate weight matrix \({\mathbf {W}} \in {\mathbb {R}}^{{\left( {\begin{array}{c}m\\ 2\end{array}}\right) } \times {\left( {\begin{array}{c}m\\ 2\end{array}}\right) }}\). In this, \(\hat{\sigma }_{ij}\) denotes the empirical covariance of \(X_i\) and \(X_j\). Indeed, any of the functions \(f_\ell \), \(\ell = 1,2,3\) corresponding to relationships (3.4)–(3.6) may be employed in this realized copula method. Moreover, they may be combined to estimate two- or three-dimensional copula parameters \(\eta \).

In the particular context of estimating \(p\)-value copulae in multiple testing models, it is infeasible to actually draw independent replications of the vector \((p_i(X): 1 \le i \le m)\) from the target population, because this would essentially mean to carry out the entire experiment several times. Hence, one typically employs resampling methods for estimating the dependency structure among the \(p\)-values, namely the parametric bootstrap or permutations if \(H_1, \ldots , H_m\) correspond to marginal two-sample problems. Pollard and van der Laan (2004) compared both approaches and argued that the permutation method reproduces the correct null distribution only under some conditions. However, if these conditions are met, the permutation approach is often superior to bootstrapping (see also Westfall and Young (1993) and Meinshausen et al. (2011)). Furthermore, it is important to notice that both bootstrap and permutation-based methods estimate the joint distribution of \((p_i(X): 1 \le i \le m)\) under the global null hypothesis \(H_0\). Hence, the assumption that \(\eta \) is a nuisance parameter which does not depend on \(\vartheta \) is an essential prerequisite for the applicability of such resampling methods for estimating \(\eta \).

3 Closed Test Procedures

An important class of FWER-controlling multiple tests which do not exactly fall into one of the categories “margin-based” and “multivariate” is constituted by closed test procedures, introduced by Marcus et al. (1976).

Theorem 3.3.

Let \({\fancyscript{H}} = \{H_i: i \in I\}\) denote a \(\cap \)-closed system of hypotheses and \(\varphi = (\varphi _i: i \in I)\) a coherent multiple test for \(({\fancyscript{X}}, {\fancyscript{F}}, {\fancyscript{P}}, {\fancyscript{H}})\) at local level \(\alpha \). Then, \(\varphi \) is a strongly FWER-controlling multiple test at FWER level \(\alpha \) for \(({\fancyscript{X}}, {\fancyscript{F}}, {\fancyscript{P}}, {\fancyscript{H}})\).

Proof.

Let \(\vartheta \in {\varTheta }\) with \(I_0(\vartheta ) \not = \emptyset \). Since \({\fancyscript{H}}\) is \(\cap \)-closed, there exists an \(i \in I\) with \(H_i = \bigcap _{j \in I_0(\vartheta )} H_j\), and \(\vartheta \in H_i\). Hence, for all \(j \in I_0(\vartheta )\), we have \(H_j \supseteq H_i\). Now, coherence of \(\varphi \) entails \(\{\varphi _i = 1\} \supseteq \bigcup _{j \in I_0(\vartheta )} \{\varphi _j = 1\}\). We conclude that

$$\begin{aligned} \text {FWER}_\vartheta (\varphi ) = \mathbb {P}_\vartheta \left( \bigcup _{j \in I_0(\vartheta )}\{\varphi _j = 1\}\right) \le \mathbb {P}_\vartheta (\{\varphi _i = 1\}) \le \alpha , \end{aligned}$$

because \(\varphi _i\) is a level \(\alpha \) test. \(\square \)

Notice that there is no restriction at all regarding the explicit form of the local level \(\alpha \) tests \(\varphi _i\) in Theorem 3.3. One is completely free in choosing these tests. The decisive property of \(\varphi \), however, is coherence. Not all multiple tests fulfill this property in the first place. This leads to the closed test principle, a “general solution to multiple testing problems” (Sonnemann (2008)).

Theorem 3.4

(Closure Principle, see Marcus et al. (1976), Sonnemann (2008)). Let \(\fancyscript{H} = \{H_i: i \in I\}\) denote a \(\cap \)-closed system of hypotheses and \(\varphi = (\varphi _i: i \in I)\) an (arbitrary) multiple test for \((\fancyscript{X}, \fancyscript{F}, \fancyscript{P}, \fancyscript{H})\) at local level \(\alpha \). Then, we define the closed multiple test procedure (closed test) \(\bar{\varphi } = (\bar{\varphi }_i : i \in I)\) based on \(\varphi \) by

$$\begin{aligned} \forall i \in I: \bar{\varphi }_i(x) = \min _{j: H_j \subseteq H_i} \varphi _j(x). \end{aligned}$$

It holds:

  1. (a)

    The closed test \(\bar{\varphi }\) strongly controls the FWER at level \(\alpha \).

  2. (b)

    For all \(\emptyset \not = I' \subset I\), the “restricted” closed test \(\bar{\varphi }' = (\bar{\varphi }_i: i \in I')\) is a strongly (at level \(\alpha \)) FWER-controlling multiple test for \(\fancyscript{H}' = \{H_i : i \in I'\}\).

  3. (c)

    Both tests \(\bar{\varphi }\) and \(\bar{\varphi }'\) are coherent.

Proof.

The assertions follow immediately from the definitions of \(\bar{\varphi }\) and \(\bar{\varphi }'\) by making use of Theorem 3.3. \(\square \)

Remark 3.1.

  1. (a)

    The closed test \(\bar{\varphi }\) based on \(\varphi \) rejects a particular hypothesis \(H_i \in \fancyscript{H}\) if and only if \(\varphi \) rejects \(H_i\) and all hypotheses \(H_j \in \mathcal{{H}}\) of which \(H_i\) is a superset (implication).

  2. (b)

    If \(\fancyscript{H}\) is not \(\cap \)-closed, then one can extend \(\fancyscript{H}\) by adding all missing intersection hypotheses, leading to the \(\cap \)-closed system of hypotheses \(\bar{\fancyscript{H}}\). If there are \(\ell \) elementary hypotheses in \(\fancyscript{H}\), then \(\bar{\fancyscript{H}}\) can consist of up to \(2^\ell - 1\) hypotheses. However, as we will demonstrate by specific examples, it is typically not necessary to test all elements in \(\bar{\fancyscript{H}}\) explicitly.

  3. (c)

    Theorem 3.3 shows that under certain assumptions a multiple test at local level \(\alpha \) is a strongly FWER-controlling multiple test at level \(\alpha \). Of course, the reverse statement is always true.

  4. (d)

    If \(\fancyscript{H}\) is disjoint in the sense that \(\forall i,j \in I, i \not =j: H_i \cap H_j = \emptyset \), and \(\varphi \) is a multiple test for \((\fancyscript{X},\fancyscript{F}, \fancyscript{P}, \fancyscript{H})\) at local level \(\alpha \), then \(\varphi \) automatically strongly controls the FWER at level \(\alpha \), because \(\varphi \) is coherent and \(\fancyscript{H}\) is \(\cap \)-closed by the respective definitions. Often, there exist many possibilities for partitioning \({\varTheta }\) in disjoint subsets, leading to the more general partitioning principle, see Finner and Strassburger (2002).

  5. (e)

    If \(I = {\varTheta }\) and \(H_\vartheta = \{\vartheta \}\) for all \(\vartheta \in {\varTheta }\), and if \(\varphi = (\varphi _\vartheta : \vartheta \in {\varTheta })\) is a multiple test at local level \(\alpha \), then \(\varphi \) strongly controls the FWER at level \(\alpha \).

Fig. 3.3
figure 3

Closed test for \(\{H_{=}, H_{\le }, H_{\ge }\}\) in the two-sample Gaussian model

A nice application of the closed test principle is the problem of directional or type III errors, cf. Finner (1999) and references therein.

Example 3.3

(Two-sample t-test). Assume that we can observe \(X = (X_{ij})\) for \(i=1,2\) and \(j=1, \ldots , n_i\), that all \(X_{ij}\) are stochastically independent and \(X_{ij} \sim \mathcal{{N}}(\mu _i, \sigma ^2)\) with unknown variance \(\sigma ^2 > 0\). Consider the hypothesis \(H_{=} : \{\mu _1 = \mu _2\}\). The two-sample \(t\)-test \(\varphi _{=}\) (say) for testing \(H_{=}\) is based on the test statistic

$$\begin{aligned} T(X) = \sqrt{\frac{n_1 n_2}{n_1+n_2}} \frac{\bar{X}_{1.} - \bar{X}_{2.}}{S}, \mathrm{{~~where~~}} S^2 = \frac{1}{\nu } \sum _{i=1}^2 \sum _{j=1}^{n_i} (X_{ij} - \bar{X}_{i.})^2, \quad \nu = n_1 + n_2 - 2, \end{aligned}$$

and is given by

$$\begin{aligned} \varphi _{=}(x) = \left\{ \begin{array}{cccc} 1 &{}~ &{}> \\ &{}|T(x)| &{}~ &{}t_{\nu ; \alpha /2}\\ 0 &{}~ &{}\le \end{array} \right\} , \end{aligned}$$

where \(t_{\nu ; \alpha /2}\) denotes the upper \(\alpha /2\)-quantile of Student’s \(t\)-distribution with \(\nu \) degrees of freedom. Let us restrict our attention to the case \(\alpha \in (0, 1/2)\). The problem of directional or type III errors can be stated as follows. Assume that \(H_{=}\) is rejected by \(\varphi _{=}\). Can one then infer that \(\mu _1 < \mu _2\) (\(\mu _1 > \mu _2\)) if \(T(x) < -t_{\nu ; \alpha /2}\) (\(T(x) > t_{\nu ; \alpha /2}\))? There is the possibility of an error of the third kind, namely, that \(\mu _1 < \mu _2\) and \(T(x) > t_{\nu ; \alpha /2}\) (\(\mu _1 > \mu _2\) and \(T(x) < -t_{\nu ; \alpha /2}\)). The formal mathematical solution to this problem is given by the closed test principle. We add the two hypotheses \(H_{\le } : \{\mu _1 \le \mu _2\}\) and \(H_{\ge } : \{\mu _1 \ge \mu _2\}\) and notice that \(H_{=} = H_{\le } \cap H_{\ge }\). Level \(\alpha \) tests for \(H_{\le }\) and \(H_{\ge }\) are given by one-sided \(t\)-tests, say

$$\begin{aligned} \varphi _{\le }(x) = \left\{ \begin{array}{cccc} 1 &{}~ &{}> \\ &{}T(x) &{}~ &{}t_{\nu ; \alpha }\\ 0 &{}~ &{}\le \end{array} \right\} , \quad \varphi _{\ge }(x) = \left\{ \begin{array}{cccc} 1 &{}~ &{}< \\ &{}T(x) &{}~ &{}-t_{\nu ; \alpha }\\ 0 &{}~ &{}\ge \end{array} \right\} . \end{aligned}$$

We construct the closed test \(\bar{\varphi } = (\bar{\varphi }_{\le }, \bar{\varphi }_{=}, \bar{\varphi }_{\ge })\), given by \(\bar{\varphi }_{=} = \varphi _{=}\), \(\bar{\varphi }_{\le } = \varphi _{=} \varphi _{\le }\), \(\bar{\varphi }_{\ge } = \varphi _{=} \varphi _{\ge }\).

Due to the nestedness of the rejection regions of \(\varphi _{\le }\) and \(\bar{\varphi }_{\le }\) (\(\varphi _{\ge }\) and \(\bar{\varphi }_{\ge }\)), see Fig. 3.3, it follows from Theorem 3.4 that type III errors are automatically controlled at level \(\alpha \), hence, one-sided decisions after two-sided testing are allowed in this model. The argumentation further shows that this is generally true for likelihood ratio test statistics, provided that the model implies an isotone likelihood ratio.

The presumably most intensively studied application of closed test procedures, however, is the context of analysis of variance models, where linear contrasts regarding the group-specific means are of interest. Since this field of application has already deeply been studied in earlier books (Hochberg and Tamhane (1987), Hsu (1996)), we abstain from covering it here. Closed test-related multiple testing strategies for systems of hypotheses with a tree structure have been worked out by Meinshausen (2008) and Goeman and Finos (2012); see also the references in these papers. In the latter case, power can be gained by exploiting the logical restrictions among the hypotheses which are given by the tree structure. This has some similarities to the methods considered by Westfall and Tobias (2007).