Keywords

7.1 Introduction

It is well recognized that a clinical trial of fixed-sample design planned without interim looks can falsely reject a hypothesis of no treatment effect on an endpoint by chance alone. This error commonly known as the false positive error or the Type I error can be excessive if the trial tests more than one hypothesis in the same study. This inflation of the Type I error is of concern as it can lead to false conclusions of treatment benefits in a trial . However, many statistical approaches for confirmatory clinical trials are now available for keeping the probability of falsely rejecting any hypothesis in testing a family of hypotheses (i.e., the familywise Type I error rate) controlled to a specified level; see, for example, a recently released FDA draft guidance “Multiple Endpoints in Clinical Trials,” and Alosh et al. (2014).

However, many confirmatory clinical trials accrue patients over many months and enroll hundreds to thousands of patients; this is a widespread practice, for example, for some cardiovascular and oncology trials. Investigators, bound by ethical and economic constraints, usually design these large trials with interim looks , with the possibility of stopping the trial early at an interim stage if the study treatment has the desired efficacy that is clinically relevant, or if it is futile to continue the study, either for lack of efficacy of the study treatment or for safety concerns. These clinical trials are normally recognized as group sequential (GS) clinical trials. The Type I error rate for GS trials, even for the simplest case of testing a single hypothesis, can be inflated if there are no adjustments for multiple looks, as compared to conventional non-GS trials, because of the repeated tests of the same hypothesis at interim looks . In GS trials, the same hypothesis is tested at different looks as the trial data accumulates over the time course of the trial , until the hypothesis is rejected or the trial reaches the final look for the last test of the hypothesis. Consequently, for assuring the credibility of a treatment benefit result even for a single-hypothesis GS trial , it is considered necessary to use a statistical adjustment method for controlling the probability of a Type I error at a pre-specified level through proper design and analysis methods that are prospectively planned.

There is an extensive literature for GS trials with plans to test a single primary hypothesis of a trial with repeated testing on accumulating data observed at different looks, and to stop the trial early at a look either for efficacy or for futility reasons. This literature covers in detail the technical and operational aspects of such trials, explaining how to plan, conduct, and analyze accumulating data of such trials. Emerson (2007) is an excellent review article on this topic. Also, there are useful books on this topic, including Whitehead (1997), Jennison and Turnbull (2000), and Proschan et al. (2006). Also, there are classical papers on this topic that are of historical importance, such as Armitage et al. (1969), Pocock (1977), O’Brien and Fleming (1979), and Lan and DeMets (1983). In addition, there are some extensions of the methods for multi-arm group sequential trials , e.g., comparison of multiple doses of the same treatment to a common control on a single primary endpoint with interim looks ; see, for example, Follmann et al. (1994), Jennison and Turnbull (2000), Hellmich (2001), and Stallard and Friede (2008).

However, modern clinical trials are designed with multiple endpoints ; some of these endpoints are given primary and secondary designations. The primary endpoint family along with their hypotheses holds a special position: If the study wins on one or more of its primary endpoint hypotheses then, depending on the level of evidence desired for this win, one can characterize a clinically relevant benefit of the study treatment. In this regard, O’Neill (1997), based on clinical and statistical considerations, made the case that secondary endpoint hypotheses need to be tested only when there is at least one rejection of the primary endpoint hypotheses leading to a clinically relevant benefit of the study treatment. Several innovative statistical procedures for confirmatory clinical trials were proposed that maximize the power for the tests of the primary hypotheses . In doing so, these approaches consider O’Neill’s stipulation along with possibility of assigning weights to the different endpoint hypotheses and other logical restrictions. Further, these test procedures control the familywise Type I error rate (FWER ) in the “strong sense” (see, e.g., Hochberg and Tamhane 1987), so that the conclusion of treatment efficacy can be made at the individual endpoints or hypotheses levels.

There is a fair amount of literature regarding these novel procedures for fixed-sample clinical trials but not so for GS clinical trials which are frequent for cardiovascular and oncology trials. Examples of such procedures for fixed-sample trial designs include the gatekeeping procedures (see, e.g., Dmitrienko et al. 2003, 2008; Dmitrienko and Tamhane 2009; and Huque et al. 2013 among others) and the graphical procedures (see, e.g., Bretz et al. 2009, 2011, 2014). The development of the gatekeeping procedures and the graphical method have relied, either explicitly or implicitly, on shortcuts to the closed test procedure, as discussed by Hommel et al. (2007). These developments that utilize short-cut testing have been possible for weighted Bonferroni tests of the intersection hypotheses that satisfy “consonance ” property (Hommel et al. 2007). Thereafter, the interest has been as to whether a similar approach for testing multiple hypotheses is possible for GS clinical trials. Recent publications, including Glimm et al. (2010), Tamhane et al. (2010), Maurer and Bretz (2013), Ye et al. (2013), Xi and Tamhane (2015), and Xi et al. (2016), have made this possible and have advanced multiple hypotheses testing methods for GS trials.

Tang and Geller (1999) proposed a general closed testing scheme for testing multiple hypotheses for GS clinical trials. This scheme, though conceptually simple to follow, seems complex to apply in practice, except for certain special situations. By taking advantage of the Hommel et al.’s findings and those of others, we make the case that that Tang and Geller’s scheme can be simplified for application purposes by developing short-cut closed test procedures using, for example, the weighted Bonferroni tests. These short-cut procedures for testing multiple hypotheses in GS clinical trials also allow, indirectly, recycling the unused significance level of a rejected hypothesis to testing other hypotheses in a trial .

In this chapter, we first review the classical O’Brien-Fleming (OF) and Pocock (PK) approaches as well as the \( \alpha \)-spending function methods, for setting the boundaries in a standard GS clinical trial for repeated testing of a single primary hypothesis. We will call herewith the \( \alpha \)-spending function methods as spending function methods. As we will see later, these boundaries computed from the spending function approaches for testing a single hypothesis can still be used for testing multiple hypotheses in GS trials. Consequently, software developed for standard GS trials with a single-hypothesis test can also be used for multiple hypotheses tests. We also touch on the Tang and Geller (1999) closed testing approach as it is of historical importance and show that for testing two primary hypotheses of a trial , this approach simplifies when the weighted Bonferroni test is used for testing the intersection hypothesis. We then visit the graphical approach , for testing multiple primary and secondary hypotheses of GS trials, as discussed by Mauer and Bretz (2013), and present an illustrative example for testing two primary and two secondary endpoints of a trial . Thereafter, we consider the case that when the trial stops after the rejection of a primary hypothesis at a look say for ethical reasons, then other hypotheses need to be tested at the same look, as discussed by Tamhane et al. (2010). We close this chapter with some concluding remarks. Finally, we should point out that in all the discussions and methods presented for deriving boundaries of the GS trials and all tests considered are 1-sided comparing a study treatment to control.

7.2 Testing of a Single Hypothesis in a GS Trial

As in fixed-sample trials , the endpoints in a GS trial can be continuous, binary, or time-to-event. Although the associated test statistics for these endpoints may appear dissimilar, they share a common property: They can be expressed in terms of the standardized sums of independent observations of a random variable. Consequently, they span asymptotically the same joint distribution across time points of multiple looks of the data. Therefore, for the sake of simplicity in this chapter, we assume that the multiple endpoints considered are continuous, and the sample size for each arm of a 2-arm trial designed to compare the study treatment to control remains equal for each endpoint at each look. This case of equal sample size can be easily extended to the case when the sample size for the treated and control arms of the trial at a look can be of different sizes. Also, we consider the case that the total sample size for the final look is fixed in advance. In our discussion of GS trials, we do not consider them adaptive when the investigator continues to modifying the trial design based on the earlier results or what is known as adaptive study design . Adaptive study designs may allow for the possibility of adjusting the sample size of the trial , redefining the endpoint , or modifying the patient population based on the results of an interim look of the data of the trial . Methodological approaches for GS trials with such adaptations are more complex, and some of the assumptions and statements made here may not be valid. With these considerations, we first consider the case of testing a single endpoint hypothesis \( H_{0} :\delta \le 0 \) against the alternative hypothesis \( H_{a} :\delta > 0 \) for a trial with \( K - 1 \) interim looks and a final look, for a total of \( K \ge 2 \) looks. A positive value of \( \delta \) indicates that the test treatment is better than the control.

7.2.1 Test Statistics and Their Distributions

Consider a 2-arm randomized trial designed to compare a treatment with a control on a single primary endpoint based on a total sample size of N subjects per arm. Let \( S_{{n_{1} }} \) be the sum statistic for the treatment difference at look 1 based on \( n_{1} \) subjects per treatment arm. This sum statistics at look 1 is the sum of endpoint observations on \( n_{1} \) subjects in the treatment arm minus the sum of endpoint observations on \( n_{1} \) subjects in the control arm. Define the B-value at look 1 as

$$ B\left( {t_{1} } \right) = S_{{n_{1} }} /\sqrt {V_{N} } ,\,{\text{where}}\,V_{N} = {\text{Var}}(S_{N} ) = 2\,N\sigma^{2} . $$
(7.2.1)

In (7.2.1), \( S_{N} \) is the sum statistic for the final look yet to be observed and \( \sigma^{2} \) is the known variance of individual observations which remains constant throughout the trial regardless of whether the subject observed is in the treatment arm or in the control arm. The value \( t_{1} \) at look 1, usually known as the information fraction or the information time at look 1, is given by

$$ {\text{Var}}\{ B\left( {t_{1} } \right)\} = n_{1} /N = t_{1} . $$
(7.2.2)

Note that calling here \( n_{1} /N = t_{1} \) as the information fraction or information time assumes that the sample sizes for the treatment and control groups are equal at each look and the variance of individual observations remains constant. In general, if \( d_{1} \) and \( d_{2} \) denote asymptotically normal estimates of a treatment group difference at interim and final looks, then the information fraction is defined as \( I = Var\left( {d_{2} } \right)/Var\left( {d_{1} } \right) \). For normal outcomes, information time is the proportion of data available at the interim look, relative to the planned maximum if the trial is not stopped early. However, in presenting our results, for simplicity, we maintain our assumptions of equal sample sizes and constant variance. These results easily extend to the general case (Jennison and Turnbull 2000).

The standardized test statistic \( Z\left( {t_{1} } \right) \) for testing \( H_{0} \) at look 1 can then be expressed as

$$ Z\left( {t_{1} } \right) = S_{{n_{1} }} /\sqrt {V_{{n_{1} }} } = (S_{{n_{1} }} /\sqrt {V_{N} } )\sqrt {V_{N} /V_{{n_{1} }} } = B\left( {t_{1} } \right)/\sqrt {t_{1} } . $$
(7.2.3)

The relationship in (7.2.3) follows from \( {\text{Var}}(S_{{n_{1} }} ) = 2n_{1} \sigma^{2} \) and \( V_{N} /V_{{n_{1} }} = 1/t_{1} \). Now consider the second look with the sample size of \( n_{2} = n_{1} + r \) per treatment arm. Then \( B\left( {t_{2} } \right) = (S_{{n_{1} }} + S_{r} )/\sqrt {V_{N} } \) where \( S_{r} \) is the sum statistic for the treatment difference based on the new data available at look 2. Consequently,

$$ {\text{Var}}\{ B\left( {t_{2} } \right)\} = n_{2} /N = t_{2} ,{\text{Cov}}\{ B\left( {t_{1} } \right), B\left( {t_{2} } \right)\} = t_{1} , $$

and

$$ {\text{Corr}}\{ B\left( {t_{1} } \right), B\left( {t_{2} } \right)\} = {\text{Corr}}\left\{ { Z\left( {t_{1} } \right), Z\left( {t_{2} } \right)} \right\} = \sqrt {t_{1} /t_{2} } \,\,{\text{for}}\,t_{1} \le t_{2}. $$
(7.2.4)

Given \( t_{1} \le t_{2} \le \cdots \le t_{k} \le \cdots \le t_{K} = 1 \), we assume that \( B\left( {t_{1} } \right),B(t_{2} ), \ldots ,B\left( {t_{K} } \right) \) follow a multivariate normal distribution with

$$ {\text{E}}\{ B\left( {t_{k} } \right)\} = 0\,{\text{under}}\,H_{0} \, {\text{and}}\,{\text{Cov}}\{ B\left( {t_{k} } \right), B\left( {t_{l} } \right)\} = t_{k} \,\,{\text{for}}\,t_{k} \le t_{l} \le t_{K} . $$
(7.2.5)

Therefore, the normal Z-statistics \( \{ Z\left( {t_{k} } \right) = B\left( {t_{k} } \right)/\sqrt {t_{k} } \} \) for \( k = 1, \ldots ,K \) follow a multivariate normal distribution with

$$ {\text{E}}\{ Z\left( {t_{k} } \right)\} = 0\,{\text{under}}\,H_{0} \,{\text{and}}\,{\text{Cov}}\{ Z\left( {t_{k} } \right), Z\left( {t_{l} } \right)\} = \sqrt {t_{k} /t_{l} } \,{\text{for}}\,t_{k} \le t_{l} \le t_{K} . $$
(7.2.6)

The non-central expected value of \( B\left( {t_{k} } \right) \) in terms of the information fraction \( t_{k} \) is given by:

$$ {\text{E}}\{ B\left( {t_{k} } \right)\} = n_{k} \delta /\sqrt {2N\sigma^{2} } = (n_{k} /N)\sqrt {N/2} \left( {\delta /\sigma } \right) = t_{k} \theta , $$
(7.2.7)

where \( \theta = \sqrt {N/2} \left( {\delta /\sigma } \right) \) is the “drift parameter .” Consequently, the non-central expected value of \( {\text{E}}\{ Z\left( {t_{k} } \right)\} = \sqrt {t_{k} } \theta . \)

Note that \( \theta = z_{1 - \alpha } + z_{1 - \beta } \) for a fixed-sample non-GS trial , where for such a trial , \( \alpha \) is the probability of falsely rejecting the null hypothesis \( H_{0} : \delta \le 0 \) of no treatment effect in favor of the alternative hypothesis \( H_{a} : \delta > 0 \) of treatment effect, and power 1−β is the probability of rejecting \( H_{0} \) when given the true treatment difference \( \delta = \delta_{0} > 0 \). For example, when the trial α = 0.025 and power 1−β = 0.90, then \( \theta = 3.2415 \). Here the notation \( z_{1 - x} \) stands for the deviate such that \( { \Pr }(U \le z_{1 - x} ) = 1 - x \) with 0 ≤ x ≤ 1, where U is the normal N(0, 1) random variable. More details about \( B\left( t \right) \) values and \( Z\left( t \right) \) normal scores can be found in Proschan et al. (2006) and Lan and Wittes (1988). In the following, we show how the well-known methods by Pocock (1977) and O’Brien and Fleming (1979) rely on these B-values and z-scores in finding their local significance levels, i.e., GS-boundary values , for the repeated testing of \( H_{0} \). For convenience, we will call these historical methods as PK and OF methods and their boundaries as PK and OF classical boundaries.

7.2.2 Classical PK and OF Boundaries

When analyses of accumulating data of a GS trial occur at equally spaced information times, then the PK boundary is a constant boundary on the z-scale . That is, if \( t_{k} = k/K \) for \( k = 1, \ldots , K \), the constant PK boundary \( c_{PK} \left( {\alpha ,K} \right) = x \) for 1-sided tests can then be obtained by solving for x in the following equation:

$$ { \Pr }[\bigcap\limits_{k = 1}^{K} {\left\{ {Z\left( {t_{k} } \right) \le x} \right\}|H_{0} } ] = 1 - \alpha \,{\text{with}}\, t_{k} = k/K \,{\text{for}}\,k = 1, \ldots ,K, $$
(7.2.8)

such that the Type I error rate is controlled at level α. This equation can be solved under the assumption that the joint distribution of the test statistics \( \left\{ {Z\left( {t_{k} } \right);k = 1, \ldots ,K} \right\} \) is multivariate normal with zero mean vector and correlation matrix \( (\rho_{kl} ) = \left( {\sqrt {t_{k} /t_{l} } } \right) \) with \( t_{k} \le t_{l} \). For example, \( c_{PK} \left( {\alpha ,K} \right) = 2.28947 \) for \( K = 3 \), (\( t_{1} = 1/3 \), \( t_{2} = 2/3 \), and \( t_{3} = 1 \)), and α = 0.025. For solving for x in (7.2.8), we wrote SAS/IML codes that calculated the left-hand side of the equation using PROBBNRM and QUAD functions of SAS. PROBBNRM is a SAS function which gives values of the cumulative distribution functions of a standard bivariate normal distribution on specifying the value of the two variables and the correlation coefficient between them. QUAD is a SAS function which integrates numerically a function over an interval. This calculation expressed the joint distribution of \( \left\{ {Z\left( {t_{k} } \right);k = 1, 2, 3} \right\} \) as the product of the distribution of \( Z\left( {t_{1} } \right) \) and the conditional bivariate distribution of \( Z\left( {t_{2} } \right) \) and \( Z\left( {t_{3} } \right) \) given \( Z\left( {t_{1} } \right) = z\left( {t_{1} } \right) \).

Jennison and Turnbull (2000) and Proschan et al. (2006) include 2-sided PK boundary values for different values of K, and α = 0.01, 0.05, and 0.10. These 2-sided boundary values at level α, if taken as 1-sided boundary values at level α/2, may not be identical to the actual 1-sided boundary values obtained from (7.2.8); see, for example, Sect. 2.4 in Wassmer and Brannath (2016). The PK boundary values for 2-sided tests are obtained by replacing \( Z\left( {t_{k} } \right) \le x \) by \( \left| { Z\left( {t_{k} } \right)} \right| \le x \) in (7.2.8). Thus, a GS trial , designed with PK boundary with looks at equally spaced information times with given α and K, would reject \( H_{0} \) for efficacy and stop the trial at look k with the information fraction \( t_{k} \) when \( Z\left( {t_{k} } \right) > c_{PK} \left( {\alpha ,K} \right) \).

Likewise, the OF boundary is a constant boundary on the B-value scale when the trial looks occur at equally spaced information times. Therefore, when \( t_{k} = k/K, \) for \( k = 1, \ldots ,K \), the 1-sided OF boundary value can be obtained by solving for x in the following equation:

$$ { \Pr }[\bigcap\limits_{k = 1}^{K} {\left\{ {B\left( {t_{k} } \right) \le x} \right\}|H_{0} } ] = 1 - \alpha \,{\text{with}}\,t_{k} = k/K\,\,{\text{for}}\,k = 1, \ldots , K. $$

Using \( Z\left( {t_{k} } \right) = B\left( {t_{k} } \right)/\sqrt {t_{k} } \) the above equation can be expressed as in (7.2.9) to solve for x using the joint distribution of the test statistics \( \left\{ {Z\left( {t_{k} } \right);k = 1, \ldots ,K} \right\} \) as a multivariate normal with zero mean vector and correlation matrix \( (\rho_{kl} ) = \left( {\sqrt {t_{k} /t_{l} } } \right) \) for \( t_{k} \le t_{l} \):

$$ { \Pr }[\bigcap\limits_{k = 1}^{K} {\left\{ {Z\left( {t_{k} } \right) \le x/\sqrt {t_{k} } } \right\} |H_{0} } ]= 1 - \alpha \,{\text{with}}\,t_{k} = k/K\,{\text{for}}\,k = 1, \ldots ,K. $$
(7.2.9)

For example, when K = 2, (\( t_{1} = 1/2 \) and \( t_{2} = 1 \)), α = 0.025, and the tests are 1-sided, then solving the equation PROBBNRM (\( x\sqrt 2 \), x, \( \sqrt {1/2} \)) = 0.975 gives the value of x = 1.97742 which in turn gives the OF boundary values of \( c_{1} \left( {\alpha , K} \right) = x\sqrt 2 = 2.796494 \) for the first look at \( t_{1} = 1/2 \) and \( c_{2} \left( {\alpha , K} \right) = x = 1.97742 \) for the final look on the z-score scale with the corresponding boundary values of \( \alpha_{1} \left( {\alpha , K} \right) = 0.002583 \) and \( \alpha_{2} \left( {\alpha ,K} \right) = 0.023997 \) on the p-value scale . Thus, if a GS trial is designed with two looks with an interim look at \( t_{1} = 1/2 \), and \( \alpha = 0.025 \), then \( H_{0} \) will be rejected when the p-value at this look is less than \( \alpha_{1} \left( {\alpha ,K} \right) = 0.002583 \) stopping the trial early; otherwise, the trial will continue to the next and final look, and \( H_{0} \) will be rejected there when the p-value at this look is less than \( \alpha_{2} \left( {\alpha ,K} \right) = 0.023997 \).

Jennison and Turnbull (2000) and Proschan et al. (2006) provide values of x for 2-sided tests for different values of K and \( \alpha = 0.01 \), 0.05, and 0.1. These 2-sided boundary values at level α, if read as 1-sided boundary values at level α/2, may not agree with the actual 1-sided boundary values . Note that the methods described in this section are of historical importance and are not so frequently used; they lack flexibility because managing analysis at equally spaced information time can be challenging. A more flexible approach for GS trials is the spending function approach described in the next section.

7.2.3 Spending Function Approach

The classical PK and OF boundaries introduced above require specifying the total number of looks at equally spaced information times. This can be inconvenient for clinical trial applications as the Data Safety Monitoring Board (DSMB) or any other group charged with performing interim looks of the accumulating clinical trial data may have to postpone a look for logistical reasons, or may decide to have a look at an unspecified time because of certain concerns. Lan and DeMets (1983) proposed the spending function approach for this and showed that the construction of GS boundaries do not require pre-specification of the number or timings of looks.

Any non-decreasing function \( f\left( {\alpha , t} \right) \) in the information time t, over the interval 0 ≤ t ≤ 1 and parameterized by the overall significance level \( \alpha \) for testing \( H_{0} \), can be a spending function if it satisfies the following conditions: \( f\left( {\alpha , t} \right) \le f\left( {\alpha , t^{{\prime }} } \right) \) for \( 0 \le t \le t^{{\prime }} \le 1 \); \( f\left( {\alpha , t = 0} \right) = 0 \); and \( f\left( {\alpha , t = 1} \right) = \alpha \). A commonly used spending function for clinical trials is the OF-like:

$$ f_{1} \left( {\alpha , t} \right) = 2\{ 1 - \varPhi (z_{1 - \alpha /2} /\sqrt {t)} \} , $$

where \( \varPhi \)(.) is the cumulative distribution function of the standard normal distribution .

Note that \( f_{1} \left( {\alpha ,0} \right) = 0 \) and \( f_{1} \left( {\alpha ,1} \right) = \alpha \). If the trial had only 2 looks, one at t = 1/2 and the other at t = 1, and \( \alpha = 0.025 \), then \( f_{1} \left( {\alpha = 0.025,t = 1/2} \right) = 2\left( {1 - \varPhi \left( {2.241403/0.70711} \right)} \right) = 2\left\{ {1 - \varPhi \left( {3.1698} \right)} \right\} = 0.001525 \) and \( f_{1} \left( {\alpha = 0.025,\,t = 1} \right) = \alpha \). One can then find the significance level x for the final look by solving the equation Pr {(P1 < 0.001525) ∪ (P2 < x)} = 0.025. The next section shows how these equations are solved. The advantage of using the OF-like spending function for clinical trials is its shape which is convex. This allows spending very little of the total α for early looks and saves most of it for latter looks when the trial has sufficient number of patients exposed to the new treatment. The idea is to stop the trial early only when the treatment effect size is sufficiently large and clinically convincing.

Table 7.1 includes a few other spending functions. These and other spending functions give the cumulative Type I error rate spent at look k with the associated information fraction \( t_{k} \). This cumulative value does not give directly the local significance level \( \alpha_{k} \left( {\alpha , t_{k} } \right) \) (i.e., the boundary value) for testing \( H_{0} \) at look k, except when \( k = 1 \) (the first look). Note that these boundary values are on the p-value scale and need to be converted for presentation on the z-scale . Finding \( \alpha_{k} \left( {\alpha , t_{k} } \right) \) requires additional calculations which we describe in the following with an example. These calculations usually require solving equations in multiple integrals and are not easy when K ≥ 3. Special computer software is normally used for this.

Table 7.1 Examples of spending functions

7.2.4 Calculations of Boundary Values Using Spending Functions

We illustrate the use of spending functions for finding the local significance level \( \alpha_{k} \left( {\alpha , t_{k} } \right) \) at look k with the information fraction \( t_{k} \), so that \( H_{0} \) will be rejected when the 1-sided p-value \( p_{k} \) at this look is less than \( \alpha_{k} \left( {\alpha , t_{k} } \right) \). Suppose a trial uses the OF-like spending function to control the Type I error rate at level \( \alpha = 0.025 \). Suppose that the first look occurs at \( t_{1} = 0.30 \). Then at this look, we spend

$$ \begin{aligned} & f_{1} \left( {\alpha = 0.025,t_{1} = 0.30} \right) = 2\left\{ {1 - \varPhi \left( {z_{1 - \alpha /2} /\sqrt {0.30} } \right)} \right\} \\ & \quad = 2\left\{ {1 - \varPhi \left( {\frac{2.2414027}{{\sqrt {0.30} }}} \right)} \right\} = 0.0000427 \end{aligned} $$

Therefore, at this look, \( \alpha_{1} \left( {\alpha , t_{1} } \right) = 0.0000427 \) and the critical value \( c_{1} \left( {\alpha ,t_{1} } \right) = 3.9285725 \) from \( \Pr \left\{ {Z\left( {t_{1} } \right) > c_{1} \left( {\alpha , t_{1} } \right)} \right\} = 0.0000427 \); one will reject \( H_{0} \) and stop the trial at the first look if \( p_{1} < 0.0000427 \) or \( Z\left( {t_{1} } \right) > 3.9285725 \). Thus, at this look the investigator spends very little of the total \( \alpha = 0.025 \).

Suppose that the trial did not stop at the first look and the investigator decides to have the second look at \( t_{2} = 0.65 \). Then the cumulative alpha spent at this look is

$$ \begin{aligned} & f_{1} \left( {\alpha = 0.025,t = 0.65} \right) = 2\left\{ {1 - \varPhi \left( {z_{1 - \alpha /2} /\sqrt {0.65} } \right)} \right\} \\ & \quad = 2\left\{ {1 - \varPhi \left( {\frac{2.2414027}{{\sqrt {0.65} }}} \right)} \right\} = 0.0054339 \end{aligned} $$

Therefore, we determine the boundary critical values of \( c_{2} \left( {\alpha , t_{2} } \right) = 2.5479 \) or \( \alpha_{2} \left( {\alpha , t_{2} } \right) = 0.0054187 \) by solving the equation: \( \Pr [\{ (Z\left( {t_{1} } \right) > 3.9285725\} \cup \{ (Z\left( {t_{2} } \right) > c_{2} \left( {\alpha , t_{2} } \right)\} ]= 0.0054339 \). Therefore, one can reject \( H_{0} \) at the second look and stop the trial , if at this look, the observed p-value \( p_{2} < 0.005187 \) or \( Z\left( {t_{2} } \right) > 2.5479 \).

Suppose the trial did not stop at this second look and the investigator moves to the final look at \( t_{3} = 1 \). Then the cumulative alpha spent at the final look is \( \alpha = 0.025 \). One can then find \( c_{3} \left( {\alpha , t_{3} } \right) \) by solving the equation:

$$ \Pr [\{ (Z\left( {t_{1} } \right) > 3.9285725\} \cup \{ (Z\left( {t_{2} } \right) > 2.5479\} \cup \{ (Z\left( {t_{3} } \right) > c_{3} \left( {\alpha , t_{3} } \right)\} ]= 0.025 $$

Solving this equation gives \( c_{3} \left( {\alpha , t_{3} } \right) = 1.9897 \) and \( \alpha_{3} \left( {\alpha , t_{3} } \right) = 0.023312 \). Therefore, one can reject \( H_{0} \) at the final look if at this look the p-value \( p_{3} < 0.023312 \) or \( Z\left( {t_{3} } \right) > 1.9897 \).

A general recursive equation for finding \( c_{k} \left( {\alpha , t_{k} } \right) \) and \( \alpha_{k} \left( {\alpha , t_{k} } \right) \) for a spending function \( f\left( {\alpha , t} \right) \) is given by \( f\left( {\alpha , t_{k} } \right) = f\left( {\alpha ,t_{k - 1} } \right) + \Pr \left[ {\{ \bigcap\nolimits_{i = 1}^{k - 1} {Z\left( {t_{i} } \right)} \le c_{i} \left( {\alpha , t_{i} } \right)\} \cap \{ Z\left( {t_{k} } \right) > c_{k} \left( {\alpha ,t_{k} } \right)\} } \right] \) for \( k \ge 2 \). There are software available that give values of \( c_{k} \left( {\alpha , t_{k} } \right) \) and \( \alpha_{k} \left( {\alpha , t_{k} } \right) \) for OF-like and other spending functions, see Zhu et al. (2011) for a review of these software. Table 7.2 shows the results from such a software. We show in Sect. 7.3 that such boundaries can also be used for testing multiple hypotheses of GS trials.

Table 7.2 Examples for the OF-like spending function with α = 0.025, 0.0125, K = 3, and 1-sided tests

7.3 Testing of Multiple Hypotheses in GS Trials

Many GS trials are designed for testing multiple endpoint hypotheses, frequently, for testing two endpoint hypotheses. Two situations generally arise. Consider, for example, a GS trial for testing two endpoint hypotheses. The first case arises when after the rejection of one of the two hypotheses at an interim look the trial does not stop but continues to later looks for testing the other hypothesis. The second case arises when the two hypotheses are hierarchically ordered, e.g., one is primary and the other is secondary. The first hypothesis in the hierarchy (i.e., the primary hypothesis) is allocated first using the full trial α (e.g., α = 0.025). If this hypothesis is rejected at an interim look, then the trial stops because of ethical considerations. For example, if the first hypothesis is associated with the mortality endpoint and the second hypothesis with a quality of life measure, then if the trial wins at a look for the mortality endpoint then the trial would generally discontinue for ethical reasons. In that case, the second hypothesis (i.e., the secondary hypothesis) is tested at the same look at which the first hypothesis was rejected. The remainder of this section considers the first case and Sect. 7.4 considers the second case. In the following, we first address methods based on the Bonferroni inequality and then move on to α-recycling approaches based on the closed testing principle (CTP) of Marcus et al. (1976), and finally to the more recent graphical approach of Maurer and Bretz (2013).

7.3.1 Methods Based on the Bonferroni Inequality

Consider, for example, a trial which for the demonstration of superiority of a new treatment to control specifies two null hypotheses: \( H_{1} \) and \( H_{2} \). Rejection of either of the two hypotheses at a look can establish efficacy of the new treatment. However, if the trial rejects one of the two hypotheses at an interim look, the trial can continue to later looks for testing the other hypothesis. For such a trial , the use of the Bonferroni inequality leads to two approaches for a stronger claim. The first approach splits the significance level α as \( \alpha_{1} + \alpha_{2} \le \alpha \) for testing \( H_{1} \) at level \( \alpha_{1} \) and \( H_{2} \) at level \( \alpha_{2} \). For example, it may assign \( \alpha_{1} = 0.005 \) for testing \( H_{1} \) and \( \alpha_{2} = 0.02 \) for testing \( H_{2} \) for controlling the overall Type I error rate at \( \alpha = 0.025 \). Tests for \( H_{1} \) and \( H_{2} \) can then separately follow in a univariate GS testing framework for the separate control of the Type I error rates at levels \( \alpha_{1} \) and \( \alpha_{2} \), respectively, using the same or different spending functions for each. In Sect. 7.3.2, we show that this approach extends to an α-recycling approach, such that, if one of the multiple hypotheses is rejected at a look then the boundary value for testing other hypotheses is updated to larger values.

The second approach uses the Bonferroni inequality differently. It specifies the rejection boundary values as \( \alpha_{k}^{{\prime }} \left( {t_{k} } \right) > 0 \) for looks \( k = 1, \ldots ,K \) such that \( \sum\nolimits_{k = 1}^{K} {\alpha_{k}^{{\prime }} (t_{k} ) = \alpha } \). It then applies a conventional multiple hypothesis testing method at a look for the control of the Type I error rate at the local level \( \alpha_{k}^{{\prime }} \left( {t_{k} } \right) \) at that look. Suppose that \( K = 2 \), i.e., the trial is designed with two looks, and \( \alpha_{1}^{{\prime }} \left( {t_{1} } \right) = 0.005 \) and \( \alpha_{2}^{{\prime }} \left( {t_{2} = 1} \right) = 0.02 \), for the first and second looks, respectively. One can then apply, for example, the conventional Hochberg procedure (1988) for testing \( H_{1} \) and \( H_{2} \) at level 0.005 at the first look, and similarly, can apply the same procedure for testing these hypotheses at the final look at level 0.02. The methods discussed in this section for testing two hypotheses generalize to testing more than two hypotheses.

7.3.2 Method Based on the Closed Testing Principle

The closed testing principle of Marcus et al. (1976) provides a general framework for constructing powerful closed test procedures (CTPs) for testing individual hypotheses based on tests of intersection hypotheses of different orders. One starts with a family of individual hypotheses \( H_{1} , \ldots , H_{h} \) and constructs a closed set \( \tilde{\varvec{H}} \) of \( 2^{h} - 1 \) non-empty intersection hypotheses as follows:

$$ \tilde{\varvec{H}} = \left\{ {H_{\varvec{J}} = \bigcap\nolimits_{{j \in \varvec{J}}} {H_{j} } ,\quad \varvec{J}\,\subseteq\,\varvec{I}= \left\{ {1, \ldots ,h} \right\}} \right\}. $$

One then performs an α-level test for each hypothesis \( H_{\varvec{J}} \) in \( \tilde{\varvec{H}} \) by using, for example, the weighted Bonferroni test . One then rejects an individual hypothesis \( H_{j} \) when all \( H_{\varvec{J}} \) for \( j \in \varvec{J} \) are rejected by their corresponding α-level tests.

For example, when h = 2, the closed set \( \tilde{\varvec{H}} = \left\{ {H_{12} , H_{1} , H_{2} } \right\} \). A CTP will reject the individual hypothesis \( H_{1} \) only when \( H_{1} \) and \( H_{12} \) are both rejected, each by an α-level test. If one uses, for example, the weighted Bonferroni test for \( H_{12} \), then the procedure cuts down the extra step of testing \( H_{1} \) after rejecting \( H_{12} \). The weighted Bonferroni test rejects \( H_{12} \), when \( p_{j} < w_{j} \alpha \) for at least one \( j \in \left\{ {1, 2} \right\} \), where \( w_{1} \) and \( w_{2} \) are the nonnegative weights assigned to \( H_{1} \) and \( H_{2} \), respectively, such that \( w_{1} + w_{2} \le 1 \), and \( p_{j} \) are the observed p-values associated with \( H_{j} \) for \( j \in \left\{ {1, 2} \right\} \). Suppose that this test rejects \( H_{12} \) for \( j = 1 \) on observing \( p_{1} < w_{1} \alpha \), then \( H_{1} \) is automatically rejected, as the significance level \( \alpha \) for the test of \( H_{1} \) satisfies \( \alpha \ge w_{1} \alpha \). This property in its general form, known as the consonance property, when satisfied for testing intersection hypotheses in a closed testing procedure, leads to short-cuts of closed test procedures and allows recycling of the significance level of a rejected hypothesis to other hypotheses (Hommel et al. 2007). This property basically means that the rejection of an intersection hypothesis \( H_{\varvec{J}} \) by an α-level test implies the rejection of at least one individual hypothesis \( H_{j} \) for \( j \in \varvec{J} \).

As a numerical example, consider testing the two hypotheses \( H_{1} \) and \( H_{2} \) with \( \alpha = 0.025, \) and suppose that weights assigned to \( H_{1} \) and \( H_{2} \) are \( w_{1} = 0.8 \) and \( w_{2} = 0.2 \), respectively, so that \( w_{1} + w_{2} = 1 \). Further, supposed that the associated observed p-values for the tests of \( H_{1} \) and \( H_{2} \) were \( p_{1} = 0.024 \) for \( H_{1} \) and \( p_{2} = 0.004 \) for \( H_{2} \). The simple weighted Bonferroni test would reject only \( H_{2} \), as \( p_{1} > w_{1} \alpha = 0.020 \) and \( p_{2} < w_{2} \alpha = 0.005. \) However, the weighted Bonferroni based CTP with these weights would reject both hypotheses. This CTP, in its initial step, would reject the intersection hypothesis \( H_{12} \) as \( p_{j} < w_{j} \alpha \) for \( j = 2 \). Consequently, as the procedure assigns the weights of one for testing each singleton hypotheses, satisfying consonance , it would then reject each of the two hypotheses as \( p_{j} < 1. \alpha = 0.025 \) for each \( j \in \left\{ {1, 2} \right\}. \)

In the following, we first visit the GS closed test procedure by Tang and Geller (1999) for testing multiple hypotheses and show that this procedure leads to α-recycling procedures by using weighted Bonferroni tests of intersection hypotheses that satisfy consonance . The Tang and Geller procedure is of historical importance with respect to using the closed testing procedure for testing multiple hypotheses in group sequential trials . Although the procedure sounds complicated in its original form, it can be simplified if the weighted Bonferroni tests, with weights satisfying the consonance property, are used for testing its intersection hypotheses . However, selection of such weights can be cumbersome for testing more than three hypotheses. Section 7.3.3 toward the end illustrates how to find these weights when testing two primary hypotheses and a secondary hypothesis. In general, the graphical approach (Sect. 7.3.5) in this regard is easier to use when testing multiple hypotheses.

Consider testing \( h \ge 2 \) endpoint hypotheses in a GS trial designed to compare a new treatment to control. Consider, as before, the intersection hypotheses \( H_{\varvec{J}} \) for \( \varvec{J}\,\subseteq\,\varvec{I} = \left\{ {1, \ldots , h} \right\} \), i.e., the new treatment to control treatment difference \( \delta_{j} \le 0 \) for all endpoints \( j \in \varvec{J}\,\subseteq\,\varvec{I} \). Also, consider that multiple looks for the trial occur at different information times \( t \in \left\{ {t_{1} , t_{2} , \ldots , t_{K} } \right\} \) such that \( t_{1} \le t_{2} \le \cdots \le t_{K} = 1 \). Let \( Z_{\varvec{J}} \) be a test statistic for testing \( H_{\varvec{J}} \) (e.g., by a weighted Bonferroni test ) and let \( Z_{\varvec{J}} \left( t \right) \) be the test statistic value of \( Z_{\varvec{J}} \) at a look with information fraction t. Further, let \( c_{\varvec{J}} \left( t \right) \) be the critical value for performing an α-level test of \( H_{\varvec{J}} \) at this look by using \( Z_{\varvec{J}} \left( t \right) \). That is, for each \( \varvec{J}\,\subseteq\,\varvec{I} \), the \( c_{\varvec{J}} \left( t \right) \) values for different t (at which times repeated tests occur) satisfy \( { \Pr }\{ Z_{\varvec{J}} \left( t \right) > c_{\varvec{J}} \left( t \right)\,{\text{for}}\,{\text{some}}\,t|H_{\varvec{J}} \} \le\upalpha \). Then a closed test procedure for GS trials as proposed by Tang and Geller (1999) can be stated as follows:

Step 1::

Start testing \( H_{\varvec{I}} \) as in a univariate case of a GS trial but using the group sequential boundary values \( c_{\varvec{I}} \left( t \right) \) for the test statistics \( Z_{\varvec{I}} \left( t \right) \), where \( \varvec{I} = \left\{ {1, \ldots , h} \right\} \).

Step 2::

Suppose that \( H_{\varvec{I}} \) is rejected first time at the look with \( t = t^{*} \). Then, for rejecting at least one individual hypothesis at this look, apply a CTP to test \( H_{\varvec{J}} \) with \( \varvec{J} \subseteq \varvec{I} \) using \( Z_{\varvec{J}} \left( {t^{*} } \right) \) and its critical value \( c_{\varvec{J}} \left( {t^{*} } \right) \). Note that \( c_{\varvec{J}} \left( {t^{*} } \right) \) can be different for different \( H_{\varvec{J}} \)’s. In applying this CTP at \( t = t^{*} \) either (a) none of the individual hypotheses will be rejected, or (b) at least one individual hypothesis \( H_{j} \) will be rejected for \( j \in \varvec{I} \).

Step 3(a)::

In Step 2, if none of the individual hypotheses are rejected at \( t = t^{*} \) then continue to the next look; however, if \( t^{*} = 1 \) and none of the individual hypotheses are rejected, the trial will stop without the rejection of any hypothesis.

Step 3(b)::

In Step 2, if at least one hypothesis is rejected at \( t = t^{*} \), then exclude the indices of the rejected hypotheses from the index set \( \varvec{I} \). With this updated index set \( \varvec{I} \), continue to the next look and repeat Step 1 and Step 2. Note that in this process, all previously rejected hypotheses are assumed rejected at later looks and are removed for further testing.

Step 4::

Reiterate the above steps until all hypotheses are rejected or the trial reaches the final look.

Implementing the Tang and Geller (1999) approach for the general case can be complicated because of the computational difficulties in finding \( c_{\varvec{J}} \left( t \right) \) values for testing \( H_{\varvec{J}} \) for different \( \varvec{J} \) and different looks. However, this approach simplifies on using univariate tests for \( H_{\varvec{J}} \) that satisfy consonance . Examples, of such tests, are the max-T or min-p test, and the un-weighted Bonferroni test . Weighted Bonferroni test which is more useful for clinical trial applications also serves this purpose, but the weights for the weighted Bonferroni tests need to be pre-selected to satisfy consonance . This may be difficult when testing more than three hypotheses. An alternative to this which does not have this issue is the graphical approach addressed in Sect. 7.3.4. The following, however, addresses the weighted Bonferroni test approach and illustrates its application for testing two hypotheses in a GS trial .

In the weighted Bonferroni test approach, to satisfy consonance for the tests of \( H_{\varvec{J}} \) for \( \varvec{J} \subseteq \varvec{I} \), one pre-selects weights \( w_{j} \left( \varvec{J} \right) \) for \( j \in \varvec{J} \) with \( \sum\nolimits_{{j \in \varvec{J}}} {w_{j} \left( \varvec{J} \right) \le 1} \) so that \( w_{j} \left( {\varvec{J}^{{*}} } \right) \ge w_{j} \left( \varvec{J} \right) \) for every \( \varvec{J}^{{*}} \subseteq \varvec{J} \). For these cases, standard software developed for testing a single hypothesis with a spending function approach can still be used for testing multiple hypotheses. The following is an illustrative example for testing two hypotheses \( H_{1} \) and \( H_{2} \) in a GS trial .

In the case of testing two hypotheses, a CTP considers a single intersection hypothesis \( H_{\varvec{J}} \) with \( \varvec{J} = \left\{ {1,2} \right\} \), written as \( H_{12} \), and two individual hypotheses \( H_{1} \) and \( H_{2} \). Suppose that for testing \( H_{12} \) one assigns weights \( w_{1} \left\{ {1, 2} \right\} = 0.8 \) and \( w_{2} \left\{ {1, 2} \right\} = 0.2 \) so that \( w_{1} \left\{ {1, 2} \right\}\alpha = 0.02 \) and \( w_{2} \left\{ {1, 2} \right\}\alpha = 0.005 \) with the trial \( \alpha = 0.025 \). Consonance is satisfied, because after \( H_{12} \) is rejected, the weights for testing each of the two individual hypotheses in the CTP is one. The following illustrates how one will test \( H_{1} \) and \( H_{2} \) in a GS trial with such initial weights.

Tests at the First Look

Suppose that the first look for the trial occurs at \( t = t_{1} = 0.30 \), and suppose that at this look the unadjusted p-values associated with \( H_{1} \) and \( H_{2} \) are \( p_{1} \left( {t_{1} } \right) \) and \( p_{2} \left( {t_{1} } \right) \), respectively. The CTP will reject \( H_{12} \) by the weighted Bonferroni test if either \( p_{1} \left( {t_{1} } \right) < \alpha_{1} \left( {w_{1} \left\{ {1, 2} \right\}\alpha = 0.02,t_{1} = 0.30} \right) = \alpha_{1} \left( {0.02,t_{1} = 0.30} \right) \) or \( p_{2} \left( {t_{1} } \right) < \alpha_{2} \left( {0.005, t_{1} = 0.30} \right) \), where these boundary critical values can be obtained by specifying spending functions \( f_{1} \) and \( f_{2} \). If \( f_{1} \) and \( f_{2} \) are each OF-like, then

$$ \begin{aligned} & \alpha_{1} \left( {0.020,t_{1} = 0.30} \right) = f_{1} \left( {w_{1} \left\{ {1,2} \right\}\alpha = 0.02,t_{1} = 0.30} \right) = 0.00002 \\ & \alpha_{2} \left( {0.005,t_{1} = 0.30} \right) = f_{2} \left( {w_{2} \left\{ {1,2} \right\}\alpha = 0.005, t_{1} = 0.30} \right) = 2.977{\text{E}} - 07 \\ \end{aligned} $$

Suppose that \( H_{12} \) is not rejected at this look with \( t_{1} = 0.30 \) and the trial continues to the second look.

Tests at the Second Look

Suppose that the second look occurs at \( t_{2} = 0.65 \). Further, suppose that at this look the unadjusted p-values associated with \( H_{1} \) and \( H_{2} \) are \( p_{1} \left( {t_{2} } \right) \) and \( p_{2} \left( {t_{2} } \right) \), respectively. Consequently, the CTP will reject \( H_{12} \) at this look if either \( p_{1} \left( {t_{2} } \right) < \alpha_{1} \left( {0.02,t_{2} = 0.65} \right) \) or \( p_{2} \left( {t_{2} } \right) < \alpha_{1} \left( {0.005, t_{2} = 0.65} \right) \). The use of the spending functions \( f_{1} \) and \( f_{2} \) as OF-like for this look gives the boundary values

$$ \alpha_{1} \left( {0.020, t_{1} = 0.65} \right) = 0.0039\,{\text{and}}\,\alpha_{2} \left( {0.005, t_{1} = 0.65} \right) = 0.000498. $$

Section 7.2.4 has addressed how these boundary values are calculated. As indicated before, computer software is used to calculate such boundary values .

Now, suppose that \( p_{2} \left( {t_{2} } \right) < 0.000498 \), then \( H_{12} \) will be rejected leading to the automatic rejection of \( H_{2} \) because of the consonance condition being satisfied. Therefore, as \( H_{12} \) and \( H_{2} \) are rejected at \( t^{*} = t_{2} = 0.65 \), the CTP will test the remaining hypothesis \( H_{1} \) at the same look with \( (t^{*} = t_{2} = 0.65 ) \) with the updated boundary value of \( \alpha_{1} \left( {0.025, t_{2} = 0.65} \right) = 0.00542 \) by the same OF-like spending function. Thus, there is a recycling of alpha of 0.005 form the rejected \( H_{2} \) to \( H_{1} \), updating the alpha of 0.02 to 0.02 + 0.005 = 0.025 which is incorporated in the first argument of \( \alpha_{1} \left( {0.025, t_{2} = 0.65} \right) \). Thus, a CTP with consonance allows recycling of alpha for GS trials, but here, this recycling updates the boundary values for testing \( H_{1} \) starting from at \( t^{*} = t_{2} = 0.65 \) using a spending function. Suppose that \( p_{1} \left( {t_{2} } \right) = 0.015 \) which is greater than 0.00542, then \( H_{1} \) at this second look remains not rejected. The trial then continues to the final look with \( t_{3} = 1 \) for testing \( H_{1} \).

Test at the Final Look

The final look occurs with \( t_{3} = 1 \) for testing \( H_{1} \) with the assumption that \( H_{2} \) (which was rejected at the second look) remain rejected at this look. Therefore, \( H_{1} \) would be tested at this look at level \( \alpha_{1} \left( {0.025, t_{3} = 1} \right) = 0.02331 \) by the same OF-like spending function.

7.3.3 Some Key Considerations and Comments

For applications, the spending functions to be used for testing different hypotheses need to be pre-specified, and for interpreting study findings, it is good practice to use the same spending functions for testing different hypotheses. It should be noted that although the total number of looks may not be pre-specified, however, specifying it may help reducing concerns about unnecessary looks of the data. In addition, in our previous discussion, including the illustrative example in Sect. 7.3.2, we assumed that information fractions for the two endpoints are equal at each look. This can be the case for continuous or binary endpoints; however, this may be not the general case. That is, if \( t_{k} \left( {E_{1} } \right) \) and \( t_{k} \left( {E_{2} } \right) \) are information fraction for two endpoints at looks \( k = 1, .., K \) then it is possible that \( t_{k} \left( {E_{1} } \right) \ne t_{k} \left( {E_{2} } \right) \) for at least one k. This can occur, for example, when \( E_{1} \) or \( E_{2} \) are time-to-event endpoints; it may also occur for other situations. Then the question may arise as how to adopt the above procedure for this general case.

In this regard, we note that the above procedure can be easily adopted to address this general case. To illustrate, suppose that in the above example, at the first look \( t_{1} \left( {E_{1} } \right) = t_{1} \left( {E_{2} } \right) = 0.30 \), but at the second look \( t_{2} \left( {E_{1} } \right) = 0.40 \) and \( t_{2} \left( {E_{2} } \right) = 0.65 \) and assume that \( H_{12} \) is not rejected at the first look; yet, it can be rejected at the second look if either \( p_{1} \left( {t_{2} } \right) < \alpha_{1} \left( {0.02, t_{2} \left( {E_{1} } \right) = 0.40} \right) \) or \( p_{2} \left( {t_{2} } \right) < \alpha_{1} \left( {0.005, t_{2} \left( {E_{2} } \right) = 0.65} \right) \). Now, suppose that at this stage \( H_{12} \) is rejected by observing that \( p_{2} \left( {t_{2} } \right) < \alpha_{1} \left( {0.005, t_{2} \left( {E_{2} } \right) = 0.65} \right) \), leading to the rejection of \( H_{2} \) as before. Therefore, the alpha of 0.005 for the rejected \( H_{2} \) will now be recycled for testing \( H_{1} \), that is by updating the old boundary value of \( \alpha_{1} \left( {0.02, t_{2} \left( {E_{1} } \right) = 0.40} \right) \) to a new boundary value \( \alpha_{1} \left( {0.025, t_{2} \left( {E_{1} } \right) = 0.40} \right) \) at this second look, and to \( \alpha_{1} \left( {0.025, t_{3} \left( {E_{1} } \right) = 1} \right) \) at the final look.

Note that in above after rejecting \( H_{2} \) at the second look, the significance level for testing for \( H_{1} \) is \( \alpha_{1} \left( {0.025, t_{2} \left( {E_{1} } \right) = 0.40} \right) \) which is not equal to \( \alpha = 0.025 \). Wrongfully, testing \( H_{1} \) at \( \alpha = 0.025 \) instead of testing it at level \( \alpha_{1} \left( {0.025, t_{2} \left( {E_{1} } \right) = 0.40} \right) \) after the rejection of \( H_{2} \) can inflate the overall Type I error rate. Also, if the trial stops at a look after rejecting a hypothesis for ethical reasons, say after the rejection of \( H_{2} \), then one cannot test a second hypothesis such as \( H_{1} \) at the full significance level of \( \alpha = 0.025 \). Doing this can inflate the overall Type I error rate, except for the special case when the test statistics for the two hypotheses are independent. We consider this type of GS trials in Sect. 7.4.

The spending functions used to test each hypothesis needs to satisfy a monotonicity property. That is, the difference function \( f\left( {\lambda , t_{k} } \right) - f\left( {\lambda , t_{k - 1} } \right) \) is monotonically non-decreasing in λ for \( k = 1, \ldots , K \). For example, the OF-like α-spending function satisfies this condition for \( \lambda < 0.318 \) (Maurer and Bretz 2013).

The above weighted Bonferroni-based CTP for testing two hypotheses can be extended to testing more than two hypotheses if weights assigned for testing intersection hypotheses in a CTP are such that consonance property is guaranteed, that is, weights assigned are such that rejection of an intersection hypothesis in the CTP leads to the rejection of at least one individual hypothesis in that intersection hypothesis. For example, for testing two primary hypotheses \( H_{1} \) and \( H_{2} \) and a secondary hypothesis \( H_{3} \) of a trial , the CTP would consider four intersection hypotheses \( H_{123} \), \( H_{12} \), \( H_{13} \) and \( H_{23} \) and three individual hypotheses .

The following selection of weights for performing Bonferroni-based tests of intersection hypotheses in the CTP would then satisfy consonance property. Assign nonnegative weights of \( w_{1} \), \( w_{2} \), and \( w_{3} \) associated with indices (1, 2, and 3) of \( H_{123} \) to test this hypothesis with \( w_{1} + w_{2} = 1 \) and \( w_{3} = 0 \); the selection of \( w_{3} = 0 \) indicates that \( H_{3} \) is tested only after at least one of the two primary hypotheses is first rejected. Assign weights of \( \{ w_{1} ,w_{2} \} \) to \( H_{1} \) and \( H_{2} \), respectively to test \( H_{12} \). Similarly, weights of \( \{ w_{1} + \delta_{2} w_{2} ,(1 - \delta_{2} )w_{2} \} \) to test \( H_{13} \), and weights of \( \{ w_{2} + \delta_{1} w_{1} ,(1 - \delta_{1} )w_{1} \} \) to test \( H_{23} \), where \( 0 \le \delta_{1} \le 1 \) and \( 0 \le \delta_{2} \le 1. \) The weights assigned to each of the individual hypotheses will be one. The selection of these weight and the recycling parameter \( \delta_{1} \) and \( \delta_{2} \), for example, can be based on the trial objectives . Once such weights for performing the weighted Bonferroni tests satisfy consonance , a CTP for testing the above three hypotheses in a GS trial can be proposed.

GS trials that are not properly conducted have the potential of unblinding the trial prematurely, and consequently, this may impact the integrity of the trial and its results. To address this important issue, usually an Independent Data Monitoring Committees (DMC) along with a charter is setup for GS trials. As our focus for this chapter is to overview the general multiple testing approaches for group sequential trials , we do not discuss this issue here. The interested reader may consult relevant literature in this regard, see, e.g., Ellenberg et al. (2017). The concerns about potential unblinding for testing single hypothesis over the course of GS trials remain the same for GS trials with testing multiple hypotheses related to multiple endpoints .

For a GS trial that include testing of multiple hypotheses, a Statistical Analysis Plan (SAP) that explains in sufficient details the design, the analyses method, and the DMC charter, is essential for proper interpretation of study findings. Such a SAP should in general be developed a priori and agreed upon by those involved before launching the trial .

7.3.4 Graphical Approach

The above weighted Bonferroni-based CTP for testing multiple hypotheses of a GS trial , though possible, can be challenging in finding appropriate weights that guarantee consonance when the number of hypotheses tested are more than a few. The graphical approach of Bretz et al. (2009) which includes a special algorithm for doing this solves this problem. In this approach, one can graphically visualize the weighted Bonferroni tests for multiple hypotheses along with an α-propagation rule by which the procedure recycles the significance level of a rejected hypothesis to other remaining unrejected hypotheses. This graphical approach , originally developed for testing multiple hypotheses of non-GS trials, can also be conveniently used for testing multiple hypotheses of GS trials; see, for example, Maurer and Bretz (2013). The following explains the key concepts of this approach for testing multiple hypotheses.

In this graphical approach , the h individual hypotheses are represented initially by a set of h nodes with nonnegative weight of \( w_{i} \) at node \( i \left( {i = 1, \ldots ,h} \right) \) such that \( \sum\nolimits_{i = 1}^{h} {w_{i} \le 1} \). These weights when multiplied by \( \alpha \) represent the local significance levels at those respective nodes . The weight \( g_{ij} \) (with \( 0 \le g_{ij} \le 1 \)) associated with a directed edge connecting the node \( i \) to the node \( j \) indicates the fraction of the local significance level at the tail node \( i \) that is added to the significance level at the terminal node \( j \), if the hypothesis at the tail node \( i \) is rejected. For convenience, we will call these directed edges as “arrows” running from one node to the other, and the weight \( g_{ij} \) as the “transition weight ” on the arrow running from node \( i \) to node \( j \).

Figure 7.1 illustrates key concepts of this graphical approach for testing two primary hypotheses \( H_{1} \) and \( H_{2} \) and a secondary hypothesis \( H_{3} \) of a trial . In this figure, the initial Graph (a) shows three nodes . Two nodes represent \( H_{1} \) and \( H_{2} \) with weights \( w_{1} = 3/4 \) and \( w_{2} = \left( {1 - w_{1} } \right) = 1/4 \), respectively. The node for \( H_{3} \) shows a weight \( w_{3} = 0 \), which can increase only after the rejection of a primary hypothesis. The nonnegative number \( g_{12} = 1/4 \) is the transition weight on the arrow going from \( H_{1} \) to \( H_{2} \); similarly, \( g_{21} = 1/4 \) is the transition weight on the arrow going from \( H_{2} \) to \( H_{1} \). The transition weight on the arrow going from \( H_{1} \) to \( H_{3} \) is 3/4 and that on the arrow going from \( H_{2} \) to \( H_{3} \) is also 3/4 satisfying the condition that sum of the transition weights of all outgoing arrows from a single node must be bounded above by 1.

Fig. 7.1
figure 1

Graphical representation of testing with two primary hypotheses H1 and H2, and one secondary hypothesis H3

Graph (b) of Fig. 7.1 represents the resulting graph after \( H_{2} \) is rejected in Graph (a). The rejection of this hypothesis frees its weight \( w_{2} \) which is then recycled to \( H_{1} \) and \( H_{3} \) according to an α-propagation rule addressed in the following for the general case. This rule also calculates new transition weights going from one node to the other for the new graph. Graph (c) of Fig. 7.1 similarly shows the resulting graph if \( H_{1} \) is rejected in Graph (a). The following shows the general graphical procedure for testing h individual hypotheses \( H_{1} , \ldots , H_{h} \) for a non-GS trial given their individual unadjusted p-values \( p_{j} \) for \( j = 1, \ldots , h \).

  1. (0)

    Set \( \varvec{I} = \left\{ {1, \ldots , h} \right\} \). The set of weights \( \{ w_{j} \left( \varvec{I} \right) , j \in \varvec{I}\} \) are such that \( 0 \le w_{j} \left( \varvec{I} \right) \le 1 \) with the sum \( \sum\nolimits_{{j \in \varvec{I}}} {w_{j} \left( \varvec{I} \right)} \le 1 \).

  2. (i)

    Select a \( j \in \varvec{I} \) such that \( p_{j} < \{ w_{j} \left( \varvec{I} \right)\} \alpha \) and reject \( H_{j} \); otherwise stop.

  3. (ii)

    Update the graph as:

    1. (a)

      \( \varvec{I} = \varvec{I}\backslash \left\{ j \right\} \), i.e., the index set \( \varvec{I} \) without the index j

    2. (b)
      $$ w_{l} \left( \varvec{I} \right) = w_{l} \left( \varvec{I} \right) + w_{j} \left( \varvec{I} \right)g_{jl} ,l \in \varvec{I};\,0,\,{\text{otherwise}} $$
      (7.3.1)
    3. (c)
      $$ g_{lk} = \frac{{g_{lk} + g_{lj} g_{jk} }}{{1 - g_{lj} g_{jl} }},\,{\text{where}}\, \left( {l, k} \right) \in \varvec{I},l \ne k\,{\text{and}}\,g_{lj} g_{jl} < 1;\,0,\,{\text{otherwise}} $$
      (7.3.2)
  4. (iii)

    If \( \left| \varvec{I} \right| \ge 1 \) then go to step (i); otherwise stop

After rejecting \( H_{j} \), the Eq. (7.3.1) for a new graph updates the weight for \( H_{l} \) to a new weight which is its old weight \( w_{l} \left( \varvec{I} \right) \) plus the weight \( w_{j} \left( \varvec{I} \right) \) at \( H_{j} \) multiplied by the transition weight \( g_{jl} \) on the arrow connecting \( H_{j} \) to \( H_{l} \). Also, the transition weights \( g_{lk} \) for the new graph are obtained by the algorithm (7.3.2) whose numerator \( g_{lk} + g_{lj} g_{jk} \) is the transition weight on the arrow connecting \( H_{l} \) to \( H_{k} \) plus the product of the transition weights on arrows going from \( H_{l} \) to \( H_{k} \) through the rejected hypothesis \( H_{j} \). The term \( g_{lj} g_{jl} \) in (7.3.2) is the product of transition weights on arrows connecting \( H_{l} \) to \( H_{j } \) and then returning to \( H_{l} \). The approach produces weights \( w_{l} \left( \varvec{I} \right) \) which satisfy consonance .

For explaining this procedure, consider a trial , which for demonstrating superiority of a new treatment A + Standard of Care (SOC) to placebo +SOC, plans to test two primary hypotheses \( H_{1} \) and \( H_{2} \) and two secondary hypotheses \( H_{3} \) and \( H_{4} \), where the pairs (\( H_{1} \), \( H_{3} \)) and (\( H_{2} \), \( H_{4} \)) being considered as parent–descendant (Maurer et al. 2011). That is, \( H_{3} \) is tested only when \( H_{1} \) is rejected, and similarly, \( H_{4} \) is tested only when \( H_{2} \) is rejected. Suppose that the trial specifies a graphical test strategy as in Fig. 7.2 for testing these four hypotheses. The initial Graph (a) in Fig. 7.2 gives a smaller weight of \( w_{1} = 1/5 \) to \( H_{1} \) as compared to a weight of \( w_{2} = 4/5 \) to \( H_{2} \) based on the prior experience that the trial may win easily for \( H_{1} \) at the significance level of \( w_{1} \alpha = 0.005 \), but the trial may require a larger significance level of \( w_{2} \alpha = 0.02 \) for winning for \( H_{2} \). As stated before, we assume that all tests in the procedure are 1-sided and the control of the overall Type I error rate is at level \( \alpha = 0.025 \). The Graph (a) assigns zero-weights to the two secondary hypotheses indicating that we do not want to reject a secondary hypothesis until its parent primary hypothesis is first rejected.

Fig. 7.2
figure 2

Graphical test procedure for two primary hypotheses H1 and H2, and two secondary hypotheses H3 and H4, where pairs (\( H_{1} \), \( H_{3} \)) and (\( H_{2} \), \( H_{4} \)) are parent–descendant

In Graph (a) of Fig. 7.2, \( g_{12} = g_{21} = g_{13} = g_{24} = 1/2 \) and \( g_{32} = g_{41} = 1 \). These settings mean that if \( H_{1} \) was rejected in Graph (a) then a fraction 1/2 of \( w_{1} \) would be recycled to \( H_{2} \) so that the weight at \( H_{2} \) would become \( w_{2} + \left( {1/2} \right)w_{1} = 9/10 \) and the remainder \( \left( {1/2} \right)w_{1} = 1/10 \) would go to \( H_{3} \); the weight at \( H_{4 } \) would remain 0 because there is no arrow going from \( H_{1} \) to \( H_{4 } \) meaning that \( g_{14} = 0 \). The rejection of \( H_{1} \) in Graph (a) would lead to Graph (b) with new transition weights obtained from (7.3.2) as: \( g_{23} = 1/3, \,g_{24} = 2/3 ,\, g_{42} = g_{43} = 1/2 \) and \( g_{32} = 1 \). Similarly, if \( H_{2} \) was initially rejected in Graph (a), then a fraction 1/2 of \( w_{2} \) would be recycled to \( H_{1} \) so that the weight at \( H_{1} \) would become \( w_{1} + \left( {1/2} \right)w_{2} = 3/5 \) and the remainder \( \left( {1/2} \right)w_{2} = 2/5 \) would go to \( H_{4 } \); the weight at \( H_{3} \) would remain 0 as there is no arrow going from \( H_{2} \) to \( H_{3} \) giving \( g_{23} = 0 \). The rejection of \( H_{2} \) in Graph (a) would lead to Graph (c) with transition weights obtained from (7.3.2) as: \( g_{13} = 2/3, g_{14} = 1/3 , g_{34} = g_{31} = 1/2 \) and \( g_{41} = 1. \)

The value \( g_{32} = 1 \) in this Graph (b) indicates that if \( H_{3} \) was rejected after the rejection of \( H_{1} \) then the entire weight of \( \left( {1/2} \right)w_{1} = 1/10 \) at \( H_{3} \) would be recycled to \( H_{2} \), so that the total weight at \( H_{2} \) after the rejection of both \( H_{1} \) and \( H_{3} \) would be \( (w_{2} + \left( {1/2} \right)w_{1} = 9/10) + \left( {\left( {1/2} \right)w_{1} = 1/10} \right) = 1 \); the weight at \( H_{4 } \) would remain zero as in this graph there is no arrow going from \( H_{3} \) to \( H_{4 } \). Therefore, after the rejection of both \( H_{1} \) and \( H_{3} \), the Graph (b) would reduce to Graph (d). Similarly, \( g_{41} = 1 \) in Graph (c) indicates that if \( H_{4 } \) was rejected after the rejection of \( H_{2} \) then the entire weight \( \left( {1/2} \right)w_{2} = 2/5 \) at \( H_{4 } \) would be recycled to \( H_{1} \), so that the total weight at \( H_{1} \) after the rejection of both \( H_{2} \) and \( H_{4 } \) would be \( (w_{1} + \left( {1/2} \right)w_{2} )) + \left( {\left( {1/2} \right)w_{2} } \right) = 1 \); the weight at \( H_{3} \) would remain zero. Therefore, after the rejection of both \( H_{2} \) and \( H_{4 } \), the Graph (c) would reduce to Graph (e). However, if either \( H_{2} \) was rejected in Graph (b) or \( H_{1} \) was rejected in Graph (c), then these graphs would reduce to Graph (f).

7.3.5 Illustrative Example of the Graphical Approach for GS Trials

The above graphical approach originally developed for testing multiple hypotheses of non-GS trials also applies to GS trials. Recycling of alpha of a rejected hypothesis to other hypotheses occurs similarly, but boundary values for testing the unrejected hypotheses are calculated using spending functions. For example, consider the above trial for testing two primary hypotheses \( H_{1} \) and \( H_{2} \) and two secondary hypotheses \( H_{3} \) and \( H_{4} \), where pairs (\( H_{1} \), \( H_{3} \)) and (\( H_{2} \), \( H_{4} \)) are parent–descendant.

In the beginning, we start with Graph (a) of Fig. 7.2 with four hypotheses \( \left\{ {H_{j} , j \in \varvec{I}_{1} = \left\{ {1, 2, 3, 4} \right\}} \right\} \) identified by four nodes and the associated weights \( \{ w_{j} \left( {\varvec{I}_{1} } \right), j \in \varvec{I}_{1} \} = \{ 1/5,4/5,0, 0\} \). These weights give the starting overall significance levels \( \left\{ {w_{j} \left( {\varvec{I}_{1} } \right)\alpha , j \in \varvec{I}_{1} ;\alpha = 0.025 } \right\} = \left\{ {0.005,0.02,0,0} \right\} \), and the j-th one for testing of \( H_{j} \) by using its spending function \( f_{j} \) for determining its boundary values for testing. That is, in the beginning, with Graph (a), we test each \( H_{j} \) \( \left( {j \in \varvec{I}_{1} } \right) \) in the univariate GS testing framework for the control of the overall Type I error rate at level \( w_{j} \left( {\varvec{I}_{1} } \right)\alpha \) so that the total overall Type I error rate control for the trial is at level \( \sum\nolimits_{{j \in \varvec{I}_{1} }} {w_{j} \left( {\varvec{I}_{1} } \right)\alpha } = \alpha . \)

For this example, we assume that \( f_{j} \)’s are all equal to \( f(\gamma ,t) = 2\{ 1 - \varPhi (z_{1 - \gamma /2} /\sqrt {t)} \} \), which is OF-like, and \( \gamma \) is the overall significance level for the repeated testing of a hypothesis. The weights \( w_{3} \left( {\varvec{I}_{1} } \right) = w_{4} \left( {\varvec{I}_{1} } \right) = 0 \) indicate that \( H_{3} \) and \( H_{4} \) are not tested in Graph (a); if they were tested, they would remain unrejected. The following describes how the procedure performs tests of these hypotheses at different looks and how it recycles the unused alpha of a rejected hypothesis to other unrejected hypotheses.

Tests at the First Interim Look:

Suppose that at the first look, the information fraction is \( t_{1} = 1/2 \). For this example, we assume that the information fraction at a look remains the same for different hypotheses. If this is not the case, the procedure will proceed as discussed in Sect. 7.3.3. The univariate group sequential procedure for testing a hypothesis in a single-hypothesis trial calculates the boundary values for interim looks given the overall significance level \( \alpha \). However, in our case, there are more than one significance levels as \( \left\{ {w_{j} \left( I \right)\alpha , j \in \varvec{I}_{1} ;\alpha = 0.025 } \right\} = \left\{ {0.005,0.02,0,0} \right\} \) assigned to \( \{ H_{j} ,j \in \varvec{I}_{1} \} \). These overall significance levels, and the use of the OF-like spending function at \( t_{1} = 0.5 \), then give the boundary values \( \{ \alpha_{j} \left( { w_{j} \left( {\varvec{I}_{1} } \right)\alpha , t_{1} } \right), j \in \varvec{I}_{1} \} = \left\{ {0.00007,0.0010,0,0} \right\} \) for testing \( \{ H_{j} ,j \in \varvec{I}_{1} \} \) at the first look. Note that the subscript of t identifies the look number and the subscript j for the hypothesis \( H_{j} \) being tested. Also note that the boundary value of \( \alpha_{j} \left( { w_{j} \left( {\varvec{I}_{1} } \right)\alpha , t_{k} } \right) \) is a function of the overall significance level \( w_{j} \left( {\varvec{I}_{1} } \right)\alpha \) assigned to \( H_{j} \) and the information fraction \( t_{k} \) at look k; here \( k = 1 \).

Suppose that at the first look, the unadjusted p-values \( \{ p_{j} \left( {t_{1} } \right),j \in \varvec{I}_{1} \} \) associate with \( \{ H_{j} , j \in \varvec{I}_{1} \} \) are such that \( p_{j} \left( {t_{1} } \right) \ge \alpha_{j} \left( { w_{j} \left( {\varvec{I}_{1} } \right)\alpha , t_{1} } \right) {\text{for }}j \in \varvec{I}_{1} \); consequently, the trial will continue to the second look without rejection of a hypothesis at the first look. For recording purposes, one can summarize the above testing information at the first look as in Table 7.3.

Table 7.3 Tests information at the first look at \( t_{1} = 1/2 \) according to Graph (a)

Tests at the Second Look:

Suppose that the trial conducts the second look when the information fraction is \( t_{2} = 3/4 \). Since none of the hypotheses was rejected at the first look, we begin with Graph (a) at the second look, by using the same overall significance levels of \( \{ w_{j} \left( {\varvec{I}_{1} } \right)\alpha ,j \in \varvec{I}_{1} \} ) = \left\{ {0.005,0.02,0,0} \right\} \) that were used at the first look. However, as \( t_{2} = 3/4 \) at the second look, the use OF-like spending function leads to the boundary values of \( \{ \alpha_{j} \left( { w_{j} \left( {\varvec{I}_{1} } \right)\alpha , t_{2} } \right), j \in \varvec{I}_{1} \} = \left\{ {0.00117,0.0069,0,0} \right\} \) for testing \( H_{j} \) for \( j \in \varvec{I}_{1} \). The boundary values for testing \( H_{3} \) and \( H_{4} \) remain zero, as there is no rejection of a primary hypothesis so far. Suppose that at this second look, the observed p-values associated with for \( H_{1} \), \( H_{3} \), \( H_{2} \), and \( H_{4} \) are \( p_{1} \left( {t_{2} } \right) = 0.001, p_{2} \left( {t_{2} } \right) = 0.020 \), \( p_{3} \left( {t_{2} } \right) = 0.040, \) and \( p_{4} \left( {t_{2} } \right) = 0.091 \), respectively. These results lead to the rejection of \( H_{1} \) at the second look as \( p_{1} \left( {t_{2} } \right) = 0.001 \) is less than its boundary value of 0.00117; see Table 7.4a.

Table 7.4 a Tests information at the second look at \( t_{2} = 3/4 \) according to Graph (a) after no rejection at the first look. b Tests information at the second look at \( t_{2} = 3/4 \) according to Graph (b) after the rejection of \( H_{1} \) at this look

The above rejection of \( H_{1} \) at the second look then frees its overall significance level of \( w_{1} \left( {\varvec{I}_{1} } \right)\alpha ) = 0.005 \) as unused alpha which is recycled to the remaining three hypotheses for their tests according to Graph (b). This revised graph, constructed after the rejection of \( H_{1} , \) allows retesting of the remaining hypotheses \( \{ H_{j} , j \in \varvec{I}_{2} = \left\{ {2, 3, 4} \right\}\} \) at their corresponding overall significance levels of \( \left\{ {w_{j} \left( {\varvec{I}_{2} } \right)\alpha , j \in \varvec{I}_{2} } \right\} = \left\{ { - ,\left( {9/10} \right)\alpha ,\left( {1/10} \right)\alpha ,\left( 0 \right)\alpha } \right\} = \left\{ { - ,0.0225,0.00255,0} \right\} \). Note that the overall significance levels for testing \( H_{2} \), \( H_{3} \) are now increased creating the possibility of additional rejections of hypotheses at the second look according to Graph (b). The use OF-like spending function with these updated overall significance levels and \( t_{2} = 3/4, \) then produces the boundary values of \( \{ \alpha_{j} \left( { w_{j} \left( {\varvec{I}_{2} } \right)\alpha , t_{2} } \right), j \in \varvec{I}_{2} \} = \left\{ { - ,0.00802,0.00047,0} \right\} \) for testing \( H_{j} \) for \( j \in \varvec{I}_{2 } \); see Table 7.4b. However, in this table, as \( p_{2} \left( {t_{2} } \right) = 0.020 > 0.00802 \) and \( p_{3} \left( {t_{2} } \right) = 0.040 > 0.00047 \), there is no additional rejections at the second look. Therefore, the trial moves to the next look which is the final look.

Tests at the Final Look:

After the rejection of \( H_{1} \) at the second look, the tests for the remaining three hypotheses {\( H_{j} , j \in \varvec{I}_{2} \} \) at the final look start with the same Graph (b) and the same overall significance levels of \( \left\{ {w_{j} \left( {\varvec{I}_{2} } \right)\alpha ,j \in \varvec{I}_{2} } \right\} = \left\{ { - ,\left( {9/10} \right)\alpha ,\left( {1/10} \right)\alpha ,\left( 0 \right)\alpha } \right\} = \left\{ { - ,0.0225,0.00255,0} \right\} \) for testing {\( H_{j} , j \in \varvec{I}_{2} = \left\{ {2, 3, 4} \right\}\} \). However, as \( t_{3} = 1 \) at this look, the use of the same OF-like spending function produces the boundary values of \( \{ \alpha_{j} \left( { w_{j} \left( {\varvec{I}_{2} } \right)\alpha , t_{3} } \right), j \in \varvec{I}_{2} \} = \left\{ { - ,0.01988,0.00234,0} \right\} \) for testing of \( \{ H_{j} , j \in \varvec{I}_{2} \} \) at this look. Suppose that at this final look, the observed p-values associated with for \( H_{3} \), \( H_{2} \), and \( H_{4} \) are \( p_{2} \left( {t_{3} } \right) = 0.012 \), \( p_{3} \left( {t_{3} } \right) = 0.008, \) and \( p_{4} \left( {t_{3} } \right) = 0.041 \), respectively. These results then lead to the rejection of \( H_{2} \) at the final look as its \( p_{2} \left( {t_{2} } \right) = 0.012 \) is less than its corresponding boundary value of 0.01988; see Table 7.5a.

Table 7.5 a Tests information at the final look at \( t_{3} = 1 \) according to Graph (b) after the rejection of \( H_{1} \) at the second look. b Tests information at the final look at \( t_{3} = 1 \) according to Graph (f) after the rejection of \( H_{1} \) at the second look and the rejection of \( H_{2} \) at the final look

Now, as \( H_{1} \) was rejected at the second look and as \( H_{2} \) is rejected at the final look, the tests of hypotheses \( H_{3} \) and \( H_{4} \) at the final look will be at the increased overall significance levels of \( \{ w_{j} \left( {\varvec{I}_{3} } \right)\alpha ,j \in \varvec{I}_{3} = \left\{ {3,4} \right\}\} = \left\{ {\left( {2/5} \right)\upalpha,\left( {3/5} \right)\upalpha} \right\} = \left\{ {0.010,0.015} \right\} \) according to Graph (f). These with the OF-like spending function give the boundary values of {–, –, 0.00907, 0.01344} for testing \( \{ H_{j} , j \in \varvec{I}_{3} \} \), rejecting also \( H_{3} \) in this final look, as \( p_{3} \left( {t_{2} } \right) = 0.008 \) is less than 0.00907; see Table 7.5b. Consequently, the remaining \( H_{4} \) can be tested at this look the at the full overall significance level of α = 0.025 which gives the boundary value of \( 0.0220 \) for its testing. Therefore, as \( p_{4} \left( {t_{2} } \right) = 0.041 > 0.0220 \) for \( H_{4} \), the trial stops without the rejection of this hypothesis.

7.4 Testing a Secondary Hypothesis When the Trial Stops After the Rejection of a Primary Hypothesis

Consider, for example, a trial with two looks for testing a primary hypothesis \( H_{1} \) and a secondary hypothesis \( H_{2} \) with one interim look and a final look at information fractions \( t_{1} \) and \( t_{2} = 1 (0 < t_{1} < t_{2} ) \), respectively. The trial , if it rejects \( H_{1} \) at the interim look, stops at that look for ethical reasons. This will in general be the case when \( H_{1} \) is associated with an endpoint such as mortality. Therefore, \( H_{2} \) must be tested at the same interim look when \( H_{1} \) is rejected, and this test for \( H_{2} \) must occur after the rejection of \( H_{1} \).

A question often arises: Can the test of \( H_{2} \) at the interim look, after the rejection of \( H_{1} \) at that look, be at the full significance level \( \alpha \) (e.g., \( \alpha = 0.025) \)? This question may arise based on the considerations that \( H_{2} \) is not tested unless \( H_{1} \) is first rejected and there is no repeated testing of \( H_{2} \) after the rejection of \( H_{1} \). Tamhane et al. (2010) (also Xi and Tamhane 2015) showed that the answer of this question is affirmative, only for the special case when the test statistics for testing \( H_{1} \) and \( H_{2 } \) are independent. However, this can inflate the overall Type I error rate if the test statistics are correlated. They show that with certain distributional assumptions of the test statistics, the exact adjusted significance level for testing \( H_{2} \) can be found if this correlation is known. However, if this correlation is unknown, then an upper bound of the adjusted significance levels can be set that covers all correlations. The following revisits this work in some detail because of its importance for clinical trial applications.

We assume that the trial is designed to demonstrate superiority of a new treatment to control such that \( H_{i} : \delta_{i} \le 0 \left( {i = 1,2 } \right) \), where \( \delta \) is the treatment difference parameter. Also, X and Y are the test statistics for testing \( H_{1} \) and \( H_{2} , \) respectively, which become \( \left( {X\left( {t_{k} } \right), Y\left( {t_{k} } \right)} \right) \) at information times \( t_{k} \left( {k = 1, 2} \right) \). Also, following the results of Sect. 7.2, we assume that each pair \( \left( {X\left( {t_{1} } \right), X\left( {t_{2} } \right)} \right) \) and \( \left( {Y\left( {t_{1} } \right), Y\left( {t_{2} } \right)} \right) \) follows a standard bivariate normal distribution with the same correlation of \( \sqrt {t_{1} } \). Further, we assume that each pair \( \left( {X\left( {t_{1} } \right), Y\left( {t_{1} } \right)} \right) \) and \( \left( {X\left( {t_{2} } \right), Y\left( {t_{2} } \right)} \right) \) follows a standard bivariate normal distribution with correlation coefficient of \( \rho \ge 0 \). Furthermore, we assume that \( \left( {c_{1} , c_{2} } \right) \) and \( \left( {d_{1} , d_{2} } \right) \) are boundary values for testing \( H_{1} \) and \( H_{2} \), respectively, so that \( d_{1} \) is used only when \( H_{1} \) is rejected at the first look; similarly, \( d_{2} \) is used only when \( H_{1} \) being retained at the first look is rejected at the final look. The test strategy for this 2-stage design can then be stated as follows:

figure a

Determining the Boundary Values of the Procedure

Tests for \( H_{1} \) and \( H_{2} \) for the above 2-stage design can be carried out by the method based on the closed testing for GS trials as addressed in Sect. 7.3.2. The intersection hypothesis \( H_{12} \) would be tested by the weighted Bonferroni tests with weights of \( w_{1} = 1 \) and \( w_{2} = 0 \) associated with the tests of \( H_{1} \) and \( H_{2} \), respectively; \( w_{2} = 0 \) for \( H_{2} \) implies that this weight can increase only after \( H_{1} \) is rejected. Therefore, for this design, the rejection of \( H_{1} \) at level \( \alpha \) implies the rejection of \( H_{12} \) at level \( \alpha \). Consequently, \( H_{2} \) can be tested at the full significance level \( \alpha \). But as the trial is a GS trial with one interim look, the boundary values \( c_{1} \) and \( c_{2} \) for testing \( H_{1} \) can then be found from the following two equations:

$$ \Pr \{ X\left( {t_{1} } \right) > c_{1} |H_{1} \} = f_{1} \left( {\alpha , t_{1} \left( X \right)} \right) $$

and

$$ f_{1} \left( {\alpha , t_{1} \left( X \right)} \right) + \Pr \left\{ {X\left( {t_{1} } \right) \le c_{1} \cap X\left( {t_{2} } \right) > c_{2} |H_{1} } \right\} = f_{1} \left( {\alpha , t_{2} \left( X \right) = 1} \right), $$

where \( f_{1} \left( {\alpha , t} \right) \) is the spending function for testing \( H_{1} \), and \( t_{1} \left( X \right) \) and \( t_{2} \left( X \right) \) are the information fractions for testing \( H_{1} \) at the first and final looks, respectively. For example, when \( f_{1} \left( {\alpha , t} \right) \) is OF-like, α = 0.025, and \( t_{1} \left( X \right) = 0.5 \), then \( c_{1} = 2.95901 \) and \( c_{2} = 1.96869 \) on the normal z-scale which translates to \( \alpha_{1} \left( {0.025, t_{1} \left( X \right) = 0.5} \right) = 0.00153 \) and \( \alpha_{2} \left( {0.025, t_{2} \left( X \right) = 1} \right) = 0.02449 \) on the p-value scale .

Since the significance level α for the test of \( H_{1} \) after its rejection recycles to test \( H_{2} \), the boundary values \( \left( {d_{1} , d_{2} } \right) \) for \( H_{2} \) need to be calculated also by a GS method but at the same level α. Reason for this is that, though \( H_{2} \) is tested after the rejection of \( H_{1} , \) the rejection of \( H_{2} \), similar to that for \( H_{1} \), can occur either at the first look or at the final look. Thus, if one uses the Pocock (1977) method for calculating the boundary values for testing \( H_{2} \), then at α = 0.025, \( t_{1} \left( Y \right) = 0.5 \) and \( t_{2} \left( Y \right) = 1, \) the value \( d = d_{1} = d_{2} = 2.17828 \) (on the z-scale ) which is 0.01469 on the p-value scale . However, the test statistics X and Y in many applications will be positively correlated. Therefore, if this correlation is ρ, and remains the same for the two looks, then it is natural to ask a key question: Is it possible to take advantage of this correlation and find \( d^{*} \le d \) while maintaining the control of the overall Type I error rate at level α = 0.025?

The following shows that this is possible. But the extent of the gain depends on the value of ρ. Larger is the value of ρ on the interval \( 0 \le \rho \le 1 \), lesser is the gain, and as ρ approaches one, the value of \( d^{*} \) approaches d determined by the Pocock (1977) method.

Determining the Value of \( d^{*} \)

Testing of \( H_{1} \) and \( H_{2} \) gives rise to three null hypotheses configurations \( H_{12} = H_{1} \cap H_{2} \), \( H_{1} \cap K_{2} , \) and \( K_{1} \cap H_{2} \), where \( K_{1} \) and \( K_{2} \) are alternatives to \( H_{1} \) and \( H_{2} \), respectively. The overall Type I error rate for testing \( H_{1} \) and \( H_{2} \) under the first two configurations is ≤α. That is, tests for \( H_{1} \) control this error rate at level \( \alpha \) regardless of whether \( H_{2} \) is true or false. Therefore, we need to find \( z_{y} = d^{*} \) by solving for \( z_{y} \) in the following equation under \( K_{1} \cap H_{2} \).

$$ { \Pr }\{ X\left( {t_{1} } \right) > c_{1} \cap Y\left( {t_{1} } \right) > z_{y} \} + { \Pr }\{ X\left( {t_{1} } \right) \le c_{1} \cap X\left( {t_{2} } \right) > c_{2} \cap Y\left( {t_{2} } \right) > z_{y} \} =\upalpha. $$
(7.4.1)

Now, Cov \( \left\{ {X\left( {t_{1} } \right), X\left( {t_{2} } \right)} \right\} = \sqrt {t_{1} } \), Cov \( \left\{ {X\left( {t_{1} } \right), Y\left( {t_{2} } \right)} \right\} = \sqrt {t_{1} } \,\uprho \), and \( {\text{Cov}}\left\{ {X\left( {t_{1} } \right), Y\left( {t_{1} } \right)} \right\} = {\text{Cov}}\left\{ {X\left( {t_{2} } \right), Y\left( {t_{2} } \right)} \right\} =\uprho \). Also, \( {\text{E}}\left\{ {X\left( {t_{1} } \right)} \right\} = \theta \sqrt {t_{1} } \), \( E\left\{ {X\left( {t_{2} } \right)} \right\} = \theta \), and \( E\left\{ {Y\left( {t_{i} } \right)} \right\} = 0 \) for \( i = 1, 2 \), because of \( K_{1} \cap H_{2} \) and \( \theta \) being the drift parameter for X. Further, one can show that conditional on \( X\left( {t_{2} } \right) = x\left( {t_{2} } \right) \), the test statistics \( X\left( {t_{1} } \right) \) and \( Y\left( {t_{2} } \right) \) are independently normally distributed as:

\( X\left( {t_{1} } \right) \) is \( N\left\{ {x\left( {t_{2} } \right)\sqrt {t_{1} } , 1 - t_{1} } \right\} \) and \( Y\left( {t_{2} } \right) \) is \( N\left\{ {\left( {x\left( {t_{2} } \right) - \theta } \right)\rho , 1 - \rho^{2} } \right\} \)

Therefore, the Eq. (7.4.1) for finding \( z_{y} = d^{*} \) can be written as:

$$ \begin{aligned} \alpha & = 1 - \varPhi \left( {c_{1} - \theta \sqrt {t_{1} } } \right) - \varPhi \left( {z_{y} } \right) + \varPhi_{12} \left( {c_{1} - \theta \sqrt {t_{1} } , z_{y} ; \rho } \right) \\ & \quad + \int\limits_{{c_{1} - \theta }}^{\infty } {\Phi \left( {\frac{{c_{1} - \theta \sqrt {t_{1} } - u\sqrt {t_{1} } }}{{\sqrt {1 - t_{1} } }}} \right)} \,\Phi \left( {\frac{{ - z_{y} - u\uprho}}{{\sqrt {1 -\uprho^{2} } }}} \right)\upphi\left( u \right)du, \\ \end{aligned} $$
(7.4.2)

where Φ and ϕ are the density and the cumulative distribution functions of the N(0,1) random variable, and \( \varPhi_{12} \) is the cumulative distribution function of the standard bivariate normal distribution with correlation coefficient of ρ.

Therefore, specifying values of ρ, \( t_{1} , c_{1} , \) and \( c_{2} , \) one can construct a graph \( z_{y} = f\left( \theta \right) \) over the interval θ > 0 that satisfy Eq. (7.4.2). Figure 7.3 shows such graphs for different values of ρ when α = 0.025 (1-sided), \( t_{1} = 0.5 \), and \( c_{1} = 2.95901 \) and \( c_{2} = 1.96869 \) on using the OF-like α-spending function. Constructing such a graph for a given ρ then gives \( d^{*} = z_{y} \) where the maximum occurs for that ρ. Such a selection of \( d^{*} \) assures that the right side of (7.4.2) is ≤ α for all θ > 0. Table 7.6, for the above values of α, \( t_{1} \), \( c_{1} \), and \( c_{2} \), gives \( d^{*} \) values and the corresponding αd* values on the p-value scale for values of ρ shown in column 1 of this table. This table also includes values of \( \theta^{*} \) where the \( d^{*} \) values occur. Results of this table show that if the test statistics for testing \( H_{1} \) and \( H_{2} \) are uncorrelated, then the test for \( H_{2} \) at a look after the rejection of \( H_{1} \) at that look can be at the full significance level α. However, if these test statistics are correlated, then this significance level for testing \( H_{2} \) is correlation dependent. For positive correlations, this significance level for testing decreases with increasing correlation value and approaches to a value by the Pocock (1977) method.

Fig. 7.3
figure 3

Graph of \( z_{y} = f\left( \theta \right) \) over the interval θ > 0 satisfying Eq. (7.4.2). In this graph, theta = θ and \( z = z_{y} \). The horizontal dashed line in the graph represents the Pocock boundary

Table 7.6 Values of d* for the 2-stage design for different correlations when α = 0.025 (1-sided), \( t_{1} = 0.5 \), and \( c_{1} = 2.95901 \) and \( c_{2} = 1.96869 \) on using the OF-like α-spending function

7.5 Concluding Remarks

Confirmatory clinical trials have been gold standards for establishing efficacy of new treatments. However, such trials when designed with a single primary endpoint do not provide sufficient information when one must assess the effect of the new treatment on different but important multiple characteristics of the disease. For these situations, trials include multiple endpoints related to these disease characteristics and a statistical plan for testing multiple hypotheses on these endpoints for establishing efficacy findings of new treatments. However, testing multiple hypotheses in a trial can raise multiplicity issues causing inflation of the Type I error rate. Fortunately, many novel new statistical methods, such as gatekeeping and graphical methods, are now available in the literature for addressing all types of multiplicity issues of clinical trials. These novel methods have advanced the role of statistical methods in designing modern clinical trials with multiple endpoints or multiple objectives .

In clinical trials with serious endpoints, such as death, often a new treatment is added to an existing therapy for detecting a relatively small but clinically relevant improvement in the treatment effect beyond what the existing therapy provides. Designing and conducting such and other trials for serious diseases can be complex, as these trials may require thousands of patients to enroll and several years to complete. Ethical and economic reasons may necessitate that these trials be designed with interim looks for finding the effect of the treatment at an earlier time point allowing the possibility of stopping the trial early when it becomes clear that the study treatment has the desired efficacy or it is futile to continue the trial further. Such trials that allow analyses of the accumulated data at interim looks for the possibility of stopping the trial early for efficacy or futility reasons are commonly known as group sequential trials .

Obviously, interim analyses of the data in a group sequential trial amounts to repeated testing of one or more hypotheses and would result in Type I error rate inflation, so multiplicity adjustment would be required for drawing valid inference. As mentioned in this chapter, several approaches have been cited in the literature for addressing the control of Type I error rate for repeated tests of a single hypothesis related to a single primary endpoint of the trial . However, approaches for addressing the multiplicity issues for testing multiple hypotheses related to multiple endpoints of group sequential trials are less frequent in the literature.

This chapter, in addition to providing a brief review of procedures and citing key references thereof for the repeated testing procedures of a single endpoint hypothesis in groups sequential trials, considers procedures for handling multiplicity issues for repeated testing of multiple endpoint hypotheses of trials. In this regard, we distinguish two cases of multiple endpoints which guide the approach for handling the multiplicity issue. The first case arises when after a hypothesis is rejected at an interim look, the trial can continue to test other hypotheses at subsequent looks for additional claims. A testing approach for this is to use the Bonferroni inequality which requires splitting the significance level either among the endpoints or among the different looks. This approach is now rarely used because of the low power of the tests.

A better approach (discussed in Sect. 7.3.2) is to consider the use of the closed testing with the weighted Bonferroni tests of the intersection hypotheses , when the weights satisfy the consonance property. This approach allows recycling of the significance level of a rejected hypothesis to the other hypotheses, thus increasing the power of the test procedure. However, as discussed, the recycling of the significance level from a rejected hypothesis to other hypotheses occurs through an α-spending function and is not simple as with non-group sequential trials .

The closed testing -based approach can be manageable when testing 2–3 hypotheses, but it may be difficult to set up for testing more than three hypotheses, for example, when testing two primary and two secondary hypotheses in a trial , as selecting weights for the weighted Bonferroni tests that satisfy the consonance property can be complicated. For these advanced cases, a graphical approach is recommended which is easier to plan, to use, and to communicate to non-statisticians. This chapter illustrates the application of these two approaches through illustrative examples, showing details of the derivations of the significance levels.

The second case arises (discussed in Sect. 7.4), for example, for a group sequential trial designed for testing a primary and a secondary endpoint hypotheses , and the trial stops at an interim look for ethical reasons when the primary hypothesis is rejected at that look in favor of the study treatment. The issue then arises as to what would be the significance level for testing the secondary hypothesis at that look, given that that the secondary hypothesis is tested only after the primary one is rejected first. This issue has been investigated in the literature in detail, but we have revisited it for increasing its awareness, as group sequential trials are frequently designed with a single primary hypothesis and multiple secondary hypotheses . A natural way to address this problem is to use the graphical procedure and recycle the significance level of the rejected primary hypothesis to secondary hypotheses using the Pocock-like α-spending function.

Glimm et al. (2010) illustrated that using the Pocock-like group sequential test to the secondary hypotheses has a power advantage over the O’Brien-Fleming boundary . Other approaches that consider correlation information between the test statistics can also be used for simple cases, for example, for the case of testing a single primary and a single secondary hypothesis.

Power considerations in designing GS trials that tests multiple hypotheses are also important. However, this topic is beyond the scope of this paper. The power issue would generally be like those for testing multiple hypotheses in a non-GS trial .