The size of the study should be considered early in the planning phase. In some instances, no formal sample size is ever calculated. Instead, the number of participants available to the investigators during some period of time determines the size of the study. Many clinical trials that do not carefully consider the sample size requirements turn out to lack the statistical power or ability to detect intervention effects of a magnitude that has clinical importance. In 1978, Freiman and colleagues [1] reviewed the power of 71 published randomized controlled clinical trials which failed to find significant differences between groups. “Sixty-seven of the trials had a greater than 10% risk of missing a true 25% therapeutic improvement, and with the same risk, 50 of the trials could have missed a 50% improvement.” The situation was not much improved in 1994, when a similar survey found only 16% of negative trials had 80% power for a 25% effect, and only 36% for a 50% effect [2]. In other instances, the sample size estimation may assume an unrealistically large intervention effect. Thus, the power for more realistic intervention effects will be low or less than desired. The danger in studies with low statistical power is that interventions that could be beneficial are discarded without adequate testing and may never be considered again. Certainly, many studies do contain appropriate sample size estimates, but in spite of many years of critical review many are still too small [3, 4].

This chapter presents an overview of sample size estimation with some details. Several general discussions of sample size can be found elsewhere [521]. For example, Lachin [11] and Donner [9] have each written a more technical discussion of this topic. For most of the chapter, the focus is on sample size where the study is randomizing individuals. In the some sections, the concept of sample size for randomizing clusters of individuals or organs within individuals is presented.

Fundamental Point

Clinical trials should have sufficient statistical power to detect differences between groups considered to be of clinical importance. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of planning.

Before a discussion of sample size and power calculations, it must be emphasized that, for several reasons, a sample size calculation provides only an estimate of the needed size of a trial [6]. First, parameters used in the calculation are estimates, and as such, have an element of uncertainty. Often these estimates are based on small studies. Second, the estimate of the relative effectiveness of the intervention over the control and other estimates may be based on a population different from that intended to be studied. Third, the effectiveness is often overestimated since published pilot studies may be highly selected and researchers are often too optimistic. Fourth, during the final planning stage of a trial, revisions of inclusion and exclusion criteria may influence the types of participants entering the trial and thus alter earlier assumptions used in the sample size calculation. Assessing the impact of such changes in criteria and the screening effect is usually quite difficult. Fifth, trial experience indicates that participants enrolled into control groups usually do better than the population from which the participants were drawn. The reasons are not entirely clear. One factor could be that participants with the highest risk of developing the outcome of interest are excluded in the screening process. In trials involving chronic diseases, because of the research protocol, participants might receive more care and attention than they would normally be given, or change their behavior because they are part of a study, thus improving their prognosis, a phenomenon sometimes called the Hawthorne or trial effect [22]. Also, secular trends toward improved care may result in risk estimates from past studies being higher than what will be found in current patient populations [23]. Participants assigned to the control group may, therefore, be better off than if they had not been in the trial at all. Finally, sample size calculations are based on mathematical models that may only approximate the true, but unknown, distribution of the response variables.

Due to the approximate nature of sample size calculations, the investigator should be as conservative as can be justified while still being realistic in estimating the parameters used in the calculation. If a sample size is drastically overestimated, the trial may be judged as unfeasible. If the sample size is underestimated, there is a good chance the trial will fall short of demonstrating any differences between study groups or be faced with the need to justify an increase in sample size or an extension of follow-up [2426]. In general, as long as the calculated sample size is realistically obtainable, it is better to overestimate the size and possibly terminate the trial early (Chap. 16) than to modify the design of an ongoing trial, or worse, to arrive at incorrect conclusions.

Statistical Concepts

An understanding of the basic statistical concepts of hypothesis testing, significance level, and power is essential for a discussion of sample size estimation. A brief review of these concepts follows. Further discussion can be found in many basic medical statistics textbooks [2737] as well as textbooks on sample size [1721]. Those with no prior exposure to these basic statistical concepts might find these resources helpful.

Except where indicated, trials of one intervention group and one control group will be discussed. With some adjustments, sample size calculations can be made for studies with more than two groups [8]. For example, in the Coronary Drug Project (CDP), five active intervention arms were each compared against one control arm [38]. The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack trial (ALLHAT) compared four active intervention arms: three newer drugs to an older one as first line therapy for hypertension [39]. Both trials used the method of Dunnett [40], where the number of participants in the control group is equal to the number assigned to each of the active intervention groups times the square root of the number of active groups. The optimal size of the control arm in the CDP was determined to be 2.24 times the size of each individual active intervention arm [38]. In fact, the CDP used a factor of 2.5 in order to minimize variance. Other approaches are to use the Bonferroni adjustment to the alpha level [41]; that is, divide the overall alpha level by the number of comparisons, and use that revised alpha level in the sample size comparison.

Before computing sample size, the primary response variable used to judge the effectiveness of intervention must be identified (see Chap. 3). This chapter will consider sample size estimation for three basic kinds of outcomes: (1) dichotomous response variables, such as success and failure (2), continuous response variables, such as blood pressure level or a change in blood pressure, and (3) time to failure (or occurrence of a clinical event).

For the dichotomous response variables, the event rates in the intervention group (pI) and the control group (pC) are compared. For continuous response variables, the true, but unknown, mean level in the intervention group (μI) is compared with the mean level in the control group (μC). For survival data, a hazard rate, λ, is often compared for the two study groups or at least is used for sample size estimation. Sample size estimates for response variables which do not exactly fall into any of the three categories can usually be approximated by one of them.

In terms of the primary response variable, pI will be compared with pC or μI will be compared with μC. This discussion will use only the event rates, pI, and pC, although the same concepts will hold if response levels μI and μC are substituted appropriately. Of course, the investigator does not know the true values of the event rates. The clinical trial will give him only estimates of the event rates, \( \widehat{p_{I\ }}\ \mathrm{and}\ \widehat{p_C} \). Typically, an investigator tests whether or not a true difference exists between the event rates of participants in the two groups. The traditional way of indicating this is in terms of a null hypothesis, denoted H0, which states that no difference between the true event rates exists (H0: pC − pI = 0). The goal is to test H0 and decide whether or not to reject it. That is, the null hypothesis is assumed to be true until proven otherwise.

Because only estimates of the true event rates are obtained, it is possible that, even if the null hypothesis is true (pC − pI = 0), the observed event rates might by chance be different. If the observed differences in event rates are large enough by chance alone, the investigator might reject the null hypothesis incorrectly. This false positive finding, or Type I error, should be made as few times as possible. The probability of this Type I error is called the significance level and is denoted by α. The probability of observing differences as large as, or larger than the difference actually observed given that H0 is true is called the “p-value,” denoted as p. The decision will be to reject H0 if p ≤ α. While the chosen level of α is somewhat arbitrary, the ones used and accepted traditionally are 0.01, 0.025 or, most commonly, 0.05. As will be shown later, as α is set smaller, the required sample size estimate increases.

If the null hypothesis is not in fact true, then another hypothesis, called the alternative hypothesis, denoted by HA, must be true. That is, the true difference between the event rates pC and pI is some value δ where δ ≠ 0. The observed difference \( \widehat{p_{\mathrm{C}\ }}-\widehat{p_{\mathrm{I}}} \) can be quite small by chance alone even if the alternative hypothesis is true. Therefore, the investigator could, on the basis of small observed differences, fail to reject H0 even when it is not true. This is called a Type II error, or a false negative result. The probability of a Type II error is denoted by β. The value of β depends on the specific value of δ, the true but unknown difference in event rates between the two groups, as well as on the sample size and α. The probability of correctly rejecting H0 is denoted 1 − β and is called the power of the study. Power quantifies the potential of the study to find true differences of various values δ. Since β is a function of α, the sample size and δ, 1 − β is also a function of these parameters. The plot of 1 − β versus δ for a given sample size is called the power curve and is depicted in Fig. 8.1. On the horizontal axis, values of δ are plotted from 0 to an upper value, δA (0.25 in this figure). On the vertical axis, the probability or power of detecting a true difference δ is shown for a given significance level and sample size. In constructing this specific power curve, a sample size of 100 in each group, a one-sided significance level of 0.05 and a control group event rate of 0.5 (50%) were assumed. Note that as δ increases, the power to detect δ also increases. For example, if δ = 0.10 the power is approximately 0.40. When δ = 0.20 the power increases to about 0.90. Typically, investigators like to have a power (1 − β) of at least 0.80, but often around 0.90 or 0.95 when planning a study; that is to have an 80%, 90% or 95% chance of finding a statistically significant difference between the event rates, given that a difference, δ, actually exists.

Fig. 8.1
figure 1figure 1

A power curve for increasing differences (δ) between the control group rate of 0.5 and the intervention group rate with a one-sided significance level of 0.05 and a total sample size (2N) of 200

Since the significance level α should be small, say 0.05 or 0.01, and the power (1 − β) should be large, say 0.90 or 0.95, the only quantities which are left to vary are δ, the size of the difference being tested for, and the total sample size. In planning a clinical trial, the investigator hopes to detect a difference of specified magnitude δ or larger. One factor that enters into the selection of δ is the minimum difference between groups judged to be clinically important. In addition, previous research may provide estimates of δ. This is part of the question being tested as discussed in Chap. 3. The exact nature of the calculation of the sample size, given α, 1 − β and δ is considered here. It can be assumed that the randomization strategy will allocate an equal number (N) of participants to each group, since the variability in the responses for the two groups is approximately the same; equal allocation provides a slightly more powerful design than unequal allocation. For unequal allocation to yield an appreciable increase in power, the variability needs to be substantially different in the groups [42]. Since equal allocation is usually easier to implement, it is the more frequently used strategy and will be assumed here for simplicity.

Before a sample size can be calculated, classical statistical theory says that the investigator must decide whether he is interested in differences in one direction only (one-sided test)—say improvements in intervention over control—or in differences in either direction (two-sided test). This latter case would represent testing the hypothesis that the new intervention is either better or worse than the control. In general, two-sided tests should be used unless there is a very strong justification for expecting a difference in only one direction. An investigator should always keep in mind that any new intervention could be harmful as well as helpful. However, as discussed in Chap. 16, some investigators may not be willing to prove the intervention harmful and would terminate a study if the results are suggestive of harm. A classic example of this issue was provided by the Cardiac Arrhythmia Suppression Trial or CAST [43]. This trial was initially designed as a one-sided, 0.025 significance level hypothesis test that anti-arrhythmic drug therapy would reduce the incidence of sudden cardiac death. Since the drugs were already marketed, harmful effects were not expected. Despite the one-sided hypothesis in the design, the monitoring process used a two-sided, 0.05 significance level approach. In this respect, the level of evidence for benefit was the same for either the one-sided 0.025 or two-sided 0.05 significance level design. As it turned out, the trial was terminated early due to increased mortality in the intervention group (see Chaps. 16 and 17).

If a one-sided test of hypothesis is chosen, in most circumstances the significance level ought to be half what the investigator would use for a two-sided test. For example, if 0.05 is the two-sided significance level typically used, 0.025 would be used for the one-sided test. As done in the CAST trial, this requires the same degree of evidence or scientific documentation to declare a treatment effective, regardless of the one-sided vs. two-sided question. In this circumstance, a test for negative or harmful effects might also be done at the 0.025 level. This in effect, provides two one-sided 0.025 hypothesis tests for an overall 0.05 significance level.

As mentioned above, the total sample size 2N (N per arm) is a function of the significance level (α), the power (1 − β) and the size of the difference in response (δ) which is to be detected. Changing either α, 1 − β or δ will result in a change in 2N. As the magnitude of the difference δ decreases, the larger the sample size must be to guarantee a high probability of finding that difference. If the calculated sample size is larger than can be realistically obtained, then one or more of the parameters in the design may need to be reconsidered. Since the significance level is usually fixed at 0.05, 0.025, or 0.01, the investigator should generally reconsider the value selected for δ and increase it, or keep δ the same and settle for a less powerful study. If neither of these alternatives is satisfactory, serious consideration should be given to abandoning the trial.

Rothman [44] argued that journals should encourage using confidence intervals to report clinical trial results instead of significance levels. Several researchers [4446] discuss sample size formulas from this approach. Confidence intervals are constructed by computing the observed difference in event rates and then adding and subtracting a constant times the standard error of the difference. This provides an interval surrounding the observed estimated difference obtained from the trial. The constant is determined so as to give the confidence interval the correct probability of including the true, but unknown difference. This constant is related directly to the critical value used to evaluate test statistics. Trials often use a two-sided α level test (e.g., α = 0.05) and a corresponding (1 − α) confidence interval (e.g., 95%). If the 1 − α confidence interval excludes zero or no difference, we would conclude that the intervention has an effect. If the interval contains zero difference, no intervention effect would be claimed. However, differences of importance could exist, but might not be detected or not be statistically significant because the sample size was too small. For testing the null hypothesis of no treatment effect, hypothesis testing and confidence intervals give the same conclusions. However, confidence intervals provide more information on the range of the likely difference that might exist. For sample size calculations, the desired confidence interval width must be specified. This may be determined, for example, by the smallest difference between two event rates that would be clinically meaningful and important. Under the null hypothesis of no treatment effect, half the desired interval width is equal to the difference specified in the alternative hypothesis. The sample size calculation methods presented here do not preclude the presentation of results as confidence intervals and, in fact, investigators ought to do so. However, unless there is an awareness of the relationship between the two approaches, as McHugh and Le [46] have pointed out, the confidence interval method might yield a power of only 50% to detect a specified difference. This can be seen later, when sample size calculations for comparing proportions are presented. Thus, some care needs to be taken in using this method.

So far, it has been assumed that the data will be analyzed only once at the end of the trial. However, as discussed in Chaps. 16 and 17, the response variable data may be reviewed periodically during the course of a study. Thus, the probability of finding significant differences by chance alone is increased [47]. This means that the significance level α may need to be adjusted to compensate for the increase in the probability of a Type I error. For purposes of this discussion, we assume that α carries the usual values of 0.05, 0.025 or 0.01. The sample size calculation should also employ the statistic which will be used in data analysis. Thus, there are many sample size formulations. Methods that have proven useful will be discussed in the rest of this chapter.

Dichotomous Response Variables

We shall consider two cases for response variables which are dichotomous, that is, yes or no, success or failure, presence or absence. The first case assumes two independent groups or samples [4859]. The second case is for dichotomous responses within an individual, or paired responses [6064].

Two Independent Samples

Suppose the primary response variable is the occurrence of an event over some fixed period of time. The sample size calculation should be based on the specific test statistic that will be employed to compare the outcomes. The null hypothesis H0 (pC − pI = 0) is compared to an alternative hypothesis HA (pC − pI ≠ 0). The estimates of pI and pC are \( \widehat{p_{\mathrm{C}\ }}-\widehat{p_{\mathrm{I}}} \) where \( \widehat{p_{\mathrm{I}}}={r}_{\mathrm{I}}/{N}_{\mathrm{I}} \) and \( \widehat{p_{\mathrm{C}}}={r}_{\mathrm{C}}/{N}_{\mathrm{C}} \) with rI and rC being the number of events in the intervention and control groups and NI and NC being the number of participants in each group. The usual test statistic for comparing such dichotomous or binomial responses is

$$ Z=\left(\widehat{p_{\mathrm{C}\ }}-\widehat{p_{1\ }}\right)/\sqrt{\overline{p}\left(1-\overline{p}\right)\;\left(1/{N}_{\mathrm{C}}+1/{N}_{\mathrm{I}}\right)} $$

where \( \overline{p}=\left({r}_{\mathrm{I}}+{r}_{\mathrm{C}}\right)/\left({N}_{\mathrm{I}}+{N}_{\mathrm{C}}\right) \). The square of the Z statistic is algebraically equivalent to the chi-square statistic, which is often employed as well. For large values of NI and NC, the statistic Z has approximately a normal distribution with mean 0 and variance 1. If the test statistic Z is larger in absolute value than a constant Zα, the investigator will reject H0 in the two-sided test.

The constant Zα is often referred to as the critical value. The probability of a standard normal random variable being larger in absolute value than Zα is α. For a one-sided hypothesis, the constant Zα is chosen such that the probability that Z is greater (or less) than Zα is α. For a given α, Zα is larger for a two-sided test than for a one-sided test (Table 8.1). Zα for a two-sided test with α = 0.10 has the same value as Zα for a one-sided test with α = 0.05. While a smaller sample size can be achieved with a one-sided test compared to a two-sided test at the same α level, we in general do not recommend this approach as discussed earlier.

Table 8.1 Zα for sample size formulas for various values of α

The sample size required for the design to have a significance level α and a power of 1 − β to detect true differences of at least δ between the event rates pI and pC can be expressed by the formula [11]:

$$ 2N=2{\left\{{Z}_{\upalpha}\sqrt{\overline{p}\left(1-\overline{p}\right)}+{Z}_{\upbeta}\sqrt{\overline{p_{\mathrm{C}}}\left(1-\overline{p_{\mathrm{C}}}\right)+{\overline{p}}_{\mathrm{I}}\left(1-{\overline{p}}_{\mathrm{I}}\right)}\right\}}^2/{\left({p}_{\mathrm{C}}-{p}_{\mathrm{I}}\right)}^2 $$

where 2N = total sample size (N participants/group) with \( \overline{p}=\left({p}_{\mathrm{C}}+{p}_{\mathrm{I}}\right)/2 \); Zα is the critical value which corresponds to the significance level α; and Zβ is the value of the standard normal value not exceeded with probability β. Zβ corresponds to the power 1 − β (e.g., if 1 − β = 0.90, Zβ = 1.282). Values of Zα and Zβ are given in Tables 8.1 and 8.2 for several values of α and 1 − β. More complete tables may be found in most introductory texts textbooks [2729, 31, 3337, 51], sample size texts [1721, 65], or by using software packages and online resources [6673]. Note that the definition of \( \overline{p} \) given earlier is equivalent to the definition of \( \overline{p} \) given here when NI = NC; that is, when the two study groups are of equal size. An alternative to the above formula is given by

Table 8.2 Zβ for sample size formulas for various values of power (1 − β)
$$ 2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2\overline{p}\left(1-\overline{p}\right)/{\left({p}_{\mathrm{C}}-{p}_{\mathrm{I}}\right)}^2 $$

These two formulas give approximately the same answer and either may be used for the typical clinical trial.

Example: Suppose the annual event rate in the control group is anticipated to be 20%. The investigator hopes that the intervention will reduce the rate to 15%. The study is planned so that each participant will be followed for 2 years. Therefore, if the assumptions are accurate, approximately 40% of the participants in the control group and 30% of the participants in the intervention group will develop an event. Thus, the investigator sets pC = 0.40, pI = 0.30, and, therefore, \( \overline{p}=\left(0.4+0.3\right)/2=0.35 \). The study is designed as two-sided with a 5% significance level and 90% power. From Tables 8.1 and 8.2, the two-sided 0.05 critical value is 1.96 for Zβ and 1.282 for Zβ. Substituting these values into the right-hand side of the first sample size formula yields 2N to be

$$ 2{\left\{1.96\sqrt{2(0.35)\;(0.65)}+1.282\sqrt{0.4(0.6)+0.3(0.7)}\right\}}^2/{\left(0.4-0.3\right)}^2 $$

Evaluating this expression, 2N equals 952.3. Using the second formula, 2N is 4(1.96 + 1.202)2 (0.35)(0.65)/(0.4 − 0.3)2 or 2N = 956. Therefore, after rounding up to the nearest ten, the calculated total sample size by either formula is 960, or 480 in each group.

Sample size estimates using the first formula are given in Table 8.3 for a variety of values of pI and pC, for two-sided tests, and for α = 0.01, 0.025 and 0.05 and 1 − β = 0.80 or 0.90. For the example just considered with α = 0.05 (two-sided), 1 − β = 0.90, pC = 0.4 and pI = 0.3, the total sample size using Table 8.3 is 960. This table shows that, as the difference in rates between groups increases, the sample size decreases.

Table 8.3 Sample size

The event rate in the intervention group can be written as pI = (1 − k) pC where k represents the proportion that the control group event rate is expected to be reduced by the intervention. Figure 8.2 shows the total sample size 2N versus k for several values of pC using a two-sided test with α = 0.05 and 1 − β = 0.90. In the example where pC = 0.4 and pI = 0.3, the intervention is expected to reduce the control rate by 25% or k = 0.25. In Fig. 8.2, locate k = 0.25 on the horizontal axis and move up vertically until the curve labeled pC = 0.4 is located. The point on this curve corresponds to a 2N of approximately 960. Notice that as the control group event rate pC decreases, the sample size required to detect the same proportional reduction increases. Trials with small event rates (e.g., pC = 0.1) require large sample sizes unless the interventions have a dramatic effect.

Fig. 8.2
figure 2figure 2

Relationship between total sample size (2N) and reduction in event rate (k) for several control group event rates (pC), with a two-sided significance level of 0.05 and power of 0.90

In order to make use of the sample size formula or table, it is necessary to know something about pC and k. The estimate for pC is usually obtained from previous studies of similar people. In addition, the investigator must choose k based on preliminary evidence of the potential effectiveness of the intervention or be willing to specify some minimum difference or reduction that he wants to detect. Obtaining this information is difficult in many cases. Frequently, estimates may be based on a small amount of data. In such cases, several sample size calculations based on a range of estimates help to assess how sensitive the sample size is to the uncertain estimates of pC, k, or both. The investigator may want to be conservative and take the largest, or nearly largest, estimate of sample size to be sure his study has sufficient power. The power (1 − β) for various values of δ can be compared for a given sample size 2N, significance level α, and control rate pC. By examining a power curve such as in Fig. 8.1, it can be seen what power the trial has for detecting various differences in rates, δ. If the power is high, say 0.80 or larger, for the range of values δ that are of interest, the sample size is probably adequate. The power curve can be especially helpful if the number of available participants is relatively fixed and the investigator wants to assess the probability that the trial can detect any of a variety of reductions in event rates.

Investigators often overestimate the number of eligible participants who can be enrolled in a trial. The actual number enrolled may fall short of goal. To examine the effects of smaller sample sizes on the power of the trial, the investigator may find it useful to graph power as a function of various sample sizes. If the power falls far below 0.8 for a sample size that is very likely to be obtained, he can expand the recruitment effort, hope for a larger intervention effect than was originally assumed, accept the reduced power and its consequences or abandon the trial.

To determine the power, the second sample size equation in this section is solved for Zβ:

$$ {Z}_{\upbeta}=\left\{-{Z}_{\upalpha}\sqrt{2\overline{p}\left(1-\overline{p}\right)}+\sqrt{N}\left({p}_{\mathrm{C}}-{p}_{\mathrm{I}}\right)\right\}/\sqrt{p_{\mathrm{C}}\left(1-{p}_{\mathrm{C}}\right)+{p}_{\mathrm{I}}\left(1-{p}_{\mathrm{I}}\right)} $$

where \( \overline{p} \) as before is (pC + pI)/2. The term Zβ can be translated into a power of 1 − β by use of Table 8.2. For example, let pC = 0.4 and pI = 0.3. For a significance level of 0.05 in a two-sided test of hypothesis, Zα is 1.96. In a previous example, it was shown that a total sample of approximately 960 participants or 480 per group is necessary to achieve a power of 0.90. Substituting Zα = 1.96, N = 480, pC = 0.4 and pI = 0.3, a value for Zβ = 1.295 is obtained. The closest value of Zβ in Table 8.2 is 1.282 which corresponds to a power of 0.90. (If the exact value of N = 476 were used, the value of Zβ would be 1.282.) Suppose an investigator thought he could get only 350 participants per group instead of the estimated 480. Then Zβ = 0.818 which means that the power 1 − β is somewhat less than 0.80. If the value of Zβ is negative, the power is less than 0.50. For more details of power calculations, a standard text in biostatistics [2729, 31, 3337, 51] or sample size [1721, 65] should be consulted.

For a given 2N, α, 1 − β, and pC the reduction in event rate that can be detected can also be calculated. This function is nonlinear and, therefore, the details will not be presented here. Approximate results can be obtained by scanning Table 8.3, by using the calculations for several pI until the sample size approaches the planned number, or by using a figure where sample sizes have been plotted. In Fig. 8.2, α is 0.05 and 1 − β is 0.90. If the sample size is selected as 1000, with pC = 0.4, k is determined to be about 0.25. This means that the expected pI would be 0.3. As can be seen in Table 8.3, the actual sample size for these assumptions is 960.

The above approach yields an estimate which is more accurate as the sample size increases. Modifications [49, 5155, 58, 59, 74] have been developed which give some improvement in accuracy to the approximate formula presented for small studies. However, the availability of computer software to perform exact computations [6673] has reduced the need for good small sample approximations. Also, given that sample size estimation is somewhat imprecise due to assumptions of intervention effects and event rates, the formulation presented is probably adequate for most clinical trials.

Designing a trial comparing proportions using the confidence interval approach, we would need to make a series of assumptions as well [6, 42, 52]. A 100(1 − α)% confidence interval for a treatment comparison θ would be of the general form \( \widehat{\theta}\pm {Z}_{\upalpha}\mathrm{S}\mathrm{E}\left(\widehat{\theta}\right) \), where \( \widehat{\theta} \) is the estimate for θ and \( \mathrm{S}\mathrm{E}\left(\widehat{\theta}\right) \) is the standard error of \( \widehat{\theta} \). In this case, the specific form would be:

$$ \left(\widehat{p_{\mathrm{C}}}-\widehat{p_{\mathrm{I}}}\right)\pm {Z}_{\upalpha}\sqrt{\overline{p}\left(1-\overline{p}\right)\;\left(1/{N}_{\mathrm{I}}+1/{N}_{\mathrm{C}}\right)} $$

If we want the width of the confidence interval (CI) not to exceed WCI, where WCI is the difference between the upper confidence limit and the lower confidence limit, then if N = NI = NC, the width WCI can be expressed simply as:

$$ {W}_{\mathrm{CI}}=2{Z}_{\upalpha}\sqrt{\overline{p}\left(1-\overline{p}\right)\;\left(N/2\right)} $$

or after solving this equation for N,

$$ N=8\;{Z}_{\upalpha}^2\overline{p}\left(1-\overline{p}\right)/{\left({W}_{\mathrm{CI}}\right)}^2 $$

Thus, if α is 0.05 for a 95% confidence interval, pC = 0.4 and pI = 0.3 or 0.35, N = 8(1.96)2(0.35)(0.65)/(WCI)2. If we desire the upper limit of the confidence interval to be not more than 0.10 from the estimate or the width to be twice that, then WCI = 0.20 and N = 175 or 2N = 350. Notice that even though we are essentially looking for differences in pC − pI to be the same as our previous calculation, the sample size is smaller. If we let pC − pI = WCI/2 and substitute this into the previous sample size formula, we obtain

$$ \begin{array}{c}\hfill 2N=2{\left\{{Z}_{\upalpha}+{Z}_{\upbeta}\right\}}^2\overline{p}\left(1-\overline{p}\right)/{\left({W}_{\mathrm{CI}}/2\right)}^2\hfill \\ {}\hfill =8{\left\{{Z}_{\upalpha}+{Z}_{\upbeta}\right\}}^2\overline{p}\left(1-\overline{p}\right)/{\left({W}_{\mathrm{CI}}\right)}^2\hfill \end{array} $$

This formula is very close to the confidence interval formula for two proportions. If we select 50% power, β is 0.50 and Zβ is 0 which would yield the confidence interval formula. Thus, a confidence interval approach gives 50% power to detect differences of WCI/2. This may not be adequate, depending on the situation. In general, we prefer to specify greater power (e.g., 80–90%) and use the previous approach.

Analogous sample size estimation using the confidence interval approach may be used for comparing means, hazard rates, or regression slopes. We do not present details of these since we prefer to use designs which yield power greater than that obtained from a confidence interval approach.

Paired Dichotomous Response

For designing a trial where the paired outcomes are binary, the sample size estimate is based on McNemar’s test [6064]. We want to compare the frequency of success within an individual on intervention with the frequency of success on control (i.e., pI − pC). McNemar’s test compares difference in discordant responses within an individual pI − pC, between intervention and control.

In this case, the number of paired observations, Np, may be estimated by:

$$ {N}_{\mathrm{p}}={\left[{Z}_{\upalpha}\sqrt{\mathrm{f}}+{Z}_{\upbeta}\sqrt{f-{d}^2}\right]}^2/{d}^2 $$

where d = difference in the proportion of successes (d = pI − pC) and f is the proportion of participants whose response is discordant. An alternative approximate formula for Np is

$$ {N}_{\mathrm{p}}={\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2f/{d}^2 $$

Example: Consider an eye study where one eye is treated for loss in visual acuity by a new laser procedure and the other eye is treated by standard therapy. The failure rate on the control, pC is estimated to be 0.40 and the new procedure is projected to reduce the failure rate to 0.20. The discordant rate f is assumed to be 0.50. Using the latter sample size formula for a two-sided 5% significance level and 90% power, the number of pairs Np is estimated as 132. If the discordant rate is 0.8, then 210 pairs of eyes will be needed.

Adjusting Sample Size to Compensate for Nonadherence

During the course of a clinical trial, participants will not always adhere to their prescribed intervention schedule. The reason is often that the participant cannot tolerate the dosage of the drug or the degree of intervention prescribed in the protocol. The investigator or the participant may then decide to follow the protocol with less intensity. At all times during the conduct of a trial, the participant’s welfare must come first and meeting those needs may not allow some aspects of the protocol to be followed. Planners of clinical trials must recognize this possibility and attempt to account for it in their design. Examples of adjusting for nonadherence with dichotomous outcomes can be found in several clinical trials [7582].

In the intervention group a participant who does not adhere to the intervention schedule is often referred to as a “drop-out.” Participants who stop the intervention regimen lose whatever potential benefit the intervention might offer. Similarly, a participant on the control regimen may at some time begin to use the intervention that is being evaluated. This participant is referred to as a “drop-in.” In the case of a drop-in a physician may decide, for example, that surgery is required for a participant assigned to medical treatment in a clinical trial of surgery versus medical care [77]. Drop-in participants from the control group who start the intervention regimen will receive whatever potential benefit or harm that the intervention might offer. Therefore, both the drop-out and drop-in participants must be acknowledged because they tend to dilute any difference between the two groups which might be produced by the intervention. This simple model does not take into account the situation in which one level of an intervention is compared to another level of the intervention. More complicated models for nonadherence adjustment can be developed. Regardless of the model, it must be emphasized that the assumed event rates in the control and intervention groups are modified by participants who do not adhere to the study protocol.

People who do not adhere should remain in the assigned study groups and be included in the analysis. The rationale for this is discussed in Chap. 18. The basic point to be made here is that eliminating participants from analysis or transferring participants to the other group could easily bias the results of the study. However, the observed δ is likely to be less than projected because of nonadherence and thus have an impact on the power of the clinical trial. A reduced δ, of course, means that either the sample size must be increased or the study will have smaller power than intended. Lachin [11] has proposed a simple formula to adjust crudely the sample size for a drop-out rate of proportion RO. This can be generalized to adjust for drop-in rates, RI, as well. The unadjusted sample size N should be multiplied by the factor {1/(1 − RO − RI)}2 to get the adjusted sample size per arm, N*. Thus, if RO = 0.20 and RI = 0.05, the originally calculated sample should be multiplied by 1/(0.75)2, or 16/9, and increased by 78%. This formula gives some quantitative idea of the effect of drop-out on the sample size:

$$ {N}^{*}=N/{\left(1-{R}_{\mathrm{O}}-{R}_{\mathrm{I}}\right)}^2 $$

However, more refined models to adjust sample sizes for drop-outs from the intervention to the control [8389] and for drop-ins from the control to the intervention regimen [83] have been developed. They adjust for the resulting changes in pI and pC, the adjusted rates being denoted pI* and pC*. These models also allow for another important factor, which is the time required for the intervention to achieve maximum effectiveness. For example, an anti-platelet drug may have an immediate effect; conversely, even though a cholesterol-lowering drug reduces serum levels quickly, it may require years to produce a maximum effect on coronary mortality.

Example: A drug trial [76] in post myocardial infarction participants illustrates the effect of drop-outs and drop-ins on sample size. In this trial, total mortality over a 3-year follow-up period was the primary response variable. The mortality rate in the control group was estimated to be 18% (pC = 0. 18) and the intervention was believed to have the potential for reducing pC by 28% (k = 0.28) yielding pI = 0.1296. These estimates of pC and k were derived from previous studies. Those studies also indicated that the drop-out rate might be as high as 26% over the 3 years; 12% in the first year, an additional 8% in the second year, and an additional 6% in the third year. For the control group, the drop-in rate was estimated to be 7% each year for a total drop-in rate of 21%.

Using these models for adjustment, pC* = 0.1746 and pI* = 0.1375. Therefore, instead of δ being 0.0504 (0.18 − 0.1296), the adjusted δ* is 0.0371 (0.1746 − 0.1375). For a two-sided test with α = 0.05 and 1 − β = 0.90, the adjusted sample size was 4020 participants compared to an unadjusted sample size of 2160 participants. The adjusted sample size almost doubled in this example due to the expected drop-out and drop-in experiences and the recommended policy of keeping participants in the originally assigned study groups. The remarkable increases in sample size because of drop-outs and drop-ins strongly argue for major efforts to keep nonadherence to a minimum during trials.

Sample Size Calculations for Continuous Response Variables

Similar to dichotomous outcomes, we consider two sample size cases for response variables which are continuous [9, 11, 90]. The first case is for two independent samples. The other case is for paired data.

Two Independent Samples

For a clinical trial with continuous response variables, the previous discussion is conceptually relevant, but not directly applicable to actual calculations. “Continuous” variables such as length of hospitalization, blood pressure, spirometric measures, neuropsychological scores and level of a serum component may be evaluated. Distributions of such measurements frequently can be approximated by a normal distribution. When this is not the case, a transformation of values, such as taking their logarithm, can often make the normality assumption approximately correct.

Suppose the primary response variable, denoted as x, is continuous with NI and NC participants randomized to the intervention and control groups respectively. Assume that the variable x has a normal distribution with mean μ and variance σ2. The true levels of μI and μC for the intervention and control groups are not known, but it is assumed that σ2 is known. (In practice, σ2 is not known and must be estimated from some data. If the data set used is reasonably large, the estimate of σ2 can be used in place of the true σ2. If the estimate for σ2 is based on a small set of data, it is necessary to be cautious in the interpretation of the sample size calculations.)

The null hypothesis is H0: δ = μC − μI = 0 and the two-sided alternative hypothesis is HA: δ = μC − μI ≠ 0. If the variance is known, the test statistic is:

$$ Z=\left({\overline{\mathrm{x}}}_{\mathrm{C}}-{\overline{\mathrm{x}}}_{\mathrm{I}}\right)/\upsigma \sqrt{\left(1/{N}_{\mathrm{C}}+1/{N}_{\mathrm{I}}\right)} $$

where \( {\overline{\mathrm{x}}}_{\mathrm{I}} \) and \( {\overline{\mathrm{x}}}_{\mathrm{C}} \) represent mean levels observed in the intervention and control groups respectively. For adequate sample size (e.g. 50 participants per arm) this statistic has approximately a standard normal distribution. The hypothesis-testing concepts previously discussed apply to the above statistic. If Z > Zα, then an investigator would reject H0 at the α level of significance. By use of the above test statistic it can be determined how large a total sample 2N would be needed to detect a true difference δ between μI and μC with power (1 − β) and significance level α by the formula:

$$ 2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2{\upsigma}^2/{\updelta}^2 $$

Example: Suppose an investigator wishes to estimate the sample size necessary to detect a 10 mg/dL difference in cholesterol level in a diet intervention group compared to the control group. The variance from other data is estimated to be (50 mg/dL)2. For a two-sided 5% significance level, Zα = 1.96 and for 90% power, Zβ = 1.282. Substituting these values into the above formula, 2N = 4(1.96 + 1.282)2(50)2/102 or approximately 1,050 participants. As δ decreases, the value of 2N increases, and as σ2 increases the value of 2N increases. This means that the smaller the difference in intervention effect an investigator is interested in detecting and the larger the variance, the larger the study must be. As with the dichotomous case, setting a smaller α and larger 1 − β also increases the sample size. Figure 8.3 shows total sample size 2 N as a function of δ/σ. As in the example, if δ = 10 and σ = 50, then δ/σ = 0.2 and the sample size 2N for 1 − β = 0.9 is approximately 1,050.

Fig. 8.3
figure 3figure 3

Total sample size (2N) required to detect the difference (δ) between control group mean and intervention group mean as a function of the standardized difference (δ/σ) where σ is the common standard deviation, with two-sided significance level of 0.05 and power (1 − β) of 0.80 and 0.90

Paired Data

In some clinical trials, paired outcome data may increase power for detecting differences because individual or within participant variation is reduced. Trial participants may be assessed at baseline and at the end of follow-up. For example, instead of looking at the difference between mean levels in the groups, an investigator interested in mean levels of change might want to test whether diet intervention lowers serum cholesterol from baseline levels when compared to a control. This is essentially the same question as asked before in the two independent sample case, but each participant’s initial cholesterol level is taken into account. Because of the likelihood of reduced variability, this type of design can lead to a smaller sample size if the question is correctly posed. Assume that ΔC and ΔI represent the true, but unknown levels of change from baseline to some later point in the trial for the control and intervention groups, respectively. Estimates of ΔC and ΔI would be \( {\overline{d}}_{\mathrm{C}}={\overline{\mathrm{x}}}_{{\mathrm{C}}_1}-{\overline{\mathrm{x}}}_{{\mathrm{C}}_2} \) and \( {\overline{d}}_{\mathrm{I}}={\overline{\mathrm{x}}}_{{\mathrm{I}}_1}-{\overline{\mathrm{x}}}_{{\mathrm{I}}_2} \). These represent the differences in mean levels of the response variable at two points for each group. The investigator tests H0: ΔC − ΔI = 0 versus HA: ΔC − ΔI = δ ≠ 0. The variance σ2 in this case reflects the variability of the change, from baseline to follow-up, and is assumed here to be the same in the control and intervention arms. This variance is likely to be smaller than the variability at a single measurement. This is the case if the correlation between the first and second measurements is greater than 0.5. Using δ and σ 2Δ , as defined in this manner, the previous sample size formula for two independent samples and graph are applicable. That is, the total sample size 2 N can be estimated as

$$ 2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2{\upsigma}_{\varDelta}^2/{\updelta}^2 $$

Another way to represent this is

$$ 2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2\left(1-\uprho \right){\upsigma}^2/{\updelta}^2 $$

where \( {\upsigma}_{\varDelta}^2=2{\upsigma}^2\left(1-\uprho \right) \) and σ2 is the variance of a measurement at a single point in time, the variability is assumed to be the same at both time points (i.e. at baseline and at follow-up), and ρ is the correlation coefficient between the first and second measurement. As indicated, if the correlation coefficient is greater than 0.5, comparing the paired differences will result in a smaller sample size than just comparing the mean values at the time of follow-up.

Example: Assume that an investigator is still interested in detecting a 10 mg/dL difference in cholesterol between the two groups, but that the variance of the change is now (20 mg/dL)2. The question being asked in terms of δ is approximately the same, because randomization should produce baseline mean levels in each group which are almost equal. The comparison of differences in change is essentially a comparison of the difference in mean levels of cholesterol at the second measurement. Using Fig. 8.3, where δ/σΔ = 10/20 = 0.5, the sample size is 170. This impressive reduction in sample size from 1,050 is due to a reduction in the variance from (50 mg/dL)2 to (20 mg/dL)2.

Another type of pairing occurs in diseases that affect paired organs such as lungs, kidneys, and eyes. In ophthalmology, for example, trials have been conducted where one eye is randomized to receive treatment and the other to receive control therapy [6164]. Both the analysis and the sample size estimation need to take account of this special kind of stratification. For continuous outcomes, a mean difference in outcome between a treated eye and untreated eye would measure the treatment effect and could be compared using a paired t-test [9, 11], \( Z=\overline{d}/{S}_{\mathrm{d}}\sqrt{1/N} \), where \( \overline{d} \) is the average difference in response and Sd is the standard deviation of the differences. The mean difference μd is equal to the mean response of the treated or intervention eye, for example, minus the mean response of the control eye; that is μd = μI − μC. Under the null hypothesis, μd equals δd. An estimate of δd, \( \overline{d} \), can be obtained by taking an estimate of the average differences or by calculating \( {\overline{\mathrm{x}}}_{\mathrm{I}}-{\overline{\mathrm{x}}}_{\mathrm{C}} \). The variance of the paired differences σ 2d is estimated by S 2d . Thus, the formula for paired continuous outcomes within an individual is a slight modification of the formula for comparison of means in two independent samples. To compute sample size, Nd, for number of pairs, we compute:

$$ {N}_{\mathrm{d}}={\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2{\upsigma}_{\mathrm{d}}^2/{\updelta}_{\mathrm{d}}^2 $$

As discussed previously, participants in clinical trials do not always fully adhere with the intervention being tested. Some fraction (RO) of participants on intervention drop-out of the intervention and some other fraction (RI) drop-in and start following the intervention. If we assume that these participants who drop-out respond as if they had been on control and those who drop-in respond as if they had been on intervention, then the sample size adjustment is the same as for the case of proportions. That is, the adjusted sample size N* is a function of the drop-out rate, the drop-in rate, and the sample size N for a study with fully compliant participants:

$$ {N}^{*}=N/{\left(1-{R}_{\mathrm{O}}-{R}_{\mathrm{I}}\right)}^2 $$

Therefore, if the drop-out rate were 0.20 and the drop-in 0.05, then the original sample size N must be increased by 16/9 or 1.78; that is, a 78% increase in sample size.

Sample Size for Repeated Measures

The previous section briefly presented the sample size calculation for trials where only two points, say a baseline and a final visit, are used to determine the effect of intervention and these two points are the same for all study participants. Often, a continuous response variable is measured at each follow-up visit. Considering only the first and last values would give one estimate of change but would not take advantage of all the available data. Many models exist for the analysis of repeated measurements and formulae [13, 9197] as well as computer software [66, 67, 6973] for sample size calculation are available for most. In some cases, the response variable may be categorical. We present one of the simpler models for continuous repeated measurements. While other models are beyond the scope of this book, the basic concepts presented are still useful in thinking about how many participants, how many measurements per individual, and when they should be taken, are needed. In such a case, one possible approach is to assume that the change in response variable is approximately a linear function of time, so that the rate of change can be summarized by a slope. This model is fit to each participant’s data by the standard least squares method and the estimated slope is used to summarize the participant’s experience. In planning such a study, the investigator must be concerned about the frequency of the measurement and the duration of the observation period. As discussed by Fitzmaurice and co-authors [98], the observed measurement x can be expressed as x = a + bt + error, where a = intercept, b = slope, t = time, and error represents the deviation of the observed measurement from a regression line. This error may be due to measurement variability, biological variability or the nonlinearity of the true underlying relationship. On the average, this error is expected to be equally distributed around 0 and have a variability denoted as σ 2(error) . Though it is not necessary, it simplifies the calculation to assume that σ 2(error) is approximately the same for each participant.

The investigator evaluates intervention effectiveness by comparing the average slope in one group with the average slope in another group. Obviously, participants in a group will not have the same slope, but the slopes will vary around some average value which reflects the effectiveness of the intervention or control. The amount of variability of slopes over participants is denoted as σ 2b . If D represents the total time duration for each participant and P represents the number of equally spaced measurements, σ 2b can be expressed as:

$$ {\upsigma}_b^2 = {\upsigma}_B^2+\left\{12\left(P-1\right)\ {\upsigma}_{\left(\mathrm{error}\right)}^2/\left({D}^2P\left(P+1\right)\right)\right\} $$

where σ 2B is the component of variance attributable to differences in participants’ slope as opposed to measurement error and lack of a linear fit. The sample size required to detect difference δ between the average rates of change in the two groups is given by:

$$ 2N=\left[4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2/{\updelta}^2\ \right]\left[{\upsigma}_B^2+\left\{12\left(P-1\right)\ {\upsigma}_{\left(\mathrm{error}\right)}^2/\left({D}^2P\left(P+1\right)\right)\right\}\right] $$

As in the previous formulas, when δ decreases, 2N increases. The factor on the right-hand side relates D and P with the variance components σ 2B and σ 2(error) . Obviously as σ 2B and σ 2(error) increase, the total sample size increases. By increasing P and D, however, the investigator can decrease the contribution made by σ 2(error) . The exact choices of P and D will depend on how long the investigator can feasibly follow participants, how many times he can afford to have participants visit a clinic and other factors. By manipulating P and D, an investigator can design a study which will be the most cost effective for his specific situation.

Example: In planning for a trial, it may be assumed that a response variable declines at the rate of 80 units/year in the control group. Suppose a 25% reduction is anticipated in the intervention group. That is, the rate of change in the intervention group would be 60 units/year. Other studies provided an estimate for σ(error) of 150 units. Also, suppose data from a study of people followed every 3 months for 1 year (D = 1 and P = 5) gave a value for the standard deviation of the slopes, σb = 200. The calculated value of σB is then 63 units. Thus, for a 5% significance level and 90% power (Zα = 1.96 and Zβ = 1.282), the total sample size would be approximately 630 for a 3-year study with four visits per year (D = 3, P = 13). Increasing the follow-up time to 4 years, again with four measurements per year, would decrease the variability with a resulting sample size calculation of approximately 510. This reduction in sample size could be used to decide whether or not to plan a 4-year or a 3-year study.

Sample Size Calculations for “Time to Failure”

For many clinical trials, the primary response variable is the occurrence of an event and thus the proportion of events in each group may be compared. In these cases, the sample size methods described earlier will be appropriate. In other trials, the time to the event may be of special interest. For example, if the time to death or a nonfatal event can be increased, the intervention may be useful even though at some point the proportion of events in each group are similar. Methods for analysis of this type of outcome are generally referred to as life table or survival analysis methods (see Chap. 15). In this situation, other sample size approaches are more appropriate than that described for dichotomous outcomes [99118]. At the end of this section, we also discuss estimating the number of events required to achieve a desired power.

The basic approach is to compare the survival curves for the groups. A survival curve may be thought of as a graph of the probability of surviving, or not having an event, up to any given point in time. The methods of analysis now widely used are non-parametric; that is, no mathematical model about the shape of the survival curve is assumed. However, for the purpose of estimating sample size, some assumptions are often useful. A common model assumes that the survival curve, S(t), follows an exponential distribution, S(t) = eλt = exp(−λt) where λ is called the hazard rate or force of mortality. Using this model, survival curves are totally characterized by λ. Thus, the survival curves from a control and an intervention group can be compared by testing H0: λC = λI. An estimate of λ is obtained as the inverse of the mean survival time. If the median survival time, TM, is known, the hazard rate λ may also be estimated by − ln(0.5)/TM. Sample size formulations have been considered by several investigators [103, 112, 119]. One simple formula is given by

$$ 2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2/{\left[ \ln \left({\lambda}_{\mathrm{C}}/{\lambda}_{\mathrm{I}}\right)\right]}^2 $$

where N is the size of the sample in each group and Zα and Zβ are defined as before. As an example, suppose one assumes that the force of mortality is 0.30 in the control group and expects it to be 0.20 for the intervention being tested; that is, λC/λI = 1.5. If α = .05 (two-sided) and 1 − β = 0.90, then N = 128 or 2N = 256. The corresponding mortality rates for 5 years of follow-up are 0.7769 and 0.6321 respectively. Using the comparison of two proportions, the total sample size would be 412. Thus, the time to failure method may give a more efficient design, requiring a smaller number of participants.

The method just described assumes that all participants will be followed to the event. With few exceptions, clinical trials with a survival outcome are terminated at time T before all participants have had an event. For those still event-free, the time to event is said to be censored at time T. For this situation, Lachin [11] gives the approximate formula:

$$ 2N=2{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2\left[\upvarphi \left({\lambda}_{\mathrm{C}}\right)+\upvarphi \left({\lambda}_{\mathrm{I}}\right)\right]/{\left({\lambda}_{\mathrm{I}}\hbox{--} {\lambda}_{\mathrm{C}}\right)}^2 $$

where φ(λ) = λ2/(1 − e−λT) and where φ(λC) or φ(λI) are defined by replacing λ with λC or λI, respectively. If a 5 year study were being planned (T = 5) with the same design specifications as above, then the sample size, 2N is equal to 376. Thus, the loss of information due to censoring must be compensated for by increasing the sample size. If the participants are to be recruited continually during the 5 years of the trial, the formula given by Lachin is identical but with φ(λ) = λ3T/(λT − 1 + e−λT). Using the same design assumptions, we obtain 2N = 620, showing that not having all the participants at the start requires an additional increase in sample size.

More typically participants are recruited uniformly over a period of time, T0, with the trial continuing for a total of T years (T > T0). In this situation, the sample size can be estimated as before using:

$$ \upphi \left(\lambda \right)={\lambda}^2/\left[1-\left({\mathrm{e}}^{-\lambda \left(T-T0\right)}\hbox{--} {\mathrm{e}}^{-\lambda T}\right)/\left(\lambda {T}_0\right)\right] $$

Here, the sample size (2N) of 466 is between the previous two examples suggesting that it is preferable to get participants recruited as rapidly as possible to get more follow-up or exposure time.

One of the methods used for comparing survival curves is the proportional hazards model or the Cox regression model which is discussed briefly in Chap. 15. For this method, sample size estimates have been published [101, 115]. As it turns out, the formula by Schoenfeld for the Cox model [115] is identical to that given above for the simple exponential case, although developed from a different point of view. Further models are given by Lachin [11].

All of the above methods assume that the hazard rate remains constant during the course of the trial. This may not be the case. The Beta-Blocker Heart Attack Trial [76] compared 3-year survival in two groups of participants with intervention starting one to 3 weeks after an acute myocardial infarction. The risk of death was high initially, decreased steadily, and then became relatively constant.

For cases where the event rate is relatively small and the clinical trial will have considerable censoring, most of the statistical information will be in the number of events. Thus, the sample size estimates using simple proportions will be quite adequate. In the Beta-Blocker Heart Attack Trial, the 3 year control group event rate was assumed to be 0.18. For the intervention group, the event rate was assumed to be approximately 0.13. In the situation of ϕ(λ) = λ2(1 − e−λT), a sample size 2N = 2,208 is obtained, before adjustment for estimated nonadherence. In contrast, the unadjusted sample size using simple proportions is 2,160. Again, it should be emphasized that all of these methods are only approximations and the estimates should be viewed as such.

As the previous example indicates, the power of a survival analysis still is a function of the number of events. The expected number of events E(D) is a function of sample size, hazard rate, recruitment rate, and censoring distribution [11, 106]. Specifically, the expected number of events in the control group can be estimated as

$$ E(D)=N{\lambda}_{\mathrm{C}}^2/\upphi \left({\lambda}_{\mathrm{C}}\right) $$

where ϕ(λC) is defined as before, depending on the recruitment and follow-up strategy. If we assume a uniform recruitment over the interval (0,T0) and follow-up over the interval (0,T), then E(D) can be written using the most general form for ϕ(λC);

$$ E(D) = N\left[1-\left({\mathrm{e}}^{-\lambda \left(T-T0\right)}\hbox{--} {\mathrm{e}}^{-\lambda T}\right)/\left(\lambda {T}_0\right)\right] $$

This estimate of the number of events can be used to predict the number of events at various time points during the trial including the end of follow-up. This prediction can be compared to the observed number of events in the control group to determine if an adjustment needs to be made to the design. That is, if the number of events early in the trial is larger than expected, the trial may be more powerful than designed or may be stopped earlier than the planned T years of follow-up (see Chap. 16). However, more worrisome is when the observed number of events is smaller than what is expected and needed to maintain adequate power. Based on this early information, the design may be modified to attain the necessary number of events by increasing the sample size or expanding recruitment effort within the same period of time, increasing follow-up, or a combination of both.

This method can be illustrated based on a placebo-controlled trial of congestive heart failure [82]. Severe or advanced congestive heart failure has an expected 1 year event rate of 40%, where the events are all-cause mortality and nonfatal myocardial infarction. A new drug was to be tested to reduce the event rate by 25%, using a two-sided 5% significance level and 90% power. If participants are recruited over 1.5 years (T0 = 1.5) during a 2 year study (T = 2) and a constant hazard rate is assumed, the total sample size (2N) is estimated to be 820 participants with congestive heart failure. The formula E(D) can be used to calculate that approximately 190 events (deaths plus nonfatal myocardial infarctions) must be observed in the control group to attain 90% power. If the first year event rate turns out to be less than 40%, fewer events will be observed by 2 years than the required 190. Table 8.4 shows the expected number of control group events at 6 months and 1 year into the trial for annual event rates of 40, 35, 30, and 25%. Two years is also shown to illustrate the projected number of events at the completion of the study. These numbers are obtained by calculating the number of participants enrolled by 6 months (33% of 400) and 1 year (66% of 400) and multiplying by the bracketed term on the right hand side of the equation for E(D). If the assumed annual event rate of 40% is correct, 60 control group events should be observed at 1 year. However, if at 1 year only 44 events are observed, the annual event rate might be closer to 30% (i.e., λ = 0.357) and some design modification should be considered to assure achieving the desired 190 control group events. One year would be a sensible time to make this decision, based only on control group events, since recruitment efforts are still underway. For example, if recruitment efforts could be expanded to 1220 participants in 1.5 years, then by 2 years of follow-up the 190 events in the placebo group would be observed and the 90% power maintained. If recruitment efforts were to continue for another 6 months at a uniform rate (T0 = 2 years), another 135 participants would be enrolled. In this case, E(D) is 545 × 0.285 = 155 events which would not be sufficient without some additional follow-up. If recruitment and follow-up continued for 27 months (i.e., T0 = T = 2.25), then 605 control group participants would be recruited and E(D) would be 187, yielding the desired power.

Table 8.4 Number of expected events (in the control group) at each interim analysis given different event rates in control group

Sample Size for Testing “Equivalency” or Noninferiority of Interventions

In some instances, an effective intervention has already been established and is considered the standard. New interventions under consideration may be preferred because they are less expensive, have fewer side effects, or have less adverse impact on an individual’s general quality of life. This issue is common in the pharmaceutical industry where a product developed by one company may be tested against an established intervention manufactured by another company. Studies of this type are sometimes referred to as trials with positive controls or as noninferiority designs (see Chaps. 3 and 5).

Given that several trials have shown that certain beta-blockers are effective in reducing mortality in post-myocardial infarction participants [76, 120, 121], it is likely that any new beta-blockers developed will be tested against proven agents. The Nocturnal Oxygen Therapy Trial [122] tested whether the daily amount of oxygen administered to chronic obstructive pulmonary disease participants could be reduced from 24 to 12 h without impairing oxygenation. The Intermittent Positive Pressure Breathing [80] trial considered whether a simple and less expensive method for delivering a bronchodilator into the lungs would be as effective as a more expensive device. A breast cancer trial compared the tumor regression rates between subjects receiving the standard, diethylstilbestrol, or the newer agent, tamoxifen [123].

The problem in designing noninferiority trials is that there is no statistical method to demonstrate complete equivalence. That is, it is not possible to show δ = 0. Failure to reject the null hypothesis is not a sufficient reason to claim two interventions to be equal but merely that the evidence is inadequate to say they are different [124]. Assuming no difference when using the previously described formulas results in an infinite sample size.

While demonstrating perfect equivalence is an impossible task, one possible approach has been discussed for noninferiority designs [125128]. The strategy is to specify some value, δ, such that interventions with differences which are less than this might be considered “equally effective” or “noninferior” (see Chap. 5 for discussion of noninferiority designs). Specification of δ may be difficult but it is a necessary element of the design. The null hypothesis states that pC > pI + δ while the alternative specifies pC < pI + δ. The methods developed require that if the two interventions really are equally effective or at least noninferior, the upper 100(1 − α)% confidence interval for the intervention difference will not exceed δ with the probability of 1 − β. One can alternatively approach this from a hypothesis testing point of view, stating the null hypothesis that the two interventions differ by less than δ.

For studies with a dichotomous response, one might assume the event rate for the two interventions to be equal to p (i.e., p = pC = pI). This simplifies the previously shown sample size formula to

$$ 2N=4p\left(1\hbox{--} p\right){\left({\mathrm{Z}}_{\upalpha}+{\mathrm{Z}}_{\upbeta}\right)}^2/{\updelta}^2 $$

where N, Zα and Zβ are defined as before. Makuch and Simon [127] recommend for this situation that α = 0.10 and β = 0.20. However, for many situations, β or Type II error needs to be 0.10 or smaller in order to be sure a new therapy is correctly determined to be equivalent to an older standard. We prefer an α = 0.05, but this is a matter of judgment and will depend on the situation. (This formula differs slightly from its analogue presented earlier due to the different way the hypothesis is stated.) The formula for continuous variables,

$$ 2N=4{\left({\mathrm{Z}}_{\upalpha}+{\mathrm{Z}}_{\upbeta}\right)}^2/{\left(\updelta /\upsigma \right)}^2 $$

is identical to the formula for determining sample size discussed earlier. Blackwelder and Chang [126] give graphical methods for computing sample size estimates for studies of equivalency.

As mentioned above and in Chap. 5, specifying δ is a key part of the design and sample size calculations of all equivalency and noninferiority trials. Trials should be sufficiently large, with enough power, to address properly the questions about equivalence or noninferiority that are posed.

Sample Size for Cluster Randomization

So far, sample size estimates have been presented for trials where individuals are randomized. For some prevention trials or health care studies, it may not be possible to randomize individuals. For example, a trial of smoking prevention strategy for teenagers may be implemented most easily by randomizing schools, some schools to be exposed to the new prevention strategy while other schools remain with a standard approach. Individual students are grouped or clustered within each school. As Donner et al. [129] point out, “Since one cannot regard the individuals within such groups as statistically independent, standard sample size formulas underestimate the total number of subjects required for the trial.” Several authors [129133] have suggested incorporating a single inflation factor in the usual sample size calculation to account for the cluster randomization. That is, the sample size per intervention arm N computed by previous formulas will be adjusted to N* to account for the randomization of Nm clusters, each with m individuals.

A continuous response is measured for each individual within a cluster of these components. Differences of individuals within a cluster and differences of individuals between clusters contribute to the overall variability of the response. We can separate the between-cluster variance σ2b and within cluster variance σ2w. Estimates are denoted S 2b and S 2w , respectively and can be estimated by standard analysis of variance. One measure of the relationship of these components is the intra-class correlation coefficient. The intra-class correlation coefficient ρ is \( {\upsigma}_{\mathrm{b}}^2/\left({\upsigma}_{\mathrm{b}}^2+{\upsigma}_{\mathrm{w}}^2\right) \) where 0 ≤ ρ ≤ 1. If ρ = 0, all clusters respond identically so all of the variability is within a cluster. If ρ = 1, all individuals in a cluster respond alike so there is no variability within a cluster. Estimates of ρ are given by \( r={S}_{\mathrm{b}}^2/\left({S}_{\mathrm{b}}^2+{S}_{\mathrm{w}}^2\right) \). Intra-class correlation may range from 0.1 to 0.4 in typical clinical studies. If we computed the sample size calculations assuming no clustering, the sample size per arm would be N participants per treatment arm. Now, instead of randomizing N individuals, we want to randomize Nm clusters with m individuals each for a total of N* = Nm × m participants per treatment arm. The inflation factor [133] is [1 + (m − 1)r] so that

$$ {N}^{*}={N}_m\times m=N\left[1+\left(m-1\right)\uprho \right] $$

Note that the inflation factor is a function of both cluster size m and intra-class correlation. If the intra-cluster correlation (ρ = 0), then each individual in one cluster responds like any individual in another cluster, and the inflation factor is unity (N* = N). That is, no penalty is paid for the convenience of cluster randomization. At the other extreme, if all individuals in a cluster respond the same (ρ = 1), there is no added information within each cluster, so only one individual per cluster is needed, and the inflation factor is m. That is, our adjusted sample N* = N × m and we pay a severe price for this type of cluster randomization. However, it is unlikely that ρ is either 0 or 1, but as indicated, is more likely to be in the range of 0.1–0.4 in clinical studies.

Example: Donner et al. [129] provide an example for a trial randomizing households to a sodium reducing diet in order to reduce blood pressure. Previous studies estimated the intra-class correlation coefficient to be 0.2; that is \( \widehat{\uprho}=r={S}_{\mathrm{b}}^2/\left({S}_{\mathrm{b}}^2+{S}_{\mathrm{w}}^2\right)=0.2 \). The average household size was estimated at 3.5 (m = 3.5). The sample size per arm N must be adjusted by 1 + (m − 1)ρ = 1 + (3.5 − 1)(0.2) = 1.5. Thus, the normal sample size must be inflated by 50% to account for this randomization indicating a small between cluster variability. If ρ = 0.1, then the factor is 1 + (3.5 − 1)(0.1) or 1.25. If ρ = 0.4, indicating a larger between cluster component of variability, the inflation factor is 2.0 or a doubling.

For binomial responses, a similar expression for adjusting the standard sample size can be developed. In this setting, a measure of the degree of within cluster dependency or concordancy rate in participant responses is used in place of the intra-class correlation. The commonly used measure is the kappa coefficient, denoted κ, and may be thought of as an intra-class correlation coefficient for binomial responses, analogous to ρ for continuous responses. A concordant cluster with κ = 1 is one where all responses within a cluster are identical, all successes or failures, in which a cluster contributes no more than a single individual. A simple estimate for κ is provided [129]:

$$ \kappa ={p}^{*}\left[{p}_{\mathrm{C}}^m+{\left(1-{p}_{\mathrm{C}}\right)}^m\right]/\left(1-\left[{p}_{\mathrm{C}}^m+{\left(1-{p}_{\mathrm{C}}\right)}^m\right]\right) $$

Here p* is the proportion of the control group with concordant clusters, and pC is the underlying success rate in the control group. The authors then show that the inflation factor is [1 + (m − 1)κ], or that the regular sample size per treatment arm N must be multiplied by this factor to attain the adjust sample size N*:

$$ {N}^{*}=N\left[1+\left(m\hbox{--} 1\right)\kappa \right] $$

Example: Donner et al. [129] continues the sodium diet example where couples (m = 2) are randomized to either a low sodium or a normal diet. The outcome is the hypertension rate. Other data suggest the concordancy of hypertension status among married couples is 0.85 (p* = 0.85). The control group hypertension rate is 0.15 (pC = 0.15). In this case, κ = 0.41, so that the inflation factor is 1 + (2 − 1)(0.41) = 1.41; that is, the regular sample size must be inflated by 41% to adjust for the couples being the randomization unit. If there is perfect control group concordance, p* = 1 and κ = 1, in which case, N* = 2 N.

Cornfield proposed another adjustment procedure [130]. Consider a trial where C clusters will be randomized, each cluster of size mi (i = 1, 2, …, C) and each having a different success rate of pi (i = 1, 2, …, C). Define the average cluster size \( \overline{m}=\varSigma {m}_{\mathrm{i}}/C \) and \( \overline{p}=\varSigma {m}_{\mathrm{i}}{p}_{\mathrm{i}}/\varSigma\;{m}_{\mathrm{i}} \) as the overall success rate weighted by cluster size. The variance of the overall success rate is \( {\upsigma}_p^2=\varSigma\;{m}_{\mathrm{i}}{\left({p}_{\mathrm{i}}-\overline{p}\right)}^2/C{\overline{m}}^2 \). In this setting, the efficiency of simple randomization to cluster randomization is \( E=\overline{p}{\left(1-\overline{p}\right)}^2\overline{m}\;{\upsigma}_p^2 \). The inflation factor (IF) for this design is \( \mathrm{IF}=1/E=\overline{m}\;{\upsigma}_{\mathrm{p}}^2/\left(1-\overline{p}\right) \). Note that if the response rate varies across clusters, the normal sample size must be increased.

While cluster randomization may be logistically required, the process of making the cluster the randomization unit has serious sample size implications. It would be unwise to ignore this consequence in the design phase. As shown, the sample size adjustments can easily be factors of 1.5 or higher. For clusters which are schools or cities, the intra-class correlation is likely to be quite small. However, the cluster size is multiplied by the intra-class correlation so the impact might still be nontrivial. Not making this adjustment would substantially reduce the study power if the analyses were done properly, taking into account the cluster effect. Ignoring the cluster effect in the analysis would be viewed critically in most cases and is not recommended.

Multiple Response Variables

We have stressed the advantages of having a single primary question and a single primary response variable, but clinical trials occasionally have more than one of each. More than one question may be asked because investigators cannot agree about which outcome is most important. As an example, one clinical trial involving two schedules of oxygen administration to participants with chronic obstructive pulmonary disease had three major questions in addition to comparing the mortality rate [122]. Measures of pulmonary function, neuro-psychological status, and quality of life were evaluated. For the participants, all three were important.

Sometimes more than one primary response variable is used to assess a single primary question. This may reflect uncertainty as to how the investigator can answer the question. A clinical trial involving participants with pulmonary embolism [134] employed three methods of determining a drug’s ability to resolve emboli. They were: lung scanning, arteriography, and hemodynamic studies. Another trial involved the use of drugs to limit myocardial infarct size [135]. Precordial electrocardiogram mapping, radionuclide studies, and enzyme levels were all used to evaluate the effectiveness of the drugs. Several approaches to the design and analysis of trials with multiple endpoints have been described [136139].

Computing a sample size for such clinical trials is not easy. One could attempt to define a single model for the multidimensional response and use one of the previously discussed formulas. Such a method would require several assumptions about the model and its parameters and might require information about correlations between different measurements. Such information is rarely available. A more reasonable procedure would be to compute sample sizes for each individual response variable. If the results give about the same sample size for all variables, then the issue is resolved. However, more commonly, a range of sample sizes will be obtained. The most conservative strategy would be to use the largest sample size computed. The other response variables would then have even greater power to detect the hoped-for reductions or differences (since they required smaller sample sizes). Unfortunately, this approach is the most expensive and difficult to undertake. Of course, one could also choose the smallest sample size of those computed. That would probably not be desirable, because the other response variables would have less power than usually required, or only larger differences than expected would be detectable. It is possible to select a middle range sample size, but there is no assurance that this will be appropriate. An alternative approach is to look at the difference between the largest and smallest sample sizes. If this difference is very large, the assumptions that went into the calculations should be re-examined and an effort should be made to resolve the difference.

As is discussed in Chap. 18, when multiple comparisons are made, the chance of finding a significant difference in one of the comparisons (when, in fact, no real differences exist between the groups) is greater than the stated significance level. In order to maintain an appropriate significance level α for the entire study, the significance level required for each test to reject H0 should be adjusted [41]. The significance level required for rejection (α′) in a single test can be approximated by α/k where k is the number of multiple response variables. For several response variables this can make α′ fairly small (e.g., k = 5 implies α′ = 0.01 for each of k response variables with an overall α = 0.05). If the correlation between response variables is known, then the adjustment can be made more precisely [140, 141]. In all cases, the sample size would be much larger than if the use of multiple response variables were ignored, so that most studies have not strictly adhered to this solution of modifying the significance level. Some investigators, however, have attempted to be conservative in the analysis of results [142]. There is a reasonable limit as to how much α′ can be decreased in order to give protection against false rejection of the null hypothesis. Some investigators have chosen α′ = 0.01 regardless of the number of tests. In the end, there are no easy solutions. A somewhat conservative value of α′ needs to be set and the investigators need to be aware of the multiple testing problem during the analysis.

Estimating Sample Size Parameters

As shown in the methods presented, sample size estimation is quite dependent upon assumptions made about variability of the response, level of response in the control group, and the difference anticipated or judged to be clinically relevant [16, 143148]. Obtaining reliable estimates of variability or levels of response can be challenging since the information is often based on very small studies or studies not exactly relevant to the trial being designed. Applying Bayesian methods to incorporate explicitly uncertainty in these estimated parameters has been attempted [149]. Sometimes, pilot or feasibility studies may be conducted to obtain these data. In such cases, the term external pilot has been used [148].

In some cases, the information may not exist prior to starting the trial, as was the case for early trials in AIDS; that is, no incidence rates were available in an evolving epidemic. Even in cases where data are available, other factors affect the variability or level of response observed in a trial. Typically, the variability observed in the planned trial is larger than expected or the level of response is lower than assumed. Numerous examples of this experience exist [143]. One is provided by the Physicians’ Health Study [150]. In this trial, 22,000 U.S. male physicians were randomized into a 2 × 2 factorial design. One factor was aspirin versus placebo in reducing cardiovascular mortality. The other factor was beta-carotene versus placebo for reducing cancer incidence. The aspirin portion of the trial was terminated early in part due to a substantially lower mortality rate than expected. In the design, the cardiovascular mortality rate was assumed to be approximately 50% of the U.S. age-adjusted rate in men. However, after 5 years of follow-up, the rate was approximately 10% of the U.S. rate in men. This substantial difference reduced the power of the trial dramatically. In order to compensate for the extremely low event rate, the trial would have had to be extended another 10 years to get the necessary number of events [150]. One can only speculate about reasons for low event rates, but screening of potential participants prior to the entry almost certainly played a part. That is, screenees had to complete a run-in period and be able to tolerate aspirin. Those at risk for other competing events were also excluded. This type of effect is referred to as a screening effect. Physicians who began to develop cardiovascular signs may have obtained care earlier than non-physicians. In general, volunteers for trials tend to be healthier than the general population, a phenomenon often referred to as the healthy volunteer effect.

Another approach to obtaining estimates for ultimate sample size determination is to design so-called internal pilot studies [148]. In this approach, a small study is initiated based the best available information. A general sample target for the full study may be proposed, but the goal of the pilot is to refine that sample size estimate based on screening and healthy volunteer effects. The pilot study uses a protocol very close if not identical to the protocol for the full study, and thus parameter estimates will reflect those effects. If the protocol for the pilot and the main study are essentially identical, then the small pilot can become an internal pilot. That is, the data from the internal pilot become part of the data for the overall study. This approach was used successfully in the Diabetes Control and Complications Trial [151]. If data from the internal pilot are used only to refine estimates of variability or control group response rates, and not changes in treatment effect, then the impact of this two-step approach on the significance level is negligible. However, the benefit is that this design will more likely have the desired power than if data from external pilots and other sources are relied on exclusively [147]. It must be emphasized that pilot studies, either external or internal, should not be viewed as providing reliable estimates of the intervention effect [152]. Because power is too small in pilot studies to be sure that no effect exists, small or no differences may erroneously be viewed as reason not to pursue the question. A positive trend may also be viewed as evidence that a large study is not necessary, or that clinical equipoise no longer exists.

Our experience indicates that both external and internal pilot studies are quite helpful. Internal pilot studies should be used if at all possible in prevention trials, when screening and healthy volunteer effects seem to cause major design problems. Design modifications based on an internal pilot are more prudent than allowing an inadequate sample size to create yield misleading results.

One approach is to specify the number of events needed for a desired power level. Obtaining the specified number of events requires a number of individuals followed for a period of time. How many participants and how long a follow-up period can be adjusted during the early part of the trial, or during an internal pilot study, but the target number of events does not change. This is also discussed in more detail in Chaps. 16 and 17.

Another approach is to use adaptive designs which modify the sample size based on an emerging trend, referred to as trend adaptive designs (see Chaps. 5 and 17). Here the sample size may be adjusted for an updated estimate of the treatment effect, δ, using the methods described in this chapter. However, an adjustment must then be made at the analysis stage which may require a substantially larger critical value than the standard one in order to maintain a prespecified α level.