Keywords

Introduction

Every clinical trial should be planned in advanced. This plan should include the study’s objectives, primary and secondary endpoints, data collection, inclusion and exclusion criteria, required sample size with scientific justification, statistical methodology, and an approach to handle missing data [1]. A sample size calculation is used to determine the minimum number of participants needed in a clinical trial in order to be able to answer the research question under investigation. During the planning phase of a clinical trial, sample size estimation should be one of the very first and key components to consider in the design of a study. Knowing the anticipated sample size allows investigators to determine whether a study is feasible and to develop an appropriate budget and identify needed resources to carry out the study. The calculation of sample size with a sufficient level of significance and power is essential to the success of a trial.

Requirements for Sample Size Calculation

The estimation of sample size involves the consideration of multiple components, including the study’s objective and primary hypothesis, type of endpoint to be analyzed, expected treatment effect and variability, treatment allocation ratio if it is desirable to have more randomized to one group than another, anticipated recruitment rate, and the estimated number of dropouts. Other parameters influencing sample size calculation include types of error (I and II) and power [1, 2].

Types of Error and Power

Consider the multisite randomized clinical trial comparing operative and nonoperative treatment using accelerated functional rehabilitation for acute Achilles tendon ruptures [3]. For the primary outcome of rerupture, the null hypothesis, denoted H 0, would be that there exists no difference between the two population proportions of rerupture. That is, there is no difference in the rate of rerupture between those with acute Achilles tendon rupture undergoing surgical repair and those treated nonoperatively. The alternative hypothesis (for a two-sided test; typically denoted H a ) is that there is a difference in the rate of rerupture. A Type I error, commonly referred to as significance level and denoted as α, is defined as the probability of erroneously rejecting the null hypothesis when it is in fact true. In this example, a Type I error would be concluding a difference in the rate of rerupture between treatment procedures that is unlikely to actually exist, i.e., a false positive. A Type II error, denoted as β, is the probability of failing to reject a false null hypothesis. That is, erroneously missing an actual difference in rerupture rates between treatment procedures, a false negative. Power (equal to 1 − β) is the probability of rejecting the null hypothesis when it is false and should be rejected (Table 16.1) [1, 2].

Table 16.1 Summary of type I and II errors

Study’s Primary Hypothesis

The primary purpose of a clinical trial, written as a scientific hypothesis, guides the design of the trial. Traditionally, a two-arm parallel-group design is employed to look for a difference between treatments (two-sided). Two-sided p-values provide the probability that the results are compatible with the null hypothesis (H 0 true). When the p-value is small (say, p-value ˂0.05), the null hypothesis is rejected (reject H 0) and there is evidence to support a difference in treatment effects. The direction of the test statistic establishes whether the new treatment is superior or inferior to the control treatment. In some instances, there is no interest in rejecting a null hypothesis in both directions (i.e., there is no interest in an inferiority results) and a superiority trial may be preferred to examine whether a new treatment is superior (better) than the established alternative (one-sided) [4].

While the traditional approach is intended to determine whether there is a difference between the experimental treatment and control, this may not be the relevant approach when the control is known to be effective and it is hoped that the experimental treatment can be shown to be as effective. In this instance, it is usually the case, that the experimental treatment may offer other advantages to the control treatment, such as convenience or tolerability, if it can be shown to be as effective to the control. Equivalence trials are designed to establish that the new procedure cannot be worse nor better than the conventional procedure if the null hypothesis is rejected. It requires that the two treatment approaches be identical within some acceptable range, δ (normally ±20%) [5]. Lastly, for a non-inferiority trial, the aim is to show that the new treatment is as good as or better (no worse) than the established treatment [4]. Each of the mentioned designs will be selected according to the study’s primary hypothesis and rely on prior information about the effects of the new procedure on a specific endpoint [1].

Study Design Considerations

Various study designs, such as a parallel-group, crossover, factorial, or cluster, may be employed to address a study’s objectives and ensure the required sample size is achieved. Each design will vary in their approach for sample size calculation. In the case of rare events, the need for a multisite trial is higher.

Study Endpoint Expected Response

A study’s endpoint, whether continuous, dichotomous, or time-to-event, will govern the type of model and sample size calculation. In the case of multiple comparisons, an adjustment to the significance level may be necessary. For a continuous endpoint, information on the expected central tendency (mean score) and variability (standard deviation) of the new procedure and its comparator are needed to more precisely estimate the sample size. The greater the variation within groups or the smaller the expected difference between groups, the larger the sample size will need to be in order to produce the same result. For a dichotomous variable, the proportion of participants achieving success in each group is needed. Most importantly, the expected treatment effect, as compared to its comparator, should be clinically meaningful [1].

Participant Retention Rates and Treatment Allocation

While sample size calculations determine the required number of participants for specific analyses, other aspects of recruitment should also be considered such as screen-failures, dropouts, and patients who are lost to follow-up. A trial should enroll more subjects to account for potential dropouts and those lost to follow-up. Attrition rates can vary tremendously, where ≤5% is of little concern but ≥20% poses serious threats to the validity of the trial [6]. Most RCTs (60–89%) published in leading journals have missing endpoint data, with complete case analysis the most frequently used strategy for handling this missing data [7, 8]. For many of these trials (18%), dropout rates exceeded 20% [8, 9]. For this reason, the number of enrollments, in trials where the primary outcome measure is continuous or binary, can be determined using an adjustment to the sample size and estimated dropout rate in the formula, Enrollment = Sample Size/(1 − dropout rate) [1]. For time-to-event, or survival data, the adjustment for dropout rate is more involved. In some instances, interim analyses may be requested to monitor treatment effects and ensure enrollment follows a specific trajectory [10, 11].

If one treatment arm is anticipated to have a greater dropout rate than its comparator, an unequal treatment allocation may be employed to ensure a balanced distribution at the end of the trial. Additionally, varied allocation and enrollment can occur in cases where it is unethical to assign an equal number of patients to each arm (e.g., placebo or sham treatment) [1]. Thus, sample size is adjusted in these scenarios. Note that departures form 1:1 randomization will increase the sample size requirement.

Conventional Guidelines

In sample size calculations, the level of significance (α) for a study is typically assumed to be 0.05 (or 5%) [12]. However, 1% or less may be used for larger samples and 10% for smaller samples. Also, the minimum power for which sample size is calculated is 80%. Larger power may be used to estimate sample size in order to provide a more conservative estimate in case treatment effects or recruitment are less than anticipated.

Calculation of Sample Size

There are many approaches to sample size estimation, with some of the more common calculations involving the comparison of two means, proportions, or a time-to-event measure and testing for a difference between groups. The next few sections describe these in more detail.

Comparing Two Means

The formula for calculating sample size comparing the mean of two treatment arms is

$$n_{1} = \kappa n_{2}{;}\quad n_{2} = \left( {1 + \frac{1}{\kappa }} \right)\left[ {\frac{{\left( {z_{\alpha /2} + z_{\beta } } \right)^{2} }}{{d^{2} }}} \right] = \left( {1 + \frac{1}{\kappa }} \right)\left[ {\frac{{\left( {z_{\alpha /2} + z_{\beta } } \right)^{2} \left( {\sigma_{1}^{2} + \sigma_{2}^{2} } \right)}}{{2\left( {\bar{\mu }_{1} - \bar{\mu }_{2} } \right)^{2} }}} \right],$$

where \(z_{\alpha /2}\) is the critical value of the standard normal distribution at α/2 (e.g., 1.96 for a 95% confidence interval with Type I error α = 0.05), \(z_{\beta }\) is the critical value of the standard normal distribution at β (e.g., 0.84 for 80% power and Type II error β = 20%), \(\kappa\) is the matching ratio, \(\mu_{i}\) is the population mean of the endpoint in group i, \(\sigma_{i}^{2}\) is the population variance of the endpoint in group i, and d is Cohen’s effect size [13]. For studies with 1:1 randomization, \(\kappa = 1.\)

Comparing Two Proportions

The formula for calculating sample size comparing two proportions is

$$n_{1} = \kappa n_{2} {;}\quad n_{2} = \left[ {\frac{{p_{1} \left( {1 - p_{1} } \right)}}{\kappa } + p_{2} \left( {1 - p_{2} } \right)} \right] \left( {\frac{{z_{\alpha /2} + z_{\beta } }}{{p_{1} - p_{2} }}} \right)^{2} ,$$

where \(p_{i}\) is the population proportion of group i, and \(p_{1} - p_{2}\) is the effect size or difference desired to be detected [13].

Comparing Time-to-Event

The formula for calculating sample size for a time-to-event analysis (Cox proportional hazards model) is

$$n = \frac{1}{{p_{1} p_{2} p_{A} }}\left( {\frac{{z_{\alpha /2} + z_{\beta } }}{{\ln \left( \theta \right) - \ln \left( {\theta_{0} } \right)}}} \right)^{2} ,$$

where \(p_{i}\) is the proportion with the event in group i, \(p_{A}\) is the overall event rate, \(\theta\) is the hazard rate, \(\theta_{0}\) is the hypothesized hazard rate under the null hypothesis, and \(\ln \left( \theta \right) - \ln \left( {\theta_{0} } \right)\) the regression coefficient (treatment indication) [13, 14]. Note that sample size formulae accounting for the length of the recruitment and follow-up periods, and drop-outs, are more sophisticated.

Available Software

Statistical software packages with tools for sample size and power analysis calculations include SAS (SAS Institute, Inc.; Cary, NC), G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), PASS (NCSS, LLC.; Kaysville, Utah), R (The R Foundation for Statistical Computing; Auckland, New Zealand), Mplus (Muthén & Muthén; Los Angeles, CA), and PS available online at Vanderbilt University (Dupont & Plummer, 1990) [15]. Several of these packages are available at no cost.

Common Pitfalls Related to and Affecting Sample Size

Sample size calculations pose several challenges, including obtaining an accurate estimate of treatment effects, selecting an appropriate power and significance level, and even selecting the correct formula to be used [16]. As a result, sample size underestimation or overestimation may occur.

Sample Size Underestimation

Sample size underestimation refers to a sample size for a trial that was calculated to be less than that required [16]. This results in lower power than is needed and may lead to misleading results such as the determination of no treatment effect (p-value > α) when one really existed. The treatment effect was not statistically significant even though it was clinically significant. That is, recruiting too few participants can lead to inconclusive results because of the low likelihood of finding a clinically relevant difference statistically significant.

Revisiting the Achilles Tendon Rupture trial, small sample size was a limitation of the study (72 participants per each arm), and therefore was underpowered to definitively make a conclusion about rerupture rates [3]. A meta-analysis had shown rerupture rates to be approximately 2.8% following operative repair and 11.7% for nonoperative treatment [17]. ConsequentlyH, the Rupture trial underestimated the sample size required. Instead, rates of 2.8% and 4.2% for operative and nonoperative treatment, respectively, were observed. The former would require a sample size of 104 participants in each group using a one-sided 2-sample independent proportions test, assuming a significance level of α = 0.05. The latter would require 2148 participants per arm (Fig. 16.1). Although the actual power for comparing rerupture rates was 12% (Fig. 16.2), with a Type II error of 88%, this study was the largest to date of its kind and findings would provide clinical insight and pilot data should a larger trial be pursued.

Fig. 16.1
figure 1

Sample size estimation for comparing rerupture rates, varying rates in the nonoperative group [created through the use of: PASS 14 Power Analysis and Sample Size Software (2015). NCSS, LLC. Kaysville, Utah, USA, ncss.com/software/pass]

Fig. 16.2
figure 2

Power analysis for observed difference in rerupture rates [created through the use of: R (The R Foundation for Statistical Computing; Auckland, New Zealand)]

Sample Size Overestimation

On the contrary, a sample size selected to be much larger than was required describes sample size overestimation [16]. Studies that are too large are also problematic for at least two reasons. This scenario may be evident from an exceptionally strong statistical significance (very small p-value), which raises ethical concerns if more subjects were exposed to an inferior treatment than were required or resources wasted. Additionally, for larger sample sizes, smaller differences can be detected and be statistically significant even when the difference is not clinically meaningful. In trial design, each assumption may be made too conservatively, to avoid the risk of failure, and the analysis of the study’s primary objective becomes overpowered as a result.

Selecting a Clinically Meaningful Difference

Determining the clinically meaningful difference for which a study is powered to detect is generally the most difficult task of the sample size calculation process. A very thorough literature search should be conducted to obtain any available data on the potential effect of the proposed new treatment. This may include published abstracts, results of phase II trials or pilot studies, and subgroup analyses from a previously conducted trial. If enough publications are available, meta-analysis techniques can be used to obtain an estimate of the potential treatment effect.

Often data are limited to help inform the potential treatment effect estimate. In those cases, an investigator might look to other published studies in this area to determine the magnitude of effect that was used when that study was designed. Often, FDA has determined the degree of treatment effect needed to establish efficacy and their guidelines may be useful as a resource. Additionally, a panel of experts in the area of investigation can be convened to develop a consensus estimate of treatment effect.

Available Databases

There are multiple databases available for use in obtaining estimates for sample size calculations. In 1994, the VA established a VA National Surgical Quality Improvement Program (NSQIP) in which all medical centers performing major surgery participated [18]. The database contains 135 variables collected preoperatively and up to 30 days postoperatively. Data is categorized as demographic, surgical profile, preoperative, intraoperative, or postoperative. Each hospital submits an average of 1,600 major operations per year into the database [19]. While the aim of NSQIP was initially quality improvement in surgical care through periodic reports and assessments of performance, VA investigators can also query the database for scientific research purposes and to obtain estimates of event rates for a power analysis such as mortality, cardiac and noncardiac complications, postoperative pneumonia, intubations, pulmonary embolism and venous thrombosis, renal dysfunction, and infections. Similarly, the American College of Surgeons National Surgical Quality Improvement Program (ACS NSQIP) can be used for sample size estimation as in the comparison of postoperative complication rates for regional versus general anesthesia among surgical patients with chronic obstructive pulmonary disease [20, 21]. Other useful available databases include the Society of Thoracic Surgeons (STS) National Database including separate databases for adult cardiac, general thoracic, and congenital heart surgery [22], and the Centers for Disease Control (CDC) Cancer Registry [23].