Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 The Concept of Optimal Study Size

Assuming that a study is perfectly valid, we are presented with an issue of how informative that study will be, a main determinant of which is sample size or study size. These synonymous terms are used to describe the number of subjects in a study. A concern of most reviewers of study proposals is whether the amount of information produced by a study will be large enough to make a ‘substantial contribution.’ Therefore, study size becomes a critical issue not only for planning studies but also for obtaining funding.

This situation raises many difficult questions: What is a ‘substantial contribution?’ Is a ‘substantial contribution’ the same for all stakeholders and for all study objectives? Is there – or under what conditions is there – a way to determine optimal study size? These questions demand complex answers that have been the subjects of many textbooks. Space constraints preclude a fully comprehensive review of this topic here; however, the present chapter deals with these questions in an introductory manner and provides useful tools and equations to make arguments about study size. Selected terms and concepts relevant to this discussion are listed in Panel 7.1.

1.2 Factors Influencing Optimal Study Size

The planning of study size is often presented as a pure matter of statistical calculations of sample size and/or power. We urge readers who have not already done so to embrace a broader view: that there are generally no statistical means of determining what the optimal study size is (Miettinen 1985) because non-statistical factors greatly influence what is considered to be the optimal study size. A few reflections may clarify and illustrate this point.

First, problems with internal validity reduce the amount of information gained by all studies to varying degrees, even when that information is generated from very large studies. If we imagine that an idealized perfect study provides 100 ‘units of information,’ any real study will provide only a fraction of that amount. In a real study, the maximum amount of information that a study can provide depends on the design of the study, and as that study progresses, information is lost from various errors or issues. Thus, even if it were possible to design a perfect study with the potential to provide 100 ‘units of information,’ measurement error and the most minor ethical mishaps (both of which are impossible to avoid completely) will cause information to be ‘lost.’ Since we cannot measure the amount of information that a study can potentially provide or how much has been lost, we cannot account for ‘information loss’ using statistical means of determining optimal study size. By an extension of logic, therefore, using only statistics to determine optimal study size is impossible. Statistical calculations of study size do indeed help us to maximize the likelihood that a given study will be provide an acceptable amount of ­information; however, study size calculations must always be contextualized and modified with non-statistical factors.

Second, apart from validity concerns and ‘information loss,’ a variety of additional factors co-determine what study size will be perceived as useful or optimal, and these factors tend to differ according to the study design. For example, the optimal size of a Phase-1 trial is very low (typically n = 6–12) because of ethical considerations. This type of study carries high risk because it represents the first time humans are being exposed to a new pharmaceutical formulation. The risks are unknown and therefore considered to be very high; therefore, an optimal sample size might be n = 10 in spite of statistical arguments suggesting that n = 50 is better. In this case, ethical considerations are valued more highly than the additional information that 40 more participants would provide. On the other hand, a Phase-3 trial might be very large (on the order n = 10,000). At this stage, the short-to-medium-­term safety and tolerability of a pharmaceutical formulation is better understood, and only reasonably safe interventions are considered suitable for a Phase-3 trial. In this study design, the primary objective is to assess effect sizes and medium-to-long-term safety, both of which can be quite small and relate to rare events. Consequently, the optimal size of a Phase-3 trial using the same formulation as a Phase-I study might be three orders of magnitude larger.

Third, financial and other resource limitations can ultimately weigh heavily on the perceived usefulness of study size. Since resource availability is sometimes dynamic during a study, the perceived optimal size of a study can change during the data collection phase of a prospective study. Such changes in perceived optimal study size can relate to an entire study, but sometimes optimal study size changes for one specific aim but not another. To illustrate this point, let us consider a 3-year-long prospective study in which Specific Aim 1 is to investigate whether zinc deficiency increases the risk of acquiring acute cholera and the severity of the disease, and Specific Aim 2 is designed to test whether various factors are effect modifiers for the effect of zinc deficiency. In year 2, the funding agency experiences financial difficulties that force redistribution of research funding, requiring the research team to scale back the study. Since effect modification tends to increase the optimal study size considerably, the research team and the funding agency meet and agree to re-craft Specific Aim 2 to address only the two most important effect modifiers. Such a decision reduces the overall study size by 30 % and reduces the optimal study size for Specific Aim 2, while potentially having no study size consequences for Specific Aim 1.

From the above we deduce that scientific, ethical, and practical concerns drive study size planning. Though we highlighted only one factor for each of the three dimensions, many factors may need to be considered when optimal size is to be determined. Table 7.1 summarizes several of these factors, most of which are derived directly from the general principles of epidemiology (See: Chap. 1). The discussion above and Table 7.1 indicate that study size optimization is complex process involving the simultaneous consideration of numerous counter-acting phenomena.

Table 7.1 Considerations for determining optimal study size for a single specific aim

1.3 Useful Precision or Power

One of the major considerations listed in Table 7.1 concerns a desired limit of precision for an estimate or a minimum power and significance for the detection of an anticipated effect, beyond which evidence is considered increasingly useless. The epidemiological and statistical literature on sample size has mainly focused on this aspect, and later in this chapter we will further expand on such statistical aspects of study size planning. As mentioned previously, the use of statistics to determine the optimal study size must be contextualized with the factors discussed above and in Table 7.1.

1.3.1 The Range of Useful Precision

Outcome parameter estimates consist of a point estimate, surrounded by an interval estimate. For example, a point estimate of a prevalence rate is surrounded by a 95 % confidence interval. The interval estimate is an expression of the uncertainty surrounding the point estimate and derives mainly from sampling variation as well as measurement variation/error. In general, the degree of uncertainty is inversely related to the size of the study. On one hand, if a study is too small, the uncertainty may increase to a level considered to be undesirable or useless (an exception, however, is that well-designed small studies may contribute meaningfully to later ­meta-­analyses; see: Chap. 25). On the other, as the study size increases, the degree of uncertainty decreases, and the interval estimate becomes narrower.

In the latter case, increasing study size to achieve higher precision may be con­sidered undesirable and useless beyond some threshold. For example, in case–control studies it is generally accepted that having more than 4 ‘controls’ for each case is inefficient. In practice, the starting question is thus often: What range of study sizes –  at analysis – will give us an interval estimate narrow enough to be considered useful but not so exceedingly narrow as to be inefficient? Note that this refers to final precision, after any necessary adjustments in the analysis, e.g., after corrections for misclassification of outcome or determinant.

Can this range in fact be determined, considering that ‘usefulness’ and ‘optimal’ are both subjective perceptions? We argue that subjectivity does not imply total arbitrariness. As pointed out by Snedecor and Cochran (1980), what is needed is careful thinking about the use to be made of the estimate and the consequences of a particular margin of error. In this respect, the researcher planning study size may consider that:

  • Perceived usefulness of a particular precision is often influenced by the fact that narrower confidence intervals enhance the precision of any subsequent projections of cost or efficiency of envisaged larger-scale policies. Thus, when the research study falls within a comprehensive evaluation of a possible new policy, high precision tends to become a necessity.

  • It may be necessary to get the opinion of some stakeholders on the matter (especially those that are providing funding). Sometimes there is an explicit wish of a sponsor to obtain evidence with a specific margin of uncertainty, e.g., a desire to know the prevalence ‘within ± 1 %.’ This is frequently the case in diagnostic particularistic studies, such as surveys. The stated reasons for this are not always clear. Perhaps a similar margin was used in a previous study about the same occurrence and, if a similar level of precision is reached at the end of the new study, a 2 % or higher increase in prevalence could then roughly be seen as evidence for the existence of a real change in prevalence, although this is not the ideal way make such a determination (Altman et al. 2008). When there is such a desirable margin of uncertainty, the required study size to achieve this is usually easy to calculate (See: Sect. 7.4). When using this approach one should not forget to take into account possible necessary adjustments, perhaps for an expected refusal rate, the sampling scheme, finite population correction, measurement error, covariate adjustment, or other reasons, as will be discussed below.

  • The perceived clinical or community health relevance of particular effect sizes is important. Stakeholders sometimes set a prior threshold for an effect size as a basis for decisions, e.g., about pursuing further research, about further development of a drug or clinical strategy, or about further exploration of a public health policy. For instance, it may be stated that ‘only if the effect can, with reasonable certainty, be larger than x can it be considered clinically relevant.’ This type of expectation is generally easier to take into account using a power-based outlook (See: below) rather than a precision-based outlook, although the latter has also been proposed (e.g., Greenland 1988; Bristol 1989; Goodman and Berlin 1994).

  • When there is a desire for very high precision, it is unrealistic to aim for a precision that is so high that it approximates the expected variation due to measurement error.

  • There are possible (dis)advantages of wide or narrow confidence intervals, beyond issues of cost and feasibility. Narrow confidence intervals can give a false impression of validity and a false impression of generalizability (Hilden 1998). Wide confidence intervals often give a false impression of lack of validity.

Sometimes, given the multiplicity of factors influencing optimal study size (Table 7.1), there is little room for choosing a sample size in studies that plan for estimation. For example, there may be an upper limit to study size that is lower than statistical calculations suggest. The question may then become: given the maximum sample size imposed, will the precision be useful and worth the effort, resources, and potential risks? The issue of sample size calculation then becomes an issue of precision calculation.

1.3.2 Power to Detect an Anticipated Effect with a Chosen Confidence

Many studies plan for statistical testing. In such studies the outcome parameters are test statistics with P-values, e.g., a t-test statistic with an associated P-value. Statistical power is interpreted as the probability of detecting a statistically significant association of a particular magnitude or greater (Daly 2008). An important question in study size planning is then often: What range of study sizes – at analysis – will give enough statistical power (e.g., one often uses a power of 80 % at a 95 % level of confidence) to detect true differences of magnitudes considered meaningful? If the true effect is smaller than this anticipated meaningful effect, then we can accept a non-significant test result (Daly 2008). This refers to final power after any necessary adjustments in the analysis, e.g., after corrections for misclassification of the outcome or determinant (Edwards et al. 2005; Burton et al. 2009). Based on this, a statistical sample size calculation can often be done. In the later sections of this chapter examples will be given.

In current epidemiological practice, the abovementioned type of sample size or power calculation is frequently performed, not only in studies that plan for testing but also in studies that plan for the estimation of effects. This may partly be because the methods for precision-based sample size calculation are not yet fully part of epidemiological tradition and are less well known, less developed, and sometimes more difficult to use. This is one of the factors that perpetuate the use of statistical testing in studies that do not need it.

Sometimes there is little room for choosing a study size in studies where statistical testing is planned. The question that needs to be addressed in that case may be whether the statistical power of the study is expected to be useful or whether the power would only provide for detecting effects that are so extreme that one might as well abandon the study plans. The issue of sample size calculation then becomes an issue of power calculation.

2 The Process of Study Size Planning

As we have noted above, the process of study size planning is not only a matter of statistical calculations but also a matter of many other considerations. Below we describe the usual process of study size planning in studies where there is indeed room for choice:

In the balancing of considerations listed in Table 7.1, the focus should first go to those factors that would require reductions of the study size to zero. In other words, any major issues about design validity and ethics should be addressed and solved first. This is obviously something that should already have been done at the stage of designing general objectives, specific aims, and general study design. However, it happens regularly that proposal reviewers, statisticians consulted for sample size and power calculations, and even article reviewers come across problems of this nature. This suggests that it is valuable to reconsider this aspect at this stage of the study planning process.

Conditional on satisfying concerns about design validity, design efficiency, and ethics, remaining major issues are the useful precision of statistical estimates (or, when statistical testing is planned, the statistical power to detect useful effects with some degree of certainty) and the costs of various hypothetical study sizes. Both may need to be calculated and balanced. At this stage statistical methods may be useful. Once an opinion is formed regarding the optimal study size, the next step is to project what sizes at preceding study stages (recruitment, sampling, eligibility screening, and enrollment) are expected to lead up to this optimal size at analysis. This determination will require considerations of expected rates of non-contact, refusal, and attrition as well as anticipated adjustments for measurement error and confounders, etc.

The process described is repeated for each specific aim separately. It may then turn out that optimal sizes, at analysis or before, for different specific aims are incompatible. This may even lead to the abandonment of one or more of the initial specific aims or to their ‘downgrading’ to a secondary or tertiary level aim because of expected lack of useful precision or power. The balancing exercise may also entail other study design changes, e.g., a choice for a more efficient design, a choice for another measurement level for the outcome variable, a reduction in the size of a reference series, etc. The balancing effort may even lead to the conclusion that financial resources, time, and availability of subjects do not allow for continuation of the study plans (Miettinen 1985).

3 The ‘Sample Size and Power’ Section of the Study Proposal

Written justifications of the chosen study size are usually located in the ‘sample size and power’ section of the study proposal or the methods section of a paper. These justifications may need to include elements listed in Panel 7.2.

Let us now look at study size planning and sample size calculation in some particularly common situations. In the next sections we will discuss the optimal size of surveys, cohort studies, case–control studies, and trials. For each of those we will discuss how optimal sample size is influenced by ethical, scientific, and practical concerns. As far as sample size calculation is concerned, the next sections give examples both of the precision- and power-based approach. However, it should be noted that both approaches are not given for a given scenario; if the alternative approach is desired, we recommend consulting Kirkwood and Sterne (2003) or another book of medical statistics or sample size planning.

4 Size Planning for Surveys

A typical survey addresses multiple specific aims, each of which often contains two or more sub-aims. These sub-aims frequently entail subgroup analyses, such as comparisons of estimates for different catchment areas or across subgroups (e.g., age categories). The planning of study size therefore often requires an extensive exploratory phase to determine the size requirements for different subgroup analyses and to use this information to derive an optimal size for the entire study. A common approach in determining optimal study size is to prioritize the specific aims and sub-aims and to consider each in the context of resource limitations. This process may then lead to revisions to or refinements of one or more specific aims and/or sub-aims. Such revisions sometimes involve abandoning certain sub-aims (especially if the associated sub-aim is very resource intensive). A common alternative approach is to retain the sub-aim in question while acknowledging that findings may be statistically imprecise or underpowered.

As discussed in Chap. 9, target populations are rarely studied in their entirety, unless that population is very small and is contained within a small area. Attempting to survey an entire target population becomes increasingly inefficient as the size of the target population or its catchment area increases, introducing an ethical issue concerning the appropriate use of limited resources. Therefore, large surveys are more likely to require statistical sampling and, as will be discussed below, the sampling proportion then becomes an important consideration in the sample size calculation. When exploring the study size implications of the various research questions addressed in the survey, the following sample size calculations may be helpful.

A note on notation:

N

Capital N

Refers to the size of the target population

n

Lower-case n

Refers to the size of the sample

4.1 Sample Size Calculation for Estimating a Prevalence

When the purpose of a survey is to estimate the prevalence of a health phenomenon, a main concern is the degree of confidence in the prevalence estimate. Therefore, it is said that sample size calculations for estimating prevalence are precision-based. The following formula is often used (Kirkwood and Sterne 2003):

$$ n=\frac{p(1-p)}{{e}^{2}}$$
(7.1)

Where:

n = sample size for estimating a prevalence

p = expected proportion (e.g., 0.12 for a prevalence of 12 %)

e = desired size of the standard error (e.g., 0.01 for ±1 %)

As an example, consider a proposed Study A, in which one purpose is to estimate the prevalence of depression in people over the age of 18 years. Based on similar studies from another region in the same country, the researchers predict that the prevalence of depression will be 12 % (p = 0.12) in their study. They want to achieve a standard error of 1 % around their estimate (i.e., 12 % ±1 %). In order to achieve this degree of precision, the researchers will likely need to have at least n = 1056 participants (based on Eq. 7.1), a value that can be usefully rounded up to n = 1100.

It is important to note that the above equation is valid only if the calculated sample size (n) is less than 5 % of the target population (N). The value of n/N is known as the sampling fraction. If n/N is greater than 0.05 or 5 %, then a finite population correction is necessary. The use of a finite population correction is usually not necessary, as most studies do not usually sample 5 % or more of the target population. Since such a scenario is rare, we do not discuss the topic further in this chapter.

4.2 Sample Size Calculation for Estimating a Mean

A similar approach is used when the purpose of the survey is to estimate the mean value of a continuous health-related parameter. Again, a main concern is the degree of confidence in the estimated mean; therefore, in this case too a precision-based approach is useful. The following formula is often used:

$$ n=\frac{{\sigma }^{2}}{{e}^{2}}$$
(7.2)

Where:

n = sample size for estimating a mean

σ = expected standard deviation, and e = desired size of the standard error

As an example, consider a sub-aim of Study A, in which the goal is to compare the mean body mass index (BMI) of participants with depression and those without depression. Previous studies of the target population indicate that the standard deviation of BMI is expected to be 4.0 (σ = 4), and the desired standard error is 0.5 kg/m2 (e = 0.5). Based on Eq. 7.2, Study A will likely need at least n = 64 participants in each group to achieve the desired degree of precision. This is a very realistic proposition because the researchers anticipate a 12 % prevalence of depression with a sample size of n = 1,100; therefore, the smallest group in which BMI will be measured will likely be 0.12 * 1,100 = 132 participants. This sub-aim of Study A will therefore likely achieve greater-than-desired precision.

4.3 Sample Size Calculations When Comparing Proportions

When the outcome parameter is the difference between two proportions, such as prevalence estimates, a power-based approach is usually taken to calculate sample size. The following series of formulas can be useful:

$$ n=\frac{{c}_{pp}\left[{p}_{1}\left(1-{p}_{1}\right)+{p}_{2}\left(1-{p}_{2}\right)\right]}{{\left({p}_{1}-{p}_{2}\right)}^{2}}$$
(7.3)

Where:

n = sample size for estimating the difference between two proportions

cpp = constant defined by the selected P-value and desired power

p1 = expected prevalence in group 1

p2 = expected prevalence in group 2

The constant cpp is determined by taking the square of the sum of the Z scores for the selected P-value and desired power:

$$ {c}_{pp}={\left({Z}_{\alpha }+{Z}_{\beta }\right)}^{2}$$
(7.4)

Where:

cpp = constant defined by the selected P-value and desired power

Zα = Z score defined by the P-value (See: Table 7.2)

Zβ = Z score defined by the statistical power (See: Table 7.2)

For example, the Z score for a P-value of 0.05 is equal to 1.96, and the Z score for 80 % power is 0.840. The sum of these values is 1.96 + 0.84 = 2.8, and this quantity squared is 7.8. This calculation has been performed for various common P-value and power combinations; the results of these calculations are shown in Table 7.2.

It is critical to note that Eq. 7.3 assumes that groups 1–2 are of equal size. Such a scenario is fairly uncommon, however. Therefore, one may need to adjust the value n to account for unequal group sizes. After using Eq. 7.3, one can employ Eq. 7.5 to execute the adjustment:

Table 7.2 Values for cpp based on common P-value and statistical power Z scores
$$ {n}^{\prime }=\frac{n{\left(1+k\right)}^{2}}{4k}$$
(7.5)

Where:

n′ = calculated sample size with adjustment for unequal group sizes

n = calculated sample size assuming equal sample size (i.e., unadjusted)

k = the ratio of planned sample sizes of the two groups, where the larger group’s size is divided by the smaller group’s size.

Equation 7.5 can be used to adjust for unequal group sizes in other sample size calculations, not just for the comparison of two proportions.

As an example, imagine a study in which one is comparing the prevalence estimates of ovarian cancer in women aged 45–50 years versus 70–75 years. Significance was set at P < 0.05, and power of 90 % is considered acceptable for this study. For this P-value and this power, the correct value for cpp is 10.51 (Eq. 7.4 and Table 7.2). Based on previous studies, it is hypothesized that the prevalence of ovarian cancer will be 1 % in the younger age group and 4 % in the older age group. Using this information and Eq. 7.3, N is calculated to be 564 people per group.

This value assumes that one desires groups of equal size. However, if one anticipates or desires different group sizes, the value 564 must be adjusted to account for unequal group sizes. If your study will involve 3-times as many women in the younger age category than in the older age category (as the prevalence of ovarian cancer in women aged 45–50 years is much lower), then the ratio k will be 3/1 = 3. Using this value and n = 564 (the value to be adjusted), the total sample size for the entire study, n’, will be 752. This value can be usefully rounded to 800 participants in total. Let us assume that only women in the 45–50 and 70–75-year-old age groups will be enrolled in the study. Since one plans to enroll 3-times as many women in the younger age group than the older age group, the 800 total participants will be composed of 800 ÷ (3 + 1) = 200 women aged 70–75 years and 800–200 = 600 women aged 70–75 years (600 ÷ 200 = k = 3).

4.4 Sample Size Calculation When Comparing Means

When the outcome parameter is the difference between two means, the following equation can be used to calculate the sample size:

$$ n=\frac{{c}_{pp}\left({\sigma }_{1}^{2}+{\sigma }_{2}^{2}\right)}{{D}^{2}}$$
(7.6)

Where:

n = Sample size for estimating the difference between two means

σ = expected standard deviation of the mean difference

cpp = constant defined by the selected P-value and desired power (Table 7.2)

D = expected minimum difference between the means

For example, consider a study in which one wishes to compare the magnitude of weight loss in ovarian cancer patients being treated with regimen A or regimen B. Both groups are expected to lose weight on average, however, one hypothesizes that regimen A will be associated with greater weight loss than B. A pilot study allowed the investigator to predict the standard deviations of the weight loss for A to be 7 kg (σ1) and for B to be 9 kg (σ2). The investigator considers a minimum difference in weight loss of 3.0 kg (D = 3.0) to be clinically important. A power of 95 % and P-value of 0.05 were considered adequate for this study; using these parameters and Eq. 7.4, cpp was determined to be 13.00 (Table 7.2). These pieces of information can be plugged into Eq. 7.6 to calculate a total sample size of n = 188. As discussed in the previous sub-section, Eq. 7.5 can be used to adjust this result to account for unequal group sizes.

5 The Size of an Observational Etiognostic Study

The sample size calculations discussed thus far relate to fairly straightforward, common scenarios in epidemiology, the estimation or comparison of proportions or means. Yet many investigators wish to address questions about etiology, or the causal factors that contribute to a health phenomenon. In this section we discuss sample size calculations in two typical etiognostic research scenarios, the traditional cohort study and the traditional case–control study. Such studies require additional considerations based on specific details of the study design. For example, choosing to contrast more levels of a determinant, to study a larger number of causal co-factors, or to study more effect modifiers will tend to increase the required sample size beyond what sample size calculations suggest. On the other hand, making a strategic decision to use a more sensitive measure that does not compromise specificity will tend to reduce the required sample size (Miettinen 1985). Study design factors will need to be considered on a case-by-case basis and their implications for sample size may need to be addressed.

5.1 Sample Size Calculation for a Traditional Case–Control Study

In case–control approaches the following formula is often found helpful for developing an argumentation about study size:

$$ n=\frac{2{\left[{Z}_{\beta }\sqrt{{p}_{1}\left(1-{p}_{1}\right)+{p}_{2}\left(1-{p}_{2}\right)}+{Z}_{\alpha }\sqrt{2{p}_{ave}\left(1-{p}_{ave}\right)}\right]}^{2}}{{\left({p}_{2}-{p}_{1}\right)}^{2}}$$
(7.7)

Where:

n = sample size of a case–control study

Zα = Z score for the desired level of significance

Zβ = Z score for the desired power

p1 = expected proportion of exposure among controls, as derived from p2 and an anticipated odds ratio (See: Eq. 7.8)

p2 = expected proportion of exposure among cases

pave = average of p1 and p2

To use this equation, one must establish an anticipated odds ratio (OR). This quantity can sometimes be based on knowledge of the strength of association for other risk factors, but ultimately, the anticipated OR should be driven primarily by the hypothesis being tested. A second piece of information that must be obtained is an expected proportion of exposure among cases (p2). The value p2 can often be anticipated using external survey data, employing pilot studies, or locating relevant literature. These two pieces of information, the anticipated OR and p2, can be plugged into the following equation to compute p1:

$$ {p}_{2}=\frac{{p}_{1}(OR)}{1+{p}_{1}(OR-1)}$$
(7.8)

As an example, consider a case–control study aimed at investigating whether chronic chewing of smokeless tobacco is associated with increased odds of developing any form of mouth cancer. Patients with mouth cancer (cases) and without mouth cancer (controls) are recruited to the study. A pilot study allowed the investigator to estimate that 24 % of cases will have a history of chronic chewing of smokeless tobacco (p2 = 0.24). The investigator anticipates that the OR of having a history of chronic chewing of smokeless tobacco is 6. The values p2 = 0.24 and OR = 6 can be plugged into Eq. 7.8 to determine that p1 = 0.05. In other words, this investigator predicts that 24.0 % of cases and 5 % of controls will have a history of chronic chewing of smokeless tobacco. The average of p1 and p2 is equal to 0.145 (pave). The investigator has set the level of significance at 0.05 and desires to achieve a power of 90 %; the corresponding Z scores are Zα = 1.96 and Zβ = 1.282 (Table 7.2). With all of these pieces of information at hand, it is possible to calculate that this study will likely need to include N = 141 participants in total, a value that can be usefully rounded up to 150 total participants. Assuming that this case–control study uses a fairly common ratio of 4 controls for each case, this study should include approximately 30 cases and 120 controls (based on Eq. 7.5).

5.2 Sample Size Calculation for a Traditional Cohort Study

To develop an argument around study size for a traditional (independent) cohort study, a slightly more complicated calculation can be useful:

$$ n=\frac{{\left[{Z}_{\alpha }\sqrt{\left(1+\frac{1}{m}\right)\overline{p}\left(1-\overline{p}\right)}+{Z}_{\beta }\sqrt{\left[\frac{{p}_{1}\left(1-{p}_{1}\right)}{m}\right]+{p}_{2}\left(1-{p}_{2}\right)}\right]}^{2}}{{\left({p}_{1}-{p}_{2}\right)}^{2}}$$
(7.9)

Where:

n = total sample size of a cohort study

Zα = Z score for the desired level of significance

Zβ = Z score for the desired power

m   = the number of unexposed participants per exposed participants

p1   = the probability of event in unexposed participants

p2   = the probability of event in exposed participants \( \overline{p}=(m{p}_{1}+{p}_{2})/(m+1)\)

To illustrate this equation, let us consider a study in which an investigator aims to determine whether obese adults are more likely to develop colon cancer than are non-obese adults. The investigator has set the level of significance at 0.05 and desires to achieve a power of 80 %. The investigator hypothesizes that obese participants will have a twofold increase in the risk of developing colon cancer compared to non-obese participants (i.e., a relative risk or RR = 2). Since RR = 2 = p2/p1, the pro­bability of developing esophageal cancer in one group (either non-obese or obese) is sufficient to predict the probability of developing the disease in the other group. Assuming that the study will last for 5 years and that the probability of developing colon cancer in non-obese participants during that time is estimated (based on previous work by other researchers) to be 5 % (p1 = 0.05), the expected probability of developing colon cancer in the obese group is 10 % (p2 = 0.10). The investigator plans to enroll three non-obese participants per obese participant (m = 3). Knowing p1, p2, and m, it is possible to calculate the value \( \overline{p}\) = 0.0625. With all of this information on hand, the investigator executes Eq. 7.9 and determines she will likely need to enroll at least 271 participants, an estimate that can be usefully rounded up to n = 300. Since the investigator plans to enroll 3-times as many non-obese people as obese people, she used Eq. 7.5 to determine that her study should include approximately 75 obese participants and 225 non-obese participants.

There is an important caveat, however, that must be considered in cohort studies on the existence of relatively minor effects. In such cases, sample size calculations tend to produce under-estimates. One way to address this issue is to increase the number of exposed participants, though this approach requires more resources. A more efficient approach is to over-represent extreme degrees of exposure in the exposed group. For example, of the 75 obese participants in the example above, it may be wise to include more severely obese participants than might be enrolled by chance alone. Two critical assumptions of this approach are that there is a dose-­dependent relationship between the exposure and the outcome, and that all potential confounding exposures in obese and severely obese participants are similar. If such an approach is taken, it should be clearly reported in the methods section, and these assumptions should be tested with their results disclosed in detail in the results section.

6 The Size of an Intervention Study

Study size gains elevated importance in intervention studies, such as randomized controlled trials (RCTs), because the risks associated with intervention studies are generally greater than diagnostic or etiognostic studies. Any intervention poses some degree of risk to the participants; therefore, over-enrollment could expose an unnecessarily large number of participants to a potentially harmful intervention when fewer participants would have been sufficient. In other words, in intervention studies, sample size becomes a major ethical issue, where the main concern is to balance the needs for attaining useful results and for limiting potential harm.

The degree of importance of sample size in an intervention study is directly proportional to the riskiness of that study and is informed by the aims of the study. For example, the optimal size of a Phase-1 clinical trial, the first time a new drug is given to humans for safety and tolerability testing, is very heavily influenced by ethical considerations. Consequently, Phase-1 trials tend to be very small (e.g., n = 6–12). In a Phase-2 study, when a drug’s dosing regimen is being evaluated, safety and tolerability are better established but still unclear; therefore, Phase-2 studies tend to be larger than Phase-1 studies but still relatively small (e.g., n = 15–60). The increase in study size in Phase-2 studies often allows preliminary hypothesis testing of efficacy and negative outcomes, though only large effect sizes tend to be detected in such small studies. In a Phase-3 study, on the other hand, the planned study size must be larger than in a Phase-3 study because effect size must be determined. In order to get to Phase-3, safety was previously established in Phases 1–2; therefore, the risk of harm is lower in a Phase-3 study. Consequently, depending on the goals of the study and the predicted effect size, Phase-3 studies can be as small as n = 200 and as large as n = 22,000 (e.g., Physician’s Health Study-I).

In any intervention study, it is generally useful to consider ways to increase the efficiency of the study without compromising statistical power. One commonly employed approach is to tweak elements of the study design in order to increase the total number of participants who are likely to experience an outcome of interest. Such tweaks often include: selecting subjects from high-risk populations, lengthening the follow-up period, maximizing compliance, and minimizing drop-out/attrition rates (See: Chap. 17). In making such tweaks, one must be very careful to avoid introducing unacceptable bias or ethical errors, and it is critical to report these intentional tweaks in the methods section so others can critically appraise the results.

When calculating sample size for an intervention study, by far the most common approach is to use formulae for the calculation of study size for the comparison of means or proportions (see: Eqs. 7.3 and 7.6). Therefore, in this chapter we will not further discuss study size planning formulae for intervention studies. However, there are many comprehensive resources covering a wide range of intervention scenarios. Readers interested in this topic may find useful additional information in statistical textbooks and books on clinical trials, e.g., Meinert (1986).

7 Accounting for Attrition

In every study there will be some proportion of participants who withdraw from the study or are otherwise lost to follow-up. The best way to account for these pheno­mena is to increase study size calculations by a known factor based on previous ­studies or a pilot study. However, such information is not always available, so a common approach is to round up to a useful study size (e.g., n = 271 can be usefully rounded up to n = 300). Although this approach provides some leeway, some researchers have advocated for a simple further adjustment: to add an additional 10 % (e.g., n = 271 is rounded to n = 300 and 10 % is added to make n = 330). Though such accounting is helpful for planning and budgeting studies, it should be noted that there is no standardized approach to dealing with this issue. Indeed, there is a great deal of controversy regarding approaches to account for attrition. In order to limit potential criticisms, an important task for writing successful grants or other funding requests, we recommend making adjustments to sample size based on pilot studies or literature, and resorting to the 10 % add-on approach if no such pilots or literature are available.

An ideal scenario is to plan the size and procedures of a study from a scientific point-of-view only and, consequently, to aim for very high power/precision and to employ only the most accurate, sophisticated procedures (which are often the most expensive). However, in practice, such an ideal scenario is quite rare in part because stakeholders typically put some level of restriction on such ambitions. It is therefore quite evident that interactions with stakeholders, a topic that is discussed in the next chapter, are crucial if one wants to develop a realistic and ethical plan for a study.