Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Phase I oncology trials aim to find the maximum tolerated dose (MTD), the highest dose with toxicity rate close to a pre-specified target level, p T . The 3+3 design [3, 4] is the leading method for phase I dose-escalation trials in oncology, as over 90 % of published phase I trials have been based on 3+3 for the past two decades [1, 5, 6]. Such popularity of 3+3 is striking since numerous model-based dose-escalation methods have been developed by biostatisticians during the same time period and almost all the new methods seemed to exhibit better performance than 3+3 [710].

The main reason for the popularity of the 3+3 design is due to its simplicity, transparency, and the costless implementation in practice. In contrast, it often requires a considerable amount of logistic support and complexity to implement most model-based designs. Even if the practical burden could be overcome, protocols based on model-based designs are often subject to more thorough reviews by IRB or among biostatisticians, as operating characteristics of these new designs are required. To the contrary, if the protocol is based on the 3+3 design, such requirement disappears since 3+3 has been widely used. As a result, despite the acceleration in the research development of adaptive model-based designs, the lower standard in the review process and cost-free implementation in practice makes 3+3 an increasingly popular design to physicians. Setting aside the logistic issues, we ask exactly how much better the model-based designs are than 3+3. In reviewing the statistical literature on phase I adaptive designs, we found that when comparing to 3+3, most works did not match the sample size across the designs. For example Ji et al. (2010) [2] showed that 3+3 exhibits a smaller average sample size in the computer simulations than model-based designs, and consequently 3+3 also yields a smaller percentage in identifying the true MTD in these simulations. Since the sample size is not matched in the comparison, it is difficult to assess the reason for the reduced percentage under 3+3. More importantly, since phase I trials focus on patient safety, comparisons without matching sample size cannot provide accurate assessment on the safety characteristics of designs. In fact, usually designs resulting in larger sample sizes should be safer since patients enrolled in the later stage of the trial with a larger sample size will be better protected due to more precise statistical inference.

In this paper, we construct a comprehensive simulation study to evaluate the operating characteristics of 3+3 and a newly developed adaptive design known as the modified toxicity probability interval (mTPI) method [2, 6]. In doing so we match the sample size between the two designs. The main intent of choosing the mTPI design for comparison is because mTPI is equally simple, transparent, and costless to implement. In other words, the logistic burden of mTPI and 3+3 is comparable, which allows us to focus on the simulation performance. Albeit being recently introduced to the society, mTPI has already received attention from both research and industry entities [11, 12]. For example, through personal communication we are informed that almost all phase I oncology trials conducted at Merck Co., Inc. in the past 2 years have been based on the mTPI design or its variations. Recently, phase I trials based on the mTPI design has been published [13, 14]. Considering the short time period since the publication of the mTPI design, this popularity is encouraging.

In a nutshell, the 3+3 design consists of a set of deterministic rules that dictate dose-escalation decisions based on observed patient outcomes. For example, if out of three treated patients 0, 1, or more than 1 toxicities are observed, 3+3 will recommend escalating dose level, continuing at the same dose level, or de-escalating dose level, respectively (see, e.g., [15, 16]).

The mTPI design uses a Bayesian statistics framework and a beta/binomial hierarchical model to compute the posterior probability of three intervals that reflect the relative distance between the toxicity rate of each dose level to the target rate p T . Let p d denote the probability of toxicity for dose d, d = 1, , D, where D is the total number of candidate doses. Using the posterior samples for p d , mTPI computes the unit probability mass, defined as

$$\displaystyle{ \mbox{ UPM}_{(a,b)}(d) = \frac{Pr\{p_{d} \in (a,b)\mid data\}} {b - a}, }$$
(1)

for three intervals corresponding to under-, proper-, and over-dosing, in reference to whether a dose is lower, close to, or higher than the MTD, respectively. Specifically, the under-dosing interval is defined as \((0,p_{T} -\epsilon _{1})\) and implies that the dose level is lower than the MTD, the over-dosing interval \((p_{T} +\epsilon _{2},1)\) implies that the dose level is higher than the MTD, and the proper-dosing interval \((p_{T} -\epsilon _{1},p_{T} +\epsilon _{2})\) suggests that the dose level is close to the MTD. Here ε 1 and ε 2 are small fractions, say 0.05. Inference is robust with respect to the choice of ε, as shown in [2]. Large UPM values for each interval imply large per-unit posterior probability mass for that interval, therefore implying the corresponding decision: if UPM(d) is the largest for under-, proper-, or over-dosing interval, the decision should be to escalate (E), stay (S) at dose d, or de-escalate (D), respectively. Therefore, assuming that dose d is currently used to treat patients, the mTPI design assigns the next cohort of patients based on the decision rule B d , given by

$$\displaystyle{ \mathbf{B}_{d} = \mbox{ arg}\max _{m\in \{D,S,E\}}\mbox{ UPM}(m,d), }$$
(2)

where UPM(m, d) is the value of UPM for the dosing interval associated with decision m. Decisions D, S, or E warrant the use of dose (d − 1), d, or (d + 1) for the next cohort of patients, respectively. Ji et al. [2] proved that the decision rule B d is consistent and optimal in that it minimizes the posterior expected loss, in which the loss function is determined to achieve equal prior expected loss for the three decisions, D, S, and E. More importantly, all the dose-escalation decisions for a given trial can be pre-calculated under the mTPI design and presented in a two-way table (Fig. 1). Once the trial starts, clinicians can easily monitor the trial and select the appropriate doses following the pre-calculated table. The simplicity and transparency of mTPI makes it a strong candidate as a model-based counterpart of the 3+3 design in practice. A software in Excel is provided at https://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id=72 We will show surprising and important findings and make a recommendation to use mTPI in future phase I trials based on these findings.

Fig. 1
figure 1

Dose-finding spreadsheet of the mTPI method. The spreadsheet is generated based on a Beta/Binomial model and pre-calculated before a trial starts. The letters in different colors are computed based on the decision rules under the mTPI method and represent different dose-finding actions. In addition to actions D, S, and E, the table includes action U, which is defined as the execution of the dose exclusion rule in mTPI.

2 Comparison of 3+3 and mTPI

2.1 Simulation Setup

We perform computer simulation of phase I trials based on the 3+3 and mTPI designs and compare their operating characteristics summarized over thousands of simulated trials.

2.1.1 Clinical Scenarios

We consider 6 doses in the simulated trials. We construct 14 scenarios for each of the three target p T values, resulting in a total of 42 scenarios. In each scenario, true toxicity probabilities are specified for the 6 doses. These scenarios are set up to capture a wide range of dose–response shapes in practice, as shown in Fig. 2 (see also a discussion in Ji et al., 2012 [17]). Specifically, Scenario 1 represents a case where all doses are safe and low; Scenario 2 represents a case where all doses are high; in Scenarios 3–4 doses cover a wide range of toxicity probabilities and the toxicity probability of one dose equals p T ; Scenarios 5–7 also cover a wide range of toxicity probabilities but the MTD is bracketed by two adjacent doses; In Scenario 8–10, dose toxicity probabilities do not vary much and center around the target p T ; Scenarios 11–12 are similar to Scenarios 8–10, except doses have a wider range of toxicity; lastly, Scenarios 13–14 represent two rare cases in which the MTD is the lowest and highest dose, respectively.

Fig. 2
figure 2

Dose-response patterns for the 42 clinical scenarios in the simulation. For each of the p T  = 0. 1, 0. 2, 0. 3 values, 14 scenarios are constructed.

2.1.2 Values of p T

In practice, the target p T values are rarely larger than 30 % as it implies unnecessary exposure of patients to doses with high toxicity. Below, we make three choices of p T : 0.1, 0.2, and 0.3, i.e., the target toxicity rates of the MTD in our simulated trials are 10 %, 20 %, or 30 %. For each p T and each scenario, we simulate 2,000 trials.

2.1.3 Matching Sample Size

A unique feature in our comparison is that we attempt to match the average sample size of the 3+3 and mTPI designs for each of the clinical scenarios used in the simulation study. To achieve this, for each scenario we first apply the 3+3 design to 2,000 simulated trials and obtain the mean of the 2,000 sample sizes. We then apply the mTPI design, in which we need to specify the maximum sample size. The mTPI design stops the trial when the total number of patients enrolled is equal or larger than the maximum sample size. We calibrate the maximum sample sizes of mTPI for each p T value and each scenario, so that the average sample sizes over simulated trials under both designs are similar across all the scenarios. Figure 3 shows the differences of the average sample sizes (over 2,000 simulated trials) between 3+3 and mTPI. The two designs exhibit comparable sample sizes overall. Our calibration of mTPI only involves varying the maximum sample size, while keeping all the other design features unchanged.

Fig. 3
figure 3

Difference in the average sample size per trial between 3+3 and mTPI. Each boxplot summarizes the differences for 14 scenarios for a given target toxicity p T value.

2.1.4 Variations of the 3+3

To account for different target p T values, we use one of the two 3+3 variations (3+3L and 3+ + 3H). See Fig. 4. Briefly, the two designs only differ when 6 patients have been treated at a dose, and 1 or 2 of them experience the toxicity. In one variation, 3+3L, we would stop the trial and declare that the MTD has been exceeded if 2 out of 6 patients experienced toxicity at the dose; in the other variation, called 3+3H, we would stop the trial and declare that the MTD is that dose. Likewise, 3+3H would escalate if 1 toxicity is observed from 6 patients, while 3+3L would stop and declare the dose to be the MTD. Here, L or H means that the target toxicity rate p T of the MTD is low or high. We use 3+3L for trials with p T  = 0. 1 or p T  = 0. 2, and the 3+3H for trials with p T  = 0. 3.

Fig. 4
figure 4

Schema of the enhanced 3+3 design. The two versions of 3+3L and 3+3H represent the cases where the MTD is defined as the highest dose on which no more than 1 and 2 dose-limiting toxicities (DLT) are observed from 6 patients, respectively.

2.2 Performance Evaluation

Summarizing results from 42 scenarios over three different p T values for three designs can be subjective depending on the criterion used in the comparison. Since the average sample sizes between the two methods are roughly matched, we focus our comparison on two summary statistics simultaneously,

$$\displaystyle\begin{array}{rcl} n_{>MTD}& =& \mbox{ the number of patients treated above the true MTD} {}\\ \%Sel_{MTD}& =& \mbox{ the percentage of selecting the true MTD}. {}\\ \end{array}$$

n  > MTD directly evaluates the safety of each design since under matched sample size; a smaller n  > MTD value implies fewer toxicities. To calculate % Sel MTD , we need to decide which doses will be considered as the MTD for each scenario.

2.3 Main Results

Figure 5 summarizes the comparison between the 3+3 and mTPI designs, regarding the differences in n  > MTD and % S e l MTD . We present the comparison results of n  > MTD in the left panel. Comparing to the mTPI design, the 3+3 design has lower n  > MTD values for two scenarios, higher n  > MTD for 34 scenarios, and the same n  > MTD for six scenarios. In words, 40 out of 42 times, mTPI treats fewer or the same number of patients at doses higher than the MTD than 3+3. In addition, Fig. 6 examines the overall toxicity percentage, defined as

$$\displaystyle{ \frac{\mbox{ the total number of toxicities over all simulated trials}} {\mbox{ the total number of patients treated over all simulated trials}} \times 100\%.}$$

Only in one out of 42 scenarios, the 3+3 design exhibits a lower overall toxicity percentage than the mTPI design.

Fig. 5
figure 5

Comparison between 3+3 and mTPI based on matched sample sizes. The left panel presents the differences in the numbers of patients treated at doses above the MTD (n  > MTD ), i.e., values of (n  > MTD 3+3 - n  > MTD mTPI ) for all 42 scenarios. The right panel presents the differences in the selection percentages of the true MTD (% S e l MTD ), i.e., values of (% Sel MTD 3+3 - % S e l MTD mTPI ) for all 42 scenarios. The three colors in the plots represent the results corresponding to the three different p T values.

Fig. 6
figure 6

Overall toxicity percentages for the 3+3 and mTPI designs across all the simulated trials.

We direct attention to the right panel of Fig. 5 which compares % S e l MTD between the two designs. In 10 out of 42 scenarios, 3+3 has a higher selection percentage of the true MTD than mTPI. Among these scenarios, the 3+3 design selects the MTD up to about 25 % more often than the mTPI design (Scenario 2 for p T  = 0. 3). In the remaining 32 scenarios, mTPI selects the MTD more often than 3+3, up to more than 40 % (Scenario 14 for p T  = 0. 1). A closer examination reveals that 3+3 has higher % S e l MTD values in scenarios when none of the doses has a toxicity probability close to p T or when the MTD is at the lower or higher end of the dosing set. We performed additional simulations and confirmed this finding. We found that when the MTD is out of the range of the dosing set, 3+3 usually has a higher selection percentage than mTPI. In other words, 3+3 is a better method when none of the investigational doses is close to the true MTD. This advantage seems to be of limited utility in practice since usually doses are chosen based on scientific and historical data, anticipating some of them are close to the MTD, not the opposite.

Summarizing the two plots in Fig. 5 and considering that (1) the overall sample sizes between the two designs are roughly matched for all the scenarios and (2) the 42 scenarios are constructed to cover a wide range of practical dose–response shapes, we conclude that the 3+3 design is more likely to treat patients at toxic doses above the MTD and less likely to identify the true MTD than the mTPI design.

3 Conclusion and Discussion

The mTPI has all the attractive properties 3+3 enjoys for practical considerations and implementations. In addition, compared to the 3+3 design, the mTPI design is safer in treating fewer patients at doses above the MTD, and in general yielding higher probabilities in identifying the true MTD.

In practice, a single value n must be provided as the maximum sample size for the mTPI design in any dose escalation study. In implementing the mTPI design, we recommend a sample size of \(n = k \times (d + 1)\) to ensure that the design will reach the highest dose if needed and still has one more cohort to use. Here k is the cohort size and d the number of doses.

It is commonly accepted that phase I trials are of small sizes. This mythology is poorly addressed in the literature. Small phase I trials often provide wrong recommended doses for phase II, resulting in either low efficacy or high toxicity if the recommended doses are too low or too high, respectively. More discussion and investigation on the proper sample sizes of phase I trials are needed. For example, a streamlined and seamless phase I/II design may result in higher power in the identification of safe and effective doses [18] due to increased sample sizes from the seamless features.

We note that comparison between CRM and 3+3 have been investigated by various authors [1921] and thus is not included in this paper. A downside of CRM is the lack of easy ways for implementation in practice. We have included the CRM design in our software so that interested users can examine all three designs together, 3+3, CRM, and mTPI.