Keywords

1 Introduction

It is required by law that the sponsor of a new drug that is intended for chronic use by patients for certain indications conduct carcinogenicity studies in animals to assess the carcinogenic potential of the drug. These studies are reviewed independently within CDER/FDA. These independent reviews are conducted by interdisciplinary groups. The statistical component of such a review includes an assessment of the design and conduct of the study, and a complete reanalysis of all statistical data. This work is performed by members of the Pharmacology and Toxicology (Pharm/Tox) Statistics Team. However, the decision of whether the drug should be considered a potential carcinogen is based on more than just statistical evidence, and as such the statistical review comprises just one part of the FDA internal decision process.

In 2001, the FDA released a draft document entitled “Guidance for Industry; Statistical aspects of the design, analysis, and interpretation of chronic rodent carcinogenicity studies of pharmaceuticals” (US Food and Drug Administration—Center for Drug Evaluation and Research 2001). This guidance was issued in the Federal Register (Tuesday, May 8 2001, Vol. 66 No. 89) and was reviewed by the public over a 90 day comment period in 2001. Sixteen comments were received from drug companies, professional organizations of the pharmaceutical industry, and individual experts from the U.S., Europe, and Japan; these are available in FDA Docket No. 01D0194. Great efforts are being made within the FDA to finalize this document.

The draft guidance describes the general process and methods used by the Pharm/Tox team in their reviews. These methods are also used widely by drug companies in the U.S. and abroad. In addition, the team also draws on results from other published research.

In addition to writing reviews, the statistical Pharm/Tox team contributes to the development of FDA policy and conducts statistical research. Indeed, research published by members of the team (Lin 1998; Lin and Ali 2006; Lin and Rahman 1998; Lin et al. 2010; Rahman and Lin 2008, 2009, 2010; Rahman and Tiwari 2012) is often incorporated into the team’s statistical reviews. More generally, in their efforts to keep abreast of new advancements in this area and to improve the quality of their reviews, members of the team have conducted regulatory research studies in collaborations with experts within and outside the agency.

The purposes of this chapter are twofold. One is to provide updates of results of some research studies that have been presented or published previously (Sects. 12.4 and 12.6). The other is to share with the professional community the results of some research studies that have not been presented or published (Sects. 12.2, 12.3 and 12.5).

In this chapter, results of five recent research projects by members of the Pharm/Tox Statistics Team are presented. Sections 12.2 and 12.3 describe two simulation studies investigating the effect on Type 1 and Type 2 error of varying the decision rules (Sect. 12.2) and experimental design (Sect. 12.3). Sections 12.4 and 12.5 discuss the development of exact methods for the poly-k test for a dose response relationship. Finally, Sect. 12.6 is a general discussion of the use of Bayesian methods in reviews of carcinogenicity studies.

2 An Evaluation of the Alternative Statistical Decision Rules for the Interpretation of Study Results

2.1 Introduction

It is specifically recommended in the draft guidance document (US Food and Drug Administration—Center for Drug Evaluation and Research 2001) that when evaluating the carcinogenic potential of a drug:

  1. 1.

    Trend tests, which have been extensively studied in the literature (Lin 1995, 1997, 1998, 2000a,b; Lin and Ali 1994, 2006; Lin and Rahman 1998; Rahman and Lin 2008, 2009), should be the primary tests used.

  2. 2.

    Pairwise tests may be used in lieu of trend tests, but only in those rare cases where they are deemed more appropriate than the trend tests.

The reason for preferring the trend test to the pairwise test is that, under most circumstances, the trend test will be more powerful than the pairwise test (for any given significance level). Note that the above guidance document recommends that only one test, either the trend test or the pairwise test, is to be used to conclude a statistically significant carcinogenic effect.

In the context of carcinogenicity studies (and safety studies in general), the Type 1 error rate is primarily a measure of the producer’s risk; if a drug is withheld from market due to an incorrect finding of a carcinogenic effect (or even if its usage is merely curtailed), then it is the producer of the drug who faces the greatest loss. Conversely, the Type 2 error rate is primarily a measure of the consumer’s risk, as it is the consumer who stands to suffer in the event that a truly carcinogenic drug is brought to market without the carcinogenic effect being reported. Proceeding from this philosophical stance (and the similar position of Center for Drug Evaluation and Research 2005), the draft guidance (US Food and Drug Administration—Center for Drug Evaluation and Research 2001) recommends a goal of maximizing power while keeping the overall (study-wise) false positive rate at approximately 10 %. In order to achieve this goal, a collection of significance thresholds are recommended. These thresholds, presented in Table 12.1, are grounded in Lin and Rahman (1998) and Rahman and Lin (2008) (for the trend test) and Haseman (1983, 1984) (for the pairwise test).

Table 12.1 Recommended significance levels for the trend test or the pairwise comparisons (US Food and Drug Administration—Center for Drug Evaluation and Research 2001: Lines 1093–1094 on page 30)

However, this goal has not been universally accepted. There is a desire on the part of some non-statistical scientists within the agency to restrict positive findings to those where there is statistical evidence of both a positive dose response relationship and an increased incidence in the high dose group compared to the control group. In other words, a joint test is desired. This is not an intrinsically unreasonable position. Nonetheless, every test needs significance thresholds, and since the only significance thresholds included in US Food and Drug Administration—Center for Drug Evaluation and Research (2001) are for single tests, it is natural (but incorrect!) for non-statistical scientists to construct a joint test using these thresholds. We will refer to this decision rule as the joint test rule. See Table 12.2.

Table 12.2 The joint test rule (not recommended!)

We are very concerned about the ramifications of the use of this rule. While the trend and pairwise test are clearly not independent, their association is far from perfect. Accordingly, the requirement that both tests yield individually statistically significant results necessarily results in a more conservative test than either the trend test or the pairwise test alone (at the same significance thresholds). The purpose of this section is to present the results of our simulation study showing a serious consequence of the adoption of this rule: a huge inflation of the false negative rate (i.e., the consumer’s risk) for the final interpretation of the carcinogenicity potential of a new drug.

2.2 Design of Simulation Study

The objective of this study is to conduct a simulation study to evaluate the inflation of the false negative rate resulting from the joint test (compared with the trend test alone).

We modeled survival and tumor data using Weibull distributions (see Eqs. (12.1) and (12.2)). The values of the parameters A, B, C, and D, were taken from the landmark National Toxicology Program (NTP) study by Dinse (1985) (see Tables 12.3 and 12.4). Values of these parameters were chosen to vary four different factors, ultimately resulting in 36 different sets of simulation conditions.

Table 12.3 Data generation parameters for the Weibull models for time to tumor onset (Dinse 1985)
Table 12.4 Data generation parameters for the Weibull models for time to death (Dinse 1985)

The factors used in the NTP study were defined as follows:

  1. 1.

    Low or high tumor background rate: The prevalence rate at 2 years in the control group is 5 % (low) or 20 % (high).

  2. 2.

    Tumors appear early or late: The prevalence rate of the control group at 1.5 years is 50 % (appearing early) or 10 % (appearing late) of the prevalence rate at 2 years.

  3. 3.

    No dose effect, a small dose effect, or a large dose effect on tumor prevalence: The prevalence of the high dose group at 2 years minus the prevalence of the control group at 2 years is 0 % (no effect), or 10 % (small effect), or 20 % (large effect).

  4. 4.

    No dose effect, a small dose effect, or a large dose effect on mortality: The expected proportion of animals alive in the high dose group at 2 years is 70 % (no effect), 40 % (small effect), or 10 % (large effect). The expected proportion of animals alive in the control group at 2 years is taken as 70 %.

However, there are important differences between the NTP design described above and the design used in our simulation study. Whereas the NTP study simulated three treatment groups with doses x = 0, x = 1, and x = 2 (called the control, low, and high dose groups), our study used four treatment groups (with doses x = 0, x = 1, x = 2, and x = 3, called the control, low, mid, and high dose groups respectively). Since the values of the parameters A, B, C, and D used were the same in the two studies (see Tables 12.3 and 12.4), the characterizations of the effect of the dose level on tumorigenesis and mortality, factors 3 and 4, apply to the dose level x = 2, i.e., to the mid dose level. To recast these descriptions in terms of the effect at the x = 3 (high dose) level, factors 3 and 4 become factors 3′ and 4′:

  1. 3′

    No dose effect, a small dose effect, or a large dose effect on tumor prevalence: The prevalence of the high dose group at 2 years minus the prevalence of the control group at 2 years is 0 % (no effect), or approximately 15 % (small effect), or approximately 28 % (large effect).

  2. 4′

    No dose effect, a small dose effect, or a large dose effect on mortality: The expected proportion of animals alive in the high dose group at 2 years is 70 % (no effect), 30 % (small effect), or 4 % (large effect). The expected proportion of animals alive in the control group at 2 years is taken as 70 %.

These differences can be expected to have the following effects on the Type 2 error rates for our study (relative to the NTP study):

  • The higher tumorigenesis rates in the high dose groups should help to reduce the false negative rates (or to increase the levels of power) of statistical tests.

  • On the other hand, higher levels of mortality will reduce the effective sample size and thus tend to increase the false negative rates (or to decrease the levels of power).Footnote 1

In our study, tumor data were generated for 4 treatment groups with equally spaced increasing doses (i.e., x = 0, x = 1, x = 2, and x = 3). There were 50 animals per group. The study duration was 2 years (104 weeks), and all animals surviving after 104 weeks were terminally sacrificed. All tumors were assumed to be incidental.

The tumor detection time (T 0) (measured in weeks) and the time to natural death (T 1) of an animal receiving dose level x were modeled by four parameter Weibull distributions:

$$\displaystyle{ S(t,x) = P[T_{i}> t\vert X = x] = \left \{\begin{array}{ll} \mathrm{e}^{-(C+Dx)(t-A)^{B} } & \mathrm{if}t> A \\ 1 &\mathrm{if}t \leq A \end{array} \right. }$$
(12.1)

where A is the location parameter, B is the shape parameter, C is the baseline scale parameter, and D is the dose effect parameter. Tables 12.3 and 12.4 list the sets of values for these parameters used in Dinse (1985).

The prevalence function for incidental tumors equals the cumulative function of time to tumor onset, i.e.,

$$\displaystyle{ P(t\vert x) =\Pr [T_{0} \leq t\vert X = x] = 1 - S(t,x). }$$
(12.2)

Each of the 36 simulation conditions described in Tables 12.3 and 12.4 was simulated 10,000 times. For each simulation, 200 animals were generated; each animal was assigned to a dose group (50 animals per group) and had a tumor onset time (T 0) and death time (T 1) simulated using Eq. (12.1). The actual time of death (T) for each animal was defined as the minimum of T 1 and 104 weeks, i.e., \(T =\min \{ T_{1},104\}\). The animal developed the tumor (i.e., became a tumor bearing animal (TBA)) only if the time to tumor onset did not exceed the time to death. The actual tumor detection time was assumed to be the time of death T. Animals in the same dose group were equally likely to develop the tumor in their life times. It was assumed that tumors were developed independently of each other. The first panel of Fig. 12.1 graphically represents the Weibull models used to generate the tumor prevalence data when the background tumor rate is low, the dose effect on tumor prevalence is large, and the tumor appears early (the model used in simulation conditions 3, 15, and 27). The second panel graphically represents the Weibull models used to generate the survival data when the dose effect on mortality is small (simulation conditions 13–24). The age-adjusted Peto method to test for a dose response relationship (Peto et al. 1980) and the age adjusted Fisher exact test for pairwise differences in tumor incidence, (in each case using the NTP partition of time intervalsFootnote 2), were applied to calculate p-values.

Fig. 12.1
figure 1

Sample tumor prevalence and mortality curves

Three rules for determining if a test of the drug effect on development of a given tumor type was statistically significant were applied to the simulated data. They were:

  1. 1.

    Requiring a statistically significant result in the trend test alone. This is the rule recommended in US Food and Drug Administration—Center for Drug Evaluation and Research (2001).

  2. 2.

    Requiring statistically significant results both in the trend test and in any of the three pairwise comparison tests (control versus low, control versus medium, control versus high).

  3. 3.

    Requiring statistically significant results both in the trend test and in the control versus high group pairwise comparison test. This is the joint test rule.

In each case, it was assumed that the tests were being conducted as part of a standard two-species study. The rules for rare tumor types were used when the incidence rate in the control group were below 1%; otherwise the rules for common tumors types were used.

After simulating and analyzing tumor data 10,000 times for each of the 36 sets of simulation conditions, the Type 1 and Type 2 error rates were estimated.

2.3 Results of the Simulation Study

Since we are simultaneously considering both models where the null hypothesis is true (so that there is no genuine dose effect on tumor incidence) and models where it is false (where there is a genuine dose effect), we need terminology that can apply equally well to both of these cases. For any given set of simulation conditions, the retention rate is the probability of retaining the null hypothesis. If the null hypothesis is true, then this rate is the probability of a true negative, and is 1 − thefalsepositiverate (Type I error). If the null hypothesis is false, then the retention rate is the probability of a false negative or Type 2 error. In this case, it is 1 − power. Correspondingly, the rejection rate is 1 − theretentionrate, and is the probability that the null hypothesis is rejected. It is either the false positive rate (if the null hypothesis is true) or the level of power (if the alternative hypothesis is true). The results (retention rates and percent changes of retention rates) of the simulation study are presented in Table 12.5.

Table 12.5 Estimated retention rates under three decision rules

Results of the evaluation of Type 1 error patterns in the study conducted and reported in Dinse (1985) show that the Peto test without continuity correction and with the partition of time intervals of the study duration proposed by NTP (see Footnote 2) yields attained false positive rates close to the nominal levels (0. 05 and 0. 01) used in the test. That means that the test is a good one that is neither conservative nor anti-conservative.

The evaluation of Type 1 error patterns found by this simulation study is done by using the rates at which the null hypothesis was rejected under those simulation conditions for which there was no dose effect on tumor rate (simulation conditions 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, and 34 in Table 12.5). The tumor types were classified as rare or common based on the incidence rate of the concurrent control. The results of this simulation study show a very interesting pattern in levels of attained Type 1 error. The attained levels of Type 1 error under various simulation conditions divided into two groups. The division was by the factor of background rate, either 20 or 5 %. The attained Type1 levels of the first group were around 0. 005. The attained Type 1 error rates for the second group were around 0. 015. The observed results and pattern of the attained Type 1 errors make sense. For the simulated conditions with 20 % background rate, probably almost all of the 10,000 generated datasets (each dataset containing tumor and survival data of four treatment groups of 50 animals each group) will have a tumor rate of equal to or greater than 1 % (the definition of a common tumor) in the control group. The attained levels of Type 1 error rates under various simulated conditions in this group are close to the nominal significance levels for common tumors.Footnote 3

The attained Type 1 error rates for the other group were between the nominal levels of significance of 0.005 (for the trend test for common tumors) and 0.025 (for the trend test for rare tumors) and not around 0.005. The reason for this phenomenon is that, though the background rate in the simulated conditions for this group was 5 % that is considered as a rate for a common tumor, some of the 10,000 generated datasets had tumor rates less than 1 % in the control group. For this subset of the 10,000 datasets, the nominal level of 0.025 was used in the trend test. See Sect. 12.3.4.3 for a more detailed discussion of this factor.

As mentioned previously, the main objective of our study is the evaluation of the Type 2 error rate under various conditions. As was expected, the Type 2 error (or false negative) rates resulting from the joint test decision rule are higher than those from the procedure recommended in the guidance document of using trend test alone. This is due to the fact that in statistical theory the false positive rate (measuring the producer’s risk in the regulatory review of toxicology studies) and the false negative rate (measuring the consumer’s risk) run in the opposite direction; use of the joint test decision rule will cut down the former rate only at the expense of inflating the latter rate.

The estimated false negative rates resulting from the extensive simulation study under the three decision rules listed in Sect. 12.2.2 are shown in Table 12.5. The last two columns of the table show the percentage changes in the retention rates of decision rules (2) and (3) respectively, compared to those of (1). For those simulation conditions where the null hypothesis is true (1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, and 34), these values measure the percentage change in the probability of not committing a Type 1 error. For the remaining simulation conditions, these values measure the inflation of the Type 2 error rate attributable to the adoption of the more stringent rules.

The magnitude of the inflation of false negative rate resulting from the joint test decision rule of requiring statistically significant results in both the trend test and the C-H (High versus Control) pairwise comparison test depends on all the four factors, namely, drug effect on mortality, background tumor rate, time of tumor appearance, and drug effect on tumor incidence considered in the simulation that are also listed in the notes at the bottom of Tables 12.3 and 12.4.

2.4 Discussion

Results of the simulation study show that the factor of the effect of the dose on tumor prevalence rates has the largest impact on the inflation of the false negative rate when both the trend test and the C-H pairwise comparison tests are required to be statistically significant simultaneously in order to conclude that the effect is statistically significant. The inflations are most serious in the situations in which the dose has a large effect on tumor prevalence. The inflation can be as high as 153.3 %. The actual Type 2 error rate can be more than double.

The above finding is the most alarming result among those from our simulation study. When the dose of a new test drug has large effects on tumor prevalence (up to 28 % difference in incidence rates between the high dose and the control groups), it is a clear indication that the drug is carcinogenic. Exactly in these most important situations the joint test decision rule causes the most serious inflations of the false negative error rate (or the most serious reductions in statistical power to detect the true carcinogenic effect). The net result of this alarming finding is that using the levels of significance recommended for the trend test alone and the pairwise test alone in the joint test decision rule can multiply the probability of failure to detect a true carcinogenic effect by a factor up to two or even more, compared with the procedure based on the result of the trend test alone.

It is true that the results in Table 12.5 show that, for the situations in which the dose has a small effect (up to 15 % difference in incidence rates between the high dose and the control groups) on tumor prevalence, the increases of false negative rates caused by the joint test decision rule are not much more than those from using the trend test alone (increases can be up to 27 %). However, this observation does not imply that the joint test decision rule is justified. The reason is that standard carcinogenicity studies use small group sizes as a surrogate for a large population with low tumor incidence. There is very little power (i.e., the false negative rates are close to 100 %) for a statistical test using a small level of significance such as 0.005 to detect any true carcinogenicity effect because of low tumor incidence rates. In those situations there will be little room for the further increase in the false negative rate no matter how many additional tests are put on top of the original trend test.

It might be argued that the large inflations in false negative rates in the use of the joint test over the use of the trend test alone could be due to the large dose effect on death (only 4 % animals alive at 2 years introduced by the additional treatment group with x = 3. The argument may sound valid since as mentioned previously, the decrease of percentage of animals alive at 2 years increases the false negative rates. We are aware of the small number of alive animals at 2 years in the simulation condition of large dose effect on death caused by using the built-in Weibull model described in Dinse (1985) and by including the additional group with x = 3 in the our study.

However, as mentioned previously, our main interest in this study is to evaluate the percentages of inflation in false negative rates attributable to the use of the joint test compared with the trend test alone. The false negative rates in the joint test and in the trend test alone are certainly also of our interest. They are not the main interest. So the issue of excessive mortality under the simulation condition of a large dose effect on death should not be a major issue in this study since it has similar impacts on false negative rates in both the joint test and the trend test alone. Furthermore, it is seen from Table 12.5 that the largest inflations (63–153 %) in the false negative rate happened under the conditions in which the dose effect on death is small (30 % alive animals at 2 years) rather than under the condition in which the effect is large (4 % alive animals at 2 years).

The extremely large false negative rates in the above simulated situations caused by the nature (low cancer rates and small group sample sizes) of a carcinogenicity experiment, reinforce the important arguments that it is necessary to allow an overall (for a compound across studies in two species and two sexes) false positive rate of about 10 % to raise the power (or to reduce the false negative rate) of an individual statistical test. This important finding of the simulation study clearly supports our big concern about failing to detect carcinogenic effects in the use of the joint test decision rule in determining the statistical significance of the carcinogenicity of a new drug. Again, the producer’s risk using a trend alone is known at the level of significance used (0.5 % for a common tumor and 2.5 % for a rare tumor in a two-species study) and is small in relation to the consumer’s risk that can be 100 or 200 times the level of the known producer’s risk. The levels of significance recommended in the guidance for industry document were developed with the consideration of those situations in which the carcinogenicity experiment has great limitations. Trying to cut down only the producer’s risk (false positive rate in toxicology studies) beyond that which safeguards against the huge consumer’s risk (false negative rates in toxicology studies) is not consistent with the FDA mission as a regulatory agency that has the duty to protect the health and well-being of the American general public.

As mentioned previously, the decision rules (levels of significance) recommended in US Food and Drug Administration—Center for Drug Evaluation and Research (2001) are for trend tests alone and for pairwise comparisons alone, and not for the joint test. To meet the desire of some non-statistical scientists within the agency to require statistically significant results for both the trend test and the C-H pairwise comparison simultaneously to conclude that the effect on the development of a given tumor/organ combination as statistically significant, and still to consider the special nature of standard carcinogenicity studies (i.e., using small group sizes as a surrogate of a large population with low a tumor incidence endpoint), we have conducted additional studies and proposed new sets of significance levels for a joint test along with some updates of the previously recommended ones. These are presented in Table 12.6. We have found that the use of these new levels keeps the overall false positive rate (for the joint test) to approximately 10 % again for a compound across studies in two species and two sexes.

Table 12.6 Recommended decision rules (levels of significance) for controlling the overall false positive rates for various statistical tests performed and submission types

3 The Relationship Between Experimental Design and Error Rates

In this section, we describe the results of a second simulation study. The aim of the simulation study discussed in Sect. 12.2 was to compare decision rules, evaluating the impact on error rates of the adoption of the joint test rule (which is more conservative than the trend test rule recommended in US Food and Drug Administration—Center for Drug Evaluation and Research (2001)—see Tables 12.1 and 12.2). By contrast, this second study, which was conducted independently, compares the effects of the use of different experimental designs on the error rates, all under the same decision rule. The decision rule used in this study is the joint test rule (Table 12.2) since, despite the absence of any theoretical justification for its use, this rule is currently used by non-statistical scientists within the agency as the basis for labeling and other regulatory decisions.Footnote 4

We first consider the nature of the various hypotheses under consideration, and the associated error rates (Sect. 12.3.1). This provides us with the terminology to express our motivation (Sect. 12.3.2). We then describe in detail the four designs that have been compared (Sect. 12.3.3), and the simulation models used to test these designs (Sect. 12.3.4). The results of the simulation are discussed in Sects. 12.3.5 (power) and 12.3.6 (Type 1 error). We conclude with a brief discussion (Sect. 12.3.7).

3.1 Endpoints, Hypotheses, and Error Rates

The task of analyzing data from long term rodent bioassays is complicated by a severe multiplicity problem. But it is not quite the case that we are merely faced with a multitude of equally important tumor endpoints. Rather, we are faced with a hierarchy of hypotheses.

  • At the lowest level, we have individual tumor types, and some selected tumor combinations, associated with null hypotheses of the form:

    Administration of the test article is not associated with an increase in the incidence rate of malignant astrocytomas of the brain in female rats.

    We call such hypotheses the local null hypotheses.

  • The next level of the hierarchy is the experiment level. A standard study includes four experiments: on male mice, on female mice, on male rats, and on female rats. Each of these experiments is analyzed independently, leading to four global null hypotheses of the form:

    There is no organ–tumor pair, or reasonable combination of organ-tumor pairs, for which administration of the test article is positively associated with tumorigenesis in male mice.

    Note that some studies consist of just two experiments in a single species (or, very rarely, in two species and a single sex).

  • The highest level of the hierarchy of hypotheses is the study level. There is a single study-wise null hypothesis:

    For none of the experiments conducted is the corresponding global null hypothesis false.

For any given local null hypothesis, the probability of rejecting that null hypothesis is called either the local false positive rate (LFPR) or the local power, depending on whether the null hypothesis is in fact true. If all the local null hypotheses in a given experiment are true, then the global false positive rate (GFPR) for that experiment is the probability of rejecting the global null hypothesis, and can be estimated from the various estimates for the LFPRs for the endpoints under consideration.Footnote 5 The goal of the multiplicity adjustments in US Food and Drug Administration—Center for Drug Evaluation and Research (2001) is to maintain the study-wise false positive rate at about 10 %. Since most studies consist of four independent experiments, we consider our target level for false positives to be a GFPR of approximately 2.5 %.Footnote 6

The calculation of a GFPR from the LFPR depends on the relationship between the local and global null hypotheses. We capture this relationship with the notion of a tumor spectrum: If \(\mathcal{T}\) is the parameter space for tumor types, then a spectrum is a function \(S: \mathcal{T} \rightarrow \mathbb{N}\); S(t) is the number of independent tumor types being tested with parameter value t. In our case, \(\mathcal{T}\) is one dimensional: under the global null hypothesis we assume that each tumor can be characterized by its background prevalence rate.Footnote 7

In our simulations, we generate estimates for the power and LFPR for three different classes of tumor:

  1. 1.

    Rare tumors have a background prevalence rate (i.e., the lifetime incidence rate among those animals who do not die from causes unrelated to the particular tumor type before the end of the study, typically at 104 weeks) of 0. 5 %.

  2. 2.

    Common tumors have a background prevalence rate of 2 %.

  3. 3.

    Very common tumors have a background prevalence rate of 10 %.

A tumor spectrum for us therefore consists of a triple \(\langle n_{1},n_{2},n_{3}\rangle\), indicating that the global null hypothesis is the conjunction of \(n_{1} + n_{2} + n_{3}\) local null hypotheses, and asserts the absence of a treatment effect on tumorigenicity for n 1 rare, n 2 common, and n 3 very common independent tumor endpoints.

Given such a spectrum, and under any given set of conditions, the GFPR is easy to calculate from the LFPR estimates for the three tumor types under those conditions:

$$\displaystyle{ \mathrm{GFPR} = 1 -\prod _{i=1}^{3}\left (1 - F_{ i}\right )^{n_{i} } }$$
(12.3)

where F i is the estimated LFPR for the i-th class of tumors. Since our desired false positive rates are phrased in terms of the study-wise false positive rate (which we want to keep to a level of approximately 10 %), we are more concerned with the GFPR than the LFPR.

Global power is slightly harder to calculate, since it is a function of a specific global alternate hypothesis. It is unclear what a realistic global alternative hypothesis might look like, except that a global alternative hypothesis is likely to be the conjunction of a very small number of local alternative hypotheses with a large number of local null hypotheses. Accordingly, we focus our attention on the local power.

In summary then, the two quantities that we most wish to estimate are the local power and the GFPR.

3.2 Motivation

For any given experimental design, there is a clear and well understood trade-off between the Type 1 rate (the false positive rate) and the Type 2 error rate (1 minus the power): by adjusting the rejection region for a test, usually by manipulating the significance thresholds, the test can be made more or less conservative. A more conservative test has a lower false positive rate, but only at the expense of a higher Type 2 error rate (i.e., lower power), while a more liberal test lowers the Type 2 error rate at the cost of raising the Type 1 error rate. Finding an appropriate balance of Type 1 and Type 2 errors is an important part of the statistical design for an experiment, and requires a consideration of the relative costs of the two types of error. It is generally acknowledged (see Center for Drug Evaluation and Research 2005) that for safety studies this balance should prioritize Type 2 error control.

However, this trade-off applies only to a fixed experimental design; by adjusting the design, it may be possible to simultaneously improve both Type 1 and Type 2 error rates.Footnote 8 Beyond this general principle, there is a particular reason to suspect that adjusting the design might affect error rates for carcinogenicity studies. It has been shown (Lin and Rahman 1998; Rahman and Lin 2008) that, using the trend test alone and the significance thresholds in Table 12.1, the study-wise false positive rate for rodent carcinogenicity studies is approximately 10 %. However, under this decision rule, the nominal false positive rate for a single rare tumor type is 2. 5 %. Given that each study includes dozens of rare endpoints, the tests must be strongly over-conservative for rare tumor typesFootnote 9; the decision rules in US Food and Drug Administration—Center for Drug Evaluation and Research (2001) rely heavily on this over-conservativeness in order to keep the GFPR to an acceptable level. But this sort of over-conservativeness is exactly the sort of phenomenon that one would expect to be quite sensitive to changes in study design.

3.3 Designs Compared

To get a sense of the designs currently in use, we conducted a brief investigation of 32 recent submissions, and drew the following general conclusions:

  • While most designs use a single vehicle control group, a substantial proportion do use two duplicate vehicle control groups.

  • A large majority of designs use three treated groups.

  • The total number of animals used can vary considerably, but is typically between 250 and 300.

  • The “traditional” design of four equal groups of 50 is still in use, but is not common; most designs use larger samples of animals.

Bearing these observations in mind, we compare four designs, outlined in Table 12.7. Three of these designs (D1–2 and D4) utilize the same number of animals (260), so that any effects due to differences in the disposition of the animals will not be obscured by differences due to overall sample size.

Table 12.7 Experimental designs considered

The first two designs, D1 and D2, are representative of designs currently in use. Design D1 uses four equal groups of 65 animals whereas design D2 uses a larger control group (104 animals) and three equal dose groups (52 animals each). This is equivalent to a design with five equal groups, comprising two identical vehicle control groups (which, since they are identical, may be safely combined) and three treated groups.

The third design tested (D3) is the “traditional” 200 animal design. Although D3 uses fewer animals than the other designs (but is otherwise similar to D1), it has been included to enable comparison with the many simulation studies and investigations which use this design, such as that described in Sect. 12.2, and in Dinse (1985), Portier and Hoel (1983), Lin and Rahman (1998), and Rahman and Lin (2008)

In light of the investigation (Jackson 2015) of the possible benefits of unbalanced designs (where the animals are not allocated equally to the various dose groups), we have also included an unbalanced design for comparison. This design (D4) follows the suggestions of Portier and Hoel (1983):

…we feel that a design with 50 to 60 of the experimental animals at control (d 0 = 0), 40 to 60 of the animals at the MTD (d 3 = 1) and the remaining animals allocated as one-third to a group given a dose of 10–30 % MTD (d 1 = 0. 25 seems best) and two-thirds to a group given a dose of 50 % MTD (d 2 = 0. 5). No less than 150 experimental animals should be used, and more than 300 animals is generally wasteful. An acceptable number of animals would be 200.

Accordingly, 60 animals have been allocated to the control group and 50 to the high dose group, with the remaining 150 animals allocated 2:1 to the mid and low dose groups.

3.4 Statistical Methodology

3.4.1 Simulation Schema

We have conducted two separate simulation studies. The first study was designed to compare the (local) power of the four designs to detect genuine increases in tumorigenicity for the three tumor types (rare, common, and very common) described in Sect. 12.3.1. In each case, about fifty different effect sizes (measured as the odds ratio for tumor incidence between a high dose and control animals at 104 weeks) were tested 1000 times. While 1000 simulations are not adequate to accurately estimate the power for a particular effect size (we can expect a margin of error in the estimate of approximately 3 %), we may still form an accurate impression of the general shape of the power curves.

The second simulation study was aimed at evaluating false positive (Type 1 error) rates. The immediate focus was on the LFPR, the rate at which individual organ-tumor endpoints for which there is no genuine effect are falsely found to be targets of a carcinogenic effect. Because local false positives are very rare, and because imprecision in the estimate of the local false positive rate is amplified when computing the global false positive rate, each simulation scenario has been repeated at least 250,000 times. The resulting estimates are amalgamated to compute the GFPR by appealing to independence and applying Eq. (12.3) to three different tumor spectra.

For both of the simulation studies, data were simulated using a competing risks model. The two competing hazards were tumorigenesis and death due to a non-tumor cause.

  • Since these simulations are intended to evaluate power and GFPRs under fairly optimal circumstances, only one toxicity model has been considered: the hazard function for non-tumor death has the form \(h_{M}(t) =\lambda t(\mu x + 1)\), where x is the dose and t is the time. The parameters \(\lambda\) and μ are chosen so that the probabilities of a control animal and a high dose animal experiencing non-tumor death before the scheduled termination date are 0. 4 and 0. 7 respectively.

  • Tumor onset time is modeled according to the poly-3 assumptions. This means that for any given animal, the probability of tumorigenesis before time t has the form \(P[T \leq t] =\lambda t^{3}\) where the parameter \(\lambda\) is a measure of the animal’s tumor risk, and so depends on the dose x, the background prevalence rate (i.e., the tumor incidence rate when x = 0), and the dose effect on tumorigenesis to be simulated. (In the case of the LFPR simulations, it is assumed that there is no dose effect on tumorigenesis, and \(\lambda\) therefore depends on the background prevalence rate alone).

Although these simulations were devised independently of those in Sect. 12.2, the resulting models are in practice quite similar. Tumor onset times modeled by this approach are very similar to those of the “early onset” models, although non-tumor mortality times tend to be earlier than those simulated in Sect. 12.2. The effect of this difference is likely to be a small reduction in power (and LFPRs) in the present model compared with those used in Sect. 12.2.

3.4.2 Decision Rule

As noted above, we are initially concerned with estimating local power and LFPRs. Accordingly, under each scenario, we simulate data for a single 24 month experiment (male mice, for example), and a single tumor endpoint (cortical cell carcinoma of the adrenal gland, for example). Each set of simulated data includes a death time for each animal and information about whether the animal developed a tumor. From these data, two poly-3 tests (see Sect. 12.4 and Bailer and Portier 1988; Bieler and Williams 1993; US Food and Drug Administration—Center for Drug Evaluation and Research 2001) are conducted: a trend test across all groups, and a pairwise test between the control and high dose groups. As we are using the joint test rule (discussed at length in Sect. 12.2.1), the null hypothesis of no tumorigenic effect is rejected only when both the trend and pairwise tests yield individually significant results, at the levels indicated in Table 12.2.

3.4.3 Misclassification

The use of the observed incidence rate in the control group to classify a tumor as rare or common is potentially problematic. There is clearly a substantial likelihood that common tumors (with a background prevalence rate of 2 %) will be misclassified as rare, and judged against the “wrong” significance thresholds. Given the difference in the significance thresholds between those used for rare and for common tumors, it is to be expected that this misclassification effect could have an appreciable liberalizing effect on the decision rules used. Furthermore, this liberalizing effect will be amplified by the fact that misclassification is positively associated with low p-values.Footnote 10 This effect was noted, discussed, and even quantified (albeit for different decision rules and simulation scenarios than those used here) in Westfall and Soper (1998).

The probability of misclassification is dependent on both the background prevalence rate of the tumor and the number of animals in the control group. Since D2 has more than 100 animals in the control group, an experiment using this design will treat a particular tumor endpoint as rare if there is no more than one tumor bearing control animal; the other designs will consider an endpoint to be rare only if no tumor bearing control animals are found at all. The effects of this difference are seen in Fig. 12.2.

Fig. 12.2
figure 2

Probability of classifying a tumor type as rare

Under more traditional circumstances, given an exact test procedure and a fixed set of significance standards, one would expect the false positive rate to increase asymptotically to the nominal level as the expected event count increased. However, given the misclassification effect in this context, we expect something different; for tumors with a 2 % background prevalence rate (and for those with a 5 % background rate in the simulation study in Sect. 12.2), the LFPR can be anticipated to converge to a value below the nominal significance level for rare tumors, but above the nominal significance level for common tumors. For tumors with a 10 % background prevalence rate, by contrast, we can expect the LFPR to be somewhat closer to the nominal value for common tumors.Footnote 11

It is uncertain whether the differential effect of misclassification on the four designs should be viewed as intrinsic to the designs, or an additional, unequal source of noise. However, given the paucity of relevant historical control data (Rahman and Lin 2009), and that a two tiered decision rule is in use, there seems to be little alternative to this method for now.Footnote 12 Accordingly, this is the most commonly used method for classifying tumors as rare or common, and we have elected to treat it as an intrinsic feature of the statistical design.

Nonetheless, it should also be remembered that statistical analysis is only one stage in the FDA review process, and that pharmacology and toxicology reviewers are free to exercise their professional prerogative and overturn the empirical determination. This is especially likely for the rarest and commonest tumors. More generally though, it is apparent that this misclassification effect must be taken into account when designing, conducting, and interpreting any simulations to evaluate carcinogenicity studies.

3.5 Power

The results of the power simulations are shown in Fig. 12.3.

Fig. 12.3
figure 3

Estimated power

Designs D1 and D2 are clearly more powerful than D3 and D4. For very common tumors, there is little difference between the two, but for both rare and common tumors, design D2 appears to be appreciably more powerful than D1. For rare tumors and an effect size of 30 (corresponding to a risk difference (RD) between the control group and the high dose group of 12.6 %), D1 and D2 have approximately 60 and 70 % power respectively. For common tumors and an effect size of 10 ( RD=14.9%), D1 and D2 have approximately 45 and 55 % power respectively. More generally, Fig. 12.3 suggests that design D2 delivers about 10 % more power than D1 across a fairly wide range of scenarios of interest.

Since it uses the fewest animals, it is not surprising that D3 is the least powerful of the four designs. Direct comparison of D3 and D1 (which are similar except for the fact that D1 uses 30 % more animals in each group) shows the benefit in power that an increased sample size can bring.

That said, it is striking that the design D4, with 260 animals, is barely more powerful than D3. As we have seen from our comparison of D1 with D3, adding animals in such a way that the groups remain equal in size does increase the power of the design (absent any sawtooth effects). Furthermore, adding animals unequally to the groups can improve the power even more (see Jackson 2015). However, in the case of design D4, the extra animals (compared with D3) were not added with the goal of improving power. Indeed, the notion of optimality which D4 is intended to satisfy is quite different from our narrow goal of maximizing power while keeping the GFPR to approximately 2.5 % (Portier and Hoel 1983):

For our purposes, an optimal experimental design is a design that minimizes the mean-squared error of the maximum likelihood estimate of the virtually safe dose from the Armitage-Doll multistage model and maintains a high power for the detection of increased carcinogenic response.

In addition, the intended maintenance of “a high power for the detection of increased carcinogenic response” was predicated on a decision rule using the trend test alone, with a significance threshold of 0. 05—a much more liberal testing regime even than that recommended in US Food and Drug Administration—Center for Drug Evaluation and Research (2001), let alone than the more conservative joint test rule used here.

It is worth noting that the unbalanced approach of Design D4 is almost the anthesis of that proposed in Jackson (2015); in the latter, power is maximized by concentrating animals in the control and high dose groups, whereas in D4 they are concentrated in the intermediate groups.

3.6 False Positive Rate

3.6.1 The Local False Positive Rate

For each design, at least 250,000 simulations were conducted to estimate the rate at which the local null hypothesis is rejected when the tumor hazard is unchanged across dose groups. The resulting estimates of the LFPR (with 95 % confidence intervals) for each of the four designs and three tumor types (rare, common, and very common) are shown in Table 12.8.

Table 12.8 Local false positive rates (%) with 95 % confidence intervals

Generally speaking, we can expect two aspects of design to affect the LFPRs. As the sample size increases, the expected number of tumors also increases, which in turn means that exact tests will behave more like asymptotic tests. In particular, we can expect the LFPRs to converge to the nominal α-level as the sample size increases. Since the tests are exact, this means that the LFPRs will tend to increase (with sample size), and since we know that the tests are strongly over-conservative for design D3 (see Sect. 12.3.2) we know that there is considerable room for growth from the levels associated with this design.

The second effect will tend to act in the opposite direction. The determination of significance thresholds depends on the number of tumor bearing animals (TBAs) in the control group: for a design with fewer than one hundred control animals, the tumor type will be considered rare just in the case that no control animals develop the tumor. (see Sect. 12.3.4.3). Thus, increasing the number of control animals (while keeping this number below 100) will increase the likelihood of a tumor type being classified as common. Since the significance thresholds for common tumors are lower than for rare tumors, this effect will make designs with more animals more conservative. (This reasoning does not apply to D2 which has more than 100 control animals. Since for this design, at last two tumor bearing control animals must be found in order for a tumor type to be considered common, D2 is more likely than the other designs to classify tumors as rare.) See Table 12.9 to see tumor misclassification rates for the different designs.Footnote 13

Table 12.9 Probability of misclassification
3.6.1.1 Comparison of D1 and D2

The most striking feature of Table 12.8 is the difference between D1 and D2. As discussed in Sect. 12.3.3, both of these designs are in regular use, and are treated similarly for analysis purposes. However, D2 is prone to far higher false positive rates than D1, raising doubts about whether it is reasonable to treat the two designs interchangeably.

For rare tumor types, the LFPR rate for D1 is so low as to be almost negligible (approximately 1 in 40,000). Even if there are 200 such endpoints, their combined false positive rate (12.3) will be under 0. 5 %. As a result, genuinely rare tumor endpoints do not contribute much to the GFPR under this design. The LFPR for D2 is about 17 higher than that for D1. While still small in absolute terms, and still strongly over-conservative, this rate is high enough that it would only take a relatively small number of rare endpoints to generate an unacceptably high combined false positive rate. In other words, despite being over-conservative, the LFPR for D2 is not over-conservative enough to meet our goals for the GFPR. See Sect. 12.3.6.2 for a more detailed discussion.

Compounding this problem, the LFPR for common tumors under D2 is exceedingly high; almost 0.6 %. This level is actually above the nominal rate for the trend test for common tumors (0.5 %), and so must be at least partially attributable to the misclassification effect noted above. The LFPR for common tumors under D1 is much lower (about 40 % of the rate under D2), but still far higher than that for rare tumors.

The LFPRs for very common tumors are much closer to each other than the rates for less common tumors (although D1 still has a higher LFPR than D1). This confirms the idea that D2’s very high LFPR for common tumors is largely due to misclassification, as this misclassification effect would be expected to be less pronounced for the very common tumors (see Table 12.9).

3.6.1.2 Comparison of D1 and D3

For both rare and common tumors, D1 has a considerably higher LFPR than D3; about twice as high in both cases. This difference is to be expected, although it is still unsettling that an increase of just 30 % in the sample size (from D3) can result in a doubling of the LFPRs. As the background rate of tumors increases, the difference in over-conservativeness of the designs becomes less influential than D3’s greater tendency toward misclassification as rare, so that D3 actually exhibits a higher LFPR than D1 for very common tumors (see Fig. 12.2).

3.6.1.3 Comparison of D3 and D4

The designs D3 and D4 have comparable LFPRs for rare and common tumors, although D4 has a noticeably lower LFPR for very common tumors than D3. This is somewhat surprising since D4 was not proposed with the specific goal of constraining the Type 1 error rate. Overall, we can say that of the four designs considered, D4 has the lowest false positive rates.

3.6.2 The Global False Positive Rate

As discussed in Sect. 12.3.1, a tumor spectrum must be selected before the GFPR may be estimated. However, the selection of a realistic tumor spectrum is difficult. The endpoints under consideration are the potential tumor types which a pathologist may report if they are found. This list is not fixed from study to study. For example, one pathologist might group tumors found in the cervix or uterus together, whereas another might report these separately. Similarly, one pathologist might group hibernomas (lipomas of brown adipose tissue) with classic (white adipose tissue) lipomas under the general heading lipoma—adipose tissue, whereas another might not. Nonetheless, these variations are relatively minor, and the requirement that pathologists conduct a “complete necropsy,” the widely accepted suggestions of Bergman et al. (2003), and the forthcoming adoption of the SEND data standard (Clinical Data Interchange Standards Consortium (CDISC) 2011) all ensure a reasonable level of uniformity.

However, the data from individual studies only reference tumor types which were actually found in those studies, so that many rare tumor endpoints are not mentioned at all in study reports. The data presented in the compilations by Charles River Laboratories (Giknis and Clifford 2004, 2005) are helpful in this regard. On the basis of the data in these tabulations, we have chosen three tumor spectra (shown in Table 12.10) which seem to form a plausible range for a typical tumor spectrum.Footnote 14

Table 12.10 Estimated global false positive rates

For two of the three spectra, the GFPR for D1 is close to the target level of 2.5 %, and even for S3, it is “only” twice the desired level. However, the GFPRs for D2 consistently exceed the target GFPR by a very wide margin. The GFPR for D3 is close to the target, confirming the results of Sect. 12.2. The conservativeness of D4 carries over to the calculation of the GFPR which is actually below the target level for the thinner spectra S1 and S2, and close to 2.5 % for S3.

Some care must be taken when interpreting these results. The LFPR for a tumor type with a true background prevalence rate of, say, 0. 1 % is likely to be somewhat lower than that for a tumor with a background prevalence rate of 0. 5 % (our rare tumor exemplar). This is unlikely to affect the estimates of the GFPRs for D1, D3, and D4, since the LFPR for rare tumors is so low for these designs that the GFPR is largely insensitive to the number of rare endpoints in the spectrum. This is not true of D2, though. A spectrum of 100 independent tumor types, half with a background prevalence rate of 0. 1 % and half with a rate of 0. 5 % might yield an appreciably lower GFPR rate under D2 than a spectrum of 100 independent tumor types each with a rate of 0. 5 %. However, this effect can only explain a small part of the difference in GFPR between D1 and D2. For example, even for S2, with just 10 non-rare endpoints, the 6 common and 4 very common endpoints between them contribute over 40 % of the GFPR: if all the rare endpoints were disregarded for D2, the GFPR for S2 would still be 3.9 %; above the 2.5 % target and higher than the GFPR of any of the other designs.

3.7 Discussion

3.7.1 Comparison of Currently Used Designs

There are clearly substantial differences between designs D1 and D2. Figure 12.3 shows that D2 has appreciably more power than D1 over a range of meaningful scenarios—in many cases close to 10 % more—although the difference is slight for very common tumors. But as demonstrated in Tables 12.8 and 12.10, D2 also suffers from a considerably worse false positive rate than D1. Taken together, these two observations lead to the conclusion that D2 is more liberal than D1.

It is reasonable to consider phasing out the use of D2 altogether. The duplicate control design was originally introduced to test for the effects of extra-binomial within-study variability between the two concurrent control groups (Baldrick and Reeve 2007; Haseman et al. 1986; US Food and Drug Administration—Center for Drug Evaluation and Research 2001); any significant differences between the control groups were an indication of a failure of experimental conduct, such as when two groups of cages are subject to different environmental conditions. However, such comparisons between control groups are clearly underpowered to detect such environmental effects, especially given how hard it is to detect even moderately strong treatment effects.

If D2 is to be retained as an acceptable design, the fact that it is more liberal than D1 needs to be taken into account. At the very least, it seems inappropriate to continue to use the same significance thresholds for designs D1 and D2.

Designs D1 and D3 are very similar, differing only in the total number of animals used; both designs distribute animals equally among the single vehicle and three dose groups. Differences between these designs’ statistical properties are therefore entirely attributable to differences in sample size. D1 is appreciably more powerful than D3 across a wide range of scenarios, but also suffers from slightly higher false positive rates. In the case of D3, the GFPR tends to be well below the target rate of 2. 5 %, confirming the findings of Sect. 12.2 that there is room to use more liberal significance thresholds (see Table 12.6), and thereby increase power. However, this does not appear to be the case for the more widely used D1 design.

In general, the differences between the three designs’ statistical properties are not negligible, meaning that our ability to draw conclusions about one the behavior of design from the study of the other is limited. This is especially problematic since a great deal of our understanding of rodent carcinogenicity studies has its foundations in studies of design D3, but this design is fading from popularity.

3.7.2 The Design D4

The unbalanced design D4 was not optimized to maximize power using the sort of decision rule currently in place, and so it is not surprising that it is substantially less powerful than the other 260 animal designs. However, it is striking that the distribution of animals does yield a very low false positive rate (comparable to D3, although better for very common tumors). Compared with the traditional 200 animal design, then, the addition of 60 animals to D4 yields a real but modest benefit in both power and Type 1 error. It is reasonable to think that these extra animals instead provide considerable added information about the virtually safe dose (that being the intent behind this design), but testing this notion is outside the scope of this study.

3.7.3 The Balance Between Type 1 and Type 2 Error

The goal of achieving satisfactory power whilst keeping the GFPR to approximately 2. 5 % is difficult to achieve, even aside from the ambiguity over what exactly constitutes “satisfactory power”.

Inspection of Fig. 12.3 shows that for a tumor with a background prevalence rate of 0. 5 %, design D1 delivers at least 50 % power when the effect size is above 27 (RiskDifference(RD) = 11. 4%), and 75 % when the effect size is above 42 (RD = 16. 9%). For a tumor with a background prevalence rate of 2 %, the power is above 50 % when the effect size is above 11 (RD = 16. 3%) and above 75 % when the effect size is above about 16 (RD = 22. 6%).

But even if we conclude that these levels of power are adequate, we are faced with the fact that the GFPR for D1 is generally somewhat above the target rate of 2.5 %. Lowering the significance thresholds to further limit the GFPR could only be done at the expense of the power, and so seems unwise, given that these studies are essentially safety studies.

In Jackson (2015), an alternative designs which delivers considerably more power than even D2 (the most powerful design considered here) while simultaneously lowering the GFPR rate to a level similar to or below that associated with D1 is investigated.

4 The Exact Poly-k Test

4.1 Introduction

The poly-k method is a mortality adjusted trend test for tumor incidence. This method was originally suggested in Bailer and Portier (1988), and was improved in Bieler and Williams (1993).

As some tumors may have long latency periods, animals with shorter life spans may face a disproportionately reduced risk of tumor onset. The poly-k method suggests correcting this problem by adjusting the number n i of animals at risk in the ith dose group to compensate for early deaths. Operationally, the jth animal in the ith dose group gets a score w ij  ≤ 1; this score is 1 if the animal lives for the full study period (T), or develops the tumor type being tested before dying. Conversely, if this animal dies at the time t ij  < T before the end of the study without developing the tumor being tested, it gets a score of

$$\displaystyle{w_{ij} = \left (\frac{t_{ij}} {T} \right )^{k} <1.}$$

The adjusted group size for Group i is then defined as

$$\displaystyle{n_{i}^{{\ast}} =\sum _{ i}w_{ij}.}$$

As an interpretation, an animal with score w ij  = 1 can be considered as a whole animal, while an animal with score w ij  < 1 can be considered as a partial animal. The adjusted group size \(n_{i}^{{\ast}}\) is equal to n i (the original group size) if all the animals in the group either survive until the end of the study or develops at least one tumor of the type being tested; otherwise the adjusted group size is less than n i , except for some marginal cases due to rounding. These adjusted group sizes are then used to perform trend and pairwise (between treated groups and the control group) tests of tumor incidence rates using the Cochran-Armitage test procedure (Armitage 1955).

One critical point to consider when using the poly-k test is the choice of the appropriate value of k, which depends on the tumor incidence pattern with the increased dose. For long term 104 week standard rat and mouse studies, a value of k = 3 is suggested in the literature.Footnote 15 In this case we refer to the procedure as the poly-3 test. It should be noted that the assumption for Cochran-Armitage test is that the marginal total n i is fixed. However, in this case \(n_{i}^{{\ast}}\) is a random variable. As a result the calculation of the variance of the test statistic needs to be modified. An estimate of this variance, using the delta method and the weighted least squares technique is suggested in Bieler and Williams (1993).

It may be noted that unlike the methods suggested in Peto et al. (1980), the poly-k analysis is independent of tumor context of observation information (i.e. if the tumor was observed on incidental or fatal context), which is a major advantage of this method over the Peto method.

4.2 The Exact Poly-k Method

The outcome of the experiment, for a specific tumor endpoint, can typically be summarized by a results table such as Table 12.11. Replacing the number of animals in each cell by the corresponding adjusted group sizes, we get a new results table, Table 12.12.

Table 12.11 Results table for single endpoint without survival adjustments
Table 12.12 Results table for single endpoint with survival adjustments

A simple-minded exact poly-k test can now be conducted performing the Cochran-Armitage test using the data in Table 12.12. However, in order to use the exact Cochran-Armitage test, the row and column totals for all permuted configurations of the observed table must be fixed. Since calculation of the \(n_{i}^{{\ast}}\) terms (the column totals) depends on the survival pattern of the animals, these terms cannot be assumed to be fixed, and the naïve use of the Cochran-Armitage test is not correct. For an appropriate exact test the adjusted column totals must be recalculated for every permutation of all animals.

We illustrate this method by considering a simple example:

4.2.1 Illustrative Example

Consider an experiment with two dose groups and five animals per group, continued up to 104 weeks. Suppose that the observed data for a specific tumor type are as shown in Table 12.13. Calculating the adjusted group sizes, and rounding so we may use discrete tests, we have

$$\displaystyle\begin{array}{rcl} n_{0}^{{\ast}}& =& \mathrm{Round}\left (0.46 + 1.00 + 1.00 + 0.79 + 1.00\right ) =\mathrm{ Round}(4.25) = 4 {}\\ n_{1}^{{\ast}}& =& \mathrm{Round}\left (1.00 + 0.84 + 1.00 + 0.11 + 1.00\right ) =\mathrm{ Round}(3.95) = 4. {}\\ \end{array}$$
Table 12.13 Raw output for example

Tables 12.14 and 12.15 summarize this data for analysis in the styles of Tables 12.11 and 12.12. The table p values for the asymptotic Cochran-Armitage tests are p = 0. 06681 for the unadjusted data and p = 0. 06332 for the adjusted data.

Table 12.14 Example of results table without survival adjustments
Table 12.15 Example of results table with survival adjustments

4.2.2 Exact Method

An exact methodology for the poly-k test is possible. In the following we will describe this method.

The approach is combinatorial; we consider permutations of the animals among the dose group. Each permutation generates a summary table of the form of Table 12.15 (although many permutations can generate similar tables). For each table, we calculate the test statistic T defined by:

$$\displaystyle{ T =\sum _{ i=0}^{r}x_{ i}d_{i} }$$
(12.4)

The table p-value of a given outcome with test statistic T = t can be calculated using the hypergeometric distribution.

We define an equivalence relation on permutations, saying that two permutations p 1 and p 2 are equivalent (\(p_{1} \sim p_{2}\)) if they both allocate the same number of TBAs to each dose group. It is clear that the test statistic T respects this equivalence relation, but it is possible to have \(T_{p_{1}} = T_{p_{2}}\) without \(p_{1} \sim p_{2}\) (Table 12.16).

Table 12.16 Output after a typical permutation

For this permutation, we have

$$\displaystyle\begin{array}{rcl} n_{0}^{{\ast}}& =& \mathrm{Round}\left (0.46 + 1.00 + 0.84 + 0.11 + 1.00\right ) =\mathrm{ Round}(3.40) = 3 {}\\ n_{1}^{{\ast}}& =& \mathrm{Round}\left (1.00 + 0.7865 + 1.00 + 0.3602 + 1.00\right ) =\mathrm{ Round}(4.14) = 4. {}\\ \end{array}$$

The resulting summary table is displayed in Table 12.17:

Table 12.17 Summary table for permuted data

We now calculate the probability that a randomly selected permutation induces a particular table of permuted values (such as Table 12.17). The random variables x 0 and x 1 can be assumed to be drawn from a random hypergeometric distribution \(g(n_{0}^{{\ast}},n_{1}^{{\ast}})\). The probability of a given hypergeometric distribution’s realization depends on the random parameters \(n_{0}^{{\ast}}\) and \(n_{1}^{{\ast}}\), and so can be denoted \(\Pr \left [g(n_{0}^{{\ast}},n_{1}^{{\ast}})\right ]\). The probability of observing a particular table is therefore:

$$\displaystyle{ \Pr \big[x_{0},x_{1},n_{0}^{{\ast}},n_{ 1}^{{\ast}},g(n_{ 0}^{{\ast}},n_{ 1}^{{\ast}})\big] =\Pr \left [g(n_{ 0}^{{\ast}},n_{ 1}^{{\ast}})\right ] \cdot \Pr \big [x_{ 0},x_{1},n_{0}^{{\ast}},n_{ 1}^{{\ast}}\big\vert g(n_{ 0}^{{\ast}},n_{ 1}^{{\ast}})\big] }$$
(12.5)

where

$$\displaystyle{ \Pr \big[x_{0},x_{1},n_{0}^{{\ast}},n_{ 1}^{{\ast}}\big\vert g(n_{ 0}^{{\ast}},n_{ 1}^{{\ast}})\big] = \left.\frac{{n_{0}^{{\ast}}\choose x_{ 0}}{n_{1}^{{\ast}}\choose x_{ 1}}} {{n^{{\ast}}\choose x}} \right /C\left (x_{0},x_{1}\right ) }$$
(12.6)

with C(x 0, x 1) as the total number of ways in which x TBAs and nx NTBAs can be arranged into groups of size n 0 and n 1 such that exactly x 0 of the TBAs are in Group 0. This quantity is given by:

$$\displaystyle{ C\left (x_{0},x_{1}\right ) ={ x\choose x_{0}}{n - x\choose n_{0} - x_{0}}{x - x_{0}\choose x_{1}}{(n - x) - (n_{0} - x_{0})\choose n_{1} - x_{1}}. }$$
(12.7)

Note that \(C\left (x_{0},x_{1}\right )\) is equal to the total number of ways in which n number of animals can be arranged so that x 0 of the tumor bearing animals are in Group 0, and the remaining x 1 are in Group 1. For r + 1 treatment groups the general formula for C(x 0, x 1, , x r ) is

$$\displaystyle{ C(\mathbf{x}) =\prod _{ i=0}^{r-1}{x -\sum _{ k<i}x_{k}\choose x_{i}}{(n - x) -\sum _{k<i}(n_{k} - x_{k})\choose n_{i} - x_{i}}. }$$
(12.8)

with \(x = x_{0} + x_{1} +\ldots +x_{r}\), \(x_{-1} = -x_{0}\), and \(n_{-1} = -n_{0}\).

For example, for Table 12.14, with \(x_{-1} = 0\), x 0 = 0, x 1 = 2, \(n_{-1} = -5\), n 0 = 5, and n 1 = 5, we have \(x = 0 + 2 = 2\), and \(n = 5 + 5 = 10\), and

$$\displaystyle{C(0,2) ={ 2\choose 0}{10 - 2\choose 5 - 0}{(2 - 0)\choose 2}{(10 - 2) - (5 - 0)\choose 5 - 2} = 56.}$$

The complete list of these 56 arrangements of 10 animals is given in Table 12.18. For each permutation p in this list, the test statistic T p is equal to 0 × 0 + 2 × 12. Furthermore, since there are just two groups, this is an exhaustive list of all permutations satisfying T p  = 2. The probability that T = 2 is therefore calculated by adding the probabilities for each of these 56 permutations, using Eq. (12.5).

Table 12.18 All possible arrangements of 10 animals at risk with 0 TBAs in Group 0 and 2 TBAs in Group 1

As a simple choice, the \(\Pr [g(n_{0}^{{\ast}},n_{1}^{{\ast}})]\) can be taken to be equal for each \(g(n_{0}^{{\ast}},n_{1}^{{\ast}})\), and can be determined from the identity

$$\displaystyle\begin{array}{rcl} & & \sum \Pr \left [x_{0},x_{1},n_{0}^{{\ast}},n_{ 1}^{{\ast}},g(n_{ 0}^{{\ast}},n_{ 1}^{{\ast}})\right ] \\ & & \quad =\sum \Pr \left [g(n_{0}^{{\ast}},n_{ 1}^{{\ast}})\right ]\Pr \left [x_{ 0},x_{1},n_{0}^{{\ast}},n_{ 1}^{{\ast}}\vert g(n_{ 0}^{{\ast}},n_{ 1}^{{\ast}})\right ] = 1{}\end{array}$$
(12.9)

so that

$$\displaystyle{ \Pr \left [g(n_{0}^{{\ast}},n_{ 1}^{{\ast}})\right ] = \frac{1} {\sum \Pr \left [x_{0},x_{1},n_{0}^{{\ast}},n_{1}^{{\ast}}\vert g(n_{0}^{{\ast}},n_{1}^{{\ast}})\right ]}. }$$
(12.10)

In this respect, the factor \(\Pr \left [g(n_{0}^{{\ast}},n_{1}^{{\ast}})\right ]\) can also be considered as a normalizing factor. Therefore, operationally we first calculate \(\sum \Pr \left [x_{0},x_{1},n_{0}^{{\ast}},n_{1}^{{\ast}}\vert g(n_{0}^{{\ast}},n_{1}^{{\ast}})\right ]\) and then normalize using \(\Pr \left [g(n_{0}^{{\ast}},n_{1}^{{\ast}})\right ]\).

It may be noted here that for the use of the hyper geometric distribution, although we round the \(n_{j}^{{\ast}}\) s to their nearest values, it is possible to use the ceiling or floor functions instead. This is a matter of individual discretion. One’s choice will have some effects on the calculation of the p-value if the sample size is very small (as would have been the case with our example, with five animals per group). However for moderately large sample size this choice will have minimal effect. It should be noted that, as discussed in Sect. 12.3.3, the regular carcinogenicity studies have 50–70 animals per group.

4.3 Second Example

We further illustrate the use and properties of our method with another example:

4.3.1 Framework of Second Example

4.3.1.1 Design

Two dose groups with dose levels 0 and 1. Twelve animals, with animal numbers 1, , 12 are randomly divided into two treatment groups, G0 and G1, of six animals each. Terminal sacrifice is at Week 104. The example data are presented in Table 12.19. The observed value of the test statistic T is \(1 \times 0 + 4 \times 1 = 4\).

Table 12.19 Observed data from second example

As described in Sect. 12.4.2.1, we can calculate the distribution of T; the distribution is shown in Table 12.20 Since the observed value of T is 4, the p-value of the test is \(\Pr [T \geq 4]\). Using the exact poly-3 test, we get a p-value of 0. 12905. For the unadjusted Cochran-Armitage test, the p-value is \(0.11364 + 0.00758 = 0.12122\). For the asymptotic one-tailed test (using StatXact), it is 0. 05124.

Table 12.20 Distribution of test statistic for second example

It should be noted that this method is based on an extensive computational procedure requiring the evaluation of all possible permutations of the animals to r + 1 dose groups. This computational complexity is a big challenge for the application of the proposed method in the data analysis of real studies. However, some modifications of commercially available software for the calculations of the probabilities of the hypergeometric distribution may facilitate these calculations.

5 Modified Exact Poly-3 Method

Since the exact poly-3 method described in Sects. 12.4.2.1 and 12.4.3 has severe computational limitations when we have group sizes of 50 or larger, the alternative and more practical way is to use the permutation sample to estimate p-values. A survival-adjusted exact randomization trend test procedure (Mancuso et al. 2002) has been proposed to use the permutation sample to estimate the p-values. The test is carried out by using PROC STRATIFY with fixed row and column sums assumptions. In order to reduce biases caused by the assumptions of fixed column and row sums using PROC STRATIFY and binomial null variance estimate from Mancuso et al. (2002), we are proposing a modified exact poly-3 trend test that can be regarded as an exact version of the poly-3 test (Bieler and Williams 1993).

5.1 Motivating Problem

As discussed earlier (see Sect. 12.4.2, and especially Table 12.13), animal survival time is not a fixed quantity. The adjusted quantal response estimates, \(p_{i}^{{\ast}} = x_{i}/n_{i}^{{\ast}}\), are actually ratios of linear statistics. Hence, the numerators and denominators of these estimates are both subject to random variation.

5.2 Permutational Distribution for the Modified Poly-3 Test

5.2.1 The Modified Poly-3 Trend Test

The quantal response tests that focus on crude lifetime tumor incidence rates and make no adjustment for differences in survival experiences across dose groups are often biased, since they implicitly assume that all animals are at equal risk of developing a tumor over the course of study. As mentioned in Sect. 12.4.1, in order to address this issue, Bailer and Portier (1988) introduced a modification to the Cochran-Armitage test for trend that adjusts for differences in treatment lethality while requiring no assumptions regarding tumor lethality or changes to study design. This poly-3 trend test incorporates a weighting scheme that allows fractional information into the analysis for animals not at full risk for tumor development. This weighting scheme essentially modifies the denominators of the crude quantal response estimates of lifetime tumor incidence to more closely approximate the total number of animal years at risk in each experimental group.

Using this weighting scheme and the notation used previously, we define \(p_{i}^{{\ast}} = x_{i}/n_{i}^{{\ast}}\) as the adjusted quantal response estimate of lifetime tumor incidence in group i; and

$$\displaystyle{ p^{{\ast}} = \frac{\sum _{i}x_{i}} {\sum _{i}n_{i}^{{\ast}}} }$$
(12.11)

as the experiment-wide adjusted quantal response estimate of lifetime tumor incidence; and \(q_{i}^{{\ast}} = 1 - p_{i}^{{\ast}}\) and \(q^{{\ast}} = 1 - p^{{\ast}}\). As previously stated, binomial null variance estimates of \(p_{i}^{{\ast}}\) do not apply since they are based on the assumption that the number of animals at risk is fixed. However, by using a Taylor expansion, a pooled null variance estimate for \(p_{i}^{{\ast}}\) can be found:

$$\displaystyle{ \mathrm{var}_{0}\left (p_{i}^{{\ast}}\right ) \approx \left ( \frac{n_{i}} {n_{i}^{{\ast}}}\right )^{2} \cdot \frac{\sum _{i}\sum _{j}(r_{ij} -\bar{ r}_{i})^{2}} {n - (g + 1)}. }$$
(12.12)

where \(n =\sum _{i}n_{i}\), \(r_{ij} = x_{i} - p_{i}^{{\ast}}w_{ij}\), and g + 1 is the number of experimental groups in the study.

A computational formula for the modified Cochran-Armitage test statistic, which will be referred to as the ratio test proposed in Bieler and Williams (1993) and denoted by Z r is given as follows:

$$\displaystyle{ Z_{r} = \frac{\sum _{i}a_{i}p_{i}^{{\ast}}d_{i} -\left.\left (\sum _{i}a_{i}d_{i}\right )\left (\sum _{i}a_{i}p_{i}^{{\ast}}\right )\right /\sum _{i}a_{i}} {\sqrt{C\left (\sum _{i } a_{i } d_{i }^{2 } - \left.\left (\sum _{i } a_{i } d_{i } \right ) ^{2 } \right / \sum _{i } a_{i } \right )}} }$$
(12.13)

where

$$\displaystyle{C =\sum _{i}\sum _{j}\frac{\left (r_{ij} -\bar{ r}_{ij}\right )^{2}} {n - (g + 1)}\quad a_{i} = \frac{n_{i}} {\left (n_{i}^{{\ast}}\right )^{2}}.}$$

5.3 Permutational Distribution for the Modified Poly-3 Test

Exact methods are preferable for sparse data. Permutation tests consider all possible assignments of animals to dose groups as equally likely, while fixing the rest of the information obtained in the experiment. Under the null hypothesis of no treatment effect, this results in an exact conditional distribution of the test statistic when intercurrent mortality patterns are equal across groups, and it is asymptotically correct when the mortality patterns are unequal (Fairweather et al. 1998; Heimann and Neuhaus 1998). Given a data set, consider all, say M, possible allocations of animals to groups while keeping the observed data for each animal fixed. Corresponding to these M arrangements, we may obtain M values of the test statistic. The permutational distribution of the modified poly-3 test statistic results from assigning equal probability to each of these M values. Letting \(Z_{r}^{{\ast}}\) be the observed value, the p-value is the proportion of the M values that are at least as extreme as \(Z_{r}^{{\ast}}\). By exhaustive enumeration, the computation for the p-value using the permutational distribution for the test is straightforward and efficient if the number of animals in the study is small. For data involving large numbers of subjects, the p-value associated with the permutational distribution of the test statistic may be approximated by a sample of the set of all permutations.

5.4 Simulations and Results

We conducted a Monte Carlo simulation study to evaluate the following tests: the exact version of the modified poly-3 test (Bieler and Williams 1993), the exact version of the poly-3 test (Bailer and Portier 1988) and PROC STRATIFY in two simulation designs (Dinse 1985; Portier et al. 1986). For each configuration, 10,000 simulated data sets were generated and tested by various methods at the nominal significance level α = 0. 05. Additionally, these methods were tested against the significance thresholds described in Table 12.1 for a standard study (where, as in Sects. 12.2 and 12.3, rarity was determined by the incidence rate in the control group).

In conducting the exact versions of the modified or the not-modified poly-3 test using the permutational distribution, p-values were estimated from samples of 5000 permutations.

5.4.1 Monte Carlo Simulation Design 1

A typical bioassay design with four groups of 50 animals each and an experimental duration of 104 weeks is used in the study. The design is simulated to have a single terminal sacrifice at the end of the experiment, as in the customary long-term rodent bioassay. The dose levels used are (0,1,2,3) across groups. The three independent variables T 0 (time to tumor onset), T 2 (time from tumor onset until death from the tumor), and T 1 (time until death from a competing risk) are used to model animal tumorigenicity data. These variables are generated from the modified Weibull distributions used by Portier et al. (1986) and others in the literature (Ahn and Kodell 1995; Chang et al. 2000; Kodell and Ahn 1997; Kodell et al. 1994). The survival function for T 0 is

$$\displaystyle{S(t) =\exp \left [-\delta _{1}(1/104)^{\delta _{2}}\right ].}$$

with δ 2 ∈ { 1. 5, 3, 6} and δ 1 chosen so that the probability of tumor onset by the end of the study attains the desired rate. Since the study is concerned with rare events, tumor rates between 0.01 and 0.15 are used.

The survival function for T 1 is

$$\displaystyle{ Q(t) =\exp \left [-\phi \left (\gamma _{1}t +\gamma _{2}t^{\gamma _{3}}\right )\right ] }$$
(12.14)

with \(\gamma _{1} = 10^{-4}\), \(\gamma _{2} = 10^{-16}\) and γ 3 = 7. 425531, and the value ϕ chosen such that the competing risks survival rate with respect to all causes of death except for the tumor of interest at 104 weeks is either 0.5 for all groups or (0.5, 0.4, 0.3, 0.2) across groups. The control survival rate chosen represents the one recently observed in the NTP studies for male Fischer 344 rats (Haseman et al. 1998), although it is somewhat below average for B6C3F1 mice and F344∕N female rats in the NTP feeding studies.

For simplicity, the survival function for T 2 has the same form as Q(t) with the same values of γ 1, γ 2 and γ 3.

5.4.2 Results of First Monte Carlo Simulation

The results of the first set of simulations are presented in Table 12.21 (common tumors) and Table 12.22 (rare tumors). In the following tables, Proposed refers to our proposed method and Mancuso refers to the method described in Mancuso et al. (2002). For trend test, the multiplicity adjustment decision rule (referred as Adjusted α in the following tables) is common and rare tumors are tested at 0:005 and 0:025 significance levels, respectively.

Table 12.21 Null hypothesis rejection rates for Monte Carlo simulation of modified poly-3 trend test and the poly-3 test described in Bailer and Portier (1988) and Mancuso et al. (2002)
Table 12.22 Summary sizes of tests of rare tumors with spontaneous incidence rates 0–1 % in simulation design I

These results in Table 12.21 indicate that there were inflations of Type 1 error under both adjusted α and 0.05 significance levels in all cases. Both the proposed Bieler and Williams (1993) and Mancuso’s methods in Mancuso et al. (2002) have very similar power. The results in the Table 12.22 show that in the four treatment group experimental design with rare tumors with background incidence rates between 0 and 1 %, Type 1 errors were inflated in both the proposed and Mancuso’s methods similar to the inflated Type 1 errors of the common tumors. But the proposed method has slightly more control over Type 1 error compared to Mancuso’s method.

5.4.3 Second Set of Monte Carlo Simulations

We also conducted a second set of simulations to investigate the modified poly-3 trend test. For this set, thirty-six simulation models were obtained by varying the levels of the four factors described in Dinse (1985) and presented in Sect. 12.2.

In general we observed that:

  1. 1.

    By using multiplicity adjustment, sizes are all less than 0.041. An inflation of size still occurs in 8 out of 12 cases: seven in the range 0. 0071–0. 03 and one at 0.041 based on adjusted levels of significance.

  2. 2.

    When a small dose effect on tumor rate, a high background tumor rate and later tumor appearance were simulated, power appeared to be lower.

  3. 3.

    Some of the high background tumor rate cases have very low levels of power. In these cases, using just 5000 replicates to estimate the p-values was inadequate.

5.5 Test Results Using Some NDA Datasets

Three NDA submissions were randomly chosen from the set of submissions reviewed by the authors to compare the proposed method with PROC STRATIFY. The permutation sample sizes were all around 6000. There were about tenfold differences in p-values between the proposed and Mancuso’s methods. We did further investigations on the significant cases to see how different permutation sample sizes would have an impact on p-value calculations for the trend test. Permutation samples of size of 1,000,000 were used. However, the p-values of samples of size of 1,000,000 were not much different from the p-values using the permutation samples size around 6000. The only changes were in the 3rd, 4th or 5th decimal places of the p-values. Since there are very large numbers of permutations (usually in the magnitude of 10100 permutations with for 50 animals per groups in a 4 group experimental design), computational limitations are a very real concern for all of these methods.

6 A Short Review of the General Bayesian Approaches to Possible Survival and Carcinogenicity Analyses

6.1 Points We Want to Make Before We Get Going

  • Bayesian methods base conclusions solely on the posterior distribution of the parameters.

  • It may be that an improper prior (i.e. a distribution that is not a proper density) is useful, provided it results in a valid posterior.

  • Conclusions about a single parameter are based on the posterior distribution for that parameter found by integrating out all other parameters, including any nuisance parameters. This is a very general procedure, but may seem inflexible in comparison to the variety of frequentist methods.

  • An approach to Bayesian hypothesis testing is consider the null hypothesis as a Bernoulli variable, and to use observed data to construct a posterior for the probability that the hypothesis is in fact true.

A central Bayesian result having far-reaching implications for both Bayesian and frequentist statistical analysis is the so-called likelihood principle:

Unless results are based only on the data that were actually observed and not those that could have occurred, results can be improved.

6.2 Bayesianism and Nonclinical Biostatistics

Historically, except for those cases where researchers have been able to exploit conjugate priors (families of distributions for which if a prior is in the family, then any posterior will also be in the family) Bayesianism has been limited by the computational complexity of calculating posterior distributions, especially after repeated updating. However, as computational power has increased, more areas of statistics, including nonclinical biostatistics, have seen Bayesian methods become viable.

It is generally the case that as more data is collected, the influence of the initial prior diminishes, and the posterior becomes primarily a reflection of the observed empirical data. When working with large datasets, this is reassuring, and addresses the common criticism that Bayesian methods are founded on an unjustifiable choice of a prior. However, as has already been noted repeatedly in this chapter, one of the greatest challenges of reviewing rodent carcinogenicity data is the rarity of the events. So in contrast to the reassuring asymptotic case, Bayesian analyses of such data are more, rather than less, sensitive to the choice of a prior. For these smaller sample sizes, we need to use so-called noninformative, vague, or objective priors that do not dominate the data. The so-called reference prior method described in Bernardo (1979) and Bernardo and Smith (1994) is an automatic procedure for generating such priors.

In the analysis of most rodent carcinogenicity studies there are two primary goals:

  1. 1.

    To analyze the effect of the compound under study on survival.

  2. 2.

    To analyze the effect of the compound on the development of neoplasms.

In the typical frequentist testing of carcinogenicity hypotheses or survival the usual null hypothesis is that some set of parameters (typically slope parameters, such as D in the Weibull parameterization described in Sect. 12.2.2) are equal to zero. Testing this hypothesis in the manner described above (Sect. 12.6.1), we are interested in the posterior distribution of the random event that D is identically equal to 0.

The following proposed Bayesian analyses are intended to be illustrative only, and not prescriptive.

6.3 Notational Conventions for Examples

In the examples below, we adopt the following notational conventions:

There are I tumor types (as discussed in Sect. 12.3.6.2), J animals, and K dose groups. Animal j is a member of group κ(j), and the total number of animals in group k is denoted n k . Without loss of generality, let k = 0 denote the control group. The animals in group k are treated with dose d k (so d 0 = 0). The maximum time in the study is denoted T, and the time at which animal j leaves the study (either through natural death or sacrifice) is denoted t j .

6.4 Survival Analysis Example: Finite Dimensional Proportional Hazards Model

The probability of an animal surviving past time t is given by the survival function \(S(t) =\Pr (T> t)\). Let f(t) denote the density of T. The instantaneous hazard function is \(h(t) = f(t)/S(t)\), and the cumulative hazard H is defined by:

$$\displaystyle{H(t) =\int _{ 0}^{t}h(u)\mathrm{d}u.}$$

The following identities follow immediately:

$$\displaystyle{f(t) = h(t)S(t)\quad \ln (S(t)) = -H(t)\quad S(t) =\mathrm{ e}^{-H(t)}\quad f(t) =\mathrm{ e}^{-H(t)}.}$$

The standard Cox regression form of the proportional hazards model for such survival models specifies the hazard function:

$$\displaystyle{h(t\vert \mathbf{x}) = h_{0}(t)\mathrm{e}^{\mathbf{x}^{\top }\boldsymbol{\beta } }.}$$

Then treatment effects can be investigated by assessing the differential effects of treatment in the \(\mathrm{e}^{\mathbf{x}^{T}\boldsymbol{\beta } }\) term. Among other possible specifications, this can reflect a trend over dose or individual dose effects.

Statistical inference on survival is based on proposing a probability model for S(t) or one of its derivations. The probability model is defined so that hypotheses to be investigated are specified as parameters in the model. A frequentist analysis takes parameters as fixed and assesses the likelihood of the observed data. A Bayesian analysis starts by noting that parameters are not known, and assumes that a prior distribution is a natural measure of this lack of exact knowledge. Then the Bayesian analysis assesses the impact of the actual observed data on this prior.

Frequentist analysis of the Cox model uses asymptotics to analyze the linear predictor (and by extension the hazard ratio), but disregards the baseline hazard h 0.Footnote 16 By contrast, a Bayesian analysis requires priors on all parameters, including the baseline hazard. In this example, we consider a finite dimensional space of possible baseline hazard functions, namely the piecewise step functions; i.e., hazard functions of the form

$$\displaystyle{ h_{0}(t) =\sum _{ m=0}^{M}\lambda _{ m}\mathbf{I}_{(a_{m},a_{m+1}]}(t). }$$
(12.15)

Without loss of generality, we may assume \(0 = a_{0} <a_{1}\ldots <a_{M} <a_{M+1} = T\).

In the formulation above, the baseline hazard is confounded with the specification of treatment effects, i.e., a multiplicative constant can be moved to either the baseline hazard or the term with covariates. The dose effect at level k is represented by the scalar β k , interpreted as the log of the hazard ratio relative to the control group. Note that β 0 = 0. We thus have K − 1 unknown scalars, together with the unknown baseline hazard function h 0(t). The model could be simplified further by assuming

$$\displaystyle{ \beta _{k} =\eta d_{k}+\mu }$$
(12.16)

for k > 0 i.e., by assuming a simple linear trend in dose.

Given that a m  < t ≤ a m+1, the integrated cumulative hazard for an animal in group k may be written as:

$$\displaystyle{H_{0}(t) =\mathrm{ e}^{\beta _{k}}\int _{0}^{t}h_{0}(u)\,\mathrm{d}u =\mathrm{ e}^{\beta _{k}}\left (\left (\sum _{n=0}^{m-1}\lambda _{n}(a_{n+1} - a_{n})\right ) +\lambda _{m}(t - a_{m})\right )}$$

and the likelihood for subject j can be written

$$\displaystyle\begin{array}{rcl} L_{j}(\boldsymbol{\beta }) \propto \left \{\begin{array}{ll} \mathrm{e}^{-H_{0}(t_{j})} & \mathrm{if\ the}\ j^{\mathrm{th}}\ \mathrm{subject\ is\ censored\ at\ time}\ t_{j} \\ \lambda _{\kappa (j)}\mathrm{e}^{\beta _{\kappa (j)}-H_{0}(t_{j})} & \mathrm{if\ the}\ j^{\mathrm{th}}\ \mathrm{subject\ fails\ at\ time}\ t_{ j}. \end{array} \right.& &{}\end{array}$$
(12.17)

Note that in the case that we use the model described in Eq. (12.16), the parameters in the likelihood function shown in Eq. (12.17) will be η and μ, rather than the parameter vector \(\boldsymbol{\beta }\).

Because this looks like a sample of exponential inter-arrival times, we would expect the simple fail/not fail distributions to correspond to Poisson random variables. For subject j censored or failed at time t j define γ jm by

$$\displaystyle{\gamma _{jm} = \left \{\begin{array}{ll} \lambda _{m}\left (a_{m+1} - a_{m}\right )&\mathrm{for}\ t_{j}> a_{m+1} \\ \lambda _{m}\left (t_{j} - a_{m}\right ) &\mathrm{for}\ a_{m} <t_{j} \leq a_{m+1} \\ 0 &\mathrm{otherwise.} \end{array} \right.}$$

Thus

$$\displaystyle{S(t) =\mathrm{ e}^{-H(t)} =\prod _{ \{ m\vert a_{m}\leq t_{j}\}}^{M}\exp \left (-\mathrm{e}^{\beta _{\kappa (j)} }\gamma _{jm}\right ).}$$

Furthermore, (t j a m ) is constant with respect to the parameters, and hence can be incorporated in the likelihood for subjects who fail by multiplying \(\lambda _{j}\) by this difference. Thus for subject j, the likelihood can also be written as:

$$\displaystyle{L_{j}(\boldsymbol{\beta }) \propto \left \{\begin{array}{cl} \prod _{m=1}^{M}\exp \left (-\mathrm{e}^{\beta _{\kappa (j)} }\gamma _{jm}\right ) &\mathrm{if\ the}\ j^{\mathrm{th}}\ \mathrm{subject\ is\ censored\ at\ time}\ t_{j} \\ \gamma _{jm}\mathrm{e}^{\beta _{\kappa (j)} }\prod _{m=1}^{M}\exp \left (-\mathrm{e}^{\beta _{\kappa (j)} }\gamma _{jm}\right )&\mathrm{if\ the}\ j^{\mathrm{th}}\ \mathrm{subject\ fails\ at\ time}\ t_{j}. \end{array} \right.}$$

Although it looks messy, this is the likelihood of T independent Poisson random variables with mean \(\mathrm{e}^{\beta _{\kappa (j)} }\gamma _{jm}\) where all responses are zero. This is only a computational convenience but allows easy estimation of the appropriate parameters using standard software (e.g., Lunn et al. 2000—see Sect. 12.6.8). Thus we need to specify an appropriate prior for the baseline hazard. Note that the baseline hazard is essentially the hazard of the control group. A gamma prior would be skewed to the right and would seem to be an appropriate choice. The standard 2 year study could be broken down into twelve 2 month periods. Sacrifice or accidental death could be treated as a reduction in the risk set, but not as a mortality event. In most circumstances we would might prefer a specification of an increasing hazard (again easily specified in WinBUGS or OpenBUGS (Lunn et al. 2000).

6.5 Carcinogenicity Example: Finite Dimensional Logistic Model

A logistic model is easy to implement in OpenBUGS or WinBUGS (Lunn et al. 2000). For this analysis we define mixed two-stage/three-stage hierarchical models for tests of trend and pairwise comparisons.

For testing trend, we define \(\theta _{ij}\) to be the probability that tumor i is found in subject j, and we build the following model:

$$\displaystyle{ \mathrm{logit}(\theta _{ij}) =\alpha _{i} +\beta _{i}d_{\kappa (j)} +\gamma _{i}\ln (t_{j}) +\delta _{j} }$$
(12.18)

where δ j is the individual random subject effect.Footnote 17

We assign model priors:

$$\displaystyle\begin{array}{rcl} \alpha _{i}& \sim & \mathrm{N}(\mu _{\alpha },\sigma _{\alpha }^{2}) {}\\ \beta _{i}& \sim & \pi _{i}\mathbf{I}_{[0]} + (1 -\pi _{i})\mathrm{N}(\mu _{\beta },\sigma _{\beta }^{2}) {}\\ \pi _{i}& \sim & \mathrm{Beta}(\phi,\psi ) {}\\ \gamma _{i}& \sim & \mathrm{N}(\mu _{\gamma },\sigma _{\gamma }^{2}) {}\\ \end{array}$$

for i = 1, , I and a random subject effect

$$\displaystyle{\delta _{j} \sim \mathrm{ N}(\mu _{\delta },\sigma _{\delta }^{2})}$$

for j = 1, , J. For computational convenience, we typically define \(\mu _{\alpha } =\mu _{\beta } =\mu _{\gamma } =\mu _{\delta } = 0\) and \(\sigma _{\delta }^{2} =\sigma _{ \alpha }^{2} +\sigma _{ \beta }^{2} =\sigma _{ g}^{2} =\sigma _{ s}^{2} = 100\), \(\sigma _{\alpha }^{2},\sigma _{\beta }^{2},\sigma _{g}^{2} \sim \mathrm{ InverseGamma}(1,3)\).

The model for the pairwise comparison between group k and the control group is similar:

$$\displaystyle{ \mathrm{logit}(\theta _{ij}) =\alpha _{i} +\beta _{ik}\mathbf{I}_{\{\kappa (j)=k\}} +\gamma _{i}\ln (t_{j}) +\delta _{j}. }$$
(12.19)

Our priors have the form:

$$\displaystyle\begin{array}{rcl} \pi _{ik}& \sim & \mathrm{Beta}(\phi,\psi ) {}\\ \beta _{ik}& \sim & \pi _{ik}\mathbf{I}_{[0]} + (1 -\pi _{ik})\mathrm{N}(\mu _{\beta _{k}},\sigma _{\beta _{k}}^{2}) {}\\ \end{array}$$

for j = 1, , J and i = 1, , n t . Note that with this parameterization, for k = 2, , K, the β ik terms represent the deviation of treatment effect from the controls. These should represent reasonably well dispersed priors on parameters.

6.6 Survival Analysis Example: Nonparametric Bayesian Analysis

Some applications involve increasing numbers of parameters or even infinite dimensional problems. Perhaps the knowledge about the parameter could follow a probability distribution not indexed by small set of parameters. For example, instead of something like a simple normal distribution indexed with a mean, μ and variance, \(\sigma ^{2}\), the family could be say one of the continuous location family distributions or possibly even the inclusive continuous probability distributions. In a simple misnomer such problems have come to be called “Bayesian Nonparametrics.” The challenge is not the fact that there are no parameters, but rather that there are far too many. Since it seems to be quite difficult to specify priors with content in infinite dimensional space it seems more appropriate to work with objective priors that cover much of the parameter space.

One of many possible standard models for the survival function is to model the logarithm of the survival with a normal distribution, i.e. to specify that T i follows a lognormal distribution. However, the typical Bayesian nonparametric model takes such a specification and uses it as a baseline function to be perturbed to “robustify” the model using a so-called Dependent Dirichlet Process (DDP) as the prior on this space of probability distributions. This function represents the prior using a so-called Dependent Dirichlet Process (DDP) as the prior on this space of probability distributions, which uses a mixture of normal distributions weighted by a Dirichlet process on the normal parameters. The prior is defined as a Dirichlet process where the baseline distribution models the linear parameters, where has the linear mean parameters has a normal distribution as prior and the variance parameters with a Gamma distribution. The prior of the precision parameter of the Dirichlet process is specified as a gamma distribution. The priors for the other hyperparameters in this function are conjugate distributions. Following the notation of Jara et al. (2014), we can write:

$$\displaystyle\begin{array}{rcl} \ln (T_{i})& =& t_{i}\vert \mathbf{f}_{X_{i}} \sim \mathbf{f}_{X_{i}} {}\\ \mathbf{f}_{X_{i}}& =& \int \mathrm{N}(X_{i}\boldsymbol{\beta },\sigma ^{2})G(\mathrm{d}\boldsymbol{\beta }\mathrm{d}\sigma ^{2}) {}\\ G\vert \alpha,G_{0}& \sim & DP(\alpha G_{0}) {}\\ \end{array}$$

Typically distributions of the hyperparameters above can be specified as follows:

$$\displaystyle\begin{array}{rcl} G_{0}& =& \mathrm{N}(\beta \vert \mu _{b},s_{b})\varGamma \left (\sigma ^{2}\vert \frac{\tau _{1}} {2}, \frac{\tau _{2}} {2}\right ) {}\\ \alpha \vert a_{0},b_{0}& \sim & \mathrm{Gamma}(a_{0},b_{0}) {}\\ \mu _{b}\vert m_{0},s_{0}& \sim & \mathrm{N}(m_{0},S_{0}) {}\\ s_{b}\vert \nu,\varPsi & \sim & \mathrm{InvWishart}(\nu,\varPsi ) {}\\ \tau _{2}\vert \tau _{s_{1}},\tau _{s_{2}}& \sim & \mathrm{Gamma}(\tau _{s_{1}},\tau _{s_{2}}) {}\\ \end{array}$$

See, for instance De Iorio et al. (2009). The parameterization used to compare doses can be captured by a dummy coding, as in the finite dimensional example (see Sect. 12.6.5).

6.7 Carcinogenicity Example: Nonparametric Logistic Model

A similar model to the one in Example 12.6.5 takes the baseline distribution as a logistic distribution. The nonparametric Bayesian approach treats an actual probability distribution as one of the parameters. This distribution is then sampled from an infinite dimensional space of possible distributions, which is both mathematically challenging and where, unlike most finite dimensional parameters, it is difficult to specify appropriate prior distributions. Thus one attempts to specify robust priors on the slope and treatment differences that have a small impact on the result. The baseline model follows a simple logit model for tests of trend and pairwise comparisons. For testing trend, we define p ijk as the probability of tumor type i being found in subject j in treatment group k. That is, with i = 1 to n t tumors and j = 1 to n s animals, and dose d k , leaving the experiment at time tj and subject effect δ j :

$$\displaystyle{ \mathrm{logit}(p_{ijjk}) =\alpha _{i} +\beta _{i}d_{k} +\gamma _{i}t_{j} +\delta _{j} }$$
(12.20)

with assigned model priors:

$$\displaystyle\begin{array}{rcl} \alpha _{i}& \sim & \mathrm{N}(\mu _{\alpha _{i}},\sigma _{\alpha }^{2}) {}\\ \beta & \sim & \mathrm{N}(\mu _{B},\sigma _{\beta }^{2}) {}\\ \gamma _{i}& \sim & \mathrm{N}(\mu _{\beta },\sigma _{g}^{2}) {}\\ \end{array}$$

But now, instead of directly specifying that the animal random effect δ j , we specify the distribution as a Dirichlet process (DP) on the space of distributions.

$$\displaystyle\begin{array}{rcl} \delta _{i}\vert G& \sim & G {}\\ G\vert \alpha,G_{0}& \sim & \mathrm{DP}(\alpha G_{0}) {}\\ G_{0}& \sim & \mathrm{N}(\mu,\varSigma ) {}\\ \end{array}$$

Note that care seems to be needed to ensure that parameters are identified. Again, this is a simple application of the function in the DPpackage (Jara et al. 2014) in R (R Core Team 2012).

6.8 Software

Prior to the development of various Markov Chain Monte Carlo methods, actual software for doing a Bayesian analysis was largely limited to approximate solutions or even insisting on so-called conjugate priors. While many statisticians could see the philosophical advantages of Bayesian methodology, the lack of good methods of computing the posterior limited the application of these methods. All that has changed with the development of so called Markov Chain Monte Carlo methods, in their most simple form similar to so-called importance sampling.

For most problems in “classical” Bayesian analyses the user is faced with a plethora of choices in packages or programs. WinBUGS and its more recent descendant OpenBUGS (Lunn et al. 2000) are probably the oldest and most used general programs (“BUGS” stands for Bayesian analysis Using Gibbs Sampling). The various versions of the manuals have the warning in quite noticeable type “WARNING: MCMC can be dangerous.” The point is that MCMC methods work well when they move around through the appropriate parameter space reaching near all feasible points. When they are stuck in a region of the parameter space and can not leave the MCMC methods can fail. WinBUGS includes several diagnostics for this type of behavior.

Several SAS procedures have options for a Bayesian analysis, usually with default, but quite reasonable priors. For more general use, PROC MCMC, provides a detailed analysis including extensive diagnostics on the MCMC Markov Chains, but, as with all very general procedures, requires careful coding of priors and likelihoods. It includes extensive, possibly nearly exhaustive, diagnostics for the MCMC behavior.

R users (R Core Team 2012) have a number of packages for Bayesian Analysis available to them. LaplacesDemon is a very general package. MCMCpack includes a relatively long list of functions for MCMC analysis. BayesSurv has a number of R functions for survival models. Last but certainly not least, in this short and by no means exhaustive list, DPpackage (Jara et al. 2014) is a very general collection of functions for Nonparametric Bayesian Analysis and is undoubtedly currently the easiest way to implement such models.