The analysis of data obtained from a clinical trial represents the outcome of the planning and implementation already described. Primary and secondary questions addressed by the clinical trial can be tested and new hypotheses generated. Data analysis is sometimes viewed as simple and straightforward, requiring little time, effort, or expense. However, careful analysis usually requires a major investment in all three. It must be done with as much care and concern as any of the design or data-gathering aspects. Furthermore, inappropriate statistical analyses can introduce bias, result in misleading conclusions and impair the credibility of the trial.

In the context of clinical trials, the term bias has several senses. One is what we might call experimenter bias. This sense applies to a difference in the behavior, conscious or unconscious, of investigators depending on what they believe about the intervention. For example, in an unblinded trial, or a trial in which the intervention assignment can be guessed, the investigator may treat a participant or data from a participant differently depending on whether she believes that participant has received the experimental intervention. These differences in behavior can lead to a second sense of the term bias, one with a more technical definition which we may call estimation bias. If the goal of the trial is to estimate how the intervention affects an outcome measured in a specified population, bias is any difference between that estimate and the true effect which is not attributable to random variation.

Even in a randomized, double-blind clinical trial which has been conducted without any experimenter bias, the estimated effect can be biased by excluding randomized participants or observed outcomes, by inappropriate choice of analytic techniques, and by missing or poor quality data caused by mechanisms which were not the same in the intervention and control groups. This chapter will focus on how to avoid introducing estimation bias into the analysis of clinical trial data.

Several introductory textbooks of statistics [18] provide excellent descriptions for many basic methods of analysis. Chapter 15 presents essentials for analysis of survival data, since they are frequently of interest in clinical trials. This chapter will focus on some issues in the analysis of data which seem to cause confusion in the medical research community. Some of the proposed solutions are straightforward; others require judgment. They reflect a point of view toward bias developed by the authors of this text and many colleagues in numerous collaborative efforts over three to four decades. Whereas some [912] have taken similar positions, others [1316] have opposing views on several issues, and some published trials are more consistent with these opposing views (e.g. [17]).

The analytic approaches discussed here primarily apply to late phase (III and IV) trials. Various exploratory analysis approaches may be entirely reasonable in early phase (I and II) studies where the goal is to obtain information and insight to design better subsequent trials. However, some the fundamentals presented may still be of value in these early phase trials. We have used early examples that were instrumental in establishing many of the analytic principles and added new examples which reinforce them. However, given the multitude of clinical trials, it is not possible to include even a small proportion of the many excellent examples that exist.

Fundamental Point

Removing randomized participants or observed outcomes from analysis and subgrouping on the basis of outcome or other response variables can lead to biased results. Those biases can be of unknown magnitude and direction.

Which Participants Should Be Analyzed?

In the context of clinical trials the term ‘withdrawal’ is used in different ways. For this chapter, ‘withdrawing a participant’ generally means removing from an analysis the data contributed by a participant who has been randomized and perhaps followed for some length of time. A common and related meaning of the term ‘withdrawal’ refers to a participant who is randomized but from whom follow-up data is deliberately not collected, or only partially collected, as a result of decisions made by the study investigators. Yet another meaning indicates someone who is randomized but refuses to continue participating in the trial. The term ‘excluded’ can also be ambiguous, sometimes referring to a participant who does not meet eligibility criteria during screening, sometimes to a participant’s data which was not used in an analysis. Since removing or not using data collected from a randomized participant can lead to bias, the question of which participants’ data should be analyzed is an important one. This chapter has adopted, in part, the terminology used by Peto and colleagues [12] to classify participants according to the nature and extent of their participation.

Discussion about which participants are to be included in the data analysis often arises in clinical trials. Although a laboratory study may have carefully regulated experimental conditions, even the best designed and managed clinical trial cannot be perfectly implemented. Response variable data may be missing, adherence to protocol may not be complete, and some participants, in retrospect, will not have met the entrance criteria. Some investigators may, after a trial has been completed, be inclined to remove from the analysis participants who did not meet the eligibility criteria or did not follow the protocol perfectly. In contrast, others believe that once a participant is randomized, that participant should always be followed and included in the analysis.

The intention-to-treat principle states that all participants randomized and all events, as defined in the protocol, should be accounted for in the primary analysis [12]. This requirement is stated in the International Conference on Harmonisation and FDA guidelines [18, 19]. There are often proposed “modified intention-to-treat” analyses, or “per protocol” or “on treatment” analyses, that suggest that only participants who received at least some of the intervention should be included. However, as we will discuss, any deviations from strict intention-to-treat offer the potential for bias and should be avoided, or at a minimum presented along with an intention-to-treat analysis. Many published analyses claim to have followed the intention-to-treat principle yet do not include all randomized participants and all events. Although the phrase is widely used, “per protocol” analysis suggests that the analysis is the one preferred in the trial’s protocol. For such analyses we think that “on treatment” analysis more accurately reflects what is done.

Exclusions refer to people who are screened as potential participants for a randomized trial but who do not meet all of the entry criteria and, therefore, are not randomized. Reasons for exclusion might be related to age, severity of disease, refusal to participate, or any of numerous other determinants evaluated before randomization. Since these potential participants are not randomized, their exclusion does not bias any intervention-control group comparison (sometimes called internal validity). Exclusions do, however, influence the broader interpretation and applicability of the results of the clinical trial (external validity). In some circumstances, follow-up of excluded people, as was done in the Women’s Health Initiative [20, 21], can be helpful in determining to what extent the results can be generalized. If the event rate in the control group is considerably lower than anticipated, an investigator may want to determine whether most high risk people were excluded or whether she was incorrect in her initial assumption.

Withdrawals from analysis refer to participants who have been randomized but are deliberately excluded from the analysis. As the fundamental point states, omitting participants from analyses can bias the results of the study [22]. If participants are withdrawn, the burden rests with the investigator to convince the scientific community that the analysis has not been biased. However, this is essentially impossible, because no one can be sure that participants were not differentially withdrawn from the study groups. Differential withdrawal can occur even if the number of omitted participants is the same in each group, since the reasons for withdrawal in each group may be different and consequently their risk of primary, secondary and adverse events may be different. As a result, the participants remaining in the randomized groups may not be comparable, undermining one of the reasons for randomization.

Many reasons are given for not including certain participants’ data in the analysis, among them ineligibility and nonadherence.

Ineligibility

A previously common cited reason for withdrawal is that some participants did not meet the entry criteria, a protocol violation unknown at the time of enrollment. Admitting unqualified participants may be the result of a simple clerical error, a laboratory error, a misinterpretation, or a misclassification. Clerical mistakes such as listing wrong sex or age may be obvious. Other errors can arise from differing interpretation of diagnostic studies such as electrocardiograms, x-rays, or biopsies. It is not difficult to find examples in earlier literature [2331]. The practice of withdrawal for ineligibility used to be common, but appears to be less frequent now, at least in papers published in major journals.

Withdrawals for ineligibility can involve a relatively large number of participants. In an early trial by the Canadian Cooperative Study Group [30], 64 of the 649 enrolled participants (10%) with stroke were later found to have been ineligible. In this four-armed study, the numbers of ineligible participants in the study groups ranged from 10 to 25. The reasons for the ineligibility of these 64 participants were not reported, nor were their outcome experiences. Before cancer cooperative groups implemented phone-in or electronic eligibility checks, 10–20% of participants entered into a trial may have been ineligible after further review. By taking more careful care at the time of randomization, the number of ineligible participants was reduced to a very small percent [32]. Currently, web based systems or Interactive Voice Recording Systems are used for multicenter and multinational clinical trials [33]. These interactive systems can lead clinic staff through a review of key eligibility criteria before randomization is assigned, cutting down on the ineligibility rate. For example, several trials employed these methods [3438].

A study design may require enrollment within a defined time period following a qualifying event. Because of this time constraint, data concerning a participant’s eligibility might not be available or confirmable at the time the decision must be made to enroll him. For example, the Beta-Blocker Heart Attack Trial looked at a 2–4 year follow-up mortality in people administered a beta-blocking drug during hospitalization for an acute myocardial infarction [23]. Because of known variability in interpretation, the protocol required that the diagnostic electrocardiograms be read by a central unit. However, this verification took several weeks to accomplish. Local investigators, therefore, interpreted the electrocardiograms and decided whether the patient met the necessary criteria for inclusion. Almost 10% of the enrolled participants did not have their myocardial infarction confirmed according to a central reading, and were “incorrectly” randomized. The question then arose: Should the participants be kept in the trial and included in the analysis of the response variable data? The Beta-Blocker Heart Attack Trial protocol required follow-up and analysis of all randomized participants. In this case, the observed benefits from the intervention were similar in those eligible as well as in those “ineligible.”

A more complicated situation occurs when the data needed for enrollment cannot be obtained until hours or days have passed, yet the study design requires initiation of intervention before then. For instance, in the Multicenter Investigation for the Limitation of Infarct Size [26], propranolol, hyaluronidase, or placebo was administered shortly after participants were admitted to the hospital with possible acute myocardial infarctions. In some, the diagnosis of myocardial infarction was not confirmed until after electrocardiographic and serum enzyme changes had been monitored for several days. Such participants were, therefore, randomized on the basis of a preliminary diagnosis of infarction. Subsequent testing may not have supported the initial diagnosis. Another example of this problem involves a study of pregnant women who were likely to deliver prematurely and, therefore, would have children who were at a higher than usual risk of being born with respiratory distress syndrome [24]. Corticosteroids administered to the mother prior to delivery were hypothesized to protect the premature child from developing this syndrome. Although, at the time of the mother’s randomization to either intervention or control groups, the investigator could not be sure that the delivery would be premature, she needed to make a decision whether to enroll the mother into the study. Other examples include trials where thrombolytic agents are being evaluated in reducing mortality and morbidity during and after an acute myocardial infarction. In these trials, agents must be given rapidly before diagnosis can be confirmed [39].

To complicate matters still further, the intervention given to a participant can affect or change the entry diagnosis. For example, in the above mentioned study to limit infarct size, some participants without a myocardial infarction were randomized because of the need to begin intervention before the diagnosis was confirmed. Moreover, if the interventions succeeded in limiting infarct size, they could have affected the electrocardiogram and serum enzyme levels. Participants in the intervention groups with a small myocardial infarction may have had the infarct size reduced or limited and therefore appeared not to have had a qualifying an infarction. Thus, they would not seem to have met the entry criteria. However, this situation could not exist in the placebo control group. If the investigators had withdrawn participants in retrospect who did not meet the study criteria for a myocardial infarction, they would have withdrawn more participants from the intervention groups (those with no documented infarction plus those with small infarction) than from the control group (those with no infarction). This would have produced a bias in later comparisons. On the other hand, it could be assumed that a similar number of truly ineligible participants were randomized to the intervention groups and to the control group. In order to maintain comparability, the investigators might have decided to withdraw the same number of participants from each group. The ineligible participants in the control group could have been readily identified. However, the participants in the intervention groups who were truly ineligible had to be distinguished from those made to appear ineligible by the effects of the interventions. This would have been difficult, if not impossible. In the Multicenter Investigation for the Limitation of Infarct Size (MILIS) for example, all randomized participants were retained in the analysis [26].

An example of possible bias because of withdrawal of ineligible participants is found in the Anturane Reinfarction Trial, which compared sulfinpyrazone with placebo in participants who had recently suffered a myocardial infarction [2729]. As seen in Table 18.1, of 1,629 randomized participants (813 to sulfinpyrazone, 816 to placebo), 71 were subsequently found to be ineligible. Thirty-eight had been assigned to sulfinpyrazone and 33 to placebo. Despite relatively clear definitions of eligibility and comparable numbers of participants withdrawn, mortality among these ineligible participants was 26.3% in the sulfinpyrazone group (10 of 38) and 12.1% in the placebo group (4 of 33) [27]. The eligible placebo group participants had a mortality of 10.9%, similar to the 12.1% seen among the ineligible participants. In contrast, the eligible participants on sulfinpyrazone had a mortality of 8.3%, less than one-third that of the ineligible participants. Including all 1,629 participants in the analysis gave 9.1% mortality in the sulfinpyrazone group, and 10.9% mortality in the placebo group (p = .20). Withdrawing the 71 ineligible participants (and 14 deaths, 10 vs. 4) gave an almost significant p = .07.

Table 18.1 Mortality by study group and eligibility status in the Anturane Reinfarction Trial

Stimulated by criticisms of the study, the investigators initiated a reevaluation of the Anturane Reinfarction Trial results. An independent group of reviewers examined all reports of deaths in the trial [29]. Instead of 14 deceased participants who were ineligible, it found 19; 12 in the sulfinpyrazone group and seven in the placebo group. Thus, supposedly clear criteria for ineligibility can be judged differently. This trial was an early example that affirmed the value of the intention-to-treat principle.

Three trial design policies that relate to withdrawals because of entry criteria violations have been discussed by Peto et al. [12]. The first policy is not to enroll participants until all the diagnostic tests have been confirmed and all the entry criteria have been carefully checked. Once enrollment takes place, no withdrawals from the trial are allowed. For some studies, such as the one on limiting infarct size, this policy cannot be applied because firm diagnoses cannot be ascertained prior to the time when intervention has to be initiated.

The second policy is to enroll marginal or unconfirmed cases and later withdraw from analysis those participants who are proven to have been misdiagnosed. This would be allowed, however, only if the decision to withdraw is based on data collected before enrollment. Any process of deciding upon withdrawal of a participant from a study group should be done blinded with respect to the participant’s outcome and group assignment.

A third policy is to enroll some participants with unconfirmed diagnoses and to allow no withdrawals. This procedure is always valid in that the investigator compares two randomized groups which are comparable at baseline. However, this policy is conservative because each group contains some participants who might not benefit from the intervention. Thus, the overall trial may have less power to detect differences of interest.

A modification to these three policies has been recommended [22]. Every effort should be made to establish the eligibility of participants prior to any randomization. No withdrawals should be allowed and the analyses should include all participants enrolled. Analyses based on only those truly eligible may be performed. If the analyses of data from all enrolled participants and from those eligible agree, the interpretation of the results is clear, at least with respect to participant eligibility. If the results differ, however, the investigator must be very cautious in her interpretation. In general, she should emphasize the analysis with all the enrolled participants because that analysis is always valid.

Any policy on withdrawals should be stated in the study protocol before the start of the study. Though the enrolled cohort is never a random sample, in general, the desired aim is to make the recruited cohort as similar as possible to the population in which the intervention will be used in clinical practice, so withdrawal of participants from the trial or participants’ data from an analysis after the decision to treat should be extremely rare. The actual decision to withdraw specific participants should be done without knowledge of the study group, ideally by someone not directly involved in the trial. Of special concern is withdrawal based on review of selected cases, particularly if the decision rests on a subjective interpretation. Even in double-blind trials, blinding may not be perfect, and the investigator may supply information for the eligibility review differentially depending upon study group and health status. Therefore, withdrawal should be done early in the course of follow-up, before a response variable has occurred, and with a minimum of data exchange between the investigator and the person making the decision to withdraw the participant. This withdrawal approach does not preclude a later challenge by readers of the report, on the basis of potential bias. It should, however, remove the concern that the withdrawal policy was dependent on the outcome of the trial. The withdrawal rules should not be based on knowledge of study outcomes. Even when these guidelines are followed, if the number of withdrawals is high, if the number of entry criteria violations is substantially different in the study groups, or if the event rates in the withdrawn participants are different between the groups, the question will certainly be raised whether bias played a role in the decision to withdraw participants.

Nonadherence

Nonadherence to the prescribed intervention or control regimen is another reason that participants are withdrawn from analysis [4059]. One version of this is to define an “on treatment” analysis that eliminates any participant who does not adhere to the intervention by some specified amount, as defined in the protocol. One form of nonadherence may be drop-outs and drop-ins (Chap. 14). Drop-outs are participants in the intervention arm who do not adhere to the regimen. Drop-ins are participants in the control arm who receive the intervention. The decision not to adhere to the protocol intervention may be made by the participant, his primary care physician, or the trial investigator. Nonadherence may be due to adverse events in either the intervention or control group, loss of participant interest or perceived benefit, changes in the underlying condition of a participant, or a variety of other reasons.

Withdrawal from analysis of participants who do not adhere to the intervention regimens specified in the study design is often proposed. The motivation for withdrawal of nonadherent participants is that the trial is not a “fair test” of the ideal intervention with these participants included. For example, there may be a few participants in the intervention group who took little or no therapy. One might argue that if participants do not take their medication, they certainly cannot benefit from it. There could also be participants in the control group who frequently receive the study medication. The intervention and control groups are thus “contaminated.” Proponents of withdrawal of nonadherent participants argue that removal of these participants keeps the trial closer to what was intended; that is, a comparison of optimal intervention versus control. The impact of nonadherence on the trial findings is that any observed benefit of the intervention, as compared to the control, will be reduced, making the trial less powerful than it planned. Newcombe [11], for example, discusses the implication of adherence for the analysis as well as the design and sample size. We discuss this in Chap. 8.

A policy of withdrawal from analysis because of participant nonadherence can lead to bias. The overwhelming reason is that participant adherence to a protocol may be related to the outcome. In other words, there may an effect of adherence on the outcome which is independent of the intervention. Certainly, if nonadherence is greater in one group than another, for example if the intervention produces many adverse events, then withdrawal of nonadherent participants could lead to bias. Even if the frequency of nonadherence is the same for the intervention and control groups, the reasons for nonadherence in each group may differ and may involve different types of participants. The concern would always be whether the same types of participants were withdrawn in the same proportion from each group or whether an imbalance had been created. Of course, an investigator could probably neither confirm nor refute the possibility of bias.

For noninferiority trials (see Chaps. 3 and 5), nonadherence may make the two interventions arms to look more alike and thus create bias towards the claim of noninferiority [13, 60]. Any attempt to use only adherers in a noninferiority trial, though, could be biased in unknown directions, thus rendering the results uninterpretable. Again, the best policy is to design a trial to have minimum nonadherence, power the trial to overcome non-preventable nonadherence and then accept the results using the principle of intention-to-treat.

The Coronary Drug Project evaluated several lipid-lowering drugs in people several years after a myocardial infarction. In participants on one of the drugs, clofibrate, total 5-year mortality was 18.2%, as compared with 19.4% in control participants [31, 57]. Among the clofibrate participants, those who had at least 80% adherence to therapy had a mortality of 15%, whereas the poor adherers had a mortality of 24.6% (Table 18.2). This seeming benefit from taking clofibrate was, unfortunately, mirrored in the group taking placebo, 15.1% vs. 28.2%. A similar pattern (Table 18.3) was noted in the Aspirin Myocardial Infarction Study [58]. Overall, no difference in mortality was seen between the aspirin-treated group (10.9%) and the placebo-treated group (9.7%). Good adherers to aspirin had a mortality of 6.1%; poor adherers had a mortality of 21.9%. In the placebo group, the rates were 5.1% and 22%.

Table 18.2 Percent mortality by study group and level of adherence in the Coronary Drug Project
Table 18.3 Percent mortality by study group and degree of adherence in the Aspirin Myocardial Infarction Study

A trial of antibiotic prophylaxis in cancer patients also demonstrated a relationship between adherence and benefit in both the intervention and placebo groups [43]. Among the participants assigned to intervention, efficacy in reducing fever or infection was 82% in excellent adherers, 64% in good adherers, and 31% in poor adherers. Among the placebo participants, the corresponding figures were 68%, 56%, and 0%.

Another pattern is noted in a three-arm trial comparing two beta-blocking drugs, propranolol and atenolol, with placebo [59]. Approximately equal numbers of participants in each group stopped taking their medication. In the placebo group, adherers and nonadherers had similar mortality: 11.2% and 12.5%, respectively. Nonadherers to the interventions, however, had death rates several times greater than did the adherers: 15.9% to 3.4% in those on propranolol and 17.6% to 2.6% in those on atenolol. Thus, even though the numbers of nonadherers in each arm were equal, their risk characteristics as reflected by their morality rates were obviously different.

Pledger [51] provides an analogous example for a trial of schizophrenia. Participants were randomized to chlorpromazine or placebo and the 1-year relapse rates were measured. The overall comparison was a 27.8% relapse rate on active medication and 52.8% for those on placebo. The participants were categorized into low or high adherence subgroups. Among the active medication participants, the relapse rate was 61.2% for low adherence and 16.8% for high adherence. However, the relapse rate was 74.7% and 28.0% for the corresponding adherence groups on placebo.

Another example of placebo adherence versus nonadherence is reported by Oakes et al. [49]. A trial of 2,466 heart attack participants compared diltiazem with placebo over a period of 4 years with time to first cardiac event as the primary outcome. Cardiac death or all-cause mortality were additional outcome measures. The trial was initially analyzed according to intention-to-treat with no significant effect of treatment. Qualitative interaction effects were found with the presence or absence of pulmonary congestion which favored diltiazem for patients without pulmonary congestion and placebo in patients with pulmonary congestion. Interestingly, for participants without pulmonary congestion, the hazard ratio or relative risk for time to first cardiac event was 0.92 for those off placebo compared to those on placebo. For participants with pulmonary congestion, the hazard ratio was 2.86 for participants off placebo compared to those on placebo. For time to cardiac death and to all-cause mortality, hazard ratios exceeded 1.68 in both pulmonary congestion subgroups. This again suggests that placebo adherence is a powerful prognostic indicator and argues for the intention-to-treat analysis.

The definition of nonadherence can also have a major impact on the analysis. This is demonstrated by reanalysis of a trial in breast cancer patients by Redmond et al. [52]. This trial compared a complex chemotherapy regimen with placebo as adjuvant therapy following surgery with disease-free survival as the primary outcome. To illustrate the challenges of trying to adjust analyses for adherence, two measures of adherence were created. Adherence was defined as the fraction of chemotherapy taken while on the study to what was defined by the protocol as a full course. One analysis (Method I) divided participants into good adherers (≥85%), moderate adherers (65–84%) and poor adherers (<65%). Using this definition, placebo adherers had a superior disease-free survival than moderate adherers who did better than poor adherers (Fig. 18.1). This pattern of outcome in the placebo group is similar to the CDP clofibrate example. The authors performed a second analysis (Method II) changing the definition of adherence slightly. In this case, adherence was defined as the fraction of chemotherapy taken while on study to what should have been taken while still on study before being taken off treatment for some reason. Note that the previous definition (Method I) compared chemotherapy taken to what would have been taken had the participant survived to the end and adhered perfectly. This subtle difference in definition changed the order of outcome in the placebo group. Here, the poor placebo adherers had the best disease-free survival and the best adherers had a disease-free survival in-between the moderate and poor adherers. Of special importance is that the participants in this example were all on placebo. Thus, adherence is itself an outcome and trying to adjust one outcome (the primary response variable) for another outcome (adherence) can lead to misleading results.

Fig. 18.1
figure 1figure 1

Percentage of disease free survival related to adherence levels of placebo; methods I and II definition of compliance in National Surgical Adjuvant Breast Program (NSABP). Three levels of compliance are: Good (>85%) Moderate (65–84%) Poor (<65%) [52]

Detre and Peduzzi have argued that, although as a general rule nonadherent participants should be analyzed according to the study group to which they were assigned, there can be exceptions. They presented an example from the VA coronary bypass surgery trial [40]. In that trial, a number of participants assigned to medical intervention crossed over to surgery. Contrary to expectation, these participants were at similar risk of having an event, after adjusting for a variety of baseline factors, as those who did not crossover. Therefore, the authors argued that the non-adherers should be kept in their original groups, but can be censored at the time of crossover. This may be true, but, as seen in the Coronary Drug Project [57], adjustment for known variables does not always account for the observed response. The differences in mortality between adherers and nonadherers remained even after adjustment. Thus, other unknown or unmeasured variables were of critical importance.

Some might think that if rules for withdrawing participants are specified in advance, withdrawals for nonadherence are legitimate. However, the potential for bias cannot be avoided simply because the investigator states, ahead of time, the intention to withdraw participants. This is true even if the investigator is blinded to the group assignment of a participant at the time of withdrawal. Participants were not withdrawn from the analyses in the above examples. However, had a rule allowing withdrawal of participants with poor adherence been specified in advance, the results described above would have been obtained. The type of participants withdrawn would have been different in the intervention and control groups and would have resulted in the analysis of non-comparable groups of adherers. Unfortunately, as noted, the patterns of possible bias can vary and depend on the precise definition of adherence. Neither the magnitude nor direction of that bias is easily assessed or compensated for in analysis.

Adherence is also a response to the intervention. If participant adherence to an intervention is poor compared to that of participants in the control group, widespread use of this therapy in clinical practice may not be reasonably expected. An intervention may be effective, but may be of little value if it cannot be tolerated by a large portion of the participants. For example, in the Coronary Drug Project, the niacin arm showed a favorable trend for mortality over 7 years, compared with placebo, but niacin caused “hot flashes” and was not easily tolerated [31]. The development of slow release formulations that reduce pharmacologic peaks has lessened the occurrence of side effects.

It is therefore recommended that no participants be withdrawn from analysis in superiority trials for lack of adherence. The price an investigator must pay for this policy is possibly reduced power because some participants who are included in the analysis may not be on optimal intervention. For limited or moderate nonadherence, one can compensate by increasing the sample size, as discussed in Chap. 8, although doing so is costly.

Missing or Poor Quality Data

In most trials, participants have data missing for a variety of reasons. Perhaps they were not able to keep their scheduled clinic visits or were unable to perform or undergo the particular procedures or assessments. In some cases, follow-up of the participant was not completed as outlined in the protocol. The challenge is how to deal with missing data or data of such poor quality that they are in essence missing. One approach is to withdraw participants who have poor data completely from the analysis [26, 61, 62]. However, the remaining subset may no longer be representative of the population randomized and there is no guarantee that the validity of the randomization has been maintained in this process.

There is a vast literature on approaches to dealing with missing data [6373]. Many of these methods assume that the data are missing at random; that is, the probability of a measurement not being observed does not depend on what its value would have been. In some contexts, this may be a reasonable assumption, but for clinical trials, and clinical research in general, it would be difficult to confirm. It is, in fact, probably not a valid assumption, as the reason the data are missing is often associated with the health status of the participant. Thus, during trial design and conduct, every effort must be made to minimize missing data. If the amount of missing data is relatively small, then the available analytic methods will probably be helpful. If the amount of missing data is substantial, there may be no method capable of rescuing the trial. In this section, we discuss some of the issues that must be kept in mind when analyzing a trial with missing data.

Rubin [72] provided a definition of missing data mechanisms. If data are missing for reasons unrelated to the measurement that would have been observed and unrelated to covariates, then the data are “missing completely at random.” Statistical analyses based on likelihood inference are valid when the data are missing at random or missing completely at random. If a measure or index allows a researcher to estimate the probability of having missing data, say in a participant with poor adherence to the protocol, then using methods proposed by Rubin and others might allow some adjustment to reduce bias [66, 71, 72, 74]. However, adherence, as indicated earlier, is often associated with a participant’s outcome and attempts to adjust for adherence can lead to misleading results.

If participants do not adhere to the intervention and also do not return for follow-up visits, the primary outcome measured may not be obtained unless it is survival or some easily ascertained event. In this situation, an intention-to-treat analysis is not feasible and no analysis is fully satisfactory. Because withdrawal of participants from the analysis is known to be problematic, one approach is to “impute” or fill in the missing data such that standard analyses can be conducted. This is appealing if the imputation process can be done without introducing bias. There are many procedures for imputation. Those based on multiple imputations are more robust than single imputation [75].

A commonly used single imputation method is to carry the last observed value forward. This method, also known as an endpoint analysis, requires the very strong and unverifiable assumption that all future observations, if they were available, would remain constant [51]. Although commonly used, the last observation carried forward method is not generally recommended [71, 73]. Using the average value for all participants with available data, or using a regression model to predict the missing value are alternatives, but in either case, the requirement that the data be missing at random is necessary for proper inference.

A more complex approach is to conduct multiple imputations, typically using regression methods, and then perform a standard analysis for each imputation. The final analysis should take into consideration the variability across the imputations. As with single imputation, the inference based on multiple imputation depends on the assumption that the data are missing at random. Other technical approaches are not described here, but in the context of a clinical trial, none is likely to be satisfactory.

Various other methods for imputing missing values have been described [6373, 75]. Examples of some of these methods are given by Espeland et al. for a trial measuring carotid artery thickness at multiple anatomical sites using ultrasound [61]. In diagnostic procedures of this type, typically not all measurements can be made. Several imputation methods, based on a mixed effects linear model where regression coefficient and a covariance structure (i.e., variances and correlations), were estimated. Once these were known, this regression equation was the basis for the imputation. Several imputation strategies were used based on different methods of estimating the parameters and whether treatment differences were assumed or not. Most of the imputation strategies gave similar results when the trial data were analyzed. The results indicated up to a 20% increase in efficiency compared to using available data in cross sectional averages.

For repeated measures, imputation techniques such as these are useful if the data are missing at random; that is, the probability of missing data is not dependent on the measurement that would have been observed or on the preceding measurements. Unfortunately, it is unlikely that data are missing at random. The best that can be offered, therefore, is a series of analyses, each exploring different approaches to the imputation problem. If all, or most, are in general agreement qualitatively, then the results are more persuasive. All analyses should be presented, not just the one with the preferred results.

In long-term trials participants may be lost to follow-up or refuse to continue their participation. In this situation, the status of the participant with regard to any response variable cannot be determined. If mortality is the primary response variable and if the participant fails to return to the clinic, his survival status may still be obtained. If a death has occurred, the date of death can be ascertained. In the Coronary Drug Project [31] where survival experience over 60 months was the primary response variable, four of 5,011 participants were lost to follow-up (one in a placebo group, three in one treatment group, and none in another treatment group). The Lipid Research Clinics Coronary Primary Prevention Trial [47] followed over 3,800 participants for an average of 7.4 years, and was able to assess vital status on all. The Physicians’ Health study of over 20,000 US male physicians had complete follow-up for survival status [76]. Many other large simple trials, such as GUSTO [39], have similar nearly complete follow-up experience. Obtaining such low loss to follow-up rates, however, required special effort. In the Women’s Health Initiative (WHI), one portion evaluated the possible benefits of hormone replacement therapy (estrogen plus progestin) compared with placebo in post-menopausal women. Of the 16,025 participants, 3.5% were lost to follow-up and did not provide 18 month data [77].

For some conditions, e.g., trials of treatment for substance abuse, many participants fail to return for follow-up visits, and missing data can be 25–30% or even more. Efforts to account for missing data must be made, recognizing that biases may very well exist.

An investigator may not be able to obtain any information on some kinds of response variables. For example, if a participant is to have blood pressure measured at the last follow-up visit 12 months after randomization and the participant does not show up for that visit, this blood pressure can never be retrieved. Even if the participant is contacted later, the later measurement does not truly represent the 12-month blood pressure. In some situations, substitutions may be permitted, but, in general, this will not be a satisfactory solution. An investigator needs to make every effort to have participants come in for their scheduled visits in order to keep losses to follow-up at a minimum. In the Intermittent Positive Pressure Breathing (IPPB) trial, repeated pulmonary function measurements were required for participants with chronic obstructive pulmonary disease [62]. However, some participants who had deteriorated could not perform the required test. A similar problem existed for the Multicenter Investigation of the Limitation of Infarct Size (MILIS) where infarct size could not be obtained in many of the sickest participants [26].

Individuals with chronic obstructive pulmonary disease typically decline in their pulmonary function and this decline may lead to death, as happened to some participants in the IPPB trial. In this case, the missing data were not missing at random and censoring was said to be informative. One simple method for cases such as the IPPB study is to define a decreased performance level considered to be a clinical event. Then the analysis can be based on time to the clinical event of deterioration or death, incorporating both pieces of information. Survival analysis, though, assumes that loss of follow-up is random and independent of risk of the event. Methods relaxing the missing at random assumption have been proposed [78, 79], but require other strong assumptions, the details of which are beyond the scope of this text.

If the number of participants lost to follow-up differs in the study groups, the analysis of the data could be biased. For example, participants who are taking a new drug that has adverse effects may, as a consequence, miss scheduled clinic visits. Events may occur but be unobserved. These losses to follow-up would probably not be the same in the control group. In this situation, there may be a bias favoring the new drug. Even if the number lost to follow-up is the same in each study group, the possibility of bias still exists because the participants who are lost in one group may have quite different prognoses and outcomes than those in the other group.

An example of differential follow-up was reported by the Comparison of Medical Therapy, Pacing, and Defibrillation in Chronic Heart Failure (COMPANION) trial [80]. COMPANION compared a cardiac pacemaker or a pacemaker plus defibrillator with best pharmacologic treatment in people with chronic heart failure. Over 1,500 participants were randomized. Two primary outcomes were assessed; death and death plus hospitalization. Individuals randomized to one of the device arms did not know to which device they had been assigned, but those on the pharmacologic treatment arm were aware that no device had been installed. During the course of the trial, the pacemaker plus defibrillator devices, made by two different manufacturers, were approved by a regulatory agency. As a result, participants in the pharmacologic treatment arm began to drop-out from the trial and some also withdrew their consent. Many requested one of the newly approved devices. Thus, when the trial was nearing completion, the withdrawal rate was 26% in the pharmacologic treatment arm and 6–7% in the device arms. Additionally, no further follow-up information could be collected on those who withdrew consent. Clearly, censoring at the time of withdrawal was not random and the possibility that it was related to disease status could not be ruled out, thus creating the possibility of serious bias. This situation could have jeopardized an otherwise well designed and conducted trial in people with a serious medical condition. However, the investigators initiated a complicated process of reconsenting the participants to allow for collection of the primary outcomes. After completing this process, assessment of the status for death plus hospitalization and vital status were 91% and 96%, respectively, in the pharmacologic treatment group. Outcome ascertainment for the two device arms was 99% or better. The final results demonstrated that both the pacemaker and the defibrillator plus pacemaker reduced death plus hospitalization and further that the defibrillator plus pacemaker reduced mortality. These results were important in the treatment of chronic heart failure. However, not correcting for the initial differential loss to follow-up would have rendered the COMPANION trial data perhaps uninterpretable. In Fig. 18.2, the Kaplan Meier curves for mortality for the two intervention arms are provided with the most complete data available.

Fig. 18.2
figure 2figure 2

Kaplan-Meier estimates in COMPANION trial. (a) The time to primary end point of death from hospitalization for any cause. (b) The time to the secondary end point of death from any cause [80]

Often, protocol designs call for follow-up to terminate at some period, for example 7, 14, or 30 days, after a participant has stopped adhering to his or her intervention, even though the intended duration of intervention would not have ended. The concept is that “off intervention” means “off study”; i.e., assessment for nonadherent participants ends when intervention ends. We do not endorse this concept. Although time to event analysis may be censored at the time of last follow-up, going off intervention or control is not likely random and may be related to participant health status. Important events, including serious adverse events, may occur beyond the follow-up period and might be related to the intervention. As noted above, though, survival analysis assumes that censoring is independent of the primary event. The practice of not counting events at the time of, or shortly after, intervention discontinuation is all too common, and can lead to problems in the interpretation of the final results. An instructive example is the Adenomatous Polyp Prevention on Vioxx (APPROVe) trial [81]. This randomized double blind trial compared a cyclo-oxygenase (COX)-II inhibitor with placebo in people with a history of colorectal adenomas. Previous trials of COX-II inhibitors had raised concern regarding long term cardiovascular risk. Thus, while the APPROVe trial was a cancer prevention trial, attention also focused on cardiovascular events, in particular thrombotic events and cardiovascular death, nonfatal myocardial infarction, and nonfatal stroke. However, participants who stopped taking their medication during the trial were not followed beyond 14 days after the time of discontinuation. The Kaplan-Meier cardiovascular risk curve is shown in Fig. 18.3. Note that for the first 18 months the two curves are similar and then begin to diverge. Controversy arose as to whether there was an 18-month lag time in the occurrence of cardiovascular events for this particular COX-II inhibitor [82, 83].

Fig. 18.3
figure 3figure 3

APPROVe—Kaplan-Meier estimates of time to death from the AntiPlatelet Trialists’ Collaborative (APTC) outcomes (cardiovascular causes, nonfatal myocardial infarction of nonfatal stroke) with censoring 14 days after participants stopped therapy [84]. Reproduced with the permission of Elsevier Ltd. for Lancet

Due to the controversy, the investigators and sponsor launched an effort to collect information on all trial participants for at least a year after stopping study treatment. This extended follow-up, referred to here as APPROVe + 1, was able to collect selected cardiovascular events of nonfatal myocardial infarction, nonfatal stroke and cardiovascular death [84], as shown in Fig. 18.4. The time to event curves begin to separate from the beginning and continue throughout the extended follow-up, with a hazard ratio of 1.8 (p = 0.006). There was a corresponding statistically nonsignificant increase in mortality.

Fig. 18.4
figure 4figure 4

APPROVe Kaplan-Meier estimates of time to death for the AntiPlatelet Trialists’ Collaborative (APTC) outcome (cardiovascular causes, nonfatal myocardial infarction or nonfatal stroke) counting all events observed for an additional year of follow-up after trial was initially terminated [84]. Reproduced with the permission

Censoring follow-up when participants go off their intervention is a common error that leads to problems like those encountered by the APPROVe trial. Going off intervention, and thus censoring follow-up at some number of days afterwards, is not likely to be independent of the disease process or how a participant is doing. At least, it would be difficult to demonstrate such independence. Yet, survival analysis and most other analyses assume that the censoring is independent. The principle lesson here is that “off intervention should not mean off study.”

An outlier is an extreme value significantly different from the remaining values. The concern is whether extreme values in the sample should be included in the analysis. This question may apply to a laboratory result, to the data from one of several areas in a hospital or from a clinic in a multicenter trial. Removing outliers is not recommended unless the data can be clearly shown to be erroneous. Even though a value may be an outlier, it could be correct, indicating that on occasions an extreme result is possible. This fact could be very important and should not be ignored. Long ago Kruskall [85] suggested carrying out an analysis with and without the “wild observation.” If the conclusions vary depending on whether the outlier values are included or excluded, one should view any conclusions cautiously. Procedures for detecting extreme observations have been discussed [8689], and the publications cited can be consulted for further detail.

An interesting example given by Canner et al. [86] concerns the Coronary Drug Project. The authors plotted the distributions of four response variables for each of the 53 clinics in that multicenter trial. Using total mortality as the response variable, no clinics were outlying. When nonfatal myocardial infarction was the outcome, only one clinic was an outlier. With congestive heart failure and angina pectoris, response variables which are probably less well defined, there were nine and eight outlying clinics, respectively.

In conclusion, missing data can create problems. Though methods which allow for missing data exist, they require certain assumptions which are not likely to be true. Every attempt should made to minimize missing data, and investigators should be aware of the potential for bias

Competing Events

Competing events are those that preclude the assessment of the primary response variable. They can reduce the power of the trial by decreasing the number of participants available for follow-up. If the intervention can affect the competing event, there is also the risk of bias. In some clinical trials, the primary response variable may be cause-specific mortality, such as death due to myocardial infarction or sudden death, rather than total mortality [9093]. The reason for using cause-specific death as a response variable is that a therapy often has specific mechanisms of action that may be effective against a disease or condition. In this situation, measuring death from all causes, at least some of which are not likely to be affected by the intervention, can “dilute” the results. For example, a study drug may be anti-arrhythmic and thus sudden cardiac death might be the selected response variable. Other causes of death such as cancer and accidents would be competing events.

Even if the response variable is not cause-specific mortality, death may be a factor in the analysis. This is particularly an issue in long term trials in the elderly or high risk populations. If a participant dies, future measurements will be missing. Analysis of nonfatal events in surviving participants has the potential for bias, especially if the mortality rates are different in the two groups.

In a study in which cause-specific mortality is the primary response variable, deaths from other causes are treated statistically as though the participants were lost to follow-up from the time of death (Chap. 15) and these deaths are not counted in the analysis. In this situation, the analysis, however, must go beyond merely examining the primary response variable. An intervention may or may not be effective in treating the condition of interest but could be harmful in other respects. Therefore, total mortality should be considered as well as cause-specific fatal events. Similar considerations need to be made when death occurs in studies using nonfatal primary response variables. This can be done by considering tables that show the number of times the individual events occur, one such event per person. No completely satisfactory solution exists for handling competing events. At the very least, the investigator should report all major outcome categories; for example, total mortality, as well as cause-specific mortality and morbid events.

In many cases, there may be recurring events. Many trials simply evaluate the time to the first event and do not count the additional events in the time to event analysis. Tables may show the total number of events in each intervention arm. Some attempts to further analyze recurrent events have been made, using for example the data from the COMPANION trial [80, 90]. Software exists for these methods [94, 95]; however, the technical details of these methods are complicated and will not be covered in this text.

Composite Outcomes

In recent years, many trials have used combinations of clinical and other outcomes as a composite response variable [9093]. One major motivation is to increase the event rate and thus reduce the sample size that might otherwise have been required had just one of the components been selected as the primary outcome. Another motivation is to combine events that have a presumed common etiology and thus get an overall estimate of effect. The sample size is usually not based on any single component.

There are challenges in using a composite outcome [96, 97]. The components may not have equal weight or clinical importance, especially as softer outcomes are added. The components may go in opposite directions or at least not be consistent in indicating intervention effect. One component may dominate the composite. Results with any single component are based on a smaller number of events and thus the power for that component is greatly reduced. Rarely do we find significance for a component, nor should we expect it in general. Regardless of the composition of the composite, analyses should be conducted for each component, or in some cascading sequence. For example, if the composite were death, myocardial infarction, stroke or heart failure hospitalization, the analysis sequence might be death, death plus myocardial infarction or stroke, and death plus heart failure hospitalization. The reason for including death is to take into account competing risk of death for the other components, in addition to its obvious clinical importance.

As pointed out in Chap. 3, it is essential that follow-up continue after the first event in the composite outcome occurs. Analysis will include looking at the contribution of each component to the overall but should also include time to event for each component separately. As indicated, if follow-up does not continue, only partial results are available for each component and analysis of those events separately could be misleading.

There are several examples where the use of a composite such as death, myocardial infarction and stroke has been used as a primary or leading secondary outcome [34, 3638]. These outcomes are all clinically relevant. In these trials, the outcomes all trended in the same direction. However, that is not always the case.

In the Pravastatin or Atorvastatin Evaluation and Infection Therapy (PROVE IT) trial, the 80 mg atorvastatin arm was more effective than the 40 mg pravastatin arm in reducing the incidence of death, myocardial infarction, stroke, required hospitalization due to unstable angina and revascularization [91]. Stroke results, one of the key components, went in the opposite direction. These results complicate the interpretation. Should investigators think that the atorvastatin improves the composite or just those components that are in the same direction as the composite? As would be expected, the differences for the components were not, in themselves, statistically significant.

Another interesting example is provided by the Women’s Health Initiative (WHI) which was a large factorial design trial post-menopausal women [77]. As discussed earlier and in Chap. 16, one part involved hormone replacement therapy which contained two strata, women with a uterus and those without. Women with a uterus received estrogen plus progestin or matching placebo; those without a uterus received estrogen alone or a matching placebo. Due to the multiple actions of hormone replacement therapy, one response variable was a global outcome mortality, coronary heart disease, bone loss reflected by hip fracture rates, breast cancer, colorectal cancer, pulmonary embolism, and stroke. As seen in Fig. 16.7, for the estrogen plus progestin stratum, there was essentially no effect on mortality and a small but nonsignificant effect in the global index, when compared to placebo. However, as shown in Fig. 16.6, the various components went in different directions. Hip fracture and colorectal cancer had a favorable response to hormone replacement therapy. Pulmonary embolism, stroke and coronary heart disease went in an unfavorable direction. Thus, any interpretation of the global index, which is a composite, requires careful examination of the components. Of course, few trials would have been designed with adequate power for the individual components so the interpretation must be qualitative, looking for consistency and biological plausibility.

The Look AHEAD trial examined whether a long-term lifestyle intervention for weight loss would decrease cardiovascular morbidity and mortality in overweight or obese patients with type 2 diabetes [98]. The primary outcome was a composite of death from cardiovascular causes, nonfatal myocardial infarction, or nonfatal stroke. During follow-up, the Data and Safety Monitoring Board (DSMB) alerted the investigators that the event rate for the primary outcome was dramatically lower than expected, less than a third [99]. The protocol was changed to include hospitalization for angina as a way of increasing the event rate, and this turned out to be the most frequent component in the revised composite, which had an incidence about 50% higher than the original composite. Unfortunately, hospitalization for angina showed markedly less effect of the intervention [100]. Using the original composite would not have changed the trial’s outcome, which was negative, but this experience underscores the importance of giving full consideration of a candidate component’s likely response to the intervention, as well as to its incidence rate.

Experience suggests that composite outcome variables should be used cautiously and only include those components that have relatively equal clinical importance, frequency, and anticipated response to the presumed mechanism of action of the intervention [96]. As softer and less relevant outcomes are added, the interpretation becomes less clear, particularly if the less important component occurs more frequently than others, driving the overall result. Significance by any individual component cannot be expected but there should be a plausible consistency across the components.

Covariate Adjustment

The goal in a clinical trial is to have study groups that are comparable except for the intervention being studied. Even if randomization is used, all of the prognostic factors may not be perfectly balanced, especially in smaller studies. Even if no prognostic factors are significantly imbalanced in the statistical sense, an investigator may, nevertheless, observe that one or more factors favor one of the groups. In either case, covariate adjustment can be used in the analysis to minimize the effect of the differences. However, covariate adjustment is not likely to eliminate the effect of these differences. Covariance analysis for clinical trials has been reviewed in numerous articles [101122].

Adjustment also reduces the variance in the test statistic. If the covariates are highly correlated with outcome, this can produce more sensitive analyses. The specific adjustment procedure depends on the type of covariate being adjusted for and the type of response variable being analyzed. If a covariate is discrete, or if a continuous variable is converted into intervals and made discrete, the analysis is sometimes referred to as “stratified.” A stratified analysis, in general terms, means that the study participants are subdivided into smaller, more homogeneous groups, or strata. A comparison of study groups is made within each stratum and then averaged over all strata to achieve a summary result for the response variable. This result is adjusted for group imbalances in the discrete covariates. If a response variable is discrete, such as the occurrence of an event, the stratified analysis might take the form of a Mantel-Haenszel statistic.

If the response variable is continuous, the stratified analysis is referred to as analysis of covariance. This uses a model which, typically, is linear in the covariates. A simple example for a response Y and covariate X would be Y = αj+ β(X − μ) + error where β is a coefficient representing the importance of the covariate X and is assumed to be the same in each group, μ is the mean value of X, and αj is a parameter for the contribution of the overall response variable jth group (e.g., j = 1 or 2). The basic idea is to adjust the response variable Y for any differences in the covariate X between the two groups. Under appropriate assumptions, the advantage of this method is that the continuous covariate X does not have to be divided into categories. Further details can be found in statistics textbooks [16, 8, 123]. If time to an event is the primary response variable, then survival analysis methods that allow for adjustments of discrete or continuous covariates may be used [106]. However, whenever models are employed, the investigator must be careful to evaluate the assumptions required and how closely they are met. Analysis of covariance can be attractive, but may be abused if linearity is assumed when the data are nonlinear, if the response curve is not parallel in each group, or if assumptions of normality are not met [122]. If measurement errors in covariates are substantial, the lack of precision can be increased [112]. For all of these reasons, covariate adjustment models may be useful in the interpretation of data, but should not be viewed as absolutely correct.

Regardless of the adjustment procedure, covariates should be measured at baseline. Except for certain factors such as age, sex, or race, any variables that are evaluated after initiation of intervention should be considered as response variables. Group comparisons of the primary response variable, adjusted for other response variables, are discouraged. Interpretation of such analyses is difficult because group comparability may be lost.

Surrogates as a Covariate

Adjustment for various surrogate outcomes may be proposed. In a trial of clofibrate [101], the authors reported that those participants who had the largest reduction in serum cholesterol had the greatest clinical improvement. However, reduction in cholesterol is probably highly correlated with adherence to the intervention regimen. Since, as discussed earlier, adherers in one group may be different from adherers in another group, analyses that adjust for a surrogate for adherence can be biased. This issue was addressed in the Coronary Drug Project [56]. Adjusted for baseline factors, the 5-year mortality was 18.8% in the clofibrate group (N = 997) and 20.2% in the placebo group (N = 2,535), an insignificant difference. For participants with baseline serum cholesterol greater than or equal to 250 mg/dl, the mortality was 17.5% and 20.6% in the clofibrate and placebo groups, respectively. No difference in mortality between the groups was noted for participants with baseline cholesterol of less than 250 mg/dl (20.0% vs. 19.9%). Those participants with lower baseline cholesterol in the clofibrate group who had a reduction in cholesterol during the trial had 16.0% mortality, as opposed to 25.5% mortality for those with a rise in cholesterol (Table 18.4). This would fit the hypothesis that lowering cholesterol is beneficial. However, in those participants with high baseline cholesterol, the situation was reversed. An 18.1% mortality was seen in those who had a fall in cholesterol, and a 15.5% mortality was noted in those who had a rise in cholesterol. The best outcome, therefore, appeared to be in participants on clofibrate whose low baseline cholesterol dropped or whose high baseline cholesterol increased. As seen earlier, adherence can affect outcomes in unexpected ways, and the same is true of surrogates for adherence.

Table 18.4 Percent 5-year mortality in the clofibrate group, by baseline cholesterol and change in cholesterol in the Coronary Drug Project’s

Modeling the impact of adherence on a risk factor and thus on the response has also received attention [109, 115]. Regression models have been proposed that attempt to adjust outcome for the amount of risk factor change that could have been attained with optimum adherence. One example of this was suggested by Efron and Feldman [109] for a lipid research study. However, Albert and DeMets [115] showed that these models are very sensitive to assumptions about the independence of adherence and health status or response. If these assumptions using these regression models are violated, misleading results emerge, such as that for the clofibrate and serum cholesterol example described above.

Clinical trials of cancer treatment commonly analyze results by comparing responders to nonresponders [104, 108]. That is, those who go into remission or have a reduction in tumor size are compared to those who do not. One early survey indicated that such analyses were done in at least 20% of published reports [122]. The authors of that survey argued that statistical problems, due to lack of random assignment, and methodological problems, due both to classification of response and inherent differences between responders and nonresponders, can occur. These will often yield misleading results, as shown by Anderson et al. [104]. They pointed out that participants “who eventually become responders must survive long enough to be evaluated as responders.” This factor can invalidate some statistical tests comparing responders to nonresponders. Those authors present two statistical tests that avoid bias. They note, though, that even if the tests indicate a significant difference in survival between responders and nonresponders, it cannot be concluded that increased survival is due to tumor response. Thus, aggressive intervention, which may be associated with better response, cannot be assumed to be better than less intensive intervention, which may be associated with poorer response. Anderson and colleagues state that only a truly randomized comparison can say which intervention method is preferable. What is unsaid, and illustrated by the Coronary Drug Project examples, is that even comparison of good responders in the intervention group with good responders in the control can be misleading, because the reasons for good response may be different.

Morgan [48] provided a related example of comparing duration of response in cancer patients, where duration of response is the time from a favorable response such as tumor regression (partial or total) to remission. This is another form of defining a subgroup of post-treatment outcome, that is, tumor response. In a trial comparing two complex chemotherapy regiments (A vs. B) in small cell lung cancer, the tumor response rates were 64% and 85%, with median duration of 245 days and 262 days respectively. When only responders were analyzed, the slight imbalance in prognostic factors was substantially increased. Extensive disease was evident at baseline in 48% of one and 21% of the other treatment responder groups. Thus, while it may be theoretically possible to adjust for prognostic factors, in practice, such adjustment may decrease bias, but will not eliminate it. Because not all prognostic factors are known, any model is only an approximation to the true relationship.

The Cox proportional hazards regression model for the analysis of survival data (Chap. 15) allows for covariates in the regression to vary with time [116]. This has been suggested as a way to adjust for factors such as adherence and level of response. It should be pointed out that, like simple regression models, this approach is vulnerable to the same biases described earlier in this chapter. For example, if cholesterol level and cholesterol reduction in the CDP example were used as time dependent covariates in the Cox model, the estimator of treatment effect would be biased due to the effects shown in Table 18.4.

Rosenbaum [121] provides a nice overview of adjustment for concomitant variables that have been affected by treatment in both observational and randomized studies. He states that “adjustments for post-treatment concomitant variables should be avoided when estimating treatment effects except in rather special circumstances, since adjustments themselves can introduce a bias where none existed previously.”

A number of additional methodologic attempts to adjust for adherence have also been conducted. Newcombe [11], for example, suggested adjusting estimates of intervention effect on the extent of nonadherence. Robins and Tsiatis [110] proposed a causal inference model. Lagakos et al. [46] evaluated censoring survival time, or time to an event, at the point when treatment is terminated. The rationale is that participants are no longer able to completely benefit from the therapy. However, the hazard ratio estimated by this approach is not the hazard that would have been estimated if participants had not terminated treatment. The authors stated that it is not appropriate to evaluate treatment benefit by comparing the hazard rates estimated by censoring for treatment termination [46]. Models for causal interference have also been used to explore the effects of adherence in clinical trials [124127]. Though promising, these approaches require strong assumptions usually either known to be untrue or difficult to validate and so are not recommended as part of a primary analysis.

Baseline Variables as Covariates

The issue of stratification was first raised in the discussion of randomization (Chap. 6). For large studies, the recommendation was that stratified randomization is usually unnecessary because overall balance would nearly always be achieved and that stratification would be possible in the analysis. For smaller studies, baseline adaptive methods could be considered but the analysis should include the covariates used in the randomization. In a strict sense, analysis should always be stratified if stratification was used in the randomization. In such cases, the adjusted analysis should include not only those covariates found to be different between the groups, but also those stratified during randomization. Of course, if no stratification is done at randomization, the final analysis is less complicated since it would involve only those covariates that are imbalanced at baseline or to be of special interest associated with the outcomes.

As stated in Chap. 6, randomization tends to produce comparable groups for both measured and unmeasured baseline covariates. However, not all baseline covariates will be closely matched. Adjusting treatment effect for these baseline disparities continues to be debated. Canner [111] describes two points of view, one which argues that “if done at all, analyses should probably be limited to covariates for which there is a disparity between the treatment groups and that the unadjusted measure is to be preferred.” The other view is “to adjust on only a few factors that were known from previous experience to be predictive.” Canner [111], as well as Beach and Meier [107], indicate that even for moderate disparity in baseline comparability, or even if the covariates are moderately predictive, it is possible for covariate adjustment to have a nontrivial impact on the measure of treatment effect. However, Canner [111] also points out that it is “often possible to select specific covariates out of a large set in order to achieve a desired result.” In addition, he shows that this issue is true for both small and large studies. For this reason, it is critical that the process for selecting covariates be specified in the protocol and adhered to in the primary analyses. Other adjustments may be used in exploratory analyses.

Another issue is testing for covariate interaction in a clinical trial [105, 113, 114, 118, 119]. Treatment-covariate interaction is defined when the response to treatment varies according to the value of the covariate [105]. Peto [118] defines treatment covariate interactions as quantitative or qualitative. Quantitative interactions indicate that the magnitude of treatment effect varies with the covariate but still favors the same intervention (Fig. 18.5a). Qualitative interaction involves a favorable intervention effects for some values of the covariate and unfavorable effects for others (Fig. 18.5b). A quantitative interaction, for example, would be if the benefit of treatment for blood pressure on mortality varied in degree by the level of baseline blood pressure but still favoring the same intervention (See Fig. 18.5a). A qualitative interaction would exist if lowering blood pressure was beneficial for severe hypertension, but less beneficial or even harmful for mild hypertension. Intervention effects can vary by chance across levels of the covariate, even changing direction, so a great deal of caution must be taken in the interpretation. One can test formally for interaction, but requiring a significant interaction test is much more cautious than reviewing the magnitude of intervention effect within each subgroup. Byar [105] presents a nice illustration example shown in Table 18.5. Two treatments, A and B, are being compared by the difference in mean response, \( Y={\overline{X}}_A-{\overline{X}}_B \), and S is the standard error of Y. In the upper panel, the interaction test is not significant, but examination of the subgroups is highly suggestive of interaction. The lower panel is more convincing for interaction, but we still need to examine each subgroup to understand what is going on.

Fig. 18.5
figure 5figure 5

Two types of intervention–covariate interactions [118]

Table 18.5 Examples of apparent treatment-covariate interactions [105]

Methods have been proposed for testing for overall interactions [114, 119]. However, Byar’s concluding remarks [105] are noteworthy when he says,

one should look for treatment-covariate interactions, but, because of the play of chance in multiple comparisons, one should look very cautiously in the spirit of exploratory data analysis rather than that of formal hypothesis testing. Although the newer statistical methods may help decide whether the data suffice to support a claim of qualitative interactions and permit a more precise determination of reasonable p values, it seems to me unlikely that these methods will ever be as reliable a guide to sensible interpretation of data as will medical plausibility and replication of the findings in other studies. We are often warned to specify the interactions we want to test in advance in order to minimize the multiple comparisons problem, but this is often impossible in practice and in any case would be of no help in evaluating unexpected new findings. The best advice remains to look for treatment-covariate interactions but to report them skeptically as hypotheses to be investigated in other studies.

As indicated in Chap. 6, the randomization in multicenter trials should be stratified by clinic. The analysis of such a study should, strictly speaking, incorporate the clinic as a stratification variable. Furthermore, the randomization should be blocked in order to achieve balance over time in the number of participants randomized to each group. These “blocks” are also strata and, ideally, should be included in the analysis as a covariate. However, there could be a large number of strata, since there may be many clinics and the blocking factor within any clinic is usually anywhere from four to eight participants. Use of these blocking covariates is probably not necessary in the analysis. Some efficiency will be lost for the sake of simplicity, but the sacrifice should be small.

As Fleiss [10] describes, clinics differ in their demography of participants and medical practice as well as adherence to all aspects of the protocol. These factors are likely to lead to variation in treatment response from clinic to clinic. In the Beta-blocker Heart Attack Trial (BHAT) [23], most, but not all, of the 30 clinics showed a trend for mortality benefit from propranolol. A few indicated a negative trend. In the Aspirin Myocardial Infarction Study (AMIS) [102], data from a few clinics suggested a mortality benefit from aspirin, although most clinics indicated little or no benefit. Most reported analyses probably do not stratify by clinic, but simply combine the results of all clinics. However, at least one of the primary analyses should average within-clinic differences, an analysis that is always valid, even in the presence of clinic-treatment interaction [114].

Subgroup Analyses

While covariance or stratified analysis adjusts the overall comparison of main outcomes for baseline variables, another common analytic technique is to subdivide or subgroup the enrolled participants defined by baseline characteristics [128156]. Here the investigator looks specifically at particular subgroups rather than the overall comparison to assess whether different groups of patients respond differently to the intervention. One of the most frequently asked questions during the design of a trial and when the results are analyzed is, “Are the intervention effects the same across levels of important baseline factors?” It is important that subgroups be examined. Clinical trials require considerable time and effort to conduct and the resulting data deserve maximum evaluation. Subgroup analyses can support or elaborate a trial’s overall primary result, or provide exploratory results for the primary outcome that may have special interest for a particular subgroup. For example, analysis of data from the V-HeFT I trial suggested that the combination of isosorbide dinitrate and hydralazine might reduce mortality in blacks but not whites [157, 158]. This lead to the initiation of a follow-up trial of the combination which enrolled only blacks with advanced heart failure [159]. The A-HeFT trial concluded that this therapy increased survival [160]. However, such success stories are not common, and care must be exercised in the interpretation of subgroup findings.

How to perform subgroup analyses when reporting clinical trial data has long been a controversial topic [140, 156]. Manuscripts reporting the results of clinical trials commonly include statements about and estimates of effects in subgroups, but the results of subgroup analyses are often misleading, having been over-interpreted or presented in a way that makes their interpretation ambiguous [129, 149]. Most published advice since the early 1980’s has included a common set of specific recommendations for subgroup analyses: they should be adjusted for multiple comparisons, they should be prespecified, and they should be assessed using interaction tests (rather than by within group estimates of the treatment effect) [142, 143, 153, 155, 156]. Making public a well-written protocol which specifies the proposed subgroups together with biologically plausible hypotheses for each and including plans for performing and presenting the subgroup analyses is often recommended as well.

As the number of subgroups increases, the potential for chance findings increases due to multiple comparisons [132, 143, 144, 155]. Therefore, if one were to perform tests of significance on a large number of subgroup analyses, there will be an increased probability of false positive results unless adjustments are made. Adjustment for multiple interaction tests on a set of variables defining subgroups is necessary to control the number of false positive results. This can be done by such familiar methods as the Bonferroni correction or variants of it. An alternative suggested by guidelines for the New England Journal of Medicine is to report the expected number of false positives associated with a set of tests reported with nominal p-values [143, 153]; for example, this approach was taken for the ACCORD BP results [152]. Even with adjustments for multiplicity, however, over-interpretation of the results of treatment effects within subgroups can lead to irreproducible conclusions.

Ideally, the subgroups to be analyzed should be pre-specified. Since it is almost always possible to find at least one suggestive subgroup effect by persistent exploration of the data after a trial is over, even when the intervention is completely inert, defining the groups to be analyzed in advance, preferably with argument for their biological plausibility, confers the greatest credibility. There is likely to be, however, low power for detecting differences in subgroups [132, 155], and they are more likely to be affected by imbalances in baseline characteristics [161, 162]. Therefore, investigators should not pay as much attention to statistical significance for subgroup questions as they do for the primary question. Recognizing the low chance of seeing significant differences, descriptions of subgroup effects are often qualitative. On the other hand, as mentioned previously, testing multiple questions can increase the chance of a Type I error. Even when prespecified, there are reasons to be cautious.

The Clopidogrel for High Atherothrombotic Risk and Ischemic Stabilization, Management, and Avoidance (CHARISMA) trial [131] tested the effectiveness of long term dual antiplatelet therapy with clopidogrel plus low-dose aspirin to aspirin alone for the prevention of cardiovascular events among patients with either clinically evident CVD or multiple risk factors. Enrolled patients had either clinically evident cardiovascular disease (symptomatic) or multiple risk factors for atherothrombotic disease (asymptomatic). There was no difference between the two randomized arms, but 20 subgroup analyses were pre-specified in the protocol. For symptomatic vs. asymptomatic patients, the p-value for the interaction test was 0.045 and the p-value for benefit among the symptomatic patients was 0.046. This was reported as a suggestion of benefit for clopidogrel. Two accompanying editorials [143, 148] took issue with this conclusion for several reasons. The authors made no adjustment for multiple comparisons: had any correction been done, none of the subgroup analyses would have been even close to significant. The subsequent interpretation of the p-value for the symptomatic patients overstated its significance, which was marginal in any case. Furthermore the significance of the interaction test seemed to be driven more by a harmful effect in the asymptomatic patients than by any beneficial effect in the symptomatic patients. Finally, from the clinical point of view, the distinction between symptomatic and asymptomatic was not clear, since some of the patients in the asymptomatic group had a history of major cardiovascular events at baseline. A subsequent re-analysis of subgroups with patients identified as primary prevention and secondary prevention found no within-subgroup benefit for the primary endpoint [153].

Even if not explicitly pre-specified, subgroup analyses may be identified in several ways with different implications for the reliability of their results. For example, it might be reasonable to infer that subgroup hypotheses related to factors used to stratification of the randomization, such as age, sex or stage of disease, were in fact considered in advance. Factors that are integrated into the study design may be implied as subgroups even if they are not explicitly stated in the protocol.

Of course, the same problems in interpretation apply here as with formally prespecified subgroups. The Prospective Randomized Amlodipine Survival Evaluation Study (PRAISE), a large multicenter trial [146], pre-specified several subgroups, but in addition analyzed a baseline characteristic used to stratify the randomization, ischemic vs. non-ischemic etiology of chronic heart failure, as an additional subgroup. The randomization of participants with chronic heart failure was stratified by ischemic and non-ischemic etiology. While the primary outcome of death or cardiovascular hospitalization was nonsignificant and the secondary outcome of overall survival outcome was nearly significant (p = 0.07), almost all of the risk reduction was in the non-ischemic subgroup. The risk reduction was 31% for the primary outcome (p = 0.04) and 46% for mortality (p < 0.001). However, the more favorable result was expected to be in the ischemic subgroup, not the non-ischemic subgroup. Thus, the investigators recommended that a second trial be conducted in the patient population with non-ischemic chronic heart failure using a nearly identical protocol to confirm this impressive subgroup result [146]. The results of the PRAISE-II trial proved disappointing with no reduction in either the primary or secondary outcome [147]. Thus, the previous predefined subgroup result could not be confirmed.

On occasion, during the monitoring of a trial, particular subgroup findings may emerge and be of special interest. If additional participants remain to be enrolled into the trial, one approach is to test the new subgroup hypothesis in the later participants. With small numbers of participants, it is unlikely that significant differences will be noted. If, however, the same pattern emerges in the newly created subgroup, the hypothesis is considerably strengthened. Subgroups may also emerge during a trial by being identified by other, similar trials. If one study reports that the observed difference between intervention and control appears to be concentrated in a particular subgroup of participants, it is appropriate to see if the same findings occur in another trial of the same intervention, even though that subgroup was not pre-specified. Problems here include comparability of definition. It is unusual for different trials to have baseline information sufficiently similar to allow for characterization of identical subgroups. In the Raloxifene Use for The Heart (RUTH) [133], age groups were among a number of pre-specified subgroups, but the definition of the groups was modified to match what was used for the Women’s Health Initiative [77]. Though the subgroup effects from RUTH and WHI were consistent, their interpretation as real clinical effects was vigorously challenged [155].

The weakest type of subgroup analysis involves post hoc analysis, sometimes referred to as “data-dredging” or “fishing.” With this approach subgroups are suggested by the data themselves. Because many comparisons are theoretically possible, tests of significance become difficult to interpret and should be challenged. Such analyses should serve primarily to generate hypotheses for evaluation in other trials. An example of subgrouping that was challenged comes from a study of diabetes in Iceland. Male children under the age of 14 and born in October were claimed to be at highest risk of ketosis-prone diabetes. Goudie [138] challenged whether the month of October emerged from post-study analyses biased by knowledge of the results. The ISIS-2 trial [141] illustrated a spurious subgroup finding that suggested treatment benefit of aspirin after myocardial infarction was not present in individuals born under Gemini or Libra astrological signs. A similar example [135] suggests twice as many participants with bronchial carcinoma were born in the month of March (p < 0.01) although this observation could not be reproduced [130, 134]. Subgroups unsupported by a biologically plausible hypothesis should be regarded with heightened caution.

Even subgroups supported by a biologically plausible rationale and suggesting beneficial effects can turn out to be irreproducible. Post-hoc subgroup analyses were performed for a number of trials of beta-blocking drugs were conducted in people who had had a myocardial infarction. One found that the observed benefit was restricted to participants with anterior infarctions [145]. Another claimed improvement only in participants 65 years or younger [128]. In the Beta-Blocker Heart Attack Trial, it was observed that the greatest relative benefit of the intervention was in participants with complications during the infarction [137]. These subgroup findings however, were not consistently confirmed in other trials [136].

Post-hoc subgroups may be specified by comparing participants from two groups who experience the same event, or have similar outcomes; an early example is the discriminant analysis done for the Multicentre International Study [145]. Investigators frequently want to do this in an attempt to understand the mechanisms of action of an intervention. Sometimes this retrospective look can suggest factors or variables by which the participants could be subgrouped. As discussed earlier in this chapter, categorization of participants by any outcome variable, e.g., adherence, can lead to biased conclusions. If some subgroup is suggested in this way, the investigator should create that subgroup in each randomized arm and make the appropriate comparison. For example, she may find that participants in the intervention arm who died were older than those in the control arm who died. This retrospective observation might suggest that age is a factor in the usefulness of the intervention. The appropriate way to test this hypothesis would be to subgroup all participants by age and compare intervention versus control for each age subgroup.

An interesting post hoc subgroup analyses was reported by the Metoprolol CR/XL Randomized Intervention Trial (MERIT) [154]. This trial, which evaluated the effect of a beta-blocker in participants with chronic heart failure, had two primary outcomes. One was all-cause mortality and the other was death plus hospitalization. MERIT was terminated early by the monitoring committee due to a highly significant reduction in mortality, as shown in Fig. 18.6, and similar reductions in death plus hospitalization. The results are remarkably consistent across all of the predefined subgroups for mortality, mortality plus hospitalization and mortality plus heart failure hospitalization as shown in Fig. 18.7. Moreover, the results were very consistent with those from two other beta-blocker trials [37, 38], as shown in Figs. 18.8 and 18.9. However, post hoc analyses during review by regulatory agencies compared results among countries. These results are shown in Fig. 18.10. Of note is that for mortality, the relative risk in the United States slightly favors placebo, in contrast to the mortality results for the trial as a whole. With respect to outcomes of mortality plus hospitalization, and mortality and hospitalization for heart failure, the U.S. data are consistent with the overall trial results. As noted by Wedel et al. [154] the analysis for interaction depends on how the regional subgroups are formed. Whether the observed regional difference is due to chance or real has been debated, but Wedel and colleagues argued that is not consistent with other external data, not internally consistent within MERIT and not biologically plausible, and thus is most likely due to chance. This result does however point out the risks of post hoc subgroup analyses.

Fig. 18.6
figure 6figure 6

MERIT Kaplan-Meier estimates of cumulative percentage of total mortality after randomization—p value nominal and adjusted for two interim analyses (MERIT) [37]. Reproduced with permission of Elsevier Ltd. for Lancet

Fig. 18.7
figure 7figure 7

Relative risk and 95% confidence intervals for selected subgroups in the MERIT trial, for total mortality, total mortality and all hospitalization, and total mortality and heart failure hospitalization [154]. Reproduced with the permission of Elsevier Ltd. for the Amer Heart Journal

Fig. 18.8
figure 8figure 8

Kaplan-Meier survival curves for the CIBIS-II trial, comparing bisoprolol and placebo [34]. Reproduced with the permission of Elsevier Ltd. for Lancet

Fig. 18.9
figure 9figure 9

Kaplan-Meier Analysis of Time to Death for COPERNICUS trial, comparing Placebo and Carvedilol Group. The 35% lower risk in the carvedilol group was significant: p = 0.00013 (unadjusted) and p = 0.0014 (adjusted) [38]

Fig. 18.10
figure 10figure 10

Relative risk and 95% confidence intervals for the MERIT trial, for outcomes of total mortality, total mortality and hospitalization for any cause, and total mortality and heart failure hospitalization [154]. Reproduced with the permission of Elsevier Ltd. for the Amer Heart Journal

Regardless of how subgroups are selected, several factors can provide supporting evidence for the validity of the findings. As mentioned, similar results obtained in several studies strengthen interpretation. Internal consistency within a study is also a factor. If similar subgroup results are observed at most of the sites of a multicenter trial, they are more likely to be true. And of course, not all follow-up analyses and replication studies refute the initial subgroup finding. In contrast, however, plausible post hoc biological explanations for the findings, while necessary, are not sufficient. Given almost any outcome, reasonable sounding explanations can be put forward.

The two most common approaches to analysis of subgroup effects are (1) multiple hypothesis tests for effects within subgroups and (2) interaction tests for homogeneity of effects across subgroups defined by each variable of interest. Of these two the consensus in the literature strongly favors the interaction test. The interaction test provides a single, global assessment of whether a categorical variable partitioning the study cohort is associated with different magnitudes of treatment effect. Estimates of those effects, with confidence intervals, provide exploratory indications of the consistency of the treatment effect across the population. Testing for treatment effects within subgroups, in contrast, requires a greater number of hypothesis tests and inflates the probability of a false positive result over the nominal significance level [132]. Statistical power and other considerations make the overall trial result a better guide to the effect in subgroups than the subgroup specific treatment effects [155, 156].

Often, attention is focused on subgroups with the largest intervention-control differences. However, even with only a few subgroups, the likelihood of large but spurious differences in effects of intervention between the most extreme subgroup outcomes can be considerable [136, 140, 150]. Because large, random differences can occur, subgroup findings may easily be over-interpreted. Peto has argued that observed quantitative differences in outcome between various subgroups are to be expected, and they do not imply that the effect of intervention is truly inhomogeneous [118].

It has also been suggested that, unless the main overall comparison for the trial is significant, investigators should be particularly conservative in reporting significant subgroup findings [150, 163]. Lee and colleagues conducted a simulated randomized trial, in which participants were randomly allocated to two groups, although no intervention was initiated [144]. Despite the expected lack of overall difference, a subgroup was found which showed a significant difference. Further simulations [132] have emphasized the potential for spurious results even when the main comparison is significant, and the importance of basing statements about significance on interaction tests rather than subgroup-specific tests.

In summary, subgroup analyses are important. However, they must be defined using baseline data and interpreted cautiously.

Not Counting Some Events

In prevention trials, the temptation is not to count events that are observed in the immediate post-randomization follow-up period. The rational for this practice is that events occurring that rapidly must have existed at screening, but were not detected. For example, if a cancer prevention trial randomized participants into a vitamin versus placebo trial, any immediate post randomization cancer events could not have been prevented since the cancer had to have already been present subclinically at entry. Because the intervention could not have prevented these cases, their inclusion in the design only dilutes the results and decreases power. While such an argument has some appeal, it must be viewed with caution. Rarely are mechanisms of action of therapies or interventions fully understood. More importantly, negative impact of interventions having a more immediate effect might not be seen as easily or as quickly with this approach. If used at all, and this should be rarely, the data must be presented both ways; i.e., with and without the excluded events.

An extreme case of dropping early events might be in a surgical or procedure trial. Participants assigned to the procedure might be put at higher risk of a fatal or irreversible event. These early risks to the participant are part of the overall intervention effect and should not be eliminated from the analysis.

Some trials have defined various counting rules for events once participants have dropped out of the study or reached some level of nonadherence. For example, the Anturane Reinfarction Trial [28] suggested that no events after 7 days going off study medication should be counted. It is not clear what length of time is appropriate to eliminate events to avoid bias. For example, if a participant with an acute disease continues to decline and is removed from therapy, bias could be introduced if the therapy itself is contributing to the decline due to adverse effects and toxicity. In the APPROVe trial [8184] described earlier in this chapter, the decision not to count events after 14 days and not to follow participants after that period of time led to controversy. In fact, the results and the interpretation were different once the almost complete follow-up was obtained [84].

Comparison of Multiple Variables

If enough hypothesis tests are done, some may be significant by chance even if all the hypotheses being tested are false. This issue of multiple comparisons includes repeated looks at the same response variable (Chap. 15) and comparisons of multiple variables. Many clinical trials have more than one response variable, and prespecify several subgroups of interest. Thus, a number of statistical comparisons are likely to be made. For example, when performing 20 independent comparisons, one of them, on the average, will be significantly different by chance alone using 0.05 as the level of significance. The implication of this is that the investigator should be cautious in the interpretation of results if she is making multiple comparisons. The alternative is to require a more conservative significance level. As noted earlier, lowering the significance level will reduce the power of a trial. The issue of multiple comparisons has been discussed by Miller [164], who reviewed many proposed approaches.

One way to counter the problem is to increase the sample size so that a smaller significance level can be used while maintaining the power of the trial. However, in practice, most investigators could probably not afford to enroll the number of participants required to compensate for all the possible comparisons that might be made. As an approximation, if investigators are making k comparisons, each comparison should be made at the significance level α/k, a procedure known as the Bonferonni correction [164]. Thus, for k = 10 and α = 0.05, each test would need to be significant at the 0.005 level. Sample size calculations involving a significance level of 0.005 will dramatically increase the required number of participants. The Bonferonni correction is quite conservative in controlling the overall α level or false positive error rate if the test statistics are correlated, which is often the case. Therefore, it may be more reasonable to calculate sample size based on one primary response variable, limit the number of comparisons and be cautious in claiming significant results for other comparisons.

However, there are other procedures to control the overall α level and we summarize briefly two of them [165, 166]. Assume that we prespecify m hypotheses to be tested, involving multiple outcomes, multiple subgroups, or a combination. The goal is to control the overall α level. One implementation of the Holm procedure [166] is to order the p values from smallest to largest as p(1), p(2),…..,p(m), corresponding to the m hypotheses H(1), H(2),…,H(m). Then the Holm procedure would reject H(1), if p(1) ≤ α/m. If and only if H(1) is rejected can we consider the next hypothesis. In that case, H(2) can be rejected if p(2) ≤ α /(m − 1). This process continues until we fail to reject and then the testing must stop. The Holm procedure can also be applied if the m hypotheses can be ordered according to their importance. Here, the most important hypothesis H(1) can be rejected only if the corresponding p value is less than α/m. If rejected, the next most important hypothesis H(2) can be rejected if the p value is less than α/(m − 1).

Hochberg’s procedure [165] also requires that the m hypotheses be specified in advance and orders the p-values from smallest to largest as does Holm’s. The Hochberg procedure allows all m hypotheses to be rejected if p(m) ≤ α/m. If this is not the case, then the remaining m − 1 hypotheses can be rejected if p(m − 1) ≤ α/(m − 1). This process is carried out for all of the m hypotheses until a rejection is obtained and then stops. These two procedures will not give exactly the same rejection pattern so it is important to prespecify which one will be used.

In considering multiple outcomes or subgroups, it is important to evaluate the consistency of the results qualitatively, and not stretch formal statistical analysis too far. Most formal comparisons should be stated in advance. Beyond that, one engages in observational data analysis to generate ideas for subsequent testing.

Use of Cutpoints

Splitting continuous variables into two categories, for example by using an arbitrary “cutpoint,” is often done in data analysis. This can be misleading, especially if the cutpoint is suggested by the data. As an example, consider the constructed data set in Table 18.6. Heart rate, in beats per minute, was measured prior to intervention in two groups of 25 participants each. After therapies A and B were administered, the heart rate was again measured. The average changes between groups A and B are not sufficiently different from each other (p = 0.75) using a standard t-test. However, if these same data are analyzed by splitting the participants into “responders” and “non-responders,” according to the magnitude of heart rate reduction, the results can be made to vary. Table 18.7 shows three such possibilities, using reductions of 7, 5, and 3 beats per minute as definitions of response. As indicated, the significance levels, using a chi-square test or Fisher’s exact test, change from not significant to significant and back to not significant. This created example suggests that by manipulating the cutpoint one can observe a significance level less than 0.05 when there does not really seem to be a difference.

Table 18.6 Differences in pre- and post-therapy heart rate, in beats per minute (HR), for Groups A and B, with 25 participants each
Table 18.7 Comparison of change in heart rate in Group A versus B by three choices of cutpoints

Noninferiority Trial Analysis

As discussed in Chap. 5, noninferiority trials are challenging to design, conduct and analyze. We pointed out the special challenges in setting the margin of noninferiority. However, once that margin of noninferiority is established prior to the start of the trial, there remain several issues that must be included in a rigorous analysis and reported because of the clinical and regulatory implications [13, 167183]. If we define I to be the new intervention, C to be the control or standard, and P to be the placebo or no treatment, then we obtain from the noninferiority trial an estimate of the relative risk (RR) of I to C, RR(I/C) or an absolute difference. In the design, the metric must be established since the sample size and the interim monitoring depend on it. The first analytic challenge is to establish whether the new intervention met the criteria for noninferiority, a part of which is demonstrating that the upper limit of the 95% confidence interval of the estimate was less than the noninferiority margin.

As shown in Fig. 18.11, from Pocock and Ware [181], if the upper limit of the 95% confidence interval for the relative risk is less than unity, various degrees of evidence exist for superiority (See case A). For noninferiority trials, if the upper limit of the 95% confidence interval is less than the margin of non-inferiority, δ, then there is evidence for noninferiority (see cases B and C). Failure to be less than this margin does not provide evidence for noninferiority (see case D). The design must have sufficient sample size and power to rule out a margin of noninferiority as discussed in Chap. 8. Although not expected when the study was designed, a noninferiority trial might also indicate harm (See E).

Fig. 18.11
figure 11figure 11

Relative risks and 95% confidence intervals for a series of superiority and non-inferiority trials [181]

The second desired goal of a noninferiority analysis is to demonstrate that the new intervention would have beaten a placebo or no treatment if it had been included; that is, the estimate of RR(I/P). Analytically, this can be accomplished by recognizing that RR(I/P) = RR(I/C) RR(C/P). However, for this imputation step to work requires at least two critical assumptions: (1) there is constancy of the control effect over time, and (2) the population where the control was tested against placebo is relevant to the current use where the intervention (I) is being tested. These assumptions are difficult, perhaps impossible, to establish (see Chap. 5). In this chapter, we will focus our attention on the first challenge of establishing whether or not the intervention versus control comparison was less than the noninferiority margin.

Assuming that an appropriate active control was selected, the trial must implement it according to best practice and as good or better than that what was done in the initial trial establishing its benefit [172]. Otherwise, the new intervention is being compared to a control that is handicapped, making it easier for the new intervention to appear similar or even better than the control. Poor adherence and conduct will favor the new intervention in a noninferiority trial, instead of handicapping the new intervention as in a superiority trial [179]. Thus, as discussed in Chap. 14, adequate measures of adherence must be collected during the trial in order to make this critical assessment. Adherence in this case does not only mean whether the participant took all or almost all of the intervention and control drugs. What else participants were taking as concomitant medication is also a consideration. If there is a substantial imbalance, interpretation of the results would be difficult.

Another key factor is whether the outcomes chosen are true measures of the effect of both the new intervention and the control. This is sometimes referred to as assay sensitivity [177]. Thus, whether consciously or not, an investigator might select an outcome that would show no change no matter what intervention was being studied, and thus guarantee that the noninferiority margin would be achieved. Outcomes should be similar to those used in the positive control versus placebo trials.

There is a debate whether the intention-to-treat analysis or the “on treatment” analysis is most appropriate for a noninferiority designed trial. If intention-to-treat is used, nonadherence dilutes whatever differences may exist and thus is biased towards noninferiority. An “on treatment” analysis compares only those who are good adherers, or at least took some predefined portion of the intervention and thus is closer to testing the true effect. However, as we demonstrated earlier in this chapter, analyzing trials by adherence to an intervention can be substantially biased, the direction of which cannot be predicted. Thus, we do not recommend such an analysis because of the uncertainty of bias and its direction, and instead recommend that a trial be designed to minimize nonadherence. The true comparison of the new intervention may be somewhere in between the intention-to-treat and the “on treatment” but there is no dependable way to tease that estimate out. If both analytic approaches confirm noninferiority, then the conclusion is more robust, assuming that the noninferiority margin is reasonable [178].

Any trial relies on an adequate sample size to have power to test hypotheses of interest, whether for superiority or noninferiority. For a superiority trial, inadequate sample size works against finding differences but for noninferiority, inadequate sample size favors finding noninferiority. There is a difficult balance between having a noninferiority margin that is too small and thus requiring an unachievable sample size and having a margin that is so large that the sample size is appealing but the results would not be convincing.

There are many examples of noninferiority trials but we will use one to illustrate the challenges. The Stroke Prevention using an ORal Thrombin Inhibitor in atrial Fibrillation (SPORTIF)-V trial in participants with atrial fibrillation comparing a new intervention, ximelegatran, against a standard warfarin intervention [183], with a primary outcome of stroke incidence. A number of issues were involved. First, there were no very good warfarin versus placebo trials to set the noninferiority margin. Second, the trial used absolute difference as the metric, assuming the event rate would be around 3%, but instead observed an event rate less than half that. Thus, the noninferiority margin of 2% that was prespecified was too large given the small event rate. If the observed event rate of 1.5% had been assumed, the prespecified margin would have been much less, perhaps closer to 1%. The observed stroke rates were 1.2% in the warfarin group and 1.6% in the ximelegatran group with a 95% CI of −.13% to 1.03% which would meet the initial margin of noninferiority. However, this was not adequate for a margin of 1%. Therefore, even though margins may be set in advance, results may invalidate the assumptions and thus the margin itself.

As discussed in Chap. 21, presentation of the results of noninferiority trials are more complex than for superiority trials because all of the assumptions must be so carefully and clearly laid out [181].

Analysis Following Trend Adaptive Designs

As discussed in Chaps. 5 and 17, the design of a trial may have an adaptive element. This might be a group sequential design for early termination due to overwhelming benefit or a strong signal for harm, or perhaps futility. Among the adaptive designs discussed some involved changing the sample size. Some of these sample size changes are due to overall lower event rates or higher variability in the primary outcome than was assumed in the original sample size estimate. In these instances, the final analysis proceeds as normal. However, another method for sample size change relies on trend adaptive designs. In these designs, which depend on the emerging trend in the data, the final critical value or significance level will be affected and thus must be kept in mind for the final analysis.

For example, some trials may monitor accumulating interim data and may terminate the trial early for evidence of benefit or harm. If a group sequential design using a 0.05 two-sided significance level O’Brien-Fleming boundary were used five times during the trial, approximately equally spaced, the final critical value would not be +1.96 and −1.96 for the upper and lower bounds but a value closer to 2.04.

For trend adaptive sample size changes, the final critical value depends on which methodology was used but all will require typically a more conservative value, for example, than a two-sided nominal alpha level of 0.05 (a critical value of 1.96).

Other than adjusting the final critical value, the analyses for these trend adaptive designs may also utilize a modified test statistic. For example, if the method of Cui et al. [184] is used in increasing the sample size, a weighted test statistic as described in Chap. 17 is required. Future observations are given less weight than the early existing observations. The usual test statistic is not appropriate in this situation. For the other trend adaptive methods described in Chaps. 5 and 17, the final analysis can proceed with the standard statistics in a usual straightforward fashion, adjusting for the final critical value from sequential testing as appropriate.

Meta-analysis of Multiple Studies

Often in an area of clinical research several independent trials using similar participants and similar intervention strategies are conducted over a period of a few years. Some may be larger multicenter trials, but there may be a substantial number of small trials none of which were conclusive individually, though they may have served as a pilot for a larger subsequent study. Investigators from a variety of medical disciplines often review the cumulative data on similar trials and try to develop a consensus conclusion of the overall results [185193]. If this overview is performed by a formal process and with statistical methods for combining all the data with a single analysis, the analysis is usually referred to as a meta-analysis or systematic review. Methods suitable for this purpose were described in 1954 by Cochran [194] and later by Mantel and Haenszel [195]. Other authors have summarized the methodologic approaches [196207]. The Cochrane Collaboration has been a major contributor to systematic reviews of controlled trials [208], often organized around a specific health care area or issue, including systematic reviews of adverse effects and advice on how to conduct such systematic reviews. Guidelines intended to improve the conduct and reporting of meta-analyses have been published [209, 210]. There are numerous examples of meta-analysis in a variety of medical disciplines and a few are referenced here [211221]. A great deal has been written and discussed about the usefulness and challenges of meta-analyses [222233].

Rationale and Issues

Researchers conduct systematic reviews and meta-analyses to address a number of important questions [190]. Probably the most common reason is to obtain more precise estimates of an intervention effect and to increase the power to observe small but clinically important effects. Very often the potential for increased power to detect small but clinically important effects motivates the meta-analysis. However, meta-analyses can also evaluate the generalizability of results across trials, populations, and specific interventions. Subgroup analyses based on small numbers of participants may not lead to firm conclusions and miss qualitative differences in effect. Post hoc subgroup analyses are unreliable due to multiplicity of testing. Prespecified meta-analysis offers the opportunity to examine a limited number of hypotheses identified in individual trials. Meta-analysis of subgroups can guide clinicians in their practice by selecting participants most suitable for the intervention. In addition, meta-analysis can support submissions to the U.S. Food and Drug Administration. If a major clinical trial is being initiated, a sensible approach is to base many aspects of the design on the summary of all existing data. Meta-analysis is a systematic process that can provide critical information on definitions of population and intervention, control group response rates, expected size of the intervention effect, and length of follow-up. Finally, if a new treatment or intervention gains widespread popularity early in its use, a meta-analysis may provide a balanced perspective and may suggest the need for a single, large, properly designed clinical trial to provide a definitive test. Furthermore, meta-analyses are mandated if the opportunity to conduct a new large study no longer exists due to a loss of equipoise, even if this loss is not well justified. In this case, a meta-analysis may be the only solution for salvaging a reliable consensus.

As indicated, a meta-analysis is the combination of results from similar participants evaluated by similar protocols and interventions. The standard analysis of a multicenter trial, stratified by clinical center is in some ways an ideal meta-analysis. Each center plays the role of a small study. Protocols and treatment strategies are identical, and participants are more similar than those in a typical collection of trials.

This contrast between a meta-analysis and a multicenter trial points out some limitations of the former. While the implementation of a clinical protocol can vary across centers, such differences are negligible compared to those in a collection of independently conducted large or small trials. Even when the analysis is done by pooling participant-level data from each trial [212, 217], meta-analysis cannot be expected to produce the same level of evidence as a single, large clinical trial. In a typical meta-analysis, important differences exist in actual treatment, study population, length of follow-up, measures of outcome, level of background medical care in international trials and quality of data [222, 225228, 233]. Because of these differences, the potential for meta-analysis should never be a justification for conducting a series of small, loosely connected studies with the expectation that a definitive result can be produced by combining after the fact. Perhaps the most fundamental problem is the potential to create bias when deciding on which studies to include in a meta-analysis. Two examples of such bias are selection bias and investigator bias.

Many support the concept that the most valid overview and meta-analysis requires all relevant studies conducted be available for inclusion or at least for consideration [190, 226]. Failing to do so can produce selection bias; that is, a mis-estimation caused by analysis of a non-representative sample. For example, Furberg [228] provides a review of seven meta-analyses of lipid lowering trials. Each article presents different inclusion criteria, such as the number of participants or the degree of cholesterol reduction. The results vary depending on the criteria used. Another example of selection bias in meta-analysis involves the investigation of whether adding manual thrombus aspiration to primary percutaneous coronary intervention (PPCI) reduces total mortality. Between 1996 and 2009, about 20 small clinical trials and one larger trial, the Thrombus Aspiration during Percutaneous Coronary Intervention in Acute Myocardial Infarction Study (TAPAS) trial [234], were conducted to address whether PPCI with thrombus aspiration might have benefits over PPCI alone. These trials were not powered for total mortality and the smaller trials were not consistently positive; however, the largest suggested a possible 50% mortality benefit for manual thrombus aspiration. A series of meta-analyses sought to clarify the situation [212, 235240]. Despite having identical aims, nearly identical inclusion criteria, and access to the same small set of trial results, no two meta-analyses included the same set of studies, and results varied. Because there were conflicting conclusions, no consensus was produced. The Thrombus Aspiration in ST-Elevation Myocardial Infarction in Scandinavia (TASTE) trial, designed with mortality as its primary outcome, concluded that there was no effect [20, 241], but a subsequent meta-analysis including the TASTE trial, while finding a non-significant effect on mortality, concluded that a modest reduction in clinical outcomes exists [242].

While it is clearly difficult enough to decide which well-known and published trial results to include, a further serious complication is that some relevant trial results may not be readily accessible in the literature due to publication bias [223, 231]. Published trials are more likely to be statistically significant (p < 0.05) or to favor a novel intervention. Trials that yield neutral or indifferent results are less likely to be published. One example described by Furberg and Morgan [227] illustrates this problem. An overview [223] of the use of propranolol in patients following a heart attack reported 7 of 45 patients died in the hospital compared to a non-randomized, placebo-control where 17 of 46 died, indicating a clear benefit of propranolol. Controversy over design limitations motivated the investigator to conduct two additional randomized trials. One showed no difference and the other a negative (harmful) trend. Neither was ever published. Identifying yet another obstacle to inclusion of all relevant studies, Chalmers et al. [224] pointed out that a MEDLINE literature search may only find 30–60% of published trials. This is due in part to the way results are presented and searches of typical key words may not uncover relevant papers. Although search engines may be better now, there are undoubtedly still limitations. Work by Gordon and colleagues found that only 57% of 244 NHLBI-supported trials completed between January 2000 and December 2011 published their main results within 30 months after completion [243]. These difficulties in determining and accessing the entire population of relevant studies may lead to analysis of a subset of trial results which are not representative, producing conclusions which do not reflect the totality of evidence because of selection bias.

Another type of bias, referred to as investigator bias, occurs when an investigator ignores or goes beyond any pre-specified plan and makes subjective decisions about which trials and outcome variables will get reported. If protocols were written well and adhered to strictly, investigator bias would not be a problem. However, post-hoc repeated testing of multiple subgroups and multiple outcomes may not be easy to detect from the published report [229]. Promising early results may draw major attention, but if later results show smaller intervention effects, they may go unnoticed or be harder to find for the systematic review. Furthermore, authors of systematic reviews are also to subject to investigator bias. That is, unless the goals of the meta-analysis are clearly stated a priori in a protocol, a positive result can be found in this analysis by sifting through numerous attempts. A great deal of time and persistence are required in order to get access to all known conducted trials and accurately extract the relevant data. Not all meta-analyses are conducted with the same degree of thoroughness.

The medical literature is filled with meta-analysis of trials covering a wide range of disciplines [211221]. Several examples from the cardiology literature will provide an overview. Chalmers and colleagues [214] reviewed six small studies that used anticoagulants in an effort to reduce mortality in heart attack patients. While only one of the six was individually significant, the combined overall results suggested a statistically significant 4.2% absolute reduction in mortality. The authors suggested no further trials were necessary. However, due to issues raised, this analysis drew serious criticism [229]. Several years later, Yusuf and colleagues [221] reviewed 33 fibrinolytic trials, focusing largely on the use of streptokinase. This overview included trials with much dissimilarity in dose, route and time of administration, and setting. Although the meta-analysis for intravenous use of fibrinolytic drugs was impressive, and the authors concluded that results were not due to reporting biases, they nevertheless discussed the need for future large-scale trials before widespread use should be recommended. There were issues, for example, as to how quickly such an intervention needed to be started after onset of a heart attack. That is, timing needed to be resolved. Canner [213] conducted an overview of 6 randomized clinical trials testing aspirin use in participants with a previous heart attack to reduce mortality. His overall meta-analysis suggested a 10% reduction that was not significant (p = 0.11). However, there was an apparent heterogeneity of results and the largest trial had a slightly negative mortality result. The Canner overview was repeated by Hennekens et al. [215] after several more trials had been conducted. This updated analysis demonstrated favorable results. May et al. [218] conducted an early overview of several modes of therapy for secondary prevention of mortality after a heart attack. Their overview covered anti-arrhythmic drugs, lipid-lowering drugs, anticoagulant drugs, beta-blocker drugs, and physical exercise. Although statistical methods were available to combine studies within each treatment class, they chose not to combine results, but simply provided relative risks and confidence interval results graphically for each study. A visual inspection of the trends and variation in trial results suggests a summary analysis. Yusuf et al. [220] later provided a more detailed overview of beta blocker trials. While using a similar graphical presentation, they calculated a summary odds ratio and its confidence interval. Meta-analysis of cancer trials have also been conducted including the use of adjuvant therapy for breast cancer [216]. While using multiple chemotherapeutic agents indicated improved relapse-free survival after 3 and 5 years of follow-up, as well as for survival, the dissimilarity among the trials led the authors to call for more trials and better data.

Thompson [232] pointed out the need to investigate sources of heterogeneity. These differences may be in populations studied, intervention strategies, outcomes measured, or other logistical aspects. Given such differences, inconsistent results among individual studies might be expected. Statistical tests for heterogeneity often have low statistical power even in the presence of a moderate heterogeneity. Thompson [232] argued that we should investigate the influence of apparent clinical differences between studies and not rely on formal statistical tests to give us assurance of no heterogeneity. In the presence of apparent heterogeneity, overall summary results should be interpreted cautiously. Thompson described an example of a meta-analysis of 28 studies evaluating cholesterol lowering and the impact on risk of coronary heart disease. A great deal of heterogeneity was present, so a simple overall estimate of risk reduction may be misleading. He showed that factors such as age of the cohort, length of treatment, and size of study were contributing factors. Taking these factors into account made the heterogeneity less extreme and results more interpretable. One analysis showed that the percent reduction in risk decreased with the age of the participant at the time of the event, a point not seen in the overall meta-analysis. However, he also cautioned that such analyses of heterogeneity must be interpreted cautiously, just as for subgroup analyses in any single trial.

Meta-analysis, as opposed to typical literature reviews, usually puts a p-value on the conclusion. The statistical procedure may allow for calculation of a p-value, but it implies a precision which may be inappropriate. The possibility that not all relevant studies have been included may make the interpretation of the p-value tenuous. Quality of data may vary from study to study. Data from some trials may be incomplete without being recognized as such. Thus, only very simple and unambiguous outcome variables, such as all-cause mortality and major morbid events ought to be used for meta-analysis.

Statistical Methods

Since meta-analysis became a popular approach to summarizing a collection of studies, numerous statistical publications have been produced addressing technical aspects [186, 194196, 198201, 203, 205, 207]. Most of this is beyond the technical scope of this text, but a number of texts on the subject of meta-analysis are available [197, 202, 204, 206]. Two common technical approaches were first suggested by Cochran [194] in 1954. If all trials included in the meta-analysis are estimating the same true (but unknown) fixed effect of an intervention, the Mantel-Haenszel method [195] can be used with a slight variation. This is similar to the logrank or Mantel-Haenszel method in the chapter on survival analysis. If the trials are assumed to have dissimilar or heterogeneous true intervention effects, the effects are described by a random effects model, as suggested by DerSimonian and Laird [200]. Another valid but less common approach relies on a Bayesian analysis [204] which was used to assess the literature on adjunctive thrombotomy for acute myocardial infarction [239].

The method of DerSimonian and Laird [200] compares rate differences within each study, and obtains a pooled estimate of the rate difference as well as the standard error. The pooled estimate of the rate difference is a weighted average of the individual study rate differences. The weights are the inverse of the sum of the between and within study variance components of intervention effect. If the studies are relatively similar or homogeneous in intervention effect, this approach and the fixed effects method produce very similar results [196]. Heterogeneity tests generally are not as powerful as the test for main effects. However, if studies vary in intervention effect, these two methods can produce different results as illustrated by Berlin et al. [196] as well as Pocock and Hughes [203].

Typically, when presenting the results of a meta-analysis, the OR estimate and 95% confidence interval are plotted in a single graph for each trial to provide a visual summary. Figure 18.12, from Yusuf et al. [221], summarizes the effects of 24 trials of fibrinolytic treatment on mortality in people with an acute heart attack. The hash mark represents the estimated OR and the line represents the 95% confidence interval. They [221] include a single estimate of the OR, combining all studies. The size of the symbol in this plots, sometimes referred to as “forest plots,” is an indication of the size of each individual studies. In the presence of serious heterogeneity of treatment effect, however, the appropriateness of obtaining a single point estimate must be questioned. If the heterogeneity is qualitative; that is, some estimates of the OR are larger than unity and others less than unity, then a combined single estimate is perhaps not wise. This would be especially true if these estimates indicated a time trend, which could occur if dose and participant selection changed as more experience with the new intervention was obtained.

Fig. 18.12
figure 12figure 12

Apparent effects of fibrinolytic treatment on mortality in the randomized trials of IV treatment of acute myocardial infarction (Reproduced with permission of the Editor, European Heart Journal and Dr. S. Yusuf)

Which model to use for meta-analysis is a matter of debate, but none are exactly correct. The random effects model has an undesirable aspect, in that small trials may dominate the final estimate. With the fixed effect model, larger trials get greater weight. Since the meta-analysis is conducted on available trials, however, the sample of participants included is not likely to be very representative of the general population to which the intervention may be applied. That is, the trials that are available do not contain a random sample of people from the targeted population but rather are participants who volunteered and who in other respects may not be representative. Thus, the estimate of the intervention effect is not as relevant as whether or not the intervention has an effect. We prefer a fixed effects model but suggest that both models should be conducted to examine what, if any, differences exist.

Chalmers, a strong advocate of clinical trials, argued that participants should be randomized early in the evolution and evaluation of a new intervention [244]. Both as a result of that kind of advocacy and the fact that small trials are always done before large ones in the development of new interventions, an early meta-analysis is likely to consist of many small studies. Sometimes, meta-analyses of just small trials might yield significant results.

Thus, meta-analyses are seen by many as alternatives to the extraordinary effort and cost often required to conduct adequately powered individual trials. Rather than providing a solution, they perhaps ought to be viewed as a way of summarizing existing data; a way that has strengths and weaknesses, and must be critically evaluated. It would clearly be preferable to combine resources prospectively and collaborate in a single large study. Pooled results from distinct studies cannot replace individual, well-conducted multicenter trials.

Analysis for Harmful Effects

While the analyzing the primary and secondary outcome variables for benefit is challenging, the analysis of adverse event data for safety is even more complex and challenging. Of course, if any of the primary or secondary outcome variables trend in the wrong direction, then there is evidence of harm, not benefit. However, harmful effects may manifest themselves in other variables than these primary or secondary outcomes. Some adverse event measures can be prespecified such as changes in the QT interval in an ECG or an elevated liver function test (LFT). But there are many other possibilities.

The typical way that adverse event data are collected in current Phase III trials is a passive system where patient complaints or physician observations are summarized in text fields which are later coded by various adverse event coding systems (See Chap. 12). Such events are usually not solicited actively so that if the patient does not complain or the physician does not record the event or problems, they do not get coded. In fact, if a patient complains about the adverse event in a different manner from one visit to the next, the event may be coded differently. If the physician records the event using different language, the event may get coded differently. It can be challenging to even track an adverse event from one visit to the next within a patient. Another one of the problems of these types of coding systems is that a very large number of categories can be generated for the same essential problem, depending on how the patient complained or the physician recorded his observations in the patient chart.

Thus, tables of adverse events using these systems can have very many rows with only a few events in each row, even for the same basic adverse problem. Such data are not likely to produce statistically significant comparisons or flag potential problems. The data are so granular that an adverse event signal cannot be seen easily. These coding systems can collapse these detailed categories into higher order terms but in doing so add adverse events that are a real signal with typically a lot more events that are not very serious or clinical important. That is, the noise drowns out the signal.

Thus, analysis of this type of data requires a careful scrutiny of the numerous detailed categories to find ones that seem to indicate a meaningful clinical issue, and these items may come from different higher level categories. This process is or can be very subjective and may be hard for another investigative team to reproduce this same categorization.

One alternative to this passive adverse event reporting is to specify in the protocol the special adverse events of interest, and actively solicit the participants for information on their occurrence or conduct whatever laboratory measures are necessary to assess whether that event did occur. Examples of a deal breaker might be QT interval increase or an increase in LFT measures. Any substantial, statistically significant or clinically important imbalance in these type of events would be sufficient to perhaps terminate a trial early or kill the further development of the intervention, whether drug, device or biologic. There are probably more than 10 such “deal breakers” but less than 100, depending on the disease and intervention. Of course other adverse event data may be collected in a patient chart as text and later retrieved as necessary using more recent developed natural language processing (NLP) algorithms. If imbalances are found in such review, confirmation should be sought whenever possible using warehouse data from large electronic health record (EHR) systems.