Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Human eyes are ill-equipped to discern patterns from statistical output presented in numeric or tabular forms. The spatial qualities of these numbers vanish once they are listed or tabulated on paper.

Table 12.1 Dosing chart: Tablet dosage by body weight

A denotes a dosing of approximately 1 mg/kg, B of approximately 2 mg/kg

Graphical techniques allow numbers to be displayed pictorially and easily convey a sizable amount of information at a glance. Exploratory data analysis relies heavily on statistical graphics to provide insight into one or more aspects of the underlying structure of the data and guidance into the appropriate statistical analysis.

In addition, thoughtfully designed statistical graphics provide a convincing means of communicating the essential messages hidden in the data in a clear, precise, and efficient manner.

The saying that “A picture is worth a thousand words” illustrates the fact that complex ideas can easily be conveyed with graphics. Furthermore, these thousand-word pictures rarely require a thousand words to explain.

Every picture tells a story. It time and again tells the story better than pages of tabulated numbers and descriptive text. The more than one-dimensional spatial quality of pictures effortlessly transcends the inherent limitation of sequential numbers and words stringed linearly together.

A series of real-world examples of statistical graphics in action produced at Janssen Pharmaceutical Companies of Johnson & Johnson are presented below.

2 Does the New Analgesic Work?

A new analgesic was studied in a phase II, acute pain, multiple-dose, bunionectomy study for up to 4 days. The study drug was allowed to be taken once every 4–6 h. Since placebo would not provide much pain relief, it was expected that subjects randomized to placebo would repeat dosing closer to 4 h on average. On the other hand, if the new analgesic was effective, the actively treated subjects would repeat dosing closer to 6 h on average.

In Fig. 12.1, subjects’ dosing history were plotted over time, one line per subject, with the actual dosing time presented as pink open dots, for both the placebo group and the treated group separately. To our surprise, the dosing patterns painted by the dots looked very similar between the two groups without any apparent visual difference. The placebo subjects were not repeating treatment more frequently than the treated group. Most subjects completed the study without early termination.

Fig. 12.1
figure 1

Subject-level study drug dosing information

What had happened was that in this study, in order to minimize early termination, subjects were allowed therapeutic doses of rescue pain medication Tylenol when the pain was not sufficiently controlled by the blinded study treatment. This is a commonly adopted approach to minimize missing pain relief and pain intensity measurement scores inevitably will be caused by early study drug termination. However, such non-missing scores collected with the aid of rescue pain medication no longer reflect the pure therapeutic effect of the study drug, but a combined effect with the rescue medication. In this study, the efficacy effect represented by the pain relief and pain intensity measurement scores also became indistinguishable between the two groups, because with adequate amount of Tylenol at hand, all subjects self-titrated to an adequate level of pain control, which is the same between the two groups distributionally.

Does the new analgesic work? Yes it does.

In Fig. 12.2, subjects’ rescue pain medication dosing history was superposed, as blue solid dots, on top of the pink open dots. Clearly the placebo subjects were taking more rescue pain medication than the actively treated group, in order to reach the same level of pain relief. What’s more, for each group, the rescue intake was the highest on the first day after surgery and decreased over time, as the acute bunionectomy pain was self-limiting. These 2 graphs provided straight-forward views into what had happened in dosing patterns that was very hard to convey using summary statistics. They also provided us a very comfortable and convincing feeling that the active drug worked, without looking at the usual “hard” statistics.

Fig. 12.2
figure 2

Subject-level study drug (pink open dots) and rescue pain medication (blue solid dots) dosing information

This study was analyzed exactly as specified by the FDA guidelines on the analysis of analgesic studies at that time and it failed, because there were no differences in pain relief and pain intensity scores due to the rescue medication use. However, this graph provided reassuring evidence supporting analgesic efficacy as well as insight on supplemental rescue medication use. It was very appealing to the clinicians.

3 How Did Subjects Take Rescue Pain Medication?

The same analgesic as in the previous section was studied in a phase II, osteoarthritis study for up to 30 days. Subjects were also allowed rescue pain medication on an as-needed basis to reduce early termination of study drug.

During the analysis of this study, it became necessary to investigate whether indeed subjects took rescue pain medication as intended for additional needed pain relief. To this end each individual subject’s study drug and rescue medication dosing profiles were examined over time in relation to average daily pain relief. Well, most of them did take rescue when the pain was insufficiently controlled by study drug. However, a few surprises showed up and they are illustrated in Fig. 12.3.

Fig. 12.3
figure 3

Subject-level study drug (black dots) and rescue pain medication (red dots) dosing information for selected subjects

In this study, subjects were instructed to take the study drug two times a day, once in the morning between 7 and 8 am, and once in the evening between 7 and 8 pm. If pain was not sufficiently controlled by this regimen, subjects were allowed to take a designated rescue medication up to 4 times a day.

In Fig. 12.3, each individual subject’s dosing profile was plotted over time. The times of study drug administration were plotted as black dots. They were around 7 to 8 am and 7 to 8 pm for all subjects. The times of rescue pain medication were plotted as red dots. In addition, the average daily pain intensity scores were plotted as blue line segments. The height of the blue line indicates the pain intensity.

The last subject featured in Fig. 12.3, Subject 517, consistently took the rescue pain medication only on days when the pain intensity was highest, indicating a serious pain flare. This was the pattern observed with most of the subjects in the study.

But, some unexpected dosing patterns were also observed.

Subject 501 took rescue pain medication 3 times a day regardless of daily pain scores. Subject 505 initially only took rescue pain medication when needed. However after about a 2-week period of high pain scores with the help of rescue pain medication, he decided to keep taking frequent daily rescue medication as a form of insurance even after his pain was well controlled. Lastly, subject 510 took a combined dose of study drug with rescue medication 2 times a day even though the pain was well controlled.

The insight gained from these graphs, which would normally escape our attention, helped the clinical trial team to give better instructions on the rescue ­medication intake during a similar phase III osteoarthritis study. Study sites were requested to clearly explain to the subjects the purpose of rescue pain medication.

4 Why Did Subjects Drop Out?

In an 8 h single-dose study, many subjects dropped out of the study during the 8 h observation period, which was a bit of a surprise.

There were various speculations as to why this was the case, some more plausible than others but no definitive conclusions can be made. This mystery was solved when individual subjects’ efficacy scores as measured by variable Z were plotted over time stratified by the time of dropout as presented in Fig. 12.4, the famous spaghetti plot.

Fig. 12.4
figure 4

Individual profile of variable Z over time stratified by time of dropout

Because lower Z scores indicated better efficacy, the figure shows that for subjects who dropped out by hour 1, no treatment effect was observed. In fact, for the 2 subjects on the graph, their condition continued to worsen. For subjects who dropped out between hours 2 and 7, some efficacy effect was observed and then lost, just before dropout. For subjects that completed the 8 h study, sustained efficacy was obtained except for a few subjects who showed signs of losing it at the end.

In Fig. 12.5, the mean Z scores were plotted over time stratified by the time of dropout and same observations as above can be reached. It clearly illustrated the cause of subject dropout as efficacy-related.

Fig. 12.5
figure 5

Mean profile of variable Z over time stratified by time of dropout

5 What is the Appropriate Dose Range?

Many years ago an effective oral analgesic in tablet form was studied in a phase II double-blind study in children 7–16 years of age before the liquid formulation was available. The intended liquid dosage regimen had it been available at that time would have been in a milligram by kilogram (mg/kg) form. Based on safety and efficacy data already collected on adults, it was expected that the effective dose range for children would be 1–2 mg/kg.

In the study, the children were randomly assigned to the “A=approximately 1 mg/kg” group and the “B=approximately 2 mg/kg” using the dosing table based on body weight (Table 12.1).

This was a morphine sparing study. Children were allowed to use a morphine pump as needed to supplement the pain relief. The primary efficacy variable was the amount of morphine used to achieve a comparable pain relief profile between the 2 randomized groups.

There was a statistically significant difference between the amount of morphine used between the “Approximately 1 mg/kg” group and the “Approximately 2 mg/kg” group.

Due to variation in weight in each group and because the mg/kg was approximated using different amounts of 25 mg tablets, when the actual weight adjusted dose was calculated for each child in each randomized group, we obtained a range instead of a single point of actual mg/kg doses taken for each group, as shown in Fig. 12.6. When the two groups were combined together, a larger mg/kg dose range spanning from 0.4 to 2.5 mg/kg was obtained.

Fig. 12.6
figure 6

Amount of morphine used vs. actual mg/kg dose

In Fig. 12.6, the primary efficacy variable was plotted against the actual mg/kg dose taken. A lowess smother was added to illustrate the trend. Even though children were not randomized to these mg/kg doses, this graph strongly suggested that the 1–2 mg/kg dose range to be efficacious in children, which was later confirmed by other studies.

6 Does Efficacy Depend on INR Control?

In a phase III study for the prevention of stroke and systemic embolism in patients with non-valvular atrial fibrillation, the primary hypothesis that the study drug is non-inferior to warfarin in the prevention of the composite endpoint of stroke and non-CNS systemic embolism was statistically demonstrated. However, the adequacy of the warfarin management in terms of INR (International Normalized Ratio) ­control was questioned. Compared to similar contemporary studies that had study-wise TTRs (%) (Percentage of time in therapeutic range) in the mid 60s, this study has a study-wise warfarin TTR (%) of 55%, which was considered much lower numerically.

Adjusted-dose anticoagulation with warfarin has been the most effective intervention to mitigate the risk of thromboembolic events in subjects with atrial fibrillation, as this clinical condition significantly increases the risk of stroke. The intensity of anticoagulation by warfarin has been measured by the international normalized ratio (INR). Maintaining subjects in the narrow therapeutic range of INR between 2 and 3 has been considered a critical aspect in warfarin dose management. The benefits of warfarin in thromboembolic events prevention and potential harm of bleeding are inversely related to INR. There is an increased risk of thromboembolic events if INR is less than 2.0 and an increased risk of bleeding if INR is larger than 3.

In a clinical study using warfarin as a comparator, the percentage of time a warfarin subject is maintained within the therapeutic range of INR between 2 and 3 can be calculated as the subject-level TTR (%). The study-wise TTR (%) is calculated as the average of the warfarin subject-level TTRs. This study-wise TTR (%) is used as a measure of how well warfarin treatment is managed in the study. The higher the TTR, the better is the warfarin dose management considered to be, which logically should lead to better efficacy and safety of the warfarin treatment in the study.

Because the study-wise TTR at 55% was lower than what was achieved in other similar contemporary studies, and the majority of the time the subject was out of the 2–3 INR range was in the INR <2 region, where the thromboembolic events in the warfarin arm would be higher, the question naturally arose: would the study drug still be non-inferior to warfarin efficacy-wise had the study-wise warfarin INR control been better?

This study enrolled subjects from many study centers across 4 continents. Just as study-wise TTR is considered as a measure of quality-of-care in the study, the center TTR, calculated as the average of warfarin subject-level TTRs in that center, is considered as a measure of quality-of-care in the center.

There was a large variation in warfarin subject-level TTRs, ranging from 0 to 100%. In Fig. 12.7, individual subject-level TTRs were plotted against the center TTRs of the centers they are in as a scatter plot. In Fig. 12.8, box plots of subject-level TTRs were plotted for categorized center TTR groups in increments of 5%. Both graphs showed that warfarin subjects in centers with higher center TTRs also have higher individual subject-level TTRs.

Fig. 12.7
figure 7

Subject TTR vs. average center TTR for warfarin subjects

Fig. 12.8
figure 8

Box plot of subject TTR vs. categorized average center TTR

If we discard the centers with lower center TTRs from the study population, ­subjects with lower subject-level TTRs will also be deleted. The remaining ­population can be treated as a smaller double-blind, randomized study, and its “study-wise” TTR will be higher.

Therefore, to answer the question of what the efficacy of this study would have looked like had we had better study-wise warfarin TTR than the observed TTR of 55%, centers with smaller average warfarin TTRs were progressively dropped (using cutpoints ranging from 0 to 100% in 1% increments) out of the study population. This resulted in remaining subpopulations that can be considered as randomized studies on their own, having increasingly higher and higher average TTRs. As shown in Fig. 12.9, the efficacy results for these smaller “studies” remain stable and consistent with the whole-study results until their TTRs reach the mid 60s to 70%, at which point the estimation breaks down and becomes unreliable due to the much smaller sample sizes left.

Fig. 12.9
figure 9

Estimated treatment effect for sliding populations of combined centers with center average warfarin TTR  >  threshold on a log scale

In Fig. 12.9, the analysis results of these 100 substudies were plotted. The visual impact of the stability of the treatment effect as the substudy-wise TTR steadily increases is beyond expression.

This simple analysis is more appropriate than the usual regression analysis that explores the association between efficacy and TTR because association is not equal to causation supportable by a randomization argument. In general, those subjects not being able to be managed to the higher TTRs are not comparable to those who can be managed to the same high TTRs, even at the same center with the same quality-of-care. The two groups of subjects are not comparable, with one being generally much sicker and with more complications than the other. Any differences we see in either efficacy or safety at the same TTR on a regression curve can very likely be due to the difference in subject populations rather than to the treatment group difference. The two contributing factors, population difference and treatment difference, are confounded and non-separable.

7 Would the Study Still Be Positive?

In a phase III study of a new antiplatelet therapy in acute coronary syndrome (ACS), the primary efficacy variable was the time to cardiovascular death, myocardial infarction, or stroke. The study was successful in statistically demonstrating the superiority of the study drug vs. placebo. The primary efficacy variable was analyzed using the Cox proportional hazards model under the non-informative censoring assumption. However, more than 10% of the subjects discontinued the study before the trial end date without experiencing any event, and the impact of this amount of missingness on the efficacy conclusion was questioned.

To evaluate the robustness of the primary efficacy results with respect to missing data caused by early withdrawals, a sensitivity analysis was carried out using model-based simulation from an exponential model built from the observed data. The individually fitted hazard rate in actively treated subjects was inflated from 0 to 100% while such fitted hazard rate in placebo subjects remained as observed.

For all censored early withdrawal subjects, virtual events and durations were imputed through random sampling from the fitted exponential distribution and the study was re-analyzed. This process was repeated 1,000 times for each sensitivity scenario.

The distributions of simulated hazard ratios, as brown dots, and 95% upper confidence limits, as blue dots, were plotted in Fig. 12.10.

Fig. 12.10
figure 10

Distributions of Simulated HR (in brown) and 95% Upper Confidence Limit (in blue)

Each vertical bar comprised 1,000 tiny dots representing 1,000 hazard ratios or 95% upper confidence limits. These dots were so numerous that they were stacked up on top of each other leaving almost no space among them except at the two extreme ends where points get sparse, hence the appearance of solid bars.

The dots in the bars were binned and the percentage of points in each bin was plotted proportionally to the right of the bars to indicate the shapes of distribution of the dots that closely resemble normal distributions.

The key message from this graph comes from the percentage of simulated trials with 95% upper confidence limits equal to or larger than 1. When this happens, the simulated trial would have failed to achieve superiority claim.

This percentage remains at zero up to 30% inflation. It reaches 1% at 50% inflation, and becomes noticeable at 70% inflation. When the assumed hazard rates in actively treated group are twice as bad as those observed, it becomes 35%. However, the majority of the simulated trials, 65%, would still have had superior efficacy results in favor of the study drug.

This analysis supports the robustness of our primary efficacy analysis with respect to missing data caused by early withdrawal.

8 Concluding Remarks

Simple graphs such as histograms and scatter plots have been routinely generated and proven useful. Custom-designed graphs for a special problem at hand require more effort. Some graphs (such as Fig. 12.9) packed so much information into one plot that it had been considered mind-boggling in the beginning. However, once properly understood, they can be much more effective than tables of numbers to get across the messages that would have gone unnoticed or distorted otherwise.

As human consciousness evolves upwards and the logical linear left brain is increasingly integrated with the intuitive spatial right brain, the future tools of communication will be more pictorial than linguistical. Until the days come when instantaneous thought transfer between human minds is possible, creative graphical statistical representation of data will be an enormously useful aid to our information sharing and decision making process.