Background

Other than more recent success with new generation stent retriever endovascular interventions [14], identifying neuro- or vasculo-protective strategies that improve upon thrombolysis with intravenous rt-PA (alteplase) have been mostly failures, with considerable resources expended on negative trials [5, 6]. Early phase trials of most of the ultimately unsuccessful agents or approaches appeared promising [7, 8]. Our prior work identified the contribution in smaller, early phase trials of baseline imbalances, unrepresentative control populations, and non-random noise distributions that invalidates the use of statistical correction to adjust for these factors [6]. The result is false-positive identification of treatments destined to fail. We also proposed the potential opposite scenario in which imbalances that favored better outcomes in the placebo arm may have generated false-negative results [9], missing an opportunity to identify a potential beneficial agent.

Baseline stroke severity and age have been associated with a large proportion of the variance related to group outcomes following stroke [9, 10]. We selected these factors to develop predictive outcome models from the pooled placebo arms of randomized stroke clinical trials [9] (RCTs). We compared treatments at their own baseline NIHSS and age to generate a pseudo-control model of patients treated without tPA (pPREDICTS; pooled Placebo REsponse Dictates TReatment Success). Using this method, early phase trials, where imbalances between treatment and control arm are common [6], and clinical case series without a control arm, can compare outcomes at the study’s own baseline conditions without statistical manipulation. In addition, we developed the novel feature of generating surfaces around the function that depict the probability that the result of any individual trial is different than the control function. Unlike other statistical methods that stratify or adjust for imbalances [11, 12], no such adjustment is needed here. Our methods have been successful in identifying those early phase trials and case series testing a new therapy compared to placebo that went on to be negative in phase 3 [1315], including identifying the lack of net benefit of a heterogeneous group of endovascular interventions, a finding later confirmed in RCTs [16, 17].

In this paper, we present an analogous model from the rt-PA arms of RCTs (pPREDICTS-tPA). Successful generation of such a model would provide a means to compare treatments that intended to improve upon rt-PA, either as add-on or alternative therapies as well as non-randomized case series. The models were used to test outcomes against control arms that received no IVT or a mixed population that treated various percentages of subjects with IVT in order to determine whether we could confirm the known benefit of IVT.

Methods

Literature search to identify RCTs where all subjects in an arm received IVT

Medline database was searched (PM, AKS, SDS) for the words ‘acute’, ‘ischemic’, ‘stroke’, ‘alteplase’, ‘rt-PA’, ‘rtPA’, ‘t-PA’ and assessed to see if they fit the following selection criteria: (1) Randomized controlled trials. (2) Published in English. (3) Human Clinical Trials. (4) At least two arms in the trial and one of them requiring intravenous rt-PA. (5) At least ten subjects. (6) Treatment window up to 6 h. (7) Follow-up of 3 months. (8) Baseline NIHSS expressed as median (or subsequent contact with authors provided this information).

Development of the pPREDICTS-tPA and updated pPREDICTS models

Details of the methods to generate the pPREDICTS model have been previously published [9] and detailed in the Electronic supplementary material. Briefly, models were developed from 90-day outcomes of control arms of RCTS in a three-step procedure: (1) stabilizing the variance and linearizing a non-linear function by transforming the proportions by an arc-sine square root function (Supplement S1), (2) fitting a function to the transformed proportions (Supplement S2), and (3) eliminating outliers (Supplement S3). Multi-dimensional prediction interval surfaces were generated [18].

pPREDICTS-tPA Models for 90-day outcomes of mortality and mRS 0–2 were based on RCT arms where all subjects in the treated arm were randomized to receive IV rt-PA. We focused on mRS 0–2 and mortality since these were the most common outcomes employed and our prior work indicated mRS 0–2 as a reliable outcome for early phase stroke trials [19]. A second set of models were created using RCTs of control arms that permitted but did not require treatment of subjects with rt-PA simulating best medicine and a third set of models with RCTs of control arms that had no subjects treated with rt-PA. All three sets of models were based on RCTs with treatment windows up to 6 h.

Comparison of models with IVT

pPREDICTS-tPA models were tested against pPREDICTS models that contained no IV rt-PA arm or partially IV rt-PA treated arms to test if the model was able to demonstrate known benefit from IVT [20]. Different models were compared against each other using the F statistic [21].

Results

Generation of pPREDICTS-tPA models

pPREDICTS-tPA models were generated from 24 RCT rt-PA-alone treated arms with 3195 subjects. Of the 24 RCTs, there were 13 arms from RCTS that treated all subjects within 3 h [20, 2233], five treated patients up to 4.5 h [2, 4, 3436], one treated patients in the 3–4.5-h window [37], one in the 3–5-h window [38], one between 3 and 6 h [39], and three between 0 and 6 h [4042]. Two trials (ATLANTIS-A [40] and CLASS-T [28]) did not provide mRS 0–2 data but did present mortality data. Four of these trials [2, 36, 39, 42] identified subjects using image guidance, but a separate analysis did not indicate any difference in outcomes and these trials were pooled with the others for subsequent analysis.

An overall functional outcome model for mRS 0–2 was developed from 22 RCTs. The model for mRS 0–2 along with ±95 % prediction interval surfaces is shown in Fig. 1a (R 2 = 0.83; p < 0.001). During the generation of the model, one study with 49 subjects (of 3096; 1.6%) was eliminated. A similar model for mRS 0–1 with ±95 % prediction interval surfaces was successfully generated (R 2 = 0.64; p < 0.001; figure not shown).

Fig. 1
figure 1

a pPREDICTS-tPA mRS 0–2 model (middle surface) developed from 22 control arms (R 2 = 0.83; p < 0.001). Surfaces on either side of the fit function represent ±95 % prediction intervals. b pPREDICTS-tPA mortality model (middle surface) developed from 21 control arms(R 2 = 0.54; p = 0.001). Surfaces on either side of the fit function represent ±95 % prediction intervals

During the generation of the mortality model, four studies with 166 subjects (5.2%) were eliminated in the outlier elimination step. The mortality model along with 95 % prediction intervals is shown Fig. 1b (R 2 = 0.54; p = 0.001).

Generation of updated pPREDICTS models

Models for mortality and mRS 0–2 were generated from control arms of 32 RCTs (7820 subjects) where some of the subjects were treated with IV rt-PA [13, 14, 20, 3741, 4365]. Seven of the 32 RCTs had treated subjects with rt-PA. Percentage of subjects treated with rt-PA ranged from a low of 9 % [51] to a high of 75 % [56]. Mortality and mRS 0–2 models were also generated from control arms of 25 RCTs (n = 4056) where none of the subjects were treated with IV rt-PA.

Comparison of pPREDICTS-tPA and pPREDICTS

The comparison of models of functional outcome of mRS 0–2 is shown in Fig. 2a (pPREDICTS-tPA, red mesh; pPREDICTS-No tPA, blue mesh; inset shows pPREDICTS, magenta mesh). Comparison of these models shows significant differences between models (pPREDICTS-tPA vs. pPREDICTS-No tPA, p = 0.01; pPREDICTS-tPA vs. pPREDICTS-partial tPA, p = 0.02). Comparison of mortality models between pPREDICTS-tPA (Fig. 2b, red mesh) and pPREDICTS (magenta mesh, with some subjects treated with IV rt-PA; blue mesh, no IV rt-PA) showed no significant differences (p >> 0.05). While not significantly different, visual inspection indicated some interesting trends. With respect to mRS 0–2 outcome, visual inspection of the surfaces suggests roughly parallel relationships between non-tPA and the tPA surfaces. The partial tPA surface is mostly between the others. With respect to mortality, the pPREDICTS-tPA surface appears to be lower throughout the range of NIHSS, but especially at higher NIHSS and lower age.

Fig. 2
figure 2

a Composite of pPREDICTS-tPA and pPREDICTS mRS 0–2 models. Significance values for the different comparisons: pPREDICTS-tPA vs. pPREDICTS-partial tPA: p = 0.02. pPREDICTS-tPA vs. pPREDICTS-No tPA: p = 0.01. b Composite of pPREDICTS-tPA and pPREDICTS mortality models. Significance for various comparisons: pPREDICTS-tPA vs. pPREDICTS-partial tPA: p = 0.51. pPREDICTS-tPA vs. pPREDICTS-No tPA: p = 0.37

Discussion

This report demonstrates the generation of IVT outcome models based on baseline NIHSS and age that were significantly associated with outcome. Based on goodness of fit, the relationship between these factors and functional outcome were more strongly associated than with mortality, a finding that is consistent with individual clinical trials that do not consistently demonstrate reduced mortality with IVT [20]. We confirmed that arms in which all patients were treated with IVT improved functional outcome compared to those with no or partial use of IVT. We did not have enough of the latter studies to test whether different percentages of subjects treated with IVT influenced outcomes except as a group. Similarly, few studies in the later time windows prevented a direct test of the influence of time to treatment on outcomes. We are working on alternative methods to address these issues.

We proposed this modeling approach to identify agents in which baseline imbalances may yield false-positive or false-negative results, particularly if the control arms are not representative of a broader population [6]. Statistical corrections are frequently applied to early trials since an imbalanced distribution of baseline factors and errors in assessment can affect outcomes separate from treatment effects [19], and non-random variation is more likely to occur in smaller trials [66]. However, we do not believe that application of statistical correction is justified in smaller trials with complex relationships between factors and outcome such as we have demonstrated in stroke [6]. Our prior work showed that issues such as improper use of statistical adjustments as well as a non-representative control arm (with worse outcomes than expected in a large population at multiple sites) were seen in 89 % of a sample of early phase stroke trials that led up to ultimately negative pivotal trials [6].

As imbalances diminish in larger trials, random effects would tend to even out and a more valid result would likely emerge. In applying pPREDICTS to early phase RCTs, we suggest the most valid early result would be those studies in which the control arm behaves similarly to the pPREDICTS pooled model. We also propose these models to test case series to a pooled model that can provide some early insight into whether improved outcomes might portend a positive RCT [67] or to look at balanced subgroups [68]. In versions without IVT, this model has successfully predicted retrospectively both the positive NINDS rt-PA trial and all negative phase 3 trials such as AbESTT and SAINT II [9]. Other updates have correctly identified among others, the negative citicoline trial [6], DP-b99, the zinc chelator [69], all of which had either non-representative control arms or major imbalances. Given this track record, perhaps this approach could aid in decision making related to which agents to pursue based on their early results [6].