1 Introduction

Immuno-oncology (IO), which embodies the confluence of tumor immunology and medical oncology, is a contemporary approach to cancer treatment using an old idea [1]. Immunotherapy (IT), the attempt to elicit the immune system to fight cancer, dates back at least 120 years, but until very recently, there had been little impact on clinical practice. IT comprises a variety of treatments that have as primary mechanism of action the generation of an immune response against cancer [2,3,4]. Such treatments include cytokines (e.g., interferons and interleukins), checkpoint inhibitors (CPIs) and other types of antibodies with immunological targets, genetically engineered T-cell therapies and other cell-based products, small molecules, oncolytic viruses, and different types of vaccines [2,3,4,5]. Some of these agents have led to unprecedented responses in clinical settings marked by resistance to conventional treatments, and improvements in overall survival (OS) have been very frequently observed in phase 3 trials of CPIs [2]. However, the dynamics of tumor responses, disease progression, and long-term gains of several IT agents, particularly CPIs, have called into question some of the conventional methods of assessing treatment benefit in oncology [2, 6,7,8,9,10,11]. In this article, we provide an overview of the conventional and novel statistical methods for assessing treatment benefit in IO clinical trials. Our focus is on advanced disease, the setting in which most of the current evidence on IT has been generated. We begin by considering the direct effect of IT on tumors, discuss the translation of these effects into patient benefit, and end by exploring measures of treatment effect that may capture such benefit.

2 Tumor Responses as the First Step Toward Benefit

2.1 Mechanistic Considerations

As a general rule, a measurable antitumor effect is a sine qua non of effective treatments in oncology. Unlike chemotherapy and targeted therapy, IT works indirectly, through the generation of an immune response against tumors. The effect of IT comprises a continuum of biological phenomena that involve both innate and adaptive immune mechanisms, as well as cellular and humoral immune responses [2, 12]. Of importance, tumor infiltration by cytotoxic T lymphocytes and other effector immune cells is one of the prerequisites for the antitumor activity of IT [13,14,15]. This activity is in turn countered by several immune suppression mechanisms acting in the tumor microenvironment [12, 13]. The dynamic interactions between the immune system and the tumor, and the varying nature of such interactions over time, can be described using the concept of immunoediting [15]. According to this view, there are three states of interaction between the immune system and the tumor: elimination, equilibrium, and escape. Interestingly, a parallel has been suggested between these three states and the clinical observations of response, stable disease and disease progression after IT [2]. Thus, the knowledge about the mechanisms involved in response and resistance to IT can be used to some extent to explain the dynamics of tumor shrinkage and growth during treatment.

2.2 Patterns of Response

The experience to date suggests that tumor responses in most patients follow the usual pattern seen with other treatment modalities, with objective response rates to CPIs that vary between close to zero and slightly over 60% according to disease setting [16]. However, unusual response patterns have emerged with the use of IT, and the mechanistic aspects mentioned above may underlie some of these patterns. For example, the use of CPIs is associated with a response profile that is not adequately captured by conventional response assessment criteria, such as the Response Evaluation Criteria in Solid Tumors (RECIST) [2, 7, 17]. The three unusual patterns of response described to date are mixed responses, pseudoprogression, and hyperprogression [2, 7]. The heterogeneity of metastatic cancers, which also characterizes their immunological landscape, probably underlies cases in which a mixed clinical picture emerges after treatment, with some lesions shrinking while others remain stable or grow [2, 7, 18]. The frequency and prognostic significance of such mixed responses are still unclear. On the other hand, in 2% to 9% of patients treated with a CPI, an initial tumor growth if followed by bona fide responses, a phenomenon now termed pseudoprogression [7, 17, 19]. In some of these cases, lymphocytic infiltration of tumors is probably responsible for the initial increase in volume of a lesion destined to shrink, but a delayed action of IT is also postulated as an underlying mechanism in some cases [7, 10, 19]. In advanced melanoma, greater increase in CD8+ cells in serial tumor samples during therapy correlated with a greater tumor size decrease on imaging [20]. The role of these postulated mechanisms is corroborated by several studies showing that pseudoprogression is associated with favorable outcomes when compared with RECIST-defined progressions, especially in advanced melanoma [19, 21, 22]. Both mixed responses and the suspicion of pseudoprogression represent great challenges to patients and physicians, as a decision needs to be made about treatment continuation. This is not the case with the more recently described phenomenon of hyperprogression, whereby some patients display very early signs of unquestionable disease progression after treatment with IT [7, 23]. Although definitions have varied among studies, hyperprogression has been associated with unfavorable outcomes [7, 23,24,25]. Moreover, it has been postulated that hyperprogression may underlie the early detriment from the use of CPIs in some phase 3 trials [7]. Despite this hypothesis, a putative immunological mechanism for hyperprogression remains to be elucidated, and controversy still exists on whether the phenomenon is particular to IT or reflects the natural history of some tumors.

2.3 Response Criteria

Under RECIST [26] and its predecessor guideline, proposed by the World Health Organization (WHO) in 1979 [27], tumor growth beyond a certain magnitude or the appearance of a new lesion indicated progressive disease (PD), synonymous with treatment failure in the chemotherapy era. Early in the development of CPIs, the unusual response patterns raised concerns about the adequacy of previous guidelines in this setting [10, 19]. Of particular concern was pseudoprogression, given the observation of prolonged periods of stable disease (SD) or event complete (CR) or partial response (PR) after an initial increase in tumor burden. These concerns led to an international collaboration of experts and the publication, in 2009, of a new set of guidelines for use with immunotherapy, which were developed based on radiographic images from patients with advanced melanoma treated with ipilimumab [19] and were later applied to another CPI used in this setting, pembrolizumab [22]. These immune-related response criteria (irRC), based on the WHO method of bidimensional measurement, introduced the concept of “total tumor burden” and the need to confirm PD. Since 2009, three additional sets of response criteria have been published [21, 28, 29]. The so-called immune-related RECIST combine some features of irRC (total tumor burden and the need to confirm PD) and of RECIST, the latter because only unidimensional measurements are used [28]. The RECIST group developed immune RECIST (iRECIST), which differs from previous guidelines in that (1) PD that is not confirmed leads to “resetting of the bar” for the assessment of progression, and (2) new lesions are not incorporated into the total tumor burden, but rather lead to a new set of lesions assessed in parallel to the original ones [29]. The more recently published immune-modified RECIST have been developed on the basis of imaging studies from patients with non-small-cell lung cancer and urothelial carcinoma treated with atezolizumab, yet another CPI, and is generally similar to iRECIST [21]. The application of these criteria is costly and time-consuming, especially in view of the fact that they increase the final overall response rate by 1% to 2% in many cases, with an additional 10% of patients overall who would have RECIST 1.1-defined PD being characterized as having SD [19, 21, 30, 31]. On the other hand, some retrospective studies have shown higher percentages of patients moving from RECIST 1.1-defined PD to SD or an objective response when treated beyond progression, especially in advanced melanoma and renal-cell carcinoma [32,33,34]. One important criticism of some of these results is the fact that IT-related response criteria have been developed in the context of clinical trials in which physicians could make a decision to continue IT in patients with an apparent clinical benefit despite evidence of progression. This subjective decision may have introduced bias due to the separation of patients with more aggressive disease from those with more indolent disease [11, 30].

Given the limitations of imaging assessment in IO, an interesting research avenue involves pathological assessment of responses in an attempt to correlate biopsy findings with those from radiographic assessment. In recent studies, pathological findings have shown promise as predictors of objective response, as well as of long-term benefit from IT, both in the neoadjuvant [35] and in the advanced setting [36].

3 Translation of Responses into Prolonged benefit

3.1 Response Duration and Its Assessment

Although objective responses are a desirable first step toward deriving favorable results from treatment, and in some cases the means to obtain improvements in symptoms, ultimately there is an expectation that responses will be durable and will bring long-term benefit. In fact, there is general empirical evidence for that, since anticancer agents receiving accelerated approval based on tumor responses often have their benefit confirmed later on [37]. On the other hand, prolonged disease stabilization can also be seen as an important benefit from IT [2, 3]. Moreover, long-term survivors may have had SD or even PD as their best responses to IT [38, 39], and in some patients responses have improved over time even without subsequent treatment, especially among those with melanoma [39, 40].

Prolonged responses appear to be more specific to IT than to other treatment types. Early in the development of cancer vaccines and CPIs, it became apparent that these agents were associated with responses lasting several weeks or months in settings for which this was not typically the case with chemotherapy [41]. Likewise, prolonged responses often occur when chimeric antigen receptor T cells are used in hematological malignancies [42]. Some IT agents have received a first approval based on responses in early-phase trials, and regulators have expressed interest in expanding our understanding of response-based metrics and their association with clinical benefit [43]. Talimogene laherparepvec, an oncolytic viral therapy, was approved after a phase 3 trial demonstrated improvements in durable response rate, defined as the percentage of patients with CR or PR maintained continuously for at least 6 months [44].

Given the above considerations, the assessment of response duration is a laudable goal toward better understanding the benefit from IT. Such an assessment is straightforward when made descriptively, but problematic when there is a comparative intent. The comparison of treatments in terms of response duration is likely biased because only responding patients are considered, with the groups under comparison being defined by a post-randomization feature. Interestingly, the treatment producing more responses will usually have responding patients of worse prognosis, and the bias may in fact go against the superior treatment [45]. Although modeling approaches have been proposed to avoid this analysis-by-responder bias [46], simpler procedures proposed recently may lead to increased use of analyses of response duration, at least in an exploratory fashion [47, 48].

The first of these procedures, due to Korn and colleagues, consists in generating more comparable patient subsets by removing responding patients with the least tumor shrinkage in the treatment group with more responders or by adding non-responding patients with the most tumor shrinkage to the group with fewer responders, in both cases maintaining similar proportions of responders in both groups [48]. Huang and colleagues have proposed a method that takes advantage of the additive properties of restricted mean survival times (RMSTs), which are discussed in more detail below [47]. The proposed method consists in ascribing a response duration to each patient in a trial, thus avoiding the exclusion of non-responding patients from analysis. The method entails the construction of Kaplan–Meier curves (for each arm separately) for a composite endpoint defined as the time elapsed between treatment initiation and response, progression, or death, whichever comes first. Kaplan–Meier curves for each arm are also constructed for progression-free survival (PFS) in the usual manner. The RMST for this composite endpoint is then computed for each arm and subtracted from the RMST for the corresponding PFS curve, yielding the restricted mean duration of response for each treatment arm. As a result of this procedure, non-responding patients will have a response duration of zero, because the same event (of progression or death) will be used for these patients to indicate the occurrence of the composite endpoint and of PFS.

3.2 Quantifying the Association Between Responses and Long-Term Benefit

With chemotherapy and targeted therapy, there is often a strong association between objective responses and PFS and between the treatment effect on these endpoints [49, 50]. On the other hand, the association between responses and OS has been more modest [49,50,51]. Several authors have attempted to quantify the association between response to IT and long-term endpoints [16, 52, 53]. Unfortunately, none of these studies on IT used individual-patient data; nevertheless, a weak association was generally found between objective response rates and both PFS and OS, as well as between the treatment effects on response rates and these long-term endpoints. As a possible exception, a modest association was found between the treatment effects on response rate and on PFS in one study (R2 = 0.47; 95% confidence interval 0.03–0.77) [53]. To our knowledge, no similar evidence has yet been generated for the duration of response as a potential surrogate for PFS or OS. The reason for the weak association between responses and PFS with IT is not clear. In addition to the limitations of analyzing aggregated data, these results may reflect the play of chance, real biological phenomena, and issues related to the assessment of responses and PFS. For example, the application of immune-related criteria to the assessment of PFS has only partially been explored [21], and phase 3 trials of CPIs have based such an assessment on RECIST 1.1 methods. Whether different associations could exist using response and PFS definitions of the immune-related criteria is a matter of speculation. The assessment of PFS is addressed explicitly by iRECIST and imRECIST, both of which specifying the need to confirm progressions [21, 29]. Thus, further work is needed to assess the relationship between rates and duration of responses and long-term outcomes in IO.

4 Assessing the Ultimate Benefit for Patients

4.1 The Revival of Overall Survival as Primary Endpoint

The prospect of curing metastatic cancer has never been better for patients [2], and unprecedented 5-year survival rates have been reported in some settings [38, 39, 54]. Prolongation of OS is a realistic goal in IO. In fact, OS is currently the preferred endpoint for phase 3 trials in IO, a lesson learned during the development of ipilimumab and confirmed in later trials [55, 56]. This is in contrast to chemotherapy and targeted therapy, settings in which a decade-long debate prevailed between proponents of OS and proponents of PFS as the primary endpoint in phase 3 trials. Given the limitations of OS, PFS eventually became the preferred primary endpoint in several settings in the era of chemotherapy and targeted therapy [57]. With these modalities, the effects of treatment coincide with its administration; however, IT behaves differently in that regard, given its putative delayed effects. Confirming the initial impression about a discordance between PFS and OS in IO [58], several phase 3 trials have shown gains in OS without an accompanying significant gain in PFS [59,60,61,62,63,64], an infrequent observation in the previous era. An initial increase in tumor volume from immune infiltration, delayed antitumor activity, or a sustained antitumor effect beyond progression have been postulated to explain that discordance [64]. Moreover, several meta-analyses based on published data have shown weak associations between PFS and OS in IO [16, 52, 53, 65]. Thus, in the remainder of this article we restrict the discussion to the assessment of OS in IO. The reader should note that we leave aside considerations related to predictive biomarkers, even though they bear implication in some of the design and analysis issues discussed. Moreover, the reader should consider that the development of IO will probably lead to increasing frequency of crossover to IT after disease progression, recapitulating the challenges observed with chemotherapy and targeted therapy.

4.2 Non-proportional Hazards of Survival

An early observation from comparative trials of CPIs has been the unusual behavior of Kaplan–Meier curves, especially with regard to the presence of delayed treatment effects on OS and of an apparent plateau in the tail of the curves. Later on, a third unusual phenomenon became apparent, albeit less frequently: the crossing of survival curves in some trials [60]. The mechanism of action of IT has been summoned as one of the potential explanations for delayed separation of OS curves, a phenomenon that is frequent [55, 58, 61, 64, 66, 67] but not universal [63, 68]. It is conceivable that an early detriment from IT, manifested as crossing of the curves a few months after randomization, also results from delayed effects, although hyperprogression may also play a role [7]. Likewise, it is conceivable that crossing curves reflect the existence of subpopulations with differential effects from treatment, as seen with targeted therapy [69]. Finally, the flattening of OS curves, which can be seen as evidence for a cure fraction [8], may also indicate the natural history of the disease in patients with indolent tumors [39]. All these observations suggest that design and analysis models assuming the proportionality of hazards are less than optimal for trials in IO, especially when comparisons are made with treatments from other classes [10, 11]. Among other problems, the presence of non-proportional hazards may lead to loss of statistical power [8], incorrect conclusions from interim analyses [70, 71], and difficulties in understanding and communicating treatment benefit [72,73,74]. Although several solutions to these problems have been proposed in the literature, their uptake appears to have been low in terms of both design and analysis of phase 3 trials in IO [6, 11, 56, 70, 73, 75,76,77]. Table 1 displays the most frequently proposed solutions to deal with the issue of loss of statistical power and interpretation of treatment benefit. The different methods have advantages and disadvantages summarized in Table 1; all of them share a lack of regulatory precedent as compared with the decades of use of the logrank test. In the following, we briefly discuss methods that appear to have the greatest potential for assessing treatment benefit in a more meaningful way than with conventional methods based on the hazard ratio (HR) estimated from Cox models.

Table 1 Selected proposed statistical methods to deal with non-proportional hazards

4.3 Weighted Logrank Tests

The proportional hazards assumption is arguably too strong in many practical situations. The violation of this assumption is frequent in oncology, and even more so in phase 3 trials in IO, where up to 50% present evidence of non-proportionality [78]. The omission of prognostic covariates from the proportional hazards model, many of which are often unknown, induces time dependence of the HR for coefficients in the model, making it difficult to distinguish the effect from a true time-dependent coefficient even in randomized trials; moreover, this bias is accentuated by increasing censoring [79].

The estimation and testing of treatment effects in the presence of non-proportional hazards has been a topic of research for a long time, but proportional hazards models have remained the standard approach in oncology because deviations from proportionality were uncommon and/or unknown in advance. With the advent of immunotherapies, the standard approach is increasingly being questioned, and weighted logrank tests have received renewed attention. Harrington and Fleming [80] proposed a two-parameter family of weighted logrank tests that can accommodate a large number of situations, in particular delayed treatment effects. Specifically, for I ordered survival times t1, t2, …, tI, the weighted logrank test statistic is

$$ Z = \frac{{\mathop \sum \nolimits_{i = 1}^{I} w\left( {t_{i} } \right)\left( {O_{i} - E_{i} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{I} w\left( {t_{i} } \right)^{2} {\text{var}}\left( {O_{i} - E_{i} } \right)} }} $$

where Oi and Ei represent, respectively, the observed and expected numbers of deaths at the ith event time, ti, and w(ti) a weight at time ti. The Gρ,γ family defines the weight function as

$$ w\left( {t_{i} } \right) = \hat{S}\left( {t_{i} } \right)^{\rho } (1 - \hat{S}\left( {t_{i} } \right))^{\gamma } $$

where \( \hat{S}\left( {t_{i} } \right) \) is an estimate of the overall survival function at time ti and ρ and γ are shape parameters for the weight function. The unweighted logrank test is obtained for ρ = γ = 0 (G0,0), the Peto–Prentice test for ρ = 1 and γ = 0 (G1,0). The test gives more weight to later time points (and is thus preferable for delayed treatment effects) when ρ = 0 and γ = 1 (G0,1) [81]. Zucker and Lakatos proposed weights that achieve maximum efficiency when there is a delayed treatment effect [82]. Yang and Prentice extended these ideas by using adaptive weights that ensure good power over a range of possible alternative hypotheses [83]. More recently, Magirr and Burman proposed “modestly weighted” logrank tests [84]. As an example, among many others, of delayed treatment benefit, the phase 3 trial KEYNOTE-40 investigated pembrolizumab versus chemotherapy in patients with recurrent or metastatic head and neck carcinoma [85]. In this trial, the standard logrank test of the difference in OS, the primary endpoint, produced only a marginally significant one-sided p value of 0.016. However, a delayed separation of the survival curves after 5 months of follow-up was observed. As the delayed treatment effect could have been reasonably expected based on the mechanisms of action of the experimental treatment and its competitor, a weighted logrank test might have been chosen in order to improve the power of the comparison [86].

A weighted logrank statistic may maximize statistical power, but the interpretation of the corresponding treatment effect is far from straightforward—in contrast to the hazard ratio that quantifies a reduction in the instantaneous risk of death at any time; the ease of interpretation of the treatment parameter in a proportional hazards model is no small reason for its enduring success, regardless of deviations from the underlying assumption [87].

4.4 Accelerated Failure Time Models

Interpretation of treatment effects on the hazard scale is not intuitive, as it is not straightforward to translate the information about the mortality hazard reduction, conveyed by the estimated value of HR, into a difference of the survival time. The latter scale is therefore more natural. Accelerated failure–time (AFT) models assume that the effect of treatment manifests itself in shrinking or extending the time scale. The model leads to a simple and natural interpretation of the treatment effect, which can be quantified in terms of the ratio of the mean survival time for the experimental and control treatment.

The AFT model is, essentially, a linear model on the logarithmic time scale, very similar to the classical linear regression model. Symbolically, the model can be expressed as follows:

$$ { \ln }\left( {t_{i} } \right) = \beta_{0} + \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \cdots + \beta_{k} x_{k,i} + \varepsilon_{i} , $$

where ti is the observed time to event for the ith patient, x1i, …, xki are the values of k explanatory variables describing the patient, and εi is the residual random error with mean equal to 0. If one assumes that εi is normally distributed, the AFT model becomes the familiar linear regression model for the logarithm of survival time. Equivalently, the model can be expressed as follows:

$$ t_{i} = e^{{\varepsilon_{i} }} e^{{\beta_{0} + \beta_{1} x_{1,i} + \beta_{2} x_{2,i} + \cdots + \beta_{k} x_{k,i} }} . $$

According to this formulation, the effect of a unit change in variable xj amounts to multiplying the time corresponding to the residual error, \( e^{{\varepsilon_{i} }} \), by a factor equal to \( e^{{\beta_{j} }} \). Hence, \( e^{{\beta_{j} }} \) can be naturally interpreted as the ratio of the mean survival time corresponding to xj + 1 and xj.

The AFT model does not require the proportional hazards assumption and can be used in the case of proportional and non-proportional hazards. Hence, it is an important alternative to the proportional hazards model. It has been observed that, for instance, the log-normal AFT model, in which residual error εi is assumed to be normally distributed and which is a non-proportional hazard model, is more suitable than the proportional hazard model for analyzing disease-free survival in colon cancer [88] or disease-free interval and disease-specific survival in breast cancer [89]. In glioblastoma, the log–logistic AFT model, in which residual error εi is assumed to have a logistic distribution and which is a non-proportional hazard model, was found to perform best with respect to prediction of patient survival time [90].

The main practical issue, often raised in the context of the use of the AFT model, is that the estimation of the model is usually carried out by assuming a parametric form of the distribution of the survival time, which in most cases is unknown. Parametric models can be used in situations where survival curves are smooth and can be approximated well with models with few parameters [91, 92]. Such approach is implemented in commercial statistical software such as SAS (PROC LIFEREG) and STATA (streg command) and in open-source software such as R (for instance, in function survreg in the survival package, and in function psm in the rms package). However, it is possible to estimate the model without the specification of the survival–time distribution. The semiparametric AFT model has been around since the end of 1970s [93, 94]. The main challenge, limiting the use of the model, was the lack of efficient and reliable computing algorithms. However, in the last decade, this has fundamentally changed. While the new computing algorithms have not yet been included in commercial statistical software, they are available in open-source software R [95]. These developments open the door to more widespread application of the semiparametric AFT model.

In the context of IT trials, it is worth noting that the AFT model, as the proportional hazards model, is not valid in the situation of a delayed treatment effect, when the survival functions for the experimental and control treatments initially overlap or cross. In that case, the use of the restricted mean survival time might be considered, as we discuss next.

4.5 Restricted Mean Survival Times

Ideally, one would prefer to express the treatment effect in terms of a difference in the mean survival time. If the survival curve reaches 0 (i.e., if the single longest observed time is an event), the mean survival can be estimated nonparametrically by computing the area under the survival curve. However, this is almost never the case in practice. It is nevertheless possible to estimate the RMST by restricting (or truncating) the follow-up to a given time t and computing the area under the survival curve only up to that point [96,97,98,99]. The landmark time t can be chosen arbitrarily, but it is usually taken equal to the minimum of the largest times in all treatment groups. Once the restricted means in both groups are computed, they may be contrasted by subtraction. Importantly, the use and interpretation of the difference of RMST does not depend on whether hazards are proportional or not [100]. The RMST can be applied even in the extreme cases of non-proportional hazards, when the survival curves initially overlap or when they cross, as can be observed in IT trials [73]. The difference in RMST measures the mean gain in life expectancy through time t associated with the superior treatment. The interpretation of RMST may not be trivial, as it depends on the duration of follow-up, which dictates the choice of the landmark time t. Moreover, a mean survival time may not be meaningful to a patient, in so far as a month of survival gained in the near future (for a patient of poorer prognosis) may be quite different from a month of survival gained in a distant future (for a patient of better prognosis).

The difference in RMST can be tested for significance, and it is worth noting that the power of the test depends on the pattern of the difference and the chosen landmark time, among other factors [76, 101]. Hence, even in the situation of non-proportional hazards, the power of the test may not necessarily be larger than the power of the logrank test. Luo et al discuss the design and monitoring of trials using RMST [102]. Significance tests for RMST are available in the R software packages survRM2 and survRM2adapt. Package SSRMST implements a method to compute sample size for a clinical trial with RMST used as an endpoint.

4.6 Combination Tests

Some authors have proposed to combine several tests in order to maximize power. For instance, if one has no idea whether the effect of treatment will be early or late, a combination test can use Z = max(|Z0,0|, |Z1,0|, |Z0,1|), where Z0,0, Z1,0, and Z0,1 are the statistics obtained from the G0,0, G1,0, and G0,1 weighted logrank tests introduced above [103, 104]. Another combination test uses both weighted logrank tests and weighted Kaplan–Meier tests, which may be more sensitive than rank tests to differences in survival estimates [105, 106]. Yet another combination test uses a logrank test that would perform best under proportional hazards, and a permutation test of the difference in restricted mean survival times that might perform better in other situations [107]. These combination tests require small sample size increases as compared with the logrank, but they protect the power of the test against departures from proportional hazards [108]. Moreover, combination tests do not require pre-specification of a unique test (such as a weighted logrank test) which might or might not turn out to be appropriate for the situation at hand.

An example of this strategy is given by the analysis of the IM211 trial evaluating atezolizumab versus chemotherapy in patients with advanced or metastatic urothelial cancer. The comparison of OS in the PD-L1-positive population reported in the publication showed a non-significant treatment effect. In this trial, survival curves cross between 4 and 5 months and show a numeric benefit in favor of atezolizumab in the long-term follow-up [109]. Roychoudhury et al. have evaluated the use of the MaxCombo test, a combined test based on multiple Fleming–Harrington weighted logrank tests used adaptively based on underlying data. The MaxCombo chose the G0,1 with the minimum p value, and the test was highly statistically significant (p = 0.005). It strongly suggests that the use of this test strategy increased significantly the power of the comparison in this scenario of crossing survival [110].

4.7 Generalized Pairwise Comparisons (GPC) for Delayed Treatment Effects

Generalized pairwise comparisons (GPC) have recently been proposed to address situations of non-proportional hazards, in particular when the treatment effect is likely to manifest itself after some time. GPC extend the Wilcoxon–Mann–Whitney test to compare two samples, e.g., two randomized groups in a clinical trial. The outcome of interest is continuous and captured by a variable denoted X (taking values x1, x2, … xn) in the treatment group and denoted Y, taking values y1, y2, … ym in the control group. Consider all possible pairs (xi, yj) consisting of one observation from the treatment group and one observation from the control group. The U-statistic for the Wilcoxon–Mann–Whitney test is given by

$$ U = \frac{1}{n \cdot m}\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{m} u_{ij} $$

where

$$ u_{ij} = \left\{ {\begin{array}{*{20}c} { + 1 \;{\text{if}}\; x_{i} > y_{j} } \\ { - 1\; {\text{if}}\; x_{i} < y_{j} } \\ { 0\; {\text{if}}\; x_{i} = y_{j} } \\ \end{array} } \right. $$

The Wilcoxon–Mann–Whitney test was extended by Gehan to potentially censored outcomes. GPC generalize the test further to any situation in which every pair can be classified as a “win” (if the individual in the treated group has a better outcome than the individual in the control group), as a “loss” (if the individual in the treated group has a worse outcome than the individual in the control group), or as a “tie” (if there is no difference in outcome between the two individuals) [111, 112]. Hence, the U-statistic is now calculated using generalized pairwise scores:

$$ u_{ij} = \left\{ {\begin{array}{*{20}c} { + 1 \;{\text{if }}\;{\text{pair is a win}}} \\ { - 1 \;{\text{if }}\;{\text{pair is a loss}}} \\ { 0\; {\text{if}}\; {\text{pair is a tie}}} \\ \end{array} } \right. $$

This generalized U-statistic, called the net benefit, is the difference between the probability of a win and the probability of a loss. The ratio between the probability of a win and the probability of a loss is called the win ratio [112].

In the analysis of times to event such as survival time, a win (loss) could be declared if the difference in survival exceeded a threshold considered to be clinically meaningful, say m [111, 113]. For treatments that have a short-lived effect, the net benefit will tend to decrease as a function of m, while for treatments that have a delayed effect, the net benefit will tend to remain stable or to increase as a function of m [114]. In fact, for treatments that achieve long-term cure in a given proportion of patients, the net benefit will tend to the cure rate.

The net benefit has been advocated as a patient-relevant measure of treatment benefit, because it is expressed on the time scale and directly answers a question a patient might ask, that is, “What is the net chance, for a patient taken at random, of surviving longer by at least m months on treatment than on control?” In addition, when the treatment benefit is delayed, the GPC test has increasing power when the threshold of clinical relevance increases [74]. Figure 1 shows average results from a large number of simulated trials in which survival in the control arm was assumed to follow an exponential distribution with parameter 0.1. The treatment arm was simulated for two distinct situations: one in which the hazard ratio remained equal to 0.65 over time (Fig. 1, Panel A), and the other in which the hazard ratio was equal to 1 until 4 months, then decreased linearly to 0.4 at 20 months and stayed at 0.4 thereafter, in such a way that the mean hazard ratio was also equal to 0.65 over the follow-up duration (Fig. 1, Panel B). Simulations were performed on complete times to event and also by setting a censoring mechanism corresponding on average to a proportion of 20% of censored observations. The censoring distribution was uniformly distributed, corresponding to an administrative censoring. The shape of the survival curves is not strikingly different between panels A and B, yet the net benefit as a function of time shows a clear difference between them and emphasizes the more substantial long-term net survival benefit from a treatment that has a delayed effect. The power of a GPC test with varying thresholds of clinical relevance is inferior, as expected, to the power of the logrank test when hazards are proportional (Fig. 1, Panel A). However, the power of a GPC test becomes superior to the power of the logrank test, and similar to the power of a G0,1 weighted logrank test, when the thresholds of clinical relevance are large (Fig. 1, Panel B).

Fig. 1
figure 1

Survival estimates as functions of time, net survival benefit, and power as functions of threshold of clinical relevance, in situations of proportional hazards (a) and delayed treatment effect (b)

Of note, however, censoring reduces the power of a GPC test for large thresholds of clinical relevance. This observation underscores the need for the follow-up time to be commensurate with the threshold of clinical relevance of interest.

The CA184-024 trial assessed the combination of ipilimumab plus dacarbazine versus placebo plus dacarbazine in patients with metastatic melanoma [66]. In this trial, PFS curves separated after the median, violating the proportional hazards assumption. When focusing on long-term PFS benefit, corresponding to higher values for the threshold of clinical relevance, the values of the net PFS benefit increased. The elevated and sustained value of the net PFS benefit, even for high threshold values, was a statistically testable measure of the delayed treatment effect [74].

4.8 Generalized Pairwise Comparisons (GPC) for Personalized Medicine

GPC can also prove useful to go beyond the analysis of a single outcome, by defining wins and losses for multiple outcomes. Hence, for instance, if time to death and time to disease progression were both of interest, one could use the composite endpoint of progression-free survival, which is the time to progressive disease or death, whichever comes first. A major objection against using such a composite endpoint is that it focuses on the time to first event, rather than on the time to most relevant endpoint. In other words, the crucially important time to death after progression is ignored. Using GPC, one can instead consider survival to have priority over time to progressive disease. Hence, if variables {X1, Y1} denote the outcome of first priority (e.g., overall survival), respectively, in the treatment and control group, and {X2, Y2} denote the outcome of second priority (e.g., time to progressive disease), respectively, in the treatment and control group, the pairwise scores can be generalized as follows (ignoring censoring for notational simplicity):

$$ u_{ij} = \left\{ {\begin{array}{lll} { + 1 \,{\text{if }}\,X_{1,i} > Y_{1,j} {\text{or}}\, (X_{1,i} = Y_{1,j} \,{\text{and }}\,X_{2,i} > Y_{2,j} ) } \\ { - 1\, {\text{if }}\,X_{1,i} < Y_{1,j} {\text{or }}\,(X_{1,i} = Y_{1,j} \,{\text{and}}\, X_{2,i} < Y_{2,j} )} \\ { 0\, {\text{otherwise }}} \\ \end{array} } \right. $$

The U-statistic captures the overall treatment effect on any number of prioritized outcomes of any type, including safety outcomes, quality of life, or other patient-relevant outcomes [115]. As such, this approach permits an overall benefit/risk assessment of the treatment effects using direct patient comparisons, rather than marginal treatment effects on the various outcomes considered that ignore the correlation between these outcomes [116]. In cancer, such a benefit/risk assessment is acutely required when treatments induce severe toxicities, some of which have a substantial impact on the patient well-being. Finally, because outcomes can be prioritized, GPC can conceivably take into account individual patient preferences, thus paving the way to truly personalized medicine.

5 Conclusion

IT is not only revolutionizing the systemic treatment of patients with cancer, but also paving the way to the adoption of novel methodological approaches to trial design, analysis, and interpretation. In fairness, some of the methods now being adopted are not new, but their revival is largely due to the issues that have emerged in IT trials. IT trials have brought increased attention to the need to follow patients for as long as possible, an issue that was largely neglected in situations of proportional hazards. In fact, there is a strong incentive for the sponsor of a trial to terminate the follow-up as soon as a treatment effect is statistically established. This tendency should be actively resisted, and ensuring long-term follow-up of clinical trials should become the norm rather than the exception. The implementation of the methods discussed here in IT trials brings a fresh look to old problems, such as that of non-proportional hazards, or the possibility to tackle emerging questions in drug development and medical practice, such as that of prioritizing outcomes according to individual preferences. It is hoped that further improvements in the ability to deliver more efficacious, and hopefully less toxic, IT modalities to patients with cancer will be made more efficient by the use of improved statistical methodology.