1 Introduction

The popularity of the Cox regression model has contributed to the enormous success of the hazard ratio as a concise summary of the effect of a randomised treatment on a survival endpoint. Notwithstanding this, use of the hazard ratio has been criticised over recent years. Hernán (2010) argued that selection effects caused by unobserved heterogeneity (“frailty”) render a causal interpretation of the hazard ratio difficult when treatment affects outcome. While the treated and untreated people are comparable by design at baseline, the treated people who survive a given time t may then tend to be more “frail” (as a result of lower mortality if treatment is beneficial) than the untreated people who survive the given time t, so that the crucial comparability of both groups is lost at that time. Aalen et al. (2015) re-iterated Hernán’s concern. They viewed the problem more as one of non-collapsibility (Martinussen and Vansteelandt 2013), which is a concern about the interpretation of the hazard ratio, though not about its justification as a causal contrast. In particular, they argued that the magnitude of the hazard ratio typically changes as one evaluates it in smaller subgroups of the population (e.g. frail people), even in the absence of interaction effects on the log hazard scale.

In this paper, we aim to develop more insight into these matters. We pay special attention to what can be learned about the treatment effect from the hazard contrast vanishing after a certain point in time, i.e. the hazard ratio becoming 1 or the hazard difference becoming 0. Throughout, we will assume that data are available from a randomised experiment on the effect of a dichotomous treatment A (coded 1 for treatment and 0 for control) on a survival endpoint T, so that issues of confounding can be ignored. Time-changing hazard ratios are often interpreted as a change in treatment effect. Two recent examples of this are Lederle et al. (2019) and Primrose et al. (2019). We argue that there is no causal support for such reporting, however.

A causal contrast compares the same population under the hypothetical scenarios ‘everybody treated’ versus ‘everybody untreated’ and we furthermore define a causal hazard ratio, which contrasts the intensities at time t between the treatment and control groups of subjects who would survive up to time t regardless of treatment. Its magnitude is indeed causally interpretable in the sense that it makes a contrast for the same principal stratum of subjects with \((T^0\ge t,T^1\ge t)\), where \(T^a\), \(a=0,1\), is the potential lifetime under treatment a. This is related to earlier work on defining causal effects based on principal strata, see Frangakis and Rubin (2002) and Bartolucci and Grilli (2011). We show that this causal hazard ratio, \(\text{ HR }(t)\), is smaller than the Cox hazard ratio at all times \(t>0\) if the joint distribution of \((T^0,T^1)\) is governed by a so-called Archimedean copula as generated by a frailty setting. This corroborates intuition, also given in Hernán (2010), that the Cox hazard ratio underestimates the causal effect (on the hazard scale) of a beneficial treatment as a result of the treatment group then containing relatively more frail subjects compared to the control group at each time \(t>0\). We surprisingly find this intuition to break down in a more general setting where there is still positive correlation between \(T^0\) and \(T^1\), but the joint distribution of \((T^0,T^1)\) is not governed by an Archimedean copula. We also define a causal hazard difference, however, both this measure and the suggested causal hazard ratio have the drawback that their causal interpretations rely on untestable assumptions.

If the Cox model (or an extended version with a time-changing coefficient) is correctly specified then the parameters of the Cox model are estimated consistently by use of the Cox partial score function and the Breslow estimator. Also the standard estimators of the variability of these estimators are consistent. Therefore, inference based on these classical tools, such as the log-rank test, is valid. Our point with this paper, however, is that the obtained hazard ratio cannot generally be given a causal interpretation when interpreted as a hazard ratio, except in the trivial case where there is no causal effect, i.e. a hazard ratio of 1 or if an unverifiable assumption holds to which we return later.

2 The Cox model and causal reasoning

2.1 Hazard ratios

Analyses of time-to-event endpoints in randomised experiments are commonly based on the Cox model

$$\begin{aligned} \lambda (t;a)=\lambda _0(t)e^{\beta a}, \end{aligned}$$

where \(\lambda (t;a)\) denotes the hazard function of T given \(A=a\), evaluated at time t, and \(\lambda _0(t)\) is the unspecified baseline hazard function. This model implies that for all t

$$\begin{aligned} P(T>t |A=1)=P(T>t |A=0)^{\exp (\beta )}, \end{aligned}$$
(1)

so that \(\exp {(\beta )}\) can be interpreted as

$$\begin{aligned} \exp (\beta ) = \frac{\log P(T>t |A=1)}{\log P(T>t |A=0)}, \end{aligned}$$

a ratio between cumulative hazards. This represents a causal contrast (i.e., it compares a functional of the survival time distribution for the same population under different interventions). Since randomisation ensures that \(T^a\perp \!\!\!\perp A\) for \(a=0,1\), we have by consistency, \(T=T^a\ \text{ if } A=a\), that

$$\begin{aligned} \exp (\beta ) = \frac{\log P(T^1>t)}{\log P(T^0>t)}. \end{aligned}$$

This shows that, under the proportional hazard assumption, \(\exp (\beta )\) forms a valid one-number summary providing a relative and causal contrast of what the log survival probability would be at an arbitrary time t if everyone were treated, versus what it would be at that time if no one were treated.

The log-transformation makes the above interpretation of \(\exp (\beta )\), while causal, difficult. It is therefore more common to interpret \(\exp (\beta )\) as a hazard ratio

$$\begin{aligned} \exp (\beta )= & {} \frac{\lim _{h\rightarrow 0}P(t\le T< t+h|T\ge t,A=1)}{\lim _{h\rightarrow 0}P(t\le T< t+h|T\ge t,A=0)}\\= & {} \frac{\lim _{h\rightarrow 0}P(t\le T^1< t+h|T^1 \ge t)}{\lim _{h\rightarrow 0}P(t\le T^0< t+h|T^0\ge t)}. \end{aligned}$$

Interpretation appears simpler now, but this is somewhat deceptive for two reasons. First, the righthand expression shows that \(\exp (\beta )\) contrasts the hazard functions with and without intervention for two separate groups of individuals, those who survive time \(t>0\) with treatment (\(T^1\ge t\)) and those who survive time \(t>0\) without treatment (\(T^0\ge t\)). Those groups will typically fail to be comparable if treatment affects outcome (Hernán 2010). In particular, when treatment has a beneficial effect then, despite randomisation, the subgroup \(T^1\ge t\) in the numerator will generally contain more frail people than the subgroup \(T^0\ge t\) in the denominator, where the frailest people may have died already. When viewed as a hazard ratio, \(\exp (\beta )\) therefore does not represent a causal contrast and it would be unwise to state that ‘treatment works in the same way throughout time’. Second, the interpretation of \(\exp (\beta )\) as a hazard ratio is further complicated by it being non-collapsible, so that its magnitude typically becomes more pronounced as one evaluates smaller subgroups of the study population (Martinussen and Vansteelandt 2013; Aalen et al. 2015).

2.2 MRC RE01 study

As an illustration, we reconsider the kidney cancer data described in White and Royston (2009). These data are from the MRC RE01 study which was a randomised controlled trial comparing interferon-\(\alpha \) (IFN) treatment with the best supportive care and hormone treatment with medroxyprogesterone acetate (control) in patients with metastatic renal carcinoma. We use the same 347 patients as in White and Royston (2009). In this illustrative analysis we consider only the first 30 months of follow up. The median follow-up time was 242 days, and 85% of the patients died within the considered time frame. The two Kaplan-Meier estimates in Fig. 1 contain all the available information about the treatment effect. IFN treatment seems superior to the standard treatment (control), although a supremum test comparing the two survival curves results in a non-significant p value of 0.09. The score process plot of Lin et al. (1993) was calculated using the R-package timereg (Martinussen and Scheike 2006), giving no evidence against the proportional hazards assumption (\({P}=0.43\) according to the supremum test). We therefore fitted the Cox model, giving the estimate \(\hat{\beta }=-0.29\) (SE 0.12, \({P}=0.01\)). The corresponding hazard ratio of 0.75 expresses that IFN treatment reduces the log survival probability at each time by 25% (on a relative scale), as compared to control. This interpretation is causal, but not insightful. Results are therefore best communicated by visualising identity (1) in terms of estimated survival curves with versus without treatment (Hernán 2010). This has the advantage that it provides better insight into the possible public health impact of the intervention, but the drawback that it does not permit a compact way of reporting and that survival curves do not provide an understanding of a possible dynamic treatment effect. To enable a more in-depth understanding, it is tempting to interpret \(\exp (\beta )\) as a hazard ratio, but then interpretation becomes subtle. We will demonstrate this in more detail in the next section, where we will investigate to what extent hazard ratios may provide insight into the dynamic nature of the treatment effect.

Fig. 1
figure 1

MRC RE01 study. Kaplan Meier plot, control group (green curve) and IFN group (blue curve) (Color figure online)

3 Time-varying hazard contrasts are not causally interpretable

3.1 Hazard ratios

Consider now a study where the hazard ratio changes with time in the following sense:

$$\begin{aligned} \frac{\lambda (t;A=1)}{\lambda (t;A=0)}= \left\{ \begin{array}{ll} \exp (\beta _1) &{} \text {if}\ t\le \nu \\ \exp (\beta _2) &{} \text {if}\ t> \nu \\ \end{array} \right. \end{aligned}$$
(2)

with \(\beta _1\ne \beta _2\), where \(\nu >0\) denotes the change point. Suppose in particular that \(\beta _1<0\) and \(\beta _2=0\), commonly interpreted as implying that the treatment is beneficial for \(t<\nu \) but ineffective for \(t>\nu \).

To develop a greater understanding whether indeed hazard ratios permit such dynamic understanding of the treatment effect, we will first study data-generating processes (DGP) that could give rise to (2); in “Appendix A.2”, we formulate a more general DGP. Let Z represent the participants’ unmeasured baseline frailty (higher means more frail), which affects T, but is independent of A by randomisation. Suppose that the hazard function \(\lambda (t;a,z)\) of T given \(A=a\) and \(Z=z\) satisfies

$$\begin{aligned} \lambda (t;a,z)=z\lambda ^*(t;a), \end{aligned}$$
(3)

for some function \(\lambda ^*(t;a)\), and let Z be Gamma distributed with mean 1 and variance \(\theta \).

We will investigate what choices of \(\lambda ^*(t;a)\) give rise to model (2). With \(\phi _Z(u)=E(e^{-Zu})\) the Laplace transform associated with the distribution of Z, the following relationship between the hazard function of interest \(\lambda (t;a)\), and \(\lambda ^*(t;a)\), can be shown to hold:

$$\begin{aligned} \Lambda ^*(t;a)=\phi _Z^{-1}(e^{-\Lambda (t;a)}) =\frac{1-e^{-\theta \Lambda (t;a)}}{\theta e^{-\theta \Lambda (t;a)}}, \end{aligned}$$

where \(\Lambda (t;a)=\int _0^t \lambda (s;a)\, ds\) and similarly with \(\Lambda ^*(t;a)\). Simple calculations then show that model (3) implies model (2) when

$$\begin{aligned} \lambda (t;a,Z)= \left\{ \begin{array}{ll} Z\lambda _0(t)e^{\beta _1a}\exp {\{\theta \Lambda _0(t)e^{\beta _1a}\}} &{} \text {if}\ t\le \nu \\ Z\lambda _0(t)e^{\beta _2a}\exp {\left\{ \theta \Lambda _0(\nu )e^{\beta _1a} +\theta \Lambda _0(\nu ,t)e^{\beta _2a}\right\} } &{} \text {if}\ t> \nu \\ \end{array}\right. \end{aligned}$$
(4)

where \(\Lambda _0(\nu ,t)=\int _{\nu }^t\lambda _0(s)\, ds\). The specific \(\lambda ^*(t;a)\) that results in model (2) can be read off from (4). For subjects with given \(Z=z\), it follows that the conditional hazard ratio is

$$\begin{aligned} \text{ HR}_{Z}(t)&=\frac{\lambda (t;A=1,Z)}{\lambda (t;A=0,Z)}\\&= \left\{ \begin{array}{ll} \exp (\beta _1)\exp {[\theta \Lambda _0(t)\{\exp (\beta _1)-1\}]} &{} \text {if}\ t\le \nu \\ \exp (\beta _2)\exp {[\theta \Lambda _0(\nu )\{\exp (\beta _1)-1\} +\theta \Lambda _0(\nu ,t)\{\exp (\beta _2)-1\}]} &{} \text {if}\ t> \nu \\ \end{array}\right. \end{aligned}$$

For \(\beta _2=0\), this simplifies to

$$\begin{aligned} \frac{\lambda (t;A=1,Z)}{\lambda (t;A=0,Z)}= \left\{ \begin{array}{ll} \exp (\beta _1)\exp {[\theta \Lambda _0(t)\{\exp (\beta _1)-1\}]} &{} \text {if}\ t\le \nu \\ \exp {[\theta \Lambda _0(\nu )\{\exp (\beta _1)-1\}]}&{} \text {if}\ t> \nu \\ \end{array}\right. \end{aligned}$$

which is less than 1 for all \(t>0\) since \(\beta _1<0\). This model suggests that the treatment would be beneficial at all \(t>0\) for individuals with the same value of z, regardless of z. This contradicts the earlier, naïve interpretation that, across all individuals combined, treatment is ineffective from time \(\nu \) onwards.

The root cause of these contradictory conclusions is the fact that the hazard ratio at a given time does not express a causal effect (Hernán 2010). This has nothing to do with model misspecification, as the models considered either hold by construction or (in the MRC RE01 example) hold reasonably well by a goodness-of-fit examination. To appreciate that the treated and untreated risk sets indeed lose their exchangeability over time, note that

$$\begin{aligned} \frac{E(Z|T>t,A=1)}{E(Z|T>t,A=0)}= \left\{ \begin{array}{l@{\quad }l} \exp {\{\Lambda _0(t)(1-e^{\beta _1})\}} &{} \text {if}\ t\le \nu \\ \frac{\exp {\{\Lambda _0(\nu )\}} +\Lambda _0(\nu ,t)}{\exp {\{\Lambda _0(\nu )e^{\beta _1}\}} +\Lambda _0(\nu ,t)} &{} \text {if}\ t>\nu . \end{array}\right. \end{aligned}$$

when \(\beta _2=0\) and \(\theta =1\). This shows that, while Z is independent of A by randomisation, this independence is not generally maintained within subgroups of survivors, where we are left with more frail subjects in the active treatment group: \(E(Z|T>t,A=1)>E(Z|T>t,A=0)\). That selection takes place, does not rely on Z being Gamma-distributed as shown in “Appendix A.1”, where we consider a situation with Z being binary.

3.2 Gastrointestinal tumour study

 Stablein and Koutrouvelis (1985) presented survival data from a randomised clinical trial on locally unresectable gastric cancer. Half of the total of 90 patients were assigned to chemotherapy, and the other half to combined chemotherapy and radiotherapy.

It was suggested that there was superior survival for patients who received chemotherapy, but only in the first year or so. The same application was considered by Collett (2015), pp. 386–389. For illustrative purposes we consider here the first 720 days of follow up corresponding to the two first time periods considered by Collett (2015). The Kaplan Meier curves corresponding to the two groups are shown in Figure 1 in the supplementary material. It is seen that the survival curves come close at the end of the considered time interval. Applying a Cox regression model with time-by-treatment-interaction, allowing for separate HR’s before and after 1 year of follow-up gives estimated HR’s (Combined vs chemotherapy) of 2.40 (95% CI; 1.25, 4.63) in the first year and 0.78 (95% CI; 0.34,1.76) thereafter. The supremum score process test of Lin et al. (1993) gave no convincing evidence against these two models, with \({P}=0.07\) in the first interval and \({P}=0.26\) in the second interval. It is tempting to conclude that the chemotherapy is beneficial in the first year only, and that there is possibly a reverse effect afterwards; see Collett (2015) for a similar analysis and conclusion. However, such conclusion is not necessarily supported by the data. We demonstrate this using the two estimated hazard ratios and assuming the DGP (3) with the frailty variable being Gamma distributed with mean and variance equal to \(\theta \), corresponding to a Kendall’s \(\tau \) of 0.3. Figure 2 in the Supplementary Material displays the estimated hazard ratio \(\lambda (t,A=1,Z)/\lambda (t,A=0,Z)\). It is seen to depend on time, being larger than 2.4 and increasing towards 3.7 in the first year and, after the change-point (1 year), starting at around 1.2 and then decreasing but being larger than 1 at all times. The reason this effect reversal does not occur under this model is that when chemotherapy is the more effective treatment, then there will also be relatively more and more frail subjects in that group over time.

3.3 Hazard differences

For an additive hazards model

$$\begin{aligned} \lambda (t;a)=\lambda _0(t)+ {\beta a}, \end{aligned}$$

following the arguments for the Cox models in Sect. 2, the hazard difference \(\beta \) may be written as

$$\begin{aligned} \beta = -\frac{1}{t}\bigl (\log P(T>t |A=1)-\log P(T>t |A=0)\bigr ) \end{aligned}$$

which equals

$$\begin{aligned} \beta = -\frac{1}{t}\bigl (\log P(T^1>t)-\log P(T^0>t)\bigr ) \end{aligned}$$

when the treatment is randomly assigned. This is a causal contrast, but suffers from the same interpretational difficulties as does the hazard ratio.

It is nonetheless tempting to think that hazard differences have a more appealing interpretation than hazard ratios, apart from them being collapsible (Martinussen and Vansteelandt 2013). Indeed, suppose that the additive hazards model

$$\begin{aligned} \lambda (t;A,Z)=\psi (t)A+\omega (t,Z), \end{aligned}$$
(5)

holds for general functions \(\psi \), \(\omega \) and (possibly unmeasured) baseline covariates Z. Then the baseline exchangeability of treated and untreated individuals w.r.t. the covariates Z, as guaranteed by randomisation, extends to all risk sets, in the sense that \(A\perp \!\!\!\perp Z| T>t\) (Vansteelandt et al. 2014; Aalen et al. 2015). The previous concerns about selection may therefore appear of less relevance for (time-varying) hazard differences. However, this is generally not the case. In Sect. 4.3, we show that an Aalen additive hazards analysis can easily result in the same misleading conclusion as what we obtained from the Cox regression analysis with treatment-by-time interaction in Sect. 3.1. In Sect. 4.3, we demonstrate that (time-varying) hazard differences can express causal contrasts, however, but only provided that additional unverifiable assumptions are made.

4 Towards causal contrasts for survival endpoints

4.1 Causal hazard ratios

In the preceding sections we have discussed difficulties in connection with interpreting hazard ratios or hazard differences causally and suggested alternative ways of illustrating time-dependent treatment contrasts. This section is devoted to a discussion of ways of defining contrasts based on the hazard function, which have a causal interpretation.

To remedy the selection problem, it seems intuitively of interest to evaluate conditional hazard ratios

$$\begin{aligned}&\frac{\lim _{h\rightarrow 0}P(t\le T< t+h|T\ge t,A=1,Z=z)}{\lim _{h\rightarrow 0}P(t\le T< t+h|T\ge t,A=0,Z=z)}\\&\quad =\frac{\lim _{h\rightarrow 0}P(t\le T^1< t+h|T^1\ge t,Z=z)}{\lim _{h\rightarrow 0}P(t\le T^0< t+h|T^0\ge t,Z=z)}, \end{aligned}$$

for a large collection of baseline variables Z, such that those who survive time \(t>0\) with treatment (\(T^1\ge t\)) are comparable to those who survive time \(t>0\) without treatment (\(T^0\ge t\)), but have the same covariate values z. Such comparability would be attained if

$$\begin{aligned} T^1\perp \!\!\!\perp T^0|Z. \end{aligned}$$
(6)

Under this assumption, it follows via Bayes’ rule that the righthand side of the above identity equals

$$\begin{aligned} \frac{\lim _{h\rightarrow 0}P(t\le T^1< t+h|T^0\ge t,T^1\ge t,Z=z)}{\lim _{h\rightarrow 0}P(t\le T^0 < t+h|T^0\ge t,T^1\ge t,Z=z)}. \end{aligned}$$
(7)

This estimand expresses the instantaneous risk at time t on treatment versus control for the principal stratum of individuals with covariates z who would have survived up to time t, no matter what treatment. It represents a causal contrast, which is closely related to the so-called survivor average causal effect (Rubin 2000). We will refer to it as the conditional causal hazard ratio.

Unfortunately, assumption (6) is untestable and biologically implausible, as it is essentially impossible to believe that one can get hold of all predictors of the event time such that knowledge of the event time without treatment does not further predict the event time with treatment. Furthermore, even if one could get hold of all such predictors, then because Z would probably carry so much information about the event time, one would logically expect the numerator and denominator of (7) to be so close to 0 or 1 that it would render the conditional causal hazard ratio essentially meaningless. In the next section we will therefore focus on the marginal causal hazard ratio \(\mathrm {HR}(t)\) obtained from (7) with Z empty:

$$\begin{aligned} \mathrm {HR}(t)=\frac{\lim _{h\rightarrow 0}P(t\le T^1< t+h|T^0\ge t, T^1\ge t)/h}{\lim _{h\rightarrow 0}P(t\le T^0< t+h|T^0\ge t,T^1\ge t)/h}. \end{aligned}$$
(8)

4.2 Contrasting causal versus observed hazard ratios

Even if it seems impossible to estimate \(\text{ HR }(t)\) from the observed data without making strong assumptions, it is of interest to gain some understanding of the causal hazard ratio \(\text{ HR }(t)\) and its relationship to the standard Cox HR (which we also refer to as the observed HR). It is particularly of interest to know whether or not \(\text{ HR }(t)\) is always smaller than HR if, as we expect, \(T^0\) and \(T^1\) are positively correlated. To develop insight, let us assume that the Cox model is correctly specified, in the sense that

$$\begin{aligned} P(T^0>t)=\exp {\{-\Lambda _0(t)\}}\quad \text{ and } \quad P(T^1>t) =\exp {\{-\phi \Lambda _0(t)\}}, \end{aligned}$$

where \(\phi =e^{\beta }\) is the Cox HR. Also, we still assume that treatment is randomised. A model for the joint distribution of \(T^0\) and \(T^1\), which obeys these marginal distributions can now be based on a copula C (Nelsen 2006). Specifically, we let \(u_t=e^{-\Lambda _0(t)}\) and \(v_t=e^{-\phi \Lambda _0(t)}\) and specify the joint distribution of \((T^0,T^1)\) via

$$\begin{aligned} P(T^0\ge t_0,T^1\ge t_1)=P(U\le u, V\le v)=C(u,v) \end{aligned}$$

where \(U=u_{T^0}\), \(u=u_{t_0}\), \(V=v_{T^1}\) and \(v=v_{t_1}\). Sklar’s Theorem (Nelsen 2006) ensures that for any joint distribution of \(T^0\) and \(T^1\) there exists a copula C so that the latter display is fulfilled. It easy to show that

$$\begin{aligned} \text{ HR }(t) =\phi \Omega (t),\quad \Omega (t) =\frac{v_t\dot{C}_v(u_t,v_t)}{u_t\dot{C}_u(u_t,v_t)}, \end{aligned}$$
(9)

where \(\dot{C}_u(u,v)=\frac{\partial C(u,v)}{\partial u}\) and similarly with \(\dot{C}_v(u,v)\). Display (9) gives the desired relation between \(\text{ HR }(t)\) and the Cox HR, \(\phi \). To obtain further insight, let us for a moment restrict the copula C to belong to the class of Archimedean copulas. Bivariate distributions generated by frailty models are a subclass of Archimedean distributions (Oakes 1989).

For Archimedean copulas, there exists a function \(\psi (v)\) so that

$$\begin{aligned} \psi (v)=\psi ^{*}(t_0,t_1) \end{aligned}$$
(10)

with \(v=S(t_0,t_1)=P(T^0>t_0,T^1>t_1)\in ]0,1]\) (Oakes 1989) and

$$\begin{aligned} \psi ^{*}(t_0,t_1)=\frac{\lambda _{T^0}(t_0|T^1=t_1)}{\lambda _{T^0} (t_0|T^1>t_1)}=\frac{\lambda _{T^1}(t_1|T^0=t_0)}{\lambda _{T^1}(t_1|T^0>t_0)} \end{aligned}$$

is the so-called cross-ratio function (Oakes 1989). In the latter display, \(\lambda _{T^0}(t_0|T^1=t_1)\) denotes the conditional hazard function of \(T^0\) given \(T^1=t_1\), and likewise with the other hazard functions in the latter display. This gives rise to the explicit expression

$$\begin{aligned} \Omega (t)=\exp {\left( -\int _{e^{-\Lambda _0(t)}}^{e^{-\phi \Lambda _0(t)}} \frac{\psi (y)}{y}dy\right) } \exp {\{-\Lambda _0(t)(\phi -1)\}}. \end{aligned}$$

Using the mean value theorem gives

$$\begin{aligned} \Omega (t)<1\iff \psi (e^{-H_t})>1, \end{aligned}$$

for some \(H_t\) on the line segment between \(\phi \Lambda _0(t)\) and \( \Lambda _0(t)\). The right hand side of the latter display is fulfilled if \(\psi (v)>1\) for all v. If the copula is induced by a frailty model where the frailty distribution is not degenerate (at unity) then \(T^0\) and \(T^1\) are positively correlated and also \(\psi (v)>1\). This confirms the intuition layed out in Hernán (2010) who argues that the causal hazard ratio is smaller than the Cox hazard ratio at all times \(t>0\) based on a frailty setting.

Example

Clayton-copula (Gamma-frailty). In this case, \(\psi (v)=\theta +1\) where \(0<\theta <\infty \). A direct calculation shows that

$$\begin{aligned} \text{ HR }(t)=\phi \exp {\{-\theta \Lambda _0(t)(1-\phi )\}}<\phi , \end{aligned}$$

for \(t>0\) so that \(\Lambda _0(t)>0\). Clearly,

$$\begin{aligned} 0< \text{ HR }(t)<\phi , \end{aligned}$$

with the value of \(\text{ HR }(t)\) depending on \(\theta \) and \(\Lambda _0(t)\). \(\square \)

Example

Inverse Gaussian copula. In this case, \(\psi (v)=1+\frac{1}{\eta -\log v}\), with \(\eta >0\). Independence of \(T_0\) and \(T_1\) is obtained when \(\eta \) tends to infinity, while \(\eta \) small results in high correlation. Again, a direct calculation shows that

$$\begin{aligned} \text{ HR }(t)=\phi \left\{ \frac{\eta +\phi \Lambda _0(t)}{\eta +\Lambda _0(t)}\right\} <\phi \end{aligned}$$

when \(\Lambda _0(t)>0\). It is seen that \( \lim _{\eta \rightarrow 0}\text{ HR }(t)=\phi ^2, \) again when \(\Lambda _0(t)>0\). So,

$$\begin{aligned} \phi ^2< \text{ HR }(t)<\phi , \end{aligned}$$

for \(t>0\) with \(\Lambda _0(t)>0\). \(\square \)

These two examples show that \(\text{ HR }(t)<\phi \) as the copulas used are indeed both Archimedean. However, they also show that it is not possible to bound \(\text{ HR }(t)\) from below as we can see from the Clayton-copula case that it can go all the way down to zero, and obviously there is no way of checking the joint distribution of \((T^0,T^1)\).

We now return to the general case, where the copula C may not be Archimedean. We see from (9) that

$$\begin{aligned} \text{ HR }(t)<\phi \iff u_t\dot{C}_u(u_t,v_t)> v_t\dot{C}_v(u_t,v_t) \end{aligned}$$

since the first order partial derivatives of C are never negative (Nelsen 2006). Since \( v_t>u_t\) as \(\phi <1\), it follows if the copula satisfies

$$\begin{aligned} u\dot{C}_u(u,v)- v\dot{C}_v(u,v)> 0,\, \text{ for }\ 0\le u<v \le 1, \end{aligned}$$
(11)

that \(\text{ HR }(t)<\phi \). Copulas obeying (11) then induce the positive correlation needed in order to have \(\text{ HR }(t)<\phi \). We are not aware of previous work that has studied copulas where (11) holds. The condition (11) is local in the sense that it needs to hold for all \(0\le u<v\le 1\) whereas well known correlation measures such as Kendall’s \(\tau \) and Spearman’s \(\rho \) are global, e.g. Spearman’s \(\rho \) can be expressed in terms of the copula as

$$\begin{aligned} \rho = 12 \int _0^1\int _0^1 \{C(u,v)-uv\}dudv. \end{aligned}$$

Alternatively, there are various types of local correlation measures (Nelsen 2006), one being the so-called positively quadrant correlation (PQD): two random variables X and Y are said to have PQD if

$$\begin{aligned} P(X\le x,Y\le y)\ge P(X\le x)P(Y\le y),\;\text{ for } \text{ all }\ x\ \text{ and }\ y, \end{aligned}$$

which is equivalent to

$$\begin{aligned} C(u,v)\ge \Pi (u,v)=uv, \, (u,v)\in [0,1]^2, \end{aligned}$$

where C is the corresponding copula and \(\Pi \) is the independence copula. We now construct a copula that gives rise to PQD, but for which (11) does not hold. For any \(\xi \in (1/4,1/2)\) (Nelsen 2006), define

$$\begin{aligned} C^{*}(u,v) = \left\{ \begin{array}{ll} \xi M(\frac{u}{\xi },\frac{v}{\xi }) &{} \text {if}\ (u,v)\in [0,\xi ]^2=J_1\\ \xi +(1-2\xi ) W(\frac{u-\xi }{1-2\xi },\frac{v-\xi }{1-2\xi }) &{} \text {if}\ (u,v)\in [\xi ,1-\xi ]^2=J_2\\ 1-\xi +\xi M(\frac{u+\xi -1}{\xi },\frac{v+\xi -1}{\xi })&{} \text {if}\ (u,v)\in [1-\xi ,1]^2=J_3\\ M(u,v)&{} \text {otherwise} \end{array}\right. \end{aligned}$$

with \(W(u,v)= \text{ max }(u+v-1,0),\, M(u,v)=\text{ min }(u,v) \) being the lower and upper Frechet-Hoeffding bounds, respectively. This copula \(C^{*}\), which is the ordinal sum of \(\{M,W,M\}\) with respect to the partition \(\{[0,\xi ],[\xi ,1-\xi ],[1-\xi ,1]\}\), is PQD (Nelsen, Exercise 5.30). The suggested partition of \([0,1]^2\) is illustrated in Fig. 2, where \(\xi \) is set to 0.35. The expression for the copula can be simplified to

$$\begin{aligned} C^{*}(u,v) = \left\{ \begin{array}{ll} M(u,v) &{} \text {if}\ (u,v)\notin J_2\\ \xi +(u+v-1)I(u+v>1) &{} \text {if}\ (u,v)\in J_2\\ \end{array} \right. \end{aligned}$$

Hence, for \((u,v)\in J_2\), we have

$$\begin{aligned} u\dot{C^{*}}_u(u,v)- v\dot{C^{*}}_v(u,v)=(u-v)I(u+v>1), \end{aligned}$$

which may be smaller than zero for \(u<v\), and therefore (11) does not hold. Hence, \( \text{ PQD }\) does not suffice to ensure (11), which may not be surprising as PQD, although a local dependence measure, is cumulative in that it depends on survival functions and not hazard functions as does the cross-ratio function.

By a direct calculation, Spearman’s \(\rho \) for this copula is given by

$$\begin{aligned} \rho (\xi )=\xi (16\xi ^2-24\xi +12)-1 \end{aligned}$$

that is an increasing function of \(\xi \) with \(\rho (1/4)=0.75\) and \(\rho (1/2)=1\). Surprisingly, this copula therefore gives rise to a situation where \(\text{ HR }(t)>\phi \) for some t’s even when \(\phi <1\) and Spearman’s \(\rho \ge 0.75\).

In particular, we have that

$$\begin{aligned} \text{ HR }(t)=\phi \Omega (t),\quad \text{ with }\quad \Omega (t)=\frac{\exp {\{-\phi \Lambda _0(t)\}}}{\exp {\{-\Lambda _0(t)\}}}, \end{aligned}$$

when

$$\begin{aligned}&\exp {\{-\phi \Lambda _0(t)\}}+\exp {\{-\Lambda _0(t)\}}>1\quad \text{ and } \end{aligned}$$
(12)
$$\begin{aligned}&\xi<\exp {\{-\Lambda _0(t)\}}<\exp {\{-\phi \Lambda _0(t)\}}<1-\xi . \end{aligned}$$
(13)

Condition (12) corresponds to the part of \(J_2\) above the diagonal (the \(v=1-u\) line, see Fig. 2) and condition (13) corresponds to the part of \(J_2\) above the (anti)-diagonal (\(v=u)\). If we follow the path \((\exp {\{-\Lambda _0(t)\}},\exp {\{-\phi \Lambda _0(t)\}})\) (starting at the point (1,1) when \(t=0\)) in \([0,1]^2\) in such a way that we pass through the upper triangle of \(J_2\) (with \(J_2\) consisting of 4 triangles like the back of an envelope, see Fig. 2) then \(\text{ HR }(t)\) will be equal to 0 as long as we have not yet reached \(J_2\) but then it jumps to \(k(u,v)\phi \), where \(k(u,v)>1\). When we cross the line \(v=1-u\) in \(J_2\), \(\text{ HR }(t)\) is undefined as both partial derivatives of the copula are equal to zero, and when we escape \(J_2\) then \(\text{ HR }(t)=0\). In “Appendix A.4”, we gain further insight by generating samples from this latter copula. The empirical result in Fig. 3 confirms the above reasoning that more than just positive correlation between \(T^0\) and \(T^1\) is needed to have that \(\text{ HR }(t)<\phi \).

Fig. 2
figure 2

Illustration of the regions used in the definition of the copula \(C^{*}\), Sect. 4.1, showing the rectangular regions \(J_1\), \(J_2\) and \(J_3\). The value of \(\xi \) is set to 0.35. The two broken lines are the lines \(v=u\) and \(v=1-u\), respectively

Fig. 3
figure 3

Estimated (points) and theoretical (full line) values of \(\text{ HR }(t)\). Broken line corresponds to \(\phi =2/3\)

If one allows for negative correlation between \(T^0\) and \(T^1\) then it is easy to construct a situation where \(\Omega (t)>1\). We just need a copula C that satisfies

$$\begin{aligned} u\dot{C}_u(u,v)- v\dot{C}_v(u,v)< 0,\, \text{ for }\ 0\le u<v \le 1, \end{aligned}$$

Take for example the Farlie–Gumbel–Morgenstern Copulas, where

$$\begin{aligned} C_{\theta }(u,v)=uv-\theta uv(1-u)(1-v), \end{aligned}$$

with \(\theta \in [-1,1]\). Specifically, let \(\theta =-1\), corresponding to Spearman’s \(\rho =-1/3\), then \(\Omega (t)>1\) and we can even have that \(\text{ HR }(t)=\phi \Omega (t)>1\) although \(\phi <1\). Negative correlation between \(T^0\) and \(T^1\) does not seem very likely, however.

4.3 Hazard differences

Define

$$\begin{aligned} \text{ HD }(t)= & {} \lim _{h\rightarrow 0}P(t\le T^1< t+h|T^0\ge t, T^1 \ge t)/h\\&-\lim _{h\rightarrow 0}P(t\le T^0< t+h|T^0\ge t,T^1\ge t)/h, \end{aligned}$$

which is a causal hazard function contrast. In the additive hazards model (5), \(\psi (t)\) reduces to the causal contrast, \(\psi (t)=\text{ HD }(t), \) provided that (6) holds (see “Appendix A.3”). The latter assumption, however, as discussed in Sect. 4.1 is both untestable and biologically implausible.

Since A is binary we can always write the hazard function \(\lambda (t;A)\) as

$$\begin{aligned} \lambda (t;A)=\lambda _0(t)+\psi (t) A \end{aligned}$$
(14)

Further, one may show the following relationship between \(\text{ HD }(t)\) and \(\psi (t)\),

$$\begin{aligned} \text{ HD }(t) =\Omega (t)\Psi (t)\{\lambda _0(t)+\psi (t)\} -\Psi (t)\lambda _0(t) \end{aligned}$$
(15)

where

$$\begin{aligned} \Omega (t)=\frac{ v_t\dot{C}_v(u_t,v_t)}{u_t\dot{C}_u(u_t,v_t)}, \quad \Psi (t)=\frac{ u_t\dot{C}_u(u_t,v_t)}{C(u_t,v_t)}, \end{aligned}$$

and where we specify the joint distribution of \((T^0,T^1)\) via

$$\begin{aligned} P(T^0\ge t_0,T^1\ge t_1)=P(U\le u, V\le v)=C(u,v) \end{aligned}$$

with \(U=u_{T^0}\), \(u=u_{t_0}\), \(V=v_{T^1}\) and \(v=v_{t_1}\). We now show that an Aalen additive hazards analysis can easily lead to the same misleading conclusion as was the case for the Cox regression analysis with a treatment-by-time interaction that we considered in Sect. 3.1. If the DGP is given by \(\lambda (t;A,Z)\) in (4) then the \(\psi (t)\) is (14) has the form

$$\begin{aligned} \psi (t) =\lambda _0(t)(e^{\beta _1}-1)I(t\le \nu ) +\lambda _0(t)(e^{\beta _2}-1)I(t> \nu ). \end{aligned}$$

If \(\beta _2=0\) we see that \(\psi (t) =0\) for \(t>\nu \) and an Aalen additive hazards analysis would therefore also indicate that the treatment effect vanishes after time point \(\nu \). We have illustrated this in the Supplementary Materials, where the marginal Aalen additive hazards model fits the data perfectly but also suggests a beneficial effect of the treatment in an initial period (first 4 years), which then disappears, see Figure 5 in the Supplementary Material. In this specific setting, we have, for \(t>\nu \), that

$$\begin{aligned} \text{ HD }(t) =\{\Omega (t)-1\}\Psi (t)\lambda _0(t)<0 \end{aligned}$$

since both \(\Omega (t)\) and \(\Psi (t)\) are within the interval (0, 1). Note also, for this specific example, that \(\lambda (t;A,Z)\) is not of the form (5).

One may also construct scenarios where the DGP is of the form (5) and where \(\psi (t)\ne \text{ HD }(t) \) if (6) does not hold. Hence, whether or not \(\psi (t)\) in (14) has a causal interpretation in terms of a contrast between hazard functions depends on unverifiable assumptions.

4.4 Alternative causal contrasts

Given the subtle interpretation of hazard ratios, we believe that the effects of treatments on survival endpoints are best communicated via unconditional effect measures. For a correctly specified Cox proportional hazards model the relative risk function

$$\begin{aligned} \text{ RR }(t)=\frac{P(T\le t|A=1)}{P(T\le t|A=0)} =\frac{P(T^1\le t)}{P(T^0\le t)}, \end{aligned}$$

can be estimated consistently by

$$\begin{aligned} \widehat{\text{ RR }}(t)= \frac{1-\exp {\{-\hat{\Lambda }_0(t) e^{\hat{\beta }}\}}}{1-\exp {\{-\hat{\Lambda }_0(t)\}}}, \end{aligned}$$

where \(\hat{\beta }\) is the Cox partial likelihood estimator and \(\hat{\Lambda }_0(t)\) is the corresponding Breslow estimator. Or we may use the restricted mean survival time (RMST) at time t (Uno et al. 2014; Zhao et al. 2016), defined as \(\text{ RMST }(t) =E\{\text{ min }(T, t)\}\). This is the area under the survival curve of T up to time t, which can easily be estimated using the corresponding Kaplan-Meier curve up to time t (see Zhao et al. (2016) for details on inference). Contrasts of the \(\text{ RMST }(t)\) corresponding to different (randomised) treatment groups therefore carry a causal interpretation. One may also report the (restricted) mean time lost before time t, which is defined as \(\text{ RMTL }(t)=t-\text{ RMST }(t)\).

However, the restricted mean residual survival time \(E\{\text{ min }(T, t)-s|T>s\}\) for \(s<t\) leads to the same subtleties as seen for the hazard function, which is due to the conditioning on \(T>s\).

4.5 Analysis of the MRC RE01 study

For the MRC RE01 trial, we fitted Aalen’s additive hazard model

$$\begin{aligned} \lambda (t;A,V)=\beta _0(t)+\psi (t)A+\beta _1(t)^TV, \end{aligned}$$
(16)

where V includes days from metastasis to randomization (log-transformed), WHO performance status (0, 1 and 2; with group 0 and 1 collapsed into one group), and Haemoglobin (g/dl). As the treatment variable A is independent of the other covariates, the above Aalen model is collapsible meaning that the interpretation of \(\psi (t)\) is the same in the conditional and marginal model. The Cox model does not have this property. Model (16) appeared to fit the data well, using the tools described in Chapter 5 in Martinussen and Scheike (2006); specifically no interaction between the treatment indicator and the baseline risk factors was found. We estimated \(\hat{\Psi }(t)=\int _0^t\psi (s)ds\) both from the conditional model and the marginal model, and the two estimators were almost identical, which, as pointed out, should be the case if model (16) is correctly specified. Using model (16), we next tested the null hypothesis \(\psi (t)=\psi \) of a constant effect (\({P}=0.58\)), which was subsequently estimated to be \(\hat{\psi }=-0.02\) (SE, 0.009). If \(T^0\perp \!\!\!\perp T^1|V\) (or if the addition of all variables Z conditional on which \(T^0\) and \(T^1\) become independent, does not change the additive structure of the model), then this is also the causal hazard difference. This would mean that over the course of the follow-up, an average of approximately 2 additional deaths will occur for each month of follow-up in each 100 persons under the control treatment alive at the start of the month and who would also be alive under the IFN treatment, compared with each 100 IFN treated persons alive at the start of the month and who would also be alive under the control treatment.

As suggested before, the assumption that \(T^0\perp \!\!\!\perp T^1|V\) is implausible (just like the assumption that all variables Z for which \(T^0\perp \!\!\!\perp T^1|Z\) do not interact with treatment on the additive hazard scale). In view of this, we additionally report the effects of treatment on the survival chances and restricted mean survival time. The relative risk function estimate, along with 95% pointwise and uniform confidence bands, is displayed in Fig. 4. We see that the estimated relative risk function is below 1 at all times, in favour of IFN treatment. For instance, it is seen that the relative risk at one year is estimated to be approximately 0.85. Judging from the 95% confidence bands (dashed curves), this is close to being significant. A uniform test over the considered time span is also close to being significant judging from the 95% uniform confidence bands.

Fig. 4
figure 4

MRC RE01 study. IFN treatment versus control treatment. Estimate of relative risk \(\text{ RR }(t)\) along with 95% pointwise confidence bands (dashed curves) and 95% uniform confidence bands (shaded area)

Further, the ‘months of life lost up to 30 months’ \(\text{ RMTL }(30)\) is estimated to be 17.3 for IFN treatment and 19.8 for control treatment. The ratio of these two (IFN vs control) is 0.87 (95% CI, 0.70 to 1.04). Thus, on IFN treatment there is a 13% reduced loss of lifetime compared to the control treatment during the first 30 months of follow up.

5 Concluding remarks

We have argued that the treatment effect in a proportional hazards model carries a causal interpretation, but that its interpretation is subtle. The proportional hazards assumption does not express, for instance, that treatment works equally effectively at all times, as the hazard ratio at a given time mixes differences between treatment arms due to treatment effect as well as selection. The danger of over interpreting hazard ratios become most pronounced when the hazard ratio is not constant over time (e.g. when the hazard ratio is below 1 for some time and then becomes 1). We have argued that this cannot be interpreted as implying that treatment effectiveness disappears after some time. In our opinion, this is the source of much confusion, and a real concern. Non-constant hazard ratios are indeed fairly common in real life because the proportional hazards assumption is a rather unstable assumption in the following sense. Even when valid in some population, this assumption is likely to fail in subgroups of that population (e.g. if one studies men and women separately), and vice versa. This makes the assumption, at best, an approximation in practice.

We have described hazard contrasts that are causally interpretable because they compare intensities at a given time t with and without treatment for the same patient population: those who would survive that time, no matter what treatment. Such hazard contrasts are not estimable without untestable assumptions concerning the joint distribution of \((T^0,T^1)\) and a sensitivity analysis is therefore the only option. One may alternatively attempt to proceed under the monotonicity assumption that no one is harmed by treatment, i.e.

$$\begin{aligned} T^1\ge T^0 \quad \mathrm { with \ probability \ 1}. \end{aligned}$$

Under this strong assumption, we can write

$$\begin{aligned} P(t\le T< t+h|T\ge t,A=0)= & {} P(t\le T^0< t+h|T^0\ge t) \\= & {} P(t\le T^0< t+h|T^0\ge t,T^1\ge t) \end{aligned}$$

and

$$\begin{aligned} P(t\le T< t+h|T\ge t,A=1)= & {} P(t\le T^1< t+h|T^1\ge t)\\= & {} P(t\le T^1< t+h|T^0\ge t,T^1\ge t)\pi (t)\\&+P(t\le T^1< t+h|T^0< t,T^1\ge t)\left\{ 1-\pi (t)\right\} , \end{aligned}$$

with

$$\begin{aligned} \pi (t) \equiv P(T^0\ge t|T^1\ge t)=\frac{P(T^0\ge t)}{P(T^1 \ge t)} =\frac{P(T\ge t|A=0)}{P(T\ge t|A=1)}. \end{aligned}$$

However, if the distribution of \((T^0,T^1)\) is assumed to be absolutely continuous then one may show that \(\lim _{h\rightarrow 0}h^{-1}P(t\le T^1< t+h|T^0\ge t,T^1\ge t)=0\) and thus \(\text{ HR }(t)=0\) under the monotonicity assumption, see “Appendix A.5”. A further disadvantage of the discussed hazard contrasts, which are causally interpretable, is also that they describe the effect for an unknown subgroup of the population.

The reason that the causal hazard ratio (8) is not identifiable without invoking strong assumptions is because it attempts to answer an overly ambitious question. Imagine a trial that randomises participants over an implanted medical device (e.g. a stent or a pacemaker) versus no treatment (or placebo). Suppose that the medical device gradually deteriorates and stops being operational after some time \(\nu \). Then we would say that treatment no longer works from time \(\nu \) onwards. This would correspond with \(\mathrm {HR}(t)=1\) for \(t\ge \nu \). However, how could data from a randomised trial be informative about the effect of treatment after time \(\nu \) when no information is collected on the times at which the medical device is operational or not? To learn about the treatment effect at each time t, we should ideally need data \(A_t\) on whether (\(A_t=1\)) or not (\(A_t=0)\) the device is operational at that time. When the operation time is ignorable, then one may learn about the treatment effect at each time t through contrasts of the form

$$\begin{aligned} \frac{\lim _{h\rightarrow 0}P(s\le T< s+h|T\ge s, \overline{A}_{t-}=1,A_t=1)}{\lim _{h\rightarrow 0} P(s\le T< s+h|T\ge s,\overline{A}_{t-}=1,A_t=0)}, \end{aligned}$$

for all \(s\ge t\), where \(\overline{A}_{t-}\) is the information generated by all \(A_u\), \(u<t\), so \(\overline{A}_{t-}=1\) refers to “being on treatment” up to just before time t. It is unsurprising that without detailed data on the operation times of each device, strong assumptions are needed to develop insight into the dynamic nature of the treatment effect. A better strategy in practice, when interest lies in the dynamic aspects of a treatment, is therefore to design the study such that the collected data provide immediate insight into the dynamic aspects of treatment (e.g. by modifying treatment assignments over time).

While we have argued that time-varying hazard ratios are not causally interpretable, note that they continue to be valuable for descriptive purposes when addressing non-causal questions. Imagine for example a setting where the outcome is time-to death after onset of a certain disease, and say that there are two subclasses (\(A=1\) and \(A=0\)) of the disease. Suppose we know from a previous study where model (2) was used, and was correctly specified, that \(\exp {(\beta _1)}=2\), \(\beta _2=0\), for \(\nu =2\) years. Assume also that all other risk factors were equally balanced in the two subclasses at time 0 (onset of disease) or that proper adjustment for these factors was performed in the hazard model. Then, based on this analysis, we can say that patients in disease subclass 1 who survive the first two years from onset of the disease will have the same subsequent risk of dying as those in disease subclass 0—a statement that may be useful for consulting patients at time \(\nu =2\) years, or later.

The difficulties we have described concerning the causal interpretation of hazard ratios pertain for hazard differences. This problem is thus not only an issue for proportional hazard models but also for additive hazard models. The root cause of the problem is the interpretation of hazard function itself, more than its particular structure.