1 Introduction

The famous model of Cox (1972) for proportional hazards is very popular in practice. That is why we need tests to check its model assumptions, i.e., \(\alpha _2(t)=\vartheta \alpha _1(t)\) for all \(t \in [0,\infty )\) in the two-sample case, where \(\alpha _j\) denotes the group-specific hazard rate. Since the Cox model is more than 40 years old it is not surprising that several statisticians already suggested how to check the proportional hazard assumption. It is not possible to comment the whole recent literature but we want to give at least a (not complete) list of contributions to this topic: Cox (1972), Schoenfeld (1982), Lin (1991), Lin et al. (1993), Grambsch and Therneau (1994), Hess (1995), Sengupta et al. (1998), Scheike and Martinussen (2004), Kraus (2009) and Chen et al. (2015). Under the assumption of proportional hazards in the two-sample case weighted rank estimators of Andersen (1983) can be used to estimate the unknown proportionality factor \(\vartheta \). The idea of Gill and Schumacher (1987) was to compare two different of these estimators. A similar approach is coming from Bluhmki et al. (2019). They measured the discrepancy of the ratio process of the group-specific Nelson–Aalen estimators from being constant. In contrast to the main contributions basing on Gaussian approximations the method of Bluhmki et al. (2019) based on a resampling technique, namely the wild bootstrap of Wu (1986). Resampling techniques are well known to improve the tests’ finite sample performance. In this paper, we suggest a (further) new test for checking the Cox model in the two-sample case. The novelty is not the test statistic itself, which is similar to the one of Wei (1984), but that the test can be conducted as a permutation as well as a wild bootstrap test. For small sample sizes resampling tests are known to perform (often) better than their asymptotic version. That is why they should be preferred. Moreover, the favorable benefit of the permutation test is its finite exactness under exchangeability. At the first sight it is, maybe, surprising that the permutation works for the whole null of proportional hazards since the data is not exchangeable in general because of two reasons: (1) we allow different censoring distribution for the two groups (2) the proportionality factor \(\vartheta \) may differ from 1. Neuhaus (1993) already showed that his permutation approach works despite the first point for testing of distributional equality in the two-sample setting, confer Janssen and Pauls (2003) and Pauly (2011) for a general setting, and in this paper we prove the extension of this idea to the null of proportional hazards.

The paper is organized as follows. In Sect. 2, we introduce the survival set-up, the counting process notation, the logrank process and Andersen’s estimator for the proportionality factor \(\vartheta \). The asymptotic exactness of the test under the null and the test’s consistency are derived in Sect. 3. The resampling techniques, permutation and wild bootstrapping, are introduced in Sect. 4. Moreover, we transfer the theoretical asymptotic properties of the test to both resampling versions. Simulations for various scenarios comparing our method with the ones of Gill and Schumacher (1987) and of Grambsch and Therneau (1994), respectively, are presented in Sect. 5. In Sect. 6, we apply our tests to two real data sets to demonstrate the practical applicability. All proofs are deferred to the “Appendix”.

2 Two-sample survival set-up

Let the usual two-sample survival set-up be given by survival times \(T_{j,i}\sim F_j\) and censoring times \(C_{j,i}\sim G_j\) for all individuals \(i=1,\ldots ,n_j\) within the two groups \(j\in \{1,2\}\), where \(F_j\) and \(G_j\) are continuous distribution functions on the positive line. All random variables \(T_{1,1},C_{1,1},\ldots ,T_{2,n_2}, C_{2,n_2}\) are assumed to be independent. We denote by \(n=n_1+n_2\) the pooled sample size which is supposed to tend to infinity in our asymptotic considerations below. To shorten the notation, all subsequent limits are meant as \(n\rightarrow \infty \) if not explicitly stated otherwise. The functions of interests are the (group specific) cumulative hazard functions \(A_1,A_2\) of the survival times defined by \(A_j(t)=-\log (1-F_j(t))=\int _0^t(1-F_j{(s)})^{-1}\,\mathrm { d }F_j{(s)}\)\((t>0)\). To be more specific, we want to test the proportional hazard assumption, i.e.,

$$\begin{aligned} { H_0} : A_2 = \vartheta A_1,\;\vartheta >0, \end{aligned}$$

while only the possibly censored times \(X_{j,i}=\min (T_{j,i},C_{j,i})\) and their censoring status \(\delta _{j,i}={\mathbf {1}}\{X_{j,i}=T_{j,i}\}\)\((j=1,2;\,i=1,\ldots ,n_j)\) can be observed.

In the following, we introduce important statistics and estimators by adopting the counting process notation of Andersen et al. (1993). Define \(N_{j,i}(t)={\mathbf {1}}\{X_{j,i}\le t,\,\delta _{j,i}=1\}\) and \(Y_{j,i}(t)={\mathbf {1}}\{X_{j,i}\ge t\}\)\((t\ge 0)\). Then \(N_j(t)=\sum _{i=1}^{n_j}N_{j,i}(t)\) equals the amount of uncensored survival times, so-called events, in group j up to the time point t and \(Y_j(t)=\sum _{i=1}^{n_j}Y_{j,i}(t)\) counts the individuals belonging to group j under risk at the time point t. In a similar way, we can interpret the pooled versions \(N=N_1+N_2\) and \(Y=Y_1+Y_2\). The Nelson–Aalen estimator \({{\widehat{A}}}_j\) for \(A_j\) is given by

$$\begin{aligned} {{\widehat{A}}}_j(t)=\int _0^t \frac{{\mathbf {1}}\{Y_j{(s)}>0\}}{Y_j{(s)}}\,\mathrm { d }N_j{(s)}\quad (t\ge 0;\ j=1,2). \end{aligned}$$

It is well known that this nonparametric estimator obeys a central limit theorem, see Andersen et al. (1993). Heuristically, we have \({\widehat{A}}_2 \approx \vartheta {{\widehat{A}}}_1\) under \({ H_0} \). Based on this idea Andersen (1983) as well as Begun and Reid (1983) suggested to estimate the underlying \(\vartheta \) by weighted rank estimators given by

$$\begin{aligned} {{\widehat{\vartheta }}}_{K} = \frac{ \int K{(s)} \,\mathrm { d }{{\widehat{A}}}_2{(s)}}{\int K{(s)} \,\mathrm { d }{{\widehat{A}}}_1{(s)} } \end{aligned}$$
(1)

for an appropriate (predictable) weight function K. Famous weight functions K corresponding to weighted logrank tests have the shape \(K=(w\circ {{\widehat{F}}}) n^{-1}Y_1Y_2/(Y_1+Y_2)\) for \(w\in \mathscr {W}=\{w:[0,1]\rightarrow (0,\infty )\) continuous and of bounded quadratic variation\(\}\), where \({{\widehat{F}}}\) denotes the Kaplan–Meier estimator of the pooled sample, i.e.,

$$\begin{aligned} 1-{{\widehat{F}}}(t) = \prod _{(j,i):X_{j,i}\le t} \Bigl ( 1- \frac{\delta _{j,i} }{Y(X_{j,i})} \Bigr )\quad (t\ge 0). \end{aligned}$$

This class includes, among others, the classical logrank weight \(K_L=n^{-1}Y_1Y_2/(Y_1+Y_2)\) and the weight \(K_{\text {HF}}=K_{\text {L}} {{\widehat{F}}}^\rho \) suggested by Harrington and Fleming (1982). By Gill (1980) the estimator \({{\widehat{\vartheta }}}_K\) obeys a central limit theorem. The test for \({ H_0} \) of Gill and Schumacher (1987) is based on the observation that the quotient of two different rank estimators \({\widehat{\vartheta }}_1\) and \({{\widehat{\vartheta }}}_2\) is approximately equal to 1, independently of the true proportionality factor \(\vartheta \). Closely related to these rank estimators are (extended) rank tests for the null of equal distributions \(H^=_0: F_1=F_2\), or equivalently \(H^=_0: A_1=A_2\). Probably, the most famous member is the classical logrank test corresponding to the weight \(K_L\). This specific test, which was first proposed by Mantel (1966) and Peto and Peto (1972), is asymptotically efficient for proportional hazard alternatives. Enlarging the restricted null \(H^=_0\) to our null \(H_0\) of proportional hazards we need to correct, among others, \(K_L\). The corrections lead to the following rescaled logrank statistic:

$$\begin{aligned} S_n(t,\vartheta )= \Bigl ( \frac{n}{n_1n_2} \Bigr )^{1/2}\int _{[0,X_{([nt])}]} \frac{Y_1Y_2}{Y_1+\vartheta Y_2} ( \,\mathrm { d }{{\widehat{A}}}_2-\vartheta \,\mathrm { d }{{\widehat{A}}}_1), \end{aligned}$$
(2)

where [x] denotes the integer part of x, we set \(X_{(0)}=0\) and \(X_{(1)}\le \cdots \le X_{(n)}\) are the order statistic of the pooled sample. \(S_n(1,1)\) is the classical logrank statistic. The additional argument \(\vartheta \) is a correction for the case \(A_2=\vartheta A_1\), details are carried out in the subsequent section. By Gill (1980) \(S_n(1,1)\) is asymptotically normal. An extension of his estimator for the limiting variance to our situation is

$$\begin{aligned} {{\widehat{\sigma }}}^2(t,\vartheta ) = \frac{n}{n_1n_2} \int {\mathbf {1}}_{[0,X_{([nt])}]} \frac{\vartheta Y_1Y_2}{(Y_1+\vartheta Y_2)^2 }\,\mathrm { d }(N_1+N_2). \end{aligned}$$
(3)

Both, \(S_n(t,\vartheta )\) and \({{\widehat{\sigma }}}^2(t,\vartheta )\), can be rewritten as linear rank statistics noting that all involved processes only jump at the order statistics \(X_{(i)}\). Let \(\delta _{(i)}\) be the censoring status corresponding to \(X_{(i)}\). Moreover, we introduce the group status \(c_{(i)}\) of the individual corresponding to \(X_{(i)}\). To be more specific, let \(c_{(i)}=1\) if \(X_{(i)}\) belongs to the second group and \(c_{(i)}=0\) if it is a member of the first group. Then

$$\begin{aligned}&S_n(t,\vartheta )=\Bigl ( \frac{n}{n_1n_2} \Bigr )^{1/2}\sum _{i=1}^{[nt]} \delta _{(i)} \Bigl ( c_{(i)} - \frac{\vartheta \sum _{m=i}^nc_{(m)}}{\sum _{m=i}^n(1-c_{(m)}) + \vartheta \sum _{m=i}^nc_{(m)}} \Bigr ), \end{aligned}$$
(4)
$$\begin{aligned}&{{\widehat{\sigma }}}^2(t,\vartheta ) = \frac{n}{n_1n_2} \sum _{i=1}^{[nt]} \delta _{(i)} \frac{\vartheta \sum _{m=i}^nc_{(m)}\sum _{m=i}^n(1-c_{(m)})}{(\sum _{m=i}^n(1-c_{(m)}) + \vartheta \sum _{m=i}^nc_{(m)})^2}. \end{aligned}$$
(5)

3 Our test statistic and its asymptotic properties

For our asymptotic consideration we need two (common) assumptions. Let \(\tau _j=\sup \{x>0: (1-F_j(x))(1-G_j(x))>0\}\)\((j=1,2)\) be the upper limit of the observation times \(X_{j,i}\) of group j, where \(\tau _j =\sup [0,\infty )=\infty \) is allowed. Due to the weighting integrands in (2) and (3) only observations \(X_{j,i}\le \tau =\min (\tau _1,\tau _2)\) have an impact in our statistic analysis. To observe not pure censored times we suppose that \(F_1(\tau )>0\) or \(F_2(\tau )>0\) holds. Moreover, we suppose that no group vanishes, i.e.,

$$\begin{aligned} 0< \liminf _{n\rightarrow \infty } \frac{n_1}{n} \le \limsup _{n\rightarrow \infty }\frac{n_1}{n}<1. \end{aligned}$$

The continuous martingale techniques are a favorable tool to obtain distributional convergence of the weighted rank estimators as well as the weighted rank statistics, or, more generally, the weighted rank processes. Since the most of these statistic can easily be written as linear rank statistics, compare to (4), discrete martingale techniques seems to be more natural. In our situation, it can be shown that \(i\mapsto S_n(i/n,\vartheta )\) is a discrete martingale with respect to the filtration

$$\begin{aligned} \mathscr {F}_{n,i} = \sigma ( d_{(1)},\ldots ,d_{(i)}, \delta _{(1)},\ldots ,\delta _{(i+1)}), \end{aligned}$$
(6)

where \(d_{(1)},\ldots ,d_{(n)}\in \{(j,i):\,1\le i \le n_j;\,j=1,2\}\) are the so-called anti ranks, i.e., if \(d_{(k)}=(j,i)\) then \(X_{(k)}\) is the value of individual i in group j. Compared to the usual filtration for the continuous approach we know a little bit about the future under \(\mathscr {F}_{n,i}\), namely the next censoring status \(\delta _{(i+1)}\). By using this discrete filtration Janssen and Neuhaus (1997) pointed out that the summands of \(S_n(1,1)\) in (4) can be interpreted as “observed minus expected” under the restricted null \(H_0^=:A_1=A_2\), see also Lemma 5.3 of Janssen and Werft (2004). We show that this can be extended to general \(\vartheta \) and our general null \(H_0\). Consequently, we can apply an appropriate discrete martingale central limit theorem, see Hall and Heyde (1980) and Jacod and Shiryaev (2003) as well as the references therein. We obtain under \(A_2=\vartheta A_1\) that \(t\mapsto S_n(t,\vartheta )\) converges in distribution to a rescaled Brownian motion \(B\circ \sigma ^2\) on the Skorohod space D[0, 1] containing all right-continuous functions \(x:[0,1]\rightarrow { {\mathbb {R}} }\) with existing left-hand limits. The rescaling function \(t\mapsto \sigma ^2(t)\) can be estimated by \({{\widehat{\sigma }}}^2(t,\vartheta )\). Since the proportionality factor \(\vartheta \) is unknown the canonical solution is to plug-in an appropriate estimator for it. For the readers’ convenience we restrict here to the (logrank) estimator \({\widehat{\vartheta }}={{\widehat{\vartheta }}}_{K}\) from (1) with \(K= K_{\text {L}}\). In the “Appendix” the regularity conditions for more general estimators can be found, for instance, for weight functions of the shape \(K=(w\circ {{\widehat{F}}})K_L\) with \(w\in \mathscr {W}\). The plug-in-estimator leads to a non-vanishing rest term \(R_n(t)=S_n( t,\vartheta )-S_n( t,{{\widehat{\vartheta }}})\) converging in distribution to \(Z_K\sigma ^2(t)\) for all \(t>0\), where \(Z_K\) is a normal distributed random variable. To eliminate this rest term and the dependence on the unknown \(\sigma ^2\) we suggest the following transformation of the statistic

$$\begin{aligned} T_n = {\widehat{\sigma }}^2(1,{\widehat{\vartheta }})^{-1/2}\sup _{t\in [0,1]}\Bigl \{ \Bigl | S_{n}(t,{{\widehat{\vartheta }}}) - \frac{{\widehat{\sigma }}^2(t,{\widehat{\vartheta }})}{{\widehat{\sigma }}^2(1,{\widehat{\vartheta }})}S_n(1,{{\widehat{\vartheta }}}) {\Bigr |} \Bigr \}. \end{aligned}$$

Theorem 1

Under \({ H_0} \) our \(T_n\) converges in distribution to \(T=\sup \{|B_0(t)|:t\in [0,1]\}\), where \(B_0\) is a Brownian bridge.

Note that the classical Kolmogorov–Smirnov test converges in distribution to the same T. Hence, the distribution of T is well-known. Tables consisting its quantiles can be found in Hall and Wellner (1980) and Schumacher (1984). Let \(\alpha \in (0,1)\) be a fixed level and \(q_\alpha \) be the \(\alpha \)-quantile of T. Then we obtain by \(\varphi _{n,\alpha }={\mathbf {1}}\{T_n>q_{1-\alpha }\}\) an asymptotically exact test of size \(\alpha \), i.e., \({ E }(\varphi _{n,\alpha })\rightarrow \alpha \) under \(H_0\). In contrast to the test of Gill and Schumacher (1987), which was designed for monotonic hazard ratio alternatives, our test is an omnibus test, as the one of Wei (1984), i.e., the test is consistent for any relevant alternative.

Theorem 2

Consider a general alternative \(H_1\):\(\{\) for every \(\vartheta >0\) there is some \(x\in (0,\tau )\) such that \(A_2(x)\ne \vartheta A_1(x)\}\). Then our test \(\varphi _{n,\alpha }\) is consistent for \(H_1\), i.e., \({ E }(\varphi _{n,\alpha })\rightarrow 1\) under \(H_1\) for all \(\alpha \in (0,1)\).

4 Resampling tests

4.1 Permutation test

Let \(c^{(n)}=(c_{(1)},\ldots ,c_{(n)})\) and \(\delta ^{(n)}=(\delta _{(1)},\ldots ,\delta _{(n)})\). It is easy to check that our test statistic \(T_n\) only depends on \((c^{(n)}, \delta ^{(n)})\) and, thus, we write \(T_n(c^{(n)}, \delta ^{(n)})\) instead of just \(T_n\) throughout this section. Instead of permuting the pairs \((c_{(i)},\delta _{(i)})\) we follow the approach of Neuhaus (1993) and Janssen and Mayer (2001), both studied weighted logrank test for testing \(H_0^=: F_1=F_2\). Simulations of Neuhaus (1993) and Heller and Venkatraman (1996) promise a good finite sample performance of these weighted logrank tests. Their approach is to keep \(\delta ^{(n)}\) fixed and only permute randomly the group membership \(c_{(i)}\). In this spirit let \(c^\pi _n=(c_{n,1}^\pi ,\ldots ,c_{n,n}^\pi )\) be a uniformly distributed permutation of \(c^{(n)}\) independent of the data \(\{(X_{j,i}, \delta _{j,i}): 1\le i \le n_j;\,j=1,2\}\).

Theorem 3

Let T be defined as in Theorem 1. Then we have under \(H_0\) as well as under any fixed alternative \(H_1\) from Theorem 2 that in probability

$$\begin{aligned} \sup _{t\ge 0}\Bigl | P (T_n(c_n^\pi , \delta ^{(n)})\le t \vert \delta ^{(n)} ) - P (T\le t) \Bigr | \rightarrow 0. \end{aligned}$$

Let \(q_{n,\alpha }^\pi ({{\widetilde{\delta }}}_n)\)\((\alpha \in (0,1);\,{{\widetilde{\delta }}}_n\in \{0,1\}^n)\) be the (left continuous) \(\alpha \)-quantile of the distribution of \(T_n(c_n^\pi ,{\widetilde{\delta }}_n)\). Then \(\varphi _{n,\alpha }^\pi = {\mathbf {1}}\{T_n(c^{(n)},\delta ^{(n)})>q_{n,1-\alpha }^\pi (\delta ^{(n)})\}\)\((\alpha \in (0,1))\) is an asymptotically exact test for \(H_0\), i.e., \({ E }(\varphi _{n,\alpha }^\pi )\rightarrow \alpha \) under \(H_0\). Since the statement of Theorem 3 is also valid under fixed alternatives \(H_1\) we can deduce from Lemma 1 of Janssen and Pauls (2003) that \(\varphi _{n,\alpha }^\pi \) is consistent for general alternatives \(H_1\) introduced in Theorem 2. To sum up, the permutation test and the asymptotic test have the same asymptotic behavior under the null as well as under fixed alternatives. However, our simulations show that for finite sample size the permutation test outperformed the asymptotic test. Partially, this can be explained by the following observation:

Since the distribution of \(T_n(c_n^\pi ,{\widetilde{\delta }}_n)\) is discrete for all \({{\widetilde{\delta }}}_n\in \{0,1\}^n\) we may consider a randomized version

$$\begin{aligned} {{\widetilde{\varphi }}}_{n,\alpha }^\pi =\varphi _{n,\alpha }^\pi + \gamma ^\pi _{n,\alpha }(\delta ^{(n)}) {\mathbf {1}}\{T_n(c^{(n)},\delta ^{(n)})=q^\pi _{n,1-\alpha }(\delta ^{(n)})\} \end{aligned}$$

with \(\gamma ^\pi _{n,\alpha }({{\widetilde{\delta }}}_n)\in [0,1]\)\((\alpha \in (0,1),{{\widetilde{\delta }}}_n\in \{0,1\}^n)\). The advantage of the permutation approach compared, for instance, to the bootstrap approach and, of course, to the asymptotic test is that the (randomized) permutation test is usually finitely exact for at least a restricted null. In our situation, \(c^{(n)}\) and \(\delta ^{(n)}\) are independent under the restricted null \(H^=_0: \{F_1=F_2,G_1=G_2\}\), see Neuhaus (1993). Hence, the randomized permutation test is even finitely exact, i.e., \({ E }({{\widetilde{\varphi }}}_{n,\alpha }^\pi )=\alpha \) under \(H_0^=\).

4.2 Wild bootstrap

In this section, we apply the wild bootstrap technique of Wu (1986). Introduce n independent and identical distributed real-valued \(G_{1,1},\ldots ,G_{1,n_1},G_{2,1},\ldots ,G_{2,n_2}\) with \({ E }(G_{j,i})=0\) and \( \text {Var} (G_{j,i})=1\). Then the wild bootstrap version \({{\widehat{A}}}_j^G\)\((j=1,2)\) of the Nelson–Aalen estimator \({{\widehat{A}}}_j\) is given by

$$\begin{aligned} {{\widehat{A}}}_j^G(t)&= \int _0^t \frac{{\mathbf {1}}\{Y_j{(s)}>0\}}{Y_j{(s)}}\,\mathrm { d }\Bigl ( \sum _{i=1}^{n_j}G_{j,i}N_{j,i}{(s)} \Bigr )=\sum _{i=1}^{n_j}G_{j,i}\int _0^t \frac{{\mathbf {1}}\{Y_j{(s)}>0\}}{Y_j{(s)}}\,\mathrm { d }N_{j,i}{(s)}. \end{aligned}$$

Now, we replace in the definition of \(S_n\) the usual Nelson–Aalen estimators \({{\widehat{A}}}_1\) and \({{\widehat{A}}}_2\) by their wild bootstrap versions \({{\widehat{A}}}_1^G\) and \({{\widehat{A}}}_2^G\), respectively, and we denote the resulting statistic by \(S_n^G\). Since the limit of \(S_n^G\) is Gaussian a choice of normal distributed multipliers \(G_{j,i}\) seems to be plausible, as used, for example, in a competing risk setting by Lin (1997). But taking the (discrete) structure of counting processes into account we should also consider discrete distributions for the multipliers \(G_{j,i}\). For example, the Rademacher distribution, i.e., the uniform distribution on \(\{-1,1\}\), see Liu (1988), or a centred Poisson distribution, see Beyersmann et al. (2013) and Mammen (1992). The latter is highly connected to the classical bootstrap, drawing with replacement, see the previously mentioned references. Janssen and Pauls (2003) offered a unified general approach for bootstrap and permutation statistics. Now, replacing \(S_n\) by \(S_n^G\) in our test statistic \(T_n\) we get our wild bootstrap test statistic denoted by \(T_n^G\). All other statistics, for instance \({{\widehat{\sigma }}}^2\) and \({{\widehat{\vartheta }}}\), remain unchanged. In contrast to the permutation approach, where we kept only the censoring status fixed, we keep here the whole data fixed.

Theorem 4

Let T be defined as in Theorem 1. Then under \(H_0\) we have in probability

$$\begin{aligned} \sup _{x\ge 0}\Bigl | P (T_n^G\le x \vert (X_{j,i},\delta _{j,i}):1\le i \le n_j;\;j=1,2 ) - P (T\le x) \Bigr | \rightarrow 0. \end{aligned}$$
(7)

Let \(q_{n,\alpha }^G=q_{n,\alpha }^G((X_{j,i},\delta _{j,i})_{1\le i \le n_j,j=1,2})\)\((\alpha \in (0,1))\) be an \(\alpha \)-quantile of \(T_n^G\) given the data \((X_{j,i},\delta _{j,i})_{1\le i \le n_j,j=1,2}\). Then \(\varphi _{n,\alpha }^G= {\mathbf {1}}\{T_n^G>q_{n,\alpha }^G\}\)\((\alpha \in (0,1))\) is an asymptotically exact test for \(H_0\), i.e., \({ E }(\varphi _{n,\alpha }^\pi )\rightarrow \alpha \) under \(H_0\). In contrast to the permutation statistic, see Theorem 3, the convergence in (7) is only valid under the null. But we can show that the conditional distribution of \(T_n^G\) is tight under alternatives \(H_1\) introduced in Theorem 2. As a result we get the bootstrap test’s consistency.

Theorem 5

For all alternatives \(H_1\) discussed in Theorem 2 we have \({ E }(\varphi _{n,\alpha }^G)\rightarrow 1\) under \(H_1\) for all \(\alpha \in (0,1)\).

5 Simulations

5.1 Type I error

To compare the behavior of our asymptotic and resampling tests with the test of Gill and Schumacher (1987) and the one of Grambsch and Therneau (1994), we performed a simulation study for small sample sizes under different scenarios. The simulations were conducted with R (version 3.5.0), see R Core Team (2019). In this section, we consider the behavior of the tests mentioned before under the null \(H_0:A_2=\vartheta A_1, \,\vartheta >0\). Since we are dealing with rank tests, monotone transformations of the data do not affect the outcome of the test statistic. That is why we can assume without loss of generality that \(F_1\) belongs to a standard exponential distribution \(\text {Exp}(1)\) and, thus, under the null \(H_0\) the second group follows also a exponential distribution \(\text {Exp}(\vartheta )\) with general parameter \(\vartheta >0\). We considered 9 different proportionality factors \(\vartheta \in \{0.2,0.4,0.6,0.8,1,2,3,4,5\}\) and two different sample sizes \(n\in \{56,224\}\). To discuss balanced and unbalanced sample size cases as well as different censoring settings, we took 3 different Scenarios into account, which are summarized in Table 1. The group sizes are \(n_1=\kappa _1n\) and \(n_2=n-n_1\). For the censoring distribution, we used exponential distributions \(C_{j,i}\sim \text {Exp}(\mu _{j,\vartheta })\) and uniform distributions \(C_{j,i}\sim \textit{Unif}(0,\mu _{j,\vartheta })\) on the interval \((0,\mu _{j,\vartheta })\). The parameters \(\mu _{j,\vartheta }\) are chosen such that they lead to an average censoring rate of \(r_j\) specified in Table 1. In the supplement, we explain how these parameters can be determined. The empirical sizes were estimated based on 5000 iterations and the resampling tests’ quantiles were estimated by 1000 iterations.

Table 1 Simulation scenarios

For our tests we used the estimator \({{\widehat{\vartheta }}}={{\widehat{\vartheta }}}_K\) with the logrank weight \(K=K_{\text {L}}\) as recommended before. For the test of Gill and Schumacher (1987) we followed their suggestion and used the two weights corresponding to the logrank test and the Peto–Prentice version, see Peto and Peto (1972) and Prentice (1978), of the generalized Wilcoxon test, respectively. The test of Grambsch and Therneau (1994) is already implemented in R, see the function cox.zph in the package survival.

For the wild bootstrap we considered three different distributions for the multipliers, namely the Rademacher, the standard normal distribution and the centred Poisson distribution, see Sect. 4.2 for details to these distributions. To not overload the plots we compare in Fig. 1 only the three bootstrap tests. The curves of the empirical sizes are very close to each other and nearly indistinguishable in most of the cases. In the comparison with the other tests, see Fig. 2, we just include the Rademacher multipliers. In the small sample size setting \(n=56\), it is apparent that the test of Gill and Schumacher (1987) leads to quite liberal decisions with empirical sizes between 5.5 and 8.4% and in average around \(7\%\). For the larger sample size case (\(n=224\)), the empirical sizes are closer to the \(5\%\) benchmark with an overall average of \(5.5\%\). The test of Grambsch and Therneau (1994) is quite conservative, in particular, for \(\vartheta \) far away from 1. Our asymptotic test is also very conservative with empirical sizes around 2–3% for \(n=56\) and around 3–5% for \(n=224\). The permutation and Rademacher bootstrap tests’ empirical sizes are always close to the nominal level \(5\%\) even in the small sample size setting \(n=56\). But the Rademacher wild bootstrap and so the other two wild bootstrap multipliers are slightly liberal in some settings, see in particular the close-up in Fig. 1, whereas the permutation test’s empirical sizes are mainly below the \(5\%\) level.

Fig. 1
figure 1

Empirical sizes of the three bootstrap tests based on the Rademacher (Rade), the normal (Norm) and the centred Poisson (Pois) distribution

Fig. 2
figure 2

Empirical sizes of our permutation (Per), bootstrap (Bo) and asymptotic (Asy) test as well as the tests’ empirical sizes of Gill and Schumacher (1987) (GS) and Grambsch and Therneau (1994) (GT)

5.2 Power simulations

In this section, we present simulations about the tests’ power behavior under different alternatives. Since the test of Gill and Schumacher (1987) was quite liberal in our simulations for small sample sizes we exclude it here. Again, we restricted to the case that the survival times of the first group are \(\text {Exp}(1)\)-distributed. For the second group, we disturbed the null assumption \(A_2={0.6}A_1\) in different hazard directions:

$$\begin{aligned} A_2(t)=\int _0^t 0.6 + w(F_1(s)) \,\mathrm { d }A_1{(s)} \quad (t\ge 0), \end{aligned}$$
(8)

where the hazard direction \(w:[0,1]\rightarrow { {\mathbb {R}} }\) is continuous and of bounded variation. To be more specific, we considered two different hazard directions:

  1. 1.

    (central hazards) \(w(x)={50}x(1-x)\).

  2. 2.

    (late hazards) \(w(x)={70}x^3(1-x)\).

Moreover, we included an alternative, which was already discussed by Kraus (2009):

$$\begin{aligned} A_2(t)=\int _0^t \alpha _2(t)\,\mathrm { d }t\quad \text {with }\alpha _2(t)= \frac{3}{2}(t-1)^2. \end{aligned}$$
(9)

For the group sizes as well as the censoring, we considered again Scenarios 1–3 introduced in the previous section. Of course, the censoring parameters \(\mu _{j,\vartheta }\) needed to be updated for the second group. Due to the complex nature of the alternatives, it is not as easy as before to determine \(\mu _{j,\vartheta }\) and, thus, we decided to find appropriate parameters by trial and error. The concrete values, which we used, can be found in the supplement. The tests’ empirical power values were estimated based on 5000 iterations and the resampling tests’ quantiles were estimated by 1000 iterations. In Tables 2, 3 and 4 we summarized the results for various sample sizes, where the highest values are marked in boldface.

To summarize the results, we can observe that the GT test leads to the highest empirical power values in scenario 2 for the late and central hazard alternative as well as in scenario 3 for the late hazard alternative when \(n=112\). Nevertheless, our permutation and our Rademacher wild bootstrap test can compete with it in these settings. For all the other situations, our resampling tests lead to higher power values than the GT test. In particular, our tests outperform the GT test for the alternative given by (9). It can be seen that among the wild bootstrap approaches the Rademacher multipliers are favorable. Moreover, we can observe that either all resampling test lead to quite similar sizes or the permutation test’s size is the highest.

Table 2 Empirical power values (highest ones boldfaced) of the test of Grambsch and Therneau (1994) (GT), our permutation (Per), our asymptotic (Asy) as well as our three different bootstrap test based on Rademacher (Rade), normal (Norm) and centered Poisson (Pois) multipliers, respectively, for the central hazard alternative
Table 3 Empirical power values (highest ones boldfaced) of the test of Grambsch and Therneau (1994) (GT), our permutation (Per), our asymptotic (Asy) as well as our three different bootstrap test based on Rademacher (Rade), normal (Norm) and centered Poisson (Pois) multipliers, respectively, for the late hazard alternative
Table 4 Empirical power values (highest ones boldfaced) of the test of Grambsch and Therneau (1994) (GT), our permutation (Per), our asymptotic (Asy) as well as our three different bootstrap test based on Rademacher (Rade), normal (Norm) and centered Poisson (Pois) multipliers, respectively, for alternative given by (9)

6 Real data examples

In this section, we illustrate the applicability of our tests by discussing two examples. The first data set is taken from Fleming et al. (1980) and consists of times from treatment to disease progression for 35 patients suffering on ovarian cancer, where 15 patients (9 censored) are categorized to stage II ovarian cancer and the remaining 20 individuals (4 censored) are stage IIa patients. The group balancing parameter \(\kappa _1\) and the censoring rates of Scenario 2 from the previous section correspond exactly to the situation here. The data set is available in the R package coin, see Hothorn et al. (2006), and is denoted by ocarcinoma there. Gill and Schumacher (1987) already used this data set. In Fig. 3 the group-specific Nelson–Aalen estimators are plotted. In Table 5 we present the p values of the tests already used in the previous section. Hereby, we restrict ourselves to the wild bootstrap test with Rademacher multipliers, due to our findings in Sect. 5.1. It can be seen that all tests reject the null of proportional hazards for the nominal level of \(5\%\), which is in line with the impression of non-proportionality getting by the plot in Fig. 3.

The second data set is taken from Collett (2015), the data set can be found in Appendix B.1 therein. The data set consists of the survival times of 44 patients suffering from chronic active hepatitis. 22 of these patients were selected by random and got the drug Prednisolone. The other 22 patients served as a control group and did not get any treatment. The censoring rates are \(50\%\) in the treatment group and \(27\%\) in the control group. Observe that Scenario 1 from the previous section reflects exactly the group balancing and the censoring rates of this data set. The details of this clinical trial were described by Kirk et al. (1980). In Fig. 3 the group-specific Nealson–Aalen estimators are plotted and in Table 5 the tests’ p values can be found. Again, the plot suggests non-proportionality of the hazards. However, in contrast to the previous example only one test, namely the Rademacher wild bootstrap test, can reject the null hypothesis of proportionality for the nominal level of \(5\%\). It was already recognized by Kraus (2009) that the test of Gill and Schumacher (1987) cannot detect the presence of non-proportional hazards for this data set. The reason is that the test was designed for alternatives with a monotone ratio of the hazard rates, which is not the case in this example. But also our asymptotic and our permutation test are not able to reject the null, where the permutation test’s p value is quite close to the \(5\%\) benchmark compared to the other tests.

Fig. 3
figure 3

Group-specific Nelson–Aalen estimators for the ovarian data set (left) and the hepatitis data set (right). The solid line corresponds to the patients with stage II (in the left plot) and to the control group (in the right plot), respectively

7 Summary and discussion

Our simulations reveal that for small sample sizes both resampling techniques are a real improvement of our asymptotic test and they lead to better results than the existing (asymptotic) methods of Gill and Schumacher (1987) and Grambsch and Therneau (1994). Regarding this observation, one may try to improve the finite sample performance of other (existing) tests by using wild bootstrapping or permutation techniques as a future project. The simulation results show that the (slightly conservative) permutation test leads to higher power values than the (slightly liberal) wild bootstrap approach in most of the cases. Moreover, we favor, in general, the permutation test due to its finite exactness under the restricted null \(H_0^{=}:F_1=F_2,\,G_1=G_2\). Altogether, we recommend using the permutation approach. But, as explained in the following two paragraphs, the wild bootstrap approach is more flexible regarding extensions.

As pointed out be one of the referees, other types of our statistic may be interesting as well, e.g., an integral-type statistic in the spirit of Cramér and von Mises. Since the wild bootstrap version directly recovers the covariance structure of the process \(S_n\), we can transfer our results concerning wild bootstrapping to other statistic types by a simple modification of the proofs. However, the situation for the permutation approach is more delicate because the asymptotic covariance structure differs from the one of \(S_n\). The solution for this problem is to use a studentized test statistic, as already done several times in the literature (Neuhaus 1993; Janssen 1997, 2005; Janssen and Pauls 2003; Pauly 2011; Konietschke and Pauly 2012; Omelka and Pauly 2012; Chung and Romano 2013; Pauly et al. 2015). A studentized version of, e.g., the integral-type statistic is rather complicated in comparison to our sup-statistic. One reason for this is the time invariance of the latter. That is why we prefer the sup-statistic in this paper.

Table 5 p values for the ovarian data set and the hepatitis data set of our asymptotic (Asy), permutation (Per) and Rademacher wild bootstrap (Bo) tests as well as the test of Gill and Schumacher (GS) and Grambsch and Therneau

We want to suggest two different ways to extend our approach to the k-sample case. On the one hand, pairwise testing \(H_0^{ij}:A_i=\vartheta _{ij}A_j\) for all \(1\le i < j \le k\) can be done in a first step followed by the classical Bonferroni adjustment of the resulting \(m=k(k-1)/2\) tests. If m is large, this leads to very conservative decisions and we suggest to apply the FDR-controlling procedure of Benjamini and Yekutieli (2001) instead. On the other hand, by more technical effort the process convergence of \(S_n\) proven in the “Appendix” can be extended to the multivariate case. Then we may use \({{\widetilde{T}}}_n=\max _{i,j} T_n^{ij}\) as our test statistic, where \(T^{ij}_n\) denotes the sup-statistic for the pairwise comparison of groups i and j. But the limiting distribution of \({{\widetilde{T}}}_n\) is not distribution-free in the case \(k>2\) and depends on the unknown distribution functions \(F_1,G_1,\ldots \) This problem may be solved by wild bootstrapping again or, alternatively, by group-wise bootstrapping, where the bootstrap sample for group j is drawn from the observations of group j only and not from the pooled observations, as in the classical bootstrap of Efron (1979). In contrast to these bootstrap methods, we do not expect the permutation approach to work because the statistic \({{\widetilde{T}}}_n\) cannot be studentized appropriately as in the two-sample case.