1 Introduction

In many longitudinal studies, subjects can experience recurrent events (Cook and Lawless 2007). This type of data has been frequently observed in medical research, engineering, the economy, and sociology. In medical research, recurrent events could be multiple occurrences of hospitalization for a group of patients, multiple recurrence episodes in cancer studies, recurrent upper respiratory and ear infections, repeated heart attacks, or multiple relapses from remission for leukemia patients (Byar 1980; Pepe and Cai 1993; Wei et al. 1989). The analysis of such data can be focused on time-between-events (gap times) or time-to-event models. In time-to-event models, the events of concern usually represent different states in the disease process (e.g., alive and disease-free, alive with disease, and dead) and they are modeled through their intensity functions (Andersen et al. 1993; Meira-Machado et al. 2009; Meira-Machado and Sestelo 2019). In these models, the estimation of these quantities is essential for long-term survival prognosis. For instance, in cancer studies, one could consider the time to recurrence and the time to death for recurrent patients as the gap times. Under this setting, many other medical contexts can be found in the literature, such as asthma, HIV/AIDS, heart disease, dementia, Alzheimer’s disease, etc. Though the proposed methods can be used in both settings, in this paper, we consider that the events are of the same nature and focus on time-between-events. This line of research has received much attention recently. Among others, they were investigated by Campbell (1981), Burke (1988), Lin et al. (1999), Van Keilegom (2004), Peña et al. (2001), de Uña- Álvarez and Meira-Machado (2008), de Uña- Álvarez and Amorim (2011) and Moreira et al. (2017) whose interest was focused on the estimation of the bivariate distribution of the gap times. In other cases, the interest was more focused on the distribution of the gap times, such as the estimation of the joint gap times, the gap time survival functions, or the conditional survival function of the gap times (Meira-Machado et al. 2016; Meira-Machado and Sestelo 2016). Among others, these issues were investigated by Tsai et al. (1986), Prentice and Cai (1992), Lin and Ying (1993), van der Laan et al. (2002), Wang and Wells (1998), Wang and Chang (1999), Prentice et al. (2004) and Schaubel and Cai (2004). These approaches are focused on a pair of gap times corresponding to two consecutive events, and the extension to cope with a vector of k gap times may not be obvious. Furthermore, the proposed methods do not account for the influence of covariates. In addition, the implementation of several of the aforementioned methods will be difficult in practice due to the lack of user friendly software. The present paper aims to fill this gap. We consider the nonparametric estimation of the multivariate distribution functions of the gap times under univariate random right censoring conditionally (or not) on current or past covariate measures. New estimators for \(K\ge 2\) gap times are introduced and their performances and limitations are discussed. One set of estimators considers a subsampling approach—which we term landmark (de Uña-Álvarez and Meira-Machado 2015)—where a selection is made of the data consisting of subjects occupying a given state at a particular time. Alternative weighted cumulative hazard estimators are also proposed. The idea is to use an adaptation of the nonparametric estimator presented by Wang and Wells (1998) which is constructed using the cumulative hazard of the total time given a first time but where each observation has been weighted using the information of the first duration. The proposed methods can also be used to obtain conditional probabilities such as those provided in the plots shown in section 4 that provide useful interpretation. We also introduce a feasible estimation method for the multivariate distribution function, conditionally on covariate measures. The proposed method follows the ideas of Meira-Machado et al. (2015) in which the authors use kernel weights and the principle of ‘inverse probability of censoring weighting’ (IPCW) (to estimate these quantities conditionally on a continuous covariate. Finally, a tutorial for analyzing such types of data using an R package in which all methods are implemented.

It is worth mentioning that there are several modelling techniques for analyzing the effect of covariates in recurrent time-to-event data. To that end, extensions of the proportional hazards model, such as the Andersen and Gill model (AG) (Andersen and Gill 1982), the Prentice, Williams, and Peterson (PWP) model (Prentice et al. 1981), and Wei, Lin, and Weissfeld (WLW) (Wei et al. 1989), have been proposed for analyzing recurrent event data. An overview of these methods can be seen in the paper by Amorim and Cai (2015) and they are outside the scope of this paper.

This article is organized as follows. The next section presents the notation and introduces the estimators. The finite sample properties of the estimators are studied by simulation in Sect. 3. In Sect. 4 we give a brief overview of the survivalREC R package developed by the authors. We illustrate how these methods can be used for data exploration by applying them to a data set on bladder cancer. Main conclusions and discussion are reported in Sect. 5.

2 Estimators

2.1 Notation

In the context of recurrent event data, each individual may go through a well-defined event several times in his history. Assume that each study subject can potentially experience K consecutive events at times \(T_{1}<T_{2}<\cdots <T_{K}\), which are measured from the start of the follow-up. We are primarily interested in the gap times \(Y_1:=T_1\), \(Y_2:=T_2-T_1, \ldots , Y_k:=T_k-T_{k-1}\), \(k=2,\ldots ,K\). Then, \(T_k=Y_{1}+\cdots +Y_{k}\) is the time to the kth event, \(T_K\) is the total time and \((Y_{1},Y_{2}, \ldots , Y_K)\) is a vector of gap times of successive events, which we assume to be observed subjected to (univariate) random right-censoring. Let C be the right-censoring variable, assumed to be independent of \((T_{1},T_{2}, \ldots , T_K)\). Because of this, the observed data consists of \((\widetilde{Y}_{1i},\ldots ,\widetilde{Y}_{Ki},\Delta _{1i},\ldots ,\Delta _{Ki})\), \(1\le i\le n\), which are n independent replications of \((\widetilde{Y}_{1},\ldots ,\widetilde{Y}_{K},\Delta _{1},\ldots ,\Delta _{K} )\), where \(\widetilde{Y}_{1}=Y_{1}\wedge C\), \(\Delta _{1}=I(Y_{1}\le C)\), \(\widetilde{Y}_{2}=Y_2\wedge C_2\), \(\Delta _{2}=I(Y_{2}\le C_{2})\) with \(C_{2}=(C-Y_{1})I(Y_{1}\le C)\) the censoring variable of the second gap time and \(\widetilde{Y}_{k}=Y_k\wedge C_k\), \(\Delta _{k}=I(Y_{k}\le C_{k})\) with \(C_{k}=(C-Y_{k-1})I(Y_{k-1}\le C)\). Obviously, \(\Delta _k=1\) implies \(\Delta _1=\cdots =\Delta _{k-1}=1\). Define also \(\widetilde{T}_k=T\wedge C\). Here and thereafter, \(a \wedge b= min(a,b)\) and \(I(\cdot )\) is the indicator function.

Let \(F_{k}\) denote the distribution function of the kth event time \(T_{k}\) and \(F^{\star }_{k}\) denote the distribution function of the kth gap time \(Y_{k}\). Due to the independence assumption between C and \((T_1,\ldots ,T_K)\), the marginal distribution of the kth event time can be consistently estimated by the Kaplan–Meier estimator based on the \((\widetilde{T}_K, \Delta _K)\)’s. Note that, since the variables \(T_1<T_2<\cdots <T_K\) are recorded successively and are subject to censoring, we only observe the kth gap time if all previous failure times are uncensored. In practice this will imply that for \(k>1\), \(Y_{k}\) and \(C_{k}\) will be in general dependent, which will make difficult the estimation of the marginal distribution of the kth gap time, as well as the estimation to the joint distribution function \(F_{1\ldots k}(y_1,\ldots ,y_k)=P(Y_{1}\le y_1,\ldots ,Y_{k}\le y_k)\).

2.2 Estimators for the bivariate distribution function

In this section we will present different approaches for estimating the bivariate distribution function of \((Y_1,Y_2)\), \(F_{12}(y_1,y_2)=P(Y_1\le y_1,Y_2\le y_2)\). The generalization to \(K>2\) gap times will be given in a later section.

2.2.1 Inverse probability of censoring weighted estimators

In the absence of censoring, the bivariate distribution function can be empirically estimated by \(\widehat{F}_{12}(y_1,y_2)=\frac{1}{n}\sum _{i=1}^{n}I(\widetilde{Y}_{1i}\le y_1, \widetilde{Y}_{2i}\le y_2)\). To handle right censoring, inverse probability of censoring weighting (IPCW) can be used (see, for example, Moreira et al. (2017) for further details).

The idea of IPCW was used by Lin et al. (1999) to introduce an estimator for the bivariate distribution function. Their estimator is based on the relation \(P(Y_1\le y_1, Y_2\le y_2)=P(Y_1\le y_1) - P(Y_1\le y_1, Y_{2}> y_2)\) where the first quantity in the right-hand side of the equation can be consistently estimated using the Kaplan–Meier estimator Kaplan and Meier (1958) of the distribution function of the first event time \(T_1\) (i.e., based on the pairs \((\widetilde{T}_{1i},\Delta _{1i})\)’s) which we denote by \(\widehat{F}_1\). The idea to estimate the second term follows from the following relation \(E[I(Y_1\le y_1, Y_{2}> y_2)]=E[\frac{I(Y_1\le y_1, Y_{2}>y_2)I(C>T_1+y_2)}{G(T_1+y_2)}]=E[\frac{I(\widetilde{Y}_1\le y_1, \widetilde{Y}_{2}>y_2)}{G(\widetilde{T}_1+y_2)}]\). From this, it follows that

$$\begin{aligned} \widehat{F}_{12}^{\mathtt {LIN}}(y_1,y_2)=\widehat{F}_1(y_1)-\frac{1}{n}\sum _{i=1}^{n}\frac{I(\widetilde{Y}_{1i}\le y_1, \widetilde{Y}_{2i}> y_2)}{\widehat{G}(\widetilde{Y}_{1i}+y_2)} \end{aligned}$$
(1)

where \(\widehat{G}\) stands for the Kaplan–Meier estimator of the censoring distribution which is computed using the \((\widetilde{T}_{2i},1-\Delta _{2i})\)’s.

Later, de Uña- Álvarez and Meira-Machado (2008) proposes an alternative estimator which is defined in terms of multivariate ‘Kaplan–Meier integrals’ with respect to the marginal distribution of \(T_2\). The idea behind their estimators is to weight the bivariate data using the Kaplan–Meier estimator of \(T_2\) as shown below.

$$\begin{aligned} \widehat{F}_{12}^{{\mathtt {KMW}}}(y_1,y_1)=\sum _{i=1}^nW_iI(\widetilde{Y}_{1i}\le y_1, \widetilde{Y}_{2i}\le y_2). \end{aligned}$$
(2)

where \(W_i\) is the Kaplan–Meier weight attached to \(\widetilde{T}_{2i}\) when estimating the marginal distribution of \(T_2\) from \((\widetilde{T}_{2i}, \Delta _{2i})\)’s (equal to minus the jump at \(\widetilde{T}_{2i}\) of the Kaplan–Meier estimator of survival of the total time; see de Uña- Álvarez and Meira-Machado (2008) for more details).

Estimator (2) (labeled as KMW) can also be expressed using IPCW (see de Uña- Álvarez and Meira-Machado 2008) being somehow related (although not equal) to that proposed by Lin et al. (1999). The two estimators labeled by KMW and LIN deal with right censoring using an appropriate reweighting of the chosen summands, and the differences between them are somewhat subtle. The KMW estimator only puts mass on observations that are completely uncensored, whereas Lin’s estimator puts mass on observations that were uncensored till a given time. In practice, this means that LIN estimator will show estimated curves with more jump points. The estimates produced via the KMW estimator produce a valid bivariate distribution since it does guarantee that the bivariate distribution function is monotone. In contrast, the specific reweighting of the data that is used in Lin’s estimator does not ensure this property. Their estimators do not attach positive mass to each pair of recorded gap times, which may lead to problems of interpretation. A proper estimator for the bivariate distribution function could be obtained by keeping the estimator constant until it starts decreasing again. However, this approach provides a downward-biased estimator. These features can be seen in our application section. The two estimators are consistent whenever \(y_1 + y_2\) is smaller than the upper bound of the support of the censoring time.

2.2.2 Estimators based on conditional probabilities

In this section, we propose estimators for the bivariate distribution function that consider the relation \(P(Y_{1}\le y_1, Y_{2}\le y_2) = P(Y_{2}\le y_2\vert Y_{1}\le y_1)P(Y_{1}\le y_1)\), where the second term in the right-hand side of the equation can be estimated using the Kaplan–Meier product-limit estimator of the distribution function of the first event time. Two different estimation methods are proposed below to estimate the first term on the right-hand side of the equation shown above.

A simple estimator for the bivariate distribution function considers that the first term can be estimated using a subsampling approach. This approach, which we also term as landmarking (van Houwelingen 2007), is obtained by considering specific subsamples or portions of the data at hand. In this case, for estimating the conditional probability \(P(Y_{2}\le y_2\vert T_{1}\le y_1)\), the analysis is restricted to those individuals with a first gap time (equivalently, the first event time) less or equal to \(y_1\). To formalize things, let \(n_{1}\) be the cardinal of \(\mathcal {S}=\left\{ i:\widetilde{Y}_{1i}\le y_1\right\} \). Then,

$$\begin{aligned} \widehat{F}^{{\mathtt {LDM}}}(y_2\vert Y_{1}\le y_1)=\sum _{i=1}^{n_1}W_i^{\left( y_1\right) }I(\widetilde{Y}_{2i}\le y_2). \end{aligned}$$
(3)

where \(W_i^{\left( y_1\right) }\) are the Kaplan–Meier weights of the distribution of \(T_2\) computed from the subsample \(\mathcal {S}\).

Any of the estimators proposed above (LIN, KMW and LDM) may reveal some problems in the right tail where uncensored observations are scarce. Below, we propose an estimator that may deal more efficiently with those situations. The proposed estimator is constructed using the cumulative hazard of the total time given a first time but where each observation has been weighted using the information of the first duration. This estimator (WCH—weighted cumulative hazard) follows the ideas by Wang and Wells (1998) in which a product-limit estimator for the second gap time is used:

$$\begin{aligned} \widehat{F}_{12}^{{\mathtt {WCH}}}(y_1,y_2)=\widehat{F}_2^{\star {\mathtt {WCH}}}(y_2\vert Y_{1}\le y_1)\widehat{F}_{1}(y_1) \end{aligned}$$
(4)

where \(\widehat{F}_{1}(y_1)\) is the Kaplan–Meier estimator and \(\widehat{F}_2^{\star {\mathtt {WCH}}}(y_2\vert Y_{1}\le y_1) = 1-\prod _{v\le y_2} (1-\widehat{\Lambda }_{Y_2\mid Y_{1} \le y_1}(dv))\) for which

$$\begin{aligned} \widehat{\Lambda }_{Y_2\mid Y_1\le y_1}(dv)=\frac{\sum _{i=1}^{n}I(\widetilde{Y}_{1i}\le y_1,\widetilde{Y}_{2i}=v,\Delta _{2i}=1)/\widehat{G}(\widehat{Y}_{1i}+v)}{\sum _{i=1}^{n}I(\widetilde{Y}_{1i}\le y_1,\widetilde{Y}_{2i}\ge v,\Delta _{1i}=1)/\widehat{G}(\widehat{Y}_{1i}+v)}. \end{aligned}$$

One interesting topic in the analysis of recurrent event data is the estimation of the marginal distribution of the gap times (\(F_j^{\star }(y)=P(Y_j \le y)\)). The problem of estimating these functions has not been discussed explicitly in the literature, giving the impression that the Kaplan–Meier estimator is still the estimator of choice. This is not true for \(j>1\). Indeed, since \(T_2\) and \(C_2\) are expected to be dependent, the Kaplan–Meier estimator of \(F_2^{\star }\) based on the \((\widetilde{Y}_{2i}, \Delta _{2i})\)’s will be in general inconsistent. An estimator for the marginal distribution of the second gap time can be obtained from proposed estimators of the bivariate distribution function since \(F_{2}^{\star }(y)=P(Y_{2}\le y)=F_{12}(+\infty ,y)\). To estimate this quantity, we suggest using the WCH, where the corresponding estimator is a simple adaptation of \(\widehat{F}_2^{\star {\mathtt {WCH}}}\). An alternative approach would be using the KMW method, \(\widehat{F}_2^{\star {\mathtt {KMW}}}(y)=\sum _{i=1}^n W_iI(\widetilde{Y}_{2i} \le y)\) where \(W_i\) are the Kaplan–Meier weights attached to \(\widetilde{T}_2\) when estimating the marginal distribution of \(T_2\).

2.3 Extension to the general case of K gap times

In this section we extend the results (estimators) proposed in Sect. 2.2 to the case of K gap times.

Let \((Y_1,Y_2,\ldots ,Y_K)\) denote a vector of K ordered gap times and let \(F_{1\ldots K}\) denote the joint distribution function of \((Y_1,Y_2,\ldots ,Y_K)\). The estimator proposed by de Uña- Álvarez and Meira-Machado (2008) can easily be extended to provide a valid estimator for the joint distribution function of \((Y_1,Y_2,\ldots ,Y_K)\). Their estimator is given by

$$\begin{aligned} \widehat{F}_{1\ldots K}^{{\mathtt {KMW}}}(y_1,\ldots ,y_K)=\sum _{i=1}^nW_iI(\widetilde{Y}_{1i}\le y_1,\ldots , \widetilde{Y}_{Ki}\le y_K) \end{aligned}$$
(5)

where \(W_i\) is the Kaplan–Meier weight attached to \(\widetilde{T}_{Ki}\) when estimating the marginal distribution of \(T_K\) from \((\widetilde{T}_{Ki},\Delta _{Ki})\)’s.

Lin’s estimator Lin et al. (1999) can be easily extended to the general case of K gap times. Since \(P(Y_1\le y_1,\ldots , Y_{K}\le y_K)=P(Y_1\le y_1,\ldots , Y_{K-1}\le y_{K-1})- P(Y_1\le y_1,\ldots ,Y_{K-1}\le y_{K-1}, Y_{K}> y_K)\), an obvious estimator for the second term is given by

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\frac{I(\widetilde{Y}_{1i}\le y_1,\ldots ,\widetilde{Y}_{(K-1)i}\le y_{k-1},\widetilde{Y}_{Ki}> y_K)}{\widehat{G}_K(\widetilde{T}_{(K-1)i}+y_K)} \end{aligned}$$
(6)

where \(\widehat{G}_K\) stands for the Kaplan–Meier estimator of the censoring distribution based on the \((\widetilde{T}_{Ki},1-\Delta _{Ki})\)’s. The first term can be estimated recursively using the same approach.

The extension of the landmark estimator (LDM) to K gap times is a consequence of Bayes’ theorem,

$$\begin{aligned} \widehat{F}_{1\ldots K}^{{\mathtt {LDM}}}(y_1,\ldots ,y_K)&= \widehat{F}_K(y_K\mid Y_1\le y_1,\ldots ,Y_{K-1}\le y_{K-1})\\&\quad \times \widehat{F}_{K-1}(y_{K-1}\mid Y_1\le y_1,\ldots ,Y_{K-2}\le y_{K-2})\\&\quad\times \cdots \times \widehat{F}_2(y_2\mid Y_1\le y_1) \widehat{F}_1(y_1)\\ \end{aligned}$$

where \(\widehat{F}_j(y\vert Y_1\le y_1,\ldots ,Y_{j-1}\le y_{j-1})\) is estimated using the Kaplan–Meier estimator based on \((\widetilde{T}_j, \Delta _j)\)’s restricted to those individuals in \(\mathcal {S}=\left\{ i:\widetilde{Y}_1\le y_1,\ldots ,\widetilde{Y}_{j-1}\le y_{j-1}\right\} \).

The extension of the weighted cumulative hazard estimator (WCH) to K gap times follows from the following relation,

$$\begin{aligned} P(Y_1\le y_1,\ldots , Y_{K}\le y_K)& = \left( 1-P\left( Y_{K}> y_K \vert Y_1\le y_1,\ldots ,Y_{K-1}\le y_{K-1}\right) \right) \\&\quad \times P\left( Y_1\le y_1,\ldots , Y_{K-1}\le y_{K-1}\right) , \end{aligned}$$

where the first term in the right-hand side of the equation is estimated using \(\widehat{P}\left( Y_{K}> y_K \vert Y_1\le y_1,\ldots ,Y_{K-1}\le y_{K-1}\right) = 1-\prod _{v\le y_K} (1-\widehat{\Lambda }_{Y_K\mid Y_{1} \le y_1,\ldots ,Y_{K-1}\le y_{K-1}}(dv))\) and the second term in the right-hand side of the equation is estimated recursively using the ideas explained in Sect. 2.2.

2.4 Estimators conditionally on current or past covariate measures

In this section, we will explain how we may introduce nonparametric estimators for the conditional distribution function, \(F_{12}(y_1,y_2\mid X)\). In particular, we are interested in estimating these functions for any time \(y_1\) and \(y_2\), but conditional on a given continuous covariate X that could either be a baseline covariate or a current covariate that is observed for an individual during the follow-up. Discrete covariates can also be included by splitting the sample for each level of the covariate and repeating the procedures described in the previous sections for each subsample.

To account for covariate effects, one standard method is to consider estimators based on Cox’s model (Cox 1972), with the corresponding baseline hazard function estimated by the Breslow’s method (Breslow 1972). Flexible nonparametric effects of the covariates on the bivariate distribution function, as those shown in our example of application, can be obtained using an alternative approach which introduces local smoothing by means of kernel weights based on local constant (Nadaraya–Watson) regression (Nadaraya 1965). The proposed method is introduced in a regression setup based on the inverse probability of censoring weighting.

Assume that we have two consecutive gap times \((Y_1, Y_2)\) and that X denotes a continuous covariate. Then, the estimation of these functions can be performed via estimating general conditional expectation of type \(E\left[ \varphi \left( Y_1,Y_2\right) \mid X=x \right] \), where \(\varphi \) is a general function defined over \(Y_1\) and \(Y_2\). For instance, in our setting, for the bivariate distribution function, \(\varphi _{u,v}(Y_1,Y_2)=I(Y_1\le u, Y_2\le v)\) while for the bivariate survival function \(\varphi _{u,v}(Y_1,Y_2)=I(Y_1> u, Y_2 > v)\).

In the absence of censoring, to estimate these quantities nonparametrically, we may use kernel smoothing techniques by calculating a local average of the \(\varphi \left( Y_1,Y_2\right) \), that is, as follows:

$$\begin{aligned} \widehat{E}\left[ \varphi \left( Y_1,Y_2\right) \vert X=x\right] =\sum _{i=1}^n W(x,X_i,h)\varphi \left( Y_{1i},Y_{2i}\right) , \end{aligned}$$

where W(xXih) is a weight function which corresponds to the Nadaraya–Watson estimator as follows:

$$\begin{aligned} W(x,X_i,h)=\frac{k\left( {(x-X_{i})/h}\right) }{\sum _{j=1}^n k\left( {(x-X_{j})/h}\right) } \end{aligned}$$

where k is a known probability density function (the kernel function) and h is the bandwidth.

To handle right censoring, inverse probability of censoring weighting can be used. Since,

$$\begin{aligned} E\left[ I(Y_1\le u,Y_2\le v)\right] =E\left[ I(\widetilde{Y}_1\le u, \widetilde{Y}_2\le v)\Delta _2/G_X(T_2)\vert X \right] \end{aligned}$$

where \(G_X\) denotes the conditional survival function of the censoring time C given the covariate X, that is, \(G_{X=x}(t)=P(C>t\vert X=x)\) which may be estimated using Beran’s estimator (Beran 1981),

$$\begin{aligned} \widehat{G}_x(t)= \prod _{\widetilde{T}_{2i}\le t,\Delta _{2i}=0}\left[ {1-\frac{W(x,X_i,h)}{\sum _{j=1}^{n}I(\widetilde{T}_{2j}\ge \widetilde{T}_{2i})W(x,X_j,h)} }\right] \end{aligned}$$
(7)

where \(W(x,X_i,h)\) are the Nadaraya–Watson weights.

Based on this, we propose the following nonparametric estimator of the conditional bivariate distribution function:

$$\begin{aligned} \widehat{F}_{12}(y_1,y_2 \vert X=x) = \sum _{i=1}^{n}W_{i}(x,X_i,h)\frac{I(\widetilde{Y}_{1i}\le y_1,\widetilde{Y}_{2i}\le y_2)\Delta _{2i}}{\widehat{G}_{x}(\widetilde{T}_{2i})} \end{aligned}$$
(8)

where \(G^0_X\) stands for an estimator of the conditional distribution \(C\mid X\), for example Beran’s estimator (of the censoring survival function) based on the \((\widetilde{Y}_{1i}, 1-\Delta _{1i}, X_i)\)’s.

Though these methods can be extended to a vector of covariates using multivariate kernels and a generalization of Beran’s estimator, some problems arise with the generalization to higher dimensions.

3 Simulation studies

In this section, we compare by simulations the estimators introduced in Section 2. We consider two simulated scenarios, the first scenario aims to compare the estimators introduced in Sect. 2.2. More specifically, estimators for the bivariate distribution function (\(F_{12}(y_1,y_2)\)) labeled as LIN, KMW, WCH and LDM. The second scenario aims to compare the extensions of the same estimators for three gap times.

3.1 Scenario 1: two gap times

Simulating data in longitudinal recurrent survival data requires the joint modeling of two or more random variables (Meira-Machado and Faria 2014). Copulas provide a useful method for deriving joint distributions given the marginal distributions, especially when the variables are non-normal as in the case of time-to-event variables (Soutinho and Meira-Machado 2020).

For the first scenario, we consider the Fairlie–Gumbel–Morgenstern (FGM) system of bivariate distribution with a joint cumulative distribution function of the form:

$$\begin{aligned} F_{12}(y_1,y_2)=F_{1}(y_1)F_{2}(y_2)\left[ 1+\delta \left\{ 1-F_{1}(y_1)\right\} \left\{ 1-F_{2}(y_2)\right\} \right] \end{aligned}$$

where \(F_1\) and \(F_2\) are the marginal cumulatives which follow a standard exponential and where \(\vert \delta \vert \le 1\) controls the amount of dependency between the two gap times. It has been shown that in this setting, the correlation of the FGM varies between 0 (independent gap times) for \(\delta =0\) and 0.25 for \(\delta =1\). The FGM was also used in the papers by Lin et al. (1999) and Moreira et al. (2017) Moreira and Meira-Machado (2012).

The follow-up time C was chosen to be uniformly distributed between 0 and an upper limit. In practice, this limit controls the amount of censored observations. An independent uniform censoring time \(C\sim U[0,4]\) resulted in 25% of censoring on the first gap time \(Y_1\), and in 46% of censoring on the second gap time \(Y_2\), for those individuals with \(\delta =1\). A second model with \(C\sim U[0,3]\) increases these censoring levels to 32% and about 60%, respectively. In each simulation, 1000 samples were generated, each with sample sizes of \(n = 100\) and \(n = 250\).

Table 1 reports the true values of \(F_{12}(y_1,y_2)\) where \(y_1\) and \(y_2\) take values 0.2231, 0.5108, 0.9163 and 1.6094 corresponding to marginal survival probabilities of 0.8, 0.6, 0.4 and 0.2. At each time point \((y_1,y_2)\) we computed the mean squared errors for the four estimators. Table 2 reports these values for model \(C\sim U[0,3]\) with a sample size of \(n=250\) and correlated gap times (\(\delta =1\)). Our results show that all estimators perform quite well, with reasonable low values for the mean square error. All estimators obtained low values for the bias (not shown) and a worst performance in the right tail (i.e., higher values of \(y_1\) and \(y_2\)) where the censoring effects are stronger. Though not shown here, the results for different sample sizes and different censoring percentages reveal that an increase in the sample size results in smaller variance and therefore a smaller mean square error. Besides, by increasing the censoring percentage, the standard deviation achieved larger values. The standard deviation (and consequently the mean square error) increased with \(y_1\) and with \(y_2\). All these facts were expected. The performance of the four estimators for the bivariate distribution function is dominated by the variability of the estimators, with a small advantage for the weighted cumulative hazard estimator and the estimator based on Kaplan–Meier weights, labeled as WCH and KMW, respectively.

Table 1 True values of the Fairlie–Gumbel–Morgenstern distribution for two dependent gap times with exponential marginal distribution functions with rate parameter 1
Table 2 Mean square errors (\(\times 100\)) for the estimators of the bivariate distribution function

3.2 Scenario 2: three gap times

In the second scenario, we consider two Archimedean copulas to generate data for a model with three recurrent events (leading to three consecutive gap times): the multivariate Clayton copula and the multivariate Frank copula (Nelsen 2006). In the first setting, the successive gap times \(\left( Y_1,Y_2,Y_3\right) \) are simulated according to the the trivariate Clayton copula. The trivariate Clayton copula is given by \(C(y_1,y_2,y_3)=[\sum _{i=1}^{3} {y_i}^{-\alpha }-2]^{-1/\alpha }\), \(\alpha \in [-1,\infty [\diagdown {0}\). We consider the Clayton copula with exponential margins with rate parameter 1 and \(\alpha =2\). The follow-up time was subjected to right censoring, C, according to uniform models \(U\left[ 0,3\right] \) and \(U\left[ 0,6\right] \). The first model results in 32% of censoring on the first gap time \(Y_1\), 55% of censoring on the second gap time \(Y_2\) and 68% of censoring on the third gap time \(Y_3\). The second model decreases these censoring levels to 17%, 32% and about 46%, respectively. Because of space limitation, we only present the results for the first model.

In a third setting, the successive gap times \(\left( Y_1,Y_2,Y_3\right) \) are simulated according to the the trivariate Frank copula. The trivariate Frank copula is given by \(C(y_1,y_2,y_3)=-\frac{1}{\alpha }ln[1+\frac{\prod _{i=1}^{3}(e^{-\alpha y_i}-1)}{(e^{-\alpha }-1)^{2}}]\) with \(\alpha \in R \diagdown {0}\). We consider the Frank copula with exponential margins with rate parameter 1 and \(\alpha =2\). Again, censoring was generated according to uniform models \(U\left[ 0,3\right] \) and \(U\left[ 0,2\right] \). The first model results in 25% of censoring on the first gap time \(Y_1\), 50% of censoring on the second gap time \(Y_2\) and 68% of censoring on the third gap time \(Y_3\). The second model increases these censoring levels to 43%, 69% and about 80%, respectively.

The true values of \(F_{123}(y_1,y_2,y_3)\) are reported in Tables 3 and 4. Tables 5, 6, 7, 8, 9, 10, 11 and 12 report the mean square error and standard deviations for the four estimators. All four methods have worst performance in the right tail, where the censoring effects are stronger. Results shown in Table 11 suggest that the WCH estimator leads to better results for estimating the trivariate distribution \(F_{123}(y_1,y_2,y_3)\) for higher values of \(y_1\) while neither one seems to be uniformly the best for estimating this quantity for small of mid valued of \(y_1\). In these cases, the KMW estimator seems to be a good alternative. The WCH estimator is among the four alternative methods the one that deals more efficiently at points for which \(y_1\), \(y_2\) and \(y_3\) are higher.

For the second scenario (Frank copula), we show in Fig. 1 the boxplots of the estimates of the trivariate probabilities for eight different points (xyz), corresponding to combinations of the percentiles 20%, 40%, 60% and 80% of the marginal distributions of the gap times. Results are based on 1000 Monte Carlo replicates for the four estimators, with a sample size of \(n=250\). The plots shown in this figure were obtained for the censoring level of \(C\sim U[0,3]\). The boxplots shown in this figure reveal some results which agree with our findings reported in the previous scenario (Clayton copula). From these plots, it can be seen that all methods have small biases and confirm the good performance of the proposed estimators. The KMW and WCH methods are the methods with less bias and variability.

Table 3 True values of the trivariate distribution of the gap times for the Clayton copula
Table 4 True values of the trivariate distribution of the gap times for the Frank copula
Table 5 Mean square error (\(\times \)100) for the KMW estimator
Table 6 Standard deviation for the KMW estimator
Table 7 Mean square error (\(\times \)100) for the Lin’s estimator
Table 8 Standard deviation for the Lin’s estimator. Trivariate Clayton copula with censoring times generated from model U[0,3] and a sample size of \(n=250\)
Table 9 Mean square error (\(\times \)100) for the LDM estimator
Table 10 Standard deviation for the LDM estimator
Table 11 Mean square error (\(\times \)100) for the WCH estimator
Table 12 Standard deviation for the WCH estimator
Fig. 1
figure 1

Boxplots of the \(M = 1000\) estimates of the trivariate distribution function of the four estimators, with a sample sizes of 250. Data from Frank copula with unit correlated exponential distributions. Censoring times were generated from an uniform distribution on [0,3]. Horizontal solid red line corresponds to the true value

4 survivalREC structure and functionality

To provide biomedical researchers with an easy-to-use tool for obtaining estimates and corresponding plots of the multivariate distributions in recurrent event data, we developed an R package called survivalREC. This software enables users to implement all nonparametric estimators discussed in Sect. 2, including the estimators conditionally on current or past covariates. The package, available at the CRAN repository at https://cran.r-project.org/web/packages/survivalREC/ comprises 15 functions, which are summarized in Table 13. Briefly, there are two main types of functionalities: (i) to estimate bivariate distribution functions in recurrent events with the KMWdf, LDMdf, LINdf, WCHdf and IPCWdf functions; and (ii) the corresponding extension for the estimation with three gap times with KMW3df, LDM3df, LIN3df and WCH3df. Plots for each method are also displayed using a “multidf” object, which can be obtained through the multidf function. Finally, the remaining auxiliary functions, Beran, KM, KMW and NWW, are included inside the previous functions.

Table 13 Summary of functions in the survivalREC package

4.1 Application to bladder cancer study data

Bladder cancer is one of the most common genitourinary malignant disease being more common in men than in women. Prognosis of this disease is, in most cases, related to risk factors that include smoking, family history, frequent bladder infections, and exposure to certain chemicals. Another significant prognostic factor for these patients’ overall survival is the presence of a recurrence. In fact, bladder cancer is a disease with a high percentage of patients who have superficial tumors, which tend to recur, but which are generally not fatal.

The bladder cancer study data set (Byar 1980) includes 118 patients that entered the study with superficial bladder tumors. Tumors were removed transurethrally and patients were assigned to one of three treatments (placebo, pyridoxine, or thiotepa) and followed until the end of the study. The time period under consideration is over four years since the entry of the first subject (48 months). Many of these patients had multiple recurrences of tumors during the study, and new tumors were removed at each visit. The time between tumor recurrence and death or censoring was recorded for each patient. The maximum observed number of recurrences is 9.

In this subsection, we will use data on the 85 subjects with nonzero follow-up who were assigned to either thiotepa or placebo, with respective sizes of 47 and 38. Summary statistics for the two treatments are given in Table 14. Among the 85 patients, 47 relapsed at least once, among these, 29 had a second recurrence, 22 had a third recurrence and 14 had four or more recurrences (16.5%). Thus, in our study only the first three recurrence times \(T_1\), \(T_2\) and \(T_3\) (or the corresponding gap times \(Y_1\), \(Y_2\) and \(Y_3\)) are considered. Data sets considering two, three and four recurrences are available in the survivalREC package. To illustrate our methods we will use data with only the first three recurrences for any patient. Bellow, is an excerpt of the data.frame with one row per individual.

figure a

The movement among the recurrent events is given by the variables \(y_i\) and \(d_i\), with \(i=\lbrace 1,\cdots , 4\rbrace \), which represent, respectively, the four gap times and their corresponding censoring indicators (1 for an event and 0 for censoring). The other three variables are the patient id (“id”), the type of treatment (“rx”, 1 = placebo and 2 = thiotepa), and the size (cm) of the largest initial tumour (“size”).

Table 14 Recurrent experiences for placebo and thiotepa groups in Stage I bladder cancer clinical trial

One important goal of these studies is to evaluate the effect on future prognosis of a locoregional recurrence (LR), since it is well known that the increased risk after a recurrence decreases significantly with increasing time since LR. In bladder cancer studies, a LR is called an early recurrence if the cancer comes back 6 to 12 months after treatment, and a late recurrence otherwise. The curves depicted in the first row of Fig. 2 show the results for the four proposed methods for the bivariate distribution function (\(F_{12}(x,y)\)) when \(x=6\) or \(x=12\) are fixed. With the exception of the LIN estimator, all the remaining three methods shown in these plots report roughly the same estimates. In fact, a specific issue with the LIN estimator is visible at the top-right of these two figures, because the displayed curves are not monotonically decreasing in y. This is a consequence of the specific reweighting of the data that is used in this approach, which may lead to problems of interpretation at the right tail of the distribution. The remaining plots in Fig. 2 (second and third rows) are intended to demonstrate the behavior of the four different methods for estimating the trivariate distribution (\(F_{123}(x,y,z)\) when different values of x, y, and z are used). The analysis of these plots revealed that, besides the LIN method, the LDM method also has the drawback of occasionally providing estimated curves that are clearly non-monotone for the trivariate distribution function and, therefore, their practical use could be less recommended. The WCH method and the method based on the Kaplan–Meier weights (KMW) both show plausible curves.

The WCH method provides a nice approach that can be used to estimate the conditional distribution function of the second gap time (time to second recurrence) conditional on the first gap time (time to the first recurrence). The plot shown in Fig. 3 depicts the estimates of \(P(Y_2\le y \vert Y_1\le 12)\) and \(P(Y_2\le y \vert Y_1>12)\) using the WCH method. The estimated curves reveal that patients with a late recurrence (after 12 months) have a reduced risk of developing a second recurrence.

In what follows, we explain how to use the survivalREC package to get estimates and plots for the bivariate distribution and for the distribution functions with three gap times. For illustration purposes, we will consider the WCH method, and the first (top left) and last (bottom right) plots shown in Fig. 2. First, to get the corresponding plots for the bivariate distribution function we need to transform the original data set into a “multidf” format class. This can be done using the multidf function that has as arguments time1, time, event1 and status. These arguments correspond to the soujorn time in the initial state and the global time, as well as their corresponding censoring indicator variables.

figure b

To obtain the nonparamentric estimates for bivariate distribution, we have the functions KMWdf, LDMdf, LINdf and WCHdf.

As an example, suppose we are interested in obtaining the estimates for \(F_{12}(x=6, y=20)\) for the method WCH. The input codes for this case are the following:

figure c

It is also possible to show the estimated curves of the bivariate distribution function given a specific value for the first gap time. This can be done through the plot function for “multidf” objects. The following input codes show how to obtain the graphic for method WCH in which the first gap time (\(t_1\)) takes value 6.

figure d

The procedures to extend the estimation to three gap times are quite similar to the bivariate case. First, we must create a new object using the multidf function, called for this example, b4state. As we can see, the b4state object gives us the cumulative gap times time1, time2 and time, as well as, the corresponding censoring indicators (event1, event2 and status). Finally, to obtain estimates, the WCH3df function can be used by adding a new parameter, z, to the third gap time.

Fig. 2
figure 2

Estimates of the bivariate (first row) and trivariate (second and third rows) distribution using the four proposed methods. Bladder recurrence cancer data

Fig. 3
figure 3

Estimates of the distribution of the second gap time (time to second recurrence) conditional on the time to the first recurrence. Estimated curves based on the WCH approach. Bladder recurrence cancer data

5 Conclusions and final remarks

There have been several contributions to the estimation of the marginal and joint distributions in the context of recurrent events. The methods based on the inverse probability of censoring weights introduced by Lin et al. (1999) and the method based on Kaplan–Meier weights (de Uña- Álvarez and Meira-Machado 2008) are among the best ones for estimating the bivariate distribution function.

In this paper, we introduce new nonparametric methods for the estimation of these quantities. Simulations show that most of the proposed estimators for the bivariate distribution function are virtually unbiased. The extension of these methods to several gap times is also discussed.

To provide biomedical researchers with an easy-to-use tool, we have also developed the survivalREC package, which is available at the CRAN repository, which enables one to implement all proposed methods. Details on the usage of its functions and main functionalities are also introduced in the paper through an illustrative example of the analysis of recurrence events in a bladder cancer study.

Another issue of practical interest that is studied in this paper is the estimation of these quantities conditionally on current or past covariate measures. A feasible nonparametric solution to this problem is proposed. The proposed method is based on local smoothing by the means of kernel weights that are either based on a local constant (i.e., Nadaraya–Watson) or a local linear regression.