1 Introduction

Cancer registries are effectively used in various cancer studies and play important roles in cancer control. A series of CONCORD studies addressed differences in cancer survival rates among nations for various cancer types, such as breast, colon, gastric, and prostate cancers (Coleman et al., 2008; Allemani et al., 2015, 2018). Derks et al. (2018) examined differences in survival outcomes due to differences in treatment policies among countries by the relative excess risks for older breast cancer patients in the Netherlands, Belgium, Ireland, England, and Greater Poland. These studies used data from cancer registries.

To address these scientific questions, rather than the overall survival, which is defined as the duration to the all-cause death, the cancer-specific survival is often of interest. Thus, the statistical analysis that accounts for the cause of death is appreciated. A potential approach is to apply methods for the competing risk analysis (Andersen et al.,1993, pp. 512–515; Fine and Gray, 1999). However, in cancer registries, reliable information on cause of death is hard to correct comprehensively. Then, in the field of cancer registry data analysis without the information on the cause of death, special survival analysis methods have been developed, in which the external data such as the life table of the general population are used to adjust the non-cancer death. This framework for inference of the cancer registry data is often referred as the relative survival framework (Perme et al., 2016; Kalager et al., 2021), since the relative survival ratio is one of key measure used in this field. The relative survival ratio is defined as the ratio of the overall survival to the non-cancer survival. Utilizing an external database of the life table for the non-cancer general population, various methods to handle the relative survival ratio has been proposed (Ederer et al., 1961; Hakulinen, 1982) and widely used in population studies (Coleman et al., 2008; Angle et al., 2014; Allemani et al., 2015). The net survival, which is defined as the survival probability if the cancer subject would not die due to reasons other than the cancer, is an alternative measure and is getting popular and popular after (Perme et al., 2012) introduced a novel estimator with sound theoretical justification. An application was reported by Allemani et al. (2018).

All these methods mentioned are for marginal quantities. Since cancer registry data consist of huge number of cancer patients, stratified analysis by age, gender, and so on with these simple methods is preferable in general without any strong statistical assumptions. On the other hand, regression analysis is also important. For example, for rare cancer types, the stratified analysis can be unstable. In assessing some covariates effects jointly, it would be useful to apply some regression models. Various regression models for cancer registry data were proposed including the parametric models (Rubio et al., 2018), the additive hazard model (Lambert et al., 2005; Cortese & Scheike, 2008) and the spline based nonproportional hazard model (Bolard et al., 2002; Gorgi et al., 2003). The Cox proportional hazard model, which is probably one of the most famous regression models for survival analysis, was also examined (Hakulinen & Tenkanen, 1987; Estève et al., 1990; Sasieni, 1996; Dickman et al., 2004; Nelson et al., 2007; Perme et al., 2009). Sasieni (1996) introduced martingale estimating equations motivated by the partial likelihood. The unweighted estimating equation, which corresponds to the score function for the partial likelihood in the standard survival analysis, was not efficient for the Cox proportional hazard model for the net survival. Sasieni (1996) considered a weighted estimating equation, which gave an efficient estimator for the regression coefficients. However, to estimate the optimal weight, a smoothing technique was needed.

Perme et al. (2009) proposed the semiparametric maximum likelihood estimation. They successfully introduced a simple method to obtain the maximum likelihood estimator based on the expectation–maximization (EM) algorithm. The variance of the estimator was obtained with the Louis’ method (Louis, 1982). Derks et al. (2018) applied this method to cancer registry data of older breast cancer patients in the Netherlands, Belgium, Ireland, England, and Greater Poland. Although the goodness of the method by Perme et al. (2009) was examined by their simulation study, asymptotic properties were not discussed. In this paper, we established asymptotic justification of the maximum likelihood estimator of the Cox proportional hazard model for the net survival by applying the general semiparametric efficiency theory. Instead of the Louis’ variance estimator, we consider a variance estimator from the semiparametric theory.

The rest of the paper is organized as follows. In Sect. 2, we introduce cancer registry data and the EM-based inference procedure for the Cox proportional excess hazard model. In Sect. 3, we present the consistency of the maximum likelihood estimator for the regression coefficients. In Sect. 4, the asymptotic normality of the maximum likelihood estimator for the regression coefficients is presented. A consistent estimator of its asymptotic variance is also presented. In Sect. 5, we report the results of a simulation study, and in Sect. 6 we apply the proposed method to a real data from the Surveillance, Epidemiology, and End Results (SEER) Program. Some discussions are made in Sect. 7. All the theoretical details are placed in Appendixes.

2 Maximum likelihood estimation for Cox proportional excess hazard model

2.1 Notations and general settings for the cancer registry data

Analysis of cancer registry data generally requires two datasets: the cancer registry data and the population life tables. Cancer registry data consists of information on characteristics at diagnosis and the survival time for a subject diagnosed with cancer. Table 1 illustrates the data structure of the cancer registry data. Note that no information on the cause of death is included in the cancer registry data. The population life tables are a set of tables of annual mortality rates calculated by demographic variables for the general population, based on demographic statistics. Table 2 shows an example of the life table for the male population by age and calendar year. The information from the life table is used to correct the impact of death due to causes other than the cancer of interest. The mathematical formulations of the cancer registry data and the relative survival framework are given as follows.

Table 1 Example of cancer registry data: list of first five-observation of cancer registry data
Table 2 Examples of the population life table for the male population with 60–64 years old in 1990–1994

Let Z be a bounded vector of baseline covariates in the cancer registry data. Typically, it consists of age at diagnosis, gender, year at diagnosis, and some other additional variables. Let \(T_O\) and C be the time-to-death due to any causes and the potential censoring time from the time of diagnosis. \(T_O\) may be censored by C. We suppose that \(T=T_O \wedge C\) and \(\Delta =I(T_O \le C)\) are observed, where \(A \wedge B\) is the minimum value of A and B and \(I(\cdot )\) is the indicator function, which takes 1 if the event in bracket is true and 0 otherwise.

Let \(T_E\) and \(T_P\) be the potential time-to-death due to cancer and that due to reasons other than the cancer, respectively. Then, \(T_O\) is expressed as \(T_O=T_E \wedge T_P\). Define \(\Delta _E=I(T_E \le T_P)\). We regard \((T, \Delta , \Delta _E, Z)\) as the complete data, although the information of \(\Delta _E\) is unobserved in the cancer registry data. The observed information is the triple \((T,\Delta , Z)\) for each subject in the cancer registry data. Let the corresponding counting process and the at-risk process denoted by \(N(t)=I(T \le t, \Delta =1)\) and \(Y(t)=I(T \ge t)\), respectively. Let \(\tau \) be a constant satisfying \(\Pr (T> \tau |Z) > 0\) for all Z. Suppose n i.i.d. copies of \((T,\Delta , Z)\) are observed and they are denoted by \((T_i,\Delta _i, Z_i)\). For other random variables, the subscript i is also used to represent the quantity for the ith subject.

Let \(F_Z(z)\) be the distribution function for Z. The conditional survival function for \(T_O\) given Z is denoted by \(S_O(t|Z)=\Pr (T_O > t |Z)\), and the corresponding hazard and cumulative hazard functions are denoted by \(\lambda _O(t|Z)\) and \(\Lambda _O(t|Z)\), respectively. These functions for \(T_E\), \(T_P\), and C are denoted in the same way with the subscripts E, P, and C, respectively. Suppose the assumption

$$\begin{aligned} (\textrm{A1}) \ T_E \perp T_P | Z \end{aligned}$$

holds, where for any random variables A, B, and C, the conditional independence between A and B given C is denoted by \(A\perp B|C\). Then, the hazard function for \(T_O\) is represented as the sum of those for \(T_E\) and \(T_P\),

$$\begin{aligned} \lambda _O(t|Z)=\lambda _E(t|Z) + \lambda _P(t|Z). \end{aligned}$$

The hazard function of \(T_E\), \(\lambda _E(t|Z)\), is called the excess hazard, representing the excess risk of death by cancer. The conditional hazard function \(\lambda _P(t|Z)\) and the conditional survival function \(S_P(t|Z)\) are calculated by an external database for population mortality and are regarded as known function.

2.2 Cox proportional excess hazard model

Suppose \(\lambda _E(t|Z)\) is modeled via a Cox-type regression model

$$\begin{aligned} \lambda _E(t|Z)=\lambda (t)e^{\beta ^TZ}, \end{aligned}$$
(1)

where \(\beta \) is a vector of regression coefficients and \(\lambda (t)\) is an unspecified baseline hazard function. Denote the baseline cumulative hazard function by \(\Lambda (t)=\int _0^t{\lambda (u)\mathrm{{d}}u}\). Let \(\beta _0\), \(\lambda _0(t)\), and \(\Lambda _0(t)\) be the true values of \(\beta \), \(\lambda (t)\), and \(\Lambda (t)\), respectively. Furthermore, we assume

$$\begin{aligned} (\textrm{A2}) \ C \perp \left( T_E, T_P \right) |Z. \end{aligned}$$

Under the assumptions of \(({ \mathrm A1})\) and \(({ \mathrm A2})\), the probability density function of the observed data \((T, \Delta , Z)\) is given by

$$\begin{aligned}&f_{T,\Delta ,Z}(t,\delta ,z; \Lambda ,\beta ) \nonumber \\ {}&= \left\{ \mathrm{{d}}\Lambda (t)e^{\beta ^Tz} + \mathrm{{d}}\Lambda _P(t|z)\right\} ^{\delta }e^{ - \Lambda (t)e^{\beta ^Tz} - \Lambda _P(t|z) }\mathrm{{d}}\Lambda _C(t|z)^{1-\delta } e^{ - \Lambda _C(t|z) }\mathrm{{d}}F_Z(z), \end{aligned}$$
(2)

where \(\mathrm{{d}}\Lambda (t)=\Lambda (t) - \Lambda (t-)\), \(\mathrm{{d}}\Lambda _P(t|Z)=\Lambda _P(t|Z) - \Lambda _P(t-|Z)\), \(\mathrm{{d}}\Lambda _C(t|z)=\Lambda _C(t|z) - \Lambda _C(t-|z)\), and \(\mathrm{{d}}F_Z(z)=F_Z(z)-F_Z(z-)\). The observed likelihood function is

$$\begin{aligned} L(\Lambda ,\beta ) \propto \prod _{i=1}^{n}{ L(\Lambda ,\beta ;T_i,\Delta _i,Z_i) }, \end{aligned}$$
(3)

where \(L(\Lambda ,\beta ;T_i,\Delta _i,Z_i)\) is the contribution of the ith subject to the likelihood given by

$$\begin{aligned} L(\Lambda ,\beta ;T_i,\Delta _i,Z_i) = \left\{ \mathrm{{d}}\Lambda (T_i)e^{\beta ^TZ_i} + \mathrm{{d}}\Lambda _P(T_i|Z_i)\right\} ^{\Delta _i} \exp \left\{ - \Lambda (T_i)e^{\beta ^TZ_i} \right\} . \end{aligned}$$
(4)

Perme et al. (2009) proposed the semiparametric maximum likelihood estimator for the regression coefficients \(\beta \), based on the EM algorithm. In constructing the semiparametric likelihood, \(\Lambda (t)\) is regarded as a right-continuous and non-decreasing step function with \(\Lambda (0)=0\) and positive jump size \(\lambda (t)>0\) at all uncensored event time points to treat nonparametrically. The likelihood function for the complete data is

$$\begin{aligned} L_C(\Lambda ,\beta ) \propto \prod _{i=1}^{n}{ \left\{ \mathrm{{d}}\Lambda (T_i)e^{\beta ^TZ_i} \right\} ^{\Delta _{Ei}} \exp \left\{ - \Lambda (T_i)e^{\beta ^TZ_i} \right\} }, \end{aligned}$$

and the log-likelihood after profiling the baseline hazard function out is

$$\begin{aligned} \ell _{CP}(\beta ) = \sum _{i=1}^{n}{ \left\{ \beta ^TZ_i - \log { \sum _{j=1}^n{ Y_j(T_i)e^{\beta ^TZ_j} } } \right\} \Delta _{Ei} }. \end{aligned}$$

Set the initial values of \(\lambda \) and \(\beta \) as \(\lambda ^{(0)}\) and \(\beta ^{(0)}\), respectively. Then, the conditional expectation of \(\ell _{CP}(\beta )\) given the observed data is

$$\begin{aligned}&Q(\beta ;\lambda ^{(0)},\beta ^{(0)}) \\ {}&= \sum _{i=1}^{n}\left\{ \beta ^TZ_i - \log { \sum _{j=1}^n{ Y_j(T_i)e^{\beta ^TZ_j} } } \right\} \frac{ \Delta _i \lambda ^{(0)}(T_i)e^{\beta ^{(0)T}Z_i} }{\lambda ^{(0)}(T_i)e^{\beta ^{(0)T}Z_i} + \lambda _P(T_i|Z_i)}. \end{aligned}$$

The value of \(\beta \) is updated by maximizing the Q function and the updated value is denoted by \(\beta ^{(1)}\). The value of \(\lambda \) is updated using the Breslow estimator as

$$\begin{aligned} \lambda ^{(1)}(T_i)&= \frac{ \Delta _i \lambda ^{(0)}(T_i)e^{\beta ^{(0)T}Z_i} }{\lambda ^{(0)}(T_i)e^{\beta ^{(0)T}Z_i} + \lambda _P(T_i|Z_i)}\left\{ \sum _{j=1}^n{ Y_j(T_i)e^{\beta ^{(1)T}Z_j} }\right\} ^{-1}. \end{aligned}$$

By updating \(\lambda ^{(k)}\) and \(\beta ^{(k)}\) and repeating the computation and maximization of the Q-function, the estimators \(\hat{\lambda }\) and \(\hat{\beta }\) are obtained. The corresponding estimator of the baseline cumulative hazard function is represented by

$$\begin{aligned} \hat{\Lambda }(t)= \sum _{\left\{ i:T_i \le t\right\} }{ \frac{ \Delta _i \hat{\lambda }(T_i)e^{\hat{\beta }^{T}Z_i} }{\hat{\lambda }(T_i)e^{\hat{\beta }^{T}Z_i} + \lambda _P(T_i|Z_i)} \left\{ \sum _{j=1}^n{ Y_j(T_i)e^{\hat{\beta }^{T}Z_j} }\right\} ^{-1} }. \end{aligned}$$
(5)

3 Consistency

In this section, we prove the consistency of the maximum likelihood estimator. Suppose that \(\beta \) is in a compact set \(\mathscr {B}\) and the covariance matrix of Z is positive definite. The existence of the pair of \((\Lambda , \beta )\) which maximizes the observed likelihood function (3) is proved in Appendix A based on the techniques using in the proof of theorem 1 of Fang et al. (2005). The identifiability of \((\Lambda , \beta )\), in the sense that \(L(\Lambda ,\beta ;t,\delta ,z) = L(\Lambda _0, \beta _0;t,\delta ,z)\) implies \((\Lambda , \beta )=(\Lambda _0, \beta _0)\) on \(t \in [0,\tau ]\), is also shown in Appendix A.

The semiparametric model (2) has a set of the unknown parameters \(\left( \beta , \eta \right) \), where \(\eta =\left\{ \Lambda , \Lambda _C, F_Z \right\} \) is the nuisance parameter. Consider parametric submodels \(\Lambda _{h_1}(t;\gamma _1) = \int _0^t{ \left\{ 1 + \gamma _1 h_1(u) \right\} \mathrm{{d}}\Lambda _0(u) } = \int _0^t{ \left\{ 1 + \gamma _1 h_1(u) \right\} \lambda _0(u) \mathrm{{d}}u }\), \(\Lambda _{C,h_2}(t|Z;\gamma _2) = \int _0^t{ \left\{ 1 + \gamma _2 h_1(u,Z) \right\} \mathrm{{d}}\Lambda _C(u|Z) } = \int _0^t{ \left\{ 1 + \gamma _2 h_2(u,Z) \right\} \lambda _C(u|Z) \mathrm{{d}}u }\), and \(F_{Z,h_3}(z;\gamma _3)=\int _0^t{ \left\{ 1 + \gamma _3 h_3(z) \right\} \mathrm{{d}}F_Z(z) }\) where \(h_1(u)\) and \(h_2(u,Z)\) are an arbitrary function and \(h_3(z)\) is a mean-zero measurable function with finite variance. The log-likelihood function based on (2) under the parametric submodel is defined by

$$\begin{aligned} \ell _n(\beta , \gamma ; h)&= \sum _{i=1}^n{ \Delta _i \log { \left[ \left\{ 1 + \gamma _1 h_1(T_i) \right\} \mathrm{{d}}\Lambda _0(T_i)e^{\beta ^TZ_i} + \mathrm{{d}}\Lambda _P(T_i|Z_i) \right] } } \\&\quad - \sum _{i=1}^n{ \int _{0}^{T_i}{ \left\{ 1 + \gamma _1 h_1(t) \right\} \mathrm{{d}}\Lambda _0(t) }e^{\beta ^TZ_i } } \\&\quad + \sum _{i=1}^n{ (1-\Delta _i) \log { \left[ \left\{ 1 + \gamma _2 h_2(T_i,Z_i) \right\} \mathrm{{d}}\Lambda _C(T_i|Z_i) \right] } } \\&\quad - \sum _{i=1}^n{ \int _{0}^{T_i}{ \left\{ 1 + \gamma _2 h_2(t,Z_i) \right\} \mathrm{{d}}\Lambda _C(t|Z_i) } } \\&\quad + \sum _{i=1}^n{ \left\{ 1 + \gamma _3 h_3(Z_i) \right\} \mathrm{{d}}F_Z(Z_i) }, \end{aligned}$$

where \(\gamma =(\gamma _1, \gamma _2, \gamma _3)^T\) and \(h=(h_1, h_2, h_3)^T\) Let

$$\begin{aligned} W(t|Z;\beta , \Lambda )&= \frac{ \mathrm{{d}}\Lambda (t)e^{ \beta ^TZ } }{ \mathrm{{d}}\Lambda (t)e^{ \beta ^TZ } + \mathrm{{d}}\Lambda _P(t|Z) }. \end{aligned}$$

Since the maximum likelihood estimator \(\hat{\beta }\) maximizes the likelihood and then maximizes it under any parametric submodel, it satisfies

$$\begin{aligned} U_n(\hat{\beta };h)&=\left( U_{n,\beta }(\hat{\beta };h)^T, U_{n,\gamma }(\hat{\beta };h)^T\right) ^T = 0 \end{aligned}$$
(6)

for any h, where

$$\begin{aligned} U_{n,\beta }(\beta ;h)&= \left. \frac{\partial }{\partial \beta } \ell _n(\beta ,\gamma ;h)\right| _{\gamma =0} \\&= \sum _{i=1}^n{ \int _{0}^{\tau }{ Z_i W(t|Z_i;\beta , \Lambda _0) \left[ \mathrm{{d}}N_i(t) - Y_i(t)\left\{ d \Lambda _0(t)e^{\beta ^TZ_i} + d \Lambda _P(t|Z_i) \right\} \right] } }, \\ U_{n,\gamma }(\beta ;h)&= \left( U_{n,\gamma _1}(\beta ;h_1), U_{n,\gamma _2}(\beta ;h_2), U_{n,\gamma _3}(\beta ;h_3) \right) ^T, \\ U_{n,\gamma _1}(\beta ;h_1)&= \left. \frac{\partial }{\partial \gamma _1} \ell _n(\beta ,\gamma ;h)\right| _{\gamma =0} \\&= \sum _{i=1}^n{ \int _{0}^{\tau }{ h_1(t) W(t|Z_i;\beta , \Lambda _0) \left[ \mathrm{{d}}N_i(t) - Y_i(t)\left\{ d \Lambda _0(t)e^{\beta ^TZ_i} + d \Lambda _P(t|Z_i) \right\} \right] } }, \\ U_{n,\gamma _2}(\beta ;h_2)&= \left. \frac{\partial }{\partial \gamma _2} \ell _n(\beta ,\gamma ;h)\right| _{\gamma =0} = \sum _{i=1}^n{ \int _{0}^{\tau }{ h_2(t,Z_i) \mathrm{{d}}M_{C,i}(t)} }, \\ U_{n,\gamma _3}(\beta ;h_3)&= \left. \frac{\partial }{\partial \gamma _3} \ell _n(\beta ,\gamma ;h)\right| _{\gamma =0} = \sum _{i=1}^n{ h_3(Z_i) \mathrm{{d}}F_{Z}(Z_i) }, \end{aligned}$$

and \(M_C(t)=I(C \le t, \Delta = 0) - \int _0^t{ Y(u) \mathrm{{d}}\Lambda _C(u|Z) }\) is a square integrable martingale with respect to some filtrations (Fleming & Harrington, 1991). Then, it can be shown that \(E \left[ U_{1}(\beta ,\Lambda ;h) \right] =0\) for all bounded functions h on \(t \in [0,\tau ]\) and \(Z_1\).

Theorem 1

Under the assumptions (A1) and (A2), the maximum likelihood estimators are consistent; as \(n \rightarrow \infty \), \(\hat{\beta }\) converge in probability to \(\beta _0\) and \(\hat{\Lambda }(t)\) converge in probability to \(\Lambda _0(t)\) uniformly in \(t \in [0,\tau ]\).

Proof

The estimator (5) is represented by

$$\begin{aligned} \hat{\Lambda }(t) = \int _0^t{ \frac{ 1 }{ \sum _{j=1}^n{ Y_j(u)e^{\hat{\beta }^TZ_j} } } \sum _{i=1}^n{ W(u|Z_i;\hat{\Lambda },\hat{\beta })\mathrm{{d}}N_i(u) } }. \end{aligned}$$

Letting \(h_1(t)=1\) in the score equation (6) leads to this estimator. Since the vector of covariates Z is bounded and the parameter space \(\mathscr {B}\) is compact, \(e^{\beta ^TZ}\) is bounded, and its upper bound is denoted by \(K_u\). By the uniform low of large number (Pollard, 1990, page 41), \(n^{-1}\sum _{j=1}^n{ Y_j(u)e^{\beta ^TZ_j} }\) converges almost surely to \(E\left[ Y(u)e^{\beta ^TZ}\right] \in (0,K_u]\), uniformly in \(t \in [0,\tau ]\). By this result and \(W(t|Z;\Lambda ,\beta ) \in [0,1]\) for all \(t \in [0,\tau ]\) and Z, \(W(t|Z;\Lambda ,\beta )\) and \(n^{-1}\sum _{j=1}^n{ Y_j(u)e^{\hat{\beta }^TZ_j} }\) are uniformly bounded on \([0,\tau ]\). Then, we can use the procedures for proof of the consistency in Murphy et al. (1997). We give a sketch of the proof of consistency of \(\hat{\beta }\) and \(\hat{\Lambda }(t)\).

Define

$$\begin{aligned} \tilde{\Lambda }(t)=\int _0^t{ \frac{ 1 }{ \sum _{j=1}^n{ Y_j(u)e^{\beta _0^TZ_j} } } \sum _{i=1}^n{ W(u|Z_i;\Lambda _0,\beta _0)\mathrm{{d}}N_i(u) } }. \end{aligned}$$

By the Lenglart inequality (Fleming and Harrington, 1991, page 113) and the uniform law of large numbers, we see that \(\tilde{\Lambda }(t)\) converges almost surely to \(\Lambda _0(t)\), uniformly in \(t \in [0,\tau ]\) as \(n \rightarrow \infty \). Since \(\hat{\Lambda }\) and \(\hat{\beta }\) are the maximum likelihood estimator,

$$\begin{aligned} n^{-1}\sum _{i=1}^n{ \left\{ \log {L(\hat{\Lambda },\hat{\beta };T_i,\Delta _i,Z_i)} - \log {L(\tilde{\Lambda },\beta _0;T_i,\Delta _i,Z_i)}\right\} } \ge 0, \nonumber \end{aligned}$$

where \(L(\Lambda ,\beta ;T_i,\Delta _i,Z_i)\) is defined in (4). Since \(\hat{\Lambda }(t)\) and \(\tilde{\Lambda }(t)\) are bounded, the ratios of their jump sizes are bounded and those ratios are of bounded variation as \(n \rightarrow \infty \) in \(t \in [0,\tau ]\), we can use the results of the equation (A.5) in Murphy et al. (1997), and then those results imply that

$$\begin{aligned} E\left[ \log {L(\hat{\Lambda },\hat{\beta };T_i,\Delta _i,Z_i)} - \log {L(\tilde{\Lambda },\beta _0;T_i,\Delta _i,Z_i) } \right] \ge - o_P(1). \end{aligned}$$
(7)

The function \(\hat{\Lambda }(t)\) is non-decreasing and bounded function. By Helly’s lemma (van der Vaart, 2000, page 9) and the compactness of \(\mathscr {B}\), any subsequence indexed by n \((n=1,2,\cdots )\) possesses a further subsequence satisfying \(\hat{\beta } \rightarrow \beta ^*\) for some \(\beta ^*\) and \(\hat{\Lambda }(t) \rightarrow \Lambda ^*(t)\) for any \(t\in [0,\tau ]\) and some monotone function \(\Lambda ^*(t)\). Therefore, for any \((t,\delta ,z)\),

$$\begin{aligned}&\log {L(\hat{\Lambda },\hat{\beta };t,\delta ,z)}\nonumber \\ {}&\quad - \log {L(\tilde{\Lambda },\beta _0;t,\delta ,z)} \xrightarrow {P} \log {L(\Lambda ^*,\beta ^*;t,\delta ,z)} - \log {L(\Lambda _0,\beta _0;t,\delta ,z)}. \end{aligned}$$
(8)

By the dominated convergence theorem, the expectation of the right-hand side of Eq. (8) under the true parameters \(\Lambda _0\) and \(\beta _0\), which is a minus of the Kullback–Leibler divergence, is nonpositive, and then by the result of the equation (7), it holds that

$$\begin{aligned} E\left[ \log {L(\Lambda ^*,\beta ^*;T,\Delta ,Z)} - \log {L(\Lambda _0,\beta _0;T,\Delta ,Z)} \right] = 0. \end{aligned}$$

By the identifiability of \(\Lambda \) and \(\beta \) and the lemma of (van der Vaart, 2000, page 62), we can conclude \(\Lambda ^* = \Lambda _0\) and \(\beta ^*=\beta _0\). Because any subsequence contains a further subsequence for which \(\hat{\beta }\) and \(\hat{\Lambda }\) converge uniformly to \(\beta _0\) and \(\Lambda _0\), respectively, their uniform convergence also holds for the entire sequence. \(\square \)

4 Asymptotic normality and variance estimation

In this section, the asymptotic normality of the maximum likelihood estimator is presented. To do so, we apply the semiparametric theory, and a consistent estimator of asymptotic variance is presented along the semiparametric theory.

Theorem 2

Suppose that \(\beta _0\) is in the interior of \(\mathscr {B}\). Under the assumptions (A1) and (A2), \(\sqrt{n}\left\{ \hat{\beta } - \beta _0 \right\} \) converge to a mean-zero Gaussian distribution with the variance \(\Sigma _{\beta }(\beta _0,\Lambda _0;h^*)^{-1}\), where

$$\begin{aligned} \Sigma _{\beta }(\beta ,\Lambda ;h^*)&=E\left[ \left\{ \int _{0}^{\tau }{ \left\{ Z - h^*(t) \right\} W(t|Z;\beta , \Lambda ) \mathrm{{d}}M(t)} \right\} ^{\otimes 2}\right] , \nonumber \\ h^*(t)&= \frac{ E\left[ W(t|Z;\beta _0, \Lambda _0) Y(t)Z e^{ \beta _0^TZ } \right] }{ E\left[ W(t|Z;\beta _0, \Lambda _0) Y(t) e^{ \beta _0^TZ } \right] }, \end{aligned}$$
(9)

\(M(t)=N(t) - \int _0^t{ Y(u) \left\{ \mathrm{{d}}\Lambda _0(u)e^{ \beta _0^TZ } + \mathrm{{d}}\Lambda _P(u|Z) \right\} }\) is a square integrable martingale with respect to some filtrations (Fleming & Harrington, 1991), and \(V^{\otimes 2}=V V^T\) for any column vector V. A consistent estimator of the asymptotic variance (9) is given by

$$\begin{aligned} \hat{\Sigma }_{\beta }(\hat{\beta },\hat{\Lambda })&= \frac{1}{n} \sum _{i=1}^n{ \int _0^{\tau }{ \left\{ Z_i - \frac{ \sum _{k=1}^n{ W(t|Z_k;\hat{\beta },\hat{\Lambda })Y_k(t) Z_k e^{\hat{\beta }^TZ_k} } }{ \sum _{j=1}^n{ W(t|Z_j;\hat{\beta },\hat{\Lambda })Y_j(t) e^{\hat{\beta }^TZ_j} } } \right\} ^{\otimes 2} } } \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \times W(t|Z_i;\hat{\beta },\hat{\Lambda })Y_i(t) e^{\hat{\beta }^TZ_i} \mathrm{{d}}\hat{\Lambda }(t). \end{aligned}$$
(10)

Proof

The nuisance tangent space for the nuisance parameter \(\eta =\left\{ \Lambda , \Lambda _C, F_C \right\} \) is given by a direct sum of three orthogonal linear spaces,

$$\begin{aligned} \Gamma = \Gamma _1 \oplus \Gamma _2 \oplus \Gamma _3, \end{aligned}$$

where

$$\begin{aligned} \Gamma _1&= \left\{ \int _{0}^{\tau }{ h_1(t) W(t|Z;\beta _0, \Lambda _0) \mathrm{{d}}M(t)}\ \mathrm{for \ all\ function\ } h_1(t)\right\} , \\ \Gamma _2&= \left\{ \int _{0}^{\tau }{ h_2(t,Z) \mathrm{{d}}M_C(t)}\ \mathrm{for \ all\ function\ } h_2(t,Z)\right\} , \\ \Gamma _3&= \left\{ h_3(Z) \ \mathrm{such\ that\ } E\left[ h_3(Z) \right] =0\right\} . \end{aligned}$$

And the orthogonal complement of the nuisance tangent space \(\Gamma \) is written as

$$\begin{aligned} \Gamma ^{\perp } = \left\{ \int _{0}^{\tau }{ \left\{ h_1(t,Z) - h_1^*(t) \right\} W(t|Z;\beta _0, \Lambda _0) \mathrm{{d}}M(t)}\ \mathrm{for \ all\ function\ } h_1(t,Z)\right\} , \end{aligned}$$

where

$$\begin{aligned} h_1^*(t)&= \frac{ E\left[ h_1(t,Z) W(t|Z;\beta _0, \Lambda _0) Y(t) e^{ \beta _0^TZ } \right] }{ E\left[ W(t|Z;\beta _0, \Lambda _0) Y(t) e^{ \beta _0^TZ } \right] }. \end{aligned}$$

Details of the derivation of these nuisance tangent spaces and their orthogonal complements are given in Appendix B.

The efficient score function for \(\beta \) is constructed by orthogonal projection of the score function \(U_{1,\beta }(\beta _0;h)\) onto the orthogonal space of \(\Gamma \), and it is given by

$$\begin{aligned} U_{1,\beta }^{eff}(\beta _0;h^*)&= \int _{0}^{\tau }{ \left\{ Z_1 - h^*(t) \right\} W(t|Z_1;\beta _0, \Lambda _0) \mathrm{{d}}M_1(t)}, \end{aligned}$$

where

$$\begin{aligned} h^*(t)&= \frac{ E\left[ W(t|Z;\beta _0, \Lambda _0) Y(t)Z e^{ \beta _0^TZ } \right] }{ E\left[ W(t|Z;\beta _0, \Lambda _0) Y(t) e^{ \beta _0^TZ } \right] }. \end{aligned}$$

Since the maximum likelihood estimator \(\hat{\beta }\) satisfies \(U_n(\hat{\beta }; h)=0\), it is the solution to

$$\begin{aligned}&\sum _{i=1}^n{ \int _{0}^{\tau }{ \left\{ Z_i - h(t) \right\} W(t|Z_i;\beta , \Lambda _0) \left[ \mathrm{{d}}N_i(t) - Y_i(t)\left\{ \mathrm{{d}} \Lambda _0(t)e^{\beta ^TZ_i} + d \Lambda _P(t|Z_i) \right\} \right] }}\\&\quad = 0 \end{aligned}$$

with any bounded function h including \(h^*\). The efficient influence function for ith subject is defined by

$$\begin{aligned} \psi _i(\beta _0,\Lambda _0;h^*)=\Sigma _{\beta }(\beta _0,\Lambda _0;h^*)^{-1} \int _{0}^{\tau }{ \left\{ Z_i - h^*(t) \right\} W(t|Z_i;\beta _0, \Lambda _0) \mathrm{{d}}M_i(t)}, \end{aligned}$$
(11)

where \(\Sigma _{\beta }(\beta _0,\Lambda _0;h^*)\) is given as (9). Therefore, it holds that \(\sqrt{n}\left( \hat{\beta } - \beta _0\right) =n^{-1/2}\sum _{i=1}^n{ \psi _i(\beta _0,\Lambda _0;h^*) } + o_P(1)\) and it converges in law to the mean-zero Gaussian distribution with the variance function \(\Sigma _{\beta }(\beta _0,\Lambda _0;h^*)^{-1}\).

The asymptotic variance (9) can be consistently estimated by replacing the theoretical quantities with the empirical ones. Then, a consistent estimator is given by (10). \(\square \)

5 Simulation study

We conducted a simulation study to examine the behavior of the two variance estimators by (10) and Louis’ method. The simulation settings were set by mimicking real cancer registry data and life tables. We considered four covariates, age, gender, year, and X. They were the age at diagnosis, the gender, the year of diagnosis, and a continuous variable. Age, gender, year, and X were generated from the normal distribution \(N(60,10^2)\), the Bernoulli distribution B(1/2), the discrete uniform distribution U(2000, 2010), and the standard normal distribution N(0, 1), respectively. We generated \(T_E\) and \(T_P\) from the exponential distributions with hazard rate \(\lambda _E(t|Z)=0.20 \exp \{ \log {1.3} \times \mathrm{{st(age)}} + \log {1.25} \times \mathrm{{gender}} + \log {0.8} \times \mathrm{{st(year)}} + \beta _X X \}\) and \(\lambda _P(t|Z)=0.02 \exp \{ \log {2.0} \times \mathrm{{st(age)}} + \log {1.25} \times \mathrm{{gender}} + \log {0.9} \times \mathrm{{st(year)}} \}\), respectively, where \(\mathrm{{st(age)}}=\mathrm{{(age}}-60)/10\) and \(\mathrm{{st(year)}}=(\mathrm{{year}}-2000)/10\). We considered four scenarios on the magnitude of the association between \(T_E\) and X; \(\beta _X = \log {1.0},\ \log {1.1},\ \log {1.2}\), or \(\log {1.3}\) in Datasets 1-4, respectively. In all datasets, \(T_E\) and \(T_P\) were conditionally independent given the covariates Z. The potential censoring time C was generated from the uniform distribution on (0, 30). We set the number of subjects n=200 or 1000. For each scenario, 1000 datasets were simulated.

We fitted the Cox model (1) with \(Z=\{\mathrm{{st(age),gender,st(year}}), X\}\) in analyses. The regression coefficients were estimated by applying the maximum likelihood method with the EM algorithm by Perme et al. (2009), and the variance of those were estimated by the estimator (10) and the estimator from Louis’s method. Because the survival function for the other cause death \(S_P(t|Z)\) is regarded as a known function in the general cancer registry analyses, we used the true \(S_P(t|Z)\) with \(t= 1,2,\ldots \) in the analyses. We matched three covariates age, gender, and year to extract \(S_p(t|Z)\) for each cancer patient. We evaluated empirical mean of variance estimates, empirical power, and coverage probabilities (CP) for each regression coefficient.

Table 3 Simulation results with \(n = 200\): rMSE implies root mean squared error (\(\times 10^2\)) and CP implies empirical coverage probability(%) for a 95% nominal level
Table 4 Simulation results with \(n = 500\): rMSE implies root mean squared error (\(\times 10^2\)) and CP implies empirical coverage probability (%) for a 95% nominal level
Table 5 Simulation results with \(n = 1000\): rMSE implies root mean squared error (\(\times 10^2\)) and CP implies empirical coverage probability (%) for a 95% nominal level

The results for \(n=200\), 500, and 1000 cases are summarized in Tables 3, 4, and 5, respectively. The coverage probabilities of the proposed method (10) were close to the nominal level of 95% with \(n=500\) and \(n=1000\), whereas a little anti-conservativeness was observed with \(n=200\). The average and the empirical coverage probability for the variance estimates were almost identical between the method (10) and Louis’s method throughout the simulation scenarios. It suggested that the two methods gave very similar estimates. To see that, we show the cross-plots the standard errors by the two methods in Fig. 1 for \(n=200\). For all the variables, the standard errors are laid near the diagonal line, indicating agreement between the two methods.

Fig. 1
figure 1

Comparison of the standard error estimates between two methods for the 1000 simulated data in dataset 4 with \(n=200\); the horizontal line is for Louis’ method and vertical line is for Semiparametric-based method

6 Illustration

We illustrate the proposed method with cancer registry data from the Surveillance, Epidemiology, and End Results (SEER) Program. We focused on a subgroup of all adult aged 60–69 years, who was diagnosed as stomach, lung, or liver cancers from 2005 to 2010 in 17 areas covering approximately 26.5% of the U.S. All patients were followed up to 10 years after diagnosis. The data were analyzed by cancer sites (stomach, lung, and liver). For each cancer site, the model (1) was applied with six covariates as explanatory variables; age at diagnosis, gender, year at diagnosis, race (White/Black/Others), stage (Localized/Regional/Distant), and income < $ 55,000, was applied. The regression coefficients were estimated by the EM-based method in Perme et al. (2009), and their variance were calculated by Louis’s method or (10). To calculate \(S_P(t|Z)\), we used the population life table of U.S., which is released from the SEER projects and it has information on annual survival by age, gender, year and race. For each cancer patient in the cancer registry, \(S_P(t|Z)\) was extracted matching the four covariates of age, gender, year, and race.

Table 6 Summary of SEER data by cancer sites (stomach, lung, and liver); the age at diagnosis, year at diagnosis, and the survival time were summarized by median with interquartile range (median [IQR]), and the other variables were summarized by the frequency and the proportion
Table 7 Results for analyses of the SEER data by cancer sites (stomach, lung, and liver); the Cox proportional excess hazard model with covariates listed below as explanatory variables were applied by cancer sites

In Table 6, we summarize patients’ characteristics of the SEER database by cancer sites. For stomach, lung and liver cancers, 3987, 48,741 and 4608 patients were died among 5313, 56,412 and 5446 registered ones. The results of parameter estimation were summarized in Table 7. The two variance estimators gave very similar 95% CIs and p values.

7 Discussion

Similarly to the standard survival analysis, the regression models play very important roles in analysis of cancer registry data. Many regression models were proposed in the relative survival setting (Rubio et al., 2018; Lambert et al., 2005; Cortese & Scheike, 2008; Bolard et al., 2002; Gorgi et al., 2003). With the substantial popularity of the original Cox proportional hazards model (Cox 1972), the Cox excess hazards regression would be one of the most important and appealing regression models in cancer registry data analysis. Successful introduction of a simple EM-based algorithm (Perme et al., 2009) for the maximum likelihood estimator is really appreciated and of practical value, and it was successfully applied in a real population study (Allemani et al., 2018). On the other hand, formal theoretical justification was left unclear. This paper contributes to fill the gap by showing consistency, asymptotic normality, and semiparametric efficiency. Although our theoretical justification covered only the variance estimator (10), it also suggested the validity of Louis’ estimator with the agreement between them observed in the simulation studies.

A typical way to use the regression model for cancer registry data is to evaluate conditional hazards given potential confounders as done by Derks et al. (2018); Schuil et al. (2018); Allemani et al. (2018). In recent years, studies combining cancer registry data with data from other databases have been conducted, and the search for factors that affect cancer prognosis has become increasingly important (Woods et al., 2021; Li et al., 2021). On the other hand, in making inference on marginal hazards, regression models also play very important roles. For example, Komukai and Hattori (2017, 2020) proposed doubly-robust inference procedures for the marginal net survival and relative survival ratio in the presence of covariate-dependent censoring, in which regression models for censoring time and the survival time were very crucial roles. Estimation of causal quantities under the relative survival setting was discussed based on the regression standardization by Syriopoulou et al. (2021). To incorporate the Cox excess hazards model in these settings, the sound theoretical basis of the model is very important. More specifically, the consistency and the efficient influence function (11) results for the estimators will be very useful theoretical results when showing the consistency and deriving the asymptotic variance of estimators incorporating the Cox excess hazards model, respectively. Our development would be helpful in developing rigorous methods for such incomplete data analysis of marginal quantities.

Finally, we conclude our paper by discussing the assumption (A1). It is a fundamental assumption in the analysis of cancer registry data, like the independent censoring assumption (Fleming and Harrington,1991, page 128) in the standard survival analysis. To make the assumption (A1) satisfied, a simple idea is to collect and include many covariates so that (A1) holds. However, it also brings a difficulty specific to cancer registry data; even if additional covariates are collected in the cancer registries, the population life tables may not have them. This new missing data problem has been handled by Touraine et al. (2020) and Rubio et al. (2021). However, their development is not satisfactory, and further research is warranted possibly with an EM-based method like the proposed method in this paper.