Gradient-based kernel variable selection for support vector hazards machine

Jeong, Sanghun; Kang, Kyungjun; Yang, Hojin

doi:10.1007/s42952-024-00256-5

Gradient-based kernel variable selection for support vector hazards machine

Research Article
Published: 15 February 2024

Volume 53, pages 509–536, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of the Korean Statistical Society Aims and scope Submit manuscript

Gradient-based kernel variable selection for support vector hazards machine

Download PDF

151 Accesses
Explore all metrics

Abstract

This study aims to improve the predictive performance for the event time through the machine learning model and find informative variables in the time-to-event data, simultaneously. To address this issue, after regarding the time-to-event data as the dichotomized counting processes data for predicting survival time, we consider the time-dependent support vector machine (SVM) framework for the dichotomized counting process data, where the decision function in this framework consists of the time-independent risk score and time-dependent intercept. Also, we consider the empirical partial derivative of the risk score function with respect to each marginal predictor as the indicator for the important predictor. Through this approach, it is possible to predict survival time and find variables that affect on the survival time at the same time. Simulation studies were conducted to confirm the performance of the model, and real data analysis was conducted by predicting the survival time of the lung cancer after the diagnosis and selecting genes associate with lung cancer through human gene data.

Enhancing SVM for survival data using local invariances and weighting

Article Open access 19 May 2020

A Multi-task Kernel Learning Algorithm for Survival Analysis

Survival Support Vector Machines: A Simulation Study and Its Health-Related Application

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Several parametric and nonparametric approaches have been developed in the survival analysis area to account for the occurrence or survival time of an event. Most of these methods focus on estimating conditional and unconditional survival probability up to a certain point in time and predicting the conditional survival time using covariates (Kalbfleisch & Prentice, 2011; Fleming & Harrington, 2011; Ibrahim et al., 2001; Lawless, 2002). The outcome variable is the observed time, which can be either of the survival time or the censoring time away from the observation, wherein the observed time’s actual meaning is denoted by the censoring indicator. Such a characteristic of the outcome variable requires a distinct approach—one that differs from the conventional statistical models, such as the Cox proportional hazards model (Cox, 1972), which is the representative method widely used for the time-to-event data, particularly in terms of the hazard function associated with the survival outcome as the multiplicative structure between the baseline hazard function and the exponential function of the linear predictor. Estimators of the regression coefficients are obtained by numerically maximizing the partial likelihood describing the observed data; these estimated coefficients can be used not only for the inference on the effect of the covariate’s effect, but also for predicting survival time in the context of the covariate’s specific level.

Although estimating the regression coefficients of the Cox proportional hazards model helps to understand the characteristics of the hazard function, the proportional hazards assumption that the hazard ratio of any level of the predictor is constant over time, must be premised. An accelerated failure time model (Wei, 1992) accounting for the relationship between the log-transformed survival time and predictors in the presence of the error term followed from a specific probability distribution can be an alternative to the Cox proportional hazards model in the survival analysis. However, it also requires the premised belief called the accelerated failure time assumption that any level of predictor additively effects on the log survival time. Likewise these two typical models, because most of survival models tend to accompany with the model-based assumptions, the issue for predicting survival time depends on the model-based assumption. The performance of the prediction for the survival time may be inaccurate if the assumed model is incorrect in reality. Following the work of Wang et al. (2016), a support vector hazard machine (SVHM) which estimates the hyperplane maximizing the margin to classify the censored or uncensored observations at each survival time can be used for the prediction on the survival time without any model-based assumptions.

When the dimension of the covariates is large, many statistical models may also suffer from difficulties, such as computational stability, noise accumulation, and variable selection problems (Clarke et al., 2009). The regularized solution minimizing the objective function comprising the empirical risk and penalty terms has great advantages in that it leads to numerical stability and avoids the problem of overfitting (Tibshirani, 1997). However, this is an insufficient approach when the dimensions of the covariates are extremely large. Various feature screening methods with specific modeling assumptions have been suggested to filter the massive number of non-informative covariates, which have complemented regularized methods. Fan and Lv (2010) recommended a two-stage approach that screens out the non-informative covariates in the first stage of variable selection and uses the regularization method in the second stage. However, these existing variable selection methods may not be optimal when prediction and variable selection for time-to-event data must be achieved simultaneously. Specifically, if the negative partial log-likelihood loss function is replaced by another loss function: Owing to the prediction purpose, the approach that estimates the optimal decision function under the modified loss function may conflict with the important covariates selected under the partial likelihood framework. However, if the aforementioned approach can estimate the optimal decision function under the modified loss function and find clues for the important covariates from this decision function, it would be more desirable and accessible to simultaneously achieve the two objectives of prediction and variable selection.

This study aims to predict the survival time and develop a variable selection method for the time-to-event dataset. We use the SVHM framework with the two different weights to estimate the time-dependent intercept and time-independent risk score after considering the observed time-to-event data as dichotomized counting processes. Then, we predict the survival time of the event based on the pair including the ranks of the estimated risk scores estimated and observed survival time, Finally, we measure the contribution of each marginal covariate effect to the optimal decision function, through the gradient information (He et al., 2021; Park & Park, 2021; Xia, 2007; Xia et al., 2002; Fukumizu & Leng, 2014). The selection method based on the gradient information basically assumes that the corresponding partial derivative of the optimal decision function must be zero if a particular covariate exhibits a negligible effect on the survival outcome. Following this belief, we computed the optimal decision function’s partial derivatives for each marginal covariate producing the time-independent risk score to conduct the variable selection.

We consider that there are several contributions in terms of statistical point of views. While the existing SVHM can not conduct the variable selection, our proposed method contributes to adding the role of variable selection to the SVHM method, which yields the unified framework in the survival analysis. Specifically, the proposed method can differ from the conventional prediction and variable selection methods in that it does not requires any premised belief such as the proportional hazards or the accelerated failure time assumptions. Also, the predicted survival time and selected variables from our proposed method can be more systematically consistent results because these results are established within the identical loss function framework, within the same kernel choice, and within one stage estimation compared with the existing two stages models. For instance, the survival time predicted by the hinge loss function in the SVHM and the important variables selected by the negative partial likelihood loss function in the Cox model seems to be somewhat inconsistent. As another aspect, we propose the inverse probability weight to deal with the unbalanced time-to-event data, which provides the possibility being able to incorporate the current suggested approach with the various weighted models.

The remainder of this paper is organized as follows. In Sect. 2, we introduce relevant notations and fundamental results of the SVHM, and variable selection approaches. In Sect. 3, we perform quantitative studies using the proposed approach and provide illustrative example based on real data in Sect. 4. We present concluding remarks in Sect. 5.

2 Methodology

2.1 Support vector hazards machine

We start with the mathematical notation and fundamental concept of the SVHM (Wang et al., 2016) in this subsection. Let a random vector $X=(x_1,\ldots ,x_p)^{T} \in {\mathbb {R}}^{p}$ be a set of predictors, and random variables $T^{*} \in {\mathbb {R}}^{+}$ and $C \in {\mathbb {R}}^{+}$ be the survival time and censoring time, respectively. The observed time and censoring indicator are defined as $T = \min (T^{*}, C)$ and the $\Delta =I(T^{*}\le C)$, respectively, where $I(\cdot )$ is an indicator function and $\Delta$ is the event indicator. Assumedly, the survival time is independent of the censoring time given by predictors X. We observed a sample of n subjects given by $\{(X_i, T_i, \Delta _i ): i=1, \ldots , n\}$, where $X_i$, $T_i$ and $\Delta _i$ denote a p-dimensional vector of covariates, observed time, and censoring indicator of the ith individual, respectively. Let $N_i(t)= \Delta _i I( {T}_i\le t )$ be the counting process and $Y_i(t)=I({T}_i \ge t )$ be the at-risk process for the ith individual for any $t \in {\mathcal {T}}= [0,\tau ]$, where $\tau$ denotes the stopping time. $dN_i(t)$ is the jump size of the counting process in a small time interval $[t, t+dt]$. Thus, $dN_i(t)=1$ if $T_i \in [t, t+dt]$, and $dN_i(t)=0$ otherwise. Let $t_{(1)}\le t_{(2)} \le \cdots , \le t_{(q)}$ be an order statistics for q distinct event times obtained from the observed dataset, where it is assumed that there are no ties in the event times.

We defined the dichotomized variable for each subject and event time as $\delta N_i(t_{(j)}) = 2(N_i(t_{(j)}) - N_i(t_{(j)}-)) -1$, which takes the value of 1 if the survival time of the ith subject is observed, and $-1$ otherwise.

We considered the time-dependent risk score $f_0(t, X)$, where $f_0: {\mathcal {T}} \times {\mathbb {R}}^{p} \rightarrow {\mathbb {R}}$ is a nonparametric smoothed function of the covariates at a specific time. Suppose that such a general risk score comprises the intercept term, $\mu : {\mathcal {T}} \rightarrow {\mathbb {R}}$, as a function of time, and the nonparametric risk score term, $f: {\mathbb {R}}^{p} \rightarrow {\mathbb {R}}$ as a function of the covariates, that is, $f_0(t, X) = \mu (t) + f(X)$. We used the time-dependent risk score to predict whether the corresponding subject experienced a failure event at the next immediate time. Specifically, when the ith subject is still contained in the risk set at time t, we predicted the ith subject to have an event if $f_0(t, X) \ge 0$, or not to have the event if $f_0(t, X) < 0$. When a symmetric positive definite kernel $k(\cdot , \cdot ): {\mathbb {R}}^{p} \times {\mathbb {R}}^{p} \rightarrow {\mathbb {R}}$ is available, we can construct a feature map with the kernel, that is, ${\phi }(X)=k(X, \cdot )$ such that ${\phi }(\cdot ): {\mathbb {R}}^{p} \rightarrow {\mathcal {H}}$. When endowing the inner product between the feature maps as the value of the kernel, ${k}(X, X')= \langle {\phi }(X), {\phi }(X') \rangle _{{\mathcal {H}}}$, we can establish a unique RKHS, denoted by $({\mathcal {H}}, \langle \cdot , \cdot \rangle )$, such that the reproducing property $f({X})=\langle f, {k}(\cdot , {X})\rangle$ holds for all $f\in {\mathcal {H}}$, and $X \in {\mathbb {R}}^{p}$.

The optimal separating hyperplane between subjects with and without the event at each time is the hyperplane that can create the largest margin between the two classes. Following Wang et al. (2016), we can express such an optimization problem as follows:

$$\begin{aligned} \min _{\mu , f}&~\frac{1}{2}\Vert f\Vert _{{\mathcal {H}} }^2+\lambda ^{-1} \sum _{i=1}^n \sum _{j=1}^q w_i(t_j) Y_i(t_j) \xi _i(t_j), \nonumber \\&\text {subject to}~~ Y_i(t_j) \delta N_i(t_j)\left\{ \mu (t_j) +f({X}_i)\right\} \ge Y_i(t_j)\left\{ 1-\xi _i(t_j)\right\} , \nonumber \\&\qquad \qquad \quad Y_i(t_j) \xi _i(t_j) \ge 0, \nonumber \\&\text {for}~~i=1, \ldots , n, \quad j=1, \ldots , q, \end{aligned}$$

(1)

where the slack variable $\xi _i(t_j)$ allows the prediction of the ith subject on the wrong side of its margin at the jth event time, the weight $w_i(t_j)$ adjusts the imbalance for the dichotomized response of the ith subject at the jth event time, the cost variable C controls the total sum of the weighted slack variables, and $\Vert f\Vert _{{\mathcal {H}} }$ denotes the norm in the RKHS. Problem (1) is a convex optimization problem with inequality constraints, and using Lagrange multipliers leads to the primal function

$$\begin{aligned} L_p&=\frac{1}{2}\Vert f\Vert _{{\mathcal {H}}}^2+\lambda ^{-1} \sum _{i=1}^n \sum _{j=1}^q w_i(t_j) Y_i(t_j) \xi _i(t_j)-\sum _{i=1}^n \sum _{j=1}^q \gamma _{i j} Y_i(t_j) \xi _i(t_j) \\&\quad -\sum _{i=1}^n \sum _{j=1}^q \alpha _{i j}\left[ Y_i(t_j) \delta N_i(t_j)\left\{ \mu (t_j)+f({X}_i)\right\} -Y_i(t_j)\left\{ 1-\xi _i(t_j)\right\} \right] , \end{aligned}$$

where it must be minimized with respect to $\mu$, f, and $\xi _i(t_j)$, where $\alpha _{i j}\ge 0$ and $\gamma _{i j} \ge 0$ are the corresponding Lagrange multipliers, and $\Vert f\Vert _{{\mathcal {H}}}$ denotes the norm in the RKHS. By the reproducing property (Aronszajn, 1950), we can represent $f \in {\mathcal {H}}$ as the evaluation functional, $f(X)= \langle f, \phi (X) \rangle$. Substituting this representation into the primal function and considering the partial derivatives as zeros, we obtain

$$\begin{aligned} f&=\sum _{i=1}^n \sum _{j=1}^q \alpha _{i j} Y_i(t_j) \delta N_i(t_j) \phi ({X}_i), \\ 0&= \sum _{i=1}^n \alpha _{i j} Y_i(t_j) \delta N_i(t_j), \\ \alpha _{i j} Y_i(t_j)&= \lambda ^{-1} w_i(t_j) Y_i(t_j)-\mu _{i j} Y_i(t_j), \quad \text {for}~~ i=1, \ldots , n,~~ j=1, \ldots , q. \end{aligned}$$

By substituting the above results into the primal function, we obtain the dual function

$$\begin{aligned} L_D=\sum _{i=1}^n \sum _{j=1}^q \alpha _{i j} Y_i(t_j)-\frac{1}{2} \sum _{i=1}^n \sum _{i^{\prime }=1}^n \sum _{j=1}^q \sum _{j^{\prime }=1}^q \alpha _{i j} \alpha _{i^{\prime } j^{\prime }} Y_i(t_j) Y_{i^{\prime }}(t_{j^{\prime }}) \delta N_i(t_j) \delta N_{i^{\prime }}(t_{j^{\prime }}) k({X}_i, {X}_{i^{\prime }}), \end{aligned}$$

for which it must be maximized with respect to the multipliers $\alpha _{i j}$ subject to $0 \le \alpha _{i j} \le \lambda ^{-1} w_i(t_j)$ and $\sum _{i=1}^n \alpha _{i j} Y_i(t_j) \delta N_i(t_j)=0$ for $i=1, \ldots , n$ and $j=1, \ldots , q$. The Karush–Kuhn–Tucker includes the constraints

$$\begin{aligned}&\left[ Y_i(t_j) \delta N_i(t_j)\left\{ \mu (t_j) +f({X}_i)\right\} -Y_i(t_j)\left\{ 1-\xi _i(t_j)\right\} \right] \ge 0 \\&\alpha _{i j}\left[ Y_i(t_j) \delta N_i(t_j)\left\{ \mu (t_j) +f({X}_i)\right\} -Y_i(t_j)\left\{ 1-\xi _i(t_j)\right\} \right] =0, \\&\gamma _{ij} \xi _i(t_j) = 0 \quad \text {for}~~i=1,\ldots ,n~~j=1, \dots , q \end{aligned}$$

characterizing the optimal solution for the above primal and dual objective functions. The solution for f has the form ${\widehat{f}} =\sum _{i=1}^n \sum _{j=1}^q {\widehat{\alpha }}_{i j} Y_i(t_j) \delta N_i(t_j) \phi ({X}_i)$, which yields the estimator for the smoothed risk score, given by

$$\begin{aligned} {\widehat{f}}({\textbf{x}})= \left\langle \sum _{i=1}^n \sum _{j=1}^q {\widehat{\alpha }}_{i j} Y_i(t_j) \delta N_i(t_j) \phi ({X}_i), \phi _k({\textbf{x}}) \right\rangle = \sum _{i=1}^n \sum _{j=1}^q {\widehat{\alpha }}_{i j} Y_i(t_j) \delta N_i(t_j) k({X}_i,{\textbf{x}}), \end{aligned}$$

(2)

where ${\textbf{x}} \in {\mathbb {R}}^{p}$ denotes a test point for any covariate. Complementary slackness is summarized as the following equations

$$\begin{aligned}&\alpha _{i j}\left[ Y_i(t_j) \delta N_i(t_j)\left\{ \mu (t_j) +f({X}_i)\right\} -Y_i(t_j)\left\{ 1-\xi _i(t_j)\right\} \right] =0, \\&( \lambda ^{-1} w_i(t_j) - \alpha _{ij} ) \xi _i(t_j) Y_i(t_j) = 0 \quad \text {for}~~i=1,\ldots ,n~~j=1, \ldots , q , \end{aligned}$$

These conditions allow for the ith subject to be well separated at the jth time if $\alpha _{i j}=0$, $\xi _i(t_j)=0$, and $Y_i(t_j)=1$ are satisfied, whereas they allow for the ith subject at the jth time to be contained in the support vectors lying on the edge of the margin when $0< \alpha _{i j} \le \lambda ^{-1} w_i(t_j)$, $\xi _i(t_j)=0$, and $Y_i(t_j)=1$ are satisfied, and lying on the wrong side of the margin when $\alpha _{i j} = \lambda ^{-1}w_i(t_j)$, and $Y_i(t_j)=1$ are satisfied. According to the complementary slackness, the estimator for f in (2) is characterized by the support vectors across all subjects and event times. The optimal time-varying intercept is also estimated by using complementary slackness. At the jth time point $t_j$, all subjects lying on the edge of the margin satisfy the condition $Y_i(t_j) \delta N_i(t_j)\left\{ \mu (t_j)+f({X}_i)\right\} -Y_i(t_j)=0$, which is equivalent to ${{\widehat{\mu }}}(t_j)= 1/\delta N_i(t_j)-f({X}_i)$. In practice, we used the average of all subjects at the time.

2.2 Gradient-based variable selection

The sparsity assumption that only a few covariates have been associated with the survival outcome is more practical than relating all covariates with the outcome in the real world. Such a sparsity assumption has been widely employed within various models for the last two decades in the variable selection areas. In this subsection, we intend to develop a variable selection approach for the SVHM within the sparsity assumption. Suppose that function f is continuously differentiable. If there exists a covariate strongly related to the risk score f associated with the survival time, guessing that a small change in the value of the corresponding covariate will cause a large change in the value, the risk score is not unreasonable. Following the work of (Park & Park, 2021), the partial derivative of f with respect to the jth predictor $x_k$ given by

$$\begin{aligned} \frac{ \partial f(X) }{\partial x_j} = \frac{ \partial f(x_1, \ldots , x_p) }{\partial x_j} \equiv \partial _j f \end{aligned}$$

can serve as the aforementioned criterion by capturing the relative importance of the jth predictor at a fixed point. Suppose that the inner product in the reproducing kernel Hilbert space (RKHS) is computed with the partial differential operator, and the partial differential operator is a self-adjoint operator. It follows from the reproducing property that:

$$\begin{aligned} \partial _j f = \frac{ \partial }{\partial x_j} \langle f, {k}(\cdot , {X})\rangle _{{\mathcal {H}}} = \langle f, \frac{ \partial }{\partial x_j} {k}(\cdot , {X})\rangle _{{\mathcal {H}}} = \langle \frac{ \partial }{\partial x_j} f, {k}(\cdot , {X})\rangle _{{\mathcal {H}}} \end{aligned}$$

holds for all $f\in {\mathcal {H}}$ and $X \in {\mathbb {R}}^{p}$. From the Cauchy–Schwarz inequality, we have an upper bound:

$$\begin{aligned} \langle f, \frac{ \partial }{\partial x_j} {k}(\cdot , {X})\rangle _{{\mathcal {H}}} \le \Vert f \Vert _{{\mathcal {H}}} \Vert \frac{ \partial }{\partial x_j} {k}(\cdot , {X}) \Vert _{{\mathcal {H}}}. \end{aligned}$$

Thereafter, we observe that the partial derivative of f with respect to any predictor belongs to the RKHS when the partial derivative of the reproducing kernel is bounded, which reveals that the reproducing kernel remains an essential tool for obtaining the partial derivative of f. The above partial derivative depends on the fixed point $x_j$, which might make it difficult to understand the importance of covariate $X_j$ across the entire sample space. If the marginal probability density function for X, denoted by $P_X(x)$, is available, the $L_2$ norm with respect to $P_X(x)$ is given by

$$\begin{aligned} \Vert \partial _j f \Vert _{L_2(P_X)}^2 = \int _{{\mathcal {X}}} ( \partial _j f )^2 dP_X(x). \end{aligned}$$

(3)

provides a better understanding of the importance of covariate $X_j$ across possible realizations. If the jth predictor does not have a relationship with the risk score associated with the survival outcome, then the $L_2$ norm in (3) tends to be close to zero for any fixed point $x_j$. Let

$$\begin{aligned} { {\mathcal {M}} } =\{ 1\le j \le p: \Vert \partial _j f \Vert _{L_2(P_X)}^2 \ne 0 \} \end{aligned}$$

be a true set of indices containing the important predictors, and $s=|{ {\mathcal {M}} }|$ be the number of elements in the true set, for which it is assumed that s is a smaller integer compared with the sample size n, and dimension p. We use the estimator ${{\widehat{f}}}$ mentioned in Sect. 2.1 for the risk score f to compute the empirical partial derivatives for each predictor and the empirical probability measure as the counter part of the marginal probability measure of the predictor to obtain the index set estimating the true sparse set. Taking the partial derivative of ${{\widehat{f}}}$ mentioned in (2) with respect to the jth predictor,

$$\begin{aligned} \partial _j {\widehat{f}} = \sum _{i=1}^n \sum _{j=1}^q {\widehat{\alpha }}_{i j} Y_i(t_j) \delta N_i(t_j) \frac{\partial k({X}_i,{\textbf{x}}) }{\partial x_j} \end{aligned}$$

and applying the empirical probability to the $L_2$ norm in (3), we compute the estimator for the magnitude of the risk score with respect to the jth expressed as:

$$\begin{aligned} \Vert \partial _j {\widehat{f}} \Vert _n^2&= \frac{1}{n} \sum _{l=1}^n \biggl ( \frac{ \partial {{\widehat{f}}}({\textbf{x}}_l) }{\partial x_j } \biggr )^2 = \frac{1}{n} \sum _{l=1}^n \biggl (\langle \frac{ \partial }{\partial x_j} {\widehat{f}}, {k}(\cdot , {{\textbf{x}}}_l)\rangle _{{\mathcal {H}}} \biggr )^2 \nonumber \\&= \frac{1}{n} \sum _{l=1}^n \biggl ( \biggl \langle \sum _{i=1}^n \sum _{j=1}^q {\widehat{\alpha }}_{i j} Y_i(t_j) \delta N_i(t_j) \frac{\partial k(\cdot , {X}_i,) }{\partial x_j}, {k}(\cdot , {{\textbf{x}}}_l) \biggr \rangle _{{\mathcal {H}}} \biggr )^2\nonumber \\&= \frac{1}{n} \sum _{l=1}^n \biggl ( \sum _{i=1}^n \sum _{j=1}^q {\widehat{\alpha }}_{i j} Y_i(t_j) \delta N_i(t_j) \frac{\partial k({X}_i, {{\textbf{x}}}_l) }{\partial x_j} \biggr )^2 . \end{aligned}$$

(4)

We utilized the empirical norm of the partial derivative for the risk score presented in (4) to select informative variables,

$$\begin{aligned} \widehat{ {\mathcal {M}} }({\gamma _n}) =\{ 1\le j \le p : \Vert \partial _j {\widehat{f}} \Vert _n^2 ~\text {is amongst the first}~{\gamma _n}~\text {largest of all values} \}, \end{aligned}$$

(5)

where $\gamma _n$ is the predefined threshold value. Numerous literature on the variable selection issues (Fan & Lv, 2010; Jeong et al., 2023) have popularly used the value of $\gamma _n$ as $n-1$ or $[n/\log (n)]$, and we use this value as our predefined threshold value, where [a] denotes the greatest integer value less than a.

2.3 Properties of gradient based kernel selection

Let ${\mathcal {X}} \subset {\mathbb {R}}^p$ be a boundedly connected set including all possible value of random vector X and ${\mathcal {H}}_k$ be a RKHS induced by the kernel k. Let ${\mathcal {M}}=\{1\le j \le p~|~\Vert \partial _j f^{*} \Vert _{P_X}^2 >0\}$ and $\widehat{{\mathcal {M}}}(\rho )=\{1\le j \le p~|~\Vert \partial _j {{\widehat{f}}} \Vert _{n}^2 > \rho \}$ be the true set of indices including the important predictors and the estimated set of indices based on the the empirical norm of the partial derivative function for each predictor, where $\rho$ is pre-defined threshold value. Since there is one-to-one relationship between $\gamma _n$ in (5) and $\rho$ in $\widehat{{\mathcal {M}}}(\rho )$, we mainly focus on $\gamma _n$ in this subsection. Define the empirical risk

$$\begin{aligned} {\mathcal {R}}_n(f, \mu )= \min _{\mu , f} \frac{1}{n} \sum _{i=1}^n \sum _{j=1}^q Y_i(t_j)\left[ 1-(\mu (t_j)+f({X}_i)) \delta N_i(t_j)\right] _{+} \end{aligned}$$

(6)

where $[a]_{+}=\max \{0,a\}$. Then, define the empirical solution as

$$\begin{aligned} ({\widehat{\mu }}, {\widehat{f}}) = \text {arg} \min _{\mu , f} {\mathcal {R}}_n(f, \mu ) + \frac{1}{2} \lambda \Vert f\Vert ^2 \end{aligned}$$

and the true decision function as $(\mu , {f}^{*}) = \text {arg} \min _{\mu , f} {\mathcal {R}}(f, \mu )$, respectively, where

$$\begin{aligned} {\mathcal {R}}(f, \mu )&= E\biggl [ \int Y(t)[1-\mu (t_j)-f(X)) ]_{+} dN(t) \biggr ] \\&\quad + \int \frac{E(Y(t))[1+\mu (t_j)+f(X)) ]_{+}}{E(Y(t))} E(dN(t)) . \end{aligned}$$

To conveniently develop theoretical justification, instead of using the risk functions aforementioned ${\mathcal {R}}_n(f, \mu )$ and ${\mathcal {R}}(f, \mu )$, we use the profile risk function excluding the effect of the time-dependent function $\mu$. Following Wang et al. (2016), the profile risk can be derived as

$$\begin{aligned} \mathcal{P}\mathcal{R}(f) = {\mathcal {R}}(f, \mu ^{*})= E\biggr [ \Delta \frac{{\tilde{P}} I({\tilde{Y}} \ge Y)[2-f({\tilde{X}})-f(X)]_+}{{\tilde{P}} I({\tilde{Y}} \ge Y)} \biggr ], \end{aligned}$$

where $\mu ^{*}=\min _{\mu }{\mathcal {R}}(f, \mu )$ and ${\tilde{P}}$ is the probability measure with respect to $({\tilde{Y}}, {\tilde{X}}, \tilde{\Delta })$.

The following conditions are required to establish the properties for our method.

(A1)
For each $f \in {\mathcal {H}}_k$, f is continuous and continuously differentiable.
(A2)
There exists a constant $\tau$ such that for all $j \in \{1,2,\dots ,p\}$
$$\begin{aligned} \sup _{X \in {\mathcal {X}}} \Vert k(X,\cdot )\Vert \le \tau ~~~\text {and}~~ \sup _{X \in {\mathcal {X}}} \Vert \partial _j k(X,\cdot ) \Vert \le \tau . \end{aligned}$$
(A3)
There exists a constant $c_1$ such that for ${\widehat{f}}, f^{*} \in {\mathcal {H}}_k$
$$\begin{aligned} \Vert {\widehat{f}}-f^{*}\Vert \le c_1 |\mathcal{P}\mathcal{R}({\widehat{f}}) - \mathcal{P}\mathcal{R}(f^{*})| \end{aligned}$$
and exists a constant $c_2$ such that for some positive $\delta$
$$\begin{aligned} P( |\mathcal{P}\mathcal{R}({\widehat{f}}) - \mathcal{P}\mathcal{R}(f^{*})| \ge c_2 n^{- (\frac{q}{q+1})} \le \frac{\delta }{2} . \end{aligned}$$
(A4)
There exists a positive constant $\eta < \frac{q}{q+1}$ such that for some positive constant $\kappa _2$
$$\begin{aligned} \min _{j \in {\mathcal {M}}^{*}} \Vert \partial _j f^{*} \Vert ^2 > \kappa _2 (\log p)n^{-\eta } . \end{aligned}$$

Condition (A1) enables us to exclude the discontinuously differentiable functions strongly associated with the survival time. For instance, there may exit a piecewise constant function defined as the survival time on some region of the jth predictor and zeros on the other regions though $\Vert \partial _j f\Vert =0$. By the condition (A1), we can be more confident that the jth predictor is not important when observing $\Vert \partial _j f\Vert =0$. Condition (A2) provides the boundedness of the reproducing kernel and the partial derivative function with the respect to all marginal predictors. Condition (A2) is always satisfied for the typical kernels such as the Guassian, polynomial and Sobolev kernels defined on a bounded domain. The first assertion of the condition (A3) implies that the discrepancy between the empirical estimator and true solution has an upper bound consisting of the discrepancy between the corresponding profiled risk functions. The second assertion of the condition (A3) assumes the specific probability of the difference between the profiled risk functions evaluated at the estimator and true function, respectively. The rate of the convergence for $\mathcal{P}\mathcal{R}({\widehat{f}}) - \mathcal{P}\mathcal{R}(f^{*})$ to the zero is known to be $O_p( n^{- \frac{q}{q+1}})$ when taking $\lambda =n^{-q/(q+1)}$ with some other conditions as introduced in Wang et al. (2016) or Remark 1, where $q=1/(4/\xi +1)$ and $\xi \in (0,2)$. We adopt this result as the basic condition to develop the theoretical justification for the proposed method. Consequently, condition (A3) allows us to approximate the probability of the tails of $\Vert {\widehat{f}}-f^{*}\Vert$ although it seems to be somewhat strong assumption. Condition (A4) assumes that the true gradient function for the important predictors contains sufficient information as the value being able to discriminate from the uninformative predictors.

Remark 1

For the convergence rate of $\mathcal{P}\mathcal{R}({\widehat{f}}) - \mathcal{P}\mathcal{R}(f^{*})$, the literature (Wang et al., 2016) assumes that $\lambda$ and $\sigma$ go to zero, that $n\lambda \sigma ^{p(2/\xi -1/2))}$ goes to infinity, and $E[Y(t^{*})|X]$ is bounded away from zero, where $t^{*}$ is the stopping time, and that $\sigma$ is a scale factor in the Gaussian kernel defined as $k(z_1,z_2)=\exp \{ {-\Vert z_1-z_2\Vert ^2/\sigma } \}$.

We establish two results as the justification of the proposed selection method.

Proposition 1

Suppose that assumptions 1–3 are satisfied. Then with the probability at least $1-\delta$, there holds

$$\begin{aligned} \max _{1 \le j\le p} \biggl | \Vert \partial _j {\widehat{f}} \Vert _n^2 - \Vert \partial _j {f}^{*} \Vert _{P_X}^2 \biggr |&\le \kappa (\log p)n^{-\frac{q}{q+1}} \end{aligned}$$

(7)

where $q=1/(4/\xi +1)$, $\xi \in (0,2)$, and $\kappa$ is some positive constant.

Proposition 1 lays out the rate of the convergence for the maximum discrepancy between the empirical norm for the partial derivative of the estimated SVHM and the $L_2$ norm for the partial derivative of the true function for all predictors. This result is important because it implies that $\Vert \partial _j {\widehat{f}} \Vert _n^2$ converges to $\Vert \partial _j {f}^{*} \Vert _{P_X}^2$ and contributes to establish the asymptotic selection consistency as the followings.

Proposition 2

Suppose that assumptions 1–4 are satisfied and Proposition 1 holds. Let $\rho =\frac{ \kappa }{2} (\log p)n^{-\eta }$. Then we have

$$\begin{aligned} P( \widehat{{\mathcal {M}}}(\rho ) ={\mathcal {M}}^{*} ) \longrightarrow 1 \end{aligned}$$

(8)

as n goes to infinity.

Proposition 2 demonstrate the asymptotic selection consistency of our proposed gradient based selection method. With the condition (A4), the specific rate of the selection consistency is determined by the result in Proposition 1. Thereby, the proposed method keeps the informative predictors and filter out the uninformative predictors in the selection procedure with the overwhelming probability.

2.4 Prediction, weight and tuning parameters

Following the work of Wang et al. (2016), we predicted the survival time for the risk score evaluated for the specific predictors. Specifically, we first sorted the observed survival time $t_{(q)}\ge t_{(q-1)} \ge \cdots , \ge t_{(1)}$ in decreasing order. Thereafter, we computed the risk score ${\widehat{f}}(\textbf{x}_{i_k})$ for the corresponding subjects with the observed event time, $k=1,2,\ldots , q$, and sorted the risk scores ${\widehat{f}}(\textbf{x}_{(i_1)}) \le {\widehat{f}}(\textbf{x}_{(i_2)}) \le \cdots , \le {\widehat{f}}(\textbf{x}_{(i_q)})$ in increasing order. We redefined the reference pair for the prediction of survival time given by $\{ ({\widehat{f}}(\textbf{x}_{(i_1)}), t_{(q)}), ({\widehat{f}}(\textbf{x}_{(i_2)}), t_{(q-1)}), \ldots , ({\widehat{f}}(\textbf{x}_{(i_q)}), t_{(1)}) \}$. as $\{ ({\widehat{f}}(\textbf{x}_{(i_k)}), \widetilde{t}_{(k)}): k=1, \ldots , q \}$ We predicted the survival time for the new observation $\textbf{x}$, given by ${{\widehat{T}}} = \sum _{k \in N(h)} {\widetilde{t}}_{(k)}/|N(h)|$, where N(h) is the set of pairs $({\widehat{f}}(\textbf{x}), {\widehat{f}}(\textbf{x}_{(i_k)}))$ approximately distance h depart for $k=1, \ldots , q$, and N(h) denotes the number of pairs in N(h). For computational simplicity, we selected the first three closest observed survival times to replace the value of h.

In classification problems, an imbalance of the response variables reduces the performance of the classifier. Although the introduced nonparametric risk score has played a role in classifying whether or not an event occurs at a specific time, the situation is extremely similar to the situation of unbalanced data in the classification problem, when considering the general situation that there is usually a single event at a specific time, while the rest of the observations remain at risk, which means that the event is free at a specific time. Therefore, we need subject- and time-specific weight to balance the occurrence and nonoccurrence of the event. We considered two types of weights expressed as

$$\begin{aligned} w_i(t_j) = I(\delta N_i(t_j)=1)\frac{\sum _{i=1}Y_i(t_j)-1}{\sum _{i=1}Y_i(t_j)} + I(\delta N_i(t_j)=-1)\frac{1}{\sum _{i=1}Y_i(t_j)} \end{aligned}$$

proposed by Wang et al. (2016), and

$$\begin{aligned} w_i(t_j) = I(\delta N_i(t_j)=1)\frac{1}{ \exp \{ - G(t_j) \} } + I(\delta N_i(t_j)=-1) \end{aligned}$$

proposed by Yang et al. (2021), where

$$\begin{aligned} {G}(t_j)= \int _{0}^{t_j} \frac{\sum _{i=1}^{n} d {\widetilde{N}}_i(s)}{ \sum _{i=1}^{n} Y_i(s) }, \end{aligned}$$

and ${\widetilde{N}}_i(t)=(1-\Delta _i )I( {T}_i\le t )$ presents the counting process for any $t \in {\mathcal {T}}=[0, \tau ]$. Notably, the first type of weight increases the occurrence of the event up to the size of the risk set at a specific time, and reduces the number of nonoccurrence events to one at the time. The second weight is an inverse probability-of-censoring weight that enables adjustment of the imbalance in the number of event occurrences by increasing it to the expected number of trials wherein the survival time is observed at a specific time.

For the choice of the tuning parameters including the cost variable $\lambda ^{-1}$, and scale factor $\sigma$ associated with the kernel, the grid search method can be used to find the optimal tuning parameter estimates ${\widehat{\lambda }}_{cv}^{-1}$ and ${\widehat{\sigma }}_{cv}$ that minimizes the k-fold cross-validation errors defined as the empirical root mean squared error as mentioned in (10), after splitting the dataset into training and test datasets. However, because the grid search method needs the expensive cost in terms of the computational time, it is appropriate for the data sets with the low dimensional predictors or the high censoring rate. For the data sets with the high dimensional predictors and the low censoring rate, a value of the decreasing sequence less than one can be used as ${\widehat{\lambda }}$ because $\lambda \rightarrow 0$ is assumed in Remark 1, where the results depending on these values did not yield dramatic change. For instance, we observed that ${\widehat{\lambda }}^{-1}=1000$ and ${\widehat{\lambda }}^{-1}=100$ leads to the identical support vectors in the simulation study.

3 Simulation

We conducted four sets of numerical simulations to examine the finite-sample performance of the prediction and variable selection for our proposed method at different censoring rates, sample sizes, and the number of covariates. In the first scenario, we generated p-dimensional random predictors $X_i=(x_{i1},\ldots , x_{ip})^{T}$ using a uniform distribution. Specifically, we independently generated the first five predictors $x_{i1}, x_{i2}, \ldots , x_{i5} \sim \text {Uniform}(0, 5)$, and the other predictors $x_{i6}, x_{i7}, \ldots , x_{ip} \sim \text {Uniform}(0, 1)$. We set the true proportional hazards regression coefficients as $\beta = (1, 1, 1, 1, 1, 0, 0, \cdots , 0)^T$ for the random predictors, that is, we set the true set of indices indicating the specific predictors associated with the outcome variable as ${ {\mathcal {M}} } =\{ 1,2,3, 4, 5\}$ for the variable selection. After we generated the true survival probability $U_i \sim \text {Uniform}(0, 1)$ for each ith observation, we generated the true survival time of the event with probability, that is,

$$\begin{aligned} T^*_i = \frac{1}{\lambda _0^{1/a}} \left[ \frac{-\text {log}(U_i)}{\text {exp}\left\{ \beta ^{T}X_i \right\} }\right] ^{1 / a}, \end{aligned}$$

(9)

where $\lambda _0$ denotes the baseline hazard function, and $\lambda _0 = 0.25$, and $a=1$ were used.

The censoring time of the ith observation was independently generated from the exponential distribution, $C_i \sim \text {Exponential}(\lambda _i )$, where $\lambda _i = \frac{cr}{1-cr} \lambda _0 /\text {exp}\{\beta ^T X_i \}$ for censoring rate, denoted by cr. We generated the observed survival time $T_i = \min (T^{*}_i, C_i)$ and the observed censoring indicator $\Delta _i=I(T^{*}_i\le C_i)$ for each ith observation. Then, we obtained sample of n observations expressed as $\{(X_i, T_i, \Delta _i ): i=1, \ldots , n\}$ as the dataset for the simulation. In the second scenario, we generated the important predictors $x_{i1}, x_{i2}, \ldots , x_{i5} \sim \text {Uniform}(0, 5)$, and other unimportant predictors $x_{i6}, x_{i7}, \ldots , x_{ip} \sim \text {Uniform}(0, 1)$, respectively. We defined the logarithm of the hazard ratio as a polynomial function given by

$$\begin{aligned} g(X_i) = x_{i1}\cdot x_{i2} \cdot x_{i3} + x_{i4}^2 + 5x_{i5} \end{aligned}$$

and generated the true survival time, whereas the polynomial function $g(X_i)$ was substituted for the linear function term $\beta ^{T}X_i$ in (9), where an identical value was used as the baseline hazard function. We used the correlated predictors in the third scenario. Specifically, the predictors were generated from $N_p(\mu , \Sigma )$, where $\mu = (3 \cdot 1^T_5, 0^T_{p - 5})^T$ and $\Sigma = \Sigma _5 \oplus \Sigma _{p - 5}$, with $\Sigma _5 = 0.5 \cdot I_5 + 0.5 \cdot 1_5 1_5^T$ and $\Sigma _{p - 5} = 0.5 \cdot I_{p - 5} + 0.5 \cdot 1_{p - 5} 1_{p - 5}^T$ for which $\oplus$ denotes the direct sum, and $0_k$ is a $k \times 1$ column vector of zeros. The corresponding survival times were generated from the model (9) in the setting with $a = 2$ and $\lambda = 0.1$. In the fourth scenario, we generated the first five predictors $x_{i1}, x_{i2}, \ldots , x_{i5} \sim \text {Uniform}(0, 5)$, and the other predictors $x_{i6}, x_{i7}, \dots , x_{ip} \sim \text {Uniform}(0, 1)$, and used the following nonlinear function

$$\begin{aligned} g(X_i) = x_{i1} \cdot x_{i2} + 5 cos(x_{i3}) + x_{i4}^2 + 3 x_{i5} \end{aligned}$$

as the logarithm of the hazard ratio.

Following a similar procedure, we generated censoring time, observed survival time, and censoring indicator variables. For each simulation scenario, we considered sample sizes of 50 and 100 (denoted by n), censoring rates of $20\%$, $40\%$, and $60\%$ (denoted by cr), while we changed the number of predictors with various settings of 50, 200, and 1000 (denoted by p). Additionally, for each scenario, we simulated 100 datasets, for which each dataset consisted of training data with a sample size of n and test data with a sample size of 500 (denoted by $n^{\text {test}})$. Notably, the first five covariates were set as the important covariates associated with variable selection for all scenarios.

For each new observation, denoted by $(X_i^{\text {test}}, T_i^{\text {test}}, \Delta _i^{\text {test}})$ in the test dataset, we predicted survival time of the event, ${\widehat{T}}_i^{\text {test}}$ after computing the predicted risk score ${\widehat{f}}(X_i^{\text {test}})$ and estimating the smoothed risk score ${{\widehat{f}}}(\cdot )$ based on the training dataset for each iteration. For the prediction accuracy of the proposed method, we considered two performance measures: the empirical root mean squared error (RMSE) term

$$\begin{aligned} RMSE = \sqrt{\sum _{i=1}^{n^{\text {test}}} \Delta _i^{\text {test}}(T_i^{\text {test}} - {\widehat{T}}({\widehat{f}}_i^{\text {test}}))^2/{\sum _{i=1}^{n^{\text {test}}}\Delta _i^{\text {test}}}} \end{aligned}$$

(10)

and the empirical concordance index term (CCI)

$$\begin{aligned} CCI = \frac{ \sum _{i=1}^{n^{\text {test}}} \Delta _i^{\text {test}}\left\{ \sum \limits _{T_j^{\text {test}}> T_i^{\text {test}}} I({\widehat{f}}(X_j^{\text {test}})>{\widehat{f}}(X_i^{\text {test}}))\right\} }{\sum _{i=1}^{n^{\text {test}}} \Delta _i^{\text {test}} \left\{ \sum \limits _{T_j^{\text {test}} > T_i^{\text {test}}} 1 \right\} }, \end{aligned}$$

respectively. This implies that the proportion of pairs wherein the ranking of predicted values is accurately arranged among all comparable pairs of time-to-event data. Furthermore, we computed the empirical partial derivative of the smoothed risk score $\widehat{f}(X)$ with respect to all marginal covariates, and, thereby, obtained the estimated set of indices $\widehat{{\mathcal {M}}}$ for the true important subset ${\mathcal {M}}$, as mentioned in (5). For the accuracy associated with the proposed method’s variable selection, we considered three performance measures: the empirical true positive rate, defined as $TPR = | {\mathcal {M}} \cap \widehat{{\mathcal {M}}}| /|{\mathcal {M}}|$, the empirical false positive rate, defined as $FPR = | {\mathcal {M}}^c \cap \widehat{{\mathcal {M}}}| /|{\mathcal {M}}^c|$, and the number of elements for the smallest set of indices $\widehat{{\mathcal {M}}}$ being able to include all the true important covariates (denoted by ${\widehat{d}}$), where ${\mathcal {A}}^c$ denotes the complement set of ${\mathcal {A}}$ and $|{{\mathcal {A}}}|$ denotes the number of elements contained in a set ${\mathcal {A}}$.

For our proposed methods, we used the SVHM with the weight based on the risk process and weight based on the censoring distribution’s inverse survival probability, as described in Sect. 2.3, denoted by Models (1) and (2), respectively. For comparison, we employed support vector regression (SVR) method introduced in (Khan & Zubek, 2008) and support vector machine based on the ranking (SVMR) introduced in Van Belle et al. (2011) for the time-to-event dataset, denoted by Models (3) and (4), respectively. Notably, the approach in Khan and Zubek (2008) directly predicted the survival time by applying the variant epsilon-insensitive hinge loss function depending on the censoring type, whereas Van Belle et al. (2011) predicted the survival time by voting in the nearest observed survival times, which were computed based on the risk score and its ranking as mentioned in Sect. 2.3.

We focused on using the Gaussian radial basis function (RBF) kernel, $k(\textbf{x},\textbf{x}')=\text {exp}(- \Vert \textbf{x}- \textbf{x}'\Vert ^2/\sigma )$ for all methods. As the gradient of the smoothed risk score is represented by the linear combination of the selected kernel’s partial derivative, we had to choose a kernel that can have a varied values of partial derivatives, and the Gaussian RBF kernel sufficiently satisfies this condition. For the choice of the scale parameter, we used $\sigma =1/p$ as the fixed value to control noise accumulation due to the high dimensionality of the predictors in the $L_2$ distance of the exponent. To compute the inverse probability weight, we used the Kaplan–Meier estimates for the survival function of the censoring distribution for each observation.

Table 1 Prediction measures in the first scenario ($n = 100$)

Full size table

Table 2 Prediction measures in the second scenario ($n = 100$)

Full size table

Table 3 Prediction measures in the third scenario ($n = 100$)

Full size table

Table 4 Prediction measures in the fourth scenario ($n = 100$)

Full size table

Tables 1, 2, 3 and 4 present the empirical average values of the total iterations for the prediction performance including the RMSE and CCI for sample size of 100 in all simulation scenarios, where the numbers in the parenthesis indicate the empirical standard deviation. Clearly, the SVHMs with the weight based on the risk process (Model (1)) outperformed the other methods across all cases of $p=50$, 200, and 1, 000 in terms of RMSE for all simulation scenarios. The weight based on the inverse probability (Model (2)) showed similar or slightly inferior performance compared with Model (1) across the different dimension settings in the linear, polynomial, and nonlinear scenarios with the independent predictors from Tables 1, 2 and 4. Although the performances of the SVR (Model (3)) for RMSE were not better than Model (1) and (2) in Tables 1, 2 and 3 for the first three scenarios, these results were not extremely poor, as much as the results of Model (3) shown in Table 4 for the nonlinear scenario. Interestingly, we found that Model (3) showed good result when the censoring rate and dimension were set to be $60\%$ and $p=1,000$ in Table 3 for the correlated predictors scenario. The support vector machine based on the censoring type (Model (4)) exhibited unsatisfactory results. We observed that Models (1) and (2) outperformed the other methods across all dimension cases and four different simulation scenarios in terms of CCI. We also see that Model (1) provided more stable results for the different censoring rates and dimension settings. Overall, for all methods, when the sample size increases, the CCI tends to improve, and this measure tends to be worse despite their slight difference when the censoring rate increases. The corresponding results for the sample size of 50 were summarized in the supplementary material.

Table 5 Variable selection measures in the first scenario ($n = 100$)

Full size table

Table 6 Variable selection measures in the second scenario ($n = 100$)

Full size table

Table 7 Variable selection measures in the third scenario ($n = 100$)

Full size table

Table 8 Variable selection measures in the fourth scenario ($n = 100$)

Full size table

Tables 5, 6, 7 and 8 report the empirical average and the corresponding standard deviation values for the true positive rate, false positive rate, and number of elements for the smallest set of indices containing all the true important predictors for each dimension of the predictors in the four simulation scenarios. We found that the selection accuracy of Model (1) outperformed the others across all different settings in the four simulation scenarios, and that it showed the best performance especially in Tables 5 and 7 for the two linear scenarios including the independent and dependent predictors. While model (2) using the inverse probability weight exhibited desirable results, for which it tended to be slightly lower than that of Model (1) in most cases (Tables 5, 6 and 8), it provided the inferior results compared with Model (3) in Table 7 when the predictors tended to be correlated and high dimensional settings. Model (4) revealed better performance than Model (3) in Table 5, 6, and 8 for the most situations. However, we observed the opposite results associated with the selection performance of Models (3) and (4) in Table 7. From four simulation studies, we confirmed that the proposed Models (1) and (2) can work better in both aspects of prediction and variable selection than the other methods when the time-to-event dataset is generated from a more complex design. The selection results for the sample size of 50 were contained in the supplementary material.

Four panels contained in Fig. 1 depict the trajectories of the time varying intercept, ${\widehat{\mu }}(t)$ in the SVHM for four different cases, denoted by A ($n=50, p=100, cr =40\%$), B ($n = 50, p=100, cr = 60\%$), C ($n=50, p=200, cr =40\%$), and D ($n =50, p=200, cr=60\%$), where the horizontal axis denotes the event time and the vertical axis denotes the value of the time varying intercept. We confirmed that these intercepts did not decrease as time increased, which can be interpreted as an increasing tendency of the hazard rates for all cases. The four panels included in Fig. 2 present the scatter plots between the estimated risk score and observed survival time for identical simulation cases, where the horizontal axis denotes event time, and the vertical axis denotes the value of the risk score in each panel. We observed that the risk score tended to decrease when the survival time tended to be longer in all simulation settings.

4 Analysis of real data

To illustrate its usefulness, a real data application was performed using four different prediction models. The data used for the analysis were human gene data collected through an oligonucleotide array in the work (Beer et al., 2002). We were biologically interested in predicting the time of occurrence of lung cancer and identifying the important genes that were mainly associated with the disease. The data comprised 86 subjects ($n = 86$), the predictors were composed of 7129 genes ($p = 7129$), and the outcome was composed of the lung cancer diagnosis time and corresponding censoring indicator, for which 62 of the 86 subjects were observed with the survival time, whereas 24 subjects were observed with the censoring time; consequently, a censoring rate of $28\%$ was calculated.

After the entire dataset was divided into five folds data sets, each of which comprised the training and test datasets, we attempted to fit the four different prediction models based on the margin maximization methods to the training datasets to estimate the functional structure of the risk score. Thereafter, we predicted the risk score for each subject contained in the test datasets, computed the CCI values for each test dataset as a measure of the prediction performance, and compared the prediction performance of our proposed methods with those of the others. Finally, we selected the important genes through the gradient-based variable selection method, where the threshold number for the variable selection was set as $[\log (n)/n]=[18.2]=18$, the Gaussian RBF kernel was employed, and the bandwidth size was used as mentioned in the work (Wang, 2012). When dividing the data into five folds, the training datasets were generated with an almost identical censoring rate to that of the original dataset.

Table 9 Prediction measures in the real data analysis

Full size table

For each prediction model, we conducted variable selection with the gradient value estimate for each marginal predictor and refitted the prediction methods for the selected predictors.

Table 9 presents the average CCI values and RMSE performance for all the methods prior and posterior to the variable section. We observed that the three CCI values for Models (1), (2), and (3) slightly decreased after applying the variable selection, while the CCI for Model (4) increased after applying the variable selection. In the contrast to the results of the CCI performance, we found that the prediction performance for Models (1) and (2) presented a meaningful result in that their corresponding RMSE values tended to be improved after conducting our proposed kernel-based variable selection approach. The prediction performance of the rank-based SVM, Model (4) was also improved compared with the result before the variable selection while Model (3) was not improved even after applying the variable selection method.

Figure 3 contains four box plots, each of which shows the distribution of the RMSE value for each prediction method, with the weight based on the risk set. The weight based on the inverse probability, support vector regression,and support vector machine based on the ranking are denoted by Models (1), (2), (3), and (4), respectively. As the empirical average of the prediction performance, the RMSE values 5.305 and 4.884 were obtained for our proposed Models (1) and (2), respectively, which indicates better performance compared with the others in terms of the prediction. Additionally, we observed that the spread of the SVHM with the inverse probability based on the KM estimators in Model (2) tends to be smaller than that of the SVHM with the weight based on the risk set in Model (1).

Two panels of Fig. 4 depict the time-varying intercept ${\widehat{\mu }}(t)$ and time-independent risk score $\widehat{f}(\cdot )$ estimated for the lung cancer data set. With the increase in the survival time, the intercept tended to increase, which reveals an appropriately estimated appearance because the risk needs to be increased over the time. Although the time-independent risk score showed a slightly scattered appearance, we observed a functional tendency to decrease to some degree.

Table 10 Results of variable selection for each model

Full size table

Table 10 shows the results of the variable selection through the gradient information for each method. We observed that some genetic predictors were commonly selected by the prediction models employed, and there were also a few differences in the selected genetic predictors between the prediction models. The indices of the genetic predictors selected by three or more prediction models among the genes selected through the proposed SVHM contained (146, 148, 564, 1107, 2429, 2778, 3001, 5474, 5909, and 6061), whereas the indices of the genetic predictors selected only by SVHM contained (3255, and 4548). To ascertain whether the selected genetic predictors can be a biologically meaningful result, we looked for a study with the association between the selected genes and lung cancer. Peng et al. (2022) found that the non-small cell lung cancer could be worse through the GAPDH (gene index 146). Carleo et al. (2020) found an association between the gene IGKC (gene index 3001) and the lung carcinogenesis in idiopathic pulmonary fibrosis patients. Ma et al. (2017) investigated the expression and epigenetic regulation of CSTB (gene index 4548) in lung cancer, where this gene has been included only in our proposed method. Although it is not directly related with the lung cancer but may not be irrelevant to the lung cancer, there was a study Gustafsson et al. (2008) that showed an association the gene IGHG (gene index 3255) and asthma severity, where it has been detected only by our method. Through these cases, we confirmed that our proposed method could be biologically meaningful.

5 Discussion

In this study, we developed a method to improve the prediction performance of the survival time and to select the important predictors in the time-to-event dataset. Specifically, we considered the counting process for each subject in the time-to-event data as a time-varying dichotomized outcome and, thereafter, adopted the SVHM and gradient-based variable selection methods to achieve two purposes namely, prediction and variable selection. Through simulation studies, we found that not only the existing margin-based methods, such as SVR and SVMR, but also the SVHM with the two different weights could present desirable prediction performance in terms of MSE, whereas the SVHM with the weights outperformed the others in terms of CCI. Moreover, we observed that the finite performance of the prediction measure for the SVHM approach with weights tended to be better for the complicated scenario in the simulation study. We believe that the proposed framework can be practically used to solve the problem of predicting the time of occurrence of an event and choosing variables in time-to-event data.

For the real data application, we used the gene expression values of the microarray data of the patients with lung cancer as high-dimensional predictors and survival time to death as the outcome. We demonstrated that both our SVHM approaches with two weights provided better prediction performance, and that such a prediction performance did not decrease significantly and was maintained even after using the gradient-based variable selection method. Using the results obtained by applying the gradient-based variable selection method to each prediction model, it was possible to identify genes that could be found in common and genes that can be uniquely discovered using each method. We confirmed that the genes identified by our proposed method were biologically meaningful and demonstrated that the proposed method is scientifically valid.

We highlighted that the amount of computational time and time-independent covariates are limitations of our proposed approach. Moreover, we could not use time-dependent covariate because we were considering time-dependent risk scores and covariate-dependent risk scores for the time-independent covariate. Additionally, we must optimize the regularized empirical risk at each survival time, which requires heavy computations. We believe that developing a scalable SVHM method is necessary and will leave this work for future research.

Data availability

We used Beer’s microarray data, which is available with LungCancer3 function in the R package “GSCA”.

References

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.
Article MathSciNet Google Scholar
Beer, D. G., Kardia, S. L., Huang, C.-C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., et al. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine, 8, 816–824.
Article Google Scholar
Carleo, A., Landi, C., Prasse, A., Bergantini, L., d’Alessandro, M., Cameli, P., Janciauskiene, S., Rottoli, P., Bini, L., & Bargagli, E. (2020). Proteomic characterization of idiopathic pulmonary fibrosis patients: Stable versus acute exacerbation. Monaldi Archives for Chest Disease, 90, 180–190.
Article Google Scholar
Clarke, B. S., Fokoué, E., & Zhang, H. H. (2009). Principles and Theory for Data Mining and Machine Learning. Springer.
Book Google Scholar
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34, 187–202.
Article MathSciNet Google Scholar
Fan, J., & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148.
MathSciNet Google Scholar
Fleming, T. R., & Harrington, D. P. (2011). Counting Processes and Survival Analysis. Wiley.
Google Scholar
Fukumizu, K., & Leng, C. (2014). Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109, 359–370.
Article MathSciNet Google Scholar
Gustafsson, P. M., Oxelius, V.-A., Nilsson, S., & Kjellman, B. (2008). Association between gm allotypes and asthma severity from childhood to young middle age. Respiratory Medicine, 102, 266–272.
Article Google Scholar
He, X., Wang, J., & Lv, S. (2021). Efficient kernel-based variable selection with sparsistency. Statistica Sinica, 31, 2123–2151.
MathSciNet Google Scholar
Ibrahim, J. G., Chen, M.-H., & Sinha, D. (2001). Bayesian Survival Analysis. Springer.
Book Google Scholar
Jeong, S., Kim, C., & Yang, H. (2023). Wasserstein filter for variable screening in binary classification in the reproducing kernel Hilbert space. Journal of Nonparametric Statistics, 1–20 (in press)
Kalbfleisch, J. D., & Prentice, R. L. (2011). The Statistical Analysis of Failure Time Data (2nd ed.). Wiley.
Google Scholar
Khan, F. M., & Zubek, V. B. (2008). Support Vector Regression for Censored Data (SVRc): A Novel Tool for Survival Analysis (pp. 863–868). IEEE, IEEE International Conference on Data Mining.
Lawless, J. F. (2002). Statistical Models and Methods for Lifetime Data. Wiley.
Book Google Scholar
Ma, Y., Chen, Y., & Petersen, I. (2017). Expression and epigenetic regulation of cystatin b in lung cancer and colorectal cancer. Pathology-Research and Practice, 213, 1568–1574.
Article Google Scholar
Park, B., & Park, C. (2021). Kernel variable selection for multicategory support vector machines. Journal of Multivariate Analysis, 186, 104800.
Article MathSciNet Google Scholar
Peng, J., Li, W., Tan, N., Lai, X., Jiang, W., & Chen, G. (2022). Usp47 stabilizes bach1 to promote the Warburg effect and non-small cell lung cancer development via stimulating hk2 and gapdh transcription. American Journal of Cancer Research, 12, 91–107.
Google Scholar
Tibshirani, R., et al. (1997). The lasso method for variable selection in the cox model. Statistics in Medicine, 16, 385–395.
Article Google Scholar
Van Belle, V., Pelckmans, K., Van Huffel, S., & Suykens, J. A. (2011). Support vector methods for survival analysis: A comparison between ranking and regression approaches. Artificial Intelligence in Medicine, 53, 107–118.
Article Google Scholar
Wang, Q. (2012). Kernel principal component analysis and its applications in face recognition and active shape models. arXiv:1207.3538
Wang, Y., Chen, T., & Zeng, D. (2016). Support vector hazards machine: A counting process framework for learning risk scores for censored outcomes. The Journal of Machine Learning Research, 17, 5825–5861.
MathSciNet Google Scholar
Wei, L.-J. (1992). The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Statistics in Medicine, 11, 1871–1879.
Article Google Scholar
Xia, Y. (2007). A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35, 2654–2690.
Article MathSciNet Google Scholar
Xia, Y., Tong, H., Li, W., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410.
Article MathSciNet Google Scholar
Yang, H., Zhu, H., Ahn, M., & Ibrahim, J. G. (2021). Weighted functional linear cox regression model. Statistical Methods in Medical Research, 30, 1917–1931.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the Editor, Associate Editor and two reviewers, whose questions and insightful comments have led to a much improved paper.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF2021R1C1C1007023).

Author information

Authors and Affiliations

Department of Statistics, Pusan National University, Busan, South Korea
Sanghun Jeong, Kyungjun Kang & Hojin Yang

Authors

Sanghun Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Kyungjun Kang
View author publications
You can also search for this author in PubMed Google Scholar
Hojin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hojin Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jeong, S., Kang, K. & Yang, H. Gradient-based kernel variable selection for support vector hazards machine. J. Korean Stat. Soc. 53, 509–536 (2024). https://doi.org/10.1007/s42952-024-00256-5

Download citation

Received: 10 July 2023
Accepted: 02 January 2024
Published: 15 February 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s42952-024-00256-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Gradient-based kernel variable selection for support vector hazards machine

Abstract

Similar content being viewed by others

Enhancing SVM for survival data using local invariances and weighting

A Multi-task Kernel Learning Algorithm for Survival Analysis

Survival Support Vector Machines: A Simulation Study and Its Health-Related Application

1 Introduction