1 Introduction

Machine learning methods have been highly successful in data-intensive survival analysis and applications [34, 36, 37] but it often hampered when the data set is small. Furthermore, in medical studies, there is often a portion of patients who did not experience the event of interest when the study ends and for these observations, we have incomplete or censored time-to-event data [15, 49]. In the past few decades, a large amount of parametric, semi-parametric and non-parametric survival models have been developed for modeling time-to-event survival data. Among them, the most popular is the regression-based semi-parametric Cox model [12], and its extensions [45, 52, 60]. When the underlying proportional hazard assumption is not satisfied, nonparametric machine learning-based approaches are useful alternatives [6, 19, 24, 31]. However, regardless of whether it is semi-parametric or non-parametric, all the aforementioned methods require an adequate number of both censored and uncensored observations [43]. When a small sample size and/or highly censored data are present, these methods may face severe difficulties [50].

Among all machine learning based survival methods, the most popular one is the Random survival forests (RSF) model [28] inherited from Random forest (RF) [7]. Different from entropy or Gini-index based RF, log-rank split rules are generally adopted for RSF. There are many extensions of RSF in the past decades and one can refer to [56] for more detailed information. More recently, there are also quite a few improvements on the traditional RSF method. For example, linear combination of input variables are used for recursively partition and a higher prognostic value is achieved in [30]. In [48], a novel paradigm for building regression trees is proposed for survival analysis. In [53], the standard procedure of simple average is replaced by a weighted average for hazard function estimation in RSF. In UST (uni.survival.tree) [17], a stabilized score test is suggested to select significant covariates first to reduce time complexity.

One may notice that all the above random forest based approaches are one-layered, i.e. only the raw features are exploited to train the forests model and predictions are immediately made, neglecting the fact that multiple intermediate trained layers may produce better representations of the training data [42]. Moreover, as a machine learning approach, tree-based methods usually rely on a large or medium number of samples to obtain a satisfactory predictive performance [44]. However, this may not be satisfied in real practices as small sample size survival data are commonplace in clinical studies [29, 61]. Furthermore, a large proportion of censored observations under such circumstances will make the survival modeling process more complicated, if not impossible.

In light of the above discussions, we propose to address the problems mentioned above using a layer-by-layer deep random forests framework, where perceptions in traditional deep neural networks are replaced by random forests. To alleviate the high censoring problem, semi-supervised learning [21] and data transformation techniques [3] are also adopted in the proposed deep survival forests (DSF) method. The superior empirical performance of the proposed method is illustrated by simulation examples and real data applications.

The major contributions of this paper are summarized as follows:

  • A non-NN (Neural Networks) style deep learning method is proposed for survival prediction.

  • We provide an effective approach to model highly censored survival data with a small sample size.

The rest of the paper is organized as follows. Section 2 introduces the motivation to modeling censored data in the case of small sample size, and then we propose a novel deep forest structure in Section 3. Experimental analysis and real data applications are described in Sections 4 and 5. Finally, we discuss and conclude the paper in Sections 6 and 7.

2 Preliminaries

In this section, we first discuss the highly censoring problem, then we give a short description of semi-supervised learning and the deep forest model. Later on in Section 3, we will develop a novel semi-supervised framework using deep forest to deal with the highly censoring problem.

2.1 The highly censoring problem

In survival analysis, an instance can be presented by a triplet (xi,δi,yi), i = 1,2,...,n., where xi = (xi1,xi2,...,xip) is the feature vector. In the case of right censored data, yi = min{Ci,Ti}, where Ti is the truly survival time, Ci is the censoring time, and δi = I(TiCi) the censoring indicator. In biomedical studies, such survival data are often characterized by a small sample size with a high dimension. Take GEO (Gene Expression Omnibus) (https://www.ncbi.nlm.nih.gov/geo/)genomics data repository as an example. So far, this database contains 4348 data sets, all of which are high-dimensional with sample sizes ranging from from 2 to 202. When a highly censoring rate is married with a small sample size, the uncensored samples may not be sufficient for predictive modeling. In such cases, the parameter estimation of the Cox model may not converge in the optimization procedure and the RSF model may fail due to the constraint that a leaf node must have some unique samples with events [62].

We will illustrate this problem with the popular RSF model. In RSF, the Nelson-Aalen (NA) estimator is used to predict the cumulative hazard function(CHF). The CHF for terminal node h is

$$ \widehat{H}_{h}(t)={\sum}_{t_{l,h}\leq t}\frac{d_{l,h}}{Y_{l,h}}, $$
(1)

where dl,h and Yl,h are the number of deaths and individuals at risk at time tl,h. Obviously, all cases within node h have the same CHF. Suppose in one terminal node, we have only one death instance and nine censored instances and the detailed survival times are (2+,3+,5+,7+,8+, 10,13+,14+,18+,25+). In this case, the NA’s estimator can only show that the risk is 0% at T < 10 and 20% at T ≥ 10, which is extremely vague and inaccurate. Consequently, the resulting RSF model may face an under-fitting problem.

One may notice that, in calculating CHF, only the number of censored samples are used and other censoring information such as the specific values of censoring times are ignored. In case of a small sample size with a highly censoring rate, one may consider improving the model’s predictive capability by exploiting such information.

2.2 Semi-supervised learning

In classification and regression problems, semi-supervised learning can make use of unlabelled data to gain more information about the underlying marginal data distribution p(x), and thereby obtain more accurate inference about the posterior distribution p(yx) [26].

However, semi-supervised learning for survival analysis so far is still underdeveloped. In survival analysis, instances that have experienced the event of interest can be regarded as labeled data. But censored data are not the same as unlabeled data in that censored data always imply that the truly survival times are within some intervals specified by survival times and hence carry more information than the unlabeled data.

Here, we consider a toy example in Fig. 1 with which we can observe how censored information may help us in classification problem. In this example, we have two classes of 34 instances: eight of them (squared dots) denotes event (uncensored) data and 26 others (circled dots) are censored. If we only use 8 uncensored instances (E1 to E8 in Fig. 1) in model training, the decision boundary may be the densely dotted line. However, from semi-supervised learning, we know that the dotted line violates the smoothness and low-density assumptions as the decision boundary of a classifier should preferably pass through low-density regions in the input space [54]. Hence, if both censored and uncensored samples are dealt with properly, the solid optimal decision hyper-plane may be found.

Fig. 1
figure 1

A basic example to explain semi-supervised learning

2.3 Deep forest

Deep learning based approaches find vast applications in a variety of fields. The mystery behind the success of deep learning may lie in three characteristics, i.e., layer-by-layer processing, in-model feature transformation and sufficient model complexity [63]. However, training of deep neural networks requires a large number of samples [1], which is often difficult to be satisfied in medical practice. In 2017, a deep forest framework with a cascade random forest structure is proposed to hold the strengths from the deep leaning [63].

One may observe that both deep neural network (DNN) and deep forest (gcForest) have a layer-by-layer structure for representational learning. As stated in [42], for any \(g\in (\mathcal {C}_{r}[0,1]^{r},\beta ,H)\), there exists a deep neural network \(f\in \mathcal {F}(l,\{d_{j}\}_{j=0}^{L+1},s,V)\) such that

$$ \begin{array}{lllll} \!\!\!\!&\|f - g\|_{\infty}\!\leq\!(2H + \!1)6^{r}\!\cdot\!(1 + r^{2} + \beta^{2})\!\cdot\! N2^{-m} + H\!\cdot \!3^{\beta}\!\cdot \!N^{-\beta/r} \\ \!\!\!\!&m\varpropto depth; N\varpropto width. \end{array} $$
(2)

According to (2), the generalization error upper bound decreases exponentially with the increase of the model depth. Later on in the next section, we shall develop a different deep forest framework for highly censored survival data using a semi-supervised learning technique.

3 The DSF approach

In this section, we first show how pseudo survival times can be approximated using censored times through a semi-supervised learning approach called data transduction technique. Then we propose a deep survival forest(DSF) approach that can utilize both censored and uncensored sample information.

3.1 Data transduction

Given m labeled samples (x1,y1),...,(xm,ym) as well as nm unlabeled samples xm+ 1,xm+ 2,...,xn, the purpose of semi-supervised learning is to predict the remaining unknown labels ym+ 1,ym+ 2,...,yn [11]. However, most semi-supervised learning approaches are born for regression or classification problems and cannot be extended to survival modeling directly [32]. Here, we try to employ a transductive semi-supervised algorithm [54] in dealing with censored observations.

Formally, given a supervised loss function for the labeled data and unsupervised loss function U for the pairs of labeled or unlabeled data, transductive methods attempt to obtain a pseudo labeling \(\hat {y}\) that minimizes

$$ \lambda\cdot{\sum}_{i=1}^{l}\ell(\hat{y}_{i},y_{i})+{\sum}_{i=1}^{n}{\sum}_{j=1}^{n}W_{ij}\cdot \ell_{U}(\hat{y}_{i},\hat{y}_{j}) $$
(3)

where Wij contains the edge weights for all pairs of nodes and λ governs the supervised term’s relative importance.

In the most typical right censored survival case, censored instances’ actual survival times are unknown but are greater than observed censoring times. Based on this fact, we can assign an optimal class label (or possible target label) to each censored instance via data transduction. In other words, we attempt to infer a pseudo-label for i th right censored instance by

$$ \widehat{T_{i}}=f(\boldsymbol{x}_{\boldsymbol{i}},\boldsymbol{x}_{\boldsymbol{j}},C_{i},T_{j}), j\neq i \text{and} \delta_{j}=1 $$
(4)

where δj is the censoring indicator, δj = I(TjCj).

We assume that the distribution of survival time possesses the memoryless property, that is P(T > t) = P(T > s + tTs). For one censored instance at time C, we suppose that the longest survival time of this instance is C + τ where τ is the maximum event time in the training set. As a result, an effective way to obtain an optimal target is data transduction via exhaustive searching from the censored time to maximum pseudo time C + τ. The transduced time can be further formulated as

$$ \widehat{T_{i}}=C_{i}+k\frac{\tau}{\zeta}=C_{i}+ks \ \ (k=1,2,\cdots,\zeta ) $$
(5)

In (5), s is an iteration stride which determines the iteration times ζ. The larger ζ, the more time the algorithm will takes and a higher accuracy will be obtained.

To avoid noise accumulation from pseudo-labels of the censored data and ensure the robustness of the whole data transduction, the proposed method makes the censored samples enter the model one by one, and carry out collaborative training with the uncensored data. Once an instance obtains the transduced pseudo-label, it becomes a new “uncensored” instance in the next training process. That is to say, a pseudo-label for the i th (m + 1 ≤ inm) censored sample is transduced, and the m + 1, ⋯, m + i censored observations are regarded as uncensored samples in the subsequent training process.

3.2 Deep survival forests

In the proposed deep survival forests (DSF) approach, we attempt to apply a similar methodology to survival data in the hope to have a smaller error upper bound. However, the censored data problem makes the popular cascade structure in trouble. Hence, the cascade structure is redesigned to cope with the challenges from highly censored data.

As illustrated in Fig. 2, DSF is distinct from deep forest in that each level of cascade receives new transduced censored samples from its preceding level, and transduce its processing result to the next level. Compared to a hierarchical framework extracting complex non-linear features in deep forest, a sequential framework expanding the training sample size by transducing survival time for the censored samples is applied in the proposed approach.

Fig. 2
figure 2

The overall procedure of DSF

In each layer of cascade forest, the stride or the number of forests should be determined in advance. In practice, these settings depend on the accuracy requirements. For example, if the censored time for one instance is 300, and if the maximum survival time in the training set is 600, and if our stride is set to 50, we need six random forests in this layer to search the optimal target time for this censored sample. If higher accuracy is required, a smaller stride (such as 10) can be set and more (30) random forests are involved.

We then will replace the censored time with a transduced label for each censored sample that yields the best performance improvement on out of bag (OOB) data. In other words, the pseudo-label of this instance is obtained based on the minimum error criterion on the testing set. The corresponding status is also transduced from censored to uncensored. Instead of other imputation methods, we design a cascade forest to realize data transduction. This procedure minimizes the corresponding estimation bias and extracts more effective feature information than one-of f procedure.

Finally, once optimal pseudo-values are transduced for all censored instances, a learning model such as random forest can be built with these pseudo-label censored samples and uncensored samples. As we can see, the final model is actually a cascade of cascades, where each cascade consists of multiple levels.

In general, the DSF can be formulated as the following optimization problem:

$$ \begin{array}{lllll} \underset{k}{Min} \sum\limits_{b=1}^{B}\sum\limits_{i=1}^{n}I_{i,b}\cdot\delta_{i,b}[y_{i,b}-\hat{f}^{*}(x_{i,b})]^{2}\\ s.t. y_{i,b}^{*}\leq \tau+y_{i,b}\cdot(1-\delta_{i,b}) \end{array} $$
(6)

where \(y_{i,b}^{*}=y_{i,b}+k_{i,b}\cdot (1-\delta _{i,b})\cdot s\), for b = 1,2,...,B and B is number of bootstrap samples from the original data; for i = 1,2,...,n; Ii,b = 1 if i is an OOB sample for b th bootstrap sampling.

In our study, random forests are chosen as the base learners. And for each random forest, a default of 100 regression trees using the most common Mean Squared Error (MSE) loss function are built. The Gini-index criteria is adopted as the splitting rules in these regression trees. Like deep forest and other deep learning approaches, we have to make a trade-off between computational efficiency and predictive performance. If computational resource is enough, we suggest setting large iteration times ζ such as 1000, 5000 to gain more accurate transduced labels and more reliable prediction performances.

The pseudo-code of the proposed DSF is presented in Algorithm 1:

figure e

Since the “while” parts based on iteration times ζ can be executed concurrently, thus in case of big survival data under the larger ζ, DSF can be trained on a multi-core CPU or computer clusters in parallel to save time.

4 Simulation studies

In this section, we use simulation studies to evaluate the effectiveness of the proposed method for survival prediction on a variety of scenarios.

4.1 The comparing models

We will compare our method with several popular semi-parametric and nonparametric models widely used in real applications. We do not consider the deep survival models such as DeepCox [33] and DeepSurv [31] as competitors, because these methods require more training samples and are not applicable in scenarios of small sample size.

  • Cox proportional hazards model [12, 13] is a popular semi-parametric model and the most commonly-used survival analysis method.

  • GlmBoost [9] is a generalized linear model which is fitted using a boosting algorithm based on component univariate linear models.

  • RSF (Random Survival Forests) [28] extended random forest [7] to model right-censored data, which is the most popular nonparametric method in the field of survival analysis.

  • CoxBoost [4] is one of the few methods that allow the implementation of popular boosting techniques in conjunction with the Cox model.

  • ORSF (Oblique Random Survival Forests) [30] is a tree-based ensemble for right-censored survival data that uses linear combinations of input variables to recursively partition a set of training data.

  • OSTE (Optimal Survival Trees Ensemble) [20] is tree-based ensemble method, which is initiated with the survival tree which stands first in rank, then further tress are tested one by one ba adding them to the ensemble in order of rank.

  • UST [17] construct a survival tree by a novel matrix-based algorithm in order to tests a number of nodes simultaneously via stabilized score tests [59].

Comparisons with these models are conducted with corresponding “survival”, “mboost”, “randomForestSRC”, “CoxBoost”, “obliqueRSF”, “OSTE” and “uni.survival.tree” packages in R. The default settings of these methods in packages are adopted for ensemble tree methods and the number of trees is set to 500. For the proposed DSF method, we set trees = 100, iteration times ζ = 200 for each level. Here, these values are relatively small to make a trade-off between accuracy and efficiency. In the last level of DSF, we set Trees = 500.

4.2 Performance comparison metrics

To evaluate the predictive accuracy of survival models, we adopt the concordance index (C-index) measure [22, 23], which is also the most popular criteria for survival predictions. The C-index metric has an attractive feature that does not depend on a single event time for evaluation and more precisely accounts for censored time. The C-index value is calculated as follows:

  • Calculate all possible pairs of cases over the data.

  • Omit those pairs whose shorter survival time is censored. Omit pairs i and j if yi = yj. Let π denote the total number of permissible pairs.

  • For each permissible pair where yiyj, count 1 if the longer survival time has a better predicted outcome; count 0 if predicted outcomes have opposite results. Let ω denote the sum over all permissible pairs.

  • C = ω/π defines C-index.

In our experiments, 5 ∗ 2 fold cross-validation [57] is used for all datasets. To be specific, each trial randomly divided the dataset into two halves, 50% for training and 50% for testing and vice versa. This process is repeated five times for each dataset and all the compared methods.

4.3 Simulation scenario settings

The simulation settings reported here are very similar to settings [48, 64]. The five settings considered are, respectively, described below:

Scenario 1. :

In this basic scenario, each simulated dataset is created using 90 independent observations, where the covariate vector (x1,x2,...,x10) is multivariate normal with μ = 0 and a covariance matrix having elements equal to 0.9ij. Survival times are simulated from an exponential distribution with \(\mu _{T}=e^{0.1{\sum }_{i=5}^{8}x_{i}}\)(i.e.,a proportional hazards model) and censoring distribution is exponential with \(\mu _{C}=0.8e^{0.1{\sum }_{i=5}^{8}x_{i}}\) to get an approximately 66% censoring rate (CR for short when necessary).

Scenario 2. :

In this nonlinear scenario, the proportional hazards assumption is mildly violated by our settings. Each simulated dataset is created using 90 independent observations, where the covariate vector (x1,x2,...,x10) consists of 10 independent and identically distributed uniform random variables on the interval [0,1]. The survival times follow an exponential distribution with \(\mu _{T}=sin(x_{1}\pi )+2\mid x_{2}-0.5\mid +{x_{3}^{3}}\). Censoring has a uniform distribution over [0,2], which results in approximately 58% censoring rate.

Scenario 3. :

In this nonproportional hazard scenario, the proportional hazards assumption is strongly violated by our settings. Each simulated dataset is created using 90 independent observations, where the covariate vector (x1,x2,...,x10) is multivariate normal with μ = 0 and a covariance matrix having elements equal to 0.9ij. Survival times are gamma-distributed with shape parameter \(\mu _{T}=0.5+0.3\mid {\sum }_{i=5}^{8}x_{i}\mid \) and scale parameter 2. Censoring time has a uniform distribution over [0,25], which results in approximately 71% censoring rate.

Scenario 4. :

In this dependent censoring scenario, the underlying censoring distribution is conditionally dependent on covariates by our settings. Each simulated dataset is created using 90 independent observations, where the covariate vector (x1,x2,...,x10) is multivariate normal with μ = 0 and a covariance matrix having elements equal to 0.9ij. Survival times are simulated according to a log-normal distribution with \(\mu _{T}=0.1\mid {\sum }_{i=1}^{2}x_{i}\mid +0.1\mid {\sum }_{i=6}^{7}x_{i}\mid \). Censoring times are log-normal with μC = μT − 1.5 and scale parameter 1, which results in approximately 62% censoring rate.

Scenario 5. :

In this more complicated scenario [27], the log-rank test may have a significant loss of power when the hazard function crosses each other. Each simulated dataset is created using 90 independent observations, where the covariate vector (x1,x2,...,x10) is uniformly distributed on the interval [0,1]. Survival time is only related to x1. Censoring time is uniformly distributed on the interval [0,10], which results in approximately 42% censoring rate. The hazard function is

$$ \left\{ \begin{aligned} 0.27t, &&x_{1}\leq0.5,t\leq2\\ 0.27(t-2)+5.4, &&x_{1}\leq0.5,t>2\\ 0.1t, &&x_{1}>0.5,t\leq6\\ 5.5(t-6)+0.6, &&x_{1}>0.5,t>6 \end{aligned} \right. $$
(7)

4.4 Simulation results

The following Table 1 and Fig. 3 present the performance of all methods in terms of C-index on five simulation datasets.

Table 1 The result of C-index with other advanced competing approaches
Fig. 3
figure 3

Boxplots of performance in terms of C-index

In S1 (the most basic proportional hazards case), all eight methods perform relatively well and all output C-index values over 0.5. And the proposed DSF outperforms the other seven methods by a noticeable margin. For S2-S5 (non-proportional hazards cases), all competing methods shows a degrade of performance. In these cases, OSTE and UST often fail and make predictions worse than random guessing. However, DSF works strikingly well in all these scenarios and outperforms the other seven methods by large margins. In the highest censoring rate case (S3), only DSF gives adequate predictions with a mean C-index value of 0.526 and all the other methods fails.

These simulated results indicate that when small sample sized (n = 90 in our simulations) and highly censored data (from 58% to 71% in our cases) are confronted and if other approaches fail, one can resort to DSF for help.

5 Real applications

In this section, we verify the potential of the proposed DSF based on real data applications. We will demonstrate its effectiveness using both low dimensional and high dimensional benchmark datasets.

5.1 Applications to low dimensional data

Here, eight real survival datasets with different censoring rates ranging from 24% to 88% are employed and all these datasets are publicly available in corresponding R packages. These datasets are further preprocessed by eliminating all columns having the same values or data with missing values. Short descriptions of the benchmark datasets are given below.

  • The breast [25] is a breast cancer dataset with 74% censoring rate, containing information on 100 breast cancer patients, including survival time, survival status, Tumor stage, Nodal status, Grading and Cathepsin-D tumor expression. The data can be obtained from the R package “coxphf”.

  • The DLBCL [2] contains gene expression data from diffuse large B-cell lymphoma (DLBCL) patients. This dataset contains 34 samples and 14 covariates with 47% censoring rate, which is available in R package “ipred”.

  • The leukemia [18] describes the treatment results for leukemia patients and contains 51 samples and nine covariates with 88% censoring rate. The data can be obtained from the R package “Stat2Data”.

  • The WPBC [10] exhibits invasive breast cancer cases and contains 194 samples and 32 covariates with 76% censoring rate. The data is available in R package “TH.data”.

  • The ovarian [16] is a randomized trial comparing two treatments for ovarian cancer with 26 samples and six covariates with 54% censoring rate. The data can be found in R package “survival”.

  • The colon [35] is from one of the first successful trials of adjuvant chemotherapy for colon cancer. This dataset contains 1858 samples and 14 covariates with 50% censoring rate, which can be obtained from the R package “survival”.

  • The kidney [38] is a kidney patients data. It represents the recurrence times to infection at the point of insertion of the catheter for kidney patients using portable dialysis equipment. This dataset contains 76 samples and six covariates with 24% censoring rate, which can be obtained from the R package “survival”.

  • The pbc [51] is from the Mayo Clinic trail in primary biliary cirrhosis (pbc) of liver conducted between 1974 and 1984. A total of 276 pbc patients and 17 covariates with 60% censoring rate, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trail of the drug D-penicillamine.

The prediction performance in terms of C-index on low dimensional datasets is summarized in Table 2 and Fig. 4. From these results, one may find that DSF works remarkably well on almost all these datasets and outperforms most methods by big margins in most cases. When extremely high censoring rate is encountered such as the case of leukemia dataset, most competitors lose their predictive power with low C-index values(0.335, 0.386, etc.) and CoxBoost manages to give predictions just above 0.5 sometimes but also fails in most runs. However, this mission impossible case is made possible with our proposed DSF method. On the same leukemia dataset, DSF performs strikingly well and achieves an average C-index of 0.820.

Table 2 The result of C-index on low dimensional datasets
Fig. 4
figure 4

Boxplots of performance in terms of C-index on low dimensional datasets

We also find that, when low censoring rate is present as in the case of kidney dataset(24% censoring rate), DSF performs not as good as some of competing methods(GlmBoost or CoxBoost), but its performance is still comparable to other competing approaches.

5.2 Applications to high dimensional data

Next, we will verify the validity of DSF on high-dimensional datasets. Here, for efficiency purpose, a two-stage strategy is adopted for high dimension survival analysis. In the first stage, irrelevant features are filtered out using an effective screening procedure and in the subsequent stage, different competing models come into play. To ensure the fairness of comparison, the same model-free screening method (Ball correlation sure independent screening, BCor-SIS) [41] is applied in all first stages. Short descriptions of the benchmark higher-dimensional datasets are given below.

  • The GSE12945 describes an expression module of WIPF1-coex pressed genes that identifies patients. This dataset, with 82 % censoring rate, has 61 patients. For each instance, 14 clinical covariates and 12985 gene features are provided. The data can be obtained from the R package “curatedCRCData” of “Bioconductor”.

  • The NSBCD [46] contains repeated observations of breast tumor subtypes in independent gene expression datasets. This dataset, with 67 % censoring rate, has 115 patients. For each observation, 549 “intrinsic” genes are provided and can be downloaded from http://user.it.uu.se/~liuya610/.

  • The vdv [55] is a gene expression profiling for predicting the clinical outcome of breast cancer. This dataset, with 56 % censoring rate, contains 4705 expression values on 78 patients, which is available in R package “randomForestSRC”.

  • The Veer [5] represents the Circulating Breast Tumor Cells by Differential Expression of Marker Genes. This dataset, with 56 % censoring rate, has 78 patients. For each observation, 4571 gene features are provided, which can be downloaded from https://clincancerres.aacrjournals.org/.

  • The unt [47] contains the gene expression, annotations and clinical data on breast cancer. This dataset, with 83 % censoring rate, has 62 patients. For each observation, five clinical covariates and 44928 gene features are provided. The data can be obtained from the R package “breastCancerUNT” of “Bioconductor”.

  • The vdx [40, 58]contains the gene expression, annotations and clinical data. This dataset, with 64 % censoring rate, has 197 patients. For each observation, three clinical covariates and 22283 gene features are provided. The data can be obtained from the R package “breastCancerVDX” of “Bioconductor”.

  • The transbig [14] contains the gene expression information for lymph node-negative (N-) breast cancer patients. This dataset, with 68 % censoring rate, has 196 patients. For each observation, 22292 gene features are provided. The data can be obtained from the R package “breastCancerTRANSBIG” of “Bioconductor”.

  • The upp [39] contains transcript profiles of 251 p53-sequenced primary breast tumors. This dataset, with 78 % censoring rate, has 197 patients. For each observation, 44938 gene features are provided. The data can be obtained from the R package “breastCancerUPP” of “Bioconductor”.

From Table 3 and Fig. 5, one can observe that DSF significantly outperforms all seven competing methods on all these high dimensional datasets. On datasets that most methods achieves relatively good predictive performance, such as GSE12945, NSBCD, vdv ,Veer, unt and vdx datasets, the proposed DSF is the best performer. On datasets that most methods may have hard times such as transbig and upp datasets, DSF performs reasonably well. Thus, from the results shown above, similar to its performance on low dimensional datasets, DSF also achieves good predictive capability in terms of C-index on these high dimensional survival datasets.

Table 3 The result of C-index on high dimensional datasets
Fig. 5
figure 5

Boxplots of performance in terms of C-index on high dimensional datasets

Hence, according to the results from both low and high dimensional real datasets with different censoring rates, the proposed DSF method generally obtains a good predictive performance and its superiority stands out if heavily censoring is present.

6 Discussions

In the previous two sections, we have demonstrated the effectiveness of the proposed DSF method using extensive simulated scenarios and real benchmark datasets. The success of DSF is probably due to the combination of both the data transduction technique and the cascade forest structure. The former has shown to exploit more censored information while the latter can achieve a better representation of the original features. When adequate uncensored samples (lower censoring rates and/or large sample sizes) are given, the proposed method may perform worse than other competitors as noises may be introduced in transducting the censored data.

Here, we have conducted additional experiments to verify the above conjectures. First, we test the effectiveness of the deep cascade structure. For this experiment, mean square error (testing_mse) is used for the prediction evaluation in each layer. Figure 6 shows the error rates in terms of testing_mse for each layer in scenarios 1-5. According to Fig. 6, one may observe that as censored data is transduced layer by layer, the testing_mse is generally on the decrease. Hence, if high precision in prediction is required, we can set a larger ζ value to obtain a deeper cascade.

Fig. 6
figure 6

Performance improvement in terms of the influence of cascade structure

Next, we test the effect of sample sizes on the proposed method. For simplicity, here we only consider the most popular semi-parametric Cox model and non-parametric RSF model as the comparing methods. Here, we vary the sample size from 60 to 1000 in two scenarios, one satisfies the proportional hazards assumption (Scenario 1) while the other violates the proportional hazards assumption (Scenario 4). Moreover, to make the comparisons more challenging, all simulated data generated with higher censoring rates. Summary information of different simulated datasets and corresponding comparison results can be found in Table 4 and Fig. 7.

Table 4 Predictive result in terms of average C-index with different sample sizes
Fig. 7
figure 7

Performance in terms of C-index on different sample sizes. The deep pink arrow indicates the maximum sample size when the DSF usually performs well

It can be observed that DSF is somewhat sensitive to sample size. When the sample size is less than 100, DSF performs better than the other two competing approaches and the censoring rates seem to have little influence on the predictive performance on DSF. Cox and RSF, however, usually get a bad performance under such scenarios. In contrast, when there is a large sample size with a lower censoring rate, DSF is not as good as RSF, but it still achieves comparable results and outperforms the Cox model by a large margin.

Similar to other deep learning approaches, the computing time of DSF is rather long in the current implementation. But this limitation is counterbalanced by the ability to model small sample sized and highly censored survival data and hence remarkable gains in the predictive capability. Furthermore, the computational issue can be alleviated by parallel computing framework and fast C++ routines in future implementations.

7 Conclusions

In this research, we have proposed a non-neural network like algorithm deep survival forests (DSF) for modelling highly censored survival data, which is prevalent in biomedical studies. Extensive numerical studies from both simulated and real data have shown that the proposed algorithm outperforms popular Cox, RSF and other state-of-the-art survival ensembles in terms of predictive performance. These results also indicates that the proposed DSF works best with small sample sized survival data with heavy censored rates when sufficient samples are not available in training workable Cox and RSF models.

Potential future research include extending the cascade forest structure to more complex survival data such as interval-censored data or competing risks data. Meanwhile,we also want to study the performance of other transduction techniques, such as Buckley-James [8] and censoring unbiased transduction [48] to make better utilization of censoring information.