1 Introduction

A main topic of scientific research in the field of data science is classification [4]. Especially in medical and health area, classification of survival continuous data or failure times is very important. Two major characteristics of survival continuous data are censoring and violation of normal assumption for ordinary least squares regressions [7]. These two characteristics of time variables explain why straightforward logistic and multiple regression techniques cannot be used.

Different parametric and semiparametric models for survival data, such as proportional hazards and accelerated life time models, are based on the assumptions which may be untenable in some situations. Hazard functions may not be proportional and determining the baseline distribution for the survival time may be hard to justify. When classification is required, nonparametric models such as decision tree are proposed that have fewer presuppositions and produce good results. The use of decision trees is one of the most promising and popular approaches for classification and it is a very common technique in data mining [2, 3]. According to many researchers, decision trees are popular due to their simplicity and transparency. Decision trees are self-explanatory and there is no need to be expert in order to follow a certain decision tree. Moreover decision trees are widely used in areas such as text mining, information extraction, machine learning, and pattern recognition [2].

In this paper, simulated data by Monte-Carlo method, covariates and survival times, from three models of hazard with different rates of censoring were used to predict the outcome using decision trees and Cox’s regression models. Then the predictions of the models were compared with the area under Receive Operating Characteristics curve (AUC) and an appropriate statistical test for AUC.

2 Methods

2.1 Decision Trees (Non-Parametric Model)

In the science of machine learning, much attention is given to the research based on the classification.The goal of classification is to choose the most appropriate class of default classes available to a case. The main goal of the these researches is to develop the methods which minimize an error rate.

Considerable progress has been achieved toward this goal in recent decades. Several methods to allocate a class are now available. Particularly, the use of decision tree is one of the most favorable and popular methods to classify tree [3]. A decision tree determines the classification and explains the reasons for selecting a certain case. This form of the classification approach is of particular importance in areas such as medical and health. The ability to present a simple and understandable branch structure is another reason to use a decision tree. This capability has shown the decision tree to be a more acceptable method compared with the other classifier such as neural network and support vector machine [2, 5].

Simplicity and comprehensibility of a decision tree are reduced as nodes and tree depth increase and in this case, it is not profitable to use structural graphic. A decision tree is a classifier which acts on a given space as recursive discriminant. It includes a node in root without any input, and other nodes in the branches with only one entrance. A node with at least one output as the internal node and other nodes without the output are considered as leaves or external nodes. In decision tree, the sample space is divided into at least two subspace by each internal node. The nodes are selected from the entries based on a specific property. For non-discrete characteristics, the parts refers to pre-specified ranges. Each leaf represents the class that has the highest relation with the target value. It may also show a probability vector representing the probability of belonging to the target feature. Training examples are gradually classified from root to the leaves according to corresponding tests. The process of forming a decision tree can be briefly stated as follows [5]:

  1. 1.

    In the event all the cases in S are tagged with identical class, giving back a leaf labeled with this class.

  2. 2.

    Select some tests (not necessarily statistical one) such as T with respect to some criteria which include two or more mutually exclusive results \(\left\{ {{P_1},{P_2}, \ldots ,{P_r}} \right\} \).

  3. 3.

    Split S into disjoint subsets \( {{S_1},{S_2}, \ldots ,{S_r}} \) so that \(S_i \) includes those cases getting outcome \(P_i\) with the test T, for \( i = 1,2, \ldots ,r \).

  4. 4.

    Repeat this tree-construction process recursively in every subsets \( S_1,S_2 ,\ldots ,S_r \) and suppose the decision trees that are given back by these recursive process named \({S_{ij}}, i = 1,2, \ldots ,r , j = 1,2, \ldots ,{r_i}\).

  5. 5.

    A decision tree is returned with a node tagged \( S_0 \) as the root and the trees \({S_{ij}}, i = 1,2, \ldots ,r,j = 1,2, \ldots ,{r_i}\) with subtrees under of each node.

2.1.1 Growing Decision Tree

Ordinary supervised learning involves a given training set aiming at formation of an explanation which would be useful for predicting past unobserved samples. Various definitions of the training set have been proposed. Often, it is defined as a bag instance of a specific schematic design. Bag instance contains a set of records or instances probably containing duplicates. Every individual tuple can be described with the aid of the values pertaining to its vector of attribute. The attributes and their domains are described by the bag schema.

Often, the assumption is that, based on a random pattern, the training set tuples are independently created in accordance with a number of unchanging and unidentified shared probability distribution. For example, assuming a training set of \(S = \left\{ {\left\langle {{X_1},{y_1}} \right\rangle ,\left\langle {{X_2},{y_2}} \right\rangle , \ldots ,\left\langle {{X_n},{y_n}} \right\rangle } \right\} \) with input attributes set of \(A = \left\{ {{a_1},{a_2}, \ldots ,{a_p}} \right\} \) in relation with covariate X and y nominal target attribute from an F fixed distribution across the labeled instance space, the objective can be an optimum classifier that generates minimum generalization error. As for the generalization error, it means the misclassification rate in concerning the distribution F. Considering the nominal attributes, the generalization error can be obtained by the following phrase [2]:

$$\begin{aligned} \epsilon \left( {DT(S} \right) ,F) = \mathop \sum \nolimits F\left( {x,y} \right) *L\left\langle {y,DT\left( S \right) \left( x \right) } \right\rangle \end{aligned}$$

where \(L\left\langle {y,DT\left( S \right) \left( x \right) } \right\rangle \) is the zero-one loss function which can be defined as below clause:

$$\begin{aligned} L\left\langle {y,DT\left( S \right) \left( x \right) } \right\rangle = \left\{ {\begin{array}{{cc}} {0 \qquad { if}~y = DT\left( S \right) \left( x \right) }\\ {1 \qquad { if}~y \ne DT\left( S \right) \left( x \right) } \end{array}}\right. \end{aligned}$$

Regarding to the sum operator over a training set, a decision tree inducer is represented by the notation DT and the DT(S) is indicative of a classification tree, induced by operationalizing DT on the training set of S. If the attributes are numeric, then the sum operator is replaced by the integration operator.

2.1.2 Decision Trees and Estimation of Probability

The inducer product, i.e., the classifier is useful for classifying the unobserved tuples through either expressly allocating it to a determined class or preparing a probability vector indicating the conditional probability of an assumed instance to be classified in any of the individual classes (probabilistic classifier). The inducers capable of constructing probabilistic classifiers are called “the probabilistic inducers”. As an outstanding feature, estimating the conditional probability of \({\hat{P}_{DT\left( S \right) }}\left( {y = {c_j}\mathrm{{|}}{a_k} = {x_{k,q}},k = 1,2 \ldots ,p} \right) \) of a given observation \(\left\langle {{X_q},{y_q}} \right\rangle \) would be possible in decision trees. Approximating the probability in classification trees is performed individually for each leaf through class frequency calculation considering the training instances belonging to the desired leaf [2]. The frequency use entails the drawback of estimating the probability which could be problematic in cases, a class will never occur in a given leaf. Obviously, the probability will be zero in such cases.

2.1.3 Algorithmic Outline for Decision Trees

Most of the algorithms of recursive partitioning type can be regarded as the product of particular cases of a non-complex two-stage algorithm: In the first stage, the observations are partitioned under the effect of univariate splits in a recursive manner and in the second stage a constant model is fitted in individual cells of the resulting partition. As the most prevalent applications of such algorithms, CART (Breiman, Friedman, Olshen and Stone, 1984) and C4.5 (Quinlan 1993) carry out an elaborative search concerning all the probable splits; there by a maximization of an information measure of node impurity is performed and the covariate indicating the best split is selected [3]. Still, the approach suffers from two major problems, i.e., overfitting and selection bias in relation with covariates involving too many probable splits.

A recent approach for developing trees is based on the conditional inferences (Hothorn, Hornik and Zeileis 2006). In the present paper the authors try to implement an integrated framework embedding recursive binary partitioning feature in the well-adjusted permutation tests theory hypothesized by Strasser and Weber (1999). The statistics that measure the relationship between the responses and covariates, when conditionally distributed, function as the basis for an unbiased selection among covariates that are measured against different scales. Furthermore, several testing processes need to be employed to ensure that there would be no major relationship between the individual covariates, and the response can be described and the recursion must be halted. Using non-negative integer valued case weights \(W = \left( {{w_1}, \ldots ,{w_n}} \right) \) can be used to formulate a generic algorithm for recursive binary partitioning for any given learning sample. Each tree node is shown and represented by a vector consisting of case weights having one and zero elements, in case the relevant observations are the node elements and the otherwise respectively. In the following, the procedure recursive binary partitioning is given [1]:

  1. 1.

    Regarding to the case weights W, one should test the global null hypothesis of independence among the p covariate and the response. We should halt in case the above hypothesis is acceptable. Otherwise, we can select the covariate Xj with strongest association to Y .

  2. 2.

    Select a \(\Theta \subset {\chi _p}\) to split \( \chi _p \) into two disjoint sets \(\Theta \) and \({\chi _p}\backslash \Theta \) in which the p-dimensional covariate vector \(X = \left( {{X_1}, \ldots ,{X_p}} \right) \) is obtained from a sample space \(\chi = {\chi _1} \times \ldots \times {\chi _p}\) . The case weights \( W_l \) and \( W_r \) specify two subgroups with elements \({w_{l,i}} = I\left( {{X_{ji}} \in \mathrm{{\Theta }}} \right) \quad \) and \( \quad {w_{r,i}} =I\left( {{X_{ji}} \notin \mathrm{{\Theta }}} \right) \) where \(i= 1, \ldots , n, j= 1, \ldots , p\)    (I(.) denotes the indicator function).

Afterwards you should repeat steps 1 and 2 recursively with modified case weights \(W_l\) and \(W_r\) respectively. Step 1 of the generic algorithm is regarded as an independence problem because we need to decide if there is any information on the response variable which might be covered by any covariate [1, 2].

The segregation of selection of variables and implementing steps 1 and 2 of the procedure is essential for the construct of the tree structure which is interpretable, meanwhile avoiding systemic trend toward covariates with many probable splits or missing values. Moreover, a stopping criterion which is statistically motivated and intuitive can be employed: stopping at the time when the global null hypothesis of independence among the response and the p covariates is impossible to be rejected at a pre-specified nominal level \(\alpha \). The algorithm induces the partition \( {B_1,B_2,\ldots ,B_r } \) of the covariate space X, in which an individual cell \( B \in \{B_1,B_2\ldots ,B_r\}\) is accompanied with a vector of case weights. In the nodes characterized by case weight W, the overall hypothesis of independence can be formulated by considering the p partial hypothesis \(\mathrm{{\mathrm{H}}}_0^j{:}F\left( {Y\mathrm{{|}}{X_j}} \right) = F\left( Y \right) , j= 1, \ldots , p\) and the global null hypothesis of \({\mathrm{{\mathrm{H}}}_0} = \bigcap \nolimits _{j = 1}^p {\mathrm{{\mathrm{H}}}_0^j} \) . Where the rejection of \( H_0 \) at the pre-determined level \( \alpha \) is impossible, we should stop the recursion. If we can reject the global hypothesis, we should measure the relationship (association) between Y and the covariates \(X_j, \quad j= 1, \ldots ,p \), through p-values that indicate the deviation from partial hypotheses \(\mathrm{H}_0^j\) [1]. In this paper, for the implementation of this process in the R (version 3.1.3 )software, we use “ctree” command from “party” package.

2.2 Cox Regression Model (Semi-Parametric Model)

The survival time T associated with an event can be a continuous, non-negative random variable having survival function:

$$\begin{aligned} S (t)= P(T>t) \end{aligned}$$

The related hazard function h(t), demonstrates the probability density of an event occurring around time t, considering that it doesn’t occur prior to time t.

The hazard function is

where, f(t) is survival density function for T. The cox regression, generally known as the proportional hazards regression (PH), is demonstrated by:

$$\begin{aligned} h_x (t,\pmb {\beta } )=h_0 (t )exp{{({\pmb {\beta }}'}X ) } \end{aligned}$$
(1)

where \( \pmb {\beta }\) (p-dimensional parameter) is the regression coefficient vector, \(\ h_0 (t) \) is the baseline hazard which is related to time and an unspecified function (it is the property that makes the Cox model a semiparametric model). Here \(X = \left( {{x_1}, \ldots ,{x_p}} \right) \) is a vector of \(x_i\) elements that are predictors or covariates [6].

$$\begin{aligned} \frac{{{h_{{x_i}}}(t,\pmb {\beta } )}}{{{h_{{x_j}}}(t,\pmb {\beta } )}}= & {} \frac{{{h_0 }\left( t \right) exp \left( {\pmb {\beta } '{x_i}} \right) }}{{{h_0 }\left( t \right) exp \left( {\pmb {\beta } '{x_j}} \right) }}\\= & {} exp\left\{ {\pmb {\beta }'({x_i} - {x_j})} \right\} \end{aligned}$$

Which is independent of the survival time. The Likelihood function will be given by

$$\begin{aligned} L(\pmb {\beta };\pmb {x})= & {} \prod \limits ^n_{i=1} h_{x_i} (t_i,\pmb {\beta })^{\delta _i}S_{x_i}(t_i,\pmb {\beta })\\= & {} \prod \limits ^n_{i=1}(h_0(t_i)exp({\pmb {\beta }}' x_i))^{\delta _i}S_0(t_i)^{exp({\pmb {\beta }}'x_i)\ } \end{aligned}$$

where \({\delta _i}\) is the binary variable, \({\delta _i}=1\) if a case is observed and \( {\delta _i}=0 \) if censored. Considering that the baseline distribution is usually unknown, this likelihood function cannot be used to obtain estimates for \(\pmb {\beta }\) [7]. Cox (1972) suggested to maximize the so-named partial likelihood function. In order to introduce this concept we will denote the observed ordered event times by \( t_{(i)} ,\ \ i=1,2,\ldots ,d\)

$$\begin{aligned} {t_{\left( 1 \right) }}< {t_{\left( 2 \right) }}< \ldots < {t_{\left( d \right) }}, \qquad d=\sum ^{n}_{i=1}{\delta _i}. \end{aligned}$$

The covariate linked to the case observed to failure time in \( t_i\) can be through by \( x_{(i)} \). The partial likelihood proposed by Cox, is given by [6]

$$\begin{aligned} \mathop \prod \limits _{i = 1}^d \frac{{exp(\pmb {\beta } '{x_{\left( i \right) }})}}{{\mathop \sum \nolimits _{j \in {R_i}\mathrm{{}}} exp(\pmb {\beta } '{x_j})}}. \end{aligned}$$

By maximizing this partial likelihood through utilizing the Newton-Raphson technique, we could match the Cox regression and estimate parameter \(\pmb {\beta }\).

In this paper, for the implementation of this process in the R software, we use “coxph” command from “survival” package [9].

Table 1 Results of simulation data
Fig. 1
figure 1

Tree-structured survival model for the simulated data based on conditional inference

3 Simulation

Consider the training set in Table 1 containing Monte-Carlo simulated data about patients. Each patient is characterized by six attributes: Age, Sex, Social welfare, Censor, Postoperative Care-1 and Postoperative Care-2 describe time care after surgery for patient. The goal is to induce a classifier with the highest accuracy in predicting health recovery. In order to examine the precision of classification in the models DT and Cox, a collection of \(\hbox {N}=1000\) samples in the medical and health including variables age from the normal distribution and sex and social welfare class from the binomial distribution and the period of care after operation in two stages hospitalization and outside hospital from the uniform distribution based on Monte-Carlo were simulated.

Table 2 Results of simulation for exponential hazard model with \(\hbox {N} = 1000\)
Table 3 Results of simulation for exponential hazard model with \(\hbox {N} = 5000\)
Table 4 Results of simulation for exponential hazard model with \(\hbox {N} = 10,000\)

Then three different models for the hazard ratio (HR) in different complexity levels including a simple linear model, a model with double interaction and a model including at least double interaction were assumed as follow:

  • Model1: Exp{\(.25*\hbox {Sex}+.5*\hbox {SW}-.5*\hbox {Age}+.5*\hbox {PC1}+.25*\hbox {PC2}\)}

  • Model2: Exp{\(.25*\hbox {Sex}+.5*\hbox {SW}-.5*\hbox {Age}+.5*\hbox {PC1}+.25*\hbox {PC2}+.5*(\hbox {Sex}*\hbox {PC1}+\hbox {PC2}*\hbox {SW2}-\hbox {PC1}*\hbox {Age}+\hbox {PC2}*\hbox {Age})\)}

  • Model3: Exp{\(.25*\hbox {Sex}\,+\,.5*\hbox {SW}\,-\,.5*\hbox {Age}\,+\,.5*\hbox {PC1}\,+\,.25*\hbox {PC2}\,+\,.5*(\hbox {Sex}*\hbox {PC1}\,+\,\hbox {PC2}*\hbox {SW2}\,-\,\hbox {Sex}*\hbox {PC1}*\hbox {Age} \,+\,\hbox {SW}*\hbox {PC2}*\hbox {Age})\,+\,.25*\hbox {PC1}*\hbox {PC2}* \hbox {Sex}*\hbox {SW}*\hbox {Age}\)}

In the next step survival times from the exponential distribution of these models were simulated [8]. Next, the survival times greater than kth percentile were taken as censored observation in k percent level. The averages of the censor rates in eight levels from 0 to 70% were considered. In order to learn the models 75% of the samples was used and the remaining was put aside for the model test. Survival times were divided into two groups based on the median. The Cox and The DT classifiers were examined after learning by criteria Area Under Curve (AUC) and the ROC-test [10, 11]. Figure 1 shows the structure of the conditional decision tree for simulated data. Whole the above procedure was repeated 200 times and the average of criteria AUC, confidence interval (CI) for AUC and P-value of ROC-test were considered. When there are many samples to examine the results, were considered as \(\hbox {N2} = 5000\) and \(\hbox {N3} = 10,000\), and all the procedure was repeated. The results are presented in Tables 23 and 4.

4 Results and Conclusion

According to this simulation study, when data are censored, for the hazard linear function, the Cox regression model works better than nonparametric model DT especially in the censor level more than 50%. However when data are not censored, it is observed that the DT model works better than the COX model.

An examination of AUC criterion in the non-linear hazard model shows that a higher precision of the DT model results compared with the Cox model in all censorship levels. Change of censor levels has no remarkable effect on AUC criterion in the Cox model. In addition, it is noted that in the non-linear hazard, with an increase complexity in hazard model, AUC criterion in the DT is almost constant and significantly more than the amount for the Cox model. The significance of these differences are confirmed by the ROC test. These results to be established for all number of samples and they can be summarized in terms of HR as follows:

  1. 1.

    If HR is linear with respect to covariates, when the data were censored, Cox regression model performs better than DT non-parametric model, especially in the censor levels more than 50% the different performance of the models is remarkable. If the data are not censorship, the DT model is outperformed the Cox model.

  2. 2.

    If the HR is non-linear with respect to variables and includes double interactions, both models perform efficiently but the DT model offers better results compared to the Cox model. If the data are not censorship the DT model is outperformed the Cox model.

  3. 3.

    In the case HR quantity is non-linear with respect to variables include minimum double interactions, the DT model significantly better results than the Cox model in all levels of censorship.

It is noted that increase in the censor levels, especially for the censor levels over 50%, decreases the performance of the DT model. But this increase does not have a remarkable effect on the Cox model. In addition, with the increase in complexity of the HR form, using the DT non-parametric model is suggested as the preferred approach, but in the case having simple linear form, application of the Cox semiparametric model is recommended. The increase in the number of samples in all the three models of HR, has no significant effect on the both classifier Cox and DT.