1 Introduction

Low CRF is associated with a high risk of cardiovascular disease (CVD) [2,3,4] that is also one of the strongest features of all-cause, and disease-specific mortality [3, 5,6,7]. In contrast, there is higher level of association for the risk attenuation of all-cause and CVD mortality in individuals with higher CRF [5, 6, 8]. Regular physical activity can be CRF booster for most individuals. Regular physical activity ascertained the role of CRF in insulin sensitivity, blood lipid profile, body composition, inflammation, and blood pressure [9, 10]. It is important to assess CRF using a standard method. Cardiopulmonary exercise testing (CPX) is widely considered a gold standard method for the evaluation of cardiopulmonary function and fitness [11]. A recent register-based study by Imboden et al. from Ball State Adult Fitness Longitudinal Lifestyle Study data proposed the prognostic significance of CRF and other variables. In addition, the authors have been able to show that CPX-derived CRF was related to all-cause and disease-specific mortality [7, 12].

The effect of CRF on mortality status might be different and could be higher or lower than the estimated average effect of CRF. It is informative to identify which characteristics, if any, lead to these differential effects [13]. Our goal is to identify which individuals have a lower effect of CRF and which individuals have a higher effect of CRF on mortality status. Thus, the objective of this paper is to find a limited number of features that interact with the treatment and to locate the relevant region on the feature (subgroup of individuals) who may have differential survival experiences across two different levels of CRF (high and low CRF).

A number of studies addressed the subgroup identification problem in the existing literature. Interaction trees (IT) [14] algorithm follows the classification and regression trees (CART) approach, which recursively partitions the data with splits chosen to optimize an objective function and then prunes the resulting tree using the Akaike information criterion (AIC). The qualitative interaction tree (QUINT) method [15] uses sequential partitioning algorithm and finds subgroups by optimizing a weighted sum of a measure of effect size and a measure of subgroup size. QUINT is limited to ordinal and uncensored variables. Foster et al. [16] developed two methods called, virtual twins regression: VT(R) and virtual twin classification: VT(C), to find subgroups of enhanced treatment effect, using simulated data. VT can handle only binary and continuous response variables with no missing values and censored observations. Loh et al. [17] overcome these problems by introducing GUIDE algorithm, which can handle missing values and three types of response variables, such as binary, continuous, and censored. Later, Loh et al. [18] introduced Gi, Gs, and, Gc by extending GUIDE algorithm. Gi is the preferred method to find subgroups defined by predictive variables only. A variable is called predictive when the subgroups of individuals are more likely to respond to a treatment effect. Gs is sensitive to both predictive and prognostic variables. Gc follows GUIDE classification tree algorithm [19].

In this paper, we consider the logistic regression with LASSO penalty as a non-tree-based parametric statistical approach [20] and VT(C), Gc, Gs, and Gi as tree-based nonparametric subgroup identification approaches. Tree-based approaches identify subgroups of study participants in terms of many closely related predictive and prognostic variables (such as age, gender, race) [19]. Using data from the Ball State Adult Fitness Longitudinal Lifestyle Study, the effectiveness of CRF on mortality status in subgroups of participants is checked by establishing an interaction between the CRF and relevant vital predictor variables.

2 Data and Variables

In this study, we use data from the Ball State Adult Fitness Longitudinal Lifestyle Study (BALL ST) cohort where all participants performed a maximal cardiopulmonary exercise (CPX) test and accomplished initial complete health and physical fitness assessment between 1969 and 2017. Participants were requested to disclose personal information, family health history, medication use, and lifestyle behaviors in the questionnaire [12].

All participants were directed to abstain from food or drink and exercise for twelve hours before testing. It was important to determine the presence of obesity, hypertension, dyslipidemia, and CVD risk factor in individuals [21]. The procedures for testing clinical measurements have been described by Whaley and co-workers in detail [22, 23]. To examine all participant’s physical conditions, it was required to complete a maximal oxygen uptake test. CPX test uses gas exchange analysis to provide an objective and accurate measurement of peak oxygen uptake, carbon dioxide production, and anaerobic threshold (AT) [24]. Fitness testing protocols were performed using standardized treadmill protocols [25], Ball State University Bruce Ramp [26], modified Balke-Ware [27], or other non-specified protocols. Test protocols were chosen depending on the physical demands of participants required to achieve maximal effort within 8 to 12 minutes [28].

In the original data set, there were 3694 de-identified participants with 58 covariates. We remove few participants (around 8%) from the data because of their incomplete information. Some variables are excluded from the dataset because of their irrelevance to our study. The fitness rank is categorized into low-CRF (\(\le \) 33rd percentile) and high-CRF (> 33rd percentile) using the percentile of fitness rank (Fitness Registry and the Importance of Exercise National Database—FRIEND) [29]. We consider CRF with two levels (CRF high and low) as treatment variables and mortality status as the response variable. Based on the available covariates in our dataset, we consider BMI, age, total cholesterol, gender, obesity, glucose, smoking status, physical activity level, and triglyceride as explanatory variables.

3 Methodology

3.1 Logistic Regression with LASSO Penalty to Identify Features Associated with all-Cause Mortality

The assessment of the effect of a treatment in subgroups of patients can be viewed as establishing an interaction term between the treatment and specific variables. Let \(y_i\) denote all-cause mortality with two levels \((1 \rightarrow \text{ alive }, 0 \rightarrow \text{ dead})\) with probabilities \(Pr(y_i=1) = \pi \) and \(Pr(y_i=0) = 1-\pi \).

We model the probability \(\pi \) as \(g(\pi ) = {\varvec{X}}^T \varvec{\beta }\) where \({\varvec{X}}\) is the model matrix of dimension \(n \times p\), \(\varvec{\beta } \) is a vector of p parameters, and (g) is a link function which is the logit link in this case.

The estimation of the vector \(\varvec{\beta }\) using LASSO (\(L_1\)-penalty) is defined as:

$$\begin{aligned} {\hat{\beta }}_{LASSO}=argmin_{\beta }\left[ \sum ^n_{i=1}y_{i}ln(\pi _i)+(1-y_i)ln(1-\pi _i)+\lambda \sum ^p_{j=1}\mid \beta _j\mid \right] \end{aligned}$$
(1)

where \(\lambda \) is a tuning parameter \((\lambda \ge 0)\). It controls the strength of shrinkage in the explanatory variables: when \(\lambda \) takes larger value, more weight will be given to the penalty term. Because the value of \(\lambda \) depends on the data, it can be computed using the cross-validation method [30].

3.2 Virtual Twins for Subgroup Identification

The virtual twins [16] method proceeds in three main steps. First, the treatment effect for each individual is estimated. Second, the data subspaces associated with large treatment effects are identified. Third, in subgroups that look promising, the treatment effect is evaluated and subgroup size is estimated.

The notations in use are outlined as follows. Let n be the total number of individuals and \(X = X_1, ..., X_p\) be the measurements that have been recorded for each participant. The outcome variable, Y is binary. The treatment indicator, T indicates whether the CRF is high, defined by \(T = 1\), and low, \(T = 0\). The treatment effect, \(Z_i\), is calculated as the difference in probabilities of having high CRF versus low CRF for all the participants. Our goal is to find the region of the predictor space, where CRF is high or low.

Step I. Applying Random Forest to the Data

VT follows the concepts of counterfactual modeling, in which there are two possible outcomes for each person: CRF high or low. For the ith participant a binary outcome variable \(Z_i\) is calculated using the random forest as follows:

$$\begin{aligned} {Z_i={\hat{p}}_1\,X_i(y_i=1)-{\hat{p}}_0\,X_i(y_i=1)} \end{aligned}$$
(2)

where \({\hat{p}}_1X_i(y_i=1)\) is the estimated probability of survival given that the CRF is high and the individual has characteristics \(X_i\). Similarly, \({\hat{p}}_0X_i(y_i=1)\) is the probability of survival given that the individual has low CRF. CRF for an individual can be predicted from the data.

Step II. Estimate a Classification Tree

The purpose of this tree is to find a small number of X’s that are strongly associated with Z and hence we can define the region. To find the region, we create a new binary variable \(Z^*\) as \(Z_i^*\)=1 if a study participant with predicted \(Z_i\) greater than some threshold c and considered to be in \({\hat{A}}\). Also, \(Z^*_i=0\) if a study participant with predicted \(Z_i\) less than equal to the threshold c. Thus, \({\hat{A}}\) is defined by the paths down the tree which lead to terminal nodes with predicted \(Z_i\)’s greater than c. Basically, \({\hat{A}}\) is the size of the selected subgroup. This method is denoted VT(C).

Step III. Subgroup Size Estimation

The treatment effect is evaluated, and subgroup size is estimated in this step.

3.3 GUIDE (Generalized Unbiased Interaction Detection and Estimation)

The generalized unbiased interaction detection and estimation (GUIDE) approach recursively partitions the data to a tree whose terminal nodes define the subgroups. The GUIDE algorithm proceeds in three steps: i) split selection, ii) finding the split variable, and iii) searching for the best split on the selected variable. GUIDE is popular because it prevents selection bias in the presence of various types of covariates. GUIDE can handle censored, uncensored response variable, and the response variable with missing values [18]. We discuss the GUIDE (Gi and Gs) survival regression tree-based approaches for censored data as follows.

3.3.1 GUIDE (Gi and Gs): Regression Tree Approach

Regression tree models are nonparametric, naturally define subgroups, and are not limited by the number of predictor variables. Several obstructions arises in the way of applying Gi and Gs regression tree approaches to the data with censored response variables. The obvious approach of replacing least-squares fits with proportional hazards models in the nodes [14] is problematic because Gs employs Chi-squared tests on residuals with their signs. When the algorithm for censored response variable fits a separate proportional hazards model in each node that yields different baseline cumulative hazard functions. As a result, the model has no longer proportional hazards and the regression coefficients between nodes cannot be compared [18]. To address this issue, Gs applies the Poisson regression for fitting proportional hazards models. Thus, we can construct a proportional hazards regression tree by iteratively fitting a Poisson regression tree [31, 32].

Gi employs loglinear model goodness-of-fit tests [33] iteratively to the fitted values to obtain the split variables and split points.

3.3.2 GUIDE (Gc): Classification Tree Approach

In this method, a classification tree is used to find subgroups by defining the class variable as \(\,class \,(V)= (Y+Z)\,\, mod\, 2\), where Y and Z are binary response and treatment variables, respectively. A classification tree requires to construct a new response variable, which will likely identify subgroups with differential treatment effects. The new response variable V can be constructed as follows:

$$\begin{aligned} V= {\left\{ \begin{array}{ll} 0,&{} \text {if } \{Y=1 \text { and } Z=1\} \ or \ \text {if } \{Y=0 \text { and } Z=0\} \\ 1,&{} \text {if } \{Y=0 \text { and } Z=1\} \ or \ \text {if } \{Y=1 \text { and } Z=0\} \end{array}\right. } \end{aligned}$$

The motivation behind this definition [18] is that the subjects for which \(class\, = 0\) respond differentially to treatment and those for which \(class = 1\) do not. Although any classification tree algorithm may be used, we use GUIDE [34] here because it does not have selection bias.

GUIDE develops a tree in three steps: 1) a Chi-square test selects the most significant split variable to split a node; 2) the split set is selected to minimize a node impurity measure (the impurity measure in GUIDE includes entropy and Gini index); 3) steps 1 and 2 are recursively repeated until too few observations are in each node. After building a complete tree, three methods including cross-validation pruning (default), test-sample pruning, and no pruning are used to decide how much of the tree to retain. The criteria for pruning are to minimize an unbiased estimate of misclassification cost [34].

4 Subgroup Identification with High and Low CRF

Data are preprocessed separately to apply the logistic-LASSO and each of the four tree-based algorithms: VT(C), Gc, Gs, and Gi for identifying the demographic and other risk factors and for the subgroup analysis.

4.1 Logistic Regression with LASSO Penalty

The assessment of the effect of CRF in subgroups of patients can be viewed as establishing an interaction term between the treatment and specific variables. A subgroup of patients can be identified when the interaction between treatment and a certain combination of feature variables is statistically significant. A statistically significant and large interaction effect usually indicates potential subgroups that may have different responses to the treatment. Logistic regression with the LASSO penalty method does variable selection by shrinking the coefficient values. The interactions and interaction rates between treatment and a certain combination of feature variables for our data are shown in Table 1.

Table 1 Interactions and interaction effects between CRF and other predictors for Ball State Adult Fitness Longitudinal Lifestyle Study data

From Table 1, we see that logistic regression with the LASSO penalty model finds some significant covariates and interaction terms by shrinking insignificant terms. This method finds the interaction with CRF and other predictors: age, total cholesterol, smoking status, BMI, gender, hypertension, and diabetes have an impact on mortality status for our data. The results from the tree-based methods for identifying subgroups are presented below, and the consistency of the findings from the logistic-LASSO is compared.

4.2 Virtual Twin Classification (VT(C))

VT(C) generates subgroups of participants by selecting a limited number of covariates that interacts with CRF. VT(C) identifies subgroups for a number of preselected thresholds along with subgroup size, treatment event rate, and control event rate. We consider, .002, .004, .006, .009, .05, and .09, as our thresholds to determine the effect of threshold values on the important predictors and subgroups. Small changes to the threshold values do not change the selected variables for splitting. But, when the considered threshold is less than .002, VT(C) tends to find too many explanatory variables, or for the threshold greater than .05, VT(C) fails to find a tree. We present a VT(C) tree for threshold .002 in Figure 1.

Fig. 1
figure 1

Virtual twin (Classification) tree for Ball State Adult Fitness Longitudinal Lifestyle Study data

From Figure 1, we see that the VT(C) partitions the whole study population into subgroups consisting of BMI and age for threshold .002. The classification tree predicts that study participants belong to the class \( (BMI<33 \ {kgm}^{-2}\,\, \& \,\, age<55 \ years)\) with \(Z^*=0\), meaning that the individuals of this subgroup will have longer lives for their high CRF. From the figure, it is also indicative that people are likely to die if their age is higher, even if their BMI is lower which is common knowledge. However, study participants that belong to the class with \(Z^*=1\), the predictor variable space, \((BMI\ge 31\ {kgm}^{-2})\), defines a region where study participants are more likely to die for their low CRF.

4.3 GUIDE Classification (Gc) Approach

Gc tree is constructed with the newly defined response variable identifies subgroups with differential effects of CRF. Gc tree uses estimated priors and unit misclassification costs as weights for predicting two classes of CRF. Prediction of classes (class = 0 (CRF low) and 1 (CRF high)) depends on these weights (Fig. 2).

Fig. 2
figure 2

Gc classification tree for Ball State Adult Fitness Longitudinal Lifestyle Study data

To split BMI, an observation goes to the left branch if and only if the condition \((BMI\le 29.85)\) is satisfied. Estimated class posterior probabilities are shown beside each node of Gc tree. Predicted classes and sample sizes are also shown below the terminal nodes. Gc tree partitions the whole study population into subgroups consisting of BMI as a predictor variable and the result indicates that study participants with BMI greater or equal to 29.85\({\ kgm}^{-2}\) have low CRF and are likely to have lower survival experience.

4.4 GUIDE Sum (Gs) Algorithm

GUIDE sum (Gs) regression tree approach is a good alternative to select prognostic variables and define subgroups naturally. Prognostic variables provide information about the response variable without considering the treatment effect (CRF in this case). Gs proportional hazards regression tree for differential treatment effects is shown in Figure 3.

Fig. 3
figure 3

Gs classification tree for Ball State Adult Fitness Longitudinal Lifestyle Study data

Sample sizes are at the bottom of the terminal nodes, and hazard ratios are given beside the terminal nodes of Gs tree. Gs uses least-square residuals and looks at the dichotomized residuals separately in the two treatment arms. Gs proportional hazards regression tree selects the prognostic variables, age, and sex that are significantly associated with all-cause mortality. For the subgroup, \( (age<52.5\ years\ \& \ sex=male)\) hazards of death is .609 times for the high CRF group compared to the low CRF group. The differential effects of CRF on survival experience for the subgroup is statistically significant. The risk of death for younger females \((age<52.5 \ years)\) with high CRF is around 10% less likely than those who have low CRF. Moreover, the risk of death for low CRF individuals \((age\ge 52.5 \ years)\) is around 17% higher than the high CRF individuals irrespective of their sex. To compare the survival experience between two levels of CRF (CRF high and low) for each subgroup of the Gs tree, a bivariate analysis with the Kaplan–Meier estimate is applied.

Fig. 4
figure 4

Kaplan–Meier curves for the subgroups of Gs tree for Ball State Adult Fitness Longitudinal Lifestyle Study data

Kaplan–Meier curves in Figure 4 for the subgroups of Gs tree show that survival experience among three subgroups is considerably different. From the estimated survival curves, it is seen that there is enhance impact of CRF in the survival experience of younger males \((age<52.5\ years)\) than their counterparts (younger females). Younger males with high CRF have higher survival experience than the males with low CRF. It is important to note that although survival experience of younger males has decreased with the follow-up years, younger females have stable higher survival experience with the follow up years for both high- and low-CRF groups. On the other hand, for the older age subgroup, study participants (both males and females) have considerably lower survival time than the younger study participants (both males and females).

The results of log-rank test in Table 2 show that for subgroups \( (age<52.5 \,\, \& \,\, sex=female)\) and \( (age<52.5 \,\, \& \,\, sex=male)\) there is difference in the survival experiences between two levels of CRF for individuals in the two groups. However, the differences in the survival experiences between two levels of CRF for subgroup \((age \ge 52.5)\) is not noticeable.

Table 2 Results of log rank test between two levels of CRF (high and low CRF) for survival experience of Gs tree

4.5 GUIDE Interaction (Gi)

When only predictive variables are available for subgroup identification, Gi regression tree approach is the best choice as Gs can be negatively affected by the presence of prognostic variables. To avoid the effect of prognostic variables, Gi uses a Chi-squared test of CRF-covariate interaction to identify a split variable at each node. Gi approach finds predictive variables to have a significant effect on survival experience and has interaction effects with CRF. Gi proportional hazards regression tree for differential treatment effects is shown in Figure 5.

Fig. 5
figure 5

Gi classification tree for Ball State Adult Fitness Longitudinal Lifestyle Study data

Gi tree in Figure 5 selects total cholesterol as interacting covariate. Total cholesterol is used to split the data at the very beginning to identify subgroups. At each intermediate node, an observation goes to the left child node if and only if the displayed condition is satisfied. Subgroup sizes are below the terminal nodes. Subgroups are identified by establishing an interaction between the CRF and total cholesterol and locating the appropriate region on total cholesterol to check the effectiveness of CRF on mortality status. Study participants with total cholesterol less than equal to 206.60mg/dl and total cholesterol greater than 206.60mg/dl may have differential survival experiences across two different levels of CRF (high and low CRF). According to the hazard ratio, the number of patients in the subgroup with total cholesterol less than equal to 206.60 mg/dl, having high CRF at any time point during the study period are 23% less likely to die than patients who have low CRF. On the other hand, for the subgroup, with total cholesterol greater than equal to 206.60mg/dl, their risk of death is \(.982 (=1)\), indicating that almost no change in survival experience is particularly noticeable, although their CRF is high or low.

Fig. 6
figure 6

Gi classification tree for Ball State Adult Fitness Longitudinal Lifestyle Study data

Kaplan–Meier plots in Figure 6 for each subgroup, for two conditions of fitness rank (high and low), are associated with individual’s mortality status. From the estimated survival curves, it is seen that study participants with \((total\ cholesterol\le 206.60\ mg/dl)\) have marginally higher survival experience with high CRF until 30 years of follow-up and after 30 years, patients have higher survival experience for high CRF than that of the patients with low CRF. On the other hand, for individuals with \((total\ cholesterol>206.60\ mg/dl)\), we do not see any significant difference on survival experience although their CRF is high or low. Results of log rank test between two levels of CRF (high and low CRF) for each subgroup of Gi tree are presented in Table 3.

Table 3 Results of Log rank test between two levels of CRF (high and low CRF) for survival experience of Gi tree

The results of log-rank test in Table 3 show that for subgroup (\(total\,\_\,cholesterol\le 206.60\)) there is difference in the survival experiences between two levels of CRF for individuals in the two groups. However, the difference in the survival experiences between two levels of CRF for subgroup \((total\_\,cholesterol\,>\, 206.60)\) is not noticeable.

5 Discussion and Conclusion

In this paper, we identify subgroups of individuals based on a small number of predictors interacting with higher or lower impact of CRF on mortality status rather than on the entire population. We have applied classification and regression tree-based algorithms: VT(c), Gc, Gi, and Gs to find subgroups of individuals from the Ball State Adult Fitness Longitudinal Lifestyle Study data. The overall results from these algorithms suggest that CRF is inversely associated with mortality status and some important features are associated with CRF and have effects on mortality status for our data. That is, individuals with low CRF have lower survival experience than the individuals with high CRF. Logistic regression with LASSO penalty finds the interaction with fitness rank and other predictors: age, total cholesterol, smoking status, BMI, Gender, hypertension, and diabetes.

VT(C) and Gc tree partition the whole study population into subgroups consisting of BMI as a predictor variable which is consistent with interaction terms identified by the logistic regression with LASSO penalty. Thus, obesity is identified as the major health problem that has an impact on all-cause mortality. According to WHO report, obesity is the fifth leading global risk factor for mortality [35] Lee et al. also found that a moderate to high level of CRF eliminates the higher risk of mortality and CRF is closely associated with BMI [36]. The Canadian Physical Activity Longitudinal Study (over the 20-year follow-up period) of 459 adults showed that higher CRF at baseline was associated with lower future risk of obesity [37]. It is important to note that in our study we find the same predictor, obesity, which is associated with CRF and has an impact on all-cause mortality.

Gs finds that there is an enhance impact of CRF in the survival experience of males with \((age\le 52.5\ years)\) than females with \((age \le 52.5 \ years)\). Younger males \((age \le 52.5\ years)\) have higher survival experience for high CRF than that of the males with low CRF. In addition, older individuals (males and females) with age 52.5 years have considerably lower survival experience compared to their younger counterparts. After certain age \((\ge 52.5 \,\,years)\), CRF does not play a major impact on mortality status for the subgroup. CRF has a vital impact on the survival experiences between males and females of age less than 52.5 years. According to Wang et al [38] there are known differences in fitness levels between males and females. Weltman et al. [39] found some of these differences in fitness are physiologic and depend on age. Our results also suggest that survival experience varies depending on sex and age.

Gi selects total cholesterol as a predictive variable, and it is to be noted that the number of patients in the subgroup, who have total cholesterol less than equal to 206.60 mg/dl, have high CRF and are less likely to die than patients who have low CRF. It indicates that total cholesterol is an important marker of all-cause mortality. The results of this study are consistent with the existing literature. According to Centers for Disease Control and Prevention [40] high blood cholesterol is the fifth leading cause of death. Anderson et al. [41] also found that under age 50 years, cholesterol levels are directly related with overall and CVD mortality. On the other hand, survival experience for the subgroup of study participants with total cholesterol greater than 206.60 mg/dl, do not vary considerably regardless of their CRF status. One possible explanation is the process of a larger number of right-censored participants in our study.

The results of subgroup analysis based on the entire data set indicate that tree-based methods identify fewer predictors compared to the non-tree-based method. However, these predictors interact significantly with CRF as the results from the logistic regression with LASSO demonstrates. Subgroups are obtained by all methods indicate that CRF is inversely associated with all-cause mortality and individuals with low CRF have a higher chance of mortality. We identify subgroups with a small number of demographic and other risk factors that cause low CRF and are responsible for all-cause mortality. Our study suggests that total cholesterol and BMI interact with the treatment effect and find subgroup of individuals who may have differential survival experience across two different levels of CRF (high and low CRF). It is to be noted that both tree-based and non-tree-based methods find that the highest risk of mortality is observed in those who are obese, older, and has a high cholesterol level which is already established in the literature. Thus health professionals could encourage individuals to increase physical activities for higher CRF which has an important implication on health issues and survival experiences.