Subgroup Identification with Classification and Regression Tree-Based Algorithms: an Application to the Ball State Adult Fitness Longitudinal Study

Sumy, Mst Sharmin Akter; Begum, Munni; Harber, Matthew P.; Finch, W Holmes; Parh, Md Yasin Ali; Fleenor, Bradley S.; Whaley, Mitchell; Peterman, James; Kaminsky, Leonard

doi:10.1007/s40840-022-01328-7

Subgroup Identification with Classification and Regression Tree-Based Algorithms: an Application to the Ball State Adult Fitness Longitudinal Study

Published: 10 June 2022

Volume 45, pages 445–459, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Bulletin of the Malaysian Mathematical Sciences Society Aims and scope Submit manuscript

Subgroup Identification with Classification and Regression Tree-Based Algorithms: an Application to the Ball State Adult Fitness Longitudinal Study

Download PDF

Mst Sharmin Akter Sumy¹,
Munni Begum ORCID: orcid.org/0000-0002-0028-546X²,
Matthew P. Harber³,
W Holmes Finch⁴,
Md Yasin Ali Parh¹,
Bradley S. Fleenor³,
Mitchell Whaley³,
James Peterman⁵ &
…
Leonard Kaminsky⁵

120 Accesses
1 Citation
Explore all metrics

Abstract

Cardiorespiratory fitness (CRF) is not only an objective measure of physical activity, but also a useful diagnostic and prognostic health indicator for patients in clinical settings. There is a well-established inverse relationship between cardiorespiratory fitness (CRF) and mortality. However, the effect of CRF on mortality status might be different on subgroups of individuals and could be higher or lower than the estimated average effect of CRF. Thus, the objective of the study is to identify subgroups with higher or lower impact of CRF on mortality status. In addition, we evaluate and compare both tree-based and non-tree-based algorithms for identifying predictive features and subgroups. A penalized logistic regression with least absolute shrinkage and selection operator (LASSO) penalty is performed to identify the features that may be associated with low CRF and all-cause mortality. The algorithms considered are: virtual twins classification (VT(C)), generalized unbiased interaction detection and estimation (GUIDE) classification (Gc), GUIDE sum (Gs), GUIDE interaction (Gi) to find subgroups of participants where CRF exerts positive or negative association with all-cause mortality from the Ball State Adult Fitness Longitudinal Lifestyle Study (BALL ST) data. The overall result suggests that tree-based (VT and GUIDE) methods naturally define subgroups with fewer predictors and the non-tree-based method (logistic-LASSO) fails to find subgroups, only identify predictors that have impact on mortality status. In terms of predictive variable selection and subgroup identification, Gi is the best method compared to other tree-based and non-tree-based algorithms. Our study identifies subgroups that may be benefited from higher CRF.

An Overview of Non-exercise Estimated Cardiorespiratory Fitness: Estimation Equations, Cross-Validation and Application

Article 01 May 2019

Identifying responders to elamipretide in Barth syndrome: Hierarchical clustering for time series data

Article Open access 11 April 2023

FIT calculator: a multi-risk prediction framework for medical outcomes using cardiorespiratory fitness data

Article Open access 16 April 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Low CRF is associated with a high risk of cardiovascular disease (CVD) [2,3,4] that is also one of the strongest features of all-cause, and disease-specific mortality [3, 5,6,7]. In contrast, there is higher level of association for the risk attenuation of all-cause and CVD mortality in individuals with higher CRF [5, 6, 8]. Regular physical activity can be CRF booster for most individuals. Regular physical activity ascertained the role of CRF in insulin sensitivity, blood lipid profile, body composition, inflammation, and blood pressure [9, 10]. It is important to assess CRF using a standard method. Cardiopulmonary exercise testing (CPX) is widely considered a gold standard method for the evaluation of cardiopulmonary function and fitness [11]. A recent register-based study by Imboden et al. from Ball State Adult Fitness Longitudinal Lifestyle Study data proposed the prognostic significance of CRF and other variables. In addition, the authors have been able to show that CPX-derived CRF was related to all-cause and disease-specific mortality [7, 12].

The effect of CRF on mortality status might be different and could be higher or lower than the estimated average effect of CRF. It is informative to identify which characteristics, if any, lead to these differential effects [13]. Our goal is to identify which individuals have a lower effect of CRF and which individuals have a higher effect of CRF on mortality status. Thus, the objective of this paper is to find a limited number of features that interact with the treatment and to locate the relevant region on the feature (subgroup of individuals) who may have differential survival experiences across two different levels of CRF (high and low CRF).

A number of studies addressed the subgroup identification problem in the existing literature. Interaction trees (IT) [14] algorithm follows the classification and regression trees (CART) approach, which recursively partitions the data with splits chosen to optimize an objective function and then prunes the resulting tree using the Akaike information criterion (AIC). The qualitative interaction tree (QUINT) method [15] uses sequential partitioning algorithm and finds subgroups by optimizing a weighted sum of a measure of effect size and a measure of subgroup size. QUINT is limited to ordinal and uncensored variables. Foster et al. [16] developed two methods called, virtual twins regression: VT(R) and virtual twin classification: VT(C), to find subgroups of enhanced treatment effect, using simulated data. VT can handle only binary and continuous response variables with no missing values and censored observations. Loh et al. [17] overcome these problems by introducing GUIDE algorithm, which can handle missing values and three types of response variables, such as binary, continuous, and censored. Later, Loh et al. [18] introduced Gi, Gs, and, Gc by extending GUIDE algorithm. Gi is the preferred method to find subgroups defined by predictive variables only. A variable is called predictive when the subgroups of individuals are more likely to respond to a treatment effect. Gs is sensitive to both predictive and prognostic variables. Gc follows GUIDE classification tree algorithm [19].

In this paper, we consider the logistic regression with LASSO penalty as a non-tree-based parametric statistical approach [20] and VT(C), Gc, Gs, and Gi as tree-based nonparametric subgroup identification approaches. Tree-based approaches identify subgroups of study participants in terms of many closely related predictive and prognostic variables (such as age, gender, race) [19]. Using data from the Ball State Adult Fitness Longitudinal Lifestyle Study, the effectiveness of CRF on mortality status in subgroups of participants is checked by establishing an interaction between the CRF and relevant vital predictor variables.

2 Data and Variables

In this study, we use data from the Ball State Adult Fitness Longitudinal Lifestyle Study (BALL ST) cohort where all participants performed a maximal cardiopulmonary exercise (CPX) test and accomplished initial complete health and physical fitness assessment between 1969 and 2017. Participants were requested to disclose personal information, family health history, medication use, and lifestyle behaviors in the questionnaire [12].

All participants were directed to abstain from food or drink and exercise for twelve hours before testing. It was important to determine the presence of obesity, hypertension, dyslipidemia, and CVD risk factor in individuals [21]. The procedures for testing clinical measurements have been described by Whaley and co-workers in detail [22, 23]. To examine all participant’s physical conditions, it was required to complete a maximal oxygen uptake test. CPX test uses gas exchange analysis to provide an objective and accurate measurement of peak oxygen uptake, carbon dioxide production, and anaerobic threshold (AT) [24]. Fitness testing protocols were performed using standardized treadmill protocols [25], Ball State University Bruce Ramp [26], modified Balke-Ware [27], or other non-specified protocols. Test protocols were chosen depending on the physical demands of participants required to achieve maximal effort within 8 to 12 minutes [28].

In the original data set, there were 3694 de-identified participants with 58 covariates. We remove few participants (around 8%) from the data because of their incomplete information. Some variables are excluded from the dataset because of their irrelevance to our study. The fitness rank is categorized into low-CRF ($\le $ 33rd percentile) and high-CRF (> 33rd percentile) using the percentile of fitness rank (Fitness Registry and the Importance of Exercise National Database—FRIEND) [29]. We consider CRF with two levels (CRF high and low) as treatment variables and mortality status as the response variable. Based on the available covariates in our dataset, we consider BMI, age, total cholesterol, gender, obesity, glucose, smoking status, physical activity level, and triglyceride as explanatory variables.

3 Methodology

3.1 Logistic Regression with LASSO Penalty to Identify Features Associated with all-Cause Mortality

The assessment of the effect of a treatment in subgroups of patients can be viewed as establishing an interaction term between the treatment and specific variables. Let $y_i$ denote all-cause mortality with two levels $(1 \rightarrow \text{ alive }, 0 \rightarrow \text{ dead})$ with probabilities $Pr(y_i=1) = \pi $ and $Pr(y_i=0) = 1-\pi $.

We model the probability $\pi $ as $g(\pi ) = {\varvec{X}}^T \varvec{\beta }$ where ${\varvec{X}}$ is the model matrix of dimension $n \times p$, $\varvec{\beta } $ is a vector of p parameters, and (g) is a link function which is the logit link in this case.

The estimation of the vector $\varvec{\beta }$ using LASSO ($L_1$-penalty) is defined as:

$$\begin{aligned} {\hat{\beta }}_{LASSO}=argmin_{\beta }\left[ \sum ^n_{i=1}y_{i}ln(\pi _i)+(1-y_i)ln(1-\pi _i)+\lambda \sum ^p_{j=1}\mid \beta _j\mid \right] \end{aligned}$$

(1)

where $\lambda $ is a tuning parameter $(\lambda \ge 0)$. It controls the strength of shrinkage in the explanatory variables: when $\lambda $ takes larger value, more weight will be given to the penalty term. Because the value of $\lambda $ depends on the data, it can be computed using the cross-validation method [30].

3.2 Virtual Twins for Subgroup Identification

The virtual twins [16] method proceeds in three main steps. First, the treatment effect for each individual is estimated. Second, the data subspaces associated with large treatment effects are identified. Third, in subgroups that look promising, the treatment effect is evaluated and subgroup size is estimated.

The notations in use are outlined as follows. Let n be the total number of individuals and $X = X_1, ..., X_p$ be the measurements that have been recorded for each participant. The outcome variable, Y is binary. The treatment indicator, T indicates whether the CRF is high, defined by $T = 1$, and low, $T = 0$. The treatment effect, $Z_i$, is calculated as the difference in probabilities of having high CRF versus low CRF for all the participants. Our goal is to find the region of the predictor space, where CRF is high or low.

Step I. Applying Random Forest to the Data

VT follows the concepts of counterfactual modeling, in which there are two possible outcomes for each person: CRF high or low. For the ith participant a binary outcome variable $Z_i$ is calculated using the random forest as follows:

$$\begin{aligned} {Z_i={\hat{p}}_1\,X_i(y_i=1)-{\hat{p}}_0\,X_i(y_i=1)} \end{aligned}$$

(2)

where ${\hat{p}}_1X_i(y_i=1)$ is the estimated probability of survival given that the CRF is high and the individual has characteristics $X_i$. Similarly, ${\hat{p}}_0X_i(y_i=1)$ is the probability of survival given that the individual has low CRF. CRF for an individual can be predicted from the data.

Step II. Estimate a Classification Tree

The purpose of this tree is to find a small number of X’s that are strongly associated with Z and hence we can define the region. To find the region, we create a new binary variable $Z^*$ as $Z_i^*$=1 if a study participant with predicted $Z_i$ greater than some threshold c and considered to be in ${\hat{A}}$. Also, $Z^*_i=0$ if a study participant with predicted $Z_i$ less than equal to the threshold c. Thus, ${\hat{A}}$ is defined by the paths down the tree which lead to terminal nodes with predicted $Z_i$’s greater than c. Basically, ${\hat{A}}$ is the size of the selected subgroup. This method is denoted VT(C).

Step III. Subgroup Size Estimation

The treatment effect is evaluated, and subgroup size is estimated in this step.

3.3 GUIDE (Generalized Unbiased Interaction Detection and Estimation)

The generalized unbiased interaction detection and estimation (GUIDE) approach recursively partitions the data to a tree whose terminal nodes define the subgroups. The GUIDE algorithm proceeds in three steps: i) split selection, ii) finding the split variable, and iii) searching for the best split on the selected variable. GUIDE is popular because it prevents selection bias in the presence of various types of covariates. GUIDE can handle censored, uncensored response variable, and the response variable with missing values [18]. We discuss the GUIDE (Gi and Gs) survival regression tree-based approaches for censored data as follows.

3.3.1 GUIDE (Gi and Gs): Regression Tree Approach

Regression tree models are nonparametric, naturally define subgroups, and are not limited by the number of predictor variables. Several obstructions arises in the way of applying Gi and Gs regression tree approaches to the data with censored response variables. The obvious approach of replacing least-squares fits with proportional hazards models in the nodes [14] is problematic because Gs employs Chi-squared tests on residuals with their signs. When the algorithm for censored response variable fits a separate proportional hazards model in each node that yields different baseline cumulative hazard functions. As a result, the model has no longer proportional hazards and the regression coefficients between nodes cannot be compared [18]. To address this issue, Gs applies the Poisson regression for fitting proportional hazards models. Thus, we can construct a proportional hazards regression tree by iteratively fitting a Poisson regression tree [31, 32].

Gi employs loglinear model goodness-of-fit tests [33] iteratively to the fitted values to obtain the split variables and split points.

3.3.2 GUIDE (Gc): Classification Tree Approach

In this method, a classification tree is used to find subgroups by defining the class variable as $\,class \,(V)= (Y+Z)\,\, mod\, 2$, where Y and Z are binary response and treatment variables, respectively. A classification tree requires to construct a new response variable, which will likely identify subgroups with differential treatment effects. The new response variable V can be constructed as follows:

$$\begin{aligned} V= {\left\{ \begin{array}{ll} 0,&{} \text {if } \{Y=1 \text { and } Z=1\} \ or \ \text {if } \{Y=0 \text { and } Z=0\} \\ 1,&{} \text {if } \{Y=0 \text { and } Z=1\} \ or \ \text {if } \{Y=1 \text { and } Z=0\} \end{array}\right. } \end{aligned}$$

The motivation behind this definition [18] is that the subjects for which $class\, = 0$ respond differentially to treatment and those for which $class = 1$ do not. Although any classification tree algorithm may be used, we use GUIDE [34] here because it does not have selection bias.

GUIDE develops a tree in three steps: 1) a Chi-square test selects the most significant split variable to split a node; 2) the split set is selected to minimize a node impurity measure (the impurity measure in GUIDE includes entropy and Gini index); 3) steps 1 and 2 are recursively repeated until too few observations are in each node. After building a complete tree, three methods including cross-validation pruning (default), test-sample pruning, and no pruning are used to decide how much of the tree to retain. The criteria for pruning are to minimize an unbiased estimate of misclassification cost [34].

4 Subgroup Identification with High and Low CRF

Data are preprocessed separately to apply the logistic-LASSO and each of the four tree-based algorithms: VT(C), Gc, Gs, and Gi for identifying the demographic and other risk factors and for the subgroup analysis.

4.1 Logistic Regression with LASSO Penalty

The assessment of the effect of CRF in subgroups of patients can be viewed as establishing an interaction term between the treatment and specific variables. A subgroup of patients can be identified when the interaction between treatment and a certain combination of feature variables is statistically significant. A statistically significant and large interaction effect usually indicates potential subgroups that may have different responses to the treatment. Logistic regression with the LASSO penalty method does variable selection by shrinking the coefficient values. The interactions and interaction rates between treatment and a certain combination of feature variables for our data are shown in Table 1.

Table 1 Interactions and interaction effects between CRF and other predictors for Ball State Adult Fitness Longitudinal Lifestyle Study data

Full size table

From Table 1, we see that logistic regression with the LASSO penalty model finds some significant covariates and interaction terms by shrinking insignificant terms. This method finds the interaction with CRF and other predictors: age, total cholesterol, smoking status, BMI, gender, hypertension, and diabetes have an impact on mortality status for our data. The results from the tree-based methods for identifying subgroups are presented below, and the consistency of the findings from the logistic-LASSO is compared.

4.2 Virtual Twin Classification (VT(C))

VT(C) generates subgroups of participants by selecting a limited number of covariates that interacts with CRF. VT(C) identifies subgroups for a number of preselected thresholds along with subgroup size, treatment event rate, and control event rate. We consider, .002, .004, .006, .009, .05, and .09, as our thresholds to determine the effect of threshold values on the important predictors and subgroups. Small changes to the threshold values do not change the selected variables for splitting. But, when the considered threshold is less than .002, VT(C) tends to find too many explanatory variables, or for the threshold greater than .05, VT(C) fails to find a tree. We present a VT(C) tree for threshold .002 in Figure 1.

From Figure 1, we see that the VT(C) partitions the whole study population into subgroups consisting of BMI and age for threshold .002. The classification tree predicts that study participants belong to the class $ (BMI<33 \ {kgm}^{-2}\,\, \& \,\, age<55 \ years)$ with $Z^*=0$, meaning that the individuals of this subgroup will have longer lives for their high CRF. From the figure, it is also indicative that people are likely to die if their age is higher, even if their BMI is lower which is common knowledge. However, study participants that belong to the class with $Z^*=1$, the predictor variable space, $(BMI\ge 31\ {kgm}^{-2})$, defines a region where study participants are more likely to die for their low CRF.

4.3 GUIDE Classification (Gc) Approach

Gc tree is constructed with the newly defined response variable identifies subgroups with differential effects of CRF. Gc tree uses estimated priors and unit misclassification costs as weights for predicting two classes of CRF. Prediction of classes (class = 0 (CRF low) and 1 (CRF high)) depends on these weights (Fig. 2).

To split BMI, an observation goes to the left branch if and only if the condition $(BMI\le 29.85)$ is satisfied. Estimated class posterior probabilities are shown beside each node of Gc tree. Predicted classes and sample sizes are also shown below the terminal nodes. Gc tree partitions the whole study population into subgroups consisting of BMI as a predictor variable and the result indicates that study participants with BMI greater or equal to 29.85${\ kgm}^{-2}$ have low CRF and are likely to have lower survival experience.

4.4 GUIDE Sum (Gs) Algorithm

GUIDE sum (Gs) regression tree approach is a good alternative to select prognostic variables and define subgroups naturally. Prognostic variables provide information about the response variable without considering the treatment effect (CRF in this case). Gs proportional hazards regression tree for differential treatment effects is shown in Figure 3.

Sample sizes are at the bottom of the terminal nodes, and hazard ratios are given beside the terminal nodes of Gs tree. Gs uses least-square residuals and looks at the dichotomized residuals separately in the two treatment arms. Gs proportional hazards regression tree selects the prognostic variables, age, and sex that are significantly associated with all-cause mortality. For the subgroup, $ (age<52.5\ years\ \& \ sex=male)$ hazards of death is .609 times for the high CRF group compared to the low CRF group. The differential effects of CRF on survival experience for the subgroup is statistically significant. The risk of death for younger females $(age<52.5 \ years)$ with high CRF is around 10% less likely than those who have low CRF. Moreover, the risk of death for low CRF individuals $(age\ge 52.5 \ years)$ is around 17% higher than the high CRF individuals irrespective of their sex. To compare the survival experience between two levels of CRF (CRF high and low) for each subgroup of the Gs tree, a bivariate analysis with the Kaplan–Meier estimate is applied.

Kaplan–Meier curves in Figure 4 for the subgroups of Gs tree show that survival experience among three subgroups is considerably different. From the estimated survival curves, it is seen that there is enhance impact of CRF in the survival experience of younger males $(age<52.5\ years)$ than their counterparts (younger females). Younger males with high CRF have higher survival experience than the males with low CRF. It is important to note that although survival experience of younger males has decreased with the follow-up years, younger females have stable higher survival experience with the follow up years for both high- and low-CRF groups. On the other hand, for the older age subgroup, study participants (both males and females) have considerably lower survival time than the younger study participants (both males and females).

The results of log-rank test in Table 2 show that for subgroups $ (age<52.5 \,\, \& \,\, sex=female)$ and $ (age<52.5 \,\, \& \,\, sex=male)$ there is difference in the survival experiences between two levels of CRF for individuals in the two groups. However, the differences in the survival experiences between two levels of CRF for subgroup $(age \ge 52.5)$ is not noticeable.

Table 2 Results of log rank test between two levels of CRF (high and low CRF) for survival experience of Gs tree

Full size table

4.5 GUIDE Interaction (Gi)

When only predictive variables are available for subgroup identification, Gi regression tree approach is the best choice as Gs can be negatively affected by the presence of prognostic variables. To avoid the effect of prognostic variables, Gi uses a Chi-squared test of CRF-covariate interaction to identify a split variable at each node. Gi approach finds predictive variables to have a significant effect on survival experience and has interaction effects with CRF. Gi proportional hazards regression tree for differential treatment effects is shown in Figure 5.

Gi tree in Figure 5 selects total cholesterol as interacting covariate. Total cholesterol is used to split the data at the very beginning to identify subgroups. At each intermediate node, an observation goes to the left child node if and only if the displayed condition is satisfied. Subgroup sizes are below the terminal nodes. Subgroups are identified by establishing an interaction between the CRF and total cholesterol and locating the appropriate region on total cholesterol to check the effectiveness of CRF on mortality status. Study participants with total cholesterol less than equal to 206.60mg/dl and total cholesterol greater than 206.60mg/dl may have differential survival experiences across two different levels of CRF (high and low CRF). According to the hazard ratio, the number of patients in the subgroup with total cholesterol less than equal to 206.60 mg/dl, having high CRF at any time point during the study period are 23% less likely to die than patients who have low CRF. On the other hand, for the subgroup, with total cholesterol greater than equal to 206.60mg/dl, their risk of death is $.982 (=1)$, indicating that almost no change in survival experience is particularly noticeable, although their CRF is high or low.

Kaplan–Meier plots in Figure 6 for each subgroup, for two conditions of fitness rank (high and low), are associated with individual’s mortality status. From the estimated survival curves, it is seen that study participants with $(total\ cholesterol\le 206.60\ mg/dl)$ have marginally higher survival experience with high CRF until 30 years of follow-up and after 30 years, patients have higher survival experience for high CRF than that of the patients with low CRF. On the other hand, for individuals with $(total\ cholesterol>206.60\ mg/dl)$, we do not see any significant difference on survival experience although their CRF is high or low. Results of log rank test between two levels of CRF (high and low CRF) for each subgroup of Gi tree are presented in Table 3.

Table 3 Results of Log rank test between two levels of CRF (high and low CRF) for survival experience of Gi tree

Full size table

The results of log-rank test in Table 3 show that for subgroup ($total\,\_\,cholesterol\le 206.60$) there is difference in the survival experiences between two levels of CRF for individuals in the two groups. However, the difference in the survival experiences between two levels of CRF for subgroup $(total\_\,cholesterol\,>\, 206.60)$ is not noticeable.

5 Discussion and Conclusion

In this paper, we identify subgroups of individuals based on a small number of predictors interacting with higher or lower impact of CRF on mortality status rather than on the entire population. We have applied classification and regression tree-based algorithms: VT(c), Gc, Gi, and Gs to find subgroups of individuals from the Ball State Adult Fitness Longitudinal Lifestyle Study data. The overall results from these algorithms suggest that CRF is inversely associated with mortality status and some important features are associated with CRF and have effects on mortality status for our data. That is, individuals with low CRF have lower survival experience than the individuals with high CRF. Logistic regression with LASSO penalty finds the interaction with fitness rank and other predictors: age, total cholesterol, smoking status, BMI, Gender, hypertension, and diabetes.

VT(C) and Gc tree partition the whole study population into subgroups consisting of BMI as a predictor variable which is consistent with interaction terms identified by the logistic regression with LASSO penalty. Thus, obesity is identified as the major health problem that has an impact on all-cause mortality. According to WHO report, obesity is the fifth leading global risk factor for mortality [35] Lee et al. also found that a moderate to high level of CRF eliminates the higher risk of mortality and CRF is closely associated with BMI [36]. The Canadian Physical Activity Longitudinal Study (over the 20-year follow-up period) of 459 adults showed that higher CRF at baseline was associated with lower future risk of obesity [37]. It is important to note that in our study we find the same predictor, obesity, which is associated with CRF and has an impact on all-cause mortality.

Gs finds that there is an enhance impact of CRF in the survival experience of males with $(age\le 52.5\ years)$ than females with $(age \le 52.5 \ years)$. Younger males $(age \le 52.5\ years)$ have higher survival experience for high CRF than that of the males with low CRF. In addition, older individuals (males and females) with age 52.5 years have considerably lower survival experience compared to their younger counterparts. After certain age $(\ge 52.5 \,\,years)$, CRF does not play a major impact on mortality status for the subgroup. CRF has a vital impact on the survival experiences between males and females of age less than 52.5 years. According to Wang et al [38] there are known differences in fitness levels between males and females. Weltman et al. [39] found some of these differences in fitness are physiologic and depend on age. Our results also suggest that survival experience varies depending on sex and age.

Gi selects total cholesterol as a predictive variable, and it is to be noted that the number of patients in the subgroup, who have total cholesterol less than equal to 206.60 mg/dl, have high CRF and are less likely to die than patients who have low CRF. It indicates that total cholesterol is an important marker of all-cause mortality. The results of this study are consistent with the existing literature. According to Centers for Disease Control and Prevention [40] high blood cholesterol is the fifth leading cause of death. Anderson et al. [41] also found that under age 50 years, cholesterol levels are directly related with overall and CVD mortality. On the other hand, survival experience for the subgroup of study participants with total cholesterol greater than 206.60 mg/dl, do not vary considerably regardless of their CRF status. One possible explanation is the process of a larger number of right-censored participants in our study.

The results of subgroup analysis based on the entire data set indicate that tree-based methods identify fewer predictors compared to the non-tree-based method. However, these predictors interact significantly with CRF as the results from the logistic regression with LASSO demonstrates. Subgroups are obtained by all methods indicate that CRF is inversely associated with all-cause mortality and individuals with low CRF have a higher chance of mortality. We identify subgroups with a small number of demographic and other risk factors that cause low CRF and are responsible for all-cause mortality. Our study suggests that total cholesterol and BMI interact with the treatment effect and find subgroup of individuals who may have differential survival experience across two different levels of CRF (high and low CRF). It is to be noted that both tree-based and non-tree-based methods find that the highest risk of mortality is observed in those who are obese, older, and has a high cholesterol level which is already established in the literature. Thus health professionals could encourage individuals to increase physical activities for higher CRF which has an important implication on health issues and survival experiences.

Data Availability

Sample data available in https://github.com/m0sumy01/data-and-materials

Code Availability

R-codes are available in https://github.com/m0sumy01/r-code

References

Lee, D.C., Artero, E.G., Sui, X., Blair, S.N.: Mortality trends in the general population: the importance of cardiorespiratory fitness. J. Psychopharmacol. (Oxford, England), 24(4 Suppl), 27–35
Myers, J., Kaykha, A., George, S., et al.: Fitness versus physical activity patterns in predicting mortality in men. Am. J. Med. 117(12), 912–8 (2004)
Article Google Scholar
Blair, S.N.: Physical inactivity: the biggest public health problem of the 21st century. Br. J. Sports Med. 43(1), 1–2 (2009)
Google Scholar
Kokkinos, P.F., Holland, J.C., Pittaras, A.F., et al.: Cardiorespiratory fitness and coronary heart disease risk factor association in women. J. Am. Coll. Cardiol. 26(2), 358–64 (1995)
Article Google Scholar
Robsahm, T.E., Falk, R.S., Heir, T., et al.: Measured cardiorespiratory fitness and self-reported physical activity: associations with cancer risk and death in a long-term prospective cohort study. Cancer Med. 5(8), 2136–2144 (2016)
Article Google Scholar
Myers, J., Prakash, M., Froelicher, V. et al.: Exercise capacity and mortality among men referred for exercise testing. N Engl J Med
Harber, M.P., Kaminsky, L.A., Arena, R., et al.: Impact of cardiorespiratory fitness on all-cause and disease-specific mortality: advances since 2009. Prog. Cardiovasc. Dis. 60(1), 11–20 (2017)
Article Google Scholar
Lakoski, S.G., Willis, B.L., Barlow, C.E., et al.: Midlife cardiorespiratory fitness, incident cancer, and survival after cancer in men: the cooper center longitudinal study. JAMA Oncol. 1(2), 231–7 (2015)
Article Google Scholar
Reaven, G.: All obese individuals are not created equal: insulin resistance is the major determinant of cardiovascular disease in overweight/obese individuals. Diab. Vasc. Dis. Res. 2, 105–112 (2005)
Article Google Scholar
Parh, M.Y.A., Begum, M., et al.: Subgroup identification for differential cardio-respiratory fitness effect on cardiovascular disease risk factors: a model-based recursive partitioning approach. J. Stat. Res. 54(2), 147–165 (2020). https://doi.org/10.47302/jsr.2020540204
Article MathSciNet Google Scholar
Hirashiki, A., Kondo, T., Okumura, T., et al.: Cardiopulmonary exercise testing as a tool for diagnosing pulmonary hypertension in patients with dilated cardiomyopathy. Ann. Noninvasive Electrocardiol. 21(3), 263–271 (2016). https://doi.org/10.1111/anec.12308
Article Google Scholar
Imboden, M.T., Harber, M.P., Whaley, M.H., et al.: Cardiorespiratory fitness and mortality in healthy men and women. J. Am. College Cardiol, 72(19), 2283-2292, 6, (2018)
Seibold, H., Zeileis, A., Hothorn, T.: Model-based recursive partitioning for subgroup analyses. Int. J. Biostat. 12(1), 45–63 (2016)
Article MathSciNet Google Scholar
Su, X., Zhou, T., Yan, X., Fan, J., Yang, S.: Interaction trees with censored survival data. J. Biostat.. 28;4(1): Article 2, (2008)
Dusseldorp, E., Mechelen, I.V.: Qualitative interaction trees: a tool to identify qualitative treatment-subgroup interactions. Stat. Med., 06 August (2013)
Foster, J.C., Taylor, J.M.G., Ruberg, S.J.: Subgroup identification from randomizedclinical trial data. Stat. Med. 30, 2867–2880 (2011). ([PubMed: 21815180])
Article MathSciNet Google Scholar
Loh, W., Cao, L., Zhou, P.: Subgroup identification for precision medicine: A comparative review of 13 method. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 09 June (2019)
Loh, W., He, X., Man, M.: A regression tree approach to identifying subgroups with differential treatment effects. Stat. Med. 34(11), 1818–1833 (2015)
Article MathSciNet Google Scholar
Loh, W.Y., Zhou, P.: The GUIDE approach to subgroup identification. In: Ting N., Cappelleri J., Ho S., Chen G. (eds) Design and analysis of subgroups with biopharmaceutical applications. emerging topics in statistics and biostatistics. Springer, Cham
James, G., Witten, D., Tibshirani, R., et al: An introduction to statistical learning, Springer Texts in Statistics, (2013)
ACSM’s Guidelines for exercise testing and prescription”: American College of SportsMedicine, 10th edn., p. 120. Wolters Kluwer, Philadelphia, PA (2017)
Whaley, M.H., Kaminsky, L.A., et al.: Predictors of over- and underachievement ofage-predicted maximal heart rate. Med. Sci. Sports Exerc. 24(10), 1173–9 (1992)
Article Google Scholar
Whaley, M.H., Kaminsky, L.A., et al.: Failure of predicted VO2peak to discriminatephysical fitness in epidemiological studies. Med. Sci. Sports Exerc. 27(1), 85–91 (1995)
Article Google Scholar
Singh, V.N.: The role of gas analysis with exercise testing. Office-Based Procedures: Part I. 28(1):159-79, (2001), vii-viii. https://doi.org/10.1016/s0095-4543(05)70012-9
Bruce, R.A., Blackmon, J.R., et al.: Exercising testing in adult normal subjects and cardiac patients, Pediatrics, 32(1963), Suppl:742–56
Kaminsky, L.A., Whaley, M.H.: Evaluation of a new standardized ramp protocol: the BSU/Bruce Ramp protocol. PubMed.gov 18(6), 438–44 (1998)
Google Scholar
Pollock, M.L., Foster, C., Schmidt, D., et al.: Comparative analysis of physiologic responses to three different maximal graded exercise test protocols in healthy women. PubMed.gov 103(3), 363–73 (1982)
Google Scholar
American College of Sports Medicine : ACSM’s guidelines for exercise testing and prescription . Philadelphia, PA: Wolters Kluwer (2017): 120
Kaminsky, L.A., Arena, R., Myers, J.: Reference standards for cardiorespiratory fitness measured with cardiopulmonary exercise testing: data from the fitness registry and the importance of exercise national database. Mayo Clin. Proc. 90(11), 1515–23 (2015). https://doi.org/10.1016/j.mayocp.2015.07.026
Article Google Scholar
James, G., Witten, D., Hastie, T., Tibshirani, R.: (2013). An introduction to statistical learning. New York: Springer. Collett, D., Modelling survival data in medical research, Texts in Statistical Science,Third Edition, (2015)
Chaudhuri, P., Lo, W.D., Loh, W.Y., Yang, C.C.: Generalized regression trees. Stat. Sin. 5, 641–666 (1995)
MathSciNet MATH Google Scholar
Loh, W.Y.:“Regression tree models for designed experiments," . In: Rojo, J., editor. Second E. L. Lehmann Symposium. Vol. 49. Institute of mathematical statistics lecture notes-monograph, Series; 210-228.https://doi.org/10.1214/074921706000000464.
Agresti, A.: An introduction to categorical data analysis," Wiley", 2nd edn, (2007)
Loh, W.Y.: Improving the precision of classification trees. Ann. Appl. Stat. 3(4), 1710–1737 (2009)
Article MathSciNet Google Scholar
WHO (2009) Global health risks: mortality and burden of disease attributable to selected major risks Geneva: World Health Organization
Lee, D.C., Artero, E.G., Sui, X., Blair, S.N.: Cardiorespiratory fitness, body composition, and all cause and cardiovascular disease mortality in men. J. Psychopharmacol. 24(4 supplement), 27–35 (2010)
Article Google Scholar
Brien, S.E., Katzmarzyk, P.T., Craig, C.L., Gauvin, Lise: Physical activity, cardiorespiratory fitness and body mass index as predictors of substantial weight gain and obesity: the Canadian physical activity longitudinal study. Can. J. Publ. Health 98(2), 121–4 (2007)
Article Google Scholar
Wang, C.Y., Haskell, W.L., Farrell, S.W., et al.: Cardiorespiratory fitness levels among US adults 20–49 years of age: findings from the 1999–2004 national health and nutrition examination survey. Am. J. Epidemiol. 171(4), 426–35 (2010)
Article Google Scholar
Weltman, A., Weltman, J.Y., Winfield, D.D.W., et al.: Relationship between age, percentage body fat, fitness, and 24-hour growth hormone release in healthy young adults: effects of gender. J. Clin. Endocrinol. Metab. 93(12), 4711–4720 (2008)
Article Google Scholar
Centers for disease control and prevention, https://www.cdc.gov/cholesterol/facts.htm
Anderson, Keaven M., Castelli, William P., et al.: Cholesterol and mortality 30 years of follow-up from the framingham study. JAMA 257(16), 2176–2180 (1987). https://doi.org/10.1001/jama.1987.03390160062027
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Bioinformatics and Biostatistics, University of Louisville, 485 E. Gray St, Louisville, 40202, KY, USA
Mst Sharmin Akter Sumy & Md Yasin Ali Parh
Department of Mathematical Sciences, Ball State University, Robert Bell Building (RB), Muncie, 47306, IN, USA
Munni Begum
Clinical Exercise Physiology Program, Ball State University, Human Performance Laboratory HP230, Muncie, 47306, IN, USA
Matthew P. Harber, Bradley S. Fleenor & Mitchell Whaley
Department of Educational Psychology, Ball State University, Teachers College (TC), Room 505, Muncie, 47306, IN, USA
W Holmes Finch
Fisher Institute of Health and Well-being, Ball State University, Muncie, 47306, IN, USA
James Peterman & Leonard Kaminsky

Authors

Mst Sharmin Akter Sumy
View author publications
You can also search for this author in PubMed Google Scholar
Munni Begum
View author publications
You can also search for this author in PubMed Google Scholar
Matthew P. Harber
View author publications
You can also search for this author in PubMed Google Scholar
W Holmes Finch
View author publications
You can also search for this author in PubMed Google Scholar
Md Yasin Ali Parh
View author publications
You can also search for this author in PubMed Google Scholar
Bradley S. Fleenor
View author publications
You can also search for this author in PubMed Google Scholar
Mitchell Whaley
View author publications
You can also search for this author in PubMed Google Scholar
James Peterman
View author publications
You can also search for this author in PubMed Google Scholar
Leonard Kaminsky
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SAS contributed to the design of the study, analyzed data and drafted the manuscript. MB contributed to design the study, interpreted results, and provided critical review of the manuscript. MYAP contributed to the conceptualization and reviewing the manuscript. MPH, BF, MW, FW, JP, and LK had full access to all data in our study. MPH and FW took the responsibility to interpret the data and checked the accuracy of the data analysis.

Corresponding author

Correspondence to Munni Begum.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Rafiqul I. Chowdhury.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sumy, M.S.A., Begum, M., Harber, M.P. et al. Subgroup Identification with Classification and Regression Tree-Based Algorithms: an Application to the Ball State Adult Fitness Longitudinal Study. Bull. Malays. Math. Sci. Soc. 45 (Suppl 1), 445–459 (2022). https://doi.org/10.1007/s40840-022-01328-7

Download citation

Received: 28 December 2021
Revised: 06 May 2022
Accepted: 18 May 2022
Published: 10 June 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s40840-022-01328-7

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Subgroup Identification with Classification and Regression Tree-Based Algorithms: an Application to the Ball State Adult Fitness Longitudinal Study

Abstract

Similar content being viewed by others

An Overview of Non-exercise Estimated Cardiorespiratory Fitness: Estimation Equations, Cross-Validation and Application

Identifying responders to elamipretide in Barth syndrome: Hierarchical clustering for time series data

FIT calculator: a multi-risk prediction framework for medical outcomes using cardiorespiratory fitness data

1 Introduction

2 Data and Variables

3 Methodology

3.1 Logistic Regression with LASSO Penalty to Identify Features Associated with all-Cause Mortality

3.2 Virtual Twins for Subgroup Identification

3.3 GUIDE (Generalized Unbiased Interaction Detection and Estimation)

3.3.1 GUIDE (Gi and Gs): Regression Tree Approach

3.3.2 GUIDE (Gc): Classification Tree Approach

4 Subgroup Identification with High and Low CRF