Introduction

Regression models have been extensively applied in epidemiological research to examine relationships between covariates (e.g., risk factors, demographics, etc.) and an outcome of interest [1, 2]. However, regression models lack the flexibility to uncover complex covariate-outcome relationships unless the analyst pre-specifies the nature of these relationships. For example, consider a randomized controlled trial, the Box Lunch Study (BLS) [3], where one analysis goal was to explore associations between daily food intake measured in kilo-calories and relevant covariates. The set of covariates include responses to the Three-Factor Eating Questionnaire (TFEQ) [4] quantifying hunger, disinhibition, and restrained eating and novel laboratory-based psycho-social measures such as relative reinforcing value of food (rrvf) [5] and degrees of liking and wanting of food [6, 7]. Denoting the outcome by Y and the covariates by X (for this illustration consisting of six covariates X1, . …, X6), one option would be to fit the multiple linear regression model

$$ Y={\beta}_0+{\beta}_1{X}_1+..\dots +{\beta}_6{X}_6+\epsilon . $$

This model assumes that daily energy intake is a linear function of covariates, an assumption that is unlikely to hold. One common alternative approach is to categorize continuous covariates, e.g., by splitting them at the median or into quartiles. However, in many cases, it may not be obvious which values to choose for splitting. Furthermore, this approach makes investigating covariate interactions more challenging due to the potentially large number of dummy variables in the model. Investigating different splitting choices requires estimating a potentially large number of linear regression models; when deciding on a single split point for a single continuous X1, many models would need to be fit to include indicator variables 1[X1 ≥ τ] defined for different threshold values τ. This may require many candidate models to be examined in a fairly ad hoc manner, potentially inflating type I error and increasing the risk of overfitting the data, which reduces generalizability of the results [8,9,10].

At the other end of the spectrum, machine learning techniques such as neural networks, support vector machines, and graphical models [11, 12] offer very flexible modeling of covariate-outcome relationships. However, these techniques are usually “black boxes” as they combine covariate information in complex ways. For example, neural networks classify outcomes based on weighted combinations of transformed covariates. The resulting model cannot easily be interpreted in terms of the original covariate values, making it difficult to gain insight about the nature of covariate-outcome associations and make individual predictions without access to software that can calculate model outputs.

Decision trees are an appealing intermediate between these two extremes: they offer more flexibility than standard regression models [13••], but their output is more easily interpretable than “black-box” machine learning methods. As a result, decision trees are potentially useful for analyzing complex, high-dimensional data from epidemiological studies. In this paper, we introduce the key concepts of decision trees and describe how they have been applied in the recent epidemiological literature, distinguishing between three different ways that decision trees are used: for explanatory modeling and variable selection, outcome prediction, and subgroup identification. We also briefly discuss some variants of and extensions to decision tree models which are also seeing increased use in the epidemiological community.

A Brief Overview of Decision Trees

A decision tree is a statistical model which aims to partition the given data into groups that are (relatively) homogeneous with respect to the outcome, based on covariate values. Decision trees have several components, which we illustrate by referring to Fig. 1 (note that Fig. 1 is a conditional inference tree, which we describe in greater detail below). Subsets of observations are contained in nodes of a decision tree; all observations (n = 226 in the Box Lunch Study) are initially contained in the root node of a tree (at the top of Fig. 1). The splitting step is a vital step in the process of constructing decision trees, where two disjoint subsets are determined by dividing the sample (or subsample, for nodes below the root node) according to covariate values. Branches in the tree represent splits below a node; a decision tree is built by successively splitting down each branch until a stopping rule is triggered. Stopping rules may be determined by a number of factors, e.g., a minimum number of observations in a node, or a threshold for the decrease in estimated prediction error. A leaf or a terminal node is a node where the stopping rule is satisfied. Collectively, a disjoint partition of the original sample is defined by terminal nodes; each observation in the sample belongs to a single terminal node, depending on its covariates. Predictions for a new observation are obtained by computing a summary statistic from the individuals falling in the same leaf as that observation. For instance, with continuous outcomes, the predicted value would be the mean outcome of the subset of observations within that leaf. Decision trees are usually depicted upside down relative to actual trees, with the root node at the top and the branches spread downwards to the leaves. The tree in Fig. 1 has four terminal nodes (or “leaves”), and therefore partitions individuals into four subgroups with distinct means (and indeed, distributions) of the outcome on the basis of six variables: hunger, wanting, liking, disinhibition, restrained eating, and relative reinforcing value of food.

Fig. 1
figure 1

Conditional inference tree that displays the association between food intake (in kcal/day) and six covariates (hunger, disinhibition, relative reinforcement of food (rrvf), restrained eating, liking, and wanting). The tree represents a series of sequential splits on hunger, liking, and rrvf that distinguish between four subgroups with different distributions of daily caloric intake. Root and inner nodes are labeled with the splitting variable and (multiplicity-adjusted) p value for the association between that variable and the outcome. Branches below root and inner nodes are labeled with the optimal splitting rule determined by the CTree algorithm. Terminal nodes display the number of individuals belonging to each subgroup and boxplots showing the distribution of daily caloric intake within each subgroup. This figure is based on a similar one that previously appeared in reference [14] (http://creativecommons.org/licenses/by/4.0/)

The splits in a decision tree define a set of prediction rules for predicting the outcome on the basis of covariates, with the goal of minimizing a loss function that computes the discrepancy between the predicted and true values. Commonly used loss functions include misspecification rates, Gini index, and entropy for classification trees and mean squared error for regression trees. A training set is used to learn a set of decision rules and a test set is utilized to assess the performance of the grown decision tree. Like many strongly data-driven methods, decision trees are prone to overfitting, i.e., getting an overly optimistic estimate of prediction accuracy by modeling idiosyncrasies of the training set used to build the tree instead of characteristics of the underlying data generating process. To prevent overfitting, it is therefore common practice to construct trees using a number of different stopping rules to generate trees of varying depths (a deeper tree has more splits, more nodes, and fewer observations within each leaf; hence, it may yield higher prediction accuracy but is also more likely to overfit). The final tree depth is selected using a process called pruning that seeks to minimize the prediction error estimated by cross-validation or, preferably, on an independent test set.

There are several methods for constructing decision trees, with the major differences between the methods being the algorithms used to partition the sample and the criteria which determine when to stop splitting. The most widely used method of constructing decision trees is the Classification and Regression Tree (CART) technique [15••]. In CART, the search for each split takes place simultaneously across all covariates and their candidate split points. For each covariate, CART identifies the split point resulting in greatest reduction in error. The split chosen for inclusion in the tree is the most error-reducing split across all covariates. This recursive splitting process continues until the best split results in a relative reduction in error less than a pre-specified threshold, since the CART and related techniques (e.g., C4.5 [16]) have seen wide application, including in obesity [17, 18], smoking studies [19•, 20], and diabetes [21, 22].

One more recently proposed alternative to CART is the conditional inference tree (CTree [23]). CTree follows a two-stage splitting process: in the first stage, the covariate to split on is determined based on a measure of association between each covariate and the outcome of interest. Then, the best split point for the splitting covariate is calculated. This two-stage splitting process allows CTree to use a more formal statistical inference framework, wherein the hypothesis that none of the covariates has a univariate association with the outcome is tested by considering a set of tests corresponding to each univariate association and the result of each test is summarized via a p value. Nodes are declared as terminal nodes (node IDs 3, 5, 6, and 7 in Fig. 1) when the minimum p value determined is larger than a multiplicity adjusted significance threshold. Hence, in conditional inference trees such as the one in Fig. 1, each splitting node is associated with a (multiplicity-adjusted) p value and the type I error is controlled both overall and within each node. In a previous paper [14], we compared the advantages and disadvantages of CART and CTree via simulation and found that CART often yielded trees with slightly lower prediction error than CTree but required more parameter tuning, tended to favor the inclusion of continuous over discrete covariates (due to the large number of possible splits of the former), and did not control the overall type I error rate. We argued that the simplicity and inferential focus of conditional inference tress make them an appealing option for epidemiologists, but at this point in time, they have seen limited use in public health and medical research [24, 25•].

Decision trees are widely implemented in open source R statistical software [26] using packages such as rpart [27•] and rpart.plot [28] for CART, partykit [29••] for CTree, and RWeka [30] for C4.5 [16]. Decision trees have also been implemented using SAS Enterprise Miner [31•], in SAS/STAT software using a procedure called hpsplit, and in the CART and CHAID modules in Stata.

Explanatory Modeling and Variable Selection with Decision Trees

One of the advantages of decision trees relative to “black box” machine learning techniques is that they provide interpretable prediction rules in terms of covariates. Hence, they can be used to identify covariates that are most relevant for predicting the outcome. In fact, trees can play two roles in explanatory modeling. First, they can act as a “variable selector” by identifying which available covariates contribute to predicting the outcome. Often, trees are constrained to have a modest number of splits, and hence if the number of available covariates is large, some fraction of those covariates will never appear in the tree and it can be concluded that they do not meaningfully contribute to explaining variation in the outcome. Several papers we found [32, 33] applied a univariate pre-screening step to identify relevant predictors to include in the tree, but in most situations, we would argue that this is unnecessary since trees already perform the variable selection function described above.

Second, decision trees also play an important role in explaining how covariates influence the outcome. In standard generalized linear regression models, covariates are assumed to be linearly related to some function of the mean. However, relationships may be non-linear so that the effect of a covariate is particularly pronounced over a subset of its range. By constructing data-driven prediction rules based on covariate thresholds, decision trees are better able to detect and characterize such non-linearities. This ability is particularly useful for ordered scales, which are common in clinical contexts and are challenging to handle as either continuous or categorical variables in a regression framework. Though not explicitly designed for effect estimation, a fitted decision tree can be used to estimate the effects of (dichotomized) covariates. For instance, in Fig. 1, the effect of having hunger ≤ 10 vs. > 10 could be estimated by calculating the mean caloric intake twice for every individual: once setting hunger ≤ 10 and once setting hunger > 10, leaving other covariate values fixed. The difference between the two mean caloric intake values is the effect of hunger. Recent papers that have used regression trees for explanatory modeling include Esteban et al. [34], who used CART to identify covariates associated with short-term mortality after an exacerbation of chronic obstructive pulmonary disease (eCOPD). They found that the highest mortality rate was in those with the highest baseline dyspnea level (among 5 levels), and a Glasgow score < 15 (score range 3–15). Kanellos-Becker et al. [35] explored the factors associated with prognosis of midgut volvulus in young children and concluded from a CART analysis that the most important predictors were blood gas analysis base excess (BGA BE) < − 1.7 and birth prior to 36 weeks. These factors were then used to derive a prognostic score which had a PPV of 84.2% and an NPV of 100%.

Prediction with Decision Trees

Since trees are a flexible modeling tool, they have been widely applied by researchers seeking to predict outcomes of interest [36•, 37•, 38, 39]. The task of prediction differs from explanatory modeling and variable selection in that, for prediction, the goal is to minimize prediction error, and understanding how covariates contribute to predicting outcomes is less important. How to measure prediction accuracy depends on the outcome type, the available data, and the clinical context. Common metrics include mean-squared error (MSE) for continuous data and area under the ROC curve (AUC) for binary outcome data [40,41,42]. The most accurate assessment of predictive performance comes when the data is split into separate “training” and “test” sets, with the former used to build the model and the latter used only to assess its performance on independent data [12]. When limited sample size precludes splitting into separate sets, cross-validation is recommended for calculating prediction error. There are several types of cross validation methods including leave-one out cross validation, the holdout method, and k-fold cross validation, and software packages in R (e.g., caret [43]) contain built-in cross validation routines that simplify the evaluation of predictive performance.

One drawback of decision trees in the context of prediction is that they can be highly sensitive to small changes in the data. This is because the initial splits of the tree have a major influence on its final structure, and decisions on how and when to split are made on what may be very small differences between the fitting metrics of interest. For example, if splitting a sample on sex reduces the within-subgroup mean squared error (MSE) by 4.7% and splitting it according to whether age is <25 or ≥25 reduces the MSE by 4.6%, then the tree will split on sex. However, it is easy to imagine that a small change in the data might cause the age split to reduce the MSE by 4.8%, in which case age would be chosen to split on instead. This sensitivity is undesirable for prediction, since it means that prediction models based on a tree fitted from one dataset may generalize poorly to new data. To overcome this “sample sensitivity” problem, it is more common to use random forests [44, 45] to derive prediction models. As its name suggests, a random forest is a collection of decision trees, with each tree in the collection fitted from a bootstrap sample of the original data. Final predictions are obtained by averaging the predictions from the trees in the random forest. While random forests often yield more accurate and generalizable prediction models, they lose the interpretability of individual decision trees. Some metrics have been developed that measure “variable importance” within random forests [46,47,48], but only provide high-level summaries of which variables have the biggest impact; they do not provide insight about specific decision rules and thresholds.

While decision trees are correctly characterized as being less sensitive to underlying assumptions about the relationship between covariates and outcomes than standard regression models, they can predict quite poorly if the outcome scales continuously with covariate values (e.g., the mean of the outcome is a linear function of a continuous covariate). In that case, any covariate cutpoint will produce groups with different means, cutpoints will be essentially arbitrary, and given sufficient data, the tree will split many times. As we have previously shown [14], when the true relationship between covariates and outcomes is linear, decision trees will have much larger prediction error than the standard regression model. Therefore, when contemplating the use of decision trees or random forests to build a predictive model, we strongly recommend comparing their performance to that of an appropriate generalized linear model to assess whether they offer a meaningful improvement in prediction accuracy.

Subgroup Identification with Decision Trees

Since decision trees are constructed by sequentially splitting the original sample based on covariate values, they classify individuals into distinct population subgroups that are relatively homogeneous with respect to a given outcome. Splits are defined by a set of covariate dichotomizations, so it is straightforward to understand how subgroup membership is determined in a decision tree; this stands in contrast with many classification techniques where the rules used to create subgroups are based on complex rules involving combinations and transformations of covariates and therefore lack a simple scientific interpretation. Decision trees have been used to identify prognostic groups [49], stratify patients in clinical trials [24, 50•, 51], and retrospectively explore treatment/exposure effect heterogeneity [52].

The usual visualization of decision trees (see Fig. 1) includes much of the information needed to characterize population subgroups, but it does not necessarily help researchers comprehend this information. This drawback is pronounced for predictor variables that lack an easily interpretable scale, or when their population distribution is unknown. To address this limitation, as part of our own work, we developed an alternative visualization of the composition of subgroups defined by decision trees [14]. R code for our novel visualization and relevant examples are available at https://github.com/AshwiniKV/visTree. The visualization is presented as a grid of plots, one corresponding to each terminal node (i.e., population subgroup), and summarizes, at a glance, the characteristics of the subgroups identified by the decision tree in Fig. 1. In Fig. 2, a plot is displayed for the terminal nodes (population subgroups) identified by the decision tree in Fig. 1. A histogram shaded in gray is placed in the background of each plot which summarizes the distribution of the outcome variable (here, 24-h energy intake) for individuals that belong to the relevant terminal node/subgroup. The top left plot in Fig. 2 displays a right-skewed distribution of 24-h energy intake and the average 24-h energy intake within each individual bin of the histogram are labeled as numbers above the x-axis. The mean and subgroup size for each terminal node are displayed as the plot title and a vertical line shows the overall mean of outcome values contained in the subgroup. Colored bars are overlaid on the background to define the composition of the subgroup; individual bars are placed on the percentile scale to describe the set of predictor values.

Fig. 2
figure 2

Visualization that summarizes the characteristics of subgroups identified by the decision tree displayed in Fig. 1

The subgroup corresponding to the top left plot of Fig. 2 is defined by liking values below -13.38, which represents the 39th population percentile, and hunger values that are below 10, which represents the 91st percentile. The bottom right plot, by contrast, has left-skewed values of 24-h energy intake and is defined by hunger above 10, where this cut-off point would create the 92nd percentile.

In Fig. 2, the four subgroups are defined by differences in liking, hunger, and relative reinforcing value of food. The first subgroup (n = 86) has a below average energy intake (1698 kcal) and is characterized by moderate to low liking and all but very high hunger. The second and third subgroups are both characterized by moderate to high liking and all but very high hunger and are differentiated by relative reinforcing value of food; very low rrvf for the second subgroup and all but very low rrvf for the third subgroup. The second subgroup (n = 22) has moderate to low energy intake (1800 kcal) and the third subgroup (n = 104) has moderate to high energy intake (2189 kcal). The fourth subgroup (n = 14) has very high energy intake (2959 kcal) and is characterized by very high hunger.

Conclusion

This article only scratches the surface of an extensive literature on decision trees. The basic concept of sequential dichotomization has been extended in many directions: decision tree variants have been developed to handle virtually all outcome types common to epidemiological studies, including continuous [53], ordinal [54], binary [55•, 56], and time-to-event [57,58,59,60] outcomes. Decision trees have been adapted to allow for traditional covariate adjustment and to handle missing covariate data [61, 62•]. They have also been made more flexible by allowing multiway splits (e.g., [63, 64]). As noted above, random forests created by combining predictions from multiple decision trees have yielded very successful and accurate predictors in a wide variety of contexts. Decision trees are also commonly incorporated into other “ensemble” methods which aggregate predictions from a variety of machine learning techniques [65, 66].

Though this article identifies three distinct uses for decision trees, most epidemiologists will choose to apply decision trees because they seek to strike a balance between model complexity and interpretability, and hence will have more than one of these uses in mind. For example, decision trees are an appealing choice when a researcher seeks to build an accurate prediction model that is based on prediction rules that can be implemented in clinical practice. They are also particularly useful for identifying a small number of covariates that can be used to stratify the population into homogeneous subgroups.

With software for fitting decision trees now available in most standard statistical packages, and ongoing work producing visualizations which make the interpretation of decision tree outputs more intuitive, decision trees are starting to be used widely in the scientific literature. We therefore encourage epidemiological researchers to branch out and give decision trees a try.