1 Introduction

Despite their long history, decision trees such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993) remain popular machine learning tools. Decision trees can be trained quickly with few parameters to tune, and the final model can be easily interpreted and visualized, which is an appealing advantage in practice. As a result, these methods are widely applied in a variety of disciplines including engineering (Shamshirband et al., 2020), bioinformatics (Shaikhina et al., 2019), agriculture (Tariq et al., 2023), and business analysis (Aydin et al., 2022; Golbayani et al., 2020). In addition to their standalone use, decision trees have seen widespread adoption in ensemble methods, often as the best available “weak learner”. Prime examples are random forests (Breiman, 2001) and gradient boosting methods such as XGBoost (Chen & Guestrin, 2016) and LightGBM (Ke et al., 2017). In this work we assume that the target variable is continuous, so we focus on regression trees rather than on classification.

A limitation of classical regression trees is that their piecewise constant nature makes them ill-suited to capture continuous relationships. They require many splits in order to approximate linear functions, which is undesirable. There are two main approaches to overcome this issue. The first is to use ensembles of decision trees such as random forests or gradient boosted trees. These ensembles smooth the prediction and can therefore model a more continuous relation between predictors and response. A drawback of these ensembles is the loss of interpretability. Combining multiple regression trees no longer allows for a simple visualization of the model, or for an explainable stepwise path from predictors to prediction. The second approach to capture continuous relationships is to use model trees. Model trees have a tree-like structure for the partition of the space, but allow for non-constant fits in the leaf nodes of the tree. Model trees thus retain the intuitive interpretability of classical regression trees while being more flexible. Arguably the most intuitive and common model tree is the linear model tree, which allows for linear models in the leaf nodes.

There are several algorithms for fitting linear model trees. The first linear model tree algorithm to be proposed, which we will abbreviate as FRIED (Friedman, 1979), uses univariate piecewise linear fits to replace the piecewise constant fits in CART. Once a model is fit on a node, its residuals are passed on to its child nodes for further fitting. The final prediction is given by the sum of the linear models along the path. Therefore, the final model can be written as a binary regression tree with linear models in the leaf nodes. The coefficients of these linear models in the leaf nodes are given by the sum of the coefficients of all linear models along the path from the root to the leaf node. Our experiments suggest that this method may suffer from overfitting and extrapolation issues. The FRIED algorithm received much less attention than its more involved successor MARS (Friedman, 1991).

The M5 algorithm (Quinlan, 1992) is by far the most popular linear model tree, and is still commonly used today (Shamshirband et al., 2020; da Silva et al., 2021; Khaledian & Miller, 2020; Pham et al., 2022; Fernández-Delgado et al., 2019). It starts by fitting CART. Once this tree has been built, linear regressions are introduced at the leaf nodes. Pruning and smoothing are then applied to reduce its generalization error. One potential objection against this algorithm is that the tree structure is built completely oblivious of the fact that linear models will be used in the leaf nodes.

The GUIDE algorithm (Loh, 2002) fits multiple linear models to numerical predictors in each node and then applies a \(\chi ^2\) test comparing positive and negative residuals to decide on which predictor to split. The algorithm can also be applied to detect interactions between predictors. However, no theoretical guarantee is provided to justify the splitting procedure. There are also algorithms that use ideas from clustering in their splitting rule. For example, SECRET (Dobra & Gehrke, 2002) uses the EM algorithm to fit two Gaussian clusters to the data, and locally transforms the regression problem into a classification problem based on the closeness to these clusters. Experimental results do not favor this method over GUIDE, and its computational cost is high.

The SMOTI method (Malerba et al., 2004) uses two types of nodes: regression nodes and splitting nodes. In each leaf, the final model is the multiple regression fit to its ‘active’ variables. This is achieved by ‘regressing out’ the variable of a regression node from both the response and the other variables. The resulting fit is quite attractive, but the algorithm has a complexity of \(\mathcal {O}(n^2p^3)\), where n denotes the number of cases and p the number of predictors. This can be prohibitive for large data sets, and makes it less suitable for an extension to random forests or gradient boosting.

Most of the above algorithms need a pruning procedure to ensure their generalization ability, which is typically time consuming. An exception is LLRT (Vogel et al., 2007) which uses stepwise regression and evaluates the models in each node via k-fold cross validation. To alleviate the computational cost, the algorithm uses quantiles of the predictors as the potential splitting points. It also maintains the data matrices for fitting linear models on both child nodes so that they can be updated. Unfortunately, the time complexity of LLRT is quite high at \(\mathcal {O}(knp^3 + n p^5)\) where k is the depth of the tree.

It is also interesting to compare ensembles of linear model trees with ensembles of classical decision trees. Recently, certain piecewise linear trees were incorporated into gradient boosting (Shi et al., 2019). For each tree, they used additive fitting (Friedman, 1979) or half additive fitting where the past prediction was multiplied by a factor. To control the complexity of the tree, they set a tuning parameter for the maximal number of predictors that can be used along each path. Empirical results suggest that this procedure outperformed classical decision trees in XGBoost (Chen & Guestrin, 2016) and in LightGBM (Ke et al., 2017) on a variety of data sets, and required fewer training iterations. This points to a potential strength of linear model trees for ensembles.

To conclude, the main issue with existing linear model trees is their high computational cost. Methods that apply multiple regression fits to leaf and/or internal nodes introduce a factor \(p^2\) or \(p^3\) in their time complexity. And the methods that use simple linear fits, such as FRIED, still require pruning which is also costly on large data sets. In addition, these methods can have large extrapolation errors (Loh et al., 2007). Finally, to the best of our knowledge there is no theoretical support for linear model trees in the literature.

In response to this challenging combination of issues we propose a novel linear model tree algorithm. Its acronym PILOT stands for PIecewise Linear Organic Tree, where ‘organic’ refers to the fact that no pruning is carried out. The main features of PILOT are:

  • Speed: It has the same low time complexity as CART without its pruning.

  • Regularized: In each node, a model selection procedure is applied to the potential linear models. This requires no extra computational complexity.

  • Explainable: Thanks to the simplicity of linear models in the leaf nodes, the final tree remains highly interpretable. Also a measure of feature importance can be computed.

  • Stable extrapolation: Two truncation procedures are applied to avoid extreme fits on the training data and large extrapolation errors on the test data, which are common issues in linear model trees.

  • Theoretically supported: PILOT has proven consistency in additive models. When the data is generated by a linear model, PILOT attains a polynomial convergence rate, in contrast with CART.

The paper is organized as follows. Section 2 describes the PILOT algorithm and discusses its properties. Section 3 presents the two main theoretical results. First, the consistency for general additive models is discussed and proven. Second, when the true underlying function is indeed linear an improved rate of convergence is demonstrated. We refer to the Appendix for proofs of the theorems, propositions and lemmas. In addition to these, we derive the time and space complexity of PILOT. Empirical evaluations are provided in Sect. 4, where PILOT is compared with several alternatives on a variety of benchmark data sets. It outperformed other tree-based methods on data sets where linear models are known to fit well, and outperformed other linear model trees on data sets where CART typically performs well. Section 5 concludes.

2 Methodology

In this section we describe the workings of the PILOT learning algorithm. We begin by explaining how its tree is built and motivate the choices made, and then derive its computational cost.

We will denote the \(n \times p\) design matrix as \(\varvec{X}=({X}_1, \dots ,{X}_n)^\top\) and the \(n \times 1\) response vector as \(Y=(y_1,\dots ,y_n)^\top\). We consider the standard regression model \(y = f(X) + \epsilon\), where \(f: \mathbb R^p\rightarrow \mathbb R\) is the unknown regression function and \(\epsilon\) has mean zero.

As a side remark, the linear models and trees we consider here are equivariant to adding a constant to the response, that is, the predictive models would keep the same fitted parameters except for the intercept terms. Therefore, in the presentation we will assume that the responses are centered around zero, that is, \(Y_{min} = -Y_{max}\) without loss of generality.

2.1 Main structure of PILOT

A typical regression tree has four ingredients: a construction rule, the evaluation of the models/splits, a stopping rule, and the prediction. Most regression tree algorithms are built greedily from top to bottom. That is, they split the original space along a predictor and repeat the procedure on the subsets. This approach has some major advantages. The first is its speed. The second is that it starts by investigating the data from a global point of view, which is helpful when detecting linear relationships.

Algorithm 1 presents a high-level description of PILOT. It starts by sorting each predictor and storing its ranks. At each node, PILOT then selects a predictor and a univariate piecewise linear model via a selection procedure that we will describe in Sects. 2.2 and 2.3. Then it fits the model and passes the residuals on to the child nodes for further fitting. This recursion is applied until the algorithm stops. There are three stopping triggers. In particular, the recursion in a node stops when:

  • the depth reaches the preset maximal depth of the tree \(K_{max}\);

  • the number of cases in the node is below a threshold value \(n_{\text {fit}}\);

  • none of the candidate models do substantially better than a constant prediction, as we will describe in Sect. 2.3.

When all nodes have reached a stopping trigger, the algorithm is done. The final prediction is given by aggregating the predictions of the linear models from the root to the leaves, as in the example in Fig. 1.

Algorithm 1
figure a

Sketch of the PILOT algorithm

Fig. 1
figure 1

An example of a PILOT tree

Note that PILOT retains good interpretability. The tree structure visualizes the decision process, while the simple linear models in each step reveal which predictor was used and how the prediction changes with respect to a small change in this predictor. For example, in Fig. 1, the result from the node X3>0.4 can be interpreted as: given \(X1<2.3\) and the previous prediction \(0.3*X1+0.2\), the best additional predictor is X3 where the split of the space happens. Furthermore, if \(X3>0.4\), \(0.9*X3+0.1\) is added to the prediction and \(0.5*X3-0.4\) otherwise, meaning that there was a correlation between X3 and the residuals after the first prediction. Since the final prediction in each leaf node is linear, it is easy to interpret. Moreover, we will define a measure of feature importance similar to that of CART, based on the variance reduction in each node. This is because PILOT selects only one predictor in each node, which makes the gain fully dependent on it. This differs from methods such as M5, GUIDE, and LLRT that use multiple predictors in each node, making it harder to fairly distribute the gain over several predictors in order to derive an overall measure of each predictor’s importance. We refer to Sect. 4.5 for a more detailed discussion.

2.2 Models used in the nodes

As stated in the high-level summary in Algorithm 1, at each node PILOT selects a fit from a number of linear and piecewise linear models. In particular, PILOT considers the following regression models on each predictor, shown in Fig. 2:

  • pcon: A Piecewise CONstant fit, as in CART.

  • lin: Simple LINear regression.

  • blin: A Broken LINear fit: a continuous function consisting of two linear pieces.

  • plin: A two-Piece LINear fit that need not be continuous.

  • con: A CONstant fit. We stop the recursion in a node after a con model is fitted.

Fig. 2
figure 2

The five regression models used in PILOT

These models extend the piecewise constant pcon fits of CART to linear fits. In particular, plin is a direct extension of pcon while blin can be regarded as a regularized and smoothed version of plin. For variables that are categorical or have but a few unique values, only pcon and con are used.

To guard against unstable fitting, the lin and blin models are only considered when the number of unique values in the predictor is at least 5. Similarly, the plin model is only considered if both potential child nodes have at least 5 unique values of the predictor.

PILOT reduces to CART (without its pruning) if only pcon models are selected, since the least squares constant fit in a child node equals its average response.

Note that a node will not be split when lin is selected. Therefore, when a lin fit is carried out in a node, PILOT does not increase its reported depth. This affects the reported depth of the final tree, which is a tuning parameter of the method.

It is possible that a consecutive series of lin fits is made in the same node, and this did happen in our empirical studies. In a node where this occurs, PILOT is in fact executing \(L^2\) boosting (Bühlmann, 2006), which fits multiple regression using repeated simple linear regressions. It has been shown that \(L^2\) boosting is consistent for high dimensional linear regression and produces results comparable to the Lasso. It was also shown that its convergence rate can be relatively fast under certain assumptions (Freund et al., 2017). PILOT does not increment the depth value of the node for lin models, to avoid interrupting this boosting procedure.

2.3 Model selection rule

In each node, PILOT employs a selection procedure to choose between the five model types con, lin, pcon, blin and plin. It would be inappropriate to select the model based on the largest reduction in the residual sum of squares (\(\textrm{RSS}\)), since this would always choose plin, as that model generalizes all the others. Therefore, we need some kind of regularization to select a simpler model when the extra \(\textrm{RSS}\) gain of going all the way to plin is not substantial enough.

After many experiments, the following regularization scheme was adopted. In each node, PILOT chooses the combination of a predictor and a regression model that leads to the lowest BIC value

$$\begin{aligned} n\log \Big (\frac{\textrm{RSS}}{n}\Big )+v\log (n) \end{aligned}$$
(1)

which is a function of the residual sum of squares \(\textrm{RSS}\), the number of cases n in the node, and the degrees of freedom v of the model. With this selection rule PILOT mitigates overfitting locally, and applying it throughout the tree leads to a global regularization effect.

It remains to be addressed what the value of v in (1) should be for each of the five models. The degrees of freedom are determined by aggregating the number of model parameters excluding the splits, and the degrees of freedom of a split. The model types con, lin, pcon, blin and plin contain 1, 2, 2, 3, and 4 coefficients apart from any split points. There is empirical and theoretical evidence in support of using 3 degrees of freedom for a discontinuity point (Hall et al., 2017). We follow this recommendation for pcon and plin, which each receive 3 additional degrees of freedom. Also blin contains a split point, but here the fitted function is continuous. To reflect this intermediate complexity we add 2 degrees of freedom to blin. We empirically tested 1 to 3 additional degrees of freedom for blin, and the performance was generally not very sensitive to this choice. Of course, these values could be considered hyperparameters and could thus be tuned by cross-validation. This would come at a substantial increase in computation time though, so we provide default parameters instead. In conclusion, we end up with v = 1, 2, 5, 5, and 7 for the model types con, lin, pcon, blin and plin.

The BIC in (1) is one of several model selection criteria that we could have used. We also tried other measures, such as the Akaike information criterion (AIC) and the adjusted AIC. It turned out that the AIC tended to choose plin too often, which reduced the regularization effect. The adjusted AIC required a pruning procedure to perform comparably to the BIC criterion. As pruning comes at a substantial additional computational cost, we decided in favor of the BIC, which performed well.

Alternatively, one option is to compute the degrees of freedom for the aggregated model in each node, and then compute the adjusted AIC to decide when to stop (Bühlmann, 2006). But in this approach hat matrices have to be maintained for each path, which requires more computation time and memory space. Moreover, as the number of cases in the nodes changes, the evaluations of the hat matrices become more complicated. For these reasons, we preferred to stay with the selection rule (1) in each node.

Algorithm 2 shows the pseudocode of our model selection procedure in a given node. We iterate through each predictor separately. If the predictor is numerical, we find the univariate model with lowest BIC among the 5 candidate models. For con and lin this evaluation is immediate. For the other three models, we need to consider all possible split points. We consider these in increasing order (recall that the ordering of each predictor was already obtained at the start of the algorithm), which allows the Gram and moment matrices for the linear models to be maintained and updated efficiently in each step. On the other hand, if the predictor is categorical, we follow the approach of CART as in Section 9.2.4 of Hastie et al. (2009). That is, we first compute the mean \(m_c\) of the response of each level c and then sort the cases according to \(m_c\). We then fit the model pcon in the usual way.

Algorithm 2
figure b

Model selection and split finding in a node of the training set

Once the best model type has been determined for each predictor separately, Algorithm 2 returns the combination of predictor and model with the lowest BIC criterion.

2.4 Truncation of predictions

PILOT relies on two truncation procedures to avoid unstable predictions on the training data and the test data.

Fig. 3
figure 3

Left: An example of the first truncation method in one node of the tree. Right: An example of the second truncation method

The first truncation procedure is motivated as follows. The left panel of Fig. 3 shows the data in a node of PILOT. The selected predictor is on the horizontal axis. The plot illustrates a situation where most of the data points in a node are concentrated in a part of the predictor’s range, causing the linear model to put a lot of weight on these points and little weight on the other(s). This can result in extreme predictions, for instance for the red point on the right. Although such unstable predictions might be corrected by deeper nodes, this could induce unnecessarily complicated models with extra splits. Moreover, if this effect occurs at a leaf node, the prediction cannot be corrected. To avoid this issue we will truncate the prediction function. Note that if we add a constant to the response, PILOT will yield the same estimated parameters except for the intercept terms that change the way they should. Denote \(B=\max (Y)=-\min (Y)\), thus the original responses have the range \([-B,B]\). We then clip the prediction so it belongs to \([-1.5B,1.5B]\). Then we compute the response Y minus the truncated prediction to get the new residuals, and proceed with building the tree for the next depth. The whole process is summarized in Algorithm 3. We found empirically that this works well.

Algorithm 3
figure c

Truncation during tree building on the training data

The first truncation method is not only applied when training the model, but also when predicting on new data. The range of the response of the training data is stored, and when a prediction for new data would be outside the proposed range \([-1.5B,1.5B]\) it is truncated in the same way. This procedure works as a basic safeguard for the prediction. However, this is not enough because a new issue can arise: it may happen that values of a predictor in the test data fall outside the range of the same predictor in the training data. This is illustrated in the right panel of Fig. 3. The vertical dashed lines indicate the range of the predictor on the training data. However, the predictor value of the test case (the red point) lies quite far from that range. Predicting its response by the straight line that was fitted during training would be an extrapolation, which could be unrealistic. This is a known problem of linear model trees. For instance, in our experiments in Sect. 4 the out-of-sample predictions of FRIED and M5 became extremely large on some data sets, which turned out to be due to this exact problem.

Therefore we add a second truncation procedure as done in Loh et al. (2007). This second truncation step is only applied to new data because in the training data, all predictor values are by definition inside the desired range. More precisely, during training we record the range \([x_{\hbox {\tiny min}}, x_{\hbox {\tiny max}}]\) of the predictor selected in this node, and store the range of the corresponding predictions \([{\hat{f}}(x_{\hbox {\tiny min}}),{\hat{f}}(x_{\hbox {\tiny max}})]\). When predicting on new data, we truncate the predictions so they stay in the training range. More precisely, we replace the original predictions \({\hat{f}}(x_{\hbox {\tiny test}})\) on this node by \(\max (\min ({\hat{f}}(x_{\hbox {\tiny test}}),{\hat{f}}(x_{\hbox {\tiny max}})),{\hat{f}}(x_{\hbox {\tiny min}}))\).

The two truncation methods complement each other. For instance, the first approach would not suffice in the right panel of Fig. 3 since the linear prediction at the new case would still lie in \([-1.5B,1.5B]\). Also, the second approach would not work in the left panel of Fig. 3 as the unusual case is included in the range of the training data. Our empirical studies indicate that combining both truncation approaches helps eliminate extreme predictions and therefore improve the stability of our model. Moreover, this did not detract from the overall predictive power. Algorithm 4 describes how both truncation methods are applied when PILOT makes predictions on new data.

Algorithm 4
figure d

Truncation during prediction on new data

2.5 Stopping rules versus pruning

It is well known that decision trees have a tendency to overfit the data if the tree is allowed to become very deep. A first step toward avoiding this is to require a minimal number of cases in a node before it can be split, and a minimal number of cases in each child node. In addition, PILOT also stops splitting a node if the BIC model selection criterion selects con.

Several decision trees, including CART, take a different approach. They let the tree grow, and afterward prune it. This indeed helps to achieve a lower generalization error. However, pruning can be very time consuming. One often needs a cross-validation procedure to select pruning parameters. It would be possible for PILOT to also incorporate cost-complexity pruning as in CART. It would work exactly the same as in CART, and additionally consider a series of lin fits as a single model to be kept or removed in the pruning. However, in our empirical study we found that adding this pruning to PILOT did not outperform the existing con stopping rule on a variety of datasets. Also, we will see in the next section that the generalization error vanishes asymptotically. Moreover, not pruning is much more efficient computationally. For these reasons, PILOT does not employ pruning. The computational gain of this choice makes it more feasible to use PILOT in ensemble methods.

3 Theoretical results

3.1 Universal consistency

In this section we prove the consistency of PILOT. We follow the settings in Klusowski (2021) and assume the underlying function \(f\in \mathcal {F}\subset L^2([0,1]^p)\) admits an additive form

$$\begin{aligned} f(X):=f_1(X^{(1)})+\dots +f_p(X^{(p)}) \end{aligned}$$
(2)

where \(f_i\) has bounded variation and \(X^{(j)}\) is the j-th predictor. We define the total variation norm \(||f||_{TV}\) of \(f\in \mathcal {F}\) as the infimum of \(\sum _{i=1}^p||f_i||_{TV}\) over all possible representations of f, and assume that the representation in (2) attains this infimum.

For all \(f, g \in L^2([0,1]^p)\) and a dataset \(X_1, \ldots , X_n\) of size n, we define the empirical norm and the empirical inner product as

$$\begin{aligned} ||f||_n^2:=\frac{1}{n}\sum _{i=1}^{n}|f( X_i)|^2 \quad \text {and} \quad \langle f,g \rangle _n:=\frac{1}{n}\sum _{i=1}^{n}f( X_i)g( X_i). \end{aligned}$$
(3)

For the response vector \(Y = (Y_1,\ldots , Y_n)\) we denote

$$\begin{aligned} ||Y-f||_n^2:=\frac{1}{n}\sum _{i=1}^{n}(Y_i - f( X_i))^2 \quad \text {and} \quad \langle Y,f \rangle _n:=\frac{1}{n}\sum _{i=1}^{n}Y_if( X_i). \end{aligned}$$
(4)

The general \(L^2\) norm on vectors and functions will be denoted as \(||\cdot ||\) without subscript.

To indicate the norm and the inner product of a function on a node T with t observations we replace the subscript n by t. We denote by \(\mathcal {T}_k\) the set of tree nodes at depth k, plus the leaf nodes of depth lower than k. In particular, \(\mathcal {T}_K\) contains all the leaf nodes of a K-depth tree.

Now we can state the main theorem:

Theorem 1

Let \(f\in \mathcal {F}\) with \(||f||_{TV}\, \leqslant A\) and denote by \({\hat{f}}(\mathcal {T}_K)\) the prediction of a K-depth PILOT tree. Suppose \(X\sim P\) on \([0,1]^{p_n}\), the responses are a.s. bounded in \([-B,B]\), and the depth \(K_n\) and the number of predictors \(p_n\) satisfy \(K_n\rightarrow \infty\) and \(2^{K_n}p_n\log (np_n)/n\rightarrow 0\). Then PILOT is consistent, that is

$$\begin{aligned} \lim _{n\rightarrow \infty } \mathbb {E}[||f-{\hat{f}}(\mathcal T_{K_n})||^2]=0. \end{aligned}$$
(5)

Note that the conditions on the depth and the dimension can easily be satisfied if we let \(K_n=\log _2(n)/r\) for some \(r>1\) and \(p_n=n^{s}\) such that \(0<s<1-1/r\). The resulting convergence rate is \(\mathcal {O}(1/\log (n))\). This is the same rate as previously obtained for CART under very similar assumptions (Klusowski, 2021).

The key of the proof of Theorem 1 is to establish a recursive formula for the training error \(R_k:=||Y-{\hat{f}}(\mathcal {T}_k)||^2_n-||Y-f||^2_n\) for each depth k. Then we can leverage results from empirical process theory (Györfi et al., 2002) to derive an oracle inequality for PILOT, and finally prove consistency.

For the recursive formula we first note that \(R_k=\sum _{T\in \mathcal {T}_K}w(T)R_k(T)\) where \(w(T):=t/n\) is the weight of node T and \(R_k(T):=||Y-{\hat{f}}(\mathcal {T}_k)||^2_t-||Y-f||^2_t\). Then we immediately have \(R_{k+1}=R_k-\sum _{T\in \mathcal {T}_k}w(T)\Delta ^{k+1}(T)\) where

$$\begin{aligned} \Delta ^{k+1}(T):=||Y-\hat{f}(\mathcal {T}_k)||^2_t\;-\;t_l||Y-\hat{f}(\mathcal {T}_{k+1})||^2_{t_l}/t\;-\;t_r||Y-\hat{f}(\mathcal {T}_{k+1})||^2_{t_r}/t \end{aligned}$$

is the impurity gain of the model on T. Here \(t_l\) and \(t_r\) denote the number of cases in the left and right child node. If no split occurs, one of these numbers is set to zero. For PILOT we need to remove con nodes that do not produce an impurity gain from the recursive formula. To do this, we define \(C_k^+=\{T|T=\textsc {con}, T\in \mathcal T_{k-1}, R_k(T)>0\}\), i.e., the set of ‘bad’ nodes on which con is fitted, and similarly \(C_k^-\) for those with \(R_k(T) \leqslant 0\). Then we can define \(R_{C_k^+}:=\sum _{T\in C_k^+}w(T)R_k(T)\), the positive training error corresponding to the cases in these con nodes. We can then consider \(\widetilde{R}_k:=R_k-R_{C_k^+}\) and show that asymptotically both \(R_k\) and \(R_{C_k^+}\) become small.

Now the problem is reduced to relating the impurity gain of the selected model to the training error before this step. Recently, a novel approach was introduced for this kind of estimation in CART (Klusowski, 2021). However, the proof used the fact that in each step the prediction is piecewise constant, which is not the case here. Therefore, we derived a generalized result for pcon in PILOT (Lemma 3 in Appendix 2.1). For controlling the gain of the other models, we can use the following proposition:

Theorem 2

Let \(\Delta _1\), \(\Delta _2\) and \(v_1\), \(v_2\) be the impurity gains and degrees of freedom of two regression models on some node T with t cases. Let \(R_0\) be the initial residual sum of squares in T. We have that

  • If model 1 does better than con, i.e. \(BIC_{\hbox {\tiny {CON}}}> BIC_1\), we have \(t\Delta _1/R_0> C(v_1,t)>0\) for some positive function C depending on \(v_1\) and t.

  • If \(BIC_{\hbox {\tiny {CON}}} > BIC_1\) and \(BIC_2 > BIC_1\) with \(v_2 \geqslant v_1\) we have \(\frac{\Delta _1}{\Delta _2} \geqslant \frac{v_1-1}{v_2-1}.\)

Moreover, if con is chosen at a node, it will also be chosen in subsequent models.

We thus know that the gain of the selected model is always comparable to that of pcon, up to a constant factor.

The proposition also justifies the con stopping rule from a model selection point of view: when con is selected for a node, all the subsequent models for this node will still be con. A con model is selected when the gain of other regression models do not make up for the complexity they introduce. Since con only regresses out the mean, we can stop the algorithm within the node once the first con fit is encountered.

We can show the following recursive formula:

Theorem 3

Under the same assumptions as Theorem 1 we have for any \(K\geqslant 1\) and any \(f\in \mathcal {F}\) that

$$\begin{aligned} ||Y-{\hat{f}}(\mathcal {T}_K)||^2_n\, \leqslant ||Y-f||^2_n \, + \frac{(||f||_{TV}+6B)^2}{6(K+3)} + R_{C_K^+}. \end{aligned}$$
(6)

Moreover, if we let \(K=\log _2(n)/r\) with \(r>1\) we have \(R_{C_K^+}\sim \mathcal {O}(\sqrt{\log (n)/n^{(r-1)/r}})\).

The BIC criterion ensures that the training errors in all con nodes vanish as \(n\rightarrow \infty\). This allows us to prove the main theorem by using the preceding results and Theorems 11.4 and 9.4 and Lemma 13.1 of Györfi et al. (2002). All intermediate lemmas and proofs can be found in the “Appendix 2”.

3.2 Convergence rates on linear models

The convergence rate resulting from Theorem 1 is the same as the previously derived one of CART (Klusowski, 2021). This is because we make no specific assumptions on the properties of the true underlying additive function. Therefore this rate of convergence holds for a wide variety of functions, including poorly behaved ones. In practice however, the convergence could be much faster if the function is somehow well behaved. In particular, given that PILOT incorporates linear models in its nodes, it is likely that it will perform better when the true underlying function is linear, which is a special case of an additive model. We will show that PILOT indeed has an improved rate of convergence in that setting. This result does not hold for CART, and to the best of our knowledge was not proved for any other linear model tree algorithm.

In order to prove the convergence rate of PILOT in linear models, we apply a recent result of Freund et al. (2017). For the convergence rate of the \(L^2\) boosting algorithm, they showed in their Theorem 2.1 that the mean squared error (MSE) decays exponentially with respect to the depth K, for a fixed design matrix. The proof is based on the connection between \(L^2\) boosting and quadratic programming (QP) on the squared loss \(||Y-\varvec{X}\beta ||^2_n\) . In fact, the \(L^\infty\) norm of the gradient of the squared loss is equal to the largest impurity gain among all simple linear models, assuming each predictor has been centered and standardized. Therefore, the gain of the k-th step depends linearly on the mean squared error in step \(k-1\), which results in a faster convergence rate. By the nature of quadratic programming, the rate depends on the smallest eigenvalue \(\lambda _{min}\) of \(\varvec{X}^\top \varvec{X}\).

In our framework we can consider a local QP problem in each node to estimate the gain of lin, which gives a lower bound for the improvement of the selected model (except when con is selected). To ensure that we have a constant ratio for the error decay, we have to make some assumptions on the smallest eigenvalue of the local correlation matrix of the predictors. This eigenvalue can be regarded as a measure of multicollinearity between the predictors, which is a key factor that affects the convergence rate of least squares regression.

To be precise, we require that the smallest eigenvalue of the theoretical correlation matrix (the jl-th entry of which is \(\mathbb {E}[(X-{\bar{X}})(X-{\bar{X}})^\top ]_{jl}/(\sigma _j\sigma _l)\) where \(\sigma _j\) is the standard deviation of the j-th predictor) is lower bounded by some constant \(\lambda _0\) in any cubic region that is sufficiently small. Then we apply two concentration inequalities to show that the smallest eigenvalue of a covariance matrix of data in such a cube is larger than \(\lambda _0\) with high probability. Finally, we can show that the expectation of the error decays exponentially, which leads to a fast convergence rate.

Our conditions are:

  • Condition 1: The PILOT algorithm stops splitting a node whenever

    • the number of cases in the node is less than \(n_{\hbox {\tiny min}}=n^\alpha\), \(0<\alpha <1\).

    • the variance of some predictor is less than \(2\sigma _0^2\) where \(0<\sigma _0<1\).

    • the volume of the node is less than a threshold \(\eta\), \(0<\eta <1\).

  • Condition 2: We assume that \(X\sim P\) on \([0,1]^p\) and the error \(\epsilon\) has a finite fourth moment. Moreover, for any cube C with volume smaller than \(\eta\) we assume that\(\lambda _{min}(Cor(X|X\in C)) \geqslant 2\lambda _0>0\), where Cor(X) is the correlation matrix.

The first condition is natural for tree-based methods. For the second one, an immediate example would be a setting when all the predictors are independent. Another example is when they follow a multivariate Gaussian distribution, which becomes a truncated Gaussian when restricted to \([0,1]^p\) or to some other cube C. It can be shown that the Gaussian distribution has relatively small correlation in cubes C of small volume; see, e.g., Muthen (1990) for the two-dimensional case. This is not surprising since on small cubes the density tends to a constant. As a result, the smallest eigenvalue of the correlation matrix on such cubes will be bounded from below.

Under these conditions we obtain the following result, whose proof can be found in “Appendix 3”.

Theorem 4

Assume the data are generated by the linear model \(Y\sim \varvec{X}\beta + \epsilon\) with n cases. Under Conditions 1 and 2, the difference between the training loss \(L^k_n\) of PILOT at depth k and the training loss \(L^*_n\) of least squares linear regression satisfies

$$\begin{aligned} \mathbb {E}[L^k_n-L^*_n] \leqslant \gamma ^k\sqrt{\mathbb {E}[(L^0_n-L^*_n)^2]}+\mathcal {O}(N_{\text {leaves}}\log (n)/n) \end{aligned}$$
(7)

where

$$\begin{aligned} \gamma :=1-\frac{\lambda _0}{4p}. \end{aligned}$$

Combining this result with the techniques in the proof of Theorem 1, we can show polynomial rate convergence of our method on linear functions.

Theorem 5

(Fast convergence on linear models) Assume the conditions of Theorem 4 hold and that |Y| is a.s. bounded by some constant B. Let \(K_n= log_\gamma (n)\). Then we have for any \(0<\alpha <1\) and corresponding \(n_{\hbox {\tiny min}} = n^{\alpha }\) that

$$\begin{aligned} \mathbb {E}[||{\hat{f}}(\mathcal {T}_K) - \varvec{X}\beta ||^2] \leqslant \mathcal {O}\Big (\frac{\log (n)}{n^{\alpha }}\Big ). \end{aligned}$$
(8)

The choice of the tuning parameter \(\alpha\) thus determines the convergence rate. When we use a low \(\alpha\) we get smaller nodes hence more nodes, yielding looser bounds in the oracle inequality and for the errors in con nodes, leading to a slower convergence. This is natural, as we have fewer cases for the estimation in each node. If we choose \(\alpha\) relatively high we obtain a faster rate, which is intuitive since for \(\alpha \rightarrow 1\) we would have only a single node.

3.3 Time and space complexity

Finally, we demonstrate that the time complexity and space complexity of PILOT are the same as for CART without its pruning. The main point of the proof is that for a single predictor, each of the five models can be evaluated in a single pass through the predictor values. Evaluating the next split point requires only a rank-one update of the Gram and moment matrices, which can be done in constant time. Naturally, the implicit proportionality factors in front of the \(\mathcal {O}(.)\) complexities are higher for PILOT than for CART, but the algorithm is quite fast in practice.

Theorem 6

PILOT has the same time and space complexities as CART without its pruning.

Proof

For both PILOT and CART we assume that the p predictors have been presorted, which only takes \(\mathcal {O}(np\log (n))\) time once.

We first check the complexity of the model selection in Algorithm 2. It is known that the time complexity of CART for split finding is \(\mathcal {O}(np)\). For the pcon model, PILOT uses the same algorithm as CART. For con we only need to compute one average in \(\mathcal {O}(n)\) time. For lin, the evaluation of \((D^{\textsc {lin}})^\top D^{\textsc {lin}}\) and \((D^{\textsc {lin}})^\top Y\) also takes \(\mathcal {O}(n)\). In the evaluation of the plin model, the Gram matrices always satisfy \(G^{\textsc {plin}}_l+G^{\textsc {plin}}_r=G^{\textsc {lin}}=(D^{\textsc {lin}})^\top D^{\textsc {lin}}\) and the moment matrices satisfy \(M^{\textsc {plin}}_l+M^{\textsc {plin}}_r=(D^{\textsc {lin}})^\top Y\). These matrices have at most \(2^2 = 4\) entries, and can be incrementally updated in \(\mathcal {O}(1)\) time as we evaluate the next split point. For blin the reasoning is analogous, with at most \(3^2 = 9\) matrix entries.

In each model, inverting the Gram matrix only takes \(\mathcal {O}(1)\) because its size is fixed. Therefore, we can evaluate all options on one presorted predictor in one pass through the data, which remains \(\mathcal {O}(n)\), so for all predictors this takes \(\mathcal {O}(np)\) time, the same as for CART. For the space complexity, the Gram and moment matrices only require \(\mathcal {O}(1)\) of storage. Since CART also has to store the average response in each child node, which takes \(\mathcal {O}(1)\) storage as well, the space complexity of both methods is the same.

For the tree building in Algorithm 3, the time complexity of computing the indices \(I_{l,r}\) and the residuals is \(\mathcal {O}(n)\) at each depth \(1 \leqslant k \leqslant K_{max}\). CART also computes the indices \(I_{l,r}\) which requires the same complexity \(\mathcal {O}(n)\) for each step. Therefore, the overall time for both methods is \(\mathcal {O}(K_{max}n)\). Since the PILOT tree only has \(\mathcal {O}(1)\) more attributes in each node than CART, the space complexity of PILOT remains the same as that of CART.

For the initialization and prediction parts, the results are straightforward. \(\square\)

We can estimate the proportionality factor of the \(\mathcal {O}(np)\) time of the split step as follows. We have to update 3 additional pairs of Gram and moment matrices: one pair for blin, where the Gram matrix is \(3 \times 3\) and the moment matrix is \(3 \times 1\), and two pairs for plin where Gram matrices are \(2 \times 2\) and the moment matrices are \(2 \times 1\). The number of additional entries is thus \(1\times (3^2+3)+2\times (2^2+2)=24\). Therefore the time needed for PILOT is roughly 30 times that of CART. Note that this is pretty low compared with other linear model tree methods, many of which have higher asymptotic complexities. Moreover, PILOT does not contain the common pruning step of CART, which is computationally quite demanding.

To illustrate the time complexity of PILOT we ran simulations using different numbers of cases (ranging from 10 to 10,000) and predictors (ranging from 2 to 1000). The functions are piecewise linear and randomly simulated. Here, we tested the running time of the split function, which is the main part of the algorithm, so not including the presorting with time complexity \(\mathcal {O}(np\log (n))\). The other parts of PILOT are not fundamentally different from the corresponding parts in CART, and so the comparison of the split function is of primary interest.

The left panel of Fig. 4 plots the log of the measured runtime versus the log number of cases, in the middle panel versus the log number of predictors and in the right panel versus the log number of predictors times cases. The trends are linear, and the least squares lines have slopes close to 1. This illustrates the \(\mathcal {O}(np)\) time complexity.

Fig. 4
figure 4

Logarithm of the computation time of the split step in PILOT, versus (left) the logarithm of the number of cases; (middle) the number of predictors; and (right) the number of cases times the number of predictors

4 Empirical evaluation

In this section we evaluate the proposed PILOT algorithm empirically and compare its results with those of some popular competitors. We start by describing the data and methods under comparison.

4.1 Data sets and methods

We analyzed 25 benchmark data sets, with the number of cases ranging from 71 to 21,263 and the number of predictors ranging from 4 to 4,088. From the UCI repository (Dua & Graff, 2017) we used the data sets Abalone, Airfoil, Auto mpg, Bike, Communities, Concrete, Diabetes, Electricity, Energy, Power plant, Real estate, Residential, Skills, slump test, Superconductor, Temperature, Thermography and Wine. From Kaggle we obtained Bodyfat, Boston Housing, Graduate Admission, and Walmart. The California Housing data came from the StatLib repository (http://lib.stat.cmu.edu/datasets/). The Ozone data is in the R-package hdi (Dezeure et al., 2015), and Riboflavin came from the R-package missMDA (Josse & Husson, 2016). Table 1 gives an alphabetical list of the data sets with their sizes.

Table 1 The data sets used in the empirical study

In the comparative study we ran the following tree-based methods: the proposed PILOT algorithm we implemented in Python with NUMBA acceleration in the split function, FRIED (Friedman, 1979) implemented by us, M5 (Quinlan, 1992) from the R package Rweka (Hornik et al., 2009), and CART. We also ran ridge linear regression from the Scikit-Learn package (Pedregosa et al., 2011) and the lasso (Tibshirani, 1996). When running methods that were not developed to deal with categorical variables, we first applied one-hot encoding to such variables, that is, we replaced each categorical variable with m outcomes by m binary variables.

The settings for hyperparameter tuning are the following. For PILOT we considered the depth parameter \(K_{\text {max}}\) in (3, 6, 9, 12, 15, 18), the minimal number of cases to fit a model \(n_{\text {fit}}\) in (2, 10, 20, 40), and the minimal number of cases in a leaf node \(n_{\text {leaf}}\) in (1, 5, 10, 20). For CART and FRIED we tuned \(K_{\text {max}}\), \(n_{\text {fit}}\) and \(n_{\text {leaf}}\) over the same ranges. For the latter methods, we tuned the data driven parameter \(\alpha _{ccp}\) for cost complexity pruning for each combination of \((K_{\text {max}},n_{\text {fit}},n_{\text {leaf}})\). For M5, \(n_{\text {leaf}}\) is the only parameter since the building, smoothing and pruning procedures are parameter-free. For the Lasso and Ridge models, we tuned the \(\lambda\) of the penalty term.

In order to evaluate the methods, we start by randomly splitting each dataset into 5 folds. For each method, and on each training set (each consisting of 4 folds), we perform 5-fold cross validation to select the best combination of hyperparameters. Finally, we compute the cross-validated mean square error, that is, the average test MSE of each method with hyperparameters tuned on the test folds. The final score of each method is its MSE divided by the lowest MSE on that dataset. A score of 1 thus corresponds with the method that performed best on that dataset, and the scores of the other methods say how much larger their MSE was.

4.2 Results

The results on these datasets are summarized in Table 2. The bold entries correspond to scores of at most 1.05, that is, methods whose MSE was at most 5% higher than the MSE of the best performing method on that dataset. The bottom four lines of the table summarize the performance, and we see that PILOT achieved the best average score with the lowest standard deviation. Moreover, its mean rank and standard deviation of its rank were also the lowest.

Table 2 MSE ratios relative to best, for the datasets of Table 1

Table 2 has some unusual entries that require an explanation. In the last column, the M5 results were partially adjusted because its predictions are not invariant to scaling the predictors. Therefore we ran M5 both with and without predictor scaling, and we present the best result here. Those based on scaling are shown in parentheses. Note that these MSEs are at least 10 times lower than the results of M5 on the unscaled dataset, indicating that M5 is sensitive to the preprocessing of the predictors. Still, the performance of M5 remained poor on the California Housing dataset. On the Superconductor data the hyperparameter tuning time for CART exceeded the preset wall time, despite applying optimized functions (from Scikit-Learn) for cross validation and pruning, so we left this table entry blank. We also noticed that FRIED had extremely high test errors (up to 100 times larger than the best MSE) for several choices of hyperparameters on the Residential dataset. A careful examination revealed that this was caused by extrapolation. At one node, the value of the predictor of a test case was outside the range of the training cases, so lin extrapolated and gave a wild prediction.

Overall, PILOT outperformed FRIED 21 times. It also did better than CART on 22 datasets. As for M5, PILOT outperformed it 15 times. In the comparison with M5 we note that PILOT has a much more stable performance, with a performance standard deviation that is 6 times smaller.

A closer look at the results reveals that PILOT struggled on the Airfoil data. A potential explanation is the fact that the Airfoil data has a rather complicated underlying structure, which is often fit using ensemble methods such as gradient boosting and random forests (Patri & Patnaik, 2015). All tree-based methods including PILOT gave relatively poor predictions on the Riboflavin data, which is high-dimensional with only 71 observations but over 4000 predictors, and whose response has a strong linear relationship with the regressors. This explains why Ridge and Lasso performed so well on these data. The tree-based methods suffered from the relatively small number of cases in the training set, which resulted in a higher test error. Still, PILOT did better than CART, FRIED and M5 on these data, suggesting that it is indeed less prone to overfitting and better at capturing linear relationships.

There are several other datasets on which Ridge and/or Lasso did very well, indicating that these data have linear patterns. The results show that PILOT tended to do quite well on those datasets and to outperform CART on them. This confirms that the linear components of PILOT make it well suited for such data, whereas we saw that it usually did at least as well as the other tree-based methods on the other datasets.

To assess the statistical significance of the performance differences involving PILOT, we conducted Wilcoxon signed rank tests between the PILOT performance and each of the other methods, with the results shown in Table 3. (The M5 result is with the scaled datasets when they did better.) All the p-values are below \(5\%\) and often much lower. To formally test for the superiority of the PILOT results in this study, we can apply the Holm-Bonferroni method (Holm, 1979) for multiple testing, at the level \(\alpha =5\%\). To this end we must sort the five p-values in Table 3 from smallest to largest, and check whether \(p_{(1)} <\alpha /5\), \(p_{(2)} < \alpha /4\), \(p_{(3)} < \alpha /3\), \(p_{(4)} < \alpha /2\), and \(p_{(5)} < \alpha /1\). All of these inequalities hold, so we have significance.

Table 3 The p-values of Wilcoxon signed rank tests comparing the performance of PILOT and the other methods on these datasets

4.3 Results after transforming predictors

It has been observed that linear regression methods tend to perform better if the numerical predictors are not too skewed. In order to address potentially skewed predictors, we reran the empirical experiments after preprocessing all numerical predictors by the Yeo-Johnson (YJ) transformation (Yeo & Johnson, 2000). The parameter of the YJ transformation was fit using maximum likelihood on the training set, after which the fitted transformation parameter was applied to the test set. One way in which such a transformation could help is by mitigating the extrapolation issue, as this is expected to occur mainly on the long tailed side of a skewed predictor, which becomes shorter by the transformation.

Table 4 MSE ratios relative to best, after transforming the predictors

Table 4 shows the results after transforming the predictors. The MSE ratios are relative to the lowest MSE of each row in Table 2, so the entries in Table 4 can be compared directly with those in Table 2. We see that the transformation often enhanced the performance of PILOT as well as that of some other methods. On the high dimensional Riboflavin data set the score of PILOT came closer to Ridge and Lasso, whereas the other decision trees kept struggling. Also on the other data sets with linear structure PILOT was comparable with both linear methods. The average and the standard deviation of its score and rank were reduced by the transformation as well. The Wilcoxon tests in Table 5 gave similar results as before, and the Holm-Bonferroni test again yields significance. The overall performance of PILOT remained the best. We conclude that PILOT typically benefits from transforming the data before fitting the tree.

Table 5 The p-values of Wilcoxon signed rank tests comparing the performance of PILOT and the other methods, on the transformed datasets

4.4 Depth comparison between the tree-based methods

Here we compare the depth of the trees fitted by CART, PILOT and FRIED, to gain insight into how the inclusion of linear models in the nodes affects the tree structure and tree depth. Figure 5 shows the frequencies of tree depths of these methods, for all folds on all datasets, when the maximal depth was set to 18. We see that the PILOT trees are much more shallow than those built by CART. This is understandable, because CART needs many splits to model approximately linear relations to a reasonable precision. On the other hand, PILOT can avoid many of those splits by fitting linear models. The depth of FRIED trees is more similar to those of PILOT. Moreover, these conclusions remain the same when the experiment is done on the transformed datasets instead of the raw datasets. We did not include the depth of M5 for two reasons: First, not all the reported results of M5 in Table 2 were for the raw datasets, because of its sensitivity to the predictor scales. Second, the maximal depth of M5 cannot be specified as a tuning parameter (Hornik et al., 2009), making it difficult to obtain a fair comparison.

Fig. 5
figure 5

Depths of the regression trees estimated through CART, PILOT and FRIED on 5 folds on the raw data (left) and the transformed data (right)

4.5 Feature importance in PILOT

Feature importance is a popular tool for assessing the role of each predictor in the final model (Hastie et al., 2009). Due to the tree structure of PILOT, we can construct a measure of feature importance similar to that of CART. For each predictor, we find the nodes in which it plays a role. We then compute the cumulative variance reduction of the corresponding fitted models. The cumulative variance reductions of all predictors are then normalized so they sum to 1. Note that this construction is exactly like the one used in CART (Hastie et al., 2009).

To illustrate feature importance in PILOT we analyze two examples, the Wine dataset and the Communities dataset. The former is low-dimensional and the latter is high-dimensional, but both datasets have variables that are easy to interpret. We first tuned the hyperparameters of PILOT and then calculated the feature importance. When computing the feature importance in PILOT, repeated lin fits in the same node are counted separately.

The Wine dataset has 11 continuous predictors and 4894 cases. The response is a measure of wine quality, provided as a score between 1 and 10. The feature importances from CART and PILOT are shown in Fig. 6. We see that the main effects are due to variables X2 (volatile acidity), X6 (free sulfur dioxide), and X11 (alcohol percentage), for both CART and PILOT. Figure 7 plots the response versus each of these predictors. We see that these predictors have some explanatory power, which CART can capture by piecewise constants. PILOT also uses these predictors to split on, but for X11 it fits additional linear models to the different regions, as seen in the last panel of Fig. 7. As a result, the feature importance of X11 is higher in PILOT than in CART, leaving less importance in some of the other variables.

Fig. 6
figure 6

Feature importance on the Wine dataset for CART (left) and PILOT (right)

Fig. 7
figure 7

Response against the predictors X2 (left), X6 (middle) and X11 (right) on the Wine dataset. Both PILOT and CART selected X11 as the first predictor to split on. The rightmost panel illustrates the predictions of CART and PILOT on this first node

We now consider the Communities dataset. Its continuous response measures the per capita violent crimes in communities in the US. The predictors are numerical too, and describe features of the communities such as median income, racial distribution, and average family composition. As the data are relatively high-dimensional with 128 predictors, we will only discuss the predictors with feature importance above 0.01 . Figure 8 shows the feature importances of CART and PILOT. In this example the difference between the two methods is more pronounced. CART identifies X51 (percentage of kids aged 4 or under in family housing with two parents) as the most important predictor, and assigns some additional importance to a few other predictors. On the other hand, PILOT identifies X45 (percentage of males who have never married) as the dominant predictor.

Fig. 8
figure 8

Feature importance of CART (left) and PILOT (right) on the Communities dataset. Only the variables with feature importance above 1% are shown

Figure 9 plots the response against these two predictors and superposes the fits on the root nodes of CART and PILOT. For CART this was a piecewise constant model on X51, whereas PILOT selected a blin fit on X45. The former reduces the RSS by 41%, whereas the latter yields a reduction in RSS of 56%. Therefore PILOT captured the underlying relation more accurately in its initial fit. This resulted in a substantially better predictive performance of the final model: from Table 2 we see that the MSE of PILOT was \(1.09/1.31 \approx 83\%\) of the MSE of CART on this dataset.

Fig. 9
figure 9

Communities dataset: response versus the first predictor selected by CART (left) and the first predictor selected by PILOT (right). The red line in the left panel shows the piecewise constant fit of CART in its root node. The red line in the right panel is the blin fit of PILOT in its root node

4.6 Explainability of PILOT

We illustrate the explainability and interpretability of PILOT on the Graduate Admission dataset, which has 400 cases and 7 predictors. The response variable is the probability of admission, and the predictors include features such as the GPA and GRE scores. Figures 10 and 11 show the CART and PILOT trees of depth 3 on this dataset. The depth of the final fits is higher, but here we focus on depth 3 for illustrative purposes. Note that the first node of the PILOT tree performs two lin fits in a row, which it is allowed to.

Fig. 10
figure 10

CART regression tree on the Graduate Admission dataset

Fig. 11
figure 11

Linear model tree by PILOT on the Graduate Admission dataset

Both methods result in a simple cascade of decision rules and found X6 to be an important predictor, which corresponds to the student’s GPA score. CART performs six splits out of seven on X6, whereas PILOT carries out two splits out of three on X6, and has X6 in all the linear models in the leaf nodes.

Fig. 12
figure 12

Left: response versus the GPA score X6 on the Graduate Admission dataset. The lines show the fits of PILOT and CART after the first split. Right: response versus the GRE score X1. The red lines indicate the means of the cases left and right of the split point 317.5

The left panel of Fig. 12 shows the probability of admission versus the GPA score X6. The GPA is clearly related with the response. Moreover, the relationship looks nearly linear, making a piecewise linear model a much better fit than a piecewise constant model. This is confirmed by the fact that the initial PILOT fit on X6 explains 77.8% of the variance of the response, while the initial CART split on X6 explains 54.1% of the variance. The explained variance of the final trees is 78.3% for PILOT versus 72.4% for CART, while PILOT uses half the number of leaf nodes and less than half the number of splits.

Regarding the other predictors, both methods carry out a split on the GRE score X1 at depth 2. The relation between the admission probability and X1 is shown in the right panel of Fig. 12. Interestingly, PILOT also chooses a piecewise constant split on this variable, and does so at basically the same split point as CART. Therefore, both models suggest that the GRE score is sometimes used as a rough threshold to distinguish between candidates. Moreover, PILOT suggests that students have little to gain by raising the GRE much above this threshold. Figure 11 indicates that PILOT makes slight linear adjustments based on the variables X2 (TOEFL score), X3 (university rating) and X5 (quality of personal statement).

5 Conclusion

This paper presented a new linear model tree algorithm called PILOT. It is computationally efficient, as it has the same time and space complexity as CART. To avoid overfitting, PILOT is regularized by a model selection rule. This regularization adds no computational complexity, and no pruning is required once the tree has been built. This makes PILOT faster than most existing linear model trees. The prediction can easily be interpreted due to its tree structure and its additive piecewise linear nature. To guide variable selection, a measure of feature importance was defined in the same way as in CART. PILOT applies two truncation procedures to the predictions, in order to avoid the extreme extrapolation errors that were sometimes observed with other linear model trees. An empirical study found that PILOT often outperformed the CART, M5, and FRIED decision trees on a variety of datasets. When applied to roughly linear data, PILOT behaved more similarly to high-dimensional linear methods than other tree-based approaches did, indicating a better ability to discover linear structures. We proved a theoretical guarantee for its consistency on a general class of functions. When applied to data generated by a linear model, the convergence rate of PILOT is polynomial. To the best of our knowledge, this is the first linear model tree with proven theoretical guarantees.

We feel that PILOT is particularly well suited for fields that require both performance and explainability, such as healthcare (Ahmad et al., 2018), business analytics (Bohanec et al., 2017; Delen et al., 2013) and public policy (Brennan & Oliver, 2013), where it could support decision making.

A future research direction is to integrate PILOT trees as base learners in ensemble methods, as it is well known that the accuracy and diversity of base learners benefit the performance of the ensemble. On the one hand, we have seen that PILOT gave accurate predictions on a number of datasets. On the other hand, the wide choice of models available at each node allows for greater diversity of the base learner. For these reasons and the fact that PILOT requires little computational cost, we expect it to be a suitable base learner for random forests and gradient boosting.

6 Supplementary information

This consists of Python code for the proposed method, and an example script.