1 Introduction

As machine learning is becoming a critical part of the success of more and more applications—such as credit scoring (Xia et al. 2017), bioactive molecule prediction (Babajide Mustapha and Saeed 2016), solar and wind energy prediction (Torres-Barrán et al. 2017), oil price prediction (Gumus and Kiran 2017), classification of galactic unidentified sources and gravitational lensed quasars (Mirabal et al. 2016; Khramtsov et al. 2019), sentiment analysis (Valdivia et al. 2018), prediction of dementia using electronic health record data (Nori et al. 2019)—it is essential to find models that can deal efficiently with complex data, and with large amounts of it. With that perspective in mind, ensemble methods have been a very effective tool to improve the performance of multiple existing models (Breiman 2001; Friedman 2001; Yoav Freund 1999; Chen and Guestrin 2016). These methods mainly rely on randomization techniques, which consist in generating many diverse solutions to the problem at hand (Breiman 2001), or on adaptive emphasis procedures (e.g. boosting Yoav Freund 1999).

In fact, the above mentioned applications have in common that they all use ensemble methods and, in particular, they use one of the following very recent ensemble methods: eXtreme Gradient Boosting or XGBoost (Chen and Guestrin 2016), Light Gradient Boosting Machine (LightGBM) (Ke et al. 2017) or Categorical Boosting (CatBoost) (Prokhorenkova et al. 2018), with very competitive results. These methods, all based on gradient boosting, propose variants to the original algorithm to improve the training speed and generalization capability. One of these methods, XGBoost, has been consistently placing among the top contenders in Kaggle competitions (Chen and Guestrin 2016). The performance of CatBoost and LightGBM is promising specially in generalization accuracy for the former (Prokhorenkova et al. 2018) and in training speed for the latter (Ke et al. 2017). But these methods are not the only ones to achieve remarkable results over a wide range of problems. Random forest is also well known as one of the most accurate and as a fast learning method independently from the nature of the datasets, as shown by various recent comparative studies (Fernández-Delgado et al. 2014; Caruana and Niculescu-Mizil 2006; Rokach 2016).

This study follows the path of many other previous comparative analysis, such as Fernández-Delgado et al. (2014), Caruana and Niculescu-Mizil (2006), Dietterich (2000), with the intent of covering a gap related to gradient boosting and its more recent variants XGBoost, LightGBM and CatBoost. None of the previous comprehensive analysis included any machine learning algorithm of the gradient boosting family despite of their appealing properties. We have found only two comparative studies that included gradient boosting in their experiments (Zhang et al. 2017; Brown and Mues 2012). In Brown and Mues (2012), a specific analysis of different algorithms in five imbalance credit scoring datasets showed that random forest and gradient boosting performed among the best methods over a wide range of class imbalance percentages. In that study only the number of trees was optimized for gradient boosting. In a recent extensive experimental comparison (Zhang et al. 2017), closer to the present study, several algorithms are compared across multiple datasets. The results of the comparison shows gradient boosting and random forest as the best of the analysis. Although the analysis includes XGBoost library, it is analyzed as a plain gradient boosting algorithm without taking into consideration any of the specific features of XGBoost. The analysis only optimizes the learning rate and depth of the trees in a limited grid of 40 points. In this study we perform an extensive hyper-parameter analysis of XGBoost in addition to the other variants of gradient boosting (LightGMB and CatBoost) focusing on the specific improvements over gradient boosting proposed in these methods. Specifically, the main improvements of the methods are: the incorporation of a complexity control term in the loss function for XGBoost, selective subsampling of high gradient instances in LightGBM and prevention of prediction shift in the computation of gradients in CatBoost.

The specific objectives of this study are, in the first place, to compare the performance of the three gradient boosting variants (XGBoost, LightGBM and CatBoost) with respect to the algorithm on which it is based (i.e. gradient boosting). The comparison is extended to random forest, which can be considered as a benchmark since many previous comparisons demonstrated its remarkable performance (Fernández-Delgado et al. 2014; Caruana and Niculescu-Mizil 2006; Zhang et al. 2017). The comparison is carried out in terms of accuracy, AUC and training speed. We believe this analysis can be very helpful to researchers and practitioners of various fields. Furthermore, a comprehensive analysis of the process of hyper-parameter setting in XGBoost, LightGBM and CatBoost is performed. For this, we propose two novel methodologies to analyze hyper-parameter tuning. In the first proposal, the variation of the average rank of a method using different hyper-parametrizations is measured when a hyper-parameter value of any given hyper-parameter is changed to any other value. The second proposal for hyper-parameter tuning analysis is a visual tool that analyzes how the performance of a method varies for a set of datasets when any given hyper-parameter is fixed to any given value, that is, when the hyper-parameter is not tuned.

The paper is organized as follows: Section 2 describes the methods of this study, emphasizing the different hyper-parameters that need to be tuned; Sect. 3 presents the results of the comparison; Finally, the conclusions are summarized in Sect. 4.

2 Methodology

2.1 Random forest

One of the most successful machine learning methods is random forest (Breiman 2001). Random forest is an ensemble of classifiers composed of decision trees that are generated using two different sources of randomization. First, each individual decision tree is trained on a random sample with replacement from the original data with the same size as the given training set. The generated bootstrap samples are expected to have approximately \(\approx 37\%\) of duplicated instances. A second source of randomization applied in random forest is attribute sampling. For that, at each node split, a subset of the input variables is randomly selected to search for the best split. The value proposed by Breiman to be given to this hyper-parameter is \(\lfloor\)log\({}_2\)(#features)\(+1\rfloor\). For classification, the final prediction of the ensemble is given by majority voting.

Based on the Strong Law of Large Numbers, it can be proven that the generalization error for random forests converges to a limit as the number of trees in the forest becomes large (Breiman 2001). The implication of this demonstration is that the size of the ensemble is not a hyper-parameter that really needs to be tuned, as the generalization accuracy of random forest does not deteriorate on average when more classifiers are included into the ensemble. The largest the number of trees in the forest, the most probable the ensemble has converged to its asymptotic generalization error. Actually, one of the main advantages of random forest is that it is almost hyper-parameter-free or at least, the default hyper-parameter setting has a remarkable performance on average (Fernández-Delgado et al. 2014). The best two methods of that comparative study are based on random forest, for which only the value of the number of random attributes that are selected at each split is tuned. The method that placed fifth (out of 179 methods) in the comparison was random forest using the default setting. This could also be seen as a drawback as it is difficult to further improve random forest by hyper-parameter tuning.

Anyhow, other hyper-parameters that may be tuned in random forest are those that control the depth of the decision trees. In general, decision trees in random forest are grown until all leaves are pure. However, this can lead to very large trees. For such cases, the growth of the tree can be limited by setting a maximum depth or by requiring a minimum number of instances per node before or after the split.

Among the set of hyper-parameters that can be tuned for random forest, we evaluate the following ones in this study:

  • The number of features to consider when looking for the best split (max_features).

  • The minimum number of samples (min_samples_split) required to split an internal node. This hyper-parameter limits the size of the trees but, in the worst case, the depth of the trees can be as large as \(N-\texttt {min\_samples\_split}\), with N the size of the training data.

  • The minimum number of samples (min_samples_leaf) required to create a leaf node. The effect of this limit is different from the previous hyper-parameter, as it effectively removes split candidates that are on the limits of the data distribution in the parent node.

  • The maximum depth of the tree (max_depth). This hyper-parameter limits the depth of the tree independently of the number of instances that are in each node.

2.2 Gradient boosting

Boosting algorithms combine weak learners, i.e. learners slightly better than random, into a strong learner in an iterative way (Yoav Freund 1999). Gradient boosting is a boosting-like algorithm for regression (Friedman 2001). Given a training dataset \(D=\{\mathbf{x}_i,y_i\}_{1}^N\), the goal of gradient boosting is to find an approximation, \(\hat{F}(\mathbf{x})\), of the function \(F^*(\mathbf{x})\), which maps instances \(\mathbf{x}\) to their output values y, by minimizing the expected value of a given loss function, \(L(y, F(\mathbf{x}))\). Gradient boosting builds an additive approximation of \(F^*(\mathbf{x})\) as a weighted sum of functions

$$\begin{aligned} F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \rho _m h_m(\mathbf{x}), \end{aligned}$$
(1)

where \(\rho _m\) is the weight of the \(m^{th}\) function, \(h_m(\mathbf{x})\). These functions are the models of the ensemble (e.g. decision trees). The approximation is constructed iteratively. First, a constant approximation of \(F^*(\mathbf{x})\) is obtained as

$$F_{0} ({\mathbf{x}}) = \arg \min _{\alpha } \sum\limits_{{i = 1}}^{N} L (y_{i} ,\alpha ).$$
(2)

Subsequent models are expected to minimize

$$(\rho _{m} ,h_{m} ({\mathbf{x}})) = \arg \min _{{\rho ,h}} \sum\limits_{{i = 1}}^{N} L (y_{i} ,F_{{m - 1}} ({\mathbf{x}}_{i} ) + \rho h({\mathbf{x}}_{i} ))$$
(3)

However, instead of solving the optimization problem directly, each \(h_m\) can be seen as a greedy step in a gradient descent optimization for \(F^*\). For that, each model, \(h_m\), is trained on a new dataset \(D=\{\mathbf{x}_i,r_{mi}\}_{i=1}^N\), where the pseudo-residuals, \(r_{mi}\), are calculated by

$$\begin{aligned} r_{mi} = \left[ \frac{\partial L(y_i,F(\mathbf{x}))}{\partial F(\mathbf{x})} \right] _{F(\mathbf{x})=F_{m-1}(\mathbf{x})} \end{aligned}$$
(4)

The value of \(\rho _m\) is subsequently computed by solving a line search optimization problem.

This algorithm can suffer from over-fitting if the iterative process is not properly regularized (Friedman 2001). For some loss functions (e.g. quadratic loss), if the model \(h_m\) fits the pseudo-residuals perfectly, then in the next iteration the pseudo-residuals become zero and the process terminates prematurely. To control the additive process of gradient boosting, several regularization hyper-parameters are considered. The natural way to regularize gradient boosting is to apply shrinkage to reduce each gradient decent step \(F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \nu \rho _m h_m(\mathbf{x})\) with \(\nu =(0, 1.0]\). The value of \(\nu\) is usually set to 0.1. In addition, further regularization can be achieved by limiting the complexity of the trained models. For the case of decision trees, we can limit the depth of the trees or the minimum number of instances necessary to split a node. Contrary to random forest, the default values for these hyper-parameters in gradient boosting are set to harshly limit the expressive power of the trees (e.g. the depth is generally limited to \(\approx 3-5\)). Finally, another family of hyper-parameters also included in the different versions of gradient boosting are those that randomize the base learners, which can further improve the generalization of the ensemble (Friedman 2002), such as random subsampling without replacement.

The attributes finally tested for gradient boosting are:

  • The learning rate (learning_rate) or shrinkage \(\nu\).

  • The maximum depth of the tree (max_depth): the same meaning as in the trees generated in random forest.

  • The subsampling rate (subsample) for the size of the random samples. Contrary to random forest, this is generally carried out without replacement (Friedman 2002).

  • The number of features to consider when looking for the best split (max_features): as in random forest.

  • The minimum number of samples required to split an internal node (min_samples_split): as in random forest.

2.3 XGBoost

XGBoost (Chen and Guestrin 2016) is a decision tree ensemble based on gradient boosting designed to be highly scalable. Similarly to gradient boosting, XGBoost builds an additive expansion of the objective function by minimizing a loss function. Considering that XGBoost is focused only on decision trees as base classifiers, a variation of the loss function is used to control the complexity of the trees

$$\begin{aligned} L_{xgb}= & {} \sum _{i=1}^N L(y_i, F(\mathbf{x}_i)) + \sum _{m=1}^M \varOmega (h_m)\end{aligned}$$
(5)
$$\begin{aligned} \varOmega (h)=\, & {} \gamma T + \frac{1}{2} \lambda \Vert w\Vert ^2 \; , \end{aligned}$$
(6)

where T is the number of leaves of the tree and w are the output scores of the leaves. This loss function can be integrated into the split criterion of decision trees leading to a pre-pruning strategy. Higher values of \(\gamma\) result in simpler trees. The value of \(\gamma\) controls the minimum loss reduction gain needed to split an internal node. An additional regularization hyper-parameter in XGBoost is shrinkage, which reduces the step size in the additive expansion. Finally, the complexity of the trees can also be limited using other strategies as the depth of the trees, etc. A secondary benefit of tree complexity reduction is that the models are trained faster and require less storage space. Randomization techniques are also implemented in XGBoost both to reduce overfitting and to increase training speed. The randomization techniques included in XGBoost are: random subsamples to train individual trees and column subsampling at tree and tree node levels. Furthermore, XGBoost can be extended to any user-defined loss function by defining a function that outputs the gradient and the hessian (second order gradient) and passing it through the “objective” hyper-parameter.

Moreover, XGBoost proposes a sparsity-aware algorithm for finding the best split. The sparsity of an attribute can be caused by the presence of many zero valued entries and/or missing values. XGBoost automatically removes these entries from the computation of the gain for split candidates. In addition, XGBoost trees learn the default child node in which instances with missing or null values are branched. Other interesting features of XGBoost include monotonic and feature interaction constraints. These features can be specially useful when domain specific information is known. Monotonic constraints force the output of XGBoost for regression to be monotonic (increasing or decreasing) with respect to any set of given input attributes. Feature interaction constraints limit the input attributes that can be combined in the paths from the root to any leaf node. Both constraints are implemented by limiting the set of candidate splits to be considered at each node.

In addition, XGBoost implements several methods to increment the training speed of decision trees not directly related to ensemble accuracy. Specifically, XGBoost focuses on reducing the computational complexity for finding the best split, which is the most time-consuming part of decision tree construction algorithms. Split finding algorithms usually enumerate all possible candidate splits and select the one with the highest gain. This requires performing a linear scan over each sorted attribute to find the best split for each node. To avoid sorting the data repeatedly in every node, XGBoost uses a specific compressed column based structure in which the data is stored pre-sorted. In this way, each attribute needs to be sorted only once. This column based storing structure allows to find the best split for each considered attributes in parallel. Furthermore, instead of scanning all possible candidate splits, XGBoost implements a method based on percentiles of the data where only a subset of candidate splits is tested and their gain is computed using aggregated statistics. This idea resembles the node level data subsampling that is already present in CART trees (Breiman et al. 1984).

The following hyper-parameters were tuned for XGBoost in this study:

  • The learning rate (learning_rate) or shrinkage \(\nu\).

  • The minimum loss reduction (gamma): The higher this value, the shallower the trees.

  • The maximum depth of the tree (max_depth)

  • The fraction of features to be evaluated at each split (colsample_bylevel).

  • The subsampling rate (subsample): sampling is done without replacement.

2.4 LightGBM

LightGBM (Ke et al. 2017) is an extensive library that implements gradient boosting and proposes some variants. The implementation of gradient boosting in this library has been especially focused on creating a computationally efficient algorithm, which is based on the precomputation of histogram of features, as in XGBoost. The library also includes tens of learning hyper-parameters that allow this model to work in a wide variety of scenarios: The implementation works both in GPU and CPU, it can work as the basic gradient boosting and has many types of randomizations (column randomization, bootstrap subsampling, etc.).

LightGBM implementation also proposes new features (Ke et al. 2017), that mainly are: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). GOSS is a subsample technique used to create the training sets for building the base trees in the ensemble. As in AdaBoost, this technique aims at incrementing the importance of those instances with higher uncertainty in their classifications, which are identified as those instances with higher gradient. When the GOSS option is set, the training sets for the base learners are composed of the top fraction of the instances with the largest gradients (a) plus a random sample fraction (b) retrieved from the instances with the lowest gradients. To compensate for the change of the original distribution, the instances in the low gradient group are weighted up by \((1-a)/b\) when computing the information gain. EFB technique bundles sparse features into a single feature. This can be done without losing any information when those features do not have non-zero values simultaneously. Both GOSS and EFB provide further training speed gains. EFB can be seen as a feature preprocessing technique and it will not be analyzed in this study as it is a technique that can be pre-applied to any dataset. In addition, for this library we will focus on the analysis of LightGBM with GOSS as the implementation of standard gradient boosting is already covered by Gradient Boosting above.

The following hyper-parameters were tuned for LightGBM GOSS mode (i.e. hyper-parameter boosting_type is set to ‘goss’) in this study:

  • The learning rate (learning_rate).

  • The maximum number of leaves of the tree. In LightGBM, the preferred method for controlling the complexity of the trees is to set a maximum number of leaves. This method gives more flexibility to the shape the trees can adopt (num_leaves).

  • The fraction of instances a to be selected from the ones with the highest gradients (top_rate).

  • The fraction of instances b to be sampled from the ones with the lowest gradients (other_rate).

  • The fraction of features to be evaluated at each split (feature_fraction_bynode).

2.5 CatBoost

CatBoost (Prokhorenkova et al. 2018) is a gradient boosting library that aims at reducing the prediction shift that occurs during training. This distribution shift is the depart of \(F(\mathbf{x_i})|\mathbf{x_i}\) with \(x_i\) being a training instance, with respect to \(F(\mathbf{x})|\mathbf{x}\) for a test instance \(\mathbf{x}\). This shift occurs because during training, gradient boosting is using the same instances for the estimation of both the gradients and the models that minimize those gradients. The solution proposed in CatBoost (Prokhorenkova et al. 2018) is to estimate the gradients using a sequence of base models that do not include that instance in their training set. To do so, CatBoost first introduces a random permutation in the training instances. The idea in CatBoost (not its implementation) is to build \(i=1,\dots ,N\) base models per each of the M boosting iteration. The \(i\)th model of the \(m\)th iteration is trained on the first i instances of the permutation and is used to estimate the gradient of the \(i+1\) instance for the \((m+1)\)th boosting iteration. In order, to be independent of the initial random permutation this process is repeated using s different random permutations. Notwithstanding, the implementation of CatBoost is optimized such that a single model is build per iteration that handles all permutations and models. The base models are symmetric trees (or decision tables). These trees are grown by extending level-wise all leaf nodes using the same split condition. Another important feature of CatBoost is how it handles categorical features. In CatBoost categorical features are substituted by a numeric feature that measures the expected target value for each category. In order to avoid over-fitting to the training data, this numeric feature would ideally need to be computed using a different dataset. However, this is not generally possible. The procedure proposed in CatBoost to compute this new feature is similar to the one followed for building the models. That is, for a given random permutation of the instances, the information of instances \(<i\) are used to compute the feature value of instance i. Then, several permutations are carried out and the obtained feature value for each instance is averaged. As with EFB in LightGBM, computing target statistics for categorical features is a preprocessing technique. Hence, it will not be taken into consideration in this study. Also, as LightBGM, CatBoost is an extensive library that includes many features as GPU training, standard boosting, and includes tens of hyper-parameters to adapt to many possible learning situations. This library also includes standard gradient boosting. However, in this analysis, we will focus in the implementation that prevents prediction shift (named Ordered CatBoost).

The following hyper-parameters were tuned for CatBoost with Ordered mode (i.e. hyper-parameter boosting_type is set to ‘Ordered’) in this study:

  • The learning rate (learning_rate).

  • The maximum depth of the tree (depth). Since symmetric trees, the ones preferably used in CatBoost, are complete trees, the value of this hyper-parameter has a different effect than similar hyper-parameters in other models. High depth values could lead to huge trees that could cause memory problems.

  • Number of gradient steps to compute the value of the leaves (leaf_estimation_iterations).

  • Regularization for the L2 coefficient (l2_leaf_reg).

3 Experimental results

Table 1 Characteristics of the studied datasets

In this section, an extensive comparative analysis of the efficiency of random forest, gradient boosting, XGBoost, GOSS LightGBM and Ordered CatBoost models is carried out. For the experiments, 28 different datasets coming from the UCI repository (Lichman 2013) were considered. These datasets come from different fields of application, and have different number of attributes, classes and instances. The characteristics of the analyzed datasets are shown in Table 1, which displays for each dataset its name, number of instances, number of attributes, if the dataset has missing values and number of classes. For this experiment, the implementation of scikit-learn (version 0.21.2) package (Pedregosa et al. 2011) was used for random forest and gradient boosting. For XGBoost, the XGBoost packageFootnote 1 was used. LightGBMFootnote 2 and CatBoostFootnote 3 libraries were used for those methods. In addition, for the comparison, the different methods were analyzed tuning the hyper-parameters using a grid search as well as using the default hyper-parameters of the corresponding packages. All packages except CatBoost have a fixed set of default hyper-parameters. In CatBoost, the default behaviour includes an adaptive selection of some hyper-parameters, such as the learning rate, on the fly.

The comparison was carried out using stratified 10-fold cross-validation. For each dataset and partition of the data into train and test, the following procedure was carried out for the five tested methods: (1) The optimum hyper-parameters for each method were estimated with stratified 10-fold cross-validation within the training set using a grid search. A wide range of hyper-parameter values is explored in the grid search. For each of the five methods, these values are shown in Table 2. (2) The set of hyper-parameters of the grid search giving the best average prediction accuracy was used to train the corresponding ensemble using the whole training partition. In addition, for binary datasets with class unbalance higher than \(\approx 65\%\) for the majority class, the hyper-parameters giving the best AUC were also used to build an ensemble on the whole training data; (3) Additionally, for XGBoost, GOSS LightGBM and Ordered CatBoost, ensembles were trained on the whole training set for all possible combination of hyper-parameters given in Table 2. This allows us, in combinations with step (1), to test for different alternative grids (more details down below); (4) The default sets of hyper-parameters for each method were additionally used to train an ensemble of each type (Table 2 shows the default values for each hyper-parameter and ensemble type); (5) The generalization accuracy of the five ensembles selected in (2) and of the ensembles with default hyper-parametrization is estimated in the left out test set. The AUC is measured for the five models with highest AUC in train. (5) In addition, the accuracy of all the XGBoost, GOSS LightGBM and Ordered CatBoost ensembles trained in step (3) is computed using the test set.

All ensembles were composed of 200 decision trees. Note that the size of the ensemble is not a hyper-parameter that needs to be tuned in ensembles of classifiers (Friedman 2001; Breiman 2001). For random forest like ensembles, as more trees are combined into the ensemble, the generalization error tends to an asymptotic value (Breiman 2001). For gradient boosting like ensembles, the generalization performance can deteriorate with the number of elements in the ensemble especially for high learning rate values. However, this effect can not only be neutralized with the use of lower learning rates (or shrinkage) but reverted (Friedman 2001). The conclusion of Friedman (2001) is that the best option to regularize gradient boosting is to fix the number of models to the highest computationally feasible value and to tune the learning rate. Although XGBoost has its own mechanism to handle missing values, we decided not to use it in order to perform a fairer comparison with respect to random forest and gradient boosting, as the implementation of decision trees in scikit-learn does not handle missing values. Instead, we used a class provided by scikit-learn to impute missing values by the mean of the whole attribute. Similarly, EFB features and target statistic for categorical features were not used in LightGBM and CatBoost respectively.

Table 2 Default values and possible values for every hyper-parameter in the normal grid search for random forest, gradient boosting, XGBoost, LightGBM (goss) and CatBoost (Ordered)

3.1 Results

Table 3 displays the average accuracy and standard deviation (after the ± sign) for: XGBoost with default hyper-parameters (shown with label D. XGB), tuned XGBoost (as T. XGB in the table), random forest with default hyper-parametrization (with D. RF), tuned random forest (T. RF), default gradient boosting (D. GB), tuned gradient boosting (T. GB), tuned GOSS LightGBM (T. LGB goss), default GOSS LightGBM (D. LGB goss), tuned Ordered CatBoost (T. Cat ord) and default ordered CatBoost (D. Cat ord). The best accuracy for each dataset is highlighted using a bold font.

From Table 3, it can be observed that the method that obtains the best performance in relation to the number of datasets with the highest accuracy is tuned gradient boosting (in 7 out of 28 datasets). After that, the methods in order are: tuned XGBoost and default CatBoost, that achieves the best results in 6 datasets; tuned CatBoost and default LightGBM in 4; default gradient boosting, tuned and default random forest and tuned LightGBM in 3; and default XGBoost in 1. As it can be observed, the default performances of the five tested algorithms are quite different. Default random forest and default Ordered CatBoost are the methods that perform more evenly with respect to their tuned counterparts. Except for a few datasets, the differences in performance are very small for both configurations of random forest. In fact, the difference between the tuned and the default hyper-parametrizations of random forest is below \(0.5\%\) in 18 out of 28 datasets. For Catboost, the default parametrization works very well except for a few exceptions (Teaching and Vowel). This occurs because CatBoost has an inner tuning of the learning rate in its default mode. Default LightGBM obtains a good number of best results, but its performance is uneven across datasets. Although the difference in average rank is not statistically significant between Default and Tuned LightGBM, the latter is better than its default version in 75% of the datasets. More importantly when the default version is better, the differences in accuracy are small (1.7% in the most favorable case, Liver) but when the default is worse, the differences can be huge (as large as 17%, in Teaching). Default XGBoost and gradient boosting perform generally worse than their tuned versions. However, this is not always the case. This is especially evident in three cases for XGBoost (German, Parkinson and Vehicle), where the default hyper-parametrization for XGBoost achieves a better performance than the tuned XGBoost. These cases are a combination of two factors: noisy datasets and good default settings. The hyper-parameter estimation process, even though it is performed within-train cross-validation, may overfit the training set specially in noisy datasets. In fact, it has been shown that reusing the training data multiple times can lead to overfitting (Dwork et al. 2015). On the other hand, in these datasets, the default setting is one of the hyper-parametrizations that obtains the best results both in train and test (among the best \(\approx 5\%\)).

In order to summarize the figures shown in Table 3, we applied the methodology proposed in Demšar (2006). This methodology compares the performance of several models across multiple datasets. The comparison is carried out in terms of average rank of the performance of each method in the tested datasets. The results of this methodology are shown graphically in Fig. 1. In this plot, a higher rank indicates a better performance. Statistical differences among average ranks are determined using a Nemenyi test. There are no statistical significant differences in average ranks between methods that are connected with a horizontal solid line. The critical distance over which the differences are considered significant is shown in the plot for reference (CD = 2.56 for 10 methods, 28 datasets and p-value \(<0.05\)).

From Fig. 1, it can be observed that there are not many statistically significant differences among the average ranks of the ten tested methods. The best methods in terms of average rank are first tuned Ordered CatBoost and then tuned XGBoost. These two methods present statistically significant differences with respect to default XGBoost and LightGBM. The next best methods are in order: tuned GB, default Ordered CatBoost, tuned GOSS LightGBM and default random forest.

In order to study the performance of these methods using other metrics, we obtained the average AUC for a selection of binary datasets with unbalanced classes. Datasets with a majority class above \(\approx\)65% were selected. The results are shown in Table 4. We only show the result for the best of each model, that is, the tuned models except for random forest, for which we use the default version. In the last row of the table the average rank is shown (higher is better). The results are similar to those of average accuracy in the sense that the ordering of the methods is similar. The best method is Ordered CatBoost followed by XGBoost and Gradient Boosting, then random forest and finally LightGBM.

Table 3 Average accuracy and standard deviation for XGBoost, random forest, gradient boosting, LightGBM with GOSS and Ordered CatBoost, all using default and tuned hyper-parameter settings
Table 4 Average AUC and standard deviation for unbalanced datasets

In Table 5, the average training execution time (in seconds) for the analyzed datasets is shown. For the methods using the default settings, the table shows the training time. For the methods that have been tuned using grid search, two numbers are shown separated with a ‘+’ sign. The first number corresponds to the time spent in the within-train 10-fold cross-validated grid search. The second figure corresponds to the training time of each method, once the optimum hyper-parameters have been estimated. The last row of the table reports the average ratio of each execution time with respect to the execution time of XGBoost using the default setting. All experiments were run using an eight-core Intel® Core™ i7-4790K CPU@4.00GHz processor. The reported times are sequential times in order to have a real measure of the needed computational resources, even though the grid search was performed in parallel using all available cores. This comparison is fair independently of whether the learning algorithms include internal multi-thread optimizations or not. For instance random forest and XGBoost include multi-thread optimizations in their code to compute the splits in XGBoost and to train each single tree in random forest whereas gradient boosting does not. LightGBM and CatBoost include also multithreaded implementations for both CPU and GPU. Only single-thread CPU implementations were used. Notwithstanding, given that the grid search procedure is fully parallelizable, the within-training CPU parallelization optimizations of the different methods do not reduce the end-to-end time required to perform a grid search in a real setting.

Table 5 Average execution time (in seconds) for training the tested methods (more details in the text)

As it can be observed from Table 5, finding the best hyper-parameters to tune the classifiers through the grid search is a rather costly process. In fact, the end-to-end training time of the tuned models is clearly dominated by the grid search process, which contributes with a percentage over 99.9% to the training time. Since the size of the grid is different for different classifiers (i.e. 3840, 256, 1920, 1440 and 120 for XGB, RF, GB, GOSS LGB and Ordered CatBoost respectively), the time dedicated to finding the best hyper-parameters is not directly comparable between classifiers. Anyhow, when measuring the training time of a model it is important to consider how complex is to find a reasonable set of hyper-parameters and not only the time needed to built the last model.

However, when it comes to fitting a single ensemble to the training data without taking into account the grid search time, GOSS LightGBM shows the fastest performance on average followed by XGBoost. The time necessary to train XGBoost given a set of hyper-parameters is about 1.6 times slower that training a GOSS LightGBM ensemble, 3.5 times faster than training a random forest, 2.4-4.3 times faster than training a gradient boosting model and 52 times faster than training an Ordered CatBoost ensemble. However, the differences are uneven across datasets. Ordered CatBoost is clearly the slowest method taking the longest time to compute the tested grid in spite of being the smallest grid. Despite the inner optimizations implemented, the process for reducing the prediction shift is costly. However, Ordered CatBoost performs, relatively to other models, better the larger the dataset is. For instance, in Magic04 (19020 instances, 10 attributes) the time needed for the computation of the XGBoost grid search is about 3 times slower than that for Ordered CatBoost and in Sonar (208 instances, 60 attributes) computing the Ordered CatBoost grid is about 14 slower than the time needed for XGBoost grid. The difference between XGBoost and Gradient Boosting can be observed in the time employed in the grid search by those methods. XGBoost takes less than half of the time to look for the best hyper-parameter setting than gradient boosting, despite the fact that its grid size is twice the size of the grid of gradient boosting. Finally, for some multiclass problems, as Segment, Soybean or Vowel, the execution time of XGBoost, gradient boosting, LightGBM and CatBoost deteriorates in relation to random forest since for multi-class problems, gradient boosting based models train one base tree per class and iteration.

Fig. 1
figure 1

Average ranks (a higher rank is better) for the tested methods across 28 datasets (Critical difference CD= 2.56)

3.2 Analysis of hyper-parametrization

In order to further analyze and understand the hyper-parametrization of XGBoost, GOSS LightGBM and Ordered CatBoost, we propose two novel analyses of the hyper-parametrizations. Firstly, we propose an analysis to measure how the model performance is affected when one hyper-parameter value is changed to another value. This analysis was carried out for XGBoost only. Secondly, we analyze the effect of having a hyper-parameter value fixed instead of doing a grid search over it on a method’s performance.

These further experiments are carried out with two objectives for XGBoost. The first objective is to try to select a better default hyper-parametrization for XGBoost. The second is to explore alternative grids for XGBoost. For the first objective, we have analyzed, for each hyper-parameter, the relation among single value assignments. To do so, we have computed the average rank (in test error) across all datasets for the ensembles trained using all hyper-parameter configurations as given in Table 2 for XGBoost. Recall that we have analyzed 3840 possible hyper-parameter configurations. Then, for each given hyper-parameter, say HParamX, and hyper-parameter value, say HParamX_valueA, we compute the percentage of times that the average rank improves when HParamX=HParamX_valueA is changed to another value HParamX_valueB and no other hyper-parameter is modified. These results are shown in Tables 6, 7, 8, 9 and 10. The tables are to be read as follows: each cell of the table indicates the % of times the average rank improves when modifying the value on the corresponding top row by the value on the corresponding first column of the table. Values above 50% are highlighted in bold. For instance, as shown by Table 6, 94.9% of the times that the learning rate hyper-parameter goes from 0.025 to 0.05, while keeping the rest of the hyper-parameters fixed, the average rank across all the analyzed datasets improves.

From these tables, we can see what the most favorable hyper-parameter values in general are. In Table 6, it can be observed that the best values for the learning rate are intermediate values. Both 0.05 and 0.1 clearly improve the performance of XGBoost with respect to the rest of the hyper-parameters on average. The best values for gamma (see Table 7) are also in the mid-range of the analyzed values. In this table, we can observe that the default gamma value, which is 0, is definitely not the best choice in general. A value of gamma \(\in [0.2,0.3]\) seems to be a reasonable choice. An interesting aspect related to the tree depth values, as shown in Table 8, is that the higher the depth, the better the performance on average. This is not necessarily in contradiction with the common use of shallow trees in gradient boosting algorithms since the depth hyper-parameter value is simply a maximum. Furthermore, the actual depth of the trees is also selected through the gamma hyper-parameter (that controls the complexity of the trees). Regarding the percentage of selected features when building the tree, it can be observed in Table 9 that values 0.25, sqrt and log2 perform very similarly on average. As for subsampling, the best value is 0.75 (see Table 10). In summary, we propose to use as the default XGBoost hyper-parameters: 0.05, 0.2, 100 (unlimited), sqrt and 0.75 for learning rate, gamma, depth, features and subsampling rate respectively.

Table 6 Percentage of times the average rank of XGBoost improves when changing the learning_rate from the value in the top row to the value in the first column
Table 7 Percentage of times the average rank of XGBoost improves when changing the gamma from the value in the top row to the value in the first column
Table 8 Percentage of times the average rank of XGBoost improves when changing the depth from the value in the top row to the value in the first column
Table 9 Percentage of times the average rank of XGBoost improves when changing the colsample_bylevel from the value in the top row to the value in the first column
Table 10 Percentage of times the average rank of XGBoost improves when changing the subsample from the value in the top row to the value in the first column

The second hyper-parametrization analysis consists in analyzing how the performance of the models vary if a given hyper-parameter HParamX is not optimized and its value is fixed to HParamX=HParamX_valueA. We propose a tool for visualizing the effect for a given set of tested datasets. To analyze this, we use the fact that we recorded for each train-test partition and for all possible grid combinations the within-train cross-validation error and the generalization test error. Hence, we can estimate the generalization error for any sub-grid within the tested hyper-parameter grid. The results of this experiments are shown in Fig. 2. This figure shows a plot for each tested hyper-parameter in XGBoost. For each plot, a boxplot is shown for each possible value assignment that indicates how the test performance ratio of the model on the tested datasets changes with respect to the use of the full grid (that is, with respect to the performance shown in Table 3 for tuned XGBoost), when the given hyper-parameter is not optimized and its value is set to the value shown in the x-axis. In addition to the boxplot, the mean ratio variation and deviation are shown with a blue line and a blue shadowed area respectively. The reference performance for the full grid is shown with a dotted line. Note that the plots are all set to the range [0.94, 1.04] for easier comparison and that a few outlier points (datasets) may fall outside the plots. If this is the case an annotation arrow is shown with the number of outliers and their minimum value (for below limit points) or maximum value (for above limits points) between parenthesis. One would expect, for important hyper-parameter at least, that the performance would drop below 1.0 for most datasets. However, this is not always the case since not tuning one hyper-parameter can be compensated by setting another hyper-parameter to a different value. To analyze this, we show, in the last plot of Fig. 2, how the results change with respect to the full grid for learning_rate given that subsample is already fixed to 0.75 and colsample_bylevel to ‘sqrt’. In this plot the dotted line corresponds to the reference average ratio when sumsample = 0.75 and colsample_bylevel = ‘sqrt’.

These pictures provide a visual tool for analyzing the effect of the different hyper-parameters. From these pictures, we can extract conclusions similar to the previous analysis although they provide further insights about the importance of the different hyper-parameters as they also show the variability of the performance across datasets. From these plots we can observe that fixing a hyper-parameter to a certain value does not in general produce severe drops in performance with some exceptions for extreme values. For instance, if setting \(gamma=2\), for \(\approx 3/4\) of datasets the performance drops. In some cases, as for \(subsample=0.75\) a slight improvement in performance can be observed. In addition, optimizing column subsampling does not seem to provide big gains and fixing its value either to ‘sqrt’ or ‘log2’ produces results similar to those of the full grid. Hence, fixing those hyper-parameters instead of optimizing them may seem a reasonable idea. In order to further analyze this idea, in the last plot of Fig. 2 the variation of learning_rate is shown given subsample = 0.75 and colsample_bylevel = ‘sqrt’. From this plot, the importance of optimizing the learning rate becomes clearer: setting the learning rate to any of the studied values would produce a drop in performance in more than 75% of the datasets (except for 0.05 in which case the drop is in 19 out of 28 datasets, 68%).

Fig. 2
figure 2

Hyper-parametrization analysis of XGBoost. Each boxplot represents the performance ratio variation of the tested dataset if the grid search for the given hyper-parameter was to be held fixed at the given value

The plots for the hyper-parametrization of GOSS LightGBM and Ordered CatBoost are shown in Figs. 3 and 4 respectively. From these plots, it can be observed that for both GOSS LightGBM and Ordered CatBoost the most important hyper-parameter is the learning rate since fixing it to any of the studied values would result in a drop in the overall performance. This is especially evident for GOSS LightGBM. In addition, for this method, there is a tendency to prefer higher values for hyper-parameter top_rate and other_rate. For these hyper-parameters the default values (top_rate = 0.1 and other_rate = 0.2) are suboptimal in the analyzed datasets. Ordered CatBoost seems to be quite stable independently of the hyper-parametrization, except for the learning_rate. Note that for l2_leaf_reg the best value for the studied datasets seems to be 1 instead of 3 as proposed in the default settings.

Fig. 3
figure 3

Hyper-parametrization analysis of GOSS LightGBM. Each boxplot represents the performance ratio variation of the tested dataset if the grid search for the given hyper-parameter was to be held fixed at the given value

Fig. 4
figure 4

Hyper-parametrization analysis of Ordered CatBoost. Each boxplot represents the performance ratio variation of the tested dataset if the grid search for the given hyper-parameter was to be held fixed at the given value

3.3 Selected hyper-parametrizations

In Table 11, we show the results for a selection of subgrids deduced from the previous hyper-parametrization analysis for the different libraries. First, Table 11 shows the average error for the proposed default hyper-parameters for XGBoost in column “D.P. XGB”. For reference, the default hyper-parametrization is also shown in the first column (as D. XGB). In addition, we explore two different hyper-parameter grids for XGBoost. On one hand, we would like to analyze the differences between gradient boosting and XGBoost in more details. There are little differences between both algorithms except that XGBoost is optimized for speed. The main difference, from a machine learning point of view, is that XGBoost incorporates into the loss function a hyper-parameter to explicitly control the complexity of the decision trees (i.e. gamma). In order to analyze whether this hyper-parameter provides any advantage in the classification performance of XGBoost, a grid with the same hyper-parameter values as the ones given by Table 2 is used except for gamma, which is always set to 0.0. On the other hand, we have observed that, in the case of random forest, tuning the randomization hyper-parameters is not very productive in general. As shown in Fig. 1, the average rank of default random forest is better than the rank of the forests for which the randomization hyper-parameters are tuned. Hence, the second proposed grid is to tune the optimization hyper-parameters of XGBoost (i.e. learning rate, gamma and depth) keeping the randomization hyper-parameters (i.e. random features and subsampling) fixed. The randomization hyper-parameters will be fixed to 0.75 for the subsampling ratio and to sqrt for the number of features as suggested by Tables 9 and 10 respectively and Fig. 2. The average generalization errors for XGBoost when using these two grids are shown in Table 11. Column “T. NG XGB” shows the results for the grid that does not tune gamma (i.e. gamma=0), and column “T. NR XGB” shows the results for the grid that does not tune the randomization hyper-parameters. For LightGBM, as observed from Fig. 3 for the studied datasets, the randomization hyper-parameters top_rate and other_rate seem to perform better for higher values, 0.7 and 0.3 respectively. Hence, a grid fixing those hyper-parameter is proposed (column “T. NR LGB” in the table). For CatBoost, we propose to fix l2_leaf_reg to 1 and leaf_estimation_iterations to 10 and to optimize the other hyper-parameters. The results of this reduced grid are shown in column “T. R Cat”. The tested full grids for XGBoost, GOSS LightGBM, and Ordered CatBoost are also shown in Table 11 for reference. The best average test error for each dataset is highlighted in boldface. Additionally, the average ranks for this table are shown in Fig. 5 following the methodology proposed in Demšar (2006). In this figure, differences among methods connected with a horizontal solid line are not statistically significant.

As it can be observed from Table 11, the proposed fixed hyper-parametrization for XGBoost (D.P. XGB) improves over the results of the default setting (D. XGB). With the proposed default setting, better results can be obtained in 17 out of 28 datasets with notable differences in datasets as Echo, Sonar or Tic-tac-toe. In addition, when the default setting is better than the proposed hyper-parameters, the differences are in general small, except for German, Parkinson and Vehicle. In spite of the improvements on average achieved by the proposed hyper-parameter setting, it seems clear that hyper-parameter optimization is necessary to further improve the performance of XGBoost and to adapt the model to the characteristics of each specific dataset.

Table 11 Average accuracy and standard deviation of XGBoost with different configurations: default, proposed, tuned, no gamma tuning and no randomizations hyper-parameter tuning

The results shown in Table 11 and Fig. 5 are quite interesting. The number of best results are 9 and 8 for Ordered CatBoost with the reduced grid and for XGBoost for the grid without randomization hyper-parameter tuning. Both grids improve over their respective full tested grid search of Ordered CatBoost and XGBoost. For GOSS LightGBM the variations in performance are small between the two grids. The same occurs for XGBoost between the full grid and the grid in which gamma is not optimized. Similar observations can be made from the average ranks for the different tested grids shown in Fig. 5. From this plot the reduced grid for Ordered CatBoost obtains the best average rank closely followed by XGBoost with the grid without randomization hyper-parameter tuning. Then, the full grid version of these methods, no gamma optimization and LightGBM grids follow in the plot. One tendency that is observed from these results is that it seems that including a complexity term to control the size of the trees can have a small edge over not using it in XGBoost, although the differences are not statistically significant. A conclusion that may be clearer is that it seems unnecessary to tune the number of random features and the subsampling rate provided that those techniques are applied with reasonable values (in our case subsampling to 0.75 and feature sampling to sqrt).

Finally, the time required to perform the grid search and to train the single models is shown in Table 12 in the same manner as Table 5. As shown in the last row of this table for the tested settings, performing the grid search without tuning the randomization hyper-parameters for XGBoost and LightGBM is approximately 16 and 10 times faster the tuning the full grid on average. These results reinforce the idea that tuning the randomization hyper-parameter is unnecessary. For Ordered CatBoost, there is a reduction of approximately 7 times with respect to the full grid although the grid search time remains rather high.

Fig. 5
figure 5

Average ranks (higher rank is better) for different XGBoost configurations (Critical difference CD= 1.15)

Table 12 Average execution time (in seconds) for training different variant of XGBoost, LightGBM GOSS and CatBoost Ordered (more details in the text)

4 Conclusion

In this study we present an empirical analysis of recent variants of gradient boosting: XGBoost, LightGBM and CatBoost, that have proven to be efficient and accurate models. Specifically, the performance of XGBoost, LightGBM and CatBoost in terms of training speed, AUC and accuracy is compared with the performance of gradient boosting and random forest under a wide range of classification tasks. The study focuses on specific characteristics of these methods with respect to gradient boosting: introduction of a complexity term in XGBoost, selective subsampling of high gradient instances in LightGBM and a computation of gradients that avoids prediction shift in CatBoost. In addition, the hyper-parameter tuning process of these methods is thoroughly analyzed.

The results of this study show that the most accurate classifier, in terms of average rank is CatBoost. Nevertheless, the differences with respect to XGBoost, gradient boosting, random forest (using the default hyper-parameters) and LightGBM are not statistically significant in terms of average ranks. XGBoost and LightGBM trained using the default hyper-parameters of the packages were the least successful methods (statistically worse than the tuned versions of XGBoost and CatBoost). The difference between the default settings of CatBoost and the tuned version are small. This is because the default version of CatBoost performs a internal adjustments of the learning rate. The performance of the different methods in terms of AUC is also consistent with their performance in terms of accuracy. In consequence, we conclude that a meticulous hyper-parameter search is necessary to create accurate models based on gradient boosting. This is not the case for random forest, whose generalization performance was slightly better on average when the default hyper-parameter values were used (those originally proposed by Breiman). In fact, tuning in XGBoost the randomization hyper-parameters subsampling rate and the number of features selected at each split was found to be unnecessary as long as some randomization is used. In our experiments, we fixed the values of the subsampling rate to 0.75 without replacement and the number of features to sqrt, reducing the size of the hyper-parameter grid search 16 fold and improving the average performance of XGBoost. For LightGBM, we found that, for the studied datasets, tuning the top and lower rate hyper-parameters produced results very similar to not tuning them. For CatBoost, we found that tuning the learning rate and the depth of the trees was sufficient to get the best results of the analysis.

Finally, from the experiments of this study, which are based on grid search hyper-parameter tuning using within-train 10-fold cross-validation, the tuning phase contributed to over 99.9% of the computational effort necessary to train any of the methods. However, the grid search time can be significantly reduced when the smaller proposed grids are used for XGBoost, LightGBM and CatBoost. The fastest algorithm was LightGBM. XGBoost has also very competitive speed results. Finally, Ordered CatBoost is much slower than the other methods in general. Anyhow, the time for selecting the model hyper-parameters has to be taken into consideration when choosing any given method and not only its speed for training the final model. In this sense, the default versions of random forest and Ordered CatBoost provide competitive results without the computational burden of hyper-parameter tuning.