Keywords

1 Introduction

The growth in computer applications have enhanced collection and analysis of big datasets. Big datasets are often referred to as high-dimensional data in statistical parlance. The difficulties faced while analyzing big datasets has led to development of many statistical or machine learning procedures in the recent time [1]. In most areas of research especially bioinformatics, it is usual to have relatively small sample sized datasets collected on large number of features.

Random forests (RF), a tree based non-parametric method originally proposed by [2] is one of the popular methods for handling high-dimensional data, mainly because of its computational speed and high accuracy. Bayesian procedures are the emerging solution to most applications of statistics in the recent time in fact it has the least error rate in theory [3,4,5,6]. Chipman et al. [7] proposed Bayesian Additive Regression Trees (BART) which is a probabilistic approach to sum of trees model. However, BART is more of Bayesian approach to sum of trees model than to call it a Bayesian random forest. Specifically, BART did not incorporate bootstrapping of trees as in RF but a posterior distribution of trees. In addition, BART controls tree depth by imposing restrictive priors on tree with large daughter nodes. BART uses prior distribution specification as a pruning tool to avoid large trees [8]. Taddy et al. [9] proposed Bayesian forest (BF) as a nonparametric Bayesian approach to RF. They used posterior of trees instead of bootstrap of trees based on a nonparametric Bayesian model using multinomial draws. BF tried to mimic RF by replacing the bootstrapping procedure by [10] by its Bayesian counterpart (Bayesian Bootstrap, [11]). This implies BF focuses on the data generating process of RF but not its impurity measures.

Based on the aforementioned features of the Bayesian variation of Random Forest (RF), we observed that none of the existing methods fully captures the complete framework of RF and this affects their eventual results. Thus the goal of this research is to develop a complete Bayesian approach to Random Forest (RF). The method updates every aspect of RF using Bayesian reasoning. By way of example, we considered the case of binary classification with high-dimensional cancer datasets.

2 Decision Trees and Random Forests

Decisions trees is a class of methods under the broad Classification and Regression Trees (CART). The response variable of interest determines the type of model, such as decision trees if the response is categorical and regression trees if the response is continuous. CART do not have any statistical model but a set of steps called algorithm. CART modelling involves partitioning the feature space into \( M \) regions.

Formally, given training dataset \( \left[ {y_{i} ,\varvec{ }x_{i1} ,x_{i2 } , \ldots , x_{ip} , i = 1, 2, \ldots ,n} \right] \), where \( y_{i} \) is a categorical outcome that assumes \( k = 1, 2, \ldots ,K \) values and \( x_{i} \) is the vector of features. CART algorithm automatically decides on the splitting variables and splitting point. After, successful partitioning of the response to \( R_{1} , R_{2} , \ldots , R_{M} \) regions, the closest form of model that CART assumes is;

$$ y = \mathop \sum \limits_{m = 1}^{M} \beta_{m} I\left( {x \in R_{m} } \right) $$
(1)

where \( \beta_{m} \) is a constant in region \( m \). Estimating \( \beta_{m} \) requires the computation of an impurity function. For classification case, the commonly used impurity functions are Misclassification Error Rate (MER), Gini Index, and deviance [12].

Random Forest (RF) update built CART trees in two steps; (i) bootstrapping the training dataset \( J \) times to obtain a total of \( J \) trees (ii) Subsampling \( l < p \) features without replacement at each split step in each \( j \) tree. Thus given a CART model \( {\Im }\left( {\hat{\beta }_{m} :x \in R_{m} } \right) \), RF model is;

$$ \hat{y} = \mathop \sum \limits_{j = 1}^{J} {\Im }_{j} \left( {\hat{\beta }_{m} :x \in R_{m} } \right) $$
(2)

RF has two tuning parameters, the number of trees \( J \) and number of subsampled features \( l \). Breiman [2] suggested using at least \( J = 200 \) and \( l = \sqrt p \) for classification task.

3 Bayesian Random Forests

Section 2 established the weakness of RF as the necessary tuning parameters are not chosen by any probabilistic law. The approach is nothing but a trial and error hence often referred to as black box method. A quick solution to avert trial and error is to select the tuning parameters by cross validation but at the expense of computation time. Therefore, the focus of this research is to modify RF forest by updating the two steps in (2) via Bayesian approach. For the bootstrapping step, we propose the Bayesian Simple Random Sampling With Replacement (BSRSWR) described by the posterior distribution in (3);

$$ P\left( {\pi | a,b} \right) = \frac{\varGamma (n + a + b)}{\varGamma (a + 1)\varGamma (b + n - 1)}\pi^{a} \left( {1 - \pi } \right)^{b + n} ,0 \le \pi \le 1 $$
(3)

where \( \pi \) is the probability of selecting any \( i \in n \) in each \( j \) step, \( \varGamma ({\text{d}}) \) is the gamma function evaluated at \( d \), \( a \) is the prior expected number of times any \( i \in n \) could be selected and \( b \) is its complement. It’s clear that the density function in (3) is a resemblance of \( Beta\left( {a + 1, b + n - 1} \right) \). A weighted CART tree \( {\Im }\left( {\widehat{\beta }_{m} :x \in R_{m} } \right) \) can be obtained using \( \omega = Beta\left( {a + 1, b + n - 1} \right)\forall i \in n \). Similarly, for the subsampling of \( l < p \) steps, we propose Bayesian Simple Random Sampling Without Replacement (BSRSWOR) with posterior density given in (4);

$$ P\left( {V |h, l, p,S,T} \right) = \frac{{\left( {\begin{array}{*{20}c} {S + V} \\ {S + 1} \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {T + p - V} \\ {T + l - h} \\ \end{array} } \right)}}{{\left( {\begin{array}{*{20}c} {S + T + p + 1} \\ {S + T + l + 1} \\ \end{array} } \right)}},h \le v \le p - l + h $$
(4)

where \( V \) is the number of relevant features whose posterior is sought, \( h \) is the sample realization of relevant features, \( p \) is the total number of features, \( l \) is the number of subsampled features as in RF, \( S \) is the prior number of relevant features and \( T \) is the prior number of irrelevant features. If we denote the posterior density in (4) as \( \delta \), we can use \( \delta \) to obtain a weighted splitting procedure where each impurity used at every splitting stage would be weighed by \( \delta . \) For a Gini index impurity, we propose a weighted Gini index \( \vartheta \);

$$ \vartheta = \mathop \sum \limits_{k = 1}^{K} (1 - \delta )\widehat{p}_{mk} \left( {1 - \widehat{p}_{mk} } \right) $$
(5)

where \( \widehat{p}_{mk} \) is the estimated class probability at each node \( m \). The variable with weight \( \delta \to 1 \), will correspond to variable with minimal unweighted Gini index and therefore useful for further splitting step. If on the other hand \( \delta \to 0 \), implies the variable is not useful and therefore expected to yield a maximal unweighted Gini index. In this case, the proposed weighted Gini index returns the unweighted Gini index so that the variable is dropped at the splitting stage.

4 Application to Cancer Datasets

In this section we illustrate the application of Bayesian Random Forest (BRF) on published real data. We use the “bake-off,” approach of [7] to study the predictive performance comparison of BRF with competing methods on 10 different real cancer data sets. Table 1 presents the data set which is a subset of 22 datasets from package “datamicroarray” in R [13]. For each of the \( 10 \) data sets, we created \( 10 \) independent train/test splits by randomly selecting \( 9/10 \) of the data as a training set and the remaining \( 1/10 \) as a test set. Thus, \( 10 \times 10 = 100 \) test/train splits were created (Fig. 1).

Table 1 The 10 datasets used in the bake-off and their associated dimensions
Fig. 1
figure 1

Boxplot of test misclassification error rate (MER) for the five methods over 100 train/test partitions. BRF has the least MER with 25% of the MER equal zero. Also, the absence of outlying point(s) in BRF indicate that it is more stable than its competitors

Based on each training set, each method was then used to predict the corresponding test set and evaluated on the basis of its predictive misclassification error rate and accuracy. The competing methods used alongside with BRF include Random Forest (RF), Bayesian Forest (BF), Gradient Boosting Machine (GBM) and Bayesian Additive Regression Trees (BART). Of all the five methods compared, GBM is the only frequentist method and also a major competitor of RF within the same classifier class [12].

5 Discussion and Conclusion

In this paper, we have established the weakness of RF and possible way to improve by formulating a probabilistic approach to tree sampling and split selection. We demonstrated the applicability of the method using 10 real cancer data sets. The individual and overall results in Table 2 show that in almost all the data sets used BRF accuracy is relative higher than its competitors. The result further shows that in any datasets used, BRF accuracy is bounded below at RF accuracy. This implies that BRF accuracy will in most cases be higher than RF accuracy and at least RF accuracy. Therefore, it can be concluded that the Bayesian weighing scheme developed indeed correct the RF weakness.

Table 2 Accuracy of the methods in each and all of the 10 datasets