1 Introduction

Machine learning (ML) methods are commonly applied in today’s data-driven society, supporting people with their daily decisions in many areas (medicine, business, loan approvals, and health services, just to cite a few). However, algorithms implementing ML methods are inherently complex and typically produce “black box” models that are opaque to explanation and interpretation for both experts and non-experts. Note that the terms interpretability and explainability are usually used by researchers interchangeably. However, even though such terms are closely related, some researchers highlight some differences and distinguish the concepts without any concrete mathematical definition. An attempt to clarify the difference between the two terms has been made by Doshi-Velez and Kim (2017), Gilpin et al. (2018) and Lipton (2018). Interpretability is mostly connected with the intuition behind the outputs of a model, with the idea being that the more interpretable a machine learning system is, the easier it is to identify cause-and-effect relationships within the system’s inputs and outputs. Explainability, on the other hand, is associated with the internal logic and mechanics that are inside a machine learning system. The more explainable a model is, the deeper is the understanding that humans achieve in terms of the internal procedures that take place while the model is training or making decisions.

Several studies, mainly from the past few years, have presented different techniques for making ML models more interpretable as well as more explainable (see e.g. Lisboa 2013; Ribeiro et al. 2016; Guidotti et al. 2018; Haddouchi and Berrado 2018; Burkart and Huber 2021). Some of the techniques aimed at providing local and global interpretability while others focused on developing locally and globally explainable models. In general, the goal of local approaches is to understand the rationale of a single prediction without necessarily understanding the entire model structure. Conversely, global approaches allow for the understanding of the whole logic of a model and the entire reasoning which leads to all of its possible outcomes.

Explainability is demanded in all fields where there is a mismatch between what a model can explain and what a decision-maker understands about it. As stated in Štrumbelj et al. (2010), nowadays there are a lot of domains characterized by making critical decisions demanding explainability. Nevertheless, in many fields of research, it is not enough to have explanations at the local level, but it may be useful to have explanations at the global level (Molnar et al. 2020).

Explainable ML algorithms aim at solving such limitations by making ML models and their predictions explainable to users. Indeed, the ubiquitous presence of ML models has highlighted that model explainability is equally important as its predictive performance (Shmueli 2010; Ribeiro et al. 2016; Lundberg and Lee 2017). The improved explainability of ML is of great usefulness and can offer increased confidence to application and domain experts who are key decision-makers (and who are often non-ML experts) in adopting ML. One further advantage is related to an easier evaluation of the results where the ML method has to be changed or tuned to new data or decision needs. Hence, model explainability is an open challenge and opportunity for researchers.

As highlighted in Dvořák (2019), within the supervised ML algorithms it is well known that using one decision tree or their ensembles is a good choice due to the simple concept of tree methods. To improve the interpretability of a decision tree, Iorio et al. (2019) proposed to grow an informative binary tree where pruning and decision tree selection can be visually obtained. Decision trees are the building blocks of a Random Forest (RF) model. In addition, within the ML algorithms, the RF model is a well-known “black box” model.

As proposed by Breiman (2001), RF is a powerful ensemble method for classification purposes thanks to its high accuracy, robustness, and ability in providing insights by ranking its features. Such an approach essentially consists of a set of decision trees, obtained through the bagging algorithm with no pruning, that produces “forests” of classifiers voting for a particular class. The larger the forests are, the more precise the outcome of their predictions. However, this has a cost: it is increasingly difficult to explain why an RF made a specific choice. For this reason, the RF model is not interpretable through functional parameters and forms, making it unsuitable for the application domains in which the decision-maker must be able to understand the causal mechanisms that generate output and justify the choices.

In literature, several works have addressed the problem of the interpretability of the RF models (see e.g. Zhao et al. 2018; Plumb et al. 2018; Tan et al. 2020). A survey of methods proposed for the interpretability of RF models can be found in Haddouchi and Berrado (2019). Recently, Aria et al. (2021) compared a series of approaches aimed at providing the interpretation for an RF model. In any case, RF models are not explainable, so it is difficult to understand how or why they arrived at a certain decision. This interpretability means that the model simply has to be trusted as is and the outcomes accepted as is. RF is highly complex when compared to decision trees where decisions can be made by following the path of the tree.

To the best of our knowledge, in literature, there are only two approaches to obtaining explainability from the RF model. The first one concerns the use of extra information such as the variable importance for analyzing the relationships between features and predictions (Friedman et al. 2001; Liaw and Wiener 2002; Grömping 2009; Louppe et al. 2013; Ehrlinger 2015). On the other hand, the second approach consists of the building of a surrogate model capable of reducing the size of the RF in terms of the number of decision trees (Chipman et al. 1998; Wang and Zhang 2009).

To overcome this lack of explainability, here we focus on RF global explanation, i.e. characterize the whole data set. Our approach merges the advantages of both decision trees and RF models, preserving the easy visualization from the former and accurate decision rules from the latter. More specifically, we exploit decision tree properties that are useful when the goal is to provide global information (per data set) about the relevance of input variables to predict a certain output (for instance predicting disease progression). The main idea is to use the RF final output to create a new model that graphically recognizes the entire relationship path within a data set. More specifically, we introduce Explainable Ensemble Trees (E2Tree) that are aimed at generating a “new tree” that can explain and represent the relational structure between the response variable and the predictors. This lead us to provide a tree structure similar to those obtained for a decision tree exploiting the advantages of a dendrogram-like output. Even if the RF mechanism is versatile enough to deal with supervised classification and regression tasks, we focus on classification analysis.

The paper is organized as follows. Section 2 offers a theoretical background. In Sect. 3 we propose the Explainable Ensemble Tree. Experimental results are shown in Sect. 4. Finally, some concluding remarks are offered in Sect. 5.

2 Theoretical background

Classification and regression trees (CART) are supervised learning techniques that use a non-parametric approach (Breiman et al. 1984). Such an approach makes no assumption about the distributional form of the underlying relationships between the predictor variables and the response. For classification problems, the prediction corresponds to the class with the highest frequency, while for regression problems, the prediction corresponds to the mean value of the node. CART approach is useful for interpretation thanks to its easy visualization, but it is not competitive in terms of accuracy with respect to other existing regression and classification techniques in the field of supervised methods. However, by aggregating many decision trees, predictive performance can be substantially improved.

Ensemble Learning methods increase predictive performance by combining the outputs of multiple base learners into a single predictive model that has the purpose of decreasing variance, altering bias, and improving predictions.

To overcome the predictive disadvantage of the single decision tree, Breiman (1996) proposed Boostrap Aggregation, known as Bagging, as a non-linear approach with the goal of achieving greater accuracy by averaging multiple decision trees, each of which is grown through the use of a bootstrap sample. Bagging aims to reduce the variance of a statistical model, simulates the variability of data through the random extraction of bootstrap samples of observations from a single training set, and aggregates the predictions on new instances. These trees are grown deeply, and a pruning phase is not carried out. This involves a low bias but also a high variance, which will be reduced by averaging the generated trees. While for the classification problems, through a majority vote the class that has the highest frequency among the tree predictions is assigned to the final prediction.

An evolution of Bagging is the RF model (Breiman 2001). It is known as an intuitive approach to its building process, and it is one of the most accurate machine learning methods with a predictive and analytical aim.

RF combines multiple models by randomly selecting the set of statistical units on which the estimate is provided and changing also the set of predictors used at each splitting step. By averaging several decision trees, the purpose of RF is to obtain greater accuracy starting from a non-linear approach. Each tree is grown by following two steps, where the first one consists of the use of a bootstrap sample to train each tree. The following step consists of the use of a random subset of variables to generate splits at each internal node, at each split \(v\approx \sqrt{p}\) features are selected, where p is the number of predictors in the data set. RF at each split does not consider all available predictors, hence the difference with bagging where \(v=p\) features are considered at each split.

To figure out the intuition behind the RF model, suppose that there is a very strong predictor in the data set or several sufficiently strong predictors, if we considered for the splitting phase all predictors or the majority of the predictors, as happens in bagging, all the trees would be similar to each other with predictions highly correlated. Therefore, there will be no substantial reduction in variance for a single decision tree since the average of many highly correlated trees does not induce such substantial reduction as an average of many unrelated trees. Instead, proceeding with random variables set, on average \((p - v)/p\) split will not consider the strong predictor, thus allowing other predictors to be selected. This is equivalent to obtaining a less variable and more reliable tree averaging, hence as a process of tree decorrelation.

RF has several advantages such as robustness against overfitting and perturbation of the response variable, it requires few statistical assumptions, it is not affected by the addition of variables not correlated with the response, it provides useful estimates regarding the error, the correlation, and the variable importance, management of different types of features, a few hyperparameters among which the most important are the number of trees in the forest and the number of variables chosen randomly at each split.

In a comparison of ten supervised learning algorithms, Caruana and Niculescu-Mizil (2006) and Caruana et al. (2008) show that the RF is the first and second-best method in terms of prediction accuracy for low-dimensional and high-dimensional problems, respectively.

3 Explainable Ensemble Trees: the key idea

The Explainable Ensemble Trees (E2Tree) key idea consists of the definition of an algorithm to represent an RF model using a single tree-like structure. The goal is to explain the results from the RF algorithm while preserving its level of accuracy, which always outperforms those provided by a decision tree. We want to emphasize that our contribution is new in the field of model explainability for ensemble tree algorithms. The proposed method is based on identifying the relationship tree-like structure explaining the classification or regression paths summarizing the whole ensemble process. There are two main advantages of E2Tree: (i) building an explainable tree that ensures the predictive performance of an RF model and (ii) allowing the decision-maker to manage with an intuitive structure (such as a tree-like structure).

In this work, we focus on RF but the algorithm can be generalized to every ensemble approach based on decision trees.

Let H be an RF model composed by B weak learners

$$\begin{aligned} H = \Big\{ H_{1},H_{2},\ldots ,H_{b},\ldots ,H_{B} \Big\}. \end{aligned}$$

Let \(\{ (X_{1},Y_{1}),\ldots ,(X_{n},Y_{n}) \}\) be a sample of size n used to train the RF.

Let \(U_{i} = \{Y_{i},X_{i}\}\) be the vector associated to the i-th observation.

The algorithm starts from the definition of a dissimilarity matrix:

where \(O_{ij}\) is a weighted co-occurrence measure between \(u_{i}\) and \(u_{j}\) along the sequence \(H_{1},\ldots ,H_{B}\) weak learnes. Hence, \(d_{ij}\) measures the dissimilarity (or the “discordance”) between two observations with respect to a given classifier H. In other words, \(O_{ij}\) measures the proportion of times \(u_{i}\) and \(u_{j}\) co-occur in a terminal node along the B weak learners sequence as:

$$\begin{aligned} O_{ij} = \frac{\sum _{b=1}^{B}I\big(u_{i} \wedge u_{j}\big) \cdot W_{t_{ij}|b}}{\max \big(\sum _{b}u_{i}W_{t_{i}|b};\sum _{b}u_{j}W_{t_{j}|b}\big)}, \end{aligned}$$

where \(W_{t_{ij}|b}\) is a local goodness of fit measure of the node \(t_{ij}|b\).

If \(I(u_{i} \wedge u_{j})=1\), then \(t_{ij}|b\) exists and it represents the terminal node t of the weak learner b that contains i and j. Each weak learner b can contain only one terminal node \(t_{ij}|b\) so the summation with respect to b indicates that, along the ensemble sequence for each tree b, only \(t_{ij}|b\) nodes will be considered when they exist.

By definition \(0 \le O_{ij} \le 1\) and \(0 \le W_{t_{ij}|b} \le 1\).

Here, we set \(W_{t_{ij}|b} =R_{t}\), where \(R_{t}\) is the correct classification rate at a generic node t, defined as:

$$\begin{aligned} R_t=\sum _{i=1}^{n_t}\frac{I\big(y_{i} = \hat{y}_t\big)}{n_t}, \end{aligned}$$

\(n_t\) is the size of node t and \(\hat{y}_t\) is the classification returned by the RF algorithm in the same node. The goal is to build a single tree that explains the information contained in D through the p predictors. The innovative contribution of our proposal refers to the learning of a tree representing, in a single, explainable, logical, and graphical structure, the aggregated classification obtained along with the ensemble procedure. The matrix D summarizes the aggregated classification in a new way. Indeed, at each node, the values of O measure the strength with which a set of observations must be forced to fall into the same node along with the construction of the tree, minimizing the values of D (or maximizing the values of O).

To build E2Tree, we define a set of stopping rules that retrace the classical rules of tree construction. In particular, we consider thresholds to the size and impurity of the terminal nodes and the complexity of the structure.

It is well-know that RF tend to have many non-informative splits because of the random selection of split variable. We might expect that with more nodes in a tree, the \(d_{ij}\) dissimilarity values between two observations (ij) with respect to a given classifier tend more and more strongly to one. To overcome this problem, in the proposed algorithm we set an additional stopping rule. Once we have identified the best split concerning dissimilarity, we check that both child nodes have a different response class from each other and a well-classified rate below a certain threshold (e.g., \(95\%\) or \(99\%\)). In this way, we are able to avoid non-informative branches that result in overfitting.

More formally, let \(s \in S\) be a candidate split for a variable X that splits t into left and right children nodes \(t_{L}\) and \(t_{R}\) according to whether \(X \le s\) or \(X > s\). These two nodes will be denoted by \(t_{L} = {U \in t : X \le s}\) and \(t_{R} = {U \in t : X > s}\).

In order to maximize the difference in discordance among the parent and children nodes, we define the following measures:

  1. 1.

    \(\bar{d}_{t}\) is the impurity measure at node t: \(\bar{d}_{t} = \sum _{i}\sum _{j}\frac{d_{ij}}{n_{t} \cdot (n_{t}-1)}\)

  2. 2.

    \(\bar{d}_{t \mid s}\) is the pooled impurity measure of the children nodes generated by s at node t

    $$\begin{aligned} \bar{d}_{t \mid s}&= \Bigg \{\Bigg [\Bigg ( \frac{\sum ^{n_{t_L}}_{i} \sum ^{n_{t_L}}_{j}d_{ij}}{n_{t_{L}}\cdot (n_{t_{L}}-1)} \cdot \frac{n_{t_{L}}}{n_{t}}\Bigg ) + \Bigg ( \frac{\sum ^{n_{t_R}}_{i} \sum ^{n_{t_R}}_{j}d_{ij}}{n_{t_{R}}\cdot (n_{t_{R}}-1)} \cdot \frac{n_{t_{R}}}{n_{t}} \Bigg )\Bigg ] \Bigg | s \Bigg \} \\&= \Bigg \{ \Bigg [\frac{1}{n_{t}} \cdot \Bigg (\frac{\sum ^{n_{t_{L}}}_{i} \sum ^{n_{t_{L}}}_{j}d_{ij}}{(n_{t_{L}}-1)} + \frac{\sum ^{n_{t_{R}}}_{i} \sum ^{n_{t_{R}}}_{j}d_{ij}}{(n_{t_{R}}-1)}\Bigg ) \Bigg ] \Bigg | s \Bigg \}. \end{aligned}$$
  3. 3.

    \(\varDelta \bar{d}_{t\mid s}\) is the decrease of impurity at node t splitted by s

    $$\begin{aligned} \varDelta \bar{d}_{t\mid s} = \big(\bar{d}_{t}-\bar{d}_{t \mid s}&\big)=\Big\{\bar{d}_{t} - \Big[\Big(\bar{d}_{t_L} \cdot p_{t_{L}} + \bar{d}_{t_{R}} \cdot p_{t_{R}}\Big) \Big|s\Big]\Big\}, \end{aligned}$$

    where \(s \in S\). Note that \([(\bar{d}_{t_{L}} \cdot p_{t_{L}} + \bar{d}_{t_{R}} \cdot p_{t_{R}}) |s] = \bar{d}_{t |s}\). Then, it is straightforward that

    $$\begin{aligned} \max _{s\in S}\bar{d}_{t \mid s} \equiv \min _{s \in S}\Big\{ \Big[\Big(\bar{d}_{t_L}\cdot p_{t_L}+\bar{d}_{t_R}\cdot p_{t_R}\Big) \Big|s\Big]\Big\}. \end{aligned}$$

The best split selected at node t, denoted by \(s^*\), maximizes the decrease of impurity \(\varDelta \bar{d}_{t\mid s^*}\).

figure a

4 Illustrative examples

To evaluate the performance of our proposal, we performed applications on freely available data sets. The source of data is the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/index.php).

Data analysis was performed using the R software with our routines, which are available at the GitHub repository https://github.com/massimoaria/eTree.

Following Guyon et al. (1997), for each data set, we used \(70\%\) for the training sample and \(30\%\) for the testing sample. We set the number of trees in the forest equal to 1000 and the number of randomly selected features/variables to evaluate at each tree node equal to 2. These were selected to exhibit different levels of class imbalance and to be a mixture of discrete and continuous features.

In all the tables, t denotes the number of the node following the approach proposed by the authors of the statistical software SPAD (Cisia Institute, France), \(n_t\) is the number of the instances in each node t, where the category predicted is assigned to the “Class” \(\hat{Y}_t\) which presents the highest proportion of observations, \(R_t\) is the correct classification rate for each node t, \(\bar{d}_{t}\) is the impurity measure at node t as defined in the above Section, \(s^*\) is the best split selected at node t, and the entire path of the tree is denoted by “Path”.

In all the figures depicting the “Explainable Ensemble Tree”, the terminal nodes are denoted by square boxes, while non-terminal nodes are denoted by circles.

4.1 Iris data

Iris data (Fisher 1936) is a standard benchmark set and have frequently been used for illustrating new multivariate techniques. It has some well-known structure and is thus useful to see how well this structure is recovered.

In this data set, there are four measurements on 50 plants from each of three species of Iris: Iris setosa, Iris versicolor, and Iris virginica. The setosa plants are well separated from the others, but there is some overlap between the versicolor and virginica species. Each iris has four measurements sepal length, sepal width, petal length, and petal width. Tables 1 and 2 show the results of our proposal. As can be noticed by looking at both the tables, our proposal highlights the entire path, also offering, for each node, the most important information for predictive purposes. Table 1 indicates the importance of the split showing the contribution of the nodes with the lower (higher) impurity measure to the higher (lower) correct classification rate. Table 2 indicates the path of each prediction made by the RF algorithm. The main advantage of the proposed E2Tree is depicted in Fig. 1 which offers, at first sight, the importance of the splits, showing the relational structure between the response variable and the predictors and their interactions. This represents an important advantage since no visual information is provided by an RF model.

Fig. 1
figure 1

Iris data set: path visualization of the Explainable Ensemble Trees (E2Tree)

Table 1 Iris data set: results of the Explainable Ensemble Trees (E2Tree)
Table 2 Iris data set: paths of the Explainable Ensemble Trees (E2Tree)

4.2 Breast cancer wisconsin data

This data set is used to classify a set of breast cancer patients. The characteristics are calculated from a digitized image of a needle aspiration of a breast mass taken from 569 patients with or without cancer. The features describe the cell nuclei present in the image. Ten real-valued features are computed for each cell nucleus. The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. The goal is to classify these cell nuclei as benign or malignant. One of the most important features of E2Tree is the possibility to provide a clear interpretation of the pathways “If-Then” that identify how interactions among predictors define the final classification in each terminal node. This is possible both in a graphical and tabular representation. Figure 2 shows a graphical representation of the E2Tree structure while Tables 3 and 4 show the results for this data set.

By looking at the Fig. 2, individuals who have a lesion characterized by a “perimeter” worst greater than 110.3 (t = 3), a “standard error area” less than or equal to 43.4 (t = 6), and a “concavity mean” greater than 0.101 (t = 13) are classified with a malignant status with a correct classification rate of 98%. Following the same pathway, in the last split when the “concavity mean” is less than or equal to 0.101 the malignant status is diagnosed with a correct classification rate of 60%, while the benign status is diagnosed with 40%. In this case, an expert in the research domain should carefully examine the status of the patient.

Fig. 2
figure 2

Breast data set: path visualization of the Explainable Ensemble Trees (E2Tree)

Table 3 Breast data set: results of the Explainable Ensemble Trees (E2Tree)
Table 4 Breast data set: path of the Explainable Ensemble Trees (E2Tree)

4.3 Indian diabetes data

The goal of the data set is to predict diagnostically whether a patient has diabetes or not, through various diagnostic measurements. Considering the constraints placed on the selection of the instances, we note that all patients are women of at least 21 years of Pima Indian origin. Among the features used there are the number of pregnancies the patient has had, her body mass index, insulin level, age, and so on. It contains 768 subjects with eight variables relevant to subjects.

Tables 5 and 6, Fig. 3 show the results for this data set. As can be noticed, by leveraging the advantages of RF algorithms that handle large datasets efficiently, we offer a greater explanation of their results.

As for the previous dataset, here we try to describe some pathways to provide an example of how our proposal helps the decision-maker to understand the relationships among the predictor interactions and the final classification in each terminal node. Patients with a BMI greater than 27.8 (t = 3), glucose less than or equal to 113 (t = 6), age less than or equal to 27 (t = 12), BMI less than or equal to 31.6 (t = 24) are classified as non-diabetic with a correct classification rate of 93%.

Fig. 3
figure 3

Diabetes data set: path visualization of the Explainable Ensemble Trees (E2Tree)

Table 5 Diabetes data set: results of the Explainable Ensemble Trees (E2Tree)
Table 6 Diabetes data set: path of the Explainable Ensemble Trees (E2Tree)

5 Conclusions

We have proposed an Explainable Ensemble Trees (E2Tree) aimed at explaining and graphically representing the relationships and interactions between the variables used to perform an ensemble method, i.e. random forest algorithm. The latter is a well-known supervised machine learning algorithm that creates many decision trees and combines the output of all trees. Compared to a decision tree, the main advantage of a random forest is the accuracy of prediction, but the main disadvantage is the lack of explainability of the final result. Nowadays, there are a lot of domains characterized by making critical decisions demanding explainability (e.g., medicine, finance, and marketing). To preserve the same advantages of an ensemble method while also ensuring its greater explainability, our proposal starts by defining a dissimilarity matrix measuring the discordance between two observations concerning a given classifier of a random forest model. Hence, we provide a dendrogram-like structure capable of explaining all the information contained in the output of a random forest algorithm. We want to stress that our proposal has no predictive purposes, since we aim to bridge the lack of interpretability typical of ensemble methods like a random forest model. We also want to point out that we refer to the accuracy in order to evaluate how our proposal reflects the accuracy of any benchmarking random forest. The proposed method ensures the following advantages: represents the ensemble process through a single tree-like structure; saves the accuracy performance of the ensemble process; and provides a powerful tool to explain iterations among predictors determining the final node classifications. The explainable power and the performances of our proposal are evaluated by real data sets from the UCI Machine Learning Repository. The results showed that by using E2Tree we provide a global explanation of the model and provide a powerful predictive tool. Our further aim is to extend this proposal to a regression framework and generalize the approach to other ensemble-tree methodologies.