1 Introduction

Prediction models are an important component of decision support systems. Applications range from credit scoring [11] and fraud detection [5] to financial auditing [4] and efficiency analysis [18]. In such applications, model interpretability is often as important if not more important than prediction accuracy.

Some more difficult to interpret models require additional post-processing to (a) obtain a better understanding of the model and (b) increase the end-user’s level of trust in the model. The latter is especially important in risk-sensitive domains such as finance and medicine, where experts are reluctant to trust prediction model’s predictions without an additional explanation.

Most explanation methods are model-specific. Some models, such as decision and regression trees, rules [6, 8], and nearest neighbors-based methods, are self-explanatory. Complex models, such as artificial neural networks and SVM (support vector machines), received the most attention, because they are often very successful but difficult to interpret (see [28], and references therein).

Linear regression and other additive models can be additionally explained by plotting the marginal effect of each input variable. For additive models, a prediction is the sum of individual marginal effects, which makes such visualizations a tool for graphical computation of predictions—a nomogram. This fact has been exploited to provide an explanation for the naive Bayes classifier [3, 17, 23], linear SVM [14], logistic regression [21], Cox regression models [15], and additive models in general [26].

This paper focuses on explanation methods that can be applied to prediction models of any type. Such general approaches must treat the model as a black box and are thus restricted to changing the inputs and observing the changes in the output. While not being able to exploit the specifics of the model, such methods have the advantage of being applicable to any type of model. This facilitates comparison of different types of models and, in practical applications, eliminates the need to replace the explanation method when the underlying model is replaced.

The key component of general explanations is the contributions of individual input features. A prediction is explained by assigning to each feature a number which denote its influence. For each feature, such contributions can be aggregated to plot the feature’s average contribution against the feature’s value. This provides an overview of the model and is similar to plotting the marginal effect for an additive model.

We begin with a simple illustrative example—a linear regression model:

$$\begin{aligned} f(x) = f(x_1,\ldots ,x_n) \approx y = \beta _0 + \beta _1 x_1 + \cdots + \beta _n x_n. \end{aligned}$$

If the input features are standardized, the coefficient \(\beta _i\) can be interpreted as the \(i\)th feature’s global importance. However, in practice, we are more interested in how a particular value influences the prediction. We turn to the following expression

$$\begin{aligned} \varphi _i(x) = \beta _i x_i - \beta _i E[X_i], \end{aligned}$$
(1)

which is also known as the situational importance of \(X_i = x_i\) [1].

The situational importance is the difference between what a feature contributes when its value is \(x_i\) and what it is expected to contribute. If the situational importance is positive, then the feature has a positive contribution (increases the prediction for this particular instance), if it is negative, then the feature has a negative contribution (decreases the prediction), and if it is 0, it has no contribution.

To illustrate, observe the linear model \(f(x_1,x_2) = 2x_1 - 3x_2 + 4\), with both input features uniformly distributed on \([-1,1]\). How much do the two input features contribute for the prediction \(f(\frac{1}{2},\frac{1}{3})\)? For this instance, the situational importance of the first and second feature are \(1\) and \(-1\), respectively. Therefore, the first feature contributes positively and the second one negatively.

We can plot the average situational importance \(\psi \) of every value, to obtain an overview of how each feature contributes across all of its values (see Fig. 1). This plot not only shows how different feature values contribute, but it can also be used to semi-graphically compute the prediction for any instance. As previously mentioned, such a plot can be produced for any additive model, a fact that has been exploited in developing several model-specific explanation methods [14, 15, 21, 23, 26].

Fig. 1
figure 1

The marginal effects of the two input features for the model \(f(x_1,x_2) = 2x_1-3x_2+4\). Both input features’ functions are shown on the same graph. Such plots can be used for semi-graphical computation of the model’s prediction for an arbitrary instance \(x =\, < x_1,x_2>\). For each input feature, we use the plotted function to map its value from the \(x\)-axis to that value’s contribution on the \(y\)-axis. All that remains is to sum the two contributions and the expectation \(E[f]=4\)

Computing the contributions for our illustrative example was simple, because the model is known and the features do not interact. Therefore, the contribution of some \(x_i\) is the same across all instances, regardless of the values of other features. In our problem setting, however, the model is unknown and no assumptions are made other than that the model maps from some known input feature space to a known codomain. These restrictions are necessary for the method to be general, but restrict us to changing the inputs and observing the outputs.

Previous general approaches [19, 24] tackled the problem in the following way:

$$\begin{aligned} \varphi _i(x) = f(x_1,\ldots ,x_n) - E[f(x_1,\ldots ,X_i,\ldots ,x_n)] \end{aligned}$$
(2)

Equation (2) is the difference between a prediction for an instance and the expected prediction for the same instance if the \(i\)th feature had not been known. In practice, expression Eq. (2) is not difficult to approximate (or compute, if the feature’s domain is finite)—we have to perturb the values of the \(i\)th feature, while the values of other input features remain fixed, and then average the prediction. Additionally, if \(f\) is an additive model, Eq. (2) is equivalent to Eq. (1), so in the case of an additive model, we do not lose any of the previously mentioned advantages associated with explaining an additive model.

However, when the model is not additive, as it is often the case in practical applications, the approach gives undesirable results. For example, observe the model \(f(x_1,x_2) = x_1 \vee x_2\), where both features are uniformly distributed on \(\{0,1\}\). When computing the contribution of the first feature for \(f(1,1) = 1\), we see that perturbing its value does not change the prediction—the first features contribution is 0. The same holds for the second feature. Therefore, both features get a 0 contribution, which is undesirable.

We learn from the previous example that perturbing one feature at a time gives undesirable results and that all subsets of features have to be perturbed to avoid such issues. In our previous work, we proposed a general method for computing the situational importance for classification and, separately, regression models that dealt with the aforementioned shortcomings of existing general explanation methods [2729].

This paper builds on the method from our previous work. The general method for computing the situational importance of features is modified and extended to marginal importance of feature values and feature importance. Appropriate sampling algorithms are provided, and we discuss two extensions that improve the algorithms efficiency. We also show that in the case of additive models, the proposed method is equivalent to the explanation commonly used for additive models. In the experimental part of the paper, we provide an empirical analysis of running times, illustrative examples, and the results of a controlled experiment of the usefulness of the contribution-based instance explanations.

The remainder of the paper is organized as follows. In Sect. 2, we provide the essential background from [28] and [29]. In Sect. 3, we describe and improve the approximation algorithm for computing a feature’s contribution for an instance [28] and the average contribution of a feature’s value. Section 4 is dedicated to experimental results and visual inspection of instance and model visualizations. Section 5 concludes the paper.

2 Computing a feature’s contribution

The following notation will be used throughout the paper. Let \(\mathcal X = [0,1]^n\) be our feature space, \(Y\) the target variable, and \(\{y_i; x_{1,i},x_{2,i},\ldots ,x_{n,i} \}_{i = 1}^M\) a data set of \(M\) instances. The function \(f:\mathcal X \rightarrow \mathfrak R \) represents the model that is used to predict the value of the target variable for an instance \(x\in \mathcal X \). In the classification case, we take the one-vs-all approach—we choose and observe the prediction for a single class value. Any class value of interest can be chosen or explanations can be generated for several values. We will refer to the chosen value as the point-of-view class.

First, observe how a feature’s value contributes for a simple linear model. That is, let us assume for a moment that \(f\) takes the form

$$\begin{aligned} f(x)= \beta _0 + \beta _1 x_1 + \cdots + \beta _n x_n. \end{aligned}$$

The contribution of the \(i\)th feature’s value for some instance \(x \in \mathcal X \) is the difference between the model’s prediction and the expected prediction if the \(i\)th feature’s value is not known:

$$\begin{aligned} \varphi _i^{additive}(x)&= \beta _0 + \cdots + \beta _i x_i + \cdots + \beta _n x_{n}- (\beta _0 + \cdots + \beta _i \mathbb E [x_{,i}] + \cdots + \beta _n x_{,n})\nonumber \\&= \beta _i (x_{i} - E[X_i]). \end{aligned}$$
(3)

The contribution in Eq. (3) is sometimes also referred to as the situational importance of \(x_{i}\) [1]. Observe how such a contribution is independent of the values of other features. This is due to the fact that the linear model is additive (that is, the features do not interact). This property makes the linear model and other additive models easy to interpret.

In practice, features often interact. To avoid the shortcomings of existing methods, we have to take into account every subset of features. We generalize Eq. (3) by first defining the model’s prediction conditional to only a subset of features’ values being known:

$$\begin{aligned} f_Q(x)=\mathbb E [f | X_i = x_{i}, \forall i \in Q], \end{aligned}$$
(4)

where \(Q \subseteq S = \{1,2,\ldots ,n\}\) is a subset of features. For an empty set, Eq. (4) reduces to \(f_{\{\}}(x)=\mathbb E [f]\). Eq. (4) allows us to define the contribution of a subset of feature values:

$$\begin{aligned} \Delta _Q(x) = f_Q(x) - f_{\{\}}(x) . \end{aligned}$$
(5)

Equation (5) is the change in prediction caused by observing the values of a certain subset of features for some instance \(x \in \mathcal X \). To provide a contribution similar to the one for the linear model, we have to map these \(2^n\) terms into \(n\) contributions, one for each feature’s value. First, we implicitly define interactions by saying that the contribution of a subset of feature values is the sum of all interactions across all subsets of those feature values:

$$\begin{aligned} \Delta _Q(x) = \sum _{W \subseteq Q} \mathcal I _W(x) , \end{aligned}$$
(6)

which uniquely determines the interactions:

$$\begin{aligned} \mathcal I _Q(x) = \Delta _Q(x) - \sum _{W \subset Q} \mathcal I _W(x) . \end{aligned}$$
(7)

Finally, each interaction is divided among the participating feature values, which defines the contribution:

$$\begin{aligned} \varphi _i(x) = \sum _{W \subseteq S \setminus \{i\}} \frac{\mathcal{I }_{W \cup \{i\}}(x)}{|W| + 1}. \end{aligned}$$
(8)

This leads to the following explicit definition (see [28] for proof):

$$\begin{aligned} \varphi _i(x) = \sum _{Q \subseteq S \setminus \{i\}}\frac{|Q|! (|S|-|Q|-1)!}{|S|!}(\Delta _{Q \cup \{i\}}(x) - \Delta _{Q}(x)). \end{aligned}$$
(9)

Equation (9) is equivalent to the Shapley value [25], a concept from coalitional game theory. In a coalitional game, it is usually assumed that \(n\) players form a grand coalition that has a certain worth (in our case, \(\Delta _{S}\)). We also know how much each smaller (subset) coalition would have been worth (\(\Delta _{Q}, Q \subset S\)). The goal is to distribute the worth of the grand coalition among players in a fair way (that is, each player should receive his fair share, taking into account all sub-coalitions). The Shapley value is one such solution, and it is the unique solution that satisfies the following properties [25]:

  • \(\sum _{i=1}^n \varphi _i(x) = \Delta _S(x)\)

  • \(\forall W \subseteq S {\setminus } \{i\}: \Delta _W = \Delta _{W \cup \{i\}}\Rightarrow \varphi _i(x) = 0 \)

  • \(\forall W \subseteq S {\setminus } \{i,j\}: \Delta _{W \cup \{i\}} = \Delta _{W \cup \{j\}}\Rightarrow \varphi _i(x) = \varphi _j(x)\)

  • \(\forall x,y \in \mathcal X \): \(\varphi (x+y) = \varphi (x) + \varphi (y)\), where \(\Delta _Q(x+y) = \Delta _Q(x) + \Delta _Q(y)\) for all \(Q \subseteq S\)

In the context of our explanation with local contributions, the properties have the following interpretation. The contributions are implicitly normalized, which makes them easier to interpret and compare. If a feature’s value does not have any impact on the prediction, then it will be assigned a 0 contribution. If two features’ values have the a symmetrical impact across all subsets, they will be assigned equal contributions, and the local contributions are additive across instances.

3 Approximation algorithm

Computing Eq. (9) has an exponential time complexity, which makes the method infeasible for practical use. The following approximation is used to reduce the computational complexity. We start by writing a different but equivalent formulation of Eq. (9):

$$\begin{aligned} \varphi _i(x) = \frac{1}{n!}\displaystyle \sum _\mathcal{O \in \pi (N)} \left( \Delta _{Pre^i(\mathcal O ) \cup \{i\}} - \Delta _{Pre^i(\mathcal O )}\right) , i = 1,\ldots ,n, \end{aligned}$$
(10)

where \(\pi (n)\) is the set of all ordered permutations of the feature indices \(\{1,2,\ldots ,n\}\) and \(Pre^i(\mathcal O )\) is the set of all indices that precede \(i\) in permutation \(\mathcal O \in \pi (n)\).

If the cost of computing the \(\Delta \)-terms would be zero, Eq. (10) could be approximated using a simple sampling algorithm, where \(\left( \Delta _{Pre^i(\mathcal O ) \cup \{i\}} - \Delta _{Pre^i(\mathcal O )}\right) \) would be one sample (see, for example, [7]). However, the computational complexity of computing the \(\Delta \)-terms is exponential. As shown in [29], it is sufficient to limit ourselves to such distributions of instances \(p\) that individual features are distributed independently. Now, Eq. (5) can be simplified into the following:

$$\begin{aligned} \Delta _Q(x)&= f_Q(x) - f_{\{\}}(x) \nonumber \\&= \sum \limits _{w \in \mathcal X ; \forall i: (w_i = x_i \vee i \notin S)} p(w) f(w) - \sum \limits _{w \in \mathcal X ;\forall i: (w_i = x_i)} p(w) f(w)\nonumber \\&= \sum \limits _{w \in \mathcal X } p(w) (f(w_{[w_i = x_i, i \in S]}) -f(w)), \end{aligned}$$
(11)

where the notation \(w_{[w_i = x_i, i \in S]}\) denotes instance \(w\) with the value of feature \(i\) replaced with that feature’s value in instance \(x\), for each \(i \in S\). For example, with \(w = \langle 2,4,6 \rangle \) and \(x = \langle 3,5,7 \rangle \), \(w_{[w_i = x_i, i \in \{1,3\}]} = \langle 3,4,7 \rangle \).

The \(\Delta \)-terms in Eq. (10) are replaced with Eq. (11) to obtain:

$$\begin{aligned} \varphi _i(x) = \frac{1}{n!} \displaystyle \sum _\mathcal{O \in \pi (N)} \displaystyle \sum _{w \in \mathcal X } p(w) \cdot \left( f(w_{[w_j = x_j, j \in Pre^i(\mathcal O ) \cup \{i\}]}) - f(w_{[w_j = x_j, j \in Pre^i(\mathcal O )]})\right) ,\nonumber \\ \end{aligned}$$
(12)

The following sampling procedure is used. Let

$$\begin{aligned} V_\mathcal{O , w \in \mathcal X } =\left( f(w_{[w_j = x_j, j \in Pre^i(\mathcal O ) \cup \{i\}]})-f(w_{[w_j = x_j, j \in Pre^i(\mathcal O )]})\right) , \end{aligned}$$

for all permutation/instance pairs, be the sampling population. When sampling at random and with replacement, we draw sample \(V_\mathcal{O ,w\in \mathcal X }\) with probability \(p(w)\). Draw \(m\) such samples \(V_1, V_2, \ldots ,V_m\) at random with replacement and define

$$\begin{aligned} \hat{\varphi }_i = \frac{1}{m} \sum _{j=1}^m V_j, \end{aligned}$$
(13)

It follows that \(\hat{\varphi }_i\) is approximately normally distributed with mean \(\varphi _i\) and variance \(\frac{\sigma _i^2}{m}\), where \(\sigma _i^2\) is the population variance. That is, \(\hat{\varphi _i}\) is an unbiased and consistent estimator of \(\varphi _i(x)\). The described approximation algorithm is summarized in Algorithm 1.

3.1 Quasi-random and adaptive sampling

We considered two improvements that increase the efficiency of the approximation algorithm. First, the approximation algorithm is a form of Monte Carlo integration. Therefore, for faster convergence, quasi-random sampling can be used instead of pseudo-random sampling. In our experiments, we used the Sobol low-discrepancy quasi-random sequence [13].

Second, to compute the explanation for an instance \(x\), we need to compute the contribution for each of the \(n\) features for that instance. In practice, we want to do this in a controlled amount of time to minimize the overall approximation error. The approximation error of the estimator \(\varphi _i(x)\) depends on the population variance which may not be the same for all features. Given that in practice, we are limited to a certain number of samples \(m\), it makes sense to adapt \(m_i\) the number of samples drawn for a feature to that feature’s variance \(\sigma ^2_i\). We discuss two cases—minimizing the squared \(\sum _{i=1}^n (\hat{\varphi }_i - \varphi _i)^2\) and the absolute approximation error \(\sum _{i=1}^n |\hat{\varphi }_i - \varphi _i|\).

figure c

Recall that the estimate \(\hat{\varphi }_i\) is approximately normally distributed \(\hat{\varphi }_i \approx N(\varphi _i, \frac{\sigma _i^2}{m_i})\). It follows that \(\hat{\varphi }_i-\varphi _i \approx N(0, \frac{\sigma _i^2}{m_i})\). The distribution of the absolute error for the \(i\)-th feature \(Z_i = |\hat{\varphi }_i - \varphi _i|\) is half-normal, with \(E[Z_i] = \sqrt{\frac{\sigma _i^2}{m_i}}\sqrt{\frac{2}{\pi }}\). The expectation for the sum of absolute errors is

$$\begin{aligned} E\left[ \sum _{i=1}^n Z_i\right] = \sum _{i=1}{n}\sqrt{\frac{\sigma _i^2}{m_i}}\sqrt{\frac{2}{\pi }} = \sqrt{\frac{2}{\pi }}\sum _{i=1}^{n}\frac{\sigma _i}{\sqrt{m_i}}. \end{aligned}$$

Similarly, for the sum of squared errors, we take \(Z_i \approx (\hat{\varphi }_i - \varphi _i)^2\). The expectation \(E[Z_i^2] = Cov[Z_i, Z_i] + 2E[Z_i] = \frac{\sigma _i^2}{m_i}\). The expectation for the sum of absolute errors is

$$\begin{aligned} E\left[ \sum _{i=1}^n Z_i\right] = \sum _{i=1}^{n} \frac{\sigma _i^2}{m_i}. \end{aligned}$$

In practice, we first take samples for each input feature to obtain an initial estimate of \(\sigma _i\). After the minimum amounts of samples have been taken, the goal is to distribute \(m_{max}\), the total number of samples we can compute, among individual features a way that we minimize the expected error. Regardless of which error we use, a greedy approach is optimal. That is, if the current amount of samples taken for each feature are \(m_1,\ldots ,m_n\), then we should take the sample for the feature that maximizes \(\sqrt{\frac{\sigma _i^2}{{m_i}}} - \sqrt{\frac{\sigma _i^2}{{m_i + 1}}}\) (or \(\frac{\sigma _i^2}{{m_i}} - \frac{\sigma _i^2}{{m_i + 1}}\)). This is a direct consequence of the fact that functions \(g(z) = \frac{\sigma _i}{\sqrt{z}}\) and \(g(z) = \frac{\sigma _i^2}{{z}}\) are both strictly decreasing on \(z \in \mathfrak R ^+\). Therefore, the currently best choice is also better than all possible future choices, regardless of the order in which future samples are taken.

figure d

The adaptive sampling version of the algorithm is summarized in Algorithm 2. Note that, we used Knuth’s incremental algorithm for computing the variance [16, 30].

3.2 Average contribution of a feature’s value

The plots mentioned in the Introduction are a common and efficient way of presenting an overview of additive models (see Fig. 1). Such plots show, for each feature, the function that maps its values to situational contributions of those values. Recall that, for additive models, where the features do not interact, the situational contribution \(\psi _i(q)\) of the \(i\)th feature’s value \(q \in \mathcal X _i\) is the same for all instances. However, if the model is not additive and the features interact, then the situational contribution of a feature’s value depends on the values of other features.

To produce a similar plot, we average the \(i\)th feature’s value \(q\)’s local contributions across all instances with that value:

$$\begin{aligned} \psi _{i}(q)&= \sum _{x \in \mathcal X \wedge x_i = q} p(x) \varphi _i(x) = \sum _{x\in \mathcal X } p(x) \varphi _i(x_{[x_i = q]}) \nonumber \\&= \sum _{x\in \mathcal X } p(x) \left( \frac{1}{n!} \displaystyle \sum _\mathcal{O \in \pi (N)} \displaystyle \sum _{w\in \mathcal X } p(w)\left( f(w_{[w_j = x_j, j \in Pre^i(\mathcal O ); w_i = q]}) \right. \right. \nonumber \\&\left. \left. - f(w_{[w_j = x_j, j \in Pre^i(\mathcal O )]})\right) {}\right) \nonumber \\&= \frac{1}{n!} \displaystyle \sum _\mathcal{O \in \pi (N)} \left( \sum _{x\in \mathcal X } \displaystyle \sum _{w\in \mathcal X } p(x)p(w) \left( f(w_{[w_j = x_j, j \in Pre^i(\mathcal O ); w_i = q]})\right. \right. \nonumber \\&\left. \left. - f(w_{[w_j = x_j, j \in Pre^i(\mathcal O )]})\right) {}\right) \nonumber \\&= \frac{1}{n!} \displaystyle \sum _\mathcal{O \in \pi (N)} \displaystyle \sum _{x \in \mathcal X } p(x) \left( f(x_{[x_i = q]}) - f(x)\right) \nonumber \\&= \displaystyle \sum _{x \in \mathcal X } p(x) \left( f(x_{[x_i = q]}) - f(x)\right) . \end{aligned}$$
(14)

For the transition from line 3 to line 4 in Eq. (14), observe the probability of the composite instance \(w_{[w_j = x_j, j \in Pre^i(\mathcal O )]}\) being a particular instance \(z\) if \(w\) and \(x\) are samples from \(\mathcal X \). For a fixed permutation \(\mathcal O \) and if independence of features is assumed, the probability of composing particular instance \(z\) by composing two independent samples from \(\mathcal X \) is \(p(z)\), the probability of drawing at random instance \(z\) from \(\mathcal X \). Therefore, the term \(\left( f(z_{[z_i = q]}) - f(z)\right) \) appears in the double summation with weight \(p(z)\).

To compute Eq. (14), we use a similar approximation as before. We also compute the standard deviation of samples \(\left( f(x_{[x_i = q]}) - f(x)\right) \), which can be interpreted as the input features overall importance for the model and is shown in the model visualizations in the Experimental Results in the form of a light gray line (see Fig. 5).

Let \(\psi _i, i = 1\ldots n\) be the average contribution functions. If \(f\) is additive, then it holds for each input feature \(i\) and its value \(x\) that \(\psi _i(q) = f_i(q) - E[f_i]\), where \(f_i\) are the marginal effects of individual features:

$$\begin{aligned} \psi _i(q)&= \displaystyle \sum _{x \in \mathcal X } p(x) (f(x_{[x_i = q]}) - f(x))\\&= \displaystyle \sum _{z \in \mathcal X } p(x) (f_i(q) - f_i(x_i))\\&= \displaystyle \sum _{x \in \mathcal X } p(x) f_i(q) - \displaystyle \sum _{x \in \mathcal X } p(x) f_i(x_i)\\&= f_i(q) - E[f_i]. \end{aligned}$$

That is, in the case of an additive model, the average contribution of a feature’s value equals the situational contribution of that value.

4 Experimental evaluation

The experimental evaluation of the proposed method is divided into three parts. First, we preform a detailed analysis of running times across several well-known real-world data sets and artificial data sets using several different types of machine learning models. The purpose of this experiment is to see how the methods algorithm scales with an increasing number of features and to quantify the benefits of using the two proposed improvements (adaptive and quasi-random sampling).

Second, we apply the method to several different types of machine learning models trained on several different data sets. The purpose of this part is to provide illustrative examples, describe the visualizations, and point out where existing methods would fail to provide a reasonable explanation.

And third, we describe a controlled experiment with 122 student participants. Our goal was to measure the effect that explanations in the form of feature contributions have on a person’s understanding of a prediction model.

The first two parts include experimentation on a number of different data sets and machine learning models. The artificial data sets used in these experiments are listed in Table 1 and are available as supplementary material. We also included several well-known regression and classification data sets: autoMpg, bodyfat, concrete, elevators, fishcatch, fruitfly, housing, machinecpu, pollution, stock, wine, and wisconsin(regression), anneal, breastCancerLJ, hepatitis, iris, monks1, monks2, monks3, mushroom, nursery, soybean, and zoo (classification). These data sets are available in .arff format from the Weka website (http://www.cs.waikato.ac.nz/ml/weka/). Most can also be found at the UCI Machine Learning Repository [9].

Table 1 Number of instances (#I), total number of input features (#F), and brief description of artificial data sets

We included ten different variations of learning algorithms for classification and seven different variations for regression (see Table 2). All-used learning algorithms were from the Weka [10] machine learning software. Unless otherwise noted, default settings were used. All experiments were run on an off-the-shelf computer with a 2.4 GHz CPU and 2 GB of RAM.

Table 2 A list of learning algorithms that were included in the experiments

4.1 Running times analysis

4.2 Sampling algorithm enhancements

We used the following procedure to measure the benefits of using adaptive sampling and quasi-random sampling. All regression data set/regression model and classification data set/classifier pairs were included in the experiment. For every such pair, we trained the model on 500 bootstrap samples and computed the mean-squared approximation error across all instances. We computed the error at different amounts of samples per feature and for all four combinations of the basic approximation algorithm and enhancements (both, just quasi-random, just adaptive, neither).

The results shown in Fig. 2a suggest that both enhancements improve the efficiency of the algorithm. That is, fewer samples are needed to achieve the same approximation error. The improvement achieved with quasi-random sampling is small compared to the improvement achieved by adaptive sampling. Best results are achieved when both enhancement are used.

Fig. 2
figure 2

The left-hand side plot shows the relative approximation error against the number of samples per feature for four variants of the approximation algorithm. Errors are relative to the error at (5,000\(\times \) number of input features) samples with both enhancements. The plot on the right hand side shows the running times for computing the contributions for an instance for the datgen40 data set. The most time-consuming models for the linear50 are also included. The remaining model/dataset pairs take less time than the Naive Bayes model and were omitted

4.3 Scalability

We illustrate how the method scales with an increasing number of features on two additional data sets. The linear50 data set is a regression problem with 1,000 instances and 50 standardized numerical input features. The class value is a linear combination of features. The datgen40 data set is a classification problem with 1,000 instances, 40 features, and 10 classes. Note that, this data set was created using Melli’s generator of rule-based data sets [22]. Both data sets are available as supplementary material.

For each data set, we incrementally added features and measured the time required to compute all contributions for a single instance. Note that, the number of samples taken \(m_{max}\) was such that the probability of having a relative approximation error of more than 1 % was 5 % (relative to the absolute value of the contribution). Adaptive sampling was used, but not quasi-random sampling.

The results in Fig. 2b show that contributions can be computed for these data sets in real-time for a few dozen features, regardless of the choice of the model. The differences between models are in part a consequence of different variances but mostly due to the differences in the time complexity of computing a single prediction, which is the key component of the approximation algorithm’s time complexity.

4.4 Illustrative examples

We use two types of visualizations: instance visualizations and model visualizations. The former are, as the name suggests, a visualization of the features’ local contributions \(\varphi \) for a particular instance.

Figure 3 shows a pair of instance visualizations for the same instance from the Monks1 data set but for two different types of models. At the top of an instance visualization are the name of the data set, the model, the point-of-view class (classification only), the model’s prediction for this instance, and the actual (correct) class value. The features’ names and values for the instance are listed on the left- and right-hand side, respectively. The boxes contain the features’ contributions for this instance. These contributions are also plotted as bars to simplify comparison and identification of features with the largest contribution. Note that, all contributions used in the visualizations were approximated with high precision (\(P\)(error \(<10^{-4})=1-10^{-3}\)).

Fig. 3
figure 3

The naive Bayes model, due to its assumptions of conditional independence of input features, can not model the importance of the equivalence between attr1 and attr2 (both have a zero contribution). Despite this limitation, it correctly predicts the class value, because for this instance, attr5 = 1 is sufficient (this feature has a substantial positive contribution). The artificial neural network correctly models both concepts

An instance visualization reveals how individual features contribute to the model’s prediction for that instance. For example, for the Monks1 data set, the class value is 1 if (attr1 = attr2 \(\vee \) attr5 = 1) and 0 otherwise. The pair of instance visualizations for the same instance from Monks1, but two different models trained on this data set provide us with additional information about how the features influence the models’ prediction (see Fig. 3). Although the models are different, the general method facilitates comparison and reveals an important difference between the two models, despite their similar predictions for the same instance. Note that, one-feature-at-a-time approaches would assign a 0 contribution to all features in the artificial neural network case. Perturbing just one feature does not change the model’s prediction.

The second pair of instance visualizations is for two different models trained on the cRedundant data set (see Fig. 4). This data set has 5 numerical input features. The class value equals 1 if \(A_1>0.5\) or \(A_2>0.7\) or \(A_3<0.4\). Otherwise, it is 0. Note that, the remaining two features \(A_4\) and \(A_5\) are copies of \(A_1\) and \(A_2\), respectively, which introduces some redundancy. For this instance, the values of the first three input features are 0.96, 0.72, and 0.67. The first two features satisfy the condition, while the third does not. Given that both models are successful predictors for this data set, they have learned these concepts, and appropriately, the first two features are assigned a positive contribution, while the third contributes against the class value being 1. Note that, the artificial neural network takes into account redundant features as well, while bagging with decision trees only takes into account one of each of the input features redundant copies. Although both models are equally good predictors, the explanations are different, because the explanations reveal what the models have learned.

Fig. 4
figure 4

Two instance visualizations for the same instance from the cRedundant data set and two different models

4.4.1 Model visualizations

The second type visualization is the model visualization (see, for example, Fig. 5a). It is composed of \(n\) marginal effect plots, one for each feature (see Sect. 3.2). For each feature, the mean local contributions are plotted against that feature’s value (black line). The importance of each feature (the standard deviation of its contributions) is also included in the form of a gray line.

Fig. 5
figure 5

Model visualizations for two different models and the cDisjunctN data set. Both models learn the concepts behind the data, and the plotted average contribution functions reveal where the individual features’ contribution changes from negative to positive

A model visualization provides us with an overview of how features contribute to the model’s predictions. For example, observe Fig. 5a—the model visualization of the decision tree that was trained on the cDisjunctN data set and is good at predicting the class values. The cDisjunct data set is similar to the cRedundant data set; however, the fourth and fifth feature are not copies of the first and second feature. Instead, they are unrelated to the class value. First, the model visualization helps us identify the most important features. The first three features in our example have an equally high importance (gray line—see Sect. 3.2), while the remaining two features are (correctly) identified as of insignificant importance. Second, the plots provide additional information about how features contribute to the model. For example, the first feature (A1) has a negative contribution (speaks against class value 1) if its value is less than 0.5, but contributes positively, if its value is greater than 0.5. Also note that, as shown in Sect. 3.2, if the model is additive, then the plot can also be use to graphically compute the prediction for an arbitrary instance from the data set.

The general method simplifies the comparison of different models or types of models. Figure 5b depicts a model visualization for an artificial neural network trained on the cDisjunct data set. While the performance of both models (see Fig. 5) is similar with respect to prediction quality, the models are slightly different. The visualization reveals the smooth fit of the artificial neural network and characteristically step-function fit of the decision tree. The artificial neural network also slightly overfit the data as the fourth and fifth feature do slightly influence the models predictions.

4.5 User-based experiment

The goal of this experiment was to measure if explanations in the form of feature contributions benefit not only (machine learning) experts but also non-expert end-users. Explanations should benefit the user by increasing the user’s understanding of the model. The usefulness of explanation methods is usually validated through visual inspection illustrative examples (as we have done in Sect. 4.4) or an application to a real-world problems with domain-expert evaluation.

Only a few studies approach the evaluation in a more general and objective way. [12] compared the usefulness of decision tables, binary decision trees, and decision rules in a study that included 51 post-graduate students. The students had to perform understanding and prediction tasks. The authors measured the prediction accuracy, speed of response, and the level of trust, and concluded that decision tables were most useful. Other studies focused only on decision trees [20] and on decision trees and decision rules [2].

Similar to [12], our goal was to measure the effect of the explanation on the user through the users’ performance at prediction tasks, which can be measured objectively. We designed the following experiment:

  • the participant is provided, in sequence, with two sets of instances with predictions and unlabeled instances

  • for the second set, the participant is also given the situational importance of each feature and all labeled instances,

  • for each set separately, the participant is asked to learn what the model does from labeled instances (an explanations, if given) and produces predictions for unlabeled instances.

We constructed two different sets of instances, T1 and T2 (see Table 3), and used two different variants of the experiment. In the first variant (EXP1), the participant is provided with either T1 or T2 (chosen at random). That is, the participant first solves the task without explanations and then the same task with explanations. In the second variant (EXP2), the participant is provided either with T1 first and T2 second or T2 first and T1 second (chosen at random). That is, the participant solves one task without explanations and then a different task with explanations.

Table 3 We constructed two sets of learning and testing instances, T1 and T2

A total of 122 computer science students participated in this experiment. The students could only use a pencil and paper to compute their predictions and were limited to 8 minutes per set. All 56 participants in EXP1 were first-year students, which are assumed to have no substantial experience with knowledge discovery. Two groups of students participated in EXP2: 52 1st year students (group A) and 14 4th year students with experience in data mining and knowledge discovery (group B).

For each test instance separately, we ranked the mean-squared errors of the participants predictions. We prefer ranks to actual mean-squared errors to avoid the effect of outliers and facilitate comparison across all four test instances. We tested the statistical significance of the differences with the Wilcoxon test (paired for EXP1 and unpaired for EXP2). We use the 95 % confidence level when determining the significance of the results.

The results are shown in Table 4. Where explanations were provided, prediction errors rank significantly lower across all groups and both variants of the experiment.

Table 4 Average ranks for the users’ prediction errors for both variants of the experiment and all groups

Significantly better predictions are a consequence of participants having a better understanding of the model. Given the design of the experiment, it is reasonable to conclude that better understanding was caused by providing explanations. This results suggest that situational importance is a useful form of explanation for non-expert users.

It could be argued that in EXP1, better understanding was not a consequence of providing explanations but of participants performing the same task for the second time when explanations were provided. However, in EXP2, this was controlled for by giving participants a different task when explanations were provided. The only potential threat to the validity of this experiment is the possibility that participants matured . However, given the simplicity of the tasks and the short timespan, we assume that this is unlikely.

5 Conclusion

We proposed a general method for explaining how features contribute to classification and regression models’ predictions. The method builds on previous work on a general method for computing the situational importance of features for prediction models. By design, the method perturbs all subsets of features to deal with the shortcomings of other existing general methods that do not properly take into account interactions between features.

We derived the mean situational importance of a feature’s value, and we show how it can be used as a basis for a model visualization, which provides an overview of how features contribute to the model’s predictions. For additive models, this approach generalizes previous additive model-specific methods and general explanation methods.

We also proposed two enhancements to the sampling algorithm (quasi-random and adaptive sampling) that reduce the running time of the algorithm. Empirical results across several types of models and data sets show that the method is an efficient and useful tool for visualizing models, comparing different types of models, and identifying potential errors. Furthermore, an experiment with human participants showed that providing an explanation in the form of feature contributions increased the user’s understanding of the prediction model.