Keywords

1 Introduction

Machine learning is a set of methods that improve automatically through experience, i.e. it is based on data. Popular machine learning methods are, e.g. support vector machines (SVMs [2]), artificial neural networks (ANNs) and random forests (RFs). Machine learning algorithms are increasingly applied in science and business and have achieved impressive performances in diverse tasks, outperforming humans. However, for several machine learning algorithms it is hard to tell what the machine has actually learned from the data. For example, in the case of ANNs, what was learned is hidden in the weights and biases of the neurons involved. If a machine learning model performs well, one might simply trust the model and ignore why it made a certain decision. However, such an attitude goes against human curiosity and thirst for knowledge. This raises the issue of interpretability [24, 25]. The straightforward way to achieve interpretability in statistical learning is to use only interpretable models. Interpretable models are, e.g. linear models, generalized linear models [23], generalized additive models [12], decision trees and rules. On the other hand, model-agnostic interpretation methods are more flexible and can be applied to any machine learning algorithm. Graphical model-agnostic methods are, e.g. Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots. We suggest to provide model-agnostic methods to evaluate the influence of different regressors and their interactions by applying methods from the statistical field of global sensitivity analysis (GSA). Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or function can be divided and allocated to different sources of uncertainty in its inputs [14, 29]. Given any real-valued function on several variables –whether analytical or given by a black-box– one wants to know which input variables affect the variability of the function the most. GSA has proven to be a valuable tool in analysing expensive to evaluate computer models with a surrogate model, e.g. a Kriging model, build first. Cheng et al. [3] use support vector regression as surrogate model within GSA, whereas [37] built a new feature selection approach upon GSA. Like in [4] we suggest to achieve an understanding and interpretability of, e.g. ANNs, SVMs and RFs by combining GSA and visualization.

2 Global Sensitivity Analysis

This section reviews sensitivity analysis with a focus on global variance-based sensitivity measures, but we also discuss derivative-based global sensitivity measures briefly.

2.1 Global Sensitivity Indices

Consider a function \(f: \Delta \subseteq \mathbb {R}^d \rightarrow \mathbb {R}\) that is square integrable w.r.t. a d-dimensional product measure μ. The functional analysis of variance (FANOVA) decomposition (also called Hoeffding-Sobol’ decomposition) of f ∈ L 2(μ) is the unique decomposition

$$\displaystyle \begin{aligned} f(X)=f_0+ \sum_i f_i(X_i) + \sum_{i<j} f_{i,j}(X_i,X_j) + \cdots + f_{1,\ldots, d}(X_1, \ldots, X_d) \end{aligned} $$
(1)

such that E(f I(X I)∣X J) = 0 for all J ⊂ I ⊆ [d] := {1, …, d}. In particular, we have E(f I(X I)) = 0 for all ∅≠I ⊆ [d], see e.g. [6]. Furthermore, this implies orthogonality of all summands in the decomposition, i.e. \(E\left (f_I(X_I)f_J(X_J)\right )=0\) for all IJ ⊆ [d]. The FANOVA decomposition can be computed recursively by

$$\displaystyle \begin{aligned} f_0=E(f(X)), \,\, \mathrm{and} \,\, f_I(X_I)=E(f(X)\mid X_I)- \sum_{J\subset I} f_J(X_J). \end{aligned} $$
(2)

Orthogonality allows for an ANOVA like decomposition of the variance of f:

$$\displaystyle \begin{aligned} D=Var(f(X))= \sum_{I\subseteq [d]} Var(f_I(X_I)). \end{aligned} $$
(3)

The variance of each term, D I = Var(f I(X I)), gives a sensitivity index of its effect. The standardized indices

$$\displaystyle \begin{aligned} S_I=D_I/D \end{aligned} $$
(4)

are known as Sobol’ indices [31]. Especially, Sobol’ indices S i = S {i} for individual variables are referred to as first-order indices and S ij = S {i,j} as second-order indices. The same holds for the unstandardized versions D i and D ij. Sobol’ introduced the closed sensitivity index to describe the influence of a group of variables:

$$\displaystyle \begin{aligned} D_I^{cl}= \sum_{J \subseteq I} D_J=Var(E(f(X)|X_I)). \end{aligned} $$
(5)

The total sensitivity index by Homma and Saltelli [13] describes the total contribution of a set of variables including all interactions of any order and is defined by all partial variances containing at least one of the variables, i.e.

$$\displaystyle \begin{aligned} D_I^T=\sum_{I \cap J\ne \emptyset} D_J, \quad S_I^T=\frac{D_I^T}{D}. \end{aligned} $$
(6)

For I = {i}, this total sensitivity index is defined by considering all supersets. An extension [21] of the concept of superset importance is given by

$$\displaystyle \begin{aligned} D_I^{sup}=\sum_{J \supseteq I} D_J. \end{aligned} $$
(7)

In particular, the unnormalized and normalized total interaction indices (TIIs) [9] are given by

$$\displaystyle \begin{aligned} D_{i,j}^{sup}=\sum_{J \supseteq \{i,j\}} D_J \quad \mbox{ and } \quad S_{i,j}^{sup}=\sum_{J \supseteq \{i,j\}} \frac{D_J}{D}. \end{aligned} $$
(8)

So, each of these indices characterizes a different aspect of the sensitivity of the model response to individual input variables or interactions between them.

2.2 Shapley Values

A similar problem to the FANOVA decomposition has been studied in game theory and economics, namely the problem of attributing the value created in a team effort to individual team members. Consider the setting where one can measure the value \(val(I)\in \mathbb {R}\) created by any subset I ⊆ [d] of the d-member team. In that case the so-called Shapley values ϕ i are the unique choice that satisfy the following four natural criteria [30, 36].

  1. 1.

    (Efficiency) \(\sum _i^d \phi _i=val([d])\).

  2. 2.

    (Symmetry) val(I ∪ i) = val(I ∪ j) ∀I ⊆ [d] ∖{i, j} implies ϕ i = ϕ j.

  3. 3.

    (Dummy) val(I ∪ i) = val(I) ∀I ⊆ [d] implies ϕ i = 0.

  4. 4.

    (Additivity) The game with value val (1) + val (2) has Shapley values ϕ (1) + ϕ (2) with ϕ (1) = ϕ(val (1)) and ϕ (2) = ϕ(val (2)).

Then the Shapley value of an individual variable is given by

$$\displaystyle \begin{aligned} \phi_i=\frac{1}{d} \sum_{I\subseteq [d]\setminus\{i\}} {d-1 \choose |I|}^{-1}(val(I \cup i)-val(I)). \end{aligned} $$
(9)

Shapley values are connected to the FANOVA decomposition by Owen [27]. In that context, for any subset I of input variables, their combined value val(I) is the “variance explained” in the FANOVA decomposition. More precisely, the choice in [27] is \(val(I)=D_I^{cl}\). Then, using the properties (1) − (4), it can be shown that the Shapley value is

$$\displaystyle \begin{aligned} \phi_i=\sum_{I\subseteq [d], i\in I} \frac{D_I}{|I|}\end{aligned} $$
(10)

according to Theorem 1 in [27]. The Shapley value does not coincide with any first-order Sobol’ index, but it is bracketed between the closed and total sensitivity index [27]:

$$\displaystyle \begin{aligned} D_i^{cl} \le \phi_i \le D_i^T. \end{aligned} $$
(11)

A normalized Shapley value may be defined as \(\phi ^*_i=\phi _i/D\). Because these indices are comparatively easy to compute, Sobol’ indices provide effectively computable bounds for the Shapley value. An exact computation of the Shapley value is computationally expensive because there are 2d subsets of [d], representing coalitions of variables. Štrumbelj and Kononenko [34] and Song et al. [33] propose effective algorithms to estimate Shapley values using Monte Carlo sampling.

2.3 Derivative-Based Global Sensitivity Measures

Based on the work of [32] derivative-based global sensitivity measures (DGSM) were introduced by Kucherneko et al. [20] as

$$\displaystyle \begin{aligned} \nu_i= \int \left(\frac{\partial f}{\partial x_i}(x)\right)^2 d\mu. \end{aligned} $$
(12)

A normalized DGSM can be defined by \( \nu ^*_i=\nu _i/ \sum _j^d v_j. \) DGSMs are not associated with a functional decomposition, but they are connected to total sensitivity indices by the inequality \(D_i^T\le C(\mu _i)\nu _i\) if for the measure μ the Poincare inequality

$$\displaystyle \begin{aligned} \int g(x)^2\mu \le C(\mu)\int ||\nabla g(x)||{}^2 d\mu \end{aligned} $$
(13)

holds for all centred functions g ∈ L 2(μ) with \(\int g(x)d\mu =0\) and ||∇g||∈ L 2(μ). Friedman and Popescu [7] introduced crossed DSGMs, in particular, for interactions:

$$\displaystyle \begin{aligned} \nu_{i,j}= \int \left(\frac{\partial ^2f}{\partial x_i\partial _j}(x)\right)^2 d\mu. \end{aligned} $$
(14)

Roustant et al. [28] provide an inequality to link crossed DGSMs to superset importance.

2.4 Estimation of Indices

For analytically tractable test functions, the indices above may be calculated by evaluating the integrals involved. In general, the function f is not known analytically and will be treated as black-box function. In Monte Carlo estimation, we take a high number of n samples x (1), …, x (n) from the distribution μ and approximate the integral by

$$\displaystyle \begin{aligned} \frac{1}{n} \sum_{k=1}^{n} f(x^{(k)}) \quad \stackrel{n\rightarrow \infty}{\longrightarrow} \int f(x)d\mu=E(f(X)). \end{aligned} $$
(15)

The approximation is unbiased and convergent with probability one according to the law of large numbers. For the estimation, we require a representation of the sensitivity indices that is suitable for Monte Carlo integration. A popular choice is based on the pick-and-freeze formula \(D_I^{cl}=E(f(X)f(X_I,Z_{-I}))-f_0^2\) which gives the pick-and-freeze Monte Carlo estimator

$$\displaystyle \begin{aligned} \hat{D}_I^{cl}=\frac{1}{n} \sum_{k=1}^{n} f(x^{(k)})f\left(x^{(k)}_I,z^{(k)}_{-I}\right) -f_0^2. \end{aligned} $$
(16)

Here Z is an independent copy of the random variable X, and − I denotes the complement set [d] ∖ I. Since, the pick-and-freeze estimator gets a large variance if f 0 is large, other formulas have been suggested that avoid the subtraction of \(f_0^2\) [18, 31]. In particular, the total sensitivity index can be computed using the Jansen formula \(D_I^T=1/2 E((f(X)-f(Z_I,X_{-I}))^2)\).

Computationally cheaper than Monte Carlo estimation, but also slightly biased, are frequency-based estimation methods. The first frequency-based estimation method was the so-called Fourier amplitude sensitivity test (FAST) by Cukier et al. [5]. TIIs can be easily estimated via the relationship with closed sensitivity indices using pick-and-freeze. Of particular interest is a direct method using the formula of [21]:

$$\displaystyle \begin{aligned} \begin{array}{rcl} D_{i,j}^{sup}& =&\displaystyle \frac{1}{4}E((f(X_i,X_j,X_{-\{i,j\}})-f(X_i,Z_j,X_{-\{i,j\}}) \end{array} \end{aligned} $$
(17)
$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle -f(Z_i,X_j,X_{-\{i,j\}}) +f(Z_i,Z_j,X_{-\{i,j\}}))^2). \end{array} \end{aligned} $$
(18)

The corresponding Liu-Owen Monte Carlo estimator is unbiased, and it is non-negative since it is a sum of squares. This implies that if the true TII is zero, then the estimator is zero as well.

3 Visualizing Interaction Structures by FANOVA Graphs

In this section the FANOVA graph, an intuitive tool to visualize the most valuable information of the FANOVA decomposition, is introduced [8, 10, 26]. Estimation and thresholding of FANOVA graphs is discussed, and we apply GSA to a standard non-linear test function.

3.1 General Idea of FANOVA Graphs

Usually, it is infeasible to look at all 2d − 1 terms of the decomposition of a function with d input variables individually, and quite often only main effect Sobol’ indices are considered. The primal intention of FANOVA graphs is to overcome this problem and to visualize the interaction structure contained in the FANOVA decomposition by a mathematical graph [26]. The so-called FANOVA graph is defined as a graph G = (V, E) where each of the d input variables is identified by an element of the vertex set V = {1, …, d}. An edge is included in the edge set (i, j) ∈ E iff there exists a superset J ⊇{i, j} such that f J(X J)≠0. That is, the pair of input variables (X i, X j) has a non-zero two-way interaction or is involved in a higher order non-zero interaction. Equivalently an edge (i, j) is not in G iff all Sobol’ indices S J = 0 for J ⊇{i, j}. This is exactly captured by a non-zero TII, i.e. \(S_{i,j}^{sup}\neq 0\).

A FANOVA graph can be further enhanced by displaying the thicknesses of each edges (i, j) proportional to the strength of the TII of the two involved input variables. In the same way, each vertex i can be displayed by circles with lines proportional in strength to the main effect Sobol’ index S i.

Let us consider the so-called Ishigami function which is frequently used for illustrating sensitivity analysis [16]. The function, given by

$$\displaystyle \begin{aligned} f(X_1,X_2,X_3)=\sin{}(X_1) +7\sin^2(X_2) +0.1X_3^4 \sin{}(X_1), \end{aligned} $$
(19)

depends on three input variables (X 1, X 2, X 3) and obviously contains a non-linear interaction between X 1 and X 3 (see Fig. 1c). For this test function Sobol’ indices can be computed analytically. Assuming a uniform distribution on [−π, π] for each input variable, analytical calculation of the FANOVA decomposition and Sobol’ indices gives us the following values

$$\displaystyle \begin{aligned} D_1= 4.346, \, D_2 = 6.125, \, D_3 = 0, \, D_{12} = 0, \, D_{13} = 3.374, \, D_{23} = 0, \, D_{123} = 0.\end{aligned} $$
(20)

This leads to the following first-order Sobol’ indices and normalized TIIs

$$\displaystyle \begin{aligned}S_1=0.314, \, S_2 =0.442 , \, S_3 = 0, \, S_{12}^{sup} = 0, \, S_{13}^{sup} = 0.244, \, S_{23}^{sup} = 0.\end{aligned} $$
(21)
Fig. 1
figure 1

3d-plots for the Ishigami function. (a) f(X 1, X 2, 0). (b) f(0, X 2, X 3). (c) f(X 1, 0, X 3)

Figure 2 shows a bar plot and the FANOVA graph displaying the Sobol’ indices and TIIs for the Ishigami function. Main effect stands for the normalized first-order Sobol’ indices and interaction is the difference between the scaled total sensitivity index and the Sobol’ index. From the FANOVA graph it becomes immediately obvious that the input variable X 2 has the highest impact on the response, followed by X 1. Also, the interaction between X 1 and X 2 is easily detected.

Fig. 2
figure 2

Bar plot and FANOVA graph displaying Sobol’ indices and TIIs for the Ishigami function. (a) Bar plot. (b) FANOVA graph

In summary, FANOVA graphs visualize both first- and second-order GSA. First-order analysis in the sense of detecting the inputs X i for which S i is very small or even zero. Second-order in the sense of looking at pairs of input variables and detecting influential interactions and their strength, i.e. the pairs {i, j} with \(S_{i,j}^{sup}> 0\).

3.2 Estimation and Thresholding

In practice, S i and \(S_{i,j}^{sup}\) are usually not analytically available and replaced by estimates \(\hat {S}_{i,j}^{sup}\) and \(\hat {S}_i\). Moreover, we often even apply GSA not to the actual black-box model but a meta-model or surrogate model of it. Then, estimates are typically not exactly equal to zero even if the “true” or analytically calculated sensitivity index would be. The resulting graph becomes confusing and uninformative. Therefore, edges (i, j) may be included into the graph only if

$$\displaystyle \begin{aligned} \hat{S}_{i,j}^{sup}>\delta \end{aligned} $$
(22)

for some small threshold δ, e.g. δ = 0.01 [26]. The computation of the FANOVA graph has been implemented in the R package fanovaGraph, providing several estimation methods as well as a thresholding functionality [8, 10].

To exemplify, let us now assume that we cannot analyse the Ishigami function analytically. Based on a random Latin hypercube design with 100 design points, we build a Kriging model of the Ishigami function. Our Kriging model has the usual Matern 5/2 covariance structure, no trend and no nugget effect. Table 1 shows the results for the estimators of the first-order Sobol’ indices \(\hat {S}_i\), normalized Shapley values \(\hat \phi _i^*\) and TIIs \(\hat {S}_{i,j}^{sup}\) of the Kriging model. These estimators have been computed using the R packages fanovaGraph [10] and sensitivity [15]. Remember that the inequalities

$$\displaystyle \begin{aligned} S_i^{cl} \le \phi_i^* \le S_i^T \end{aligned} $$
(23)

hold and that \(S_i^{cl}=S_i\), which is reflected by the order of the values in Table 1. Comparison also shows that the estimates slightly deviate from the true values given above.

Table 1 Sobol’ indices (first order) and TIIs for the Kriging model of the Ishigami function

Figure 3 displays on the left hand side the resulting pure FANOVA graph. This is a complete graph as all estimated TIIs are different from zero, even if only slightly. Therefore, we threshold the values by δ = 0.025 and gain the graph on the right hand side, which is the same as for the analytical evaluation of the Ishigami function. The TII’s and the FANOVA graph help to discover an underlying block-additive structure of the function f, i.e. we can find a decomposition into cliques of input variables such that variables in different cliques do not interact. As outlined in [26], the detected interaction structure by the FANOVA graph can be a valuable aid in constructing block-additive Kriging models. Therefore, the fanovaGraph package also contains methods for block-additive Kriging analysis. The block-additive decomposition provided by the FANOVA graph can also be used in a parallelized global optimization procedure [17].

Fig. 3
figure 3

Fanova graphs without and with thresholding for the Kriging model of the Ishigami function. (a) Fanova graph. (b) Fanova graph with threshold δ = 0.025

4 Fields of Applications

The interpretation of a machine learning model by global sensitivity indices and FANOVA graph is in general applicable to any kind of model with a continuous response variable. We show examples of a Kriging model of a piston simulator and an ANN of resistance of sailing yachts.

4.1 Kriging Model of a Piston Simulator

As an example for the application in the field of the design and analysis of computer experiments we are using the piston simulator from the mistat package in R [1], first presented in [19]. A piston is moving within a cylinder. The piston’s performance is measured by the time it takes to complete one cycle, in seconds. Here, we take the mean of 50 cycles as response, since the cycle time of the piston fluctuates strongly. The following factors can affect the piston’s performance. The ranges, in which these factors are varied uniformly in our sensitivity analysis, are given in brackets.

m

The impact pressure determined by the piston mass (30–60) [kg].

S

The piston surface area (0.005–0.020) [m 2].

V 0

The initial volume of the gas inside the piston (0.002-0.010) [m 3].

k

The spring coefficient (1000–5000) [Nm 3].

p 0

The atmospheric pressure (9 ⋅ 104 − 11 ⋅ 104) [Nm 2].

T

The surrounding ambient temperature (290–296) [K].

T 0

The filling gas temperature (340–360) [K].

Based on a random Latin hypercube design with 70 design points, we build a Kriging model of the piston simulator. The Kriging model has a Matern 5/2 covariance kernel, no trend and no nugget effect. Table 2 shows the results for the Sobol’ indices (first-order and total) and the Shapley values of the piston simulator. The slightly negative value for, e.g. \(\hat \phi ^*_6\) is of course an artefact of the estimation method. We observe that the piston surface X 2 = S and the spring coefficient X 4 = k have the largest effect on the cycle time.

Table 2 Sobol’ indices for the Kriging model of the piston simulator

Figure 4 displays the FANOVA graph for the Kriging model of the piston simulator after thresholding by δ = 0.005.

Fig. 4
figure 4

FANOVA graph with threshold δ = 0.005 for Kriging model of the piston simulator. (a) FANOVA graph. (b) FANOVA graph (only TIIs displayed)

In Fig. 4a both the edges as well as the vertices of the graph are presented by lines proportional to the values of the respective indices. It becomes obvious that X 2 = S and X 4 = k have the highest impact, followed by X 1 = m. However, as the values of the TIIs are noticeably smaller than the first-order Sobol’ indices, it is not possible to detect which interactions are the largest. Therefore, in the FANOVA graph in Fig. 4b we only vary the edges in strength proportional to the values of the TIIs. The largest TII is observed for the interaction X 2 X 4 = Sk with \(\hat {S}^{sup}_{2,4}=0.033\).

4.2 Neural Net Model of Resistance of Sailing Yachts

The residuary resistance of a ship is its total resistance minus the viscous resistance. In this section we are studying the residual resistance of sailing yachts in dependence of their hull geometry and the yacht velocity.

The Delft systematic yacht hull series data set [11] comprises 308 = 22 ⋅ 14 experiments with yacht models of scale 6.25 performed at the Delft Ship Hydromechanics Laboratory. In total, 22 different hull forms were tested with 14 different velocities. Based on the Delft series, semi-empirical models were developed [11] which are widely used in the yacht industry [22]. The Delft data set has 6 regressors and one dependent variable, all of which are dimensionless, i.e. their unit is 1 or % or ‰. Let the weight displacement Δ be the weight of water equivalent to the immersed volume of the hull. Then the dependent variable is the ratio R r∕ Δ of the residuary resistance R r to the weight displacement, given in ‰. The independent variables are as follows.

X 1 :

The longitudinal centre of buoyancy (LCB) is the longitudinal distance, given in % of some characteristic length, from a point of reference (often midships) to the centre of the displaced volume of water.

X 2 :

The prismatic coefficient C p = ∇∕L WL A m with A m the cross-sectional area of the underwater slice at midships. C p displays the ratio of the immersed volume of the hull to a volume of a prism with equal length and cross-sectional area A m.

X 3 :

The length-displacement ratio L WL∕∇1∕3 where the volume displacement ∇ is the volume of water displaced by the hull.

X 4 :

The beam-draught ratio B WLT where the draught T is the maximal distance from the water line to the bottom of the keel.

X 5 :

The length-beam ratio L WLB WL is the ratio of length to maximal width at water line.

X 6 :

The Froude number \(Fr=u/\sqrt {gL_{WL}}\). Here u is the flow velocity relative to the yacht, g the gravitational acceleration, and L WL is the length of the hull at water line.

We train a single hidden layer ANN to learn the relationship between input and output variables in the Delft data set. Such ANNs are implemented in the nnet package in R [35]. For regression, we choose an ANN with a linear activation function to the output neuron. The data set is divided into training, validation and testing subsets, containing 50%, 25% and 25% of the samples, respectively. The hyperparameter to be tuned is the number n of neurons in the hidden layer. We choose the ANN with lowest validation error, i.e. highest R 2-value for the validation data. That is according to Table 3 an ANN with 8 hidden neurons, i.e. a 6-8-1 net with 6 ⋅ 8 + 8 = 56 weights and 8 + 1 = 9 biases.

Table 3 R 2 values for trainings, validation and test data for ANNs with different number of hidden neurons

For the chosen ANN as black-box function we perform a GSA. We compute the Sobol’ indices as well as the scaled TIIs using the Liu-Owen method with n = 100, 000 Monte Carlo samples with the help of the fanovaGraph package in R. Table 4 displays these sensitivity indices with the following coding of the regressors X 1 = LCB, X 2 = C p, X 3 = L WL∕∇1∕3, X 4 = B WLT, X 5 = L WLB WL and X 6 = Fr. The Sobol’ indices and scaled TIIs are graphically displayed in the bar plot in Fig. 5a.

Fig. 5
figure 5

Bar plot and FANOVA graphs without and with thresholding for the ANN model. (a) Bar plot of first-order and total Sobol’ indices. (b) FANOVA graph. (c) FANOVA graph with threshold δ = 0.025 (only TIIs displayed)

Table 4 Sobol’ and Shapley values for the neural net model

Figure 5 displays the FANOVA graph for the ANN model with and without thresholding. The Froude number, a proxy for velocity, has by far the largest impact on the residuary resistance. The largest interactions are X 1 X 4, X 4 X 6 and X 1 X 6.

5 Summary

We have discussed the usefulness of GSA as a tool for interpretable machine learning. Global sensitivity indices based on Sobol’ indices, Shapley values as well as derivative-based global sensitivity measures are revisited. FANOVA graphs allow for a very intuitive visualization of interaction structures and the strength of first-order Sobol’ indices and TIIs. The approach is exemplified with a Kriging meta-model for a piston simulator and an ANN model for the resistance of yachts.