1 Introduction

Deep learning models [98] have achieved remarkable performance in a variety of tasks, from visual recognition, natural language processing, reinforcement learning to recommendation systems, where deep models have produced results comparable to and in some cases superior to human experts. Due to their nature of over-parameterization (involving more than millions of parameters and stacked with more than hundreds of layers), it is often difficult to understand the prediction results of deep models [47]. ExplainingFootnote 1 their behaviors remains challenging because of their hierarchical non-linearity in a black-box fashion. The lack of interpretability raises a severe issue about the trust of deep models in high-stakes prediction applications, such as autonomous driving, healthcare, criminal justice, and financial services [29]. While many interpretation tools have been proposed to explain or reveal the ways that deep models make decisions, nonetheless, either from a scientific view or a social aspect, explaining the behaviors of deep models is still in progress. In this paper, instead of focusing on the social impacts, regulations, and laws related to deep model interpretations, we would like to focus on the research field by clarifying the research objectives and reviewing the methods proposed.

Interpretation versus Interpretability In this work, we first clarify two concepts that should be distinguished: interpretations and model interpretability. Interpretations are also named as explanations or attributions that are calculated by interpretation algorithms to explain or reveal the ways that deep models make decisions, such as the indication of discriminative features used for model decisions [137], or the importance of every training sample as the contribution for inference [91]. On the other hand, the model interpretability refers to the intrinsic properties of a deep model measuring in which degree the inference result of the deep model is predictable or understandable to human beings [47]. In practice, one could apply the interpretation algorithms of trustworthiness (introduced below) to further evaluate the model interpretability through matching the interpretations, i.e., the results from interpretation algorithms for a deep model, with the human-labeled results if available, such as [24]. In this way, the comparison of interpretability becomes possible among different models. More evaluation approaches are reviewed and will be introduced later.

Interpretation algorithms and the taxonomy As there are no formal nor well-agreed definitions about the way to interpret a deep model, the interpretation algorithms are usually designed with different principles, such as

  • To highlight the important parts of input features on which the deep model mainly relies, using gradients [155], perturbations [55], proxy explainable models [137] and other methods;

  • To investigate the inside of deep models to understand the rationale of how models make decisions by visualizing the intermediate features [191, 197], or putting the counterfactual examples to investigate the changes [64];

  • To analyze the training data by assessing their individual contributions [91], estimating their learning difficulty [21] or detecting mislabeled samples [128].

This paper reviews the recent interpretation algorithms and proposes a novel taxonomy for categorizing the interpretation algorithms. In brief, the proposed taxonomy has three orthogonal dimensions – (1) representations of interpretations, e.g., the input feature importance or the training samples’ influences; (2) the type of the targeting model that the algorithm can be used for, e.g., differentiable models, models containing specific architectures or other properties; and (3) relations between interpretation algorithms and the deep model, e.g., the closed-form expression or the composition of the model. Recent interpretation algorithms can all be categorized to the proposed three-dimensional taxonomy, which will be presented in detail in Sect. 3.

Evaluations on trustworthiness of interpretation algorithms and model interpretability There are two evaluations: one on the trustworthiness of interpretation algorithms, another on the model interpretability.

From previous reviews and outlooks for the interpretations [29, 47, 81, 107, 144], we summarize the most important desiderata for the interpretation algorithms, i.e., the trustworthiness. The “trustworthiness” here refers to that the interpretation results are reliable/faithful to arbitrary deep models. That is to say: The trustworthy interpretation algorithm produces the explanations that loyally reveal the model’s behaviors, instead of giving results that are irrelevant or just those desired by humans. Incorporating a trustworthy interpretation algorithm, the evaluations on the model interpretability are then meaningful. In Fig. 1, we illustrate the connections between these key concepts and further elaborate these concepts in Sect. 2.

Fig. 1
figure 1

Scheme about interpretations, interpretation algorithms, trustworthiness, model interpretability and the corresponding evaluations

The trustworthiness of the interpretation algorithms could be assessed by designed evaluation approaches for assuring the uses of interpretations, and the interpretability of deep models could be evaluated and measured for identifying the most interpretable ones. Both evaluations have challenges remaining, introduced below.

  • Quantifying the utility of trustworthiness of interpretation algorithms is challenging due to the lack of a proper definition of this quantity and well-defined metrics. Though trustworthiness can be understood subjectively that the trustworthy algorithm produces loyal interpretations to the model, the optimal metric is still under study. Simple metrics such as accuracy, precision, and recall are not applicable here.

  • The difficulty of evaluating the model interpretability mainly comes from the lack of the ground truth. We could not casually annotate “true” interpretations as annotating image labels because interpretation labels might not exist in most cases, or it would be out of objectiveness. Furthermore, obtaining human labeled ground truth for interpretation is labor/time-consuming, which is not scalable over large datasets.

Even in this complex and difficult situation, several efficient and effective approaches have been proposed to evaluate the trustworthiness of interpretation algorithms and model interpretability. The former is mainly based on perturbation evaluations [70, 127, 143] or proxy models [9, 183], while the latter based on expert ground truths [24] or cross-model explanations [104]. In Sect. 4, we comprehensively review the evaluation approaches on both the trustworthiness of interpretation algorithms and model interpretability.

Overview We describe the organization of this survey paper: We introduce the key concepts, including the interpretation algorithm, interpretations, model interpretability, and their relations in Sect. 2. We present the proposed taxonomy for interpretation algorithms and introduce the algorithms accordingly in Sect. 3. Evaluations on the trustworthiness of interpretation algorithms and the model interpretability are introduced in Sect. 4. Section 5 discusses the connections between interpretations and other research topics. Finally, we introduce several open-source libraries for interpretations and related in Sect. 6.

2 Main concepts: interpretations and interpretability

The fuzziness of main concepts interpretation and interpretability leads to a lot of confusions and hinders the academic process. In this section, we make our efforts to clarify these fuzzy research targets and introduce the definitions of interpretations, interpretation algorithms and model interpretability, with involving the notion of trustworthiness.

2.1 Interpretation algorithms and trustworthiness

We first introduce interpretation algorithms. A deep model needs interpretations because the inference output of the model does not show the reasoning inside. An interpretation algorithm is thus designed to produce interpretations to explain the model’s decisions and gain insight into its internals of reasoning and rationale. As mentioned previously, there are no formal nor well-agreed definitions about the way to interpret a deep model. We, therefore, adopt a very loose definition about the interpretation: All the outcomes produced by the interpretation algorithms that help to understand the model are considered as interpretations.

Instead of directly discussing the interpretations, we introduce the categories of the interpretation algorithms, as they give different information to help humans to understand the deep models. For example, an algorithm obtaining the training samples’ learning difficulties helps to inspect the model’s training process; An algorithm computing the feature importance helps to realize the most important features that the model uses to make decisions; an algorithm investigating the intermediate results of a neural network helps understand the model’s decision-making process. We show a novel taxonomy to fully categorize the existing and potential algorithms and review the corresponding algorithms in Sect. 3.

The interpretation can then lead to the discussion that the model is interpretable or not. However, before that discussion, we should guarantee at the first step that the interpretation algorithm is trustworthy and the interpretation can be trusted. The notion of trustworthiness is proposed to cover the most important desiderata from the previous review works [29, 107, 122], and can be defined as follows:

  • An interpretation algorithm is trustworthy if it properly reveals the underlying rationale of a model making decisions.

In this definition, the underlying rationale covers all categories of information that help to understand the model, e.g., how the model makes decisions, or the reasoning behind the model making decisions. The word properly here targets the issue that the intrinsic underlying rationale behind the model is usually given by an extrinsic algorithm. Extrinsic algorithms may not be part of the targeting model to be interpreted. That is to say, as an additional module to diagnose the model, the interpretation algorithm is at risk of giving explanations that are independent of the model. A sanity check [3] was performed to inspect several gradient-based interpretation algorithms by randomizing parts of parameters in the model and showing the interpretation changes. However, a few algorithms always produce the same interpretations, despite the significant changes of the parameters. Trustworthiness is defined to recover the rationale of the model, whether the model makes the correct decisions or not, instead of yielding information that is independent of the model. Though the definition of trustworthiness is not mathematically rigorous, the idea behind is clear. There are also several evaluations for assessing the trustworthiness, which will be introduced in Sect. 4.1.

Trustworthiness of different interpretation algorithms Due to the differences in representation of explanation results and type of models to be interpreted, the amount of information exposed by interpretation algorithms may be different. Trustworthiness is only required for the explained information. It would be easy for achieving the trustworthiness if one algorithm explains only a bit of information about the deep model, but this would be rarely useful for any explanation. The trustworthiness is thus an ad hoc requirement with respect to the interpretation algorithm and defined to guarantee the information provided by the interpretation algorithm can be trusted.

Relation to self-interpretable models To complete the discussions of trustworthy interpretation algorithms, we note that many researchers are working on effective self-interpretable models, to name a few, Capsule Models [73, 142], Neural Additive Models [5] and CALM [88]. We consider this is a particular case within our discussion that the self-interpretable models contain both the model and the intrinsic interpretation algorithm. To be more accurate, the self-interpretable models consist of an intrinsic interpretation algorithm. Moreover, if the model makes decisions based on the intrinsic interpretations, then this interpretation component is without doubt trustworthy.

Fully-interpretable models We also discuss fully interpretable models here to get a better understanding about the interpretations and the trustworthiness of interpretation algorithms for black-box deep models. We informally give the definition that a model is fully interpretable if the model is totally understandable by humans. The following models are considered as fully interpretable without too much controversyFootnote 2: a set of limited number of rules; a depth-limited decision tree; a sparse linear model.

Comparison to fully and self-interpretable models To compare across fully interpretable models, self-interpretable models and black-box deep models, we can see: (1) fully interpretable models can be totally understood by showing themselves. (2) Self-interpretable ones can provide explanations with an amount of information by an intrinsic interpretation algorithm. The interpretation algorithms for both fully and self interpretation models are trustworthy. (3) For black-box deep models, it is hard to provide such interpretations and much harder to guarantee the interpretation algorithms be trustworthy. Fortunately, the interpretations may be different and do not provide the fully interpretable explanation results. The trustworthiness only guarantees that the amount of information provided by the interpretation algorithm is correct.

2.2 Model interpretability

From industrial demands, the model interpretability is sometimes more important than other metrics such as accuracy because of safety and social issues in domains of autonomous driving, healthcare, criminal justice, financial services and many others. Though no mathematical definition has been proposed, general agreement about the expression proposed by [47] has been reached. We reclaim their definition of model interpretability as follows.

  • The model interpretability is the ability (of the model) to explain or to present in understandable terms to a human.

According to other review works [29, 115], “the interpretability of a model is higher if it is easier for a person to reason and trace back why a prediction was made by the model. Comparatively, a model is more interpretable than another model if the prior’s decisions are easier to understand than the decisions of the latter”.

From the definition of the model interpretability, the expression understandable to a human is a subjective notion. It is human-centered [47, 94], making it complicated to target this research problem of quantitatively measuring and comparing the interpretability of various models. Till recently, there are not many metrics for quantifying the model interpretability, and Sect. 4.2 will introduce the existing evaluation approaches on the model interpretability.

We give an intuitive example to show that different models may have different interpretability. Take image classification [43, 175] as the task, and a trustworthy algorithm of analyzing the input-output relations as the interpretation algorithm. We consider two models, and the produced interpretations locate different image pixels. It is easier to understand if the interpretation aligns with the object parts in the image, while it is harder to understand if the interpretation locates at the background or another accompanied object in the image for recognizing the target object. Although the trustworthy algorithm reveals the rationales of both models, we prefer the former model because its way of making decisions is more direct to human understandings.

2.3 Toward interpretable deep learning

This section defined the trustworthiness of interpretation algorithms and the model interpretability. We emphasize several points that usually confuse the field with more explicit remarks.

Interpretation algorithms, interpretations and model interpretability The notions of interpretation algorithms, interpretations, and model interpretability should be distinguished. Only the interpretability among all these expressions is a property of the model. Interpretation algorithms are designed to analyze the black-box model. Algorithms must be trustworthy; otherwise, the interpretations do not reveal the model’s internals. Their relations and differences are illustrated in Fig. 1.

Summary of desiderata for interpretations In this section, the proposed desiderata is the trustworthiness for interpretation algorithms. Researchers [29, 47, 81, 101, 107, 183] also proposed many other desiderata for interpretations, interpretation algorithms or interpretability, such as fairness, privacy, reliability, robustness, causality, trust, fidelity, faithfulness, transferability, informativeness, transparency, plausibility, satisfaction, accountability, etc. However, we note that (1) properties, such as informativeness, plausibility, satisfaction, refer to whether the interpretation is understandable to humans, and are different from the trustworthiness in this paper that refers to algorithms; (2) properties, such as reliability, robustness, trust, fidelity, faithfulness, transparency, are similar to trustworthiness or can be comprised by the general definition of trustworthiness; (3) properties, such as causality, transparency, depend on the underlying rationale in our context; (4) properties, such as fairness, transferability, privacy, are the standards to constrain the models; and (5) others (e.g., accountability and traceability) are more related to holistic evaluations of the systems. There is slight difference and specific requirements in various scenarios, but the proposed trustworthiness is only for interpretation algorithms.

Deep models for high-dimensional data for scientific discovery Though the motivation of interpretations and interpretability at the beginning is to help humans understand the deep models, the interpretations sometimes lead to other valuable and promising findings. Deep models may be more efficient than humans to cope with high-dimensional data. From molecules [84, 133] to black holes [86], from chemistry [62] to games [152], deep models could be used to solve many problems. However, without interpretations, the knowledge discovered by deep models is still unknown for humans, or the scores obtained are not semantic and not fully understood by humans. Interpretations in these cases could be helpful to find new intelligent patterns and discover new scientific theories. For example, from a perspective of rationale processes, interpretations can help humans to understand how a model infers; Or a feature analysis algorithm can help to identify the most important features that the model uses; Or a tool of investigating the data can help find the typical data samples or the most influential ones that explain how the model makes decisions. These algorithms are all included in this paper and will be discussed in the following section.

3 Interpretation algorithms: taxonomy, algorithm designs, and miscellaneous

This section introduces the interpretation algorithms in recent years, with a proposed taxonomy of three dimensions. For each algorithm, we give a brief introduction and follow the taxonomy for the categorization. A discussion is also provided for future works at the end of this section.

3.1 Taxonomy

We categorize the existing interpretation algorithms according to three orthogonal dimensions: representations of interpretations, targeting model’s types for interpretations, and the relation between interpretation algorithms and models. We list the options in each dimension for a better comparison.

For different applications and interpretation requirements, the representations of interpretation are various:

  • Feature (Importance). These algorithms aim at estimating the feature importance/contribution with respect to the final objective. This includes the analyses on the dimensions of input raw data and extracted features, e.g., images, texts, audios etc.; and intermediate features inside models, e.g., the activations of neural networks; or latent features in GANs.

  • Model Response. Algorithms here generally propose to generate or find new examples and see the model’s responses, so as to investigate the model behaviors on certain patterns, prototypes, or discriminative features by which the model makes decisions.

  • Model Rationale Process. Though deep models are complex, they can be substituted by interpretable models, to gain insights on the rational process inside. Algorithms here interpret the deep model by indicating the path that the model makes decisions.

  • Dataset. Instead of interpreting deep models, algorithms here propose to explain the data samples in the training set by showing how they affect the optimization phase of deep models.

Interpretation algorithms cope with different types of models:

  • Model-agnostic. Algorithms are included here that completely consider the models as black boxes and do not investigate the inside of models.

  • Differentiable model. This subset of algorithms contains only algorithms that address the interpretations of differentiable models, especially neural networks. Note that model-agnostic algorithms also cover this subset.

  • Specific model. This family of algorithms can only be applied to certain types of models, e.g., convolutional neural networks (CNNs), generative adversarial networks (GANs), Graph Neural Networks (GNNs). This is a narrower family than the previous one.

Fig. 2
figure 2

Illustration of relations between the interpretation algorithm and the model. Four relations are illustrated: closed-form, composition, dependence and proxy

The third dimension for categorizing interpretation algorithms is the relation between the interpretation algorithm and the model:

  • Closed-form. These algorithms derive a closed-form formula from the target model and output interpretable terms.

  • Composition: Algorithms here can be considered as components of (interpretable) models, usually obtained during training.

  • Dependence: These algorithms build new operations upon the target model after training and output interpretable terms.

  • Proxy. Unlike dependence, algorithms here obtain, via learning or derivation, a proxy model for explaining the behavior of models.

For a better illustration, four of relations between interpretation algorithms and deep models are shown in Fig. 2.

We have introduced the proposed taxonomy of three dimensions: Representation, Model Type and the Relation. In the following subsection, we will present most of the recent interpretation algorithms. We also give a categorization of all these algorithms with respect to the proposed taxonomy in Table 1.

3.2 Interpretation algorithms

LIME and model-agnostic algorithms LIME [137] presents a locally faithful explanation by fitting a set of perturbed samples near the target sample using a potentially interpretable model, such as linear models and decision trees. We define a model \(g\in G\), where G is a class of interpretable models. The domain of g is \(\{0,1\}^{d'}\) and its complexity measure is \(\varOmega (g)\). Let \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) be the model being explained and \(\pi _x(z)\) be the proximity measure between a perturbed sample z and x. Finally, let \(L(f,g,\pi _x)\) be a measure of the unfaithfulness of g in approximating f in the locality defined by \(\pi _x\). LIME produces explanations by the following:

$$\begin{aligned} \xi (x)= \mathop {\mathrm { arg\,min}}_{g\in G} L(f,g,\pi _x)+\varOmega (g). \end{aligned}$$
(1)

The obtained explanation \(\xi (x)\) interprets the target sample x, with linear weights when g is a linear model. LIME is model-agnostic, meaning that the obtained proxy model is suitable for any model. Similarly, several model-agnostic algorithms, such as Anchors [138], SHAP [110], RISE [127], MAPLE [130], target interpreting features and provide feature importance or contributions to the final decision.

Global interpretation algorithms Feature importance analysis is a common tool for explaining the model outputs with respect to inputs. The aforementioned approaches can be categorized into feature importance analysis, while their interpretations are for individual examples, giving unique results for each different example. Different from these “local” interpretations, “global” interpretations provide feature importance in an overall vision of the model. Global interpretations for deep models are usually based on local ones, and an aggregation of local interpretations is performed to obtain the global feature importance, while the difference resides in the aggregation approach, e.g., LIME-SP [137], NormLIME [6] and GALE [166].

Input gradient-based algorithms The input gradient attributes the important features in the input domain. However, for deep nonlinear models with numerous layers stacked, the gradients would be vanished or saturated during the back-propagation and thus contain noises.

SmoothGrad [155] proposed to remove the noises by averaging the gradients of a number of noised inputs. We take visual tasks as an example: Given input image x, neural networks compute a class activation function \(S_c\) for class \(c\in C\). A sensitivity map can be constructed by calculating the gradient of \(M_c\) with respect to input x: \(M_c(x) = \partial S_c(x)/\partial x\). However, the saliency maps are often noisy because of sharp fluctuations of the derivative. To smooth the gradients, multiple Gaussian noises are added to the input image, and the saliency maps are averaged. SmoothGrad is defined as follows:

$$\begin{aligned} \hat{M_c}(x) = \frac{1}{n}\sum _{1}^{n}M_c(x+\mathcal {N}(0,\,\sigma ^{2})). \end{aligned}$$
(2)

Integrated Gradient (IG) [160] aggregates the gradients along with the inputs that lie on the straight line between the baseline and input. Let F be a neural network, x be the input, and \(x'\) be the baseline input, which can be a black image for computer vision models and a vector of zeros for word embedding in text models. The integrated gradients along the ith dimension is

$$\begin{aligned} \text {IG}_i(x) = (x_i - x_i')\times \int _{\alpha =0}^{1}\frac{\partial F(x'+\alpha \times (x-x'))}{\partial x_i}\mathrm{d}\alpha . \end{aligned}$$
(3)

An axiom called completeness is satisfied, which states that the attributions add up to the difference between the output of F at input x and baseline \(x'\).

Other input gradient-based algorithms include DeepLIFT [150], VarGrad [3], GradSHAP [110], and FullGrad [156].

Layer-wise relevance propagation Layer-wise relevance propagation (LRP) [16] is also an input feature attribution algorithm. Instead of using proxy models, perturbations or gradients, LRP recursively computes a Relevance score for each neuron of layers, so as to understand the contribution of a single pixel of an image x to the prediction function f(x) in an image classification task.

$$\begin{aligned} f(x) = \cdots = \sum _{d=1}^{V^{(l+1)}} R_d^{(l+1)} = \sum _{d=1}^{V^{(l)}} R_d^{(l)} = \cdots = \sum _{d=1}^{V^{(1)}} R_d^{(1)}, \end{aligned}$$
(4)

where \(R_d^{(l)}\) is the Relevance score of the dth neuron at the lth layer, \(V^{(l)}\) indicates the dimension of lth layer, and \(V^{(1)}\) is the number of pixels in the input image. Iterating Eq. (4) from the last layer, which is the classifier output f(x) to the input layer x consisting of image pixels, then yields the contribution of pixels to the prediction results. Based on the idea of back-propagating Relevance scores, LRP can be extended to other neural networks, even with special and complex nonlinear operations [27, 118]. To adapt LRP to specific tasks, many variants have been proposed, such as Contrastive LRP [67] which produces pixel-wise explanations of instance objects, Softmax-Gradient LRP [79] which gives explanations focusing on discriminating possible objects in the images, and Relative Attributing Propagation (RAP) [123] which focuses on both positive and negative features. Furthermore, extended LRPs [34, 169] can be helpful to interpret Transformer models [45, 48, 159].

CAM and variants Given a CNN and an image classification task, classification activation map (CAM) [197] can be derived from the operations at the last layers of the CNN model and show the important regions that affect model decisions. Specifically, for a given category c, we expect the unit corresponding to a pattern of the category in the receptive field be activated in the feature map. The weights in the classifier indicate the importance of each feature map in classifying category c. Therefore, a weighted sum of visual patterns illustrates the important regions of a category. Let \(f_k(x,y)\) denote the activation of unit k in the last convolutional layer at spatial location (xy), \(F_k=\sum _{x,y}f_k(x,y)\) be the global average pooling for unit k, and \(w_k^c\) be the weight corresponding to class c for unit k so that \(\sum _{k}w_k^c F_k\) is the input to softmax for class c. Then the activation map for class c is:

$$\begin{aligned} M_c(x,y) = \sum _{k}w_k^c f_k(x,y). \end{aligned}$$
(5)

GradCAM [145] further looks at the gradients flowing into the convolutional layer to give weight to activation maps. Let \(y^c\) be the score for class c before the softmax, \(A^k\) be feature map activations of the unit k in a convolutional layer, the neuron importance weight \(\alpha _{k}^c\) is the global-average-pooled gradient of \(y^c\) with respect to \(A^k\):

$$\begin{aligned} \alpha _{k}^c = \frac{1}{Z}\sum _{i}\sum _{j}\frac{\partial y^c}{\partial A_{i,j}^k}. \end{aligned}$$
(6)

The localization map is a weighted combination of activation maps:

$$\begin{aligned} L_{Grad-CAM}^c = ReLU(\sum _{k}\alpha _k^c A^k). \end{aligned}$$
(7)

ScoreCAM [174] also uses gradient information but assigns importance to each activation map by the notion of Increase of Confidence. Given an image model \(Y=f(X)\) that takes in image X and outputs logits Y. The kth channel of convolutional layer l is denoted \(A_l^k\). With baseline image \(X_b\) and category c, the contribution \(A_l^k\) toward Y is:

$$\begin{aligned} C(A_l^k) = f^c(X\circ H_l^k) - f^c(X_b), \end{aligned}$$
(8)

where \(H_l^k = s(Up(A_l^k))\). \(Up(\cdot )\) is the operation that upsamples \(A_l^k\) into the input size and s normalizes each element into [0, 1]. ScoreCAM is defined as:

$$\begin{aligned} L_{Score-CAM}^c = ReLU(\sum _{k}\alpha _k^c A_l^k), \end{aligned}$$
(9)

where \(\alpha _k^c = C(A_l^k)\).

More CAM variants have been recently proposed, e.g., GradCAM++ [32], CBAM [178], Respond-CAM [196], and Ablation-CAM [44].

Perturbation-based algorithms To investigate important features in the input, a straightforward way is to measure the effect of perturbations applied to the input [54, 55]. This idea is quite simple: The random perturbations on the features would lead to different changes in the model’s predictions, where larger changes would be observed for more important features. Note that perturbation can be also used for evaluating the trustworthiness of interpretation algorithms when we are not aware of interpretation ground truth [143, 172].

Counterfactual examples Using counterfactual examples to explain the model behaviors is also an important direction for understanding the black boxes. Generally, the counterfactual examples have changes in the input that are as small as possible, but would completely change the decision made by the model. The changes in input would be a clue for explaining the model’s behavior. Most counterfactual-example approaches, such as FIDO [31], DiCE [121], and several others [64, 97], to generating counterfactual examples are based on the optimization with sparsity constraints or toward the smallest changes in input. Using counterfactual examples to explain the model behaviors can also be included in causal inference [126], which is considered as a new perspective for model interpretability [120, 179]. Detailed reviews on counterfactual explanations can be found in [12, 167, 173].

Adversarial examples Adversarial examples are very related to counterfactual ones with similar optimization methods, while adversarial examples are used to reveal the vulnerability of the deep model and often attack the AI systems. Adversarial examples in vision tasks are usually the imperceptible changes in the images which mislead the model’s decision. Note that analyses on the adversarial examples [58, 77] show the connections to the understanding of the deep learning process and robustness of the trained deep model.

TCAV Given a set of examples representing a concept of human interest (such as an object, a pattern, a color etc.), TCAV [87] seeks a vector in the space of activations at some layer to represent this concept. Precisely, by defining a concept activation vector (or CAV) as the normal to a hyperplane, TCVA separates examples according to the existence of this concept in the activations: Given one example in a particular class, along the direction of a CAV, the directional derivative of this example contributes a score if it is positive, and the ratio of examples that have positive directional derivatives over all examples in this class is defined as the TCAV score. CAV finds examples of a semantic concept learned by the intermediate layers of a deep model, contributing to the predictions while TCAV quantitatively measures the contributions of this concept.

Prototype To explain the classification models, finding the typical exemplar for each category is also effective and direct. Humans can understand better that the model identifies the featured prototype to make decisions. Chen et al. [35] proposed ProtoPNet, which explains the deep model by finding prototypical parts of predicted objects and gathering evidence from the prototypes to make final decisions. Another method named ABELE [69] generates exemplar and counter-exemplar images, labeled with the class identical to, and different from, the class of the image to explain, with a saliency map, highlighting the importance of the areas of the image contributing to its classification.

As a technique for generating prototypes, activation maximization generally computes the prototypes through an optimization process:

$$\begin{aligned} \max _{{\varvec{x}}} \ \log p(y_c|{\varvec{x}}) - \lambda \Vert {\varvec{x}}\Vert ^2, \end{aligned}$$
(10)

where \(p(y_c|{\varvec{x}})\) is the probability given by a deep model with \({\varvec{x}}\) as input, and the second term is the constraint for generating the prototype. However, the constraint can be replaced by many other choices [49, 113, 124, 153]. A tutorial for this direction is cited [119]. More works related to prototypes or exemplars for interpretations can be found in [23, 26, 103, 116].

Proxy models for rationale process The reasoning process or the underlying rationale of deep models is complex due to the nonlinearity and enormous computations. It is difficult for humans to know the exact steps of the rationale process with semantics inside the black boxes. However, this rationale process can be proxied by graph models [190] or decision trees [192], which provide a decision-making path that is more interpretable to humans. Moreover, deep neural networks can be combined with decision forest models [92] or distilled into a soft decision tree [57]. A model-agnostic approach for interpreting rationale process named BETA [96] allows to learn (with optimality guarantees) a small number of compact decision sets, each of which explains the behavior of the black box model in specific, well-defined regions of feature space.

Forgetting events Forgetting events are defined by [164] for analyzing the training examples using training dynamics. Given a dataset \(D=(x_i,y_i)_i\), after t steps of SGD, example \(x_i\) undergoes a forgetting event if it is misclassified at step \(t+1\) after having been correctly classified at step t. Forgetting events signify samples’ interactions with decision boundaries, and the samples play a part equivalent to support vectors in the support vector machine paradigm. Unforgettable examples are samples learnt at step \(t^*<\infty \) and never misclassified for all \(k \ge t^*\). They are easily recognizable samples that contain obvious class attributes. Whereas examples with the most forgetting events are ambiguous without clear characteristics of a certain class, and some are noisy samples.

Dataset cartography Dataset cartography [161] looks into two measures for each sample during the training process—the model’s confidence in the true class and the variability of confidence across epochs. Therefore, training examples can be categorized as easy-to-learn, hard-to-learn, or ambiguous based on their position in the two-dimensional map. Consider training dataset \(D={(x,y^*)_i}_{i=1}^N\) where \(x_i\) is the ith sample and \(y_i^*\) is the true label. After training for E epochs, the confidence is defined as the mean probability of true label across epochs:

$$\begin{aligned} \hat{\mu _i} = \frac{1}{E}\sum _{e=1}^{E}p_{\theta ^{(e)}}(y_i^*|x_i), \end{aligned}$$
(11)

where \(p_{\theta ^{(e)}}\) is the probability with parameters \(\theta ^{(e)}\) at the end of the eth epoch. The variability is the standard deviation of \(p_{\theta ^{(e)}}(y_i^*|x_i)\):

$$\begin{aligned} \hat{\sigma _i} = \sqrt{\frac{\sum _{e=1}^{E}(p_{\theta ^{(e)}}(y_i^*|x_i)-\hat{\mu _i})^2}{E}}, \end{aligned}$$
(12)

AUM Another method for analyzing the training dynamics is proposed to compute the area under the margin (AUM) [128]:

$$\begin{aligned} \text {AUM}({\varvec{x}}, y) = \frac{1}{T} \sum _{t=1}^T ( z_y^{(t)}({\varvec{x}}) - \max _{i \ne y} z_i^{(t)}({\varvec{x}}) ), \end{aligned}$$
(13)

where \(z_i^{(t)}({\varvec{x}})\) is the logit, computed by the model, of ith class at tth epoch during training with respect to the example \({\varvec{x}}\).

Influence functions Influence functions [91] identify the training samples most responsible for a model prediction by upweighting a sample by some small value and analyze its effect on the parameters and the loss of the target sample. Given input space X and output space Y, we have training data \(z_1, \dots , z_n\), where \(z_i=(x_i,y_i)\in X\times Y\). Let \(L(z, \theta )\) be the loss where \(\theta \in \varTheta \) are the parameters. The optimal \(\hat{\theta }\) is given by \(\hat{\theta }=argmin_{\theta \in \varTheta }\frac{1}{n}\sum _{i=1}^{n}L(z_i,\theta )\). The influence of upweighting training point z on the loss at the test point \(z_{test}\) is:

$$\begin{aligned} I_{up,loss}(z,z_{test})=-\nabla _\theta L(z_{test},\hat{\theta })^T {H_{\hat{\theta }}}^{-1} \nabla _\theta L(z,\hat{\theta }), \end{aligned}$$
(14)

where \(H_{\hat{\theta }}=\frac{1}{n}\sum _{i=1}^{n}\nabla _\theta ^2L(z_i,\hat{\theta })\). Based on influence functions, several techniques [38, 90] have been proposed with improvement.

Contributions of long-tailed training examples Instead of identifying mislabeled samples, easy/difficult-to-learn samples from the training set, more theoretical works on detecting the long-tail examples and outliers [28, 52, 53]. Most of them investigate the connections between the memorization capacity of deep models [187] and the learning process, in order to know the contributions of training examples, including long-tailed ones and outliers.

Interpretations on GNNs Graph Neural Networks (GNNs) are a powerful tool for learning tasks on structured graph data. Like other deep learning models, GNNs show the black-box fashion and are required to explain their prediction results and rationale processes. Without requiring modification of the underlying GNN architecture, GNNExplainer [184] leverages the recursive neighborhood-aggregation scheme to identify important graph pathways as well as highlight relevant node feature information that is passed along edges of the pathways. Recently, more researches focus on the interpretations of GNN models, such as GraphLIME [76], CoGE [51], Counterfactual explanations on GNNs [18] and others [20, 111, 132].

Interpretations on GANs Generative adversarial networks (GANs) are a popular generative model based on two adversarial networks, where one generates synthesized examples, and another tries to classify generated examples from natural examples. Interpretations on GANs mainly search for semantically meaningful directions. Compared with labeled semantics, Bau et al. [25] proposed GAN dissection to find semantic neurons in generative models and modify the semantics in the generated images. Instead of relying on labels, Voynov et al. [171] found semantically meaningful directions in an unsupervised way from the intermediate layers of generative models. Similarly, Shen et al. [149] proposed a closed-form factorization method for identifying semantic neurons. Note that there are other methods for explaining the generative models [131, 170, 180].

Information flow In some deep learning models, there are multiplicative scalar weights that control information flow in some parts of a network. The most common examples are attention [17] and gating:

$$\begin{aligned} c^{att} = \sum _i \alpha ^{att}_i h_i, \qquad c^{gate} = \alpha ^{gate} h \end{aligned}$$
(15)

The attention weights \(\alpha ^{att}\) (\(\sum _i \alpha ^{att}_i = 1\)) and the gate values \(\alpha ^{gate}\) (\(\alpha ^{gate} \in [0,1]\)) are usually interpretable because their values represent the strength of the corresponding information pathways. Attention and gating are frequently used in NLP models, and there have been plenty of works aiming to understand the model through these weights, such as Rollout [2], Seq2Seq-Vis [157] and others [61, 158], or to investigate the reliability of using them as explanations [82, 148, 177]. As well, these ideas have also been used in Vision Transformers [48] for explaining image classification models [34, 185] or bi-modal transformer models [33].

Self-generated explanations Using text generation techniques, a model can explicitly generate human-readable explanations for its own decision. A joint output-explanation model is trained to produce an prediction and simultaneously generate an explanation for the reason of that prediction [14, 93, 109]. This requires some kind of supervision available to train the explanation part of the model.

Inductive biases toward interpretation modules Different from post-hoc explanations after the optimization process, some works focus on designing inductive biases during training to encourage the model to be more interpretable. By simple abstraction, the objective function for this purpose can be written as

$$\begin{aligned} Loss = L(f(x), y) + \alpha R, \end{aligned}$$
(16)

where f(x) represents the deep model output with x as input, y is the ground truth, L is the loss function, specifically the cross entropy for standard supervised classification problem, and R is the objective function for biasing toward interpretable models. Various approaches [46, 114, 140, 191] have been proposed to improve the interpretability during training. More encouragingly, Sabour et al. [142] designed a self-interpretable deep model where each neuron outputs semantic features.

Table 1 Categorization of interpretation algorithms with respect to the proposed taxonomy
Table 2 List of interpretation algorithm publications

3.3 Categorization and discussion

We have introduced a large number of typical interpretation algorithms and categorized them according to the proposed taxonomy, so as to provide a clear illustration in this research field. We hope the taxonomy can shed light on future improvements/extensions on explaining (deep) learning models. We show the categorization of all these algorithms with respect to the proposed taxonomy in Table 1, and gathering interpretation algorithms according to the categorization in Table 2 for a quick glimpse.

Table 1 shows that there are many methods of the Feature representation and only a few Rational ones; many Proxy and Dependence relations but a few Closed-Form. We argue that both of these observations were due to the challenging analyses of complex deep neural networks. The rationale and the closed-form of deep models are still hard to understand or even approximate. From the categorization, we also would like to point out the blanks that may indicate some unexplored directions for future perspectives. For example, no Model-Agnostic algorithms have the Composition relation with models. While the input–output sensitivity analysis methods are currently developed, improving the input-output interpretations can be a good perspective. Moreover, we should note that the adversarial attacks do not only aim at trained models [30], but the interpretations [8, 56, 71]. We leave the further investigations for future work.

3.4 Interpretations on specific application domains

We do not explicitly categorize the interpretation algorithms according to their application domains because (1) the algorithm used in one specific domain may also be applicable on a broader scope with little modifications, especially for model-agnostic algorithms; and (2) for model-specific algorithms, the categorization on the model type generally overlaps with the one on the application domain. For completeness, we discuss recent works of deep model interpretations in the following domains: reinforcement learning, recommendation systems, and medical domains. These applications are slightly different from image classification or sentiment analyses and may require interpretations in a unique form, but most algorithms introduced previously can be used directly.

3.4.1 Deep reinforcement learning (DRL)-related domains

Reinforcement learning (RL) [85] is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Deep learning methods have recently enabled RL to decision-making problems that were previously intractable, such as playing games [117, 151, 168] and training robots [10, 99, 100]. DRL is also applicable and shows potentials of application in healthcare, finance and business management [13, 105], where human security and property safety issues should be considered, leading to the demands of explainable RL [134].

According to recent surveys [13, 105], DRL methods are generally based on DNNs to approximate value functions or find policies. Most methods directly learn the objectives from raw inputs, especially for visual tasks where the images are used as inputs for estimating the value functions. For those methods, input feature-related interpretation algorithms, such as LIME and SmoothGrad, have already been explored for explaining DRL methods [15, 65, 80, 135]. However, as we discussed before, interpretation algorithms may expose different amount of information of the deep models, and in some real-world situations, different interpretation algorithms are required. For critical problems concerning human security and property safety, showing the input–output relations of deep models is sometimes not persuadable for consumers. The rationale inside the deep model may be required and has not been much investigated yet in this field.

3.4.2 Recommendation systems

The recommendation system [139] is a subclass of the information retrieval domain that seeks to predict the “rating” or “preference” a user would give to an item. With the growing information available on the Internet, it becomes more and more difficult to find the items of interest by users themselves. For many web applications, the recommendation systems are an essential method for providing a better user experience [193]. Based on all kinds of information provided by users explicitly or implicitly, the recommendation system filters and sorts a list of items of interest in a personalization way.

There are three reasons for explainable recommendation systems. The first one is to gain users’ trust in the recommendation system. Explanations help to improve the transparency, persuasiveness and user satisfaction of the recommendation system. The second is to facilitate engineers to debug the recommendation algorithm. Explanations provide analyses how the deep model works, and it would be easy to locate the bugs with explanations. The first two arguments are borrowed from previous reviews [193, 195]. The third is to prevent the privacy and social issues. The recommendations may be computed based on features that have privacy or ethical issues. We would not like to have a recommendation system that may lead to these issues. Explanations can thus be used to expose and prevent this problem.

Classic recommendation methods, including collaborate filtering [19, 72], are interpretable, while the usages of black-box deep models [39, 40, 63] increases the opacity of recommendation systems. Recent works on explainable recommendation systems can be categorized following our proposed taxonomy, and most of them focus on designing interpretable modules [36, 102, 147, 162]. We refer interested readers to the survey on explainable recommendation systems [195].Footnote 3

3.4.3 Deep learning applications to medical applications

Deep learning methods have been recently applied on medical domains, especially on medical imaging analyses [108], such as the classification of Alzheimer’s [83], lung cancer detection [75], tuberculosis diagnosis [136], retinal disease detection [146], etc. Though researchers show the potentials of using deep learning methods in helping the diagnostics, the applications in the real-world situations of healthcare, clinics, hospitals and rehabilitation are very critical, because a single failure would cause irreparable damages. Explanations for deep learning-based methods are more urged in this field than in other fields, to gain the trust of physicians, regulators as well as the patients [154].

Interpretation algorithms proposed in this specific domain have been surveyed [154, 163]. Most of them are aligned with the general ones as reviewed in this work, because the network architectures are the same and the tasks are similar. The difference mainly resides in the data distribution and the domain expert knowledge. Interpretation algorithms are technically applicable and their trustworthiness can be evaluated in medical domains. In spite of the advances, however, currently deep learning-based methods have not achieved a significant deployment in the clinics still due to the lack of interpretability [154]. This indicates that the new interpretation tools are still required in this domain.

4 Trustworthiness evaluations of interpretation algorithms and model interpretability evaluations

Previous section focuses on the interpretation algorithms and interpretation results. This section summarizes the current works in evaluating the trustworthiness of interpretation algorithms, and the deep models’ interpretability. To emphasize, the model interpretability is measured based on trustworthy interpretation algorithms. Before introducing model interpretability evaluation, we present the evaluation methods for assuring the trustworthiness of interpretation algorithms in Sect. 4.1. Then, given a trustworthy interpretation algorithm, in Sect. 4.2 we present a few evaluation methods for the interpretability of deep models.

4.1 Trustworthiness evaluations of interpretation algorithm

Perturbation-based evaluations The perturbation-based evaluation of interpretation algorithms follows the intuition that flipping the most salient pixels first should lead to high performance decay. Perturbation-based examples can therefore be used for the trustworthiness evaluations of interpretation algorithms [41, 70, 143, 172]. The main metric MoRF, Most Relevant First, (or LeRF, Least Relevant First, respectively), calculates the area under the curve (AUC), where the curve is of the probabilities predicted by the model after removing most relevant features (or least relevant features respectively). MoRF would drop very quickly at beginning and LeRF would retain at a high value until the end, if the explanation is trustworthy. They are usually used together and both have the same objective of evaluating the trustworthiness of the explanation.

In a different view [55, 74] that “without re-training, it is unclear whether the degradation in model performance comes from the distribution shift or because the features that were removed are truly informative.” So Hooker et al. [74] proposed to remove the most important features, extracted by the interpretation algorithms, and then retrain the model, to measure the degradation of model performance and evaluate the trustworthiness of interpretation algorithms. We believe that the prohibitive computation cost added by the retraining step is meaningful for explaining the learning process (how the features/pixels were learned by a specific architecture of models), but contributes less to explain one trained model in a post-hoc way.

Evaluations by randomizing parameters There is no need for retraining in some cases, and we can identify untrustworthy interpretation algorithms by simply randomizing parameters. Adebayo et al. [3] found that even with random weights at the top layers of the network, a number of saliency map-based approaches were still able to locate the important regions of the input images, and proved that these methods do not depend on the models. Adebayo et al. [4] summarized the uses of interpretation algorithms for model debugging, i.e., to detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination).

BAM Yang et al. [181] proposed a framework, named Benchmarking Attribution Methods (BAM), for benchmarking interpretation algorithms through a manually created dataset where objects are randomly pasted into images, and a set of models trained on that dataset. BAM carefully generates a semi-natural dataset, where objects are copied into images of scenes, so each image has an object label and a scene label. Then with models trained on this dataset and test examples, a target interpretation algorithm is evaluated by this framework, giving relative importance rankings for input features, which can be validated by ground truth from the generated dataset. The intuition behind BAM is that relative importance has a ground truth ranking, which can be controlled by the crafted dataset and used for comparing with the one given by interpretation methods, and then BAM can quantitatively evaluate the trustworthiness of the algorithm.

Trojaning Model trojaning attacks [37, 68] indicate visual dataset contamination, where a subset of images are modified by giving a specific trigger (e.g., a yellow square is attached to the right bottom of the image) to the desired target. This attack poisons the trained model that the trigger is the only feature for classifying the desired target. Benefit from trojaning attacks, Lin et al. [106] proposed to verify the interpretation algorithm on the trojaned models. The qualified algorithm should highlight pixels around the trigger in contaminated images instead of object parts. Using the triggers as ground truth, Lin et al. [106] evaluated the trustworthiness of interpretation algorithms.

Infidelity and sensitivity The desired properties relating to trustworthiness have been discussed in [9, 183]. We reclaim the two definitions of (in)fidelity and sensitivity, which objectively and quantitatively measure the trustworthiness of interpretation algorithms. Given a black-box function \({\varvec{f}}\), an interpretation algorithm \(\varPhi \), a random variable \({\varvec{I}}\in {\mathbb {R}}^d\) with probability measure \(\mu _{\varvec{I}}\), which represents meaningful perturbations of interest, and a given input neighborhood radius r, the infidelity and sensitivity of \(\varPhi \) of the target interpretation algorithm as:

$$\begin{aligned}&\text {INFD}(\varPhi , {\varvec{f}}, {\varvec{x}}) = \mathbb {E}_{{\varvec{I}}\sim \mu _{\varvec{I}}} ({\varvec{I}}^T \varPhi ({\varvec{f}}, {\varvec{x}}) - ({\varvec{f}}({\varvec{x}}) - {\varvec{f}}({\varvec{x}}- {\varvec{I}}))^2 ), \end{aligned}$$
(17)
$$\begin{aligned}&\text {SENS}_{\text {MAX}} = \max _{\Vert {\varvec{y}}- {\varvec{x}}\Vert \le r} \Vert \varPhi ({\varvec{f}}, {\varvec{y}}) - \varPhi ({\varvec{f}}, {\varvec{x}}) \Vert , \end{aligned}$$
(18)

where \({\varvec{I}}\) represents significant perturbations around \({\varvec{x}}\) and can be specified in various ways.

ExpO fidelity and stability Plumb et al. [129] proposed two metrics for measuring the desired properties of explanations and using them as regularization terms, to improve the explainability of trained models. These two metrics can also be used as trustworthiness metrics for LIME and its variants, as they are able to evaluate the related fidelity and stability of proxy models. We use ExpO-Fidelity and ExpO-Stability to refer the two metrics in this paragraph, where ExpO is short for Explanation-based Optimization, in order to avoid the confusion to the Infidelity and Sensitivity [183]. The formulas of ExpO-Fidelity and ExpO-Stability are

$$\begin{aligned}&F({\varvec{f}}, {\varvec{g}}, N_x) = \mathbb {E}_{x' \sim N_x} [({\varvec{g}}(x') - {\varvec{f}}(x'))^2], \end{aligned}$$
(19)
$$\begin{aligned}&F({\varvec{f}}, {\varvec{e}}, N_x) = \mathbb {E}_{x' \sim N_x} [ ||{\varvec{e}}(x, {\varvec{f}}) - {\varvec{e}}(x', {\varvec{f}})||_2^2 ] \Vert , \end{aligned}$$
(20)

where \({\varvec{g}}\) is the proxy model obtained by LIME or its variants, and \({\varvec{e}}(x, {\varvec{f}})\) represents the post-hoc local explanation result given a local data point x to explain and the model \({\varvec{f}}\).

Sensitivity to hyperparameters Besides evaluations on the trustworthiness to the model, Bansal et al. [22] proposed to measure the sensitivity to hyperparameters. “It is important to carefully evaluate the pros and cons of interpretability methods with no hyperparameters and those that have”. In fact, the insensitivity to hyperparameters is also an important metric to trustworthiness.

4.2 Model interpretability evaluation

In some situations, different deep models exhibit different abilities to expose understandable terms to humans. Even the same network architecture, training on different datasets may have different interpretability scores [24]. Given the same trustworthy interpretation algorithm and any two models, model interpretability evaluation methods are used to measure and compare the interpretability between models. In this subsection, we introduce four model interpretability evaluation methods, i.e., Network Dissection [24], Pointing Game [189], Consensus [104] and the one through OOD Samples [59, 60].

Fig. 3
figure 3

Visualizations of semantic segmentation ground truth and interpretations from three popular algorithms, i.e., LIME, GradCAM and SmoothGrad, where the interpretation results are shown in different levels of granularity, i.e., superpixel, low-resolution, and pixel, respectively. We use the three algorithms to interpret images from CUB-200-2011 [175], where the semantic segmentations are available

The basic idea for evaluating the model interpretability for Network Dissection [24], Pointing Game [189] and Consensus [104] is to measure the overlap between semantic items (e.g., segmentation ground truth by humans, or cross-model ensemble of explanations) and interpretation results, as shown in Fig. 3.

Network dissection Network Dissection [24], based on CAM [197], relies on a densely-labeled dataset where each image is labeled across colors, materials, textures, scenes, objects and object parts. Given a CNN model, Network Dissection recovers the intermediate-layer feature maps used by the model for the classification, and then measures the mean intersection over union (mIoU) of each neuron between the activated locations with the labeled visual concepts. A neuron is semantic if its mIoU is larger than a threshold. Then the number of semantic neurons and the ratio are considered as the score for model interpretability.

Pointing game The Pointing Game [189] measures the model interpretability via the localization accuracy. This accuracy is equally the true positive rate between the computed explanation and the annotated object of interest. It is similar to Network Dissection in the way that the pixel-wise or box-wise labels for visual concepts are required and the same intersection between explanations and annotations is measured.

Consensus Consensus approach [104] incorporates an ensemble of deep models as a committee. Consensus first computes interpretations using a trustworthy interpretation algorithm (e.g., LIME [137], SmoothGrad [155]) for every model in the committee, then obtains the consensus of interpretation from the entire committee through voting. Further, Consensus evaluates the interpretability of a model through matching its interpretation result (of LIME or SmoothGrad) to the consensus, and ranks the matching scores together with other deep models in the committee, so as to pursue the absolute and relative interpretability evaluation results. Consensus uses LIME and SmoothGrad to validate its effectiveness, while Consensus is also compatible with other algorithms that interpret other targets, such as the rationale process, as long as the voting approach is suitable for the interpretation algorithm.

Through OOD samples BAM [181] and Trojaning attacks [37, 68] create datasets that are different from natural distributions, and train the models on such datasets. Models trained on such datasets are used to verify the trustworthiness of interpretation algorithms because they should suffer from the attacks on the datasets. In another way, one can use the such ideas of out-of-distribution (OOD) samples to directly evaluate the deep models where the OOD samples were not seen during training. [59, 60] generated different OOD datasets and tested with classic deep models and human observers to record the errors that they made on these datasets. With sophisticated designs of datasets and experiments, they found that the consistency between humans and deep models is closing. These evaluations show the interpretability of deep models in a general way, to present that the visual recognition of models is partially consistent with humans. This could be easily extended to the comparison within models.

4.3 Human-centered/user-study evaluations

User studies involving humans are a commonly used method for evaluating the trustworthiness of interpretation algorithms and model interpretability. We combine these two directions and introduce them here, as the designed user-study experiments may be capable of performing the two evaluations simultaneously.

An approach to evaluate the algorithm of counterfactual examples [11] was proposed, where a user-study experiment was used to validate their approaches. This user-study experiment aims at verifying whether humans can predict the deep model’s decision. Specifically, several (clean and counterfactual) samples with models’ predictions are presented to users, and then a new sample is shown to ask the user if the model can make the correct decisions or not. Another approach based on decision trees and sets, designs descriptive and multiple-choice questions to test the user’s understanding of the decision boundaries of the classes in the data, in order to evaluate the interpretability of their proposed Bayesian Decision Lists. [56] designed the user-study experiments following the idea that interpretability is the user’s ability to predict the model’s changes in response to changes in input. More user studies can be found in [66, 78, 94].

4.4 Concluding remarks

Table 3 List of evaluation methods

We summarize the evaluation methods in Table 3. We have to note that assessing the trustworthiness of interpretation algorithms is challenging. While a small number of algorithms benefit from intrinsic properties of deep models, e.g., closed-form interpretations, the trustworthiness of most algorithms remains to be evaluated. Despite filtering approaches (such as randomizing the weights [3]) to picking out irrelevant interpretation algorithms, reasonable and practical evaluation approaches for directly assessing the trustworthiness are also reviewed. Given a trustworthy algorithm, the interpretability can be evaluated between models, to compare the degree of being understandable. If the algorithm is not trustworthy, it does not make sense to compare the interpretability of models using unreliable interpretation results. A few model interpretability evaluation methods are introduced, while more model interpretability evaluations should be explored in the future. We also note that subjective human-centered user studies are one important evaluation tool that can be used for evaluating both interpretation algorithms and model interpretability, thanks to the flexibility of designing arbitrary experiments for various objectives.

5 Impact beyond interpretations

Deep models have many unknown phenomenon and properties, e.g., adversarial attacks, memorization capacity, generalization ability etc. (Lack of) interpretation and (low) interpretability are one of them. Interestingly, besides the original motivations for explaining black-box deep models, interpretation-related terms have been connected to existing findings about deep models. In this section, we present two fields that are widely known to be related to interpretations.

5.1 Interpretability, adversarial attacks, and robustness

Recent studies on adversarial examples have found positive connections between model interpretability and adversarial robustness. Two teams [140, 165] first observed that compared to standard models, adversarially trained models show more interpretable input gradients. Etmann et al. [50] theoretically proved that the increase in adversarial robustness improves the alignment between input and its respective input gradient, using the case of a linear binary classifier. Zhang et al. [194] further analyzed how adversarially trained models achieve robustness from an interpretation perspective, showing that adversarially robust models rely on fewer texture features and are more shape-biased, which is regarded as coincide more with the human interpretation. Essentially, the connection between adversarial examples and gradient-based interpretations may come from their common dependence on the input gradient.

For future works, these observations could (1) motivate new understandings about how deep models work and (2) explore the connections between interpretation-related terms and other properties of deep models.

5.2 Learning from interpretations

As containing rich information about the location of discriminative features, interpretation results can also be utilized to guide training strategies such as data augmentations and regularization approaches, especially for vision tasks. For example, Kim et al. [89] proposed to improve Mixup [188] by leveraging the saliency map [153]. Specifically, they aimed to seek an optimal transport that maximizes the exposed saliency. Zagoruyko et al. [186] imposed the regularizer to encourage the alignment of saliency maps between the teacher and student networks for effective knowledge distillation. Wickramanayake et al. [176] also used interpretations to generate efficient augmented data samples to train the model, for improving the interpretability and the model performance. Interpretations sometimes can be used as weak labels in specific tasks. For example, Lai et al. [95] introduced a saliency-guided learning approach for weakly supervised object detection. Many weakly object localization and weakly semantic segmentation methods [7, 89, 182] start from an interpretation, and obtain promising results.

From these works, we believe that the interpretability and model performance are not two contradictory measures and that they can be improved simultaneously. Future works could further focus on this direction.

6 Open-source libraries for deep learning interpretation

To simplify future researches and practical usages, we introduce several open-source libraries that implement popular interpretation algorithms based on mainstream deep learning frameworks, such as TF-ExplainerFootnote 4 based on Tensorflow [1], CaptumFootnote 5 based on PyTorch [125] and InterpretDLFootnote 6 based on PaddlePaddle [112]. Note that TF-explainer and Captum mainly include algorithms that target at features with gradient-based techniques. Some other popular libraries focus on machine learning and have not involved deep models, such as interpretml,Footnote 7 AIX360Footnote 8 etc., and the library LITFootnote 9 that is for NLP models.

7 Discussions and conclusions

In this paper, we review the recent research on interpretation algorithms, model interpretability, and the connections to other deep learning factors.

First of all, to address the research efforts in interpretations, we clarify the main concepts of interpretation algorithms and model interpretability that were usually confused, and connect them by introducing the notion of trustworthiness of interpretation algorithms.

Second, we propose a new taxonomy and elaborate the design of several recent interpretation algorithms, from different perspectives according to the proposed taxonomy. Our work reviews the recent advances in interpretation algorithms, and provides a clear categorization, to help future researches to better compare new algorithms with the most related works, or progress in unexplored directions.

Third, we survey the performance metrics for evaluating the trustworthiness of interpretation algorithms, to guarantee the appropriate usages of the interpretation results. These metrics can be used to quantitatively compare between the interpretation algorithms. The proposition of new algorithms can be supported by comparing these metrics with related works, instead of by providing tenuous descriptions and qualitative visualizations.

Further, we summarize the current work in evaluating models’ interpretability given trustworthy interpretation algorithms. Based on these evaluations, more relations between interpretability and other metrics could be found for deep models, possibly leading to further understandings about the deep learning. However, there are not many evaluation methods for measuring the interpretability, though the existing ones are largely aligned for popular network architectures. Designing new methods of evaluating models’ interpretability could be one of the important research directions.

Finally, we review and discuss the connections between deep models’ interpretations and other factors, such as adversarial robustness and learning from interpretations. New understandings how deep models could be observed and analyzed. Note that many interpretation algorithms and evaluation approaches are open-sourced and there are some useful libraries to simplify the practical usages and future researches.