Keywords

1 Introduction

Nowadays, a wide range of real-life applications such as computer vision [5, 11], speech processing, natural language understanding [6], health [14], and military fields [2, 4] make use of Machine Learning (ML) models for decision making or prediction/classification purpose. However, those models are often implemented as black boxes which make their predictions difficult to understand for humans. This nature of ML-models limits their adoption and practical applicability in many real world domains and affect the human trust in them. Making ML-models more explainable and transparent is currently a trending topic in data science and artificial intelligence fields which attracts the interest of several researchers.

Explainable AI (XAI) refers to the tools, methods, and techniques that can be used to make the behavior and predictions of ML models to be understandable to human [3]. Thus, the higher the interpretability/explainability of a ML model, the easier it is for someone to comprehend why certain decisions or predictions have been made.

Multiple interpretability approaches are based on additive models where the prediction is a sum of individual marginal effects like feature contribution [16], where a value (denoting the influence on the output) is assigned to each feature. One of the latest proposed methods is based on mathematical Shapeley Values and was introduced by Scott et al. [9] as SHAP (for SHapley Additive exPlanations). It relies on combining ideas from cooperative game theory and local explanations [8].

Let us also mention the LIME (Local Interpretable Model-agnostic Explanations) method which is one of the most famous local explainable models [13].

LIME explains individual predictions of any classifier or regressor in a faithful and intelligible way, by approximating them locally with an interpretable model (e.g., linear models, decision trees). However, having a global explanation of the model can be challenging as it is more complicated to maintain a good fidelity - interpretability trade off. To this end, authors in [13] proposed an approach, called submodular Pick which is an algorithm aiming to maximize a coverage function of total feature importance for a set of instances. While maximizing the coverage function is NP-Hard, authors make use of a greedy algorithm which adds iteratively instances with the highest marginal coverage to the solution set, offering a constant-factor approximation to the optimum. The selected set is the most representative, non-redundant individual explanations of the model.

In this paper, our aim is to introduce a new approach to select individual instances (explanations) to be considered for global explanation to ensure that the picked group reflects the global behavior of the black-box model. Unlike submodular optimization proposed in [13], we advocate to consider the problem of picking representative instances as a co-selection task. The idea is to apply a similarity preserving co-selection approach to select a set of instances and features on the explanations provided by any explainer.

The paper is structured as follows. Section 2 provides a necessary background on LIME method. In Sect. 3, we present our approach allowing for a global explanation of black box ML models. Section 4 shows the preliminary experiments done to validate our proposal. In Sect. 5, we conclude the paper and draw some research lines for future work.

2 Background on LIME

Interpretability of ML models reflects the ability to provide meaning in understandable terms to human. It is crucial to trust the system and get insights based on its decisions. Quality of an explanation could be improved by making it more Interpretable, Faithful, and model-agnostic [12]. Faithfulness represents how the explanation is describing the reality of the model. Model-agnostic methods are used for any type of model. LIME introduced by Ribeiro et al [13], is one of the well-known examples of such methods. It is a framework which explains a prediction by approximating it locally using an interpretable model (Algorithm 1).

figure a

The basic idea of LIME is to replace a data instance x by its interpretable representations \(x'\) thanks to a mapping function \(\varPhi (x)\). For example, an image will be represented as a group of super-pixels, a text as binary vectors indicating the presence or the absence of a word. The interpretable representations are more easily understandable and close to human intuition. Then, \(x'\) is perturbed to generate a set of new instances. The black box model is used to make predictions of generated instances from \(x'\) which are weighted according to their dissimilarity with \(x'\). Now, for the explanation purpose, an interpretable model, such as linear models, is trained on weighted data to explain prediction locally.

2.1 LIME: Fidelity-Interpretability Trade-off

Authors in [13] define an explanation as a model \( g \in G\), where G is a class of potentially interpretable models (e.g., linear models, decision trees). Let \( \varOmega {(g)} \) be a measure of complexity (as opposed to interpretability) of the explanation g. For example, for linear models \(\varOmega (g)\) may be the number of non-zero weights. The model being explained is denoted by \(f : \mathbb R^d \longrightarrow \mathbb R\). Let now \(\pi _x\) defines a locality around x and \(\mathcal {L}(f; g; x) \) be a measure of how unfaithful g is in approximating f in the locality \(\pi _x\). The explanation produced by LIME is then obtained by the following minimization problem [13]:

$$\begin{aligned} \xi {(x)} = \mathop {\textrm{argmin}}\limits _{g\in G} \mathcal {L}(f;g;\pi _x) +\varOmega {(g )} \end{aligned}$$
(1)

2.2 Explaining Global Behavior

LIME explains a single prediction locally. Then, it picks K explanations which “must be representative" to show to the user. The Submodular Pick explained in Algorithm 2 is used to choose instances to be inspected for global understanding. The quality of selected instances is critical to get insights from the model in a reasonable time.

figure b

Let \(\mathrm {\textbf{X}}\) (with \(|\mathrm {\textbf{X}}|= n\)) be the set of instances to explain, Algorithm 2 calculates \(\mathrm {\textbf{W}}\in \mathbb R^{n \times d'}\) an explanation matrix using each individual explanation given by Algorithm 1. Then, it computes \( (I_{j})\) global feature importance for each column j in W, such that the highest importance score is given to the feature explaining an important number of different instances. Submodular Pick aims then at finding the set of instances V, \(|V| < \mathrm {\textbf{B}}\) that scores the highest coverage, defined as the function which calculates total importance of features in at least one instance. Finally, greedy algorithm is used to build V by adding the instance with highest marginal coverage gain.

3 Proposed Approach

Fig. 1.
figure 1

The proposed framework for a global explanation using a co-selection of features and instances.

The approach we propose in this paper consists of two sequential phases (see Fig. 1). The first is to use LIME (without loss of generality, any other explainer can be used) to obtain the explanations of the predictions for the test data. While the second phase focuses on global explainability by co-selecting the most important test instances and features. Thus, we provide a global understanding of the black-box model.

3.1 Explanation Space

Let f be a black box model, and \(\mathrm {\textbf{X}}\) a test dataset of n instances and \(\varPhi (\mathrm {\textbf{X}})=\mathrm {\mathbf {X'}}\) its interpretable representation in \(\mathbb R^p\). First, to obtain an individual explanation of the prediction made by f for each instance \(x_i\) we use LIME by fitting a linear model on a generated dataset around \({x'}_{i}\), the interpretable representation of \(x_i\). Thus, for each instance \(x_i\), we obtain an explanation of length k (\(k < p\)). It is worthy to note that the length is a parameter set by the user and corresponds to the number of features retained.

Once the individual explanations have been obtained, we construct an explanation space represented by \(\mathrm {\textbf{E}}\in \mathbb R^{n\times m}\), where the dimension m of the explanations space corresponds to the union of the k features of each explanation. We illustrate this step with the following example:

Example 1

Let \(\mathrm {\mathbf {X'}}\) be the interpretable representation of 3 instances in \(\mathbb R^{500}\), and \(k=3\) be the length of the explanation desired for these three instances. By performing LIME algorithm on \(\mathrm {\mathbf {X'}}\), we obtain 3 explanations of length 3:

$$\begin{aligned} e_i={\left\{ \begin{array}{ll} e_1=\{(f_1,0.5),(f_{25},0.9),(f_4,0.1)\}\\ e_2=\{(f_{17},0.2),(f_6,0.3),(f_{78},0.4)\}\\ e_3=\{(f_{500},0.8),(f_{25},0.7),(f_{1},0.25)\} \end{array}\right. } \end{aligned}$$
(2)

where \(e_1,e_2\), and \(e_3\) are the explanations of \(x'_1\), \(x'_2\) and \(x'_3\) respectively. Thus, the matrix \(\mathrm {\textbf{E}}\in \mathbb R^{3\times 7}\) can be seen as the concatenation of all the explanations and the union of the set of features obtained by each explanation. Note that the dimension m here is equal to 7.

Fig. 2.
figure 2

Explanation matrix \(\mathrm {\textbf{E}}\) (this matrix is given as input for CoSP Algorithm 3)

3.2 Global Explicability by Co-selection

Understanding the model’s decisions globally remains a complex task. In fact, some approaches like LIME were extended to face this complexity by only picking a group of individual explanations. In this paper, we advocate a method allowing global explainability by co-selecting the most important instances and features over the explanations provided by any explainer. The idea is to find a residual matrix \(\mathrm {\textbf{R}}\) and a transformation matrix \(\mathrm {\textbf{W}}\), which transforms high-dimensional explanations data \(\mathrm {\textbf{E}}\) to low dimensional data \(\mathrm {\textbf{E}}\mathrm {\textbf{W}}\), to maximize the global similarity between \(\mathrm {\textbf{E}}\) and \(\mathrm {\textbf{E}}\mathrm {\textbf{W}}\). After the optimal \(\mathrm {\textbf{W}}\) and \(\mathrm {\textbf{R}}\) have been obtained, the original features and instances are ranked, based on the \(\ell _{2,1}\)-norm values of the rows of \(\mathrm {\textbf{R}}\) and \(\mathrm {\textbf{W}}\), and the top features and instance are selected accordingly.

3.3 Notation

First, we present the notation we use in this paper. Let \(\mathrm {\textbf{E}}\) be an explanation matrix of n instances and m features. The \(l_{2,1}\)-norm of \(\mathrm {\textbf{E}}\) is:

$$\begin{aligned} \parallel \mathrm {\textbf{E}}\parallel _{2,1} = \sum _{i=1}^{m}\parallel \mathrm {\textbf{E}}_{i} \parallel _{2} = \sum _{i=1}^{m} \sqrt{\sum _{j=1}^{n}\mathrm {\textbf{E}}_{ij}^{2}} \end{aligned}$$
(3)

and its Frobenius norm (\(l_{2,2}\)) is:

$$\begin{aligned} \parallel \mathrm {\textbf{E}}\parallel _{F} =\left( \sum _{i=1}^{m}\parallel \mathrm {\textbf{E}}_{i} \parallel _{2}^{2}\right) = \left( \sum _{i=1}^{m} \left( \sum _{j=1}^{n}{} {\textbf {E}}_{ij}^{2}\right) \right) _{}^{1/2} \end{aligned}$$
(4)
Table 1. Summary of symbols and notations

3.4 Co-Selection Pick (CoSP)

To perform a co-selection of instances and features on the explanations matrix, we must minimize the following problem as pointed out in [1]:

$$\begin{aligned} \min _{\mathrm {\textbf{W}},\mathrm {\textbf{R}}}{\parallel \mathrm {\textbf{E}}\mathrm {\textbf{W}}-\mathrm {\textbf{R}}^T-{\textbf {Z}}\parallel ^{2}_{F} + \lambda \parallel \mathrm {\textbf{W}}\parallel _{2,1}+\beta \parallel \mathrm {\textbf{R}}\parallel _{2,1}} \end{aligned}$$
(5)

where:

  • Z is the eigen-decomposition of the pairwise similarity matrix, \(\mathrm {\textbf{A}}\), computed over the explanation matrix \(\mathrm {\textbf{E}}\). Note that the similarity matrix \(\mathrm {\textbf{A}}\) can be calculated in supervised fashion (e.g. adjacency matrix, fully binary matrix) if the labels of test instances are available, or in unsupervised mode as follows:

    $$\begin{aligned} \mathrm {\textbf{A}}_{ij} = e^{-\frac{\Vert e_i-e_j \Vert ^2}{2\delta ^2}} \end{aligned}$$
    (6)
  • \(\mathrm {\textbf{R}}={\textbf {W}}^T\mathrm {\textbf{E}}^T-{\textbf {Z}}^T-\varTheta .\), is a residual matrix and \(\varTheta \) is a random matrix, usually assumed to be multi-dimensional normal distribution [15]. Note that the matrix \(\mathrm {\textbf{R}}\) is a good indicator of outliers and less important and irrelevant instances in a dataset according to [17, 18].

  • \(\lambda \) and \(\beta \) are regularization parameters, used to control the sparsity of \(\mathrm {\textbf{W}}\) and \(\mathrm {\textbf{R}}\) respectively; and \(\delta \) is a parameter for the RBF kernel used to compute the matrix \(\mathrm {\textbf{A}}\) in the unsupervised mode in Eq. (6).

The first term of the objective in Eq. (5) exploits the \(\mathrm {\textbf{E}}\) structure by preserving the pairwise explanations similarity while the second and third terms are used to perform feature selection and instance selection, respectively. In order to minimize Eq. (5), we adopt an alternating optimization over \(\mathrm {\textbf{W}}\) and \(\mathrm {\textbf{R}}\) as in [1], by solving two reduced minimization problems:

Problem 1: Minimizing Eq. (5) by fixing \(\mathrm {\textbf{R}}\) to compute \(\mathrm {\textbf{W}}\) (for feature selection). To solve this problem, we consider the lagrangian function of Eq. (5):

$$\begin{aligned} \mathcal {L}_\mathrm {\textbf{W}}=trace(\mathrm {\textbf{W}}^T\mathrm {\textbf{E}}^T\mathrm {\textbf{E}}\mathrm {\textbf{W}}-2\mathrm {\textbf{W}}^T{\textbf {E}}^T({\textbf {R}}^T+{\textbf {Z}}))+\lambda \parallel \mathrm {\textbf{W}}\parallel _{2,1}. \end{aligned}$$
(7)

Then, we calculate the derivative of \(\mathcal {L}_W\) w.r.t \(\mathrm {\textbf{W}}\):

$$\begin{aligned} \frac{\partial \mathcal {L}_\mathrm {\textbf{W}}}{\partial \mathrm {\textbf{W}}}=2\mathrm {\textbf{E}}^T\mathrm {\textbf{E}}\mathrm {\textbf{W}}-2\mathrm {\textbf{E}}^T(\mathrm {\textbf{R}}^T+{\textbf {Z}})+2\lambda \mathcal {D}_\mathrm {\textbf{W}}\mathrm {\textbf{W}}. \end{aligned}$$
(8)

where \(\mathcal {D}_W\) is a (\(m\times m\)) diagonal matrix with the \(i^{th}\) element equal to \(\frac{1}{2\parallel {\textbf {W}}(i,:)\parallel _2 }\). Subsequently, we set the derivative to zero to update \(\mathrm {\textbf{W}}\):

$$\begin{aligned} \mathrm {\textbf{W}}= ({\textbf {E}}^T{\textbf {E}}+\lambda \mathcal {D}_\mathrm {\textbf{W}})^{-1}\mathrm {\textbf{E}}^T(\mathrm {\textbf{R}}^T+{\textbf {Z}}) \end{aligned}$$
(9)

Problem 2: Minimizing Eq. (5) by fixing \(\mathrm {\textbf{W}}\) to compute the solution for \(\mathrm {\textbf{R}}\) (for explanation selection). To solve this problem, we consider the Lagrangian function of Eq. (5):

$$\begin{aligned} \mathcal {L}_\mathrm {\textbf{R}}=trace(\mathrm {\textbf{R}}^T\mathrm {\textbf{R}}-2\mathrm {\textbf{R}}^T(\mathrm {\textbf{E}}\mathrm {\textbf{W}}-{\textbf {Z}}))+\beta \parallel \mathrm {\textbf{R}}\parallel _{2,1}. \end{aligned}$$
(10)

Then, we calculate the derivative of \(\mathcal {L}_\mathrm {\textbf{R}}\) w.r.t \(\mathrm {\textbf{R}}\):

$$\begin{aligned} \frac{\partial \mathcal {L}_\mathrm {\textbf{R}}}{\partial \mathrm {\textbf{R}}}=2\mathrm {\textbf{R}}^T-2(\mathrm {\textbf{E}}\mathrm {\textbf{W}}-{\textbf {Z}})+2\beta \mathcal {D}_\mathrm {\textbf{R}}\mathrm {\textbf{R}}^T. \end{aligned}$$
(11)

where \(\mathcal {D}_\mathrm {\textbf{R}}\) is a (\(n\times n\)) diagonal matrix with the \(i^{th}\) element equal to \(\frac{1}{2\parallel \mathrm {\textbf{R}}^T(i,:)\parallel _2 }\).

Subsequently, we set the derivative to zero to update B:

$$\begin{aligned} \mathrm {\textbf{R}}= (\mathrm {\textbf{E}}\mathrm {\textbf{W}}-{\textbf {Z}})^T(({\textbf {I}} + \beta \mathcal {D}_\mathrm {\textbf{R}})^{-1})^T \end{aligned}$$
(12)

where I is a (\(n\times n\)) identity matrix. All of the above developments are summarized on Algorithm 3.

figure c

3.5 Algorithm Analysis

In the Algorithm 3, the final user expects a selection of \(\mathrm {\textbf{B}}\) instances (e.g., explanations) and \(\mathrm {\textbf{L}}\) features which are most relevant to provide global explanation of the model. In order to achieve this, CoSP requires four hyper-parameters \(\lambda \), \(\beta \), \(\delta \) and h that will be used later on to build the set of chosen instances and features. Firstly, we build the explanations matrix \(\mathrm {\textbf{E}}\) using any explainer, in our case we use LIME. Secondly, we compute the similarity matrix \(\mathrm {\textbf{A}}\) either in supervised mode (as adjacency matrix or a binary matrix) or in an unsupervised way according to the availability of the labels of the test instances \(\mathrm {\textbf{X}}\). Then, we eigen-decompose \(\mathrm {\textbf{A}}\) to find Z. From line 9 to line 13 \(\mathrm {\textbf{W}}\) and \(\mathrm {\textbf{R}}\) are updated until convergence according to Eqs. (9) and (12). Following the alternate optimization, we rank the instances and the features according to \(\mathrm {\textbf{R}}\) and \(\mathrm {\textbf{W}}\) respectively. So, the higher the norm of \(\parallel {\textbf {R}}(:,j)\parallel _2\), the more the \(j^{th}\) explanation is not representative, while the higher the norm \(\parallel {\textbf {W}}(i,:)\parallel _2\), the more the \(i^{th}\) feature is important.

4 Experiments

In this section, we conduct some experiments to validate our frameworkFootnote 1 on some known sentiment datasets.

4.1 Datasets

We use a binary sentimental classification dataset. Sentimental analysis is the task of analyzing people’s opinions, reviews, and comments presented as textual data. It gives intuition about different points of view and feedback by detecting relevant words used to express specific sentiments [10]. Today, companies rely on sentimental analysis to improve their strategy. People’s opinions are collected from different sources like Facebook, Tweets, product reviews and processed in order to understand customer’s needs and improve marketing plans. When the sentiment is divided into positive and negative ones, it is called binary sentimental analysis which is the most common type and the one used in our case. While multi-class sentiment analysis classifies text into groups of possible labels. We use multi-Domain Sentiment DatasetFootnote 2, which contains multiple domains reviews (books and dvd) from Amazon.com, where for each type of product there are hundred of thousands of collected reviews. Then, we use an experiment introduced in [13] which aims to evaluate if explanations could help a simulated user to recognize the best model from a group of models having the same accuracy on validation set. In the order to do this, a new dataset will be generated by adding 10 artificial features to the train and validation set from original public dataset (reviews). For the train examples, each of those features appears in 10% of instances in one class and in 20% of the other class. In the test examples, an artificial feature appears in 10% of examples in both classes. This represents the case of having spurious correlations in the data introduced by none informative features.

4.2 Evaluation and Results

We train pairs of classifiers until their validation accuracy is within 0.1% of each other. However, their test accuracy should differ by at least 5% which will make one classifier better than the other. Then, we explain global behaviors of both classifiers using our proposed approach CoSP.

Fig. 3.
figure 3

Top 5 features per class picked by CoSP global approach for review’s binary classification on books dataset

Fig. 4.
figure 4

Top 5 features per class picked by CoSP global approach for review’s binary classification on kitchen dataset

To validate our approach, we use the same experimental setting introduced in [7] by selecting top five important features per class chosen as most relevant ones to be considered for the classification task. Global approach is validated if it selects distinguishing features. Results shown in Figs. 3 and 4 were produced by applying CoSP with its hyper-parameters: \(\lambda \approx 2.11\), \(\beta \approx 61.79\), \(\delta = 1\) and \(h = 17000\) (which stands for the number of features selected by CoSP). First, the displayed perception contains words that are meaningful in order to judge the type of comment. Features are aligned with human intuition and words with no representative meaning like stop words were not selected. Second, noisy features labeled with prefix “FAKE" added to the dataset were not deemed important.

5 Conclusion

In this paper, we presented CoSP, a generic framework aiming to select individual instances in order to provide global explanation for machine learning models. We used Co-selection based on similarity as foundation to build global understanding of the black box internal logic over any local explainer. Furthermore, we conducted some experiments showing that CoSP offers representative insights. This study is a another step towards understanding machine learning models globally. For future work, we would like to explore this methods in the context of time series data, as it is a challenging to find representative illustration for this type of data.