Keywords

1 Introduction

The recent strategies under the XAI umbrella are mostly model agnostic. It means that irrespective of the ML model type and the internal structure, the explanation methods provide the explanation for the model’s decisions. One such technique is the feature importance method [1]. These methods [2,3,4,5,6,7,8] can be plugged into any ML model to know the learning behaviour of the model in terms of feature importance. Here, the learning behaviour represents the order of important features on which the model takes its prediction. These model-agnostic methods require only the input and the predicted output of the model for providing the feature importance explanation.

The feature importance can be defined as a quantitative indicator that quantifies how much a model’s output changes with respect to the permutation of one or a set of input variables [9]. The computation of these variable importance values is operationalized in different ways. The importance of the variables can be quantified by introducing them one by one, called feature inclusion [8] or by removing them one by one from the whole set of features, called feature removal [2]. The model can be retrained several times [11] for each of the input feature inclusions/removals or multiple retraining can be avoided [12] by handling the absence of removed features or the inclusion of new features. For that, any supplementary baseline input [13], conditional expectations [14], the product of marginal expectations [15], approximation with marginal expectations [3] or replacement with default values [2] can be used.

Though all these methods explain the feature importance behind the decisions of the model, the explanation obtained from a method may not be similar to the explanation of another method for the same model [17, 34]. This would confuse the analyst as which explanation should be trusted when different explanations are obtained. Unfortunately, there is no clear, standard principle to choosing the appropriate explanation method.

There may be many but different ML models that can fit equally well and produce almost similar accurate predictions on the same data. But the feature which is most important to one such model may not be an important feature for another well performing model [19].

In such a scenario, providing the explanation based on a single ML model using a specific explanation method would be biased (unfair) over the model/method. To this end, a novel explanation method is proposed to provide a method agnostic explanation across various method explanations of multiple almost-equally-accurate models. These near-optimal models [29] are termed as Rashomon set [19]. Instead of selecting a single predictive model from a set of well performing models and providing the explanation for it, the proposed method offers an explanation across multiple methods to cover the feature importance of all the well performing models in the model class.

The rest of the work is structured as follows: Sect. 2 reviews the related works, Sect. 3 deals with the proposed method, Sect. 4 speaks about the experiments, results and discussion, and Sect. 5 presents the conclusion.

2 Related Works

A plethora of strategies under XAI is developed for providing explanations for the black-box models. Among them, the major attention is being received by the feature importance methods. These methods [3,4,5,6,7,8, 11] aim to explain a single model’s variable importance values by permutating the variables. The methods can give explanations as local feature importance [2] for a single instance or as global importance [4] for the entire data set.

Rashomon Effects:

Initially, the problem of model multiplicity where multiple models fit on the data are equally good but different models was raised by [10]. There is no clear reason to choose the ‘best’ model among all those almost-equally-accurate models [22]. Moreover, the learning behaviour of the models varies among themselves. It means that the feature that is important for one model may not be important for another model. Hence, to avoid a biased explanation of a single model, the comprehensive ex-planation for all the well performing models (Rashomon set models) is given as a range of explanation by [19].

In line with [19], the authors of [22] expanded the Rashomon set concept by defining the cloud of the variable importance (VIC) values for the almost-equally-accurate models and visualizing it with the variable importance diagram (VID). The VID informs that the importance of a variable gets changed when another variable becomes important.

Aggregating over a set of competing equally good models would reduce the non-uniqueness [10]. Based on this concept, the authors of [29] generated a set of 350 near-optimal logistic regression models on the COMPAS dataset, aggregated the models’ feature importance values and presented the explanation a less biased importance explanation for the model class than a single model’s biased explanation. Similarly, by ensembling the Rashomon set models using prior domain knowledge, the authors [30] correct the biased learning of a model. If the Rashomon set is large, the models contained in the set could exhibit various desirable properties [31]. Also, the authors observe that the model performance does not necessarily vary across different algorithms even though the ratio of Rashomon set models on the dataset is small.

All these works aim on solving the bias that arises from multiple models (Rashomon set) rather than considering the bias that comes for a model from multiple methods.

Explanation Evaluation and Ensembling:

The common evaluating measures found in the literature for ensembling explainable approaches are stability [32, 34, 37], (in)fidelity [18, 37], consistency [32, 35], informativeness [33] and comparison metric [36]. Though the explanation methods provide varying explanations for the same model, no principled way could be found in the literature to get a consensus explanation across various methods. A framework [32] proposes the ensembled explanation of several model agnostic algorithms based on the consistency and stability scores with the aim to provide an ensembled explanation independent of the XAI algorithms. Similarly, a unifying framework for understanding the feature removal-based explanation methods is introduced in [7]. The authors showed the relationship of how the methods are related to one another in providing the explanation. It does not combine the explanations of the various methods into one explanation but offers comparable explanations of those methods. At the same time, by comparing the various method explanations for a model, the most representative knowledge of the data set is obtained through the common explanations from the various methods [34]. All the ensembling explanation works focus on the multiple explanations for a single model rather than model multiplicity.

A unified explanation across multiple methods has not been extensively studied and the research works related to the Rashomon set focus on the explanations that vary across the multiple models rather than across multiple methods. Hence, a framework is proposed to address the explanation bias happening across multiple methods for the multiple almost-equally-accurate models. The work is motivated to find the answer to the following research questions:

  1. RQ1.

    while various explanation methods are applied on multiple well performing models to get the feature importance explanations, will the feature which is projected as (un)important by one explanation method be agreed by other methods?

  2. RQ2.

    Is getting a consensus explanation that is consistent across the various applied methods for multiple almost-equally-accurate models possible?

3 Proposed Method

This section presents the proposed method for obtaining the method agnostic ensembled explanation of various almost-equally-accurate ML models. The processes involved in obtaining the model agnostic model class reliance range using the MAMCR framework are depicted in Fig. 1.

Fig. 1.
figure 1

The Process pipeline of Model Agnostic Model Class Reliance (MAMCR) framework

3.1 Models Building

Let \({\left(X,Y\right)\,\,\epsilon \,\,{\mathbb{R}}}^{p+1}\), where p > 0, \({X\,\, \epsilon\,\, {\mathbb{R}}}^{p}\) is the random vector of p input variables and \(Y\,\,\epsilon \,\,{\mathbb{R}}^{1}\) is the output variable. The process pipeline starts with the modelling of a class of multiple ML models on the pre-processed data of tabular type. As per the No Free Lunch theorem [21], there is no single ML model that is considered as best for solving the problems. Consequently, multiple ML models can be fitted on the same data set to verify the model’s performance. This set of prespecified predictive models is referred to as model class [19].

(1)

where, \(M\) is a model class that consists of m models. Each model can take the input X and convert it to response Y. Each model's performance is assessed in terms of its prediction accuracy. The model class can be built with a set of regression algorithms. In that case, the model performance is assessed in terms of R2 value.

3.2 Finding the Rashomon Set Models

From the multiple fitted models of the model class \(M\), the almost-equally-accurate models form the Rashomon set (Ʀ). A Rashomon set is constructed based on a benchmark model ɱ*and a nonnegative factor ε as follows:

(2)

Selection of ɱ* with possible maximum accuracy and ε with a small positive value helps to search for the models whose prediction accuracy are not less than the (1−ε) factor of ɱ* accuracy and to construct the Ʀ models i.e., .

3.3 Obtaining Model Reliance Values and Ranking Lists

The model reliance [19] (or feature importance) indicates how much a model relies on a variable for making its predictions. The model reliance on the variable k (\({mr}^{k}\)) is measured by the quantity of change in the model’s performance with and without the variable k, where k = 1, 2, …, p. The more the change in the model performance, the higher the importance of that variable in the model’s prediction contribution.

Different state-of-the-art explanation methods are selected to apply to each Rashomon set model to obtain their model reliance on p variables. Any global explanation method that returns the explanation in the form of feature importance can be chosen.

(3)

The obtained model reliance explanations E can be mapped to a model reliance vector as follows:

(4)

where \({mr}_{n}^{p}\)(ɱ) represents the model reliance of the model ɱ on variable p that is obtained from the explanation method \(n\). The model reliance vector values are also mapped to model reliance ranking lists as follows:

(5)

The explanation is the set of model reliance ranking lists obtained for the \({1}^{{\varvec{s}}{\varvec{t}}}\) model (ɱ1) from \({\varvec{n}}\) explanation methods. The en1) shows the feature ranking list for the model ɱ1 obtained from the \({n}^{th}\) explanation method. For example, the order can be represented as follows,

where \({f}_{1}\) is the name of the input feature that has the highest importance value than all other variables \({f}_{2},{f}_{3 },{f}_{4},...,{f}_{p}\). The model reliance ranking list follows the order \({f}_{1}>{f}_{3 }>{f}_{4}>{f}_{p}>,\dots ,>{f}_{2}\), where variable \({f}_{2}\) has the least importance among the p variables.

3.4 Finding the Reference Explanation and Consistent Explanations

Various methods that operationalize the feature importance computation may not produce the same explanation for a model [34]. The explanations not only differ in the ranking order but also in the computed model reliance values. Despite the variances, no clear reason could be found in the literature for selecting a specific explanation method. As pointed out by [16], if the results of different techniques point to the same conclusion, they very likely reflect the real aspects of the underlying data. Therefore, a reference explanation reflecting the commonly found feature order among the different methods’ explanations of a model should be discovered. This reference explanation captures the optimal feature order by aggregating all the explanations’ feature ranking preferences using the modified Borda Count method [23].

(6)

The Borda function returns the result as an aggregated model reliance ranking order i.e., captures the optimal ranking order of the features from the \(n\) explanations of the \({1}^{st}\) model. Likewise, for each model, a reference explanation is aggregated from the corresponding model’s explanations from n methods. This leads to a totally \({\varvec{r}}\) number of reference explanations for the Rashomon set models .

To quantify the consistency of several methods in producing similar explanations to the model, the methods’ explanations for the model are compared against the reference explanation. To find the consistency score, a ranking similarity method needs to be applied. The existing statistical method such as Kendall’s τ [24] is considered inadequate to this problem because the ranking lists may not be conjoint. On the other hand, the Rank-Biased Overlap (RBO) [28] could handle the ranking lists even though the lists are incomplete. The RBO similarity between two feature ranking order lists R1 and R2 is calculated using the following equation as per [28].

$$\begin{aligned}RBO\left({R}_{1},{R}_{2},p\right) & \,=\left(1-p\right){\sum }_{d=1}^{\infty }{{p}^{d-1}.A}_{d}\\ & {A}_{d} =\frac{|{R}_{1}\cap {R}_{2}|}{d}\end{aligned}$$
(7)

The RBO similarity value ranges from 0 to 1, where 0 indicates no similarity between the feature ranking order lists and 1 indicates complete similarity. The p parameter (0 < p < 1) defines the weight for the top features to be considered. The parameter Ad defines the agreement of overlapping at depth d. The intersection size of the two feature ranking lists at the specified depth d is the overlap of those 2 lists (Refer to Eqs. 17 in [28]).

A similarity score is computed between the model’s various explanations and the corresponding reference explanation and is referred to as optimal similarity. It is calculated as follows,

$$\begin{aligned} & {OPTIMAL\_SIM}_{i,j}=RBO\left({e}_{i,j},{{e}_{j}}^{*}\right)\\ & where \,\, i\,=\,1\,\, to\,\, n\,\, methods ; \,\,j\,=\,1\,\, to\,\, r\,\, models\end{aligned}$$
(8)

The \({OPTIMAL\_SIM}_{i,j}\) defines how much the explanation (\({e}_{i,j}\)) from method \(i\) for the model j (ɱj) is similar in complying with the reference explanation \({{e}_{j}}^{*}\), in terms of feature order. The \(OPTIMAL\_SIM\) value is computed for all the method explanations of each model. Therefore, \({\varvec{n}}\) × \({\varvec{r}}\) similarity scores are obtained totally, that is, each explanation method gets a consistency score for each model.

3.5 Computing the Weighted Grand Mean (θ)

Among the various explanations of the Rashomon set models, the optimal similarity scores of the methods are calculated based on the method explanations’ compliance with the corresponding model’s reference explanation. This score shows the degree of similarity that the method has, in explaining the model’s optimal learning behaviour.

Since the different explanation methods produce different feature importance coefficients for each feature, the model has varying levels of reliance on a feature. Therefore, a grand mean (θ) across several methods should be estimated. For that, a weighted mean [38] is to be implemented. To weigh the feature importance values that are computed by each method for a model, the optimal similarity score is used. For each feature, the weighted mean of the feature importance values based on the methods’ optimal similarity score as weight is calculated by,

(9)

The grand mean of the feature k of the model j (\({\theta }_{j,k}\)) is calculated by adding the product of the optimal similarity score of the 1 to n methods with its computed feature importance value for the k feature (\({mr}_{1}^{k} \,\,to\,\, {mr}_{n}^{k}\)) and dividing the result with the sum of n methods’ weights (i.e., optimal similarity scores of n methods). The grand mean is computed for all the p features for each Rashomon set model. Therefore, \(p\) × \(r\) weighted mean feature importance values are obtained.

3.6 Method Agnostic Model Class Reliance (MAMCR) Explanation

The method agnostic model class reliance explanation of the Rashomon explanation set is given as a comprehensive reliance range for each variable based on the reliance of all the well performing models under \(n\) explanation methods.

The model class reliance of all the p variables can be given as a range of lower and upper bounds of weighted feature importance values. The lower and upper bounds of the model class reliance for each variable can be defined as,

(10)
(11)

The range [MCRk−, MCRk+] of variable k represents that if the MCRk− value is low, the variable k is not important for any almost-equally-accurate models in the Rashomon set models whereas if the MCRk+ is high, then the variable k is most important for every well performing model in . Thus, the MCR provides a method agnostic variable importance explanation for all the well performing models of the Rashomon set.

4 Experiments and Results

In this section, the concept of the proposed method is illustrated with the experiments on the 2-year criminal recidivism prediction datasetFootnote 1 which was released by ProPublica to study the COMPAS (Correction Offender Management Profiling for Alternative Sanctions) model that was used throughout the US Court system. The dataset consists of 7214 defendants (from Broward County of Florida) with 52 features and one outcome variable, which is 2-year recidivism. Among the 52 features, 12 are date type to denote jail-in and out, offence, and arrest dates, 21 are personal data identifiers such as first and last name, age, sex, case numbers and descriptions and other features are mostly numeric values such as no. of days in screening, in jail, from compas, etc. The framework is not limited to this data but is flexible enough to support any dataset.

In the analysis of the Race variable’s contribution to predicting the 2-year recidivism, the authors [22] say that there are some well performing models which do not rely on inadmissible features like Race and gender. Additionally, for the same data set, the authors [29] report that the explanation based on a single model is biased over the inadmissible feature ‘Race’, whereas the grand mean of multiple models’ feature importance values does not highlight the feature as an important feature for the majority of the models. To ensure whether these claims will be consistent across multiple methods’ explanations and to answer the research questions as well, the same dataset used by [22, 29] with similar a setup (with 6 features - age, race, prior, gender, juvenile crime, and current charge - of all the 7214 defendants) is taken for the analysis.

To make the outcome prediction, the logistic regression model class is used in the analysis with 90% (6393) training data and 10% (721) test data as in [29]. The Stratified 5-fold cross-validation is used to train and validate the multiple models. The total trained models and the selected models to the Rashomon set are shown in Fig. 2. The reasonable sample of Rashomon set models (350) are obtained from the total trained (2665) models by filtering the models whose prediction accuracy are above the accuracy threshold (1−ε)ɱ* = 0.6569, where ɱ* accuracy = 0.6914 and ε = 5%. Those models form the Rashomon set.

To obtain the explanations for models’ decisions, the iAdditiveFootnote 2 and other 5 state-of-the-art XAI methods [3, 4, 7, 25, 26] based on the feature importance approach are applied to the Rashomon set models . Normalization is applied to each method’s computed importance values for each model. The model reliance rankings for each model are also obtained (\({E}_{MRR}\)). Figure 3 shows the various methods’ model reliance ranking range of the Rashomon set models grouped by each feature of the COMPAS dataset.

Fig. 2.
figure 2

The prediction accuracy frequency of all the trained models. The accuracy threshold (1−ε)ɱ* = 0.6569, where ɱ* = 0.6914 and ε = 5% is used to search for the Rashomon set models (Ʀ). Models with an accuracy level above the threshold value are only included in the Ʀ.

The distribution of feature importance ranks that are obtained from different methods illustrates the variation found in the various method explanations. Let’s consider the ‘Race’ feature’s rank explanations. The Shap [3], Skater [4] and iAdditive methods’ ranks span from 1–6 for the models, whereas for the other 3 methods, the range is from 2–6. It means, as per the former methods’ explanations, there are some models which consider the ‘Race’ feature as their most important (1st rank) feature. But in the view of the latter methods, for none of the models, ‘Race’ is the 1st priority feature. Let’s take the ‘Juvenile crime’ feature. As per the Sage [7] method explanations, the’crime’ feature is the most important feature for most of the models, whereas, for the Shap and iAdditive methods, the median ranks lie in 4th and 5th positions, respectively. The Skater and Lofo [26] methods have similar 3rd rank position to the feature and the Dalex [25] method stood in between the Sage and Skater rank positions by giving 2nd rank.

From this, it could be observed that for the same models, these methods provide different feature importance explanations (in the form of computed values and ranks as well). If any one of the methods is selected to provide the explanation for a well performing model, it could end up in a method-dependent explanation of that model. It means that the explanation would be biased over the specific method. Therefore, to get a consensus explanation for the almost-accurate models over all the applied explanation methods, the model agnostic model class reliance (MAMCR) explanation method is to be implemented.

Firstly, a reference explanation e* is aggregated from the corresponding explanations of 6 methods for each model to reflect the common feature ranking order. These reference explanations reflect the optimal learning for all the models in the Rashomon set (see Fig. 4). To quantify the consistency of various explanations obtained from multiple methods, the corresponding reference explanation (e*) is compared against each model’s method-wise explanation.

Fig. 3.
figure 3

Model reliance/feature importance rankings obtained from the 6 explanation methods for the COMPAS dataset. A box plot showing the range of ranks allocated by each method for the 350 Rashomon set models for a feature is shown in each panel. The difference in the feature rankings illustrates the variations found in the various method explanations.

Next, for each model of the Rashomon set, the weighted average is computed for all the features based on the method’s consistency score. The method explanation which complies well with the optimal explanation will contribute more to the average model reliance value. For each of the six variables of the 350 models, the grand means (θj,k) are computed using Eq. 9 based on the concern method’s consistency/optimal similarity scores.

The method agnostic model class reliance explanation (MAMCR) for the multiple almost-accurate models based on multiple methods’ explanations is presented as a range. The lower and upper bounds [MCR, MCR+] of each variable’s grand mean are selected as the model class reliance for all the models in the Rashomon set. The method agnostic MCR is shown in Table 1. In that, the high MCR value (e.g., 0.08) indicates that the Prior feature is used by all the models and the low MCR+ value (e.g., 0.10) indicates that the Age feature is least used by all the models.

Fig. 4.
figure 4

The feature-wise rank distribution of optimal reference explanations (e*) for 350 Rashomon set models.

Table 1. The method agnostic model class reliance explanation of the Rashomon set models for the six features of the COMPAS dataset.

4.1 Discussion

Various methods’ explanations are compared against the ‘Race’ feature’s importance. The distribution of many models’ model reliance is shown in Fig. 5. The number of models that falls on the feature importance range is displayed on each bar in the histogram. As per the Sage [7] explanation, the ‘Race’ feature is not at all an important feature used by most of the models. It could be observed from Fig. 5a that 324 models out of 350 are given the feature importance value as less than 0.1. This informs that the Race feature is not an important feature for the 324 models. It complies well with the claim of [29]. On the other hand, it is not true based on other methods’ explanations. From Figs. 5b–5e, it could be observed that there are many models that rely on the ‘Race’ feature from the moderate to high range, whereas Fig. 5f is consistent with Fig. 5a. It alerts us that the explanation obtained from a method is not necessarily the same as the one obtained from another method for the model.

This addresses the first research question (RQ1) that while multiple explanation methods are applied on multiple well-performing models for getting the feature importance explanations, the feature which is projected as (un)important by one explanation method is not necessarily agreed by another method. Therefore, the identified importance of the feature depends completely on the method that is applied for obtaining the explanation.

Fig. 5.
figure 5

The feature importance values of the ‘Race’ feature for 350 Rashomon set models, grouped by each method (5a. Sage, 5b. Lofo, 5c. Skater, 5d. Shap, 5e. iAdditive and 5f. Dalex). The data label of each bar shows how many models lie within the feature importance bin range.

While comparing the method explanations of each feature (see, Fig. 3), no two methods could be identified in producing a similar explanation pattern in all the feature explanations. For example, the Skater and Shap method explanations for the Age feature resemble the same pattern except for the outlier. Similarly, the Sage and Dalex are in a similar pattern on the same variable. The same methods could not be found with similar patterns in other feature explanations. For example, the Skater and Shap methods have contrasting explanation patterns in Juvenile crime feature, whereas the Skater and Lofo methods exhibit a similar pattern. One of the possible reasons observed for the variation could be that a feature becomes the most important when another variable becomes the least important [22]. It is illustrated in Fig. 6.

Fig. 6.
figure 6

The feature importance values of Prior and Juvenile crime features computed by 6 methods. While the importance values of the Juvenile crime feature increase, the prior feature importance decreases and vice versa, which is emphasised with a box.

Figure 6 shows the feature importance values computed for Juvenile crime and Prior features by the 6 methods for the 350 almost-accurate models. Each point in the plot represents a model’s reliance on those variables. When the Prior feature importance (y-axis) of the models reaches its maximum value such as above 0.6, the crime feature importance (x-axis) of them is below ≈0.35 (shown within a box). When the crime feature’s importance of a model reaches above 0.8 or around 1, its Prior importance is very low such as less than ≈0.15. It indicates that the feature Prior is the most important feature of a model when the Juvenile crime is less important than the Prior feature. So, if a method allocates a feature with high importance in its explanation obviously another feature gets reduced importance which may make the explanation vary from another method’s explanation.

Despite the variations, the methods and their explanations can be compared based on their computational dependency on the feature permutation [27] function. Identifying the commonalities in the explanations [20] of multiple methods which point to similar feature-wise explanations is considered as revealing the true importance of the underlying data [16]. Hence, the MAMCR method finds the weighted mean for the feature explanations based on the method’s consistency in producing similar explanations and through which it provides a comprehensive range for the multiple almost-equally-accurate models. It represents the feature-wise model reliance bounds for all the well-performing models of the pre-specified model class that are computed by the pre-specified methods.

To validate the MAMCR explanation bound suitability to all well performing models, a new, almost-equally-accurate test model is created using the same model class (i.e., Logistic regression) algorithm with random sampling data. This model’s accuracy is verified against the Rashomon set threshold (0.6569). The explanations from the six methods are obtained for the model and the grand mean of each variable is found. The test model’s feature importance which is plotted along with the MAMCR bounds is displayed in Fig. 7. It elucidates that the test model’s feature importance of all the variables lies within the MAMCR boundary values. Thus, the second research question (RQ2), finding the consistent explanation across multiple explanation methods for the almost-equally-accurate models, is addressed through the MAMCR framework by obtaining the method agnostic MCR bounds.

Fig. 7.
figure 7

The feature importance values of a Test model’s features along with the MAMCR bounds. The test model’s importance values lie within the MAMCR explanation range.

5 Conclusion

The experiments conducted on the COMPAS data set alert us that the method’s explanation which highlights a feature as most important may not be projected as such by another method. These inconsistencies in the generated explanations by different explanation methods for the Rashomon set models motivated the proposal of a novel framework for discovering consistent explanations across multiple explanation methods. It provided a method agnostic explanation as a model class reliance for the multiple almost-equally-accurate models. The efficiency of the method agnostic MCR explanation is illustrated by describing the comprehensive variable importance value range for all the well performing models of the pre-specified model class across multiple explanation methods.

In this work, the explanation methods that return the feature importance values as a global explanation are only considered for the explanation ensembling. The future work can be extended for the instance-wise explanations and for other explanation output formats as well.