1 Introduction

The proliferation of big data architectures has resulted in many applications having an increasingly large impact on business and society (Junqué de Fortuny et al. 2013). We focus on two sorts of big data. The first is behavioral data, defined as data that capture human behavior through the actions and interactions of people (Shmueli 2017), which can be used for various predictive purposes. For instance, what you “Like” on Facebook is predictive of your openness and many other personality traits (Kosinski et al. 2013, 2017), while the accounts you pay to or webpages you visit are predictive features for product interest (Martens et al. 2016) and creditworthiness (De Cnudde et al. 2019b). The second sort of big data is textual data. Text classification is ubiquitous in business and government (Provost and Fawcett 2013). Example applications are automatic identification of spam emails (Attenberg et al. 2009), objectionable web content detection (Martens and Provost 2014) and legal document classification (Chhatwal et al. 2018), just to name a few.

Mining behavior and text data can result in highly accurate classification models (Junqué de Fortuny et al. 2013; Provost et al. 2015), but also in very complex model structures. The complexity arises from either the learning technique (e.g., deep learning) or the data, or both. Behavioral and textual data are typically high-dimensional and sparse. Let us consider an example that we will refer back to. We want to predict the gender of users based on the movies they have viewed. A user having watched a movie is represented by a binary feature for each movie, which results in an enormous feature space. However, each user only has watched a small number of movies, which results in an extremely sparse data matrix (almost all elements are zero). A user having watched a movie is represented by a corresponding non-zero value for that binary feature, and we refer to such features as “active”. In other words, because of the sparsity, the number of active features \(m'\) of a user (the movies someone watched) is much smaller than the dimensionality m of the full feature space (all possible movies someone could watch). Because of these data characteristics, even intrinsically interpretable linear models are difficult to interpret because there are many thousands of features, each with their own linear coefficient; further, the features that will be brought to bear for prediction are different for every individual. Applying nonlinear techniques normally renders the reasons for a particular prediction completely opaque.

The importance of understanding individual classification decisions is well-argued in the literature (Gregor and Benbasat 1999; Freitas 2014; Martens and Provost 2014; Goodman and Flaxman 2016; Doshi-Velez and Kim 2017; Ras et al. 2018; Lipton 2018). Explanations for model predictions are often necessary for users to trust and improve the model (Gregor and Benbasat 1999). In some domains, like medical diagnosis and credit scoring, it even is a legal requirement (Martens and Provost 2014; Gregor and Benbasat 1999; Martens et al. 2007) (e.g., why was my loan application rejected?). According to  Doshi-Velez and Kim (2017), the demand for interpretable models stems from a mismatch between “formal” objectives (e.g., minimal prediction error) and “ethical” objectives (e.g., privacy), which can only be validated when a certain level of interpretability is achieved.

Various approaches have been proposed for explaining model predictions (Craven 1996; Martens and Provost 2014; Ribeiro et al. 2016; Lundberg and Lee 2017; Lipton 2018; Wachter et al. 2018), varying in scope and flexibility. The scope indicates whether the method generates global explanations (for the entire feature/instance space) or instance-level explanations (for a single prediction) (Martens and Provost 2014), whereas the flexibility indicates whether the approach is model-specific or model-agnostic. Much research focuses on model-specific explanation techniques tailored to a specific type of prediction models such as deep learning models (Samek et al. 2015; Arras et al. 2017) or random forests (Breiman 2001). In contrast, model-agnostic methods explain model predictions of any prediction model. This increases flexibility; however, often it results in substantially more computational effort (Martens and Provost 2014; Arras et al. 2017).

For this paper, we focus on the increasingly popular notion of “counterfactual” explanations (Martens and Provost 2014; Provost 2014; Chen et al. 2017; Wachter et al. 2018; Nguyen 2018; Sokol and Flach 2019). A counterfactual explanation of a model-based system’s decision for a particular instance is defined as a set of “evidence” of the instance without which the system would not have made that decision. In our setting of behavioral and textual data, this evidence corresponds to a set of active features of the instance where changing these feature values to zero would lead the system not to have made that decision. Ideally, this set is minimal, meaning that the predicted class only changes when all features that are part of the counterfactual explanation are removed (feature values set to zero). Note that minimality is not always guaranteed and depends on the algorithm that is used to compute counterfactuals.

Counterfactuals have been argued to be crucial for explaining predictions on the instance-level as they pinpoint the features that led to the decision (Sokol and Flach 2019) and make the decision actionable (Chen et al. 2017; Wachter et al. 2018; Fernandez et al. 2019). In our running movie example, if we want an explanation of why a user called, say, Sam was predicted to be “male”, we want to know which movies were critical for the model’s decision. A counterfactual explanation shows a set of movies such that removing them from Sam’s movie list would lead the predicted class to no longer be “male” (see Fig. 1a). In the context of textual and behavioral data, removing the feature from the instance is equivalent to setting the original feature value to zero or “removing the evidence”. In the remaining of this study, we will consider counterfactuals based on the removal of evidence that is present in the data—for example, words that appear in a document or items that an individual has “Liked” on Facebook. These correspond to “active” features—those that are present in a sparse representation, or those that are non-zero in a traditional feature-vector representation.

In this study, we are interested in finding minimum-sized counterfactual explanations. A possible approach to find the counterfactual is to conduct a complete search through the entire space of feature combinations, starting with one feature and incrementally increasing the number of features until an explanation is found. However, this strategy scales up exponentially with the number of features, making it impracticable for high-dimensional feature spaces (Martens and Provost 2014). Consequently, there is a need for an algorithm that computes counterfactuals with a good trade-off between effectiveness (selecting the most important features for the explanation) and efficiency (computation time).

Martens and Provost (2014) proposed a heuristic best-first search for Evidence Counterfactuals (SEDC),Footnote 1 which is able to counterfactually explain predictions of any classification model with sparse features. To the best of our knowledge, SEDC is the only existing model-agnostic explanation algorithm for counterfactuals which is able to deal with behavioral and textual data. For this reason, we use this algorithm as a benchmark in this study. The proposed algorithm by Wachter et al. (2018) cannot deal with many binary variables—a common representation for explanations for behavioral and textual data—which eliminates their algorithm from this study.

Fig. 1
figure 1

Example outputs of two different types of instance-level explanations that show why Sam was classified as “male” based on his movie viewing data: a counterfactual explanation (a) and an additive feature attribution explanation (b)

In the literature, other instance-level explanation types have been proposed for high-dimensional data sources, such as additive feature attribution explanations (Ribeiro et al. 2016; Lundberg and Lee 2017). In our movie running example, additive feature attribution explanations would show an ordered list of important movies and their corresponding importance weights—specifically, importance for this particular model decision (see Fig. 1b). Such algorithms generate an importance-ranked list of features, i.e., coefficients of a linear model, for a single instance. The idea of developing hybrid methods which connect counterfactuals with additive feature attribution explanations stems from the following reasoning: if these importance-rankings of features are sufficiently accurate, it may be possible to compute counterfactuals from them: starting from the highest-ranked feature, we remove features until the predicted class changes.Footnote 2 One novelty of this study resides in the idea that these importance rankings may be an “intelligent” starting point for searching for counterfactuals. The resulting algorithm for computing counterfactuals may be better than the existing SEDC algorithm. For this reason, we empirically compare the counterfactual explanation algorithms to help researchers and practitioners better understand which method is most suitable when facing behavioral or textual data.

This paper’s main contributions are fourfold: (1) we propose two novel model-agnostic explanation algorithms, creating them via the combination of counterfactual explanations and additive feature attribution methods (LIME-C and SHAP-C); (2) we define quantitative evaluation criteria that proxy the effectiveness and efficiency of these algorithms; (3) we perform an in-depth evaluation of the explanation quality of LIME-C and SHAP-C when applied to high-dimensional behavioral and textual data and benchmark their performance against the SEDC algorithm, and lastly, (4) we propose changes to the model-agnostic methods for generating counterfactuals, and discuss research directions stemming out of our findings.

2 Counterfactual explanation algorithms

To our knowledge, counterfactual explanations were first used to explain document classifications (Martens and Provost 2014), and since that time have been applied more broadly (Provost 2014; Chen et al. 2017; Wachter et al. 2018; Nguyen 2018; Sokol and Flach 2019). Martens and Provost (2014) define an explanation in counterfactual terms as a minimal set of active features such that, when removing these features from the instance, the predicted class changes.Footnote 3 For instance, in Fig. 1a, the movies “Die Hard”, “Taxi Driver” and “Mission Impossible” explain why Sam was classified as “male”.

Consider instance \({\mathbf {x}} =(x_{1} ,\dots , x_{m})\) and the feature indices \(I_{{\mathbf {x}}} =\{1,\dots , m\}\) for \(m \in {\mathbb {N}}\). Let \(I_{A} \subseteq I_{{\mathbf {x}}}\) represent the indices of the active features of \({\mathbf {x}}\) such that \(\forall j \in I_{A}\): \(x_{j} \ne \)0, \(\forall j \notin I_{A}\): \(x_{j} =0\). Let \(I \subseteq I_{A}\) be a subset of the indices of active features, then a perturbed instance \({\mathbf {z}}_{I}\) of instance \({\mathbf {x}}\) (hereafter referred to as \({\mathbf {z}}\)) is defined as \(\forall l \in I\): \(z_{l} = x_{l}\), \(\forall l \notin I\): \(z_{l} =0\). Let C be a trained classifier that is a function from instances to k classes. Instance \({\mathbf {x}}\) is classified by classifier C: \({\mathbf {x}}\rightarrow \{0,\dots ,k\}\) as class c. In this study, we define a counterfactual explanation as follows:

Definition 1

A counterfactual explanation for instance \({\mathbf {x}}\)’s classification is a set of active features with indices \(E \subseteq I_{A}\) such that removing all features with indices E from the instance \({\mathbf {x}}\) leads C to produce another classification. The perturbed instance \({\mathbf {z}}_{I}\) with \(I = I_{A} {\setminus } E\) denotes the result of removing the features with indices E from instance \({\mathbf {x}}\). Further, a counterfactual explanation is minimal in the sense that removing any subset of E does not yield a change in class. Specifically:

A set of features with indices E is a counterfactual explanation for \(C({\mathbf {x}}) \Leftrightarrow \)

  1. 1.

    \(E \subseteq I_{A}\) (the features are active for instance \({\mathbf {x}}\)),

  2. 2.

    C(\({\mathbf {z}}_{I_{A} {\setminus } E}\)) \(\ne \) c (the class changes), and

  3. 3.

    \(\lnot \exists \) \(E'\) \(\subset \) E: C(\({\mathbf {z}}_{I_{A} {\setminus } E'}\)) \(\ne \) c (E is minimal).

Note that, for behavioral and textual data, removing features corresponds to setting the (original) feature values to zero.

We implemented the model-agnostic SEDC heuristic search algorithm, presented by Martens and Provost (2014), which conducts a best-first search strategy. For explaining individual predictions of linear classification models, SEDC is optimal in the sense that it always finds a minimum-sized feature set that changes the predicted class (formal proof can be found in Martens and Provost (2014)). For explaining nonlinear model predictions, optimality is not guaranteed because the algorithm cuts off its search after a limit has been reached (Martens and Provost 2014). Also, because of the search cut-off, the explanations may not be minimal, i.e., a subset of the explanation set may also be a counterfactual. We further limit SEDC’s search by stopping after the first explanation has been found. As the empirical results below show, this means that the method is very fast; we leave assessing the full effectiveness vs. time tradeoff to future work.

Additive feature attribution methods use an explanation model g as an interpretable approximation of the trained classification model C (with corresponding scoring function \(f_{c}\): \({\mathbf {x}} \rightarrow {\mathbb {R}}\)) in the neighborhood of an instance \({\mathbf {x}}\). Two recently proposed model-agnostic methods are the linear interpretable model-agnostic explainer (LIME) (Ribeiro et al. 2016) and Shapley additive explanations (SHAP) (Lundberg and Lee 2017). In the context of text and behavior, the explanation model g is a linear function of binary variables that indicate whether the feature is “active” (original value) or “removed” (zero). Consider again the instance \({\mathbf {x}} =(x_{1} ,\dots , x_{m})\) that has \(m'\) active features (note that the full feature dimension m can be much larger). The additive feature attribution explanation of instance \({\mathbf {x}}\) can be represented as a linear model:

$$\begin{aligned} g(\mathbf {x'}) = \phi _{0} + \sum _{j=1}^{m} \phi _{j}x'_{j} \end{aligned}$$
(1)

where \(x'_j\) \(\in \) \(\{0,1\}\) is the binary representation of \(x_j\) (where \(x'_{j}\) is 1 if \(x_{j}\) is non-zero, else it equals 0), m is the number of features of instance \({\mathbf {x}}\), and \(\phi _{0}, \phi _{j}\) \(\in \) \({\mathbb {R}}\). For SHAP, the weights retrieved from the model also represent the (approximate) Shapley values, which have theoretically attractive properties (see Lundberg and Lee (2017) for more details). The main differences between LIME and SHAP are (1) how they generate the sample of perturbed instances, (2) the distance function \(\pi \) and (3) the complexity control of the explanation.

Suppose we want to explain the instance \({\mathbf {x}}\). Both LIME and SHAP first map the instance to a binary representation \(\mathbf {x'} =(x'_{1} ,\dots , x'_{m})\) using a mapping function \(h({\mathbf {x}})=\mathbf {x'}\). Next, perturbed instances are generated from \(\mathbf {x'}\), and each perturbed instance \(\mathbf {z}'\) is assigned a distance weight \(\pi _{\mathbf {x}'}(\mathbf {z}')\). LIME generates perturbed instances by sampling \({\tilde{n}}\) instances and randomly removing active features from \(\mathbf {x'}\). Each perturbed instance \(\mathbf {z}'\) is then mapped onto the original feature space to obtain the predicted score using the scoring function \(f_c(\mathbf {z})\), which is then used as a label for training the explanation model g. Each perturbed sample is assigned a corresponding weight. For textual data, LIME uses the cosine distance as the distance function to measure the similarity between \(\mathbf {x}'\) and \(\mathbf {z}'\), which seems a suitable choice for behavioral data as well.Footnote 4 SHAP starts by estimating distance weights for different subset sizes. A subset size is the number of non-zero elements of a perturbed instance \(\mathbf {z}'\). For each subset size s, a distance weight is estimated.Footnote 5 Then, the method samples \({\tilde{n}}\) perturbed instances from the subset spaces, starting from the smallest (and largest) subsets. LIME trains the explanation model by using \(\ell _{2}\)-regularized linear regression and controls the complexity even more by allowing exactly K features in the explanation. SHAP trains the model using \(\ell _{1}\)-regularized linear regression.

Note that neither LIME nor SHAP produce non-trivial counterfactual explanations natively. However, it is straighforward to produce variants of the algorithm that do. Specifically, we can apply the efficient search algorithm for counterfactuals for linear models (Martens and Provost 2014), which we refer to as lin-SEDC,Footnote 6 to the importance-ranked lists generated by LIME and SHAP. We refer to these novel algorithms as LIME-C and SHAP-C where C stands for “Counterfactual”. The (general) pseudo-code of a hybrid algorithm of additive feature attribution explanations and counterfactuals is shown in Algorithm 1. In a first step, an additive feature attribution explanation is generated, without regularizing the linear explanation model g. For LIME, this means that the complexity parameter K is set to the number of active features \(m'\), whereas for SHAP, no regularization is used. From this step, a linear model with the binary representation of the features is obtained (original value versus zero), or equivalently, we retrieve an importance-ranked list of features.

figure a

In a second step, the linear algorithm for finding counterfactuals (lin-SEDC) (Martens and Provost 2014) is applied to the ranked list to efficiently generate a counterfactual, if possible. In more detail: the (active) features of the linear explanation model are ranked by their estimated coefficients (from high to low coefficient). Then, in a first iteration, the feature that is ranked at the top is removed from the instance, or equivalently, its value is set to zero. If this results in a class change, a counterfactual explanation is found. If not, the set of two top-ranked features is checked for being a counterfactual explanation. If not, the set of three top-ranked features are removed from the instance, and so on, until a counterfactual explanation has been found. Both LIME-C and SHAP-C rely on random sampling to generate counterfactuals, and thus, are stochastic explanation algorithms. This is in contrast to SEDC, which always results in the same search tree path for finding explanations when re-running the algorithm. Moreover, note that there is no guarantee that the counterfactuals from the hybrid algorithms are always minimal.

3 Experimental setup

3.1 Data sets and models

Our experimental data comprise 10 behavioral and 3 textual data sets. All data are public, except the Facebook and Fraud data. The classification tasks are binary and vary from gender prediction to sentiment analysis. Table 1 summarizes the characteristics of the data. All data have high-dimensional feature spaces with up to hundreds of thousands of features. Movielens_1m, Movielens_100k, KDD2015, Airline and Twitter have lower-dimensional feature spaces compared to the other data sets. For all data sets, the “class-of-interest” is the minority class. A large class imbalance is present for the Fraud data. Also, 20news has a large imbalance compared to the other data sets. Relatively balanced data are Facebook, TaFeng and LibimSeTi (imbalance values b larger than 30%). The large sparsity values p for all data indicate that the number of active features is very small compared to the total number of possible active features.

Table 1 also shows the number of test instances per data set and the average number of active features \({\overline{m}}'\), which is very different between the data sets. Ecommerce and Flickr have very small instances (only 2 to 3 active features), in contrast to other data such as Movielens_1m with instances having over 150 active features.

Table 1 Data characteristics of the data sets (references are placed in brackets): data type (B: behavioral, T: textual), target variable, number of instances and features, the class imbalance b of the target, the sparsity p and test set for linear and nonlinear models (percentage of instances predicted as positive are placed in brackets)

For the behavioral data, we build \(\ell _{2}\)-regularized Logistic Regression (\(\ell _{2}\)-LR) models and multi-layer perceptrons (MLP). Logistic Regression has proven to be the best-performing shallow model for big behavioral data (De Cnudde et al. 2019a), while a follow-up study demonstrated a (modest) performance improvement by deep learning models (De Cnudde et al. 2019c). For the textual data, we build bag-of-words support vector machines (SVM) with linear and RBF kernel, because they are well-established to be successful for text mining applications (Joachims 1998; Martens and Provost 2014). For preprocessing text, we remove stopwords and lemmatize tokens using the NLTK package in Python, and then, use TF-IDFFootnote 7 vectorization (Joachims 1998; Martens and Provost 2014). We use 80% of the data for training the models and 20% as test set. For \(\ell _{2}\)-LR and SVM, we fine-tune the regularization parameter using a holdout set (25% of training data). For MLP, we use the best parameter configuration found by De Cnudde et al. (2019a). We build models using the Scikit-learn library. To make classifications, we sort the test instances by decreasing predicted scores and classify the k% top-ranked instances as positive, such that the fraction of test instances classified as positive equals the fraction of positives in the training data.

3.2 Explanations

For the experiments, we generate counterfactuals for the positively predicted test instances and we set the maximum size of the counterfactual explanation to 30, in line with questions as to the utility of explanations sets that are too large (Sokol and Flach 2019; Martens and Provost 2014). As a second algorithmic choice, we set the maximum time to compute an explanation to 5 min. For SEDC, we set the maximum number of iterations to 50 and we use our own Python implementation.Footnote 8 For LIME-C,Footnote 9 we use LimeText explainerFootnote 10 for generating the importance-ranked list. We set the complexity parameter K equal to the number of active features (Ribeiro et al. 2016) and we set the number of perturbed samples \({\tilde{n}}\) equal to 5,000 (Ribeiro et al. 2016; Nguyen 2018). Next, we compute counterfactuals from the ranked feature list using lin-SEDC (Martens and Provost 2014). For SHAP-C,Footnote 11 we first compute the linear model using the model-agnostic implementation KernelExplainerFootnote 12 with a single reference value (zero), default \(\ell _{1}\)-regularization and the identity link function. Similar to LIME-C, we set the size of the neighborhood \({\tilde{n}}\) equal to 5000. Here as well, counterfactuals are computed from the importance-ranked list using lin-SEDC.

3.3 Evaluation criteria

We define the following set of performance metrics for evaluating counterfactual explanations generated by the three different algorithms:

  1. 1.

    Effectiveness

    • Switching point: number of features that need to be removed before the classification changes. The switching point equals the size of the counterfactual explanation.

    • Percentage explained: fraction of positively predicted instances for which a counterfactual explanation smaller than 30 features is found.

  2. 2.

    Efficiency

    • Computation time: number of seconds it takes to generate an explanation.

To compare effectiveness of the different algorithms, we need a common definition for assessing (counterfactual) explanations. Feature-ranking explanations were tied to the notion of the counterfactual implicitly by Nguyen (2018), who introduces the notion of the switching point, which is the number of features that need to be removed (or set zero)—when traversing the ranked list—before the prediction switches to another class. (This is essentially the procedure of lin-SEDC.) The switching point was originally introduced as a proxy for the method’s ability to rank features from high to low relative importance (Arras et al. 2017; Nguyen 2018); it also gives us a straightforward method for turning the feature-ranked explanations into counterfactual explanations. (For explanations already represented as counterfactuals, such as those produced by SEDC, the switching point simply equals the number of features in the explanation.) Measuring the switching point is important, because in cases where the prediction is not the default prediction, simply selecting all the features would produce a class change, but would be a trivial “explanation”. All else being equal, for a better importance-ranked list one would not have to choose as many features to create a counterfactual explanation, resulting in a lower switching point. We do not allow counterfactuals to be larger than 30; therefore, the switching points also will be no larger than 30. In the experiments, we also compute a random explainer for estimating the switching point, against which we benchmark our counterfactual algorithms. It randomly selects a feature and sets it to zero. If the class changes, then a switching point is found. If not, it verifies whether the predicted score at least decreased. If not, it selects another random feature. If yes, then it selects a new, random feature and evaluates whether leaving out these two features together results in a class change. This is repeated until the random algorithm finds a switching point.

Information on effectiveness is captured by the percentage explained, which indicates the fraction of instances for which a counterfactual explanation smaller than 30 features is found. More specifically, when the explanation method is not very good at identifying the most relevant features, the method will most likely result in larger switching points. This will result in fewer instances for which a counterfactual smaller than 30 is found.

Lastly, we also compare the efficiency of available implementations of the explanation algorithms, as finding counterfactual explanations can be a hard computational problem. The computation time is important to a greater or lesser degree depending on the timeliness needs of the application. For example, whether one will compute an explanation on demand for a small number of instances at human-cognition speeds versus one will compute and cache explanations for all predictions in a high-throughput application (e.g., why was I shown this?).

4 Results: effectiveness

Table 2 shows the percentage explained by each of the algorithms. For the linear models, there are very small differences between the methods and SEDC is always better than or as good as the other methods. For the LibimSeTi data, LIME-C and SHAP-C find significantly fewer counterfactual explanations than SEDC.

For the nonlinear models, however, SEDC never produces more explanations than LIME-C and SHAP-C. SEDC has a significantly lower percentage explained than LIME-C and/or SHAP-C for 5 out of 13 data sets. Since in theory, without an iteration limit, the best-first search will find (all) explanations for every case, this phenomenon is due to the heuristic cut-off of the search at 50 iterations—it does not expand more than 50 feature sets (search nodes). In more detail: for some nonlinear models, removing one feature does not result in a predicted score change for any of the features. Consequently, the algorithm selects a random feature to continue with in the following iteration. The same may happen in the second iteration. These “bad” feature choices are what makes the algorithm need more than 50 iterations to find a counterfactual explanation.

Lastly, SHAP-C seems to have difficulties for the Fraud data. For Fraud/nonlin, only 75% of the test instances are explained. For the non-explained instances, all estimated coefficients (step 1 in Algorithm 1) are zero, so no linear explanation model was generated. We conjecture this is due to the random sampling procedure, which results in a higher number of required instances \({\tilde{n}}\). When setting the sample size \({\tilde{n}}\) to 7,000 (instead of 5,000), the percentage explained increases to a maximum of 100, indicating that this is the required number of perturbed samples needed to generate explanations. We conjecture that the “critical number of perturbed samples” increases for highly imbalanced data like Fraud and that this is related to the sampling procedure of SHAP-C.

Table 2 Percentage explained (counterfactuals smaller than 30 features)
Table 3 Median and interquantile range of switching point

Table 3 indicates the median and interquantile range of the switching points.Footnote 13 A first observation is that the data sets with large instances, such as Movielens_1m and Facebook, have a wider range of switching points (large third quantile value) compared to data sets with small instances such as Flickr and Ecommerce, where the first quantile, the median and the third quantile are equal to 1. We also observe that, for linear models, there are no differences in the median switching point between the algorithms. For linear models, in general, the low switching points of SEDC are not a surprising result: it is optimal for linear models, i.e., it will always find the minimum-sized subset of features  (Martens and Provost 2014). Comparing the results of the novel algorithms LIME-C and SHAP-C, which are approximation methods, against SEDC, for linear models they usually find the smallest-sized explanations as well. For the nonlinear models, however, no method dominates. LIME-C and SHAP-C perform worse than SEDC on the YahooMovies and LibimSeTi data sets. SEDC performs worse in terms of median switching points than LIME-C and SHAP-C on the Facebook data. The Facebook data present an interesting case. The mean switching points of SEDC, LIME-C and SHAP-C for Facebook/nonlin are respectively 8.34, 3.59 and 4.13, indicating that there are more outlier values for SEDC. The reason here is similar to the discussion of the iteration limit above, but there is an additional factor: we stop the search after the first explanation is found. This may be penalizing SEDC in terms of explanation length, but giving it an advantage in terms of computational efficiency. Finally, when comparing the methods with the random benchmark, we conclude that all approaches are significantly better at pinpointing the most important features, except for the Ecommerce, Flickr and Fraud data, where random performs as well because of the few active features per instance.

Table 4 Median and interquantile range of computation time in seconds

5 Results: computational efficiency

Table 4 summarizes the computation times. We observe that the median computation times of SEDC are very small, compared to LIME-C and SHAP-C: for all our data and models, the median computation time for SEDC is less than 1 second. The interquantile ranges and the mean computation times also inform us about the efficiency of SEDC. More specifically, for all data, there are many outlier values for computation times. This is because SEDC’s efficiency (mostly) depends on the number of features in the explanation. We observe that, for the data with very low switching points (e.g., YahooMovies), SEDC is very efficient over the entire set of test instances: there are not many extreme values. For instances that need more features to be removed before a predicted class change is obtained,Footnote 14SEDC is slower (Movielens_1m, Facebook, LibimSeTi). This becomes an issue for classification problems where instances are “harder” to explain with counterfactuals, i.e., more features need to be removed to change the predicted class. For data with small instances (e.g., Ecommerce) or classification problems where data instances are “easier” to explain by a counterfactual, SEDC is always the most efficient method. Note that, despite the fact that LibimSeTi has, on average, a smaller switching point than Facebook, it still takes much longer to generate counterfactual explanations for the LibimSeTi data. This is because the number of active features is, on average, very large for LibimSeTi, which also plays an important role in determining the computation time.

Overall, LIME–C and SHAP–C have a stable efficiency and the computation time does not depend on the switching point. (In contrast to SEDC, for which the computation time is sensitive to the number of features in the explanation.) The efficiency of LIME-C and SHAP-C depends mostly on the number of active features of an instance. The results indicate that SHAP-C’s efficiency seems more prone to the number of active features of the instance (median and interquantile range values are relatively larger starting from YahooMovies).

Lastly, the algorithms are generally slower for textual data than for behavioral data. We conjecture this is because of the time to evaluate the SVM scoring function f, which may be higher compared to the \(\ell _{2}\)-LR and MLP scoring functions. As an illustration, take the Facebook data (behavioral) and 20news data (textual). Even though the Facebook data has more active features per instance and more features in the model (1,22,924 compared to 41,356), the median time to compute a counterfactual for all three algorithms is higher for the 20news data than for the Facebook data.

6 Conclusion and future work

From this study applying model-agnostic, instance-level explanation methods to the finding of counterfactual explanations for high-dimensional behavioral and textual data, we can draw several conclusions. First, the (straightforward) extensions LIME-C and SHAP-C as expected find reasonable, if not always optimal, counterfactual explanations. Furthermore, extending these algorithms to find counterfactual explanations addresses an open problem with the application of these methods to high-dimensional data, namely, which features should be reported in the explanation. The answer for LIME-C and SHAP-C is: those that allow the creation of an Evidence Counterfactual. SHAP-C does have problems with highly unbalanced data sets. Despite this, SHAP-C may still be preferable when the user is particularly interested in the theoretical interpretation of the importance weights  (Lundberg and Lee 2017). LIME-C showed a stable effectiveness for all data and models, and even for very large data instances that require many features to be removed for a predicted class change, LIME-C computes counterfactuals relatively fast. Moreover, the results indicate that the efficiency of LIME-C is less sensitive to the number of active features compared to SHAP-C.

SEDC, which was designed to find counterfactual explanations, is generally fast and effective, but not always. In the main results, SEDC was clearly the fastest. It is provably optimal for linear models, and also empirically found smaller counterfactuals (on average) for the nonlinear models on two data sets. However, for certain instances on certain data sets, SEDC’s run time was comparably quite large. Furthermore, the search stopping criteria were met before SEDC found explanations in a non-negligible number of cases. As a best-first search algorithm, there is an effectiveness vs. efficiency tradeoff that we did not explore comprehensively in this paper.

This work indicates that there is a good deal of room for more research on this topic. For example, instead of LIME-C and SHAP-C, other hybrid algorithms could be created. For example, LIME (or SHAP) could be run first to fix a search order for a search algorithm like SEDC. In those cases where LIME-C (SHAP-C) produces a great explanation, this new hybrid would find it fast. But the algorithm could keep searching, and would be biased towards trying the best features found by LIME (SHAP) before the others, which likely would lead to finding even better explanations faster. Furthermore, we see optimized search algorithms performing quite well for computationally hard problems  (Schreiber et al. 2018); we conjecture that similar algorithms could be applied in the context of classification from big, sparse data to find optimal explanations fast.