Keywords

1 Introduction

Recent years have witnessed a surge of interest in DNNs, particularly in the field of image recognition. Research on time series classification using DNNs has likewise progressed, with FCNs having demonstrated competitive performance, making them promising candidates for real-world applications [10, 26]. However, the inner workings of DNNs are a black box, making it difficult for end users to trust model output. To address this problem, researchers have been actively exploring the field of XAI. Methods such as LIME [21], SHAP [19] and CAM [28] have been proposed to provide transparent explanations of model predictions. One approach within XAI is counterfactual explanation, which shows how a query can be altered to the counterfactual instance in order to change the model’s prediction result. Counterfactual explanation not only presents important components of the query that contributed to the prediction, but also suggests the user’s next action to change the result. From this perspective, it is said to be psychologically effective [4], with many methods having been proposed [8, 13, 18]. Although counterfactual explanation has become an increasingly popular XAI field in recent years [20], most existing methods focus on image and tabular data, and few methods have been developed for time series data [7, 9, 12].

In the field of time series, certain subsequences within the data are considered to have significance [27]. Just as DNNs learn semantic concepts in the image domain [3], they are likely to learn subsequences in time series. To improve end-user understanding and satisfaction, it is effective to present explanations based on these subsequences, in addition to changing the model classification results. Actually, research in the image domain has shown that presenting the meaningful concepts learned by DNNs as explanations has led to increased user satisfaction [1]. Furthermore, in the case of time series data, multiple subsequences contribute to classification [10]. To fully interpret a model’s predictions, it is therefore necessary to treat all of these subsequences simultaneously.

Our proposed method, MIPCE, obtains subsequences (referred to as patches) learned by the FCN, and divides the corresponding query into patches. Aside from generating counterfactuals, MIPCE also provides the process of each patch’s continuous change to the counterfactual. This is because presenting changes to the counterfactual has been shown to have a positive effect on user understanding and satisfaction [23].

2 Related Works

As mentioned previously, LIME, SHAP, and CAM are widely recognized XAI methods. They provide visual explanations by highlighting the regions that contribute to the classification. In the case of counterfactuals, Watcher proposed a baseline method (W-CF) [25] that generates the counterfactual within a small distance from the query. As extensions of W-CF, many methods in the image domain use generative models, such as generative adversarial networks (GANs), to obtain the counterfactuals [11, 17, 23]. Furthermore, methods that use features appearing in DNNs along with GANs have also been proposed [13]. These methods are designed to satisfy proximity, which measures the similarity between the query and counterfactual, and plausibility, which determines whether the counterfactual is following the data distribution or is out of distribution (OOD). In addition, counterfactuals must be generated in a form that is recognizable to humans from an XAI perspective [20]. Some methods jointly present the process of change to the counterfactual [11, 17]. One such method has demonstrated the effectiveness of its interpretation through expert evaluations [23]. However, it should be noted that these methods do not specifically focus on time series data.

In the field of time series XAI, one popular approach is to focus on time series subsequences known as shapelets [27], and a method has been developed to explain random forest classification models that are trained with shapelets [12]. Our approach also focus on subsequences, but it differs in terms of the target models and the procedures for obtaining subsequences. Another method is Native-Guide [7], which modifies the results of any classification model by changing part of the query to the nearest-neighbor instance of a different class (denoted as NUN, short for nearest-unlike-neighbor). This method can be applied to DNNs classification models such as FCNs, but there are difficulties in accurately capturing subsequences. It should also be noted that these methods do not have the capability to generate continuous changes to the counterfactual.

3 MultIple Patches Counterfactual-Changing Explanations(MIPCE)

MIPCE divides time series data into subsequences using Gaussian mixture models (GMM), and generates continuous changes from the query to the counterfactual (see Fig. 1). It is necessary for the process of change to follow the principle of proximity in order to provide more interpretable explanations for users. Ideally, continuous changes would gradually approach the counterfactual in the range between the query and counterfactual. In addition, sparsity, defined as the idea of not changing anything except the necessary parts of the query, is also important for interpretability. To generate these ideal explanations, MIPCE uses Gaussian process latent variable models (GPLVM) [14] for each patch.

Fig. 1.
figure 1

MIPCE overview. (Left) Time series patch division. (Right) Change to the counterfactual. Green shows the original query, others show changes from the query. (Color figure online)

3.1 Setup and Notation

Assume a two-class FCN classification model [26] (denoted as M) as a black-box model. We represent the input data as \(\textbf{y} \in \mathbb {R}^{T_{\textbf{y}}}\), latent variable of the \(\textbf{y}\) as \(\textbf{z}\), and the feature extracted by the convolution as \(\textbf{X} \in \mathbb {R}^{T_{\textbf{X}} \times S}\). S denotes the number of channels in the last convolutional layer. \(\lbrace \textbf{x}_{s} \rbrace _{s=1}^{S} \ge \textbf{0}\) by using ReLU activations, and \(T_{\textbf{X}}\) and \(T_{\textbf{y}}\) are equal by setting the strides of all convolutional layers to 1. Let R denote the convolutional receptive field. For the query (denoted as q), the classified class by the FCN is represented as c, and the classification probability is represented as \(M_{c}\left( \textbf{y}\right) \). For the counterfactual, the classified class and the probability is similarly represented as \(c^{\prime }, M_{c^{\prime }}\left( \textbf{y}\right) \). Let \(\textbf{v}_{scaled}\) denote the min-max normalized value for \(\textbf{v}\), and \(v_{\mathop {\mathrm {arg\,scaled}}\limits _{\textbf{z}}}\) denote the scaled v by applying min-max normalization to the set \(\mathbb {v} \; (v \in \mathbb {v})\) obtained by varying \(\textbf{z}\) in its defined range.

3.2 Algorithm

Divide the Time Series Data into Patches (Algorithm 1). Using the features of all \(N_{D}\) training data \(\lbrace \textbf{X}^{n} \rbrace _{n=1}^{N_{D}}\), compute a variant of CAM (\(\text {CAM-All} \in \mathbb {R}^{T_\textbf{X}}\)) that retrieves all features contributing to the classification together:

$$\begin{aligned} \text {CAM-All} = (\sum _{c \in \lbrace 1,2 \rbrace }\frac{1}{N_{D}S} \sum _{n=1}^{N_{D}}\sum _{s=1}^{S} |w_{s,c}| \textbf{x}_{s}^{n})_{scaled} \end{aligned}$$
(1)

\(w_{s,c}\) is a weight that connects the s channel’s output of the convolution layer to the class c input of the softmax layer in the FCN. The GMM, which uses Dirichlet process [2] (referred to as DPGMM), is then fit to the sampled data points via rejection sampling [5] from the CAM-All. This allows the CAM-All to be divided into clusters in the temporal direction.

Let the minimum, maximum, and mean time steps of each cluster \(\mathbb {k} \in \lbrace \mathbb {1}, \ldots , \mathbb {K} \rbrace \) be denoted as \(t_{\textbf{X}}^{min_{\mathbb {k}}}, t_{\textbf{X}}^{max_{\mathbb {k}}}\), and \(t_{\textbf{X}}^{mean_{\mathbb {k}}}\) respectively. When \(\mathbb {k}\) is in ascending order, \(\mathbb {k}\) and \(\mathbb {k+1}\) are merged into a single cluster if:

$$\begin{aligned} t_{\textbf{X}}^{mean_{\mathbb {k+1}}} - t_{\textbf{X}}^{mean_{\mathbb {k}}}\le R \end{aligned}$$
(2)

Because \(T_{\textbf{X}} = T_{\textbf{y}}\), clusters in the feature space can be considered as clusters in the input space. Thus, under (2), the representative time step of two clusters in the input space becomes one feature following convolution. Therefore, these two clusters should not be treated independently, as they have a correlation. When we redefine the cluster as \(\mathbb {k} \in \lbrace \mathbb {1}, \ldots , \mathbb {K} \rbrace \), and the time steps as \(t_{\textbf{X}}^{min_{\mathbb {k}}}\) and \(t_{\textbf{X}}^{max_{\mathbb {k}}}\) after the merge process, the range of time steps for patch \(\mathbb {k}\) is:

$$\begin{aligned} \begin{gathered} \mathbb {T}_{\textbf{y}}^{\mathbb {k}} = \lbrace t_{\textbf{y}}^{min_{\mathbb {k}}}, \ldots , t_{\textbf{y}}^{max_{\mathbb {k}}} \rbrace ,\;\;\text {where} \\ t_{\textbf{y}}^{min_{\mathbb {k}}} = t_{\textbf{X}}^{min_{\mathbb {k}}}-\frac{1}{2}R, \;\; t_{\textbf{y}}^{max_{\mathbb {k}}} = t_{\textbf{X}}^{max_{\mathbb {k}}}+\frac{1}{2}R \end{gathered} \end{aligned}$$
(3)

Equation (3) calculates the mininum and maximum time steps of input data that will affect to the \(\lbrace t_{\textbf{X}}^{min_{\mathbb {k}}}, \ldots , t_{\textbf{X}}^{max_{\mathbb {k}}} \rbrace \). Then, the contribution of the patch \(\mathbb {k}\) to the classification is computed via:

$$\begin{aligned} \text {Contrib}_{\mathbb {k}} = \sum _{t \in \mathbb {T}_{\textbf{y}}^{\mathbb {k}}} \text {CAM-All}_{t} \end{aligned}$$
(4)

where \(\text {CAM-All}_{t}\) is a t time step value of the CAM-All.

figure a

Generate Counterfactual Changing (Algorithm 2). When representing latent variables and observational data as \(\mathcal {D} = \lbrace \left( \textbf{z}_{1}, \textbf{y}_{1} \right) , \left( \textbf{z}_{2}, \textbf{y}_{2}\right) , \ldots \rbrace \), the

GPLVM’s expected value of the predictive distribution for the unknown latent variable \(\textbf{z}^{*}\) is defined as:

$$\begin{aligned} \begin{gathered} \mathbb {E} [p(\textbf{y}^{*} \mid \textbf{z}^{*}, \mathcal {D}) ]= \textbf{k}_{*}^{T}\textbf{K}^{-1}\textbf{Y} \\ \textbf{k}_* = \left( k\left( \textbf{z}^*, \textbf{z}_1\right) , k\left( \textbf{z}^*, \textbf{z}_2\right) , \ldots \right) ^{T}, \textbf{Y} = \left( \textbf{y}_{1}, \textbf{y}_{2}, \ldots \right) ^{T} \end{gathered} \end{aligned}$$
(5)

k represents the kernel function and \(\textbf{K}\) represents covariance matrix. As GPLVM is commonly used for dimensionality reduction, it is possible to divide the latent space into clusters. Considering three clusters – \(\mathbb {c1}\), \(\mathbb {c2}\) and \(\mathbb {c3}\) – case, (5) can be:

$$\begin{aligned} \mathbb {E} [p(\textbf{y}^{*} \mid \textbf{z}^{*}, \mathcal {D}) ]= \left( \textbf{k}_{*, \mathbb {c1}}, \textbf{k}_{*, \mathbb {c2}}, \textbf{k}_{*, \mathbb {c3}} \right) \textbf{K}^{-1} \left( \textbf{Y}_{\mathbb {c1}}, \textbf{Y}_{\mathbb {c2}}, \textbf{Y}_{\mathbb {c3}}\right) ^T \end{aligned}$$
(6)

If we want to obtain \(\textbf{y}^{*}\) that exists between \(\textbf{Y}_{\mathbb {c1}}\) and \(\textbf{Y}_{\mathbb {c2}}\), this case is difficult to realize due to the influence of \(\textbf{Y}_{\mathbb {c3}}\). The same argument can be applied to the case where we want to obtain the ideal continuous change to the counterfactual. Therefore, it is necessary to select one cluster of each class in advance.

Data Selection. Prepare the query patch \(\textbf{y}_{\mathbb {T}_{\textbf{y}}^{\mathbb {k}}}^{q}\) and \(N_{sim}\) similar patches of each class c and \(c^{\prime }\) with Euclidean distance. Then, train Bayesian GPLVM [24] with the patches, and apply DPGMM to obtain the latent variable \(\textbf{z}_{q}\) of the query, class c latent variable clusters \(\mathbb {z}_{c} \in \lbrace \mathbb {1}, \ldots , \mathbb {Z}_{c}\rbrace \) and so is class \(c^{\prime }\). When we denote the mean of the \(\mathbb {z}_{c}\) as \(\textbf{z}_{\mathbb {z}_{c}}\), and the number of elements as \(|\mathbb {z}_{c}|\), score clusters with:

$$\begin{aligned} \text {Score-} \mathbb {z}_{c} = \frac{1}{|\mathbb {z}_{c}|}\sum _{\textbf{z} \in \mathbb {z}_{c}}M_{c} ( \mathbb {E} [G_{B} ( \textbf{z} ) ]) + \alpha _{1} ( 1 - ( \Vert \textbf{z}_{q} - \textbf{z}_{\mathbb {z}_{c}}\Vert _{2}^{2} )_{\mathop {\mathrm {arg\,scaled}}\limits _{\textbf{z}_{\mathbb {z}_{c}}}}) + \alpha _{2} |\mathbb {z}_{c}| \end{aligned}$$
(7)

In (7), Bayesian GPLVM is represented as \(G_{B}\), and \(G_{B}\left( \textbf{z} \right) \) denotes the predictive distribution of \(\textbf{z}\). The first term represents the average patch classification probability of cluster \(\mathbb {z}_{c}\). The second and the third terms are constraints to satisfy proximity and plausibility, as the cluster’s elements size reflects the data distribution. Finally, the cluster \(\hat{\mathbb {z}}_{c}\) can be selected with \(\mathop {\mathrm {arg\,max}}\limits _{\mathbb {z}_{c}} \lbrace \text {Score-} \mathbb {1}_{c}, \ldots , \text {Score-} \mathbb {Z}_{c} \rbrace \). Similarly, we can find \(\hat{\mathbb {z}}_{c^{\prime }}\) for class \(c^{\prime }\).

We assume a small value of \(N_{sim}\). It is therefore necessary to increase the number of data points in the latent space prior to DPGMM clustering. Bayesian GPLVM is an appropriate choice because it allows sampling from the Gaussian distribution of the latent variable, while having equivalent properties to GPLVM.

Counterfactual Changing. Train GPLVM with the query patch, as well as patches of the clusters \(\hat{\mathbb {z}}_{c}\) and \(\hat{\mathbb {z}}_{c^{\prime }}\). This allows us to obtain the latent variables \(\textbf{z}_{q}\) of the query, \(\lbrace \textbf{z}^{n} \rbrace _{n=1}^{N}\) of the class c and \(\lbrace \textbf{z}^{n^{\prime }} \rbrace _{n^{\prime }=1}^{N^{\prime }}\) of the class \(c^{\prime }\), where \(N = |\hat{\mathbb {z}}_{c}|\) and \(N^{\prime } = |\hat{\mathbb {z}}_{c^{\prime }}|\). Then, select a \(\textbf{z}_{cf}\) to generate a counterfactual patch:

$$\begin{aligned} \begin{gathered} \textbf{z}_{cf} = \mathop {\mathrm {arg\,max}}\limits _{\textbf{z}^{n^{\prime }}} \lbrace \text {Score-} \textbf{z}^{1^{\prime }}, \ldots , \text {Score-} \textbf{z}^{N^{\prime }} \rbrace ,\;\;\text {where} \\ \text {Score-} \textbf{z}^{n^{\prime }} = M_{c^{\prime }} ( \mathbb {E}[G ( \textbf{z}^{n^{\prime }} ) ]) + \alpha _{3} ( 1 - (\Vert \textbf{z}_{q} - \textbf{z}^{n^{\prime }}\Vert _{2}^{2})_{\mathop {\mathrm {arg\,scaled}}\limits _{\textbf{z}^{n^{\prime }}}}) \end{gathered} \end{aligned}$$
(8)

In (8), GPLVM is represented as G as well as Bayesian GPLVM in (7). Using \(\textbf{z}_{q}\) and \(\textbf{z}_{cf}\), explore the latent space \(\mathcal {Z}\) to find \(\textbf{z}_{sf}\) to generate a semifactual patch:

$$\begin{aligned} \begin{aligned} \textbf{z}_{sf} = \mathop {\mathrm {arg\,max}}\limits _{\textbf{z} \in \mathcal {Z}}&( 1 - |0.5 - M_{c} ( \mathbb {E} [G ( \textbf{z} ) ])|_{\mathop {\mathrm {arg\,scaled}}\limits _{\textbf{z}}} ) \\&+ \alpha _{4} ( 1- ( \Vert \textbf{z} - \textbf{z}_{q}\Vert _{2}^{2} + \Vert \textbf{z} - \textbf{z}_{cf}\Vert _{2}^{2} )_{\mathop {\mathrm {arg\,scaled}}\limits _{\textbf{z}}}) \end{aligned} \end{aligned}$$
(9)

The semifactual is the instance when the classification result changes. Equation (9) constrains that \(\textbf{z}_{sf}\) is within the \(\textbf{z}_{q}\) and \(\textbf{z}_{cf}\) while the classification probability of the patch is 0.5. After acquiring \((\textbf{z}_{q}, \textbf{z}_{sf}, \textbf{z}_{cf})\), we can obtain the set of internal latent variables \(\mathbb {z}_{q \rightarrow sf}\) and \(\mathbb {z}_{sf \rightarrow cf}\) by linearly varying \(\beta \) in (10), where \(0 \le \beta \le 1\):

$$\begin{aligned} \textbf{z}_{q \rightarrow sf} = \beta \textbf{z}_{q} + ( 1-\beta ) \textbf{z}_{sf}, \;\; \textbf{z}_{sf \rightarrow cf} = \beta \textbf{z}_{sf} + ( 1-\beta ) \textbf{z}_{cf} \end{aligned}$$
(10)

Then, generate the continuously changing patch from the query to the semifactual and from the semifactual to the counterfactual, by \(G(\mathbb {z}_{q \rightarrow sf})\) and \(G(\mathbb {z}_{sf \rightarrow cf})\). Using the linear kernel with an RBF kernel for the GPLVM allows us to generate continuous changes that gradually increase the distance from the query.

figure b

The Whole Algorithm. Based on Algorithm 1, divide the query into patches and then generate counterfactual changes in the order of the patches with the highest contribution to the classification using Algorithm 2. By iterating this process until the classification result changes, the final explanation can be obtained. The end user is presented with the expected value and a 95% confidence interval. During the iterative process, overlapping patches may be used for explanation. In such cases, Algorithm 2 is applied to them as a single patch.

4 Experiments

We verified the effectiveness of MIPCE with five time series datasets from UCR Archive [6]: ECG200, Strawberry, GunPoint, ProximalPhalanxOutlineCorrect (Proximial), and Wafer. Although the Wafer dataset has a test size of 6164, we randomly selected 50 samples from each class in the interest of conserving computational time. In Experiment 1, we compared MIPCE with several existing methods. Experiment 2 was conducted to evaluate continuous changes, whereas Experiment 3 investigated whether users could understand the decision processes of DNNs from explanations.

FCN Settings. The model consists of three convolutional layers with ReLU activations, a global average pooling layer, and a softmax layer. Batch normalization was applied before input to the ReLU. The number of channels in the convolution, and the kernel size, were set in the order of \(\left( 128, 256, 128\right) \) and \(\left( 7, 5, 3\right) \) from the input layer, respectively. This refers to [26] where high accuracy is achieved.

MIPCE Settings. For the GPLVM and Bayesian GPLVM, we set the latent variable dimensions to 2, and used the results of PCA as initial values. Models were trained with \({\text {Normal}}\left( 0, 1\right) \), \({\text {Gamma}}(3, 1)\) and \({\text {Gamma}}(1, 1)\) as the prior distributions of latent variables, corresponding to parameters of the linear and RBF kernels respectively. Training was conducted over 1000 iterations and optimized with L-BFGS-B [15]. \(\left( \alpha _{1}, \alpha _{2}, \alpha _{3}, \alpha _{4}\right) \) in the algorithm were all set to 0.1 and \(N_{sim}=15\). We explored the \(\mathcal {Z}\) in (9) via grid search, and changed \(\beta \) in (10) so that \(\mathbb {z}_{q \rightarrow sf}\) and \(\mathbb {z}_{sf \rightarrow cf}\) were 50 steps each.

4.1 Experiment 1: Counterfactuals

We compared MIPCE with W-CF and Native-Guide in qualitative and quantitative metrics, specifically in terms of proximity, plausibility, and substitutability.

Proximity evaluates the relative distance between the query (q) and counterfactual (CF) by \(\frac{d(q, \text {CF})}{d(q, \textrm{NUN})}\). We employed the L1 norm, L2 norm, and L\(\infty \) (L-Inf) norm as d.

Plausibility evaluates whether the counterfactual is OOD with OCSVM [22] and Isolation Forest (IForest) [16]. In addition, we used interpretable metrics called IM1 and IM2, which use an autoencoder [13]. OCSVM and IForest detect OOD based on distance, whereas IM1 and IM2 do so based on features.

Substitutability evaluates whether sufficient classification accuracy is

achieved when using counterfactuals as training data [13]. Prepare a k-nearest neighbor classifier \(k\mathrm {-NN_{orig}}\) trained on the original data, and \(k\mathrm {-NN_{CF}}\) trained on the counterfactuals. Then calculate the accuracy in classifying the test dataset and obtain the following ratio: \(\mathrm {R\%-Sub} \equiv \frac{\;k\mathrm {-NN_{CF}\; Acc.}}{\;k\mathrm {-NN_{orig}\; Acc.}} \times 100\).

Results. Figure 2 shows the counterfactuals generated by each method, along with corresponding queries, which belong to the same class. We observe that in the case of query2, MIPCE generated sparse explanations. In addition, if we examine query1 and query2 together, we can clearly interpret the important subsequences.

Fig. 2.
figure 2

Counterfactuals of the ECG200. MIPCE shows expected values as a solid line and 95% confidence intervals as fill. The color is the same as Fig. 1 right.

Table 1. Evaluation results of counterfactuals.

From a quantitative perspective, W-CF obtained the best results in terms of proximity (see Table 1). However, as seen in Fig. 2, good proximity does not necessarily correlate with high human interpretability. In addition, W-CF exhibited poor results in terms of plausibility, as it generated counterfactuals that do not exist in the real world. Conversely, our method obtained better plausibility and substitutability scores. This suggests that MIPCE captures subsequences that are critical for classification, and generates counterfactuals that follow the data distribution.

4.2 Experiment 2: Change to the Counterfactual

In the process of continuous change, we evaluated the proximity and plausibility of instances that change the query to the counterfactual \(r\% \; (r \in \lbrace 0, 25, 50, 75, \)

\( 100 \rbrace )\). We used the same metrics as in Experiment 1. From a proximity perspective, it is desirable for the distance between the query and instance to increase with the changing rate of the counterfactual. From a plausibility perspective, it is desirable for instances with a change rate of approximately 50% to be OOD. These evaluations were inspired by [13].

Fig. 3.
figure 3

Counterfactual changing of each dataset. The solid line represents the corresponding percentage, and the dashed line shows instances for other percentages. The color is the same as in Fig. 1 right.

Fig. 4.
figure 4

(a) Counterfactual change evaluation of the mean and standard deviation of the five datasets. (b) Results of user test.

Results. The distance from the query was observed to increase with the rate of change to the counterfactual (see Fig. 3 and Fig. 4a). Therefore, it can be said that the process of continuous change is an ideal one. In terms of plausibility, OCSVM and IForest exhibited smaller changes in their evaluation values compared to IM1 and IM2. This indicates that distance-based metrics cannot detect intermediate counterfactual instances that would not follow the data distribution. Conversely, the autoencoder’s metrics judge instances close to the \(50\%\) ratio to be OOD.

4.3 Experiment 3: User Test

Present explanations generated by specific methods from W-CF, Native-Guide, and MIPCE to assess the user’s understanding of the DNN decision process. Effectiveness was evaluated by measuring the ability of users to correctly predict the DNN’s classification result of an unknown query. Our participants, all college students with prior knowledge of machine learning, were divided among 3 groups of approximately 6 students each. Each group was presented with 8 examples of explanations, and subsequently tested with 4 unknown queries. The results determined which explanation method is the most conducive for the user’s understanding of the DNN. This experiment was inspired by [1].

Results and Discussion. As can be seen from the average accuracy (see Fig. 4b), MIPCE demonstrated superior performance on many datasets, indicating its effectiveness in enhancing user’s understanding of the model. However, W-CF outperformed MIPCE on the GunPoint and Wafer datasets. Both datasets are easily recognizable to humans, and it is likely that users inferred the classification criteria from multiple queries. This suggests that for easily recognizable time series data, the informative explanations provided by MIPCE may hinder user understanding. MIPCE results were also worse on the Proximal dataset, as the generated counterfactuals altered most of the query (see Fig. 3), making it difficult for users to understand the important sequences. It is expected that this can be resolved by showing the patch division process along with the counterfactuals, or by revising the cluster merging algorithm.

5 Conclusion

For counterfactual explanations in time series classification, we propose MIPCE, which takes subsequences from an FCN and presents the counterfactual changes of the patches that contribute to classification. Quantitative evaluation results indicate that MIPCE generates more plausible counterfactuals consistent with the data distribution compared to conventional methods. In addition, our approach is able to retrieve features that contribute to classification, indicating the potential of using them for data augmentation. Furthermore, user testing has shown the effectiveness of our method.

In the future, we will improve our method to present more effective explanations based on user feedback. One idea for improvement is to show the patch division, as well as the contribution of each patch to classification, along with the current explanation. Another direction is data augmentation. In the continuous changes of MIPCE, it is possible to obtain the classification probability and confidence level of the generated instance, which serve as indicators of how well the instance follows the data distribution. This could be used for data augmentation, and we will explore the possibility of applying our method therein.