1 Introduction

As an essential branch of Machine Learning proliferating, Deep Learning can achieve high-performance, multi-purpose machine vision applications by skillfully designing convolutional neural networks (CNN) models and training with adequate image samples. Nowadays, CNN applications are gaining popularity for quality control of products in automatic manufacturing environments. Especially image classification tasks realize the most prevalent CNN-based error-proofing machine vision applications [1, 2]. However, as a subset of Machine Learning, CNN has the same shortcomings. It is difficult to understand the decision mechanism and is not as intuitive as pattern-based machine vision solutions with hand-crafted rules. Usually, end-users tend to be more skeptical of CNN-based applications than pattern-based machine vision solutions [3].

With the superior performance of Machine Learning and the increasing number of applications in many fields, the need to improve its interpretability has become mandatory. Explainable Artificial Intelligence (XAI) is the concept that Machine Learning models are required to be interpretable, trustworthy, and efficiently manageable [4, 5]. Since its emergence, XAI has been widely emphasized by the government, academia, and industry. Through the joint efforts of professional organizations, XAI-related ISO standards [6] have been gradually established and published.

To provide end-users with a better understanding and trust of image classification tasks based on CNN, the rational implementation of visual explanations for inference mechanisms is a feasible and reliable solution. When end-users can understand a black-box model’s decision process to assess or verify the output via visualization, they will be receptive to model deployment [3, 7,8,9]. Proper visual explanations also localize fine-grained features of the image [10], estimate the influences of the wrong prediction, reveal deficiencies in the input data and training process, and offer guidance for continuous updates and efficient improvement of the models [11]. In real-world medical applications, IGOS +  + [12] discovered the classification model overfitting to the texts instead of the indicative symptoms of pneumonia on the X-ray images. For CNN applications in industry, visual explanations facilitate avoiding quality control incidents in massive manufacturing due to the lack of model transparency and interpretability [7].

Another essential property of interpretability for machine vision in industrial quality control applications is robustness assessment. High robustness means that the model can maintain its regular performance under perturbation or worse situations [6, 13], i.e., minor changes in the input should not cause significant variance in the model output. While reflecting and understanding the interpretability of model inferences, assessing and verifying the robustness of model applications has also become a well-defined workflow required by ISO standards. Typical industrial manufacturing cases are deployed in physically enclosed environments and secure local networks. Although malicious intrusion in adversarial attacks is almost impossible, machine vision applications are often subject to unexpected disturbances from external working conditions. Typical disturbances are occlusions in ROI (region of interest), greyscale brightening (or darkening), blurring, and other feature contaminations caused by environmental degradation, workpiece, manufacturing process, and equipment. Robust machine vision applications should rigorously ensure that the inference results do not lead to false positives when affected by small amounts of typical disturbances. Also, the quantity of false negatives is within an acceptable range.

Early visual explanation research focused on analyzing the model structure and visualizing the features processed in various network layers [14,15,16,17]. These methods required the models to be white-box. Some visualizations were not easily understandable to end-users but merely interpretable for researchers. With the promotion of CNN technology and the support of XAI program, plenty of visual explanations [18,19,20,21] applicable to black-box models have been proposed and proven effective. Still, they do not incorporate the function of robustness assessment and validation, also some minor insufficiency to be improved for practical applications.

Regarding academic advances and industrial needs, our paper optimizes RISE [20] for better applicability to CNN-based image classification tasks by proposing the concept of Decisive Saliency Map and the corresponding quantitative metric. Our method derives appropriate threshold values and weights based on the characteristics of saliency value distribution, then performs binarization and weighted sum-up operations for the feature regions with the highest importance to obtain DSM and its coverage rate. The main contributions of this paper are as follows:

  • We carried out the data analysis of the distribution of the saliency value to propose the concept of DSM. The characteristic of the distribution is merged into the visual explanation to highlight the essential salient regions that determine model inference. Several comparisons display the differences in feature graininess focused by CNN models of different depths by applying DSM.

  • Our optimization method continues the concise and easy-to-implement ideas of RISE, making the visual explanation more intuitive but less dispersive. Our method displays more fine-grained and decisive salient regions for image classification applications via visualization.

  • The coverage rate of DSM can provide the quantitative robustness assessment and an extra reference indicator of the trustworthiness of the model predictions, complementing the Softmax confidence score. The rate detects samples with high confidence scores but implies a high risk of model misclassification. Furthermore, the robustness assessment we proposed can be adopted into black-box models in industrial machine visions at an ignorable computational cost, responding to the incoming ISO standard requirement for the AI industry.

2 Related work

Due to the differences in research perspectives and purposes, dozens of visual explanations have been developed recently for CNN models with distinct techniques and visual effects [22]. Commonly used Deep Learning visual explanations are generally subordinate to Local Interpretability of Post-Hoc Explainability strategies. They are classified as model-specific and model-agnostic to distinguish visual explanation types. Model-specific methods are designed for specified models of which the designer has a certain level of knowledge at least. Model-agnostic methods are intended for any unknown models or algorithms. In most model-agnostic cases, only these models’ inputs (samples) and outputs (predictions) are visible and accessible.

2.1 Model-specific methods

Model-specific methods are the most interpretability techniques contributed by the AI community, and always provide reliable and fundamental explanations. Model-specific visual explanations for CNN are based on backpropagation and class activation mapping (CAM), including Guided Backprop [14], CAM [23], GRAD-CAM [24], Score-CAM [25] etc. Although these methods have excellent and understandable visualizations, they require accessing or modifying partial model network layers to perform specific operations, including global averaging and weighted summation of gradients, class activation maps, weights of convolutional feature maps, or forward passing scores on object classes [26, 27].

Therefore, model-specific approaches do not apply to complex CNN models deployed after encapsulation, e.g., mainstream industrial machine vision products.

2.2 Model-agnostic methods

Typical model-agnostic methods are based on local approximate interpretability, or sensitivity analysis of how the output is influenced by perturbed input [11, 28].

Local approximate interpretability constructs simplified models to explain linearly, such as Local Interpretable Model-agnostic Explanations (LIME) [18] and its variant Anchor [21].

Though both are based on superpixels segmentation, Anchor improves LIME by anchoring and superimposing with if–then rules. However, the prerequisite for establishing an Anchor interpretation of image classification is to obtain the correct superpixels segmentation, which may lead to considerable variances in visualization due to particular segmentation algorithms and hyper-parameters. It also requires that the image sample has adequate discriminative feature areas to build a reasonable explanation for Anchor.

The sensitivity analysis method generates the saliency maps of the input image by analyzing changes in the prediction influenced by the input perturbation. Typical saliency maps used for CNN models [27] are heat maps masked on the original input images, reflecting the different degrees of influence of the feature regions by heat colours. Representative methods as [19] and [20]. They visualize the image feature regions that significantly influence the results and include quantitative metrics of accuracy on interpretability: deletion and insertion. RISE [20] already possesses concise design, excellent performance, and high applicability. These properties are the preliminary basis for the feasibility of applications in industrial environments. Recently, based on the overall experimental results of several benchmark datasets and CNN models, RISE is one of the best methods evaluated by five recognized visualization metrics [29].

However, there is still space for improvement in reflecting visualization attributes, particularly feature graininess and importance ranking [30, 31]. Meaningful perturbation [19] requires additional meta-parameters but provides less sharp visualization due to Gaussian blur masking. It is difficult to identify the feature importance since it only contains coarse feature patterns. RISE may confuse judgments due to minor salient regions and noises generated by the randomness of the masking process. By balancing the advantages of deletion and preservation processes, the recent study IGOS +  + [12] uses bilateral perturbations to generate fine-grained saliency maps with additional cost. Meanwhile, its scattered salient regions may interfere with subjective cognition.

Although the available visual explanations help understand and highlight the feature regions, not many are applicable to black-box models while having explicit effects. To our knowledge, none of them assess the robustness of the model inference process in conjunction with the visualizations.

3 Decisive saliency map (DSM)

Motivated by RISE, after measuring the differences in predictions by using thousands of perturbed input images and obtaining the saliency maps for each label class, we further utilize the saliency value distribution information to improve the visual explanation. DSM is calculated to highlight the essential feature regions. It suppresses the dispersion [29], noises, and minor feature distractions from the random masking process. The coverage rate of DSM serves as a quantitative metric and comparison criteria.

The overall flowchart is shown in Fig. 1.

Fig.1
figure 1

Overall flowchart of the proposed method

3.1 Acquisition of decisive saliency maps

In CNN models, for a 3-channel input image \(I\in {R}^{H\times W\times 3}\) with length \(H\) and width \(W\),\(f(I)\) is the confidence score (probability) of the inference on the input and processed by the Softmax function. As defined in (1) [20], \(f(I\odot M)\) is the confidence score of the perturbed images obtained after the element-wise random multiplication operation of the original image. \(M\in \{\mathrm{0,1}\}\) is the random binary mask, with the probability \(p\) for unmasking. It is empirically set to 0.5 from the range [0,1], indicating half image patches occlusion.\({\mathbb{E}}[M]\) is the expectation value of all possible masking operations when the image pixel \(\lambda \) is still preserved as \(M\left(\lambda \right)=1\).\(\mathrm{MC}\) denotes a total of \(N\) times of Monte Carlo sampling masking operations and model inferences. The importance of each pixel is approximately computed by the weighted average of the masks and corresponding confidence scores, then generates saliency maps \({S}_{I,f}\left(\lambda \right)\) to the inferences as follows:

$${S}_{I,f}\left(\lambda \right)\stackrel{\mathrm{MC}}{\approx }\frac{1}{{\mathbb{E}}[M]\cdot N}{\sum }_{i=1}^{N} f\left(I\odot {M}_{i}\right)\cdot {M}_{i}\left(\lambda \right)$$
(1)

For a CNN model with \(K\) classes, the stacked 1-channel saliency maps obtained from (1) is \({S}_{I,f}\left(\lambda \right)\in {R}^{H\times W\times 1\times K}\), represented as \({S}^{K}\). For a specific class \(k\) in the saliency map \({S}^{k}\in {R}^{H\times W\times 1}\), the values of the saliency map for each pixel are scalars \({s}_{ij}\in (\mathrm{0,1})\) indexed by \(i\), \(j\) in height and width, respectively.

Let the maximum, minimum, and mean values of \({s}_{ij}\) in \({S}^{k}\) be \(max({s}_{ij})\),\(min({s}_{ij})\), and \(m\mathit{ean}\left({s}_{ij}\right)\), respectively. According to (1), it can be derived intuitively that \({s}_{ij}\) has the following properties:

  1. 1.

    The value of \({s}_{ij}\) is positively correlated with the classification confidence score \({f}_{k}(I)\) of the original image, and the number of high-importance pixels preserved by the random masking operations \({\mathrm{M}}_{\mathrm{i}}(\uplambda )\). The confidence score \({f}_{k}(I\odot {M}_{i})\) of the masked image is smaller than the original image confidence score \({f}_{k}(I)\) in most cases.

  2. 2.

    Since the masked input and the model for the confidence score \({f}_{k}\left(I\odot {M}_{i}\right)\) are identical in the saliency map of the same class, the absolute value difference of \(\mathrm{max}({s}_{ij})\) and \(\mathrm{min}({s}_{ij})\) mainly comes from the total number of saliency pixels preserved by the mask \({M}_{i}(\lambda )\) that can reproduce most class feature information.

  3. 3.

    The larger the values of \(\mathrm{max}({s}_{ij})\), \(\mathrm{min}({s}_{ij})\), \(\mathrm{mean}\left({s}_{ij}\right)\) and the closer to \({f}_{k}\left(I\right)\), the higher the number of pixels in the high-saliency regions. It means the inferences can hardly be perturbed into misclassification by the random masks. Accordingly if \({f}_{k}(I\odot {M}_{i})\) is generally higher, the robustness of the inference process is relatively better in industrial applications. And vice versa, low \({f}_{k}(I\odot {M}_{i})\) means the model is more susceptible to random masks with poor robustness. Masking a small part of the salient region leads to a significant decrease in \(\mathrm{max}({s}_{ij})\) and \(\mathrm{min}({s}_{ij})\).

The \({s}_{ij}\) values of pixels in the saliency map were converted into a histogram for subsequent analysis. Most of the input images with correct inference and excellent confidence scores \(f(I)\) have \(\mathrm{max}({s}_{ij})\) values close to \(f(I)\). The distribution histograms Skewness and Kurtosis of saliency maps are relatively small.

However, the analysis also reveals some abnormal input samples. Their model inferences and corresponding saliency maps by RISE are both correct, as in Fig. 2a–b. Though the confidence scores are not low in value probably, their \(\mathrm{max}({s}_{ij})\) is much smaller than \(f(I)\). Figure 2 d–f shows the saliency value distribution fitted by functions. The histogram density distribution with much larger Skewness and Kurtosis values is almost impossible to be fitted by a standard normal distribution. It can only be well fitted using the Johnson unbounded distribution [32]. Such histogram distribution tends to have a significant long-tail effect, and the mean saliency value is too small or even approximate to zero.

Fig. 2
figure 2

Visualizations of saliency maps of a typical example of poor robustness for model inference. The input sample a is from Stanford Car-196 [33] test set, and the CNN model is a fine-tuned EfficientNet-B3 [34] with transfer learning. The saliency map provided by RISE in b does not reflect the saliency value distribution characteristics. However, the saliency value distribution histogram has an apparent long-tail effect in d–f. Once minor disturbances in g perturb the input, the classification result is incorrect, or the score is unexpectedly low. Our DSM method can reflect the risk of insufficient robustness of model inference by visualizing the decisive salient region in c and i

Furthermore, the salient regions in the abnormal sample are too small for the inference to be robust. Once a slight perturbation degrades the input image, for instance, brightening or darkening, or physically the object is occluded or rotated [6], as in Fig. 2g–h, the prediction could be misclassified, or the confidence score could be significantly reduced. Even the inference processes of these samples by specific models are risky if they occur in real-world applications, it is difficult for them to be noticed, understood, and accepted by end-users, e.g., industrial quality control personnel.

According to the analysis mentioned above, if the statistical information of \({s}_{ij}\) data characteristics of the saliency map \({S}^{k}\) can be utilized and reflected in the visualization of the saliency map, it can facilitate direct observation and reduce extra computational data records during application. Besides that, it is more feasible to assess the robustness of the model inference based on the input perturbation methods other than the backpropagation approaches from the principle. Therefore, motivated by RISE, we propose an optimized method for the saliency map. In this method, the data characteristics of the saliency maps are merged into the visual explanation through algorithmic transformation. We define the new saliency map as Decisive Saliency Map, which indicates that the feature area covered by the transformed salient region has the dominant influence and decisive effect on the image classification prediction. Weighting decisive salient regions into the heat map can correctly correlate with the data characteristics of the saliency distribution histogram to improve the visualization of the importance of features and provide a reliable robustness assessment.

The process of computing the decisive saliency maps \({S}_{\mathrm{DSM}}^{K}\) is as follows:

Step 1 Obtain the stacked saliency maps \({S}^{K}\in {R}^{H\times W\times 1\times K}\) of the CNN model for image classification with \(K\) classes using (1).

Step 2 Select the two-dimensional saliency map \({S}^{k}\in {R}^{H\times W\times 1}\) for the \({k}^{th}\) class as needed, search for its \(\mathrm{max}\left({s}_{ij}\right)\) and \(\mathrm{min}({s}_{ij})\), calculate the mean value \(\mathrm{mean}\left({s}_{ij}\right)\). The decisive saliency differential value \({\delta }_{s}\) is derived to reflect the severity of the long-tail effect of the histogram distribution using the data characteristics of the saliency map according to (2):

$${\delta }_{s}={C}_{d}(\mathrm{max}\left({s}_{ij}\right)-\mathrm{min}\left({s}_{ij}\right))\frac{\mathrm{mean}\left({s}_{ij}\right)}{\mathrm{min}\left({s}_{ij}\right)}$$
(2)

where \({C}_{d}\) is set as the coefficient of dominance. It is used to appropriately distinguish the importance of the original image features and effectively suppress the noises in the subsequent heat map without completely ignoring the subordinated features.

The range of \({C}_{d}\) value practically meaningful for the subsequent binarization operation is \({C}_{d}\ge \) 0. In [35], the research suggests a normalized weight threshold to select a highlighted region for occlusion to improve robustness during training. In our design concept, \({C}_{d}\) should not only separate the decisive salient regions from the rest of the image but also properly suppress subordinate features.

The value of \({C}_{d}\) in range of \([0.1,0.5]\) is proposed initially regarding the simplicity of the visualization design and interpretation function [36]. The improper value of \({C}_{d}\) would make it difficult to distinguish between differences in the robustness of the inference, or excessively ignore subordinate visualized features. When the value of \({C}_{d}\) is set to 0.2, the following equations can equivalently emphasize the decisive salient regions with explicit boundaries and straightforwardly skip the normalization process of saliency values.

Step 3 Calculate the saliency threshold \({s}_{dsm}\) for the subsequent binarization operation on \({s}_{ij}\). Compared with the linear or fixed coefficient operation of \(\mathrm{max}\left({s}_{ij}\right)\) as the binarization threshold, the use of exponential form \({e}^{-{\delta }_{s}}\) can accurately distinguish the influence level of pixels in the saliency map and the distribution characteristics of the saliency value histogram:

$${s}_{dsm}={\mathrm{max}({s}_{ij})\cdot e}^{-{C}_{d}(\mathrm{max}\left({s}_{ij}\right)-\mathrm{min}\left({s}_{ij}\right))\cdot \frac{\mathrm{mean}\left({s}_{ij}\right)}{\mathrm{min}\left({s}_{ij}\right)}}$$
(3)

Step 4 Binarize all \({s}_{ij}\) in \({S}^{k}\) with \({s}_{dsm}\) as the threshold to obtain a new saliency map \({\widetilde{S}}^{k}\in {R}^{H\times W\times 1}\) consisting of \({\widetilde{s}}_{ij}\):

$${\widetilde{s}}_{ij}=\left\{\begin{array}{c}1, if\quad {s}_{ij}>{s}_{dsm}\\ 0, if\quad {s}_{ij}\le {s}_{dsm}\end{array}\right.$$
(4)

Step 5 Sum up of \({S}^{k}\) and the weighted \({\widetilde{S}}^{k}\). The weight is \({\delta }_{s}\) corresponding to the \({k}^{th}\) class. The selected regions are emphasized with the contribution of \({C}_{d}\). Our converged visualization meets the focal point principle related to human attention well [36]. This process is equivalent to superimposing a small portion of the saliency value of essential features on the original heat map. The optimized visualization meets the closure principle that patterns should be clustered with definite borders when visual explanation contains complex feature elements. Thus, Decisive Saliency Map \({S}_{dsm}^{k}\) for single class and \({S}_{DSM}^{K}\in {R}^{H\times W\times 1\times K}\) for all classes are obtained as follows:

$${S}_{dsm}^{k}={S}^{k}+{\delta }_{s}\cdot {\widetilde{S}}^{k}$$
(5)
$${S}_{DSM}^{K}=\{{S}_{dsm}^{1},{S}_{dsm}^{2},{S}_{dsm}^{3},\dots \dots ,{S}_{dsm}^{k}\}$$
(6)

Our method merges the saliency maps with implicit data characteristics information of saliency value histograms, and represents the fine-grained features by delineating the realistic feature boundary. The optimization is still simple and effective in design concepts. For the original and perturbed input sample 1, Decisive Saliency Maps are shown in Fig. 2c and Fig. 2i. Since the highlighted area provided by DSM is almost impossible to be seen in Fig. 2i after perturbation in Fig. 2g, it is more intuitive to explain the unexpectedly low confidence score due to the lack of features than in Fig. 2h.

3.2 DSM-based evaluation metric

Causal metrics have been commonly used in previous research to objectively evaluate the performance of visual explanations, e.g., AUC scores (Area Under probability Curve) of the deletion and insertion, pointing game, etc. These approaches mainly concentrate on validating the accuracy, localization, and faithfulness of the saliency maps. They do not involve the robustness assessment of the model inferences. Also, AUC calculations require additional GPU inferences, increasing computational cost and time extensively.

In image classification tasks based on CNN models, even if the subjective observations of saliency maps of various input samples are similar and the objective AUC calculations are approximate in value, the inference processes still have significant differences regarding the dependence of features, which can be reflected by \({S}_{\begin{array}{c}DSM\\ \end{array}}^{K}\).

To better analyze the differences in visual explanations, a new quantitative evaluation metric \({r}_{dsm}^{k}\) is proposed in this paper, namely the calculation of the coverage rate of DSM. As a concise and intuitive quantitative metric, \({r}_{dsm}^{k}\) directly reflects the ratio of pixels of the decisive salient region in an image for the \({k}^{th}\) class, which directly quantifies the robustness of image classification of black-box models to potential perturbation. Using \({\widetilde{s}}_{ij}\) from (4) to derive the \({r}_{dsm}^{k}\) of specified class from \({S}_{dsm}^{k}\) as:

$${r}_{dsm}^{k}=\frac{1}{H\cdot W}\sum_{i=1}^{H} \sum_{j=1}^{W} {\widetilde{s}}_{ij}$$
(7)

The metric \({r}_{dsm}^{k}\) does not rely on subjective cognition while reflecting the quantitive difference in the histogram density distribution of saliency value of similar visual explanations. It can be used for long-term tracking to compare whether the inference processes are robust and unnecessary for updates. The computational runtime is much faster and more efficient than causal metrics that rely on GPUs.

3.3 DSM for robustness assessment

The most common metric for judging the trustworthiness of image classification results is the Softmax confidence score. However, it has been proven [37] that the Softmax confidence score tends to lose calibration as the model structure becomes deeper and more complex, making the model overconfident in the prediction. Even high confidence scores do not ensure the reliability and robustness of the inference process nor truly reflect the likelihood of the correct result.

In [37] also verified that temperature scaling is the simplest and most effective solution for confidence probability calibration without affecting the model’s accuracy. The Softmax function \({\sigma }_{\mathrm{SM}}\), which converts the network logit vectors \({\mathbf{z}}_{i}\) to confidence score at the end of the model networks, is calibrated by adding the temperature parameter \(T\) in (8), then the prediction \({\widehat{q}}_{i}\) is calibrated as in (9). However, this calibration solution requires access to the model design, making it impossible to apply to most CNN models encapsulated and deployed in the industrial environment.

$${\sigma }_{\mathrm{SM}}{\left({\mathbf{z}}_{i}/T\right)}^{(k)}=\frac{\mathrm{exp}\left({z}_{i}^{(k)}/T\right)}{\sum_{j=1}^{K} \mathrm{exp}\left({z}_{i}^{\left(j\right)}/T\right)}$$
(8)
$${\widehat{q}}_{i}=\underset{k}{max} {\sigma }_{\mathrm{SM}}{\left({\mathbf{z}}_{i}/T\right)}^{(k)}$$
(9)

We propose that \({r}_{\mathrm{dsm}}^{k}\) performs as an additional reference indicator for the robustness assessment of CNN models. It can conveniently and intuitively discover the input samples and classes with poor robustness of model inference while avoiding modifying the model structure to calibrate. \({S}_{DSM}^{K}\) and \({r}_{dsm}^{k}\) of Decisive Saliency Maps reveal potential risks beneath the subjective observation of the saliency maps or AUC calculation, e.g., the improper essential salient region in \({S}_{dsm}^{k}\), low value of \({r}_{dsm}^{k}\) corresponding to the prediction class, or unreliable features displayed in fine-graininess.

Such anomalies indicate that the model failed to avoid overfitting during training and is not capable or driven to explore adequate discriminative features. Overfitting leads the model highly susceptible to unpredictable misclassification and unreasonable confidence score fluctuations due to image perturbation and image quality degradation in real-world applications, significantly deteriorating the robustness of the model.

A typical and necessary method for studying the robustness of various vision architectures is the occlusion in salient regions [15], 19. In image classification models, the prerequisite for high prediction scores \(f\left(I\odot {M}_{i}\right)\) of random masked inputs is robust against severe occlusion. A model with excellent robustness indicates that masked inputs are closer to the original input.

$${f\left(I\odot {M}_{i}\right)}_{argmax}\approx f(I)$$
(10)

In a robust inference process, the larger the Softmax confidence score of the \({f}_{k}\left(I\right)\) inference result, the saliency values \({S}_{I,f}\left(\lambda \right)\) and its related statistical description, i.e., \(\mathrm{max}\left({s}_{ij}\right)\), will be larger in value consequently. Given the masking probability in (1), \(\mathrm{mean}\left({s}_{ij}\right)\) positively correlates with the most likely prediction scores of perturbed inputs. Large saliency values and normal distribution will lead to a small binarization threshold after the transformation using (3) and (4). The obtained threshold further allows Decisive Saliency Maps \({s}_{dsm}^{k}\) to include more pixels. Then the larger the proportion of the saliency map \({r}_{dsm}^{k}\), as \({r}_{dsm}^{k}\propto {f}_{k}\left(I\right)\) usually. Through distinct perspectives, our method shares logical similarities with the information loss process in [38].

To improve the robustness, the \({S}_{dsm}^{K}\) and \({r}_{dsm}\) of decisive saliency map can serve as an alternative function to confidence probability calibration, which guides the improvement of the model’s training dataset or procedure. Typical actions are using Random Erasing [39] or CutMix [40] for data augmentation, introducing the label smoothing function and applying other regularization, etc.

4 Experiments

4.1 Datasets and implementation

The datasets for validating DSM in this paper are ImageNet [41] and Stanford Car-196 [33]. Three types of CNN models established are listed below:

  1. 1.

    ResNet50 [42], provided by TensorFlow2.3, and its weight pre-trained on the ImageNet dataset. Hereafter referred to as ResNet50.

  2. 2.

    EfficientNet-B0 [34], provided by TensorFlow2.3, and its weight pre-trained on the ImageNet dataset. Hereafter referred to as EfficientNet-B0.

  3. 3.

    EfficientNet-B3 [34], which simulates a fine-grained visual classification application deployed in industrial environments, is obtained by transfer learning in TensorFlow from the pre-trained weight on ImageNet with multiple data augmentation, label smoothing, stochastic weight averaging. The inference accuracy of the model on Stanford Car-196 is 93.68% without test-time augmentation or model ensemble. Hereafter referred to as EfficientNet-B3.

The \({C}_{d}=0.2\) is experimentally verified in two common datasets by comparing the deletion process and visualization of samples.

Since model-agnostic methods assume that CNN models are black-box and the input image from the real world is not limited to ImageNet, the preprocess input instruction (ensure image colour channel zero-centred) is not applied to adjust the RGB channel distribution of the input image. The prediction results and saliency value distribution characteristics for the samples in this paper are shown in Table 1, where green indicates a confidence score greater than 60% or \({r}_{dsm}^{k}\) greater than 1%; orange indicates that the results do not match Ground Truth (GT) or \({r}_{dsm}^{k}\) is less than 0.2%.

Referring to the empirical results of multiple samples in Table 1, we conclude several descriptions as follows:

  1. 1.

    High robustness When the value of \({r}_{dsm}^{k}\) is equal to or greater than 1%, it can be validated that the image has enough essential feature regions for the model to recognize. The inference process can robustly overcome the perturbation, even mostly covering the decisive salient region. All results for samples with \({r}_{dsm}^{k}\) values above 1% in Table 1 are Ground Truth and Top 1 class, even if the score probability results are numerically low (e.g., sample 7).

  2. 2.

    Barely acceptable When \({r}_{dsm}^{k}\) is between 0.2% and 1%, the robustness of model inference is relatively weak. Perturbation large enough to cover the decisive salient region still impacts the results.

  3. 3.

    Poor robustness When \({r}_{dsm}^{k}\) is below 0.2% or far worse, the robustness deteriorates rapidly. The model inference is highly susceptible to negligible occlusion in the input image. Even if the occlusion is only 10 to 30 pixels of a 224 \(\times \) 224 image, which is harmless for human perception, there is a high possibility of false prediction for CNN models.

4.2 Experiments on ImageNet

When observing visual explanations subjectively using DSM, we can discover the feature regions’ actual influence and avoid the confusion caused by subordinated features that have insufficient influences on inference. In Figs. 3 and 4, a comparison can be found between RISE and DSM’s differences in saliency maps using ResNet50.

Fig. 3
figure 3

Comparisons of the visual explanations on sample 2 by RISE and DSM using ResNet50

Fig. 4
figure 4

Comparisons of the visual explanations on sample 3 by RISE and DSM using ResNet50

In Fig. 3a–b, the original visual explanation for sample 2 gives the impression that each goldfish is of equal importance for the model inference, i.e., the model focuses equally on the features represented by multiple goldfish. In contrast, our approach shows in Fig. 3d that the region represented by one and only one goldfish in the fish school has the maximum value and its proximity of the feature saliency, while the salient regions of other goldfish do not. Our visual explanation also better reflects the purpose for which the deletion process was set up. It is to discover regions with profound feature information which have a significant impact on the classification score, but with as few pixels as possible, through perturbation like masks [19].

Figure 4c shows that sample 3 has a lower absolute value of Kurtosis relative to sample 2 and a much higher mean saliency value in Fig. 3c. Combined with DSM shown in Fig. 4d, the visual explanation reflects that the focused feature area is adequate. The pixels with the maximum saliency value are concentrated on the right-wing instead of the relatively uniform distribution on both wings shown in Fig. 4b. \({r}_{dsm}^{k}\) of sample 3 is quantified as in Table 1. It visualizes the discrepancy in the distribution of the saliency values. Sample 3 has a larger decisive salient region than sample 2, which is consistent with the high probability score.

To further verify the discriminative effect of DSM on the feature regions and the feasibility of robustness assessment, the salient region of the Top-1 class of the prediction is perturbed with the mask motivated by adversarial erasing [43] or by patch permutations, as in Fig. 5. The visual explanation of DSM guides the size and location of the mask. Patch permutations boost the model to learn features of different levels of granularity when training [44]. Simultaneously, it demonstrates the robustness of the model to spatial structural information disturbance when inferring [38]. Compared with occlusions and patch permutations, other natural and spatial perturbations, e.g., Gaussian blur, are tested to have relatively minute disturbance on the model inference.

Fig. 5
figure 5

Comparisons of RISE and DSM using ResNet50 and EficientNet-B0. In (a)-(c), the GT class dropped from first to third after perturbation in decisive salient regions. In (d), the decisive salient region is perturbed by patch permutations into 2 × 2 grids. The prediction fails consequently. The inference with ResNet50 under large occlusion or permutated into 4 × 4 grids shows good robustness in (e)-(f). The confidence score is scarcely affected, and the value of \({r}_{dsm}^{323}\) is big. Using EfficientNet-B0 for input sample 2, DSM shows in (g) that the value of \({r}_{dsm}^{1}\) is too low (only 0.016%) and displays poor robustness. Therefore a negligible perturbation in (h) has reduced the score of the GT class from the first to the second

The inference process with poor robustness fails for the perturbed input sample to get the correct prediction. Once the only goldfish representing the decisive saliency is occluded partially or perturbed by shuffle operation, as in Fig. 5a–d, the confidence score of the goldfish is reduced to lower than the probability of other classes, leading to misclassification. In contrast, input images with a sufficiently large coverage rate of DSM can maintain correct results and high scores, even after a severer perturbation or being shuffled into smaller grids than the previous sample, as in Fig. 5e–f. The result indicates enough high-importance feature areas for the model to recognize.

The samples above were further tested by comparing Decisive Saliency Maps of several CNN models with different network depths. In well-designed CNN models, the deeper the structure and higher the accuracy, the better the models can exploit the fine-grained features. Nevertheless, many samples preliminarily verify that when the features of interest are similar to different CNN models, overfitting may manifest in focusing excessively on limited fine-grained regions due to deeper layers, leading to poor robustness of models. Too much attention to too small features is not conducive to model generalization in real-world applications.

As shown in DSM, EfficientNet-B0 focuses on the same goldfish as ResNet50 in Fig. 5g, but the region of decisive saliency is much smaller. We perturb salient regions guided by DSM as in Fig. 5h and find that even a negligible perturbation to the human eye already caused the false prediction of EfficientNet-B0. Therefore, if the robustness of the model for real-world applications is a concern, a deeper and more complex model, while performing better, may not be the most appropriate choice without sufficient degraded samples and data augmentation.

DSM is also applicable to the class discriminative inferences of input samples containing objects of multiple classes. As in Fig. 6a, input sample 4 has two classes, bull mastiff and tiger cat, which are the Top-2 classification results by ResNet50. It can be found that a small amount of perturbation in the decisive salient region representing the tiger cat, shown in Fig. 6e, significantly reduces the confidence score for the cat as in Fig. 6f. Compared to the saliency map by RISE in Fig. 6d, DSM indicates clearly that only the cat’s mouth and nose, not its head as a whole, are being focused on by the model. The confidence score of the first class (bull mastiff) increases remarkably after the salient region of the second class is disturbed with the indication of DSM, allowing the prediction to be very “confident” in Fig. 6f. When the features that ResNet50 and EfficientNet-B0 focus on are approximate, the inference process of EfficientNet-B0 is less robust in comparison again, as shown in Fig. 6k–l. The coverage rate of the DSM of both classes is fairly low. Occlusion as a tiny mask on decisive salient regions of the top class lowers its score to third in Fig. 6m. The input sample is shuffled into different levels of granularity in Fig. 6n–p. In most patch permutations cases, class 243 scores are higher than class 282, depending on the integrity of decisive salient regions after shuffle operation.

Fig. 6
figure 6

DSM and robustness assessment of the prediction of input sample 4. The input sample a includes two ImageNet classes, 243 and 282. b–e show the comparison of RISE and DSM. The perturbed sample f for class 282, which refers to DSM e, shows that inference using ResNet50 shifts to focus entirely on class 243 after partial perturbation, essentially removing the attention to class 282. In DSM using EfficientNet-B0, class 282 is the first and 243 is the fourth. Poor robustness is visualized in k and l. m shows that negligible perturbation changes the class sequence of the result. n–p demonstrate the input sample shuffled into 3 × 3 and 4 × 4 grids. In most permutation cases, class 243 remains higher scores than class 282

Figure 7 shows a comparison of more visual explanations of input samples. The visual explanations are generated by RISE, our method DSM, and the most representative model-specific method, GRAD-CAM. DSM improvement focuses on highlighting the most important features compared to other visual explanations, such as focusing on the animal’s eyes in samples 5 and 8. For sample 7 bubbles which may represent the dispersion problem [29], DSM visualizes the bubble contours unambiguously while suppressing the noise.

Fig. 7
figure 7

Comparison of DSM and other visual explanations using Resnet50 and EfficientNet-b0 for more samples

For the misclassification cases, DSM can reflect the risk of untrustworthiness in the inference, and facilitate the detection of false predictions. The other two visual explanations are incapable of verification during interpretation. For puzzling sample 9, the valley, both CNN models without preprocessing misclassified it as the cliff, and they are highly susceptible to perturbation to misclassify input as the cliff dwelling. However, confidence scores greater than 60% cannot expose the risk of misclassification. But DSM reveals that the value of \({r}_{dsm}^{k}\) is too small. The high-saliency region is unnoticeable via visualization, thus exposing misclassification risk.

4.3 Experiments on stanford car-196

For encapsulated Deep Learning models in industrial applications or even traditional pattern-based machine vision algorithms, visual explanations are validated by RISE and DSM, which are perturbation-based approaches and suitable for black-box models. The only premise is that the industrial machine vision systems can output the results with probability scores correctly matching each of the massive perturbed input samples. With transfer learning and multiple regularizations, EfficientNet-B3 simulating industrial applications obtains higher accuracy and acceptable confidence scores (> 80%) due to better model performance and a smaller volume of class labels in the fine-grained dataset Stanford Car-196 compared to ImageNet.

When referring to the first and second columns of Fig. 8, it is difficult for end-users to detect the robustness risk based only on visual explanations and confidence scores. Provided that the predictions are correct, the difference in the size of essential salient regions produced by samples with high robustness of model inference and those without is difficult to be distinguished precisely. Calculating AUC using the deletion process is costly in GPU inference. However, the difference in AUC values still does not directly indicate the discrepancy in robustness, referring to the fourth column of Fig. 8.

Fig. 8
figure 8

DSM and corresponding deletion AUC using EfficientNet-B3 model on samples from Stanford Car-196

Figure 8i shows the input sample with high robustness of model inference. The distribution histogram in Fig. 8y shows that the mean saliency value is high. The fit error between the Johnson unbounded and log-normal distribution is low. Whereas the sample with poor robustness of model inference is shown in Fig. 8m, it is seen that the mean saliency value is low as in Fig. 8z. The value of Kurtosis is much higher, and the distribution has a long tail effect.

Combined with physical objects in the real world, the visual explanations of DSM in the third column of Fig. 8 focus on discriminative features such as vehicle front mesh grilles, emblems, headlights, and taillights. Despite the distinctiveness of implementation methods, the findings are basically consistent with the description of [45]. DSM reduces the confusion of background and unneeded physical features on subjective cognition.

When decisive salient regions and coverage rates are considerably larger, models can still have correct predictions with unaffected confidence scores on samples even if large areas of the images are perturbed, as in Fig. 9a–b. This is the case for most test set samples after various data augmentation and training optimization of the fine-tuned model. In contrast, when the decisive salient regions are small but inferences have relatively high confidence scores, the predictions are susceptible to the perturbation in decisive salient regions in Fig. 9c–f. Corresponding to the real-world application, it is equivalent to the situation where minor damage to an auxiliary vehicle part causes CNN models to fail to recognize the vehicle type. Though EfficientNet-B3 has better performance, deeper layers, and more keen attention to discriminative features, its lack of robustness to infer certain classes or samples, e.g., images from the rearview of class 196 in the Stanford Cars dataset, is reflected by the utilization of DSM. DSM visually explains the variation of the prediction scores when in different granularity of patch permutations in Fig. 9g–l. The variation depends on the size of decisive salient regions and how the grids perturb the regions.

Fig. 9
figure 9

Decisive Saliency Maps for perturbed input samples with various robustness. The predictions of EfficientNet-B3 change accordingly. The scores do not drop in patch permutations simply due to more granular levels of grids. The variation of scores depends on the size of decisive salient regions and how the grids perturb the regions

4.4 Class sensitivity evaluation

Class Sensitivity is defined and verified with different visual explanation methods in [29]. A responsible visual explanation in the image classification task should provide a different interpretation for each class. Besides, higher Class Sensitivity should display more discriminative and dissimilar visual explanations between saliency maps of classes with higher and lower scores. The advantages of DSM method regarding Class Sensitivity are presented in visual cognition and computation results of (dis)similarity metrics, as evaluated qualitatively and quantitatively.

Qualitative Evaluation Saliency maps of the lower-score classes provided by RISE are occasionally confusing. The visualizations tend to misguide the observers at first glance to convince enough, or even excessive discriminative features are considered during model inference.

Saliency maps by DSM are remarkably explicit and meaningful for lower-score classes, exposing that models could scarcely recognize correct features or salient regions.

The dissimilarity of saliency maps generated by RISE and DSM is shown in Fig. 10, comparing the classes with the highest and lowest scores. Optimization by DSM is prominent in subjective cognition.

Fig. 10
figure 10

The dissimilarity between saliency maps of the classes with the highest and lowest scores. Saliency maps of the lowest scores by DSM are remarkably meaningful

Quantitative Evaluation Along with the Pearson Correlation Coefficient (CC), we apply several other commonly used similarity metrics for saliency maps generated by RISE and DSM. The dissimilarity between classes with the highest and lowest scores is calculated and compared.

The Similarity (SIM) metric calculates the similarity index from the normalized saliency distributions of the predicted and ground truth saliency maps [46, 47]. The Kullback–Leibler divergence (KL) is a classical measure to estimate dissimilarities between the probability distribution of two maps, giving more penalty to false negatives. For better comparison, \({\varvec{N}}{\varvec{K}}{\varvec{L}}=1-{\varvec{K}}{\varvec{L}}\) is used in the evaluation [48]. The Normalized Scanpath Saliency (NSS) is an effective measurement sensitive to false positives and dissimilarity between prediction and ground truth [46, 47].

The saliency maps are normalized, respectively. Then top classes are set as ground truth in the calculation. We binarize the top-class saliency map in NSS with its mean saliency value as the threshold.

The pre-trained CNN model is ResNet50. The evaluation is conducted over a subset with more than 300 samples randomly picked from ImageNet. The sample amount approximates typical saliency benchmark datasets.

The results of CC are close to the experimental results in [29]. The symmetric computation of CC does not assume which saliency map is the ground truth. Thus, it cannot separate differences from false positives or false negatives. Positive NSS indicates a consistent correlation between saliency maps, and negative NSS indicates apparent dissimilarity. Considering saliency maps generated by numerous perturbations are distributed more pervasively than real human eye fixation, other computational values of similarity metrics are higher correspondingly.

As shown in Table 2, smaller values mean larger dissimilarity, representing higher Class Sensitivity. Most metrics demonstrate a certain level of optimization by DSM method.

Table 1 Saliency value distribution data of input samples.\({r}_{dsm}^{k}\) can be a reference indicator for reliability complementing the confidence score, as most wrong predictions relate to low \({r}_{dsm}^{k}\) values but acceptable scores
Table 2 The dissimilarity evaluation results between classes with highest and lowest scores from the aforementioned metrics. Smaller values are desired in dissimilarity evaluation

The overall evaluation results indicate that DSM is an improved method for Class Sensitivity qualitatively and quantitatively, illustrating the dissimilarity between the highest and lowest classes with increased efficiency.

5 Conclusion and future work

This paper proposes an optimized visual explanation called Decisive Saliency Map applicable to black-box models for image classification tasks. DSM can quantitatively calculate the discrepancy of influence and size of different salient regions, also embody extra information on the distribution of saliency value in visualization. Its function of robustness assessment of the model inference process is validated on ImageNet and Stanford Car-196 datasets.

Further research will be conducted to eliminate the influence of randomness on the quantitative metrics. Simultaneously, we will continue to study the visual explanations of Deep Learning models to promote the utilization in other CNN vision tasks, including object detection, instance segmentation, etc. Endeavors will be made to reliable deployment and promotion of visual explanations in the manufacturing environment, analyzing the selection of backbone networks to balance accuracy and robustness requirements.