Revisiting model’s uncertainty and confidences for adversarial example detection

Aldahdooh, Ahmed; Hamidouche, Wassim; Déforges, Olivier

doi:10.1007/s10489-022-03373-y

Revisiting model’s uncertainty and confidences for adversarial example detection

Published: 19 April 2022

Volume 53, pages 509–531, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Revisiting model’s uncertainty and confidences for adversarial example detection

Download PDF

991 Accesses
14 Citations
Explore all metrics

Abstract

Security-sensitive applications that rely on Deep Neural Networks (DNNs) are vulnerable to small perturbations that are crafted to generate Adversarial Examples. The (AEs) are imperceptible to humans and cause DNN to misclassify them. Many defense and detection techniques have been proposed. Model’s confidences and Dropout, as a popular way to estimate the model’s uncertainty, have been used for AE detection but they showed limited success against black- and gray-box attacks. Moreover, the state-of-the-art detection techniques have been designed for specific attacks or broken by others, need knowledge about the attacks, are not consistent, increase model parameters overhead, are time-consuming, or have latency in inference time. To trade off these factors, we revisit the model’s uncertainty and confidences and propose a novel unsupervised ensemble AE detection mechanism that 1) uses the uncertainty method called SelectiveNet, 2) processes model layers outputs, i.e. feature maps, to generate new confidence probabilities. The detection method is called SFAD. Experimental results show that the proposed approach achieves better performance against black- and gray-box attacks than the state-of-the-art methods and achieves comparable performance against white-box attacks. Moreover, results show that SFAD is fully robust against High Confidence Attacks (HCAs) for MNIST and partially robust for CIFAR10 datasets.¹

Omni: automated ensemble with unexpected models against adversarial evasion attack

Article 30 November 2021

Towards Interpreting Vulnerability of Object Detection Models via Adversarial Distillation

Rethinking the Evaluation of Deep Neural Network Robustness

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

DL has achieved remarkable advances in different fields in human life especially computer vision tasks like object detection, image classification [1,2,3], surveillance [4], and medical imaging [5]. Despite that, it is found that DL models are vulnerable to adversaries [6, 7]. In image classification models, for instance, adversaries can generate AEs, by adding small perturbations to an input image that are imperceptible to humans and devices, that cause DL models to misclassify the input images. Such potential threat affects security-critical DL-based applications [8] such as self-driving cars.

Adversaries can generate AEs for white-box, black-box, and gray-box attacks [9, 10]. In white-box attack scenario, the adversary knows everything about the DL-model including inputs, outputs, architecture, and weights of the model. Hence, he is guided by the model gradient to generate AE by solving an optimization problem [7, 11,12,13,14]. In black-box scenario, the adversary knows nothing about the model but he leverages the transferability property [15] of AEs and the input content. By sending queries to the model, the adversary can craft small perturbations that are harmonious with the input image [16,17,18,19]. In the gray-box scenario, the adversary knows only the input and the output of the model and hence, he tries to substitute the original model with an approximated model and then uses its gradient as in white-box scenario to generate AEs.

Researchers pay attention to this threat and several emerging methods have been proposed to detect or to defend against AEs. More details about defense and detection methods can be found in Section 2.

DL model’s uncertainty is one of the main methods that has been used to determine whether an input sample belongs to the training manifold. The uncertainty is usually measured by adding randomness to the model using Dropout technique [20, 21]. It is found that clean sample predictions do not change, when randomness is added, while it changes for AEs. Feinman et al. [22] proposed BU metric that used Monte Carlo dropout to estimate the uncertainty to detect AEs that are near the classes manifold, while Smith et al. [23, 24] used mutual information method to estimate the uncertainty. The prediction risk of these methods is higher compared to the recent uncertainty method, SelectiveNet [25], that is used in this work. On the other hand, it was shown in [26] that predicted class probabilities, i.e. model’s confidence, of in-of-distribution samples are higher than of out-of-distribution. Model’s confidence was used in [26,27,28,29] to implement AE detectors. Uncertainty and confidence based detectors showed limited success against black- and gray-box attacks. Uncertainty and confidence based detectors are usually threshold-based detectors as shown in Fig. 1(a). To enhance detectors’ performance, one recommendation goes to the direction of providing ensemble detection methods, as shown in Fig. 1b. Although state-of-the-art detectors achieve promising results, they may have one or more limitation(s); not performing well with some known attacks [30], broken by attackers [31, 32], performance of baseline detectors is not consistent [33], increase the model parameters overhead [34], time consuming [35], or introduce latency [36] in the inference time.

In this paper and in order to mitigate the aforementioned limitations, we revisit the model’s uncertainty and confidence to propose a novel ensemble AE detector that hasn’t had any knowledge of AEs, i.e. unsupervised detector, as shown in Fig. 2. The proposed method has the following attributes; 1) it investigates SelectiveNet capability in detecting adversarial examples since it measures the uncertainty with less risk. According to the author’s knowledge, the SlelectiveNet [25] is not used in adversarial attacks detection models. 2) Unlike other detectors [29, 37, 38], the proposed method uses the model’s last N-layers outputs, i.e. feature maps, to build N-CNNs ${\mathscr{M}}$ that have different processing blocks like up/down sampling, auto-encoders [39, 40], noise addition [41,42,43], and bottleneck layer addition [44] that make the representative data of last layers more unique to the input data distribution to yield better model’s confidence. To reduce the effect of white-box attacks, the output of ${\mathscr{M}}$ is transferred/distilled to build the last CNN $\mathcal {S}$. 3) The proposed model ensembles the proposed detection techniques to provide the final detector. This step has a great impact in reducing the adversary’s capability to craft perturbations that can fool the detector, since he has to fool every detection technique. We name the proposed method as Selective and Feature based Adversarial Detection (SFAD). The high-level architecture of the SFAD is illustrated in Fig. 1b.

A prototype of SFAD is tested under white-box, black-box and gray-box attacks on MNIST [45], and CIFAR10 [46]. Under the white-box attacks, the experimental results show that SFAD can detect AEs at least with accuracy of 89.8% (many with 99%) for all tested attacks except for the PGD attack [14] with at least 65% detection accuracy in average. For black- and gray-box attacks, SFAD shows better performance than other tested detectors. Finally, SFAD is tested under the HCA [47] and it shows that it is fully(100%) and partially(57.76%) robust on MNIST and CIFAR10 respectively. SFAD sets the thresholds to reject 10% of clean images. Moreover, comparisons with state-of-the-art methods are presented. Hence, our key contributions are:

We propose a novel unsupervised ensemble model for AE detection. Ensemble detection makes SFAD more robust against white-box and adaptive attacks.
We investigate the SelectiveNet’s, as an uncertainty model, capability in detecting AEs.
We show that, by processing the feature maps of last N-layers, we can build classifiers for better confidence distribution. We provide an ablation experiments to study the impact of the feature processing blocks.
SFAD prototype proves the concept of the approach and lets the door open in future to find the best N layers and the best N(or M) CNNs combinations to build the detector’s classifiers.
Unlike tested state-of-the-art detectors, SFAD prototype shows better performance under gray- and black-box attacks. SFAD prototype shows that it is fully robust on MNIST and partially robust on CIFAR10 when attacked with HCAs. For instance, Local Intrinsic Dimensionality (LID) method [48] reported very high detection accuracy on the tested attacks, but fails on HCAs [31, 47].

2 Related work

2.1 Detection methods

Defense techniques like adversarial training [7, 14, 49, 50], feature denoising [51,52,53], pre-processing [54, 55], and gradient masking [56,57,58,59] try to make the model robust against the attacks and let the model correctly classify the AEs. On the other hand, detection methods provide adversarial status for the input image. Detection techniques can be classified according to the presence of AEs in the detector learning process into supervised and unsupervised techniques [33]. In supervised detection, detectors include AEs in the learning process. Many approaches exist in the literature. In the feature-based approach [38, 60,61,62,63], detectors use clean and AEs inputs to built their classifier models from scratch by using raw image data or by using the representative layers’ outputs of a DNN model. For instance, in [38], the detector quantizes the last ReLU activation layer of the model and builds a binary v classifier. As reported in [38], this detector is not robust enough and is not tested against strong attacks like Carlini-Wagner (CW) attacks. While the work in [61] added a new adversarial class to the NN model and train the model from scratch with clean and adversarial inputs. This architecture reduces the model accuracy [61]. In the concurrent recent work [63], Wang et al. used the saliency map features of clean and adversarial examples to learn the classifier’s detector. In the statistical-based approach [22, 48], detectors perform statistical measurement to define the separation between clean and adversarial inputs. In [22], KD estimation, BU, or combined models are introduced. Kernel-density feature is extracted from clean and AEs in order to identify AEs that are far away from data manifold while Bayesian uncertainty feature identifies the AEs that lie in low-confidence regions of the input space. LID method is introduced in [48] as a distance distribution of the input sample to its neighbors to assess the space-filling capability of the region surrounding that input sample. The works in [31, 47] showed that these methods can be broken. Finally, the network invariant approach [62, 64] learns the differences in neuron activation values between clean input samples and AEs to build a binary NN detector. The main limitation of this approach is that it requires prior knowledge about the attacks and hence it might not be robust against new or unknown attacks.

On the other hand, in unsupervised detection, detectors are trained with clean images only to identify the AEs. It is also known as prediction inconsistency models since it depends on the fact that AEs might not fool every NN model. That’s because the input feature space is almost limited and the adversary always takes that as an advantage to generate the AEs. Hence, unsupervised detectors try to reduce this limited input feature space available to adversaries. Many approaches have been presented in the literature. The Feature Squeezing (FS) approach [30] measures the distance between the predictions of the input and the same input after squeezing. The input will be adversarial if the distance exceeds a threshold. The work in [30] squeezes out unnecessary input features by reducing the color bit depth of each pixel and by spatial smoothing of adversarial inputs. As reported in [30], FS is not performing well with some known attacks like FGSM. Instead of squeezing, denoising based approach, like MagNet [65], measures the distances between the predictions of input samples and denoised/filtered input samples. It was found in [32, 53] that MagNet can be broken and do not scale to large images. Recently, a network invariant approach was introduced [35]. They proposed a NIC method that builds a set of models for individual layers to describe the provenance and the activation value distribution channels. It was observed that AEs affect these channels. The provenance channel describes the instability of activated neurons set in the next layer when small changes are present in the input sample while the activation value distribution channel describes the changes with the activation values of a layer. The reported performance of this method showed its superiority against other state-of-the-art models but other works reported that the baseline NIC’s detectors are not consistent [33], increase model parameters overhead [34], are time consuming [35], and increase the latency in the inference time [36].

Uncertainty-based detectors

Following the observation that the prediction of clean image remains correct with many dropouts, while the prediction of AE changes. Feinman et al. [22] proposed BU metric. BU uses Monte Carlo dropout to estimate the uncertainty, to detect those AEs that are near the classes manifold, while Smith et al. [23] used mutual information method for such a task. In [24], Sheikholeslami et al. proposed an unsupervised detection method that provides a layer-wise minimum variance solver to estimate model’s uncertainty for in-distribution training data. Then, a mutual information based threshold is identified.

Confidence-based detectors

Aigrain et al. [27] built a simple NN detector that uses the model’s logits of clean and AEs to build a binary classifier. Inspired by the hypothesis of that, for a given perturbed image, different models yield different confidences, Monteiro et al. [28] proposed a bi-model mismatch detection method. The detector is a binary RBF-SVM classifier that takes as input the output of two classifiers of clean and AEs. On the other hand, Sotgiu et al. proposed an unsupervised detection method that uses the last N representative layers’ outputs of the classifier to built three SVM classifiers with RBF kernel. The confidence probabilities of the SVMs are combined to build the last SVM-RBF classifier. Then, a threshold is identified to reject inputs that have less maximum confidence probability.

2.2 SelectiveNet as an uncertainty model

Let $\mathcal {X}$ be an input space, e.g. images, and $\mathcal {Y}$ a label space. Let $\mathbb {P}(X,Y)$ be the data distribution over $\mathcal {X} \times \mathcal {Y}$. A model, $f:X \rightarrow Y$, is called a prediction function, $ \ell : Y \times Y \rightarrow \mathbb {R}^{2}$ is a given loss function. Given a labeled set $S_{k} = {(x_{i}, y_{i})}_{i=1}^{k} \subseteq (\mathcal {X} \times \mathcal {Y})^{k}$ sampled i.i.d. from $\mathbb {P}(X,Y)$, where k is the number of training samples. The true risk of the prediction function f w.r.t. $\mathbb {P}$ is $R(f) \triangleq \mathbb {E}_{\mathbb {P}(X,Y)}[\ell (f(x),y)]$ while the empirical risk of the prediction function f is $\hat {r}(f\mid S_{k}) \triangleq \frac {1}{k} {\sum }_{i=1}^{k} \ell (f(x_{i}),y_{i})$.

Here, we briefly demonstrate the SelectiveNet as stated in [25]. The selective model is a pair (f,g), where f is a prediction function, and $g : \mathcal {X} \rightarrow \{0,1\}$ is a binary selection function for f,

$$ (f,g)(x) \triangleq \begin{cases} f(x), & \text{if } g(x) = 1; \\ \text{don't know}, & \text{if } g(x) = 0. \end{cases} $$

(1)

A soft selection function can also be considered, where $g : \mathcal {X} \rightarrow [0,1]$, hence, the value of (f,g)(x) is calculated with the help of a threshold τ as expressed in the following equation

$$ (f,g)(x) \triangleq \begin{cases} f(x), & \text{if } g(x) \geq \tau; \\ \text{don't know}, & \text{if } g(x) < \tau. \end{cases} $$

(2)

The performance of a selective model is calculated using coverage and risk. The true coverage is defined to be the probability mass of the non-rejected region in $\mathcal {X}$ and calculated as

$$ \phi(g) \triangleq E_{P}[g(x)], $$

(3)

while the empirical coverage is calculated as

$$ \hat{\phi}(g\mid S_{k}) \triangleq \frac{1}{k} \sum\limits_{i=1}^{k}g(x_{i}) $$

(4)

The true selective risk of (f,g) is

$$ R(f,g) \triangleq \frac{E_{P}[\ell (f(x),y)g(x)]}{\phi(g)}, $$

(5)

while the empirical selective risk is calculated for any given labeled set S_k as

$$ \hat{r}(f,g\mid S_{k}) \triangleq \frac{\frac{1}{k}{\sum}_{i=1}^{k}\ell (f(x_{i}),y_{i})g(x_{i})}{\hat{\phi}(g\mid S_{k})}. $$

(6)

Finally, for a given coverage rate 0 < c ≤ 1 and Θ, a set of parameters for a given deep network architecture for f and g, the optimization problem of the selective model is expressed as:

$$ \begin{array}{@{}rcl@{}} &\begin{aligned} \theta^{\ast} = \operatorname*{arg min}_{\theta \in {\Theta}} (R(f_{\theta},{g}_{\theta}))\\ \textit{s.t. } \phi({g}_{\theta}) \geq c, \end{aligned} \end{array} $$

(7)

and can be solved using the Interior Point Method (IPM) [66] to enforce the coverage constraint. That yields to unconstrained loss objective function over samples in S_k,

$$ \begin{array}{@{}rcl@{}} &\begin{aligned} \mathcal{L}_{(f,g)} \triangleq \hat{r}_{\ell}(f,g\mid S_{k}) + \lambda {\Psi}(c-\hat{\phi}(g\mid S_{k}))\\ {\Psi}(a) \triangleq \max(0,a)^{2}, \end{aligned} \end{array} $$

(8)

where c is the target coverage, λ is a hyper-parameter controlling the relative importance of the constraint, and Ψ is a quadratic penalty function. As a result, SelectiveNet is a selective model (f,g) that optimizes both f(x) and g(x) in a single model in a multi-task setting as depicted in Fig. 2c. For more details about the SelectiveNet model, readers are advised to read [25].

3 Adversarial Detection (SFAD) method

3.1 SFAD’s classifiers design

It is believed that the last N layers in the DNN $\mathcal {F}$ have potentials in detecting and rejecting AEs [29, 37]. In [67] and [37], only the last layer (N = 1) is utilized to detect AEs. At this very high level of presentation, AEs are indistinguishable from samples of the target class. This observation is enhanced when DNR [29] used the last three layers to build SVM with RBF kernel based classifiers. Unlike other works, in this work, 1) feature maps of the last layers Z_j, where $j=\{1,2 \dots ,N\}$, are processed. In the aforementioned methods, the representatives of the last layers are not processed and basically the detectors represent another approximation of the baseline classifier which is considered as a weak point. 2) MTL is used via the SelectiveNet. MTL has an advantage of combining related tasks with one or more loss function(s) and it does better generalization especially with the help of the auxiliary functions. For more details about MTL, please refer to these recent review papers [68, 69].

In this section, the Adversarial Detection (SFAD) method is demonstrated. As depicted in Fig. 2a, SFAD consists of two main blocks; the selective AEs classifiers ${\mathscr{M}}$ block (in blue), where ${\mathscr{M}}=\{m_{j}\}_{j=1}^{N}$, and the selective knowledge transfer classifier $\mathcal {S}$ block (in orange). In the training phase, we have two steps; the first is to train the ${\mathscr{M}}$ classifiers, and the second step is to train the $\mathcal {S}$ classifier. Hence, ${\mathscr{M}}$, and $\mathcal {S}$ are trained separately. While in the inference/test time, the output of $\mathcal {F}$, ${\mathscr{M}}$, and $\mathcal {S}$ blocks, i.e. model’s uncertainties and confidences, are used in the detection process, as depicted in Fig. 2d.

3.2 Selective AEs classifiers block: training the ${\mathscr{M}}$ classifiers

As shown in Fig. 2a, the aim of ${\mathscr{M}}$ block is to build N individual classifiers, ${\mathscr{M}}=\{m_{j}\}_{j=1}^{N}$. It was shown that perturbation propagation becomes clear when the DNN model goes deeper, hence, using N-last layers have potential in identifying the AEs. Unlike works in [29, 37], we process the representative last N-layer(s) outputs Z_j in different ways in order to make clean input features more unique, as shown in Fig. 2b and discussed in the next Section 3.2.1. This will limit the feature space that the adversary uses to craft the AEs [30, 65]. Moreover, each of the last N-layer output has its own feature space which makes each m_j classifier be trained with different feature space. Hence, combining and increasing the number of N will enhance the detection process.

For simplicity and as recommended in [29], we set N = 3 in the implemented prototype and hence, each individual layer output is assigned to a classifier as shown in Fig. 2a. Let the last N layers’ outputs z_ji of x_i from S_k are z_1i, z_2i, and z_Ni, respectively, where, $j=\{1,2, \dots , N\}$. z_ji are individually the inputs of the m_j classifier.

The outputs of the m_j classifier are denoted as m_ji(z_ji). Let $\mathcal {Y}^{\prime }=\mathcal {Y}+1$ be a label space of m_j, where the extra label is denoted for the selective status, hence m_j represents a function $m_{j}:Z_{j} \rightarrow Y^{\prime }$ on a distribution $\mathbb {P}(Z_{j},Y^{\prime })$ over $\mathcal {Z} \times \mathcal {Y}^{\prime }$. We refer to the selective probability of m_j as $P_{s}^{m_{j}}$ and the confidence probabilities of m_j as $P_{c}^{m_{j}}$. m_j optimizes the overall loss function

$$ \mathcal{L}_{m_{j}} = \alpha \mathcal{L}_{(m_{j},g_{m_{j}})} + (1-\alpha) \mathcal{L}_{h_{m_{j}}} \text{ , where }\alpha=0.5, $$

(9)

where ${\mathscr{L}}_{(m_{j},g_{m_{j}})}$ is the selective loss function of m_j, as discussed in Section 2.2, and ${\mathscr{L}}_{h_{m_{j}}}$ is the auxiliary loss function of m_j and are calculated as following:

$$ \begin{gathered} \mathcal{L}_{(m_{j},g_{m_{j}})} \triangleq \hat{r}_{\ell}(m_{j},g_{m_{j}}\mid S_{k}) + \lambda {\Psi}(c-\hat{\phi}(g_{m_{j}}\mid S_{k})),\\ {\Psi}(a) \triangleq \max(0,a)^{2}, \end{gathered} $$

$$ \mathcal{L}_{h_{m_{j}}} = \hat{r}(h_{m_{j}}\mid S_{k}) = \frac{1}{k}{\sum}_{i=1}^{k}\ell(h_{m}(z_{ji}),y_{i}). $$

Studying the value of α is out of the paper scope, but other task balancing methods, may be applied like, uncertainty [70], GradNorm [71], DWA [40], DTP [72], and MGDA [73].

3.2.1 Feature maps processing

As depicted in Fig. 2b each selective classifier consists of different processing blocks; auto-encoder block, up/down-sampling block, bottleneck block, and noise block. These blocks aim at giving distinguishable features for input samples to let the detector recognize the AEs efficiently.

Auto-encoder

Auto-encoders are widely used as a reconstruction tool and its loss is used as a score for different tasks. For instance, it is used in the detection process of AEs in [65]. It is believed that AEs gave higher reconstruction loss than clear images. This process is a.k.a attention mechanism [74, 75] and it is used to focus on better representation of input features especially on the shallow classifiers.

Up/down-sampling

Up sampling and down sampling are used in different deep classifiers [39, 40]. The aim of down sampling, a.k.a pooling layers in NN, is to gather the global information of the input signal. Hence, if we consider the clean input signal as a signal that has global information and then we expand the global information by bi-linear up sampling and then down sample by average pooling, we will measure the ability of global information reconstruction of the input signal. Besides, this process can be seen as a use case of the reconstruction process.

Noise

Adding noise has a potential impact in making NN more robust against AEs and it has been used in many defense methods [41,42,43]. In this work, we add a branch in the classifier that adds small Gaussian noise to the input signal before and after the auto-encoder block. Then, the noised and clean input features are concatenated before the bottleneck block.

Bottleneck

The bottleneck block [44] consists of three convolutional layers; 1×1, 3×3, and 1×1 convolutional layers. The bottleneck name came from the fact that the 3×3 convolutional layer is left as a bottleneck between 1×1 convolutional layers. It is mainly designed for efficiency purposes but according to [74, 76] it is very effective in building shallow classifiers which helps having better representation of input signal.

3.3 Selective knowledge transfer block: training the $\mathcal {S}$ classifier

The block $\mathcal {S}$ aims at building selective knowledge transfer classifier. It concatenates the confidence values of Y classes of the ${\mathscr{M}}$ classifiers. The idea behind the block $\mathcal {S}$ is that each set of its input is considered as a special feature of the clean input. Hence, we transfer this knowledge, m_j confidence probabilities, of clean inputs to the classifier. Besides, in the inference time, we believe that AE will generate a different distribution of the confidence values and if the AE is able to fool one m_j, it may not fool the others.

As Fig. 2a shows, the confidence probabilities of m_j classifiers are concatenated to be as an input Q = $concat(P_{c}^{m_{1}},$ $ P_{c}^{m_{2}}, \dots , P_{c}^{m_{N}})$ for the selective knowledge transfer block $\mathcal {S}$. The $\mathcal {S}$ classifier consists of one or more dense layer(s) and yields the selective probability of $\mathcal {S}$ as $P_{s}^{\mathcal {S}}$ and the confidence probabilities of $\mathcal {S}$ as $P_{c}^{\mathcal {S}}$. $\mathcal {S}$ represents a function $\mathcal {S}:Q \rightarrow Y^{\prime }$ on a distribution $\mathbb {P}(Q,Y^{\prime })$ over $\mathcal {Q} \times \mathcal {Y}^{\prime }$. Hence, it optimizes the following loss function

$$ \mathcal{L}_{\mathcal{S}} = \alpha \mathcal{L}_{(\mathcal{S},g_{\mathcal{S}})} + (1-\alpha) \mathcal{L}_{h_{\mathcal{S}}} \text{ , where }\alpha=0.5, $$

(10)

where ${\mathscr{L}}_{(\mathcal {S},g_{\mathcal {S}})}$ is the selective loss function of $\mathcal {S}$, as discussed in Section 2.2, and ${\mathscr{L}}_{h_{\mathcal {S}}}$ is the auxiliary loss function of $\mathcal {S}$ and are calculated as following:

$$ \begin{array}{@{}rcl@{}} &\begin{gathered} \mathcal{L}_{(\mathcal{S},g_{\mathcal{S}})} \triangleq \hat{r}_{\ell}(\mathcal{S},g_{\mathcal{S}}\mid S_{k}) + \lambda {\Psi}(c-\hat{\phi}(g_{\mathcal{S}}\mid S_{k})),\\ {\Psi}(a) \triangleq \max(0,a)^{2}. \end{gathered} \end{array} $$

$$ \mathcal{L}_{h_{\mathcal{S}}} = \hat{r}(h_{\mathcal{S}}\mid S_{k}) = \frac{1}{k}\sum\limits_{i=1}^{k}\ell(h_{\mathcal{S}}(q_{i}),y_{i}). $$

3.4 Detection process in the test time

After having the ${\mathscr{M}}$ and the $\mathcal {S}$ classifiers trained, we can use them with the baseline classifiers $\mathcal {F}$ to detect the AEs in the inference/test time, As depicted in Fig. 2d. Specifically, the output of baseline model $P_{c}^{\mathcal {F}}$, the outputs of ${\mathscr{M}}$ block, $P_{s}^{m_{j}}$ and $P_{c}^{m_{j}}$, and the output of $\mathcal {S}$ block, $P_{s}^{\mathcal {S}}$ and $P_{c}^{\mathcal {S}}$, are used in the ensemble detection process. First of all, the following thresholds have to be identified:

the confidence threshold value
$$ th_{c}=\max (th_{c}^{\mathcal{S}}, th_{c}^{m_{1}}, th_{c}^{m_{2}}, ..., th_{c}^{m_{N}}) $$
where $th_{c}^{m_{j}}$ is the confidence threshold for the selective AEs classifier m_j, and $th_{c}^{\mathcal {S}}$ is the confidence threshold for the $\mathcal {S}$ classifier.
selective threshold $th_{s}^{m_{j}}$ for each selective AEs classifier m_j.
selective threshold $th_{s}^{\mathcal {S}}$ for the $\mathcal {S}$ classifier.

Following the steps in [29], we select our thresholds using a subset of the clean test samples at a level when 10% (at most) of clean samples can be rejected by the ensemble detection. Once the thresholds are calculated we run the detection process as follows:

1.
Confidence detection: is set to 1 if $max(P_{c}^{\mathcal {S}})<th_{c}$ and is set to 0 otherwise, where 1 means adversarial input.
2.
Selective detection: is set to 1 if $P_{s}^{\mathcal {S}}<th_{s}^{\mathcal {S}}$ or $P_{s}^{m_{1}} < th_{s}^{m_{1}}$ or … or $P_{s}^{m_{N}} < th_{s}^{m_{N}}$ and is set to 0 otherwise.
3.
Mismatch detection: is set to 1 if argmax $(P_{c}^{\mathcal {S}}) \neq argmax $ $(P_{c}^{\mathcal {F}})$ and is set to 0 otherwise.
4.
Ensemble detection: The input sample is adversarial if it is detected in confidence, selective, or mismatch detection process.

4 Experimental settings

4.1 Datasets

The proposed prototype is evaluated on CNN models trained with two popular datasets; MNIST [45] and [46] CIFAR10.

MNIST is hand-written digit recognition dataset with 70000 images (60000 for training and 10000 for testing) and ten classes and CIFAR10 is an object recognition dataset with 60000 images (50000 for training and 10000 for testing) ten classes.

4.2 Baseline classifiers

For the baseline models, two CNN models are trained; one for MNIST and one for CIFAR10. For MNIST, we trained 6-layer CNN with 98.73% accuracy while for CIFAR10 we trained 8-layer CNN with 89.11% accuracy. The classifier’s architectures for MNIST and CIFAR10 are shown in Table 1 and Table 2, respectively.

Table 1 MNIST baseline classifier architecture

Revisiting model’s uncertainty and confidences for adversarial example detection

Abstract

Similar content being viewed by others

Omni: automated ensemble with unexpected models against adversarial evasion attack

Towards Interpreting Vulnerability of Object Detection Models via Adversarial Distillation

Rethinking the Evaluation of Deep Neural Network Robustness

Explore related subjects

1 Introduction

2 Related work

2.1 Detection methods

Uncertainty-based detectors

Confidence-based detectors

2.2 SelectiveNet as an uncertainty model

3 Adversarial Detection (SFAD) method

3.1 SFAD’s classifiers design

3.2 Selective AEs classifiers block: training the \({\mathscr{M}}\) classifiers

3.2.1 Feature maps processing

Auto-encoder

Up/down-sampling

Noise

Bottleneck

3.3 Selective knowledge transfer block: training the \(\mathcal {S}\) classifier

3.4 Detection process in the test time

4 Experimental settings

4.1 Datasets

4.2 Baseline classifiers

4.3 SFAD Settings

4.3.1 Selective AEs classifiers block

Autoencoder

Up/down-sampling

Bottleneck

Noise

Dense layers

SelectiveNet

4.3.2 Selective Knowledge Transfer block

4.4 Threat model, attacks, and state-of-the-art detectors

4.4.1 Threat model

4.4.2 Adversarial attacks

Fast Gradient Sign Attack (FGSM) [7]

Projected Gradient Descent (PGD) [14]

Carlini-Wagner (CW) [13]

DF [12]

PA and TA [19]

ST [17]

SA [78]

HopSkipJump attack [79] (HSJA)

4.4.3 Comparison with existing detectors

KD+BU [22]

rce [81]

LID [48]

RAID [64]

FS [30]

MagNet [65]

NIC [35]

DNR [29]

5 SFAD performance evaluation

5.1 Performance under white, black, and gray boxes attacks

5.1.1 Zero-Knowledge (of detectors) adversary white-box attacks

5.1.2 Black-box attacks

5.1.3 Gray-box attacks

5.2 Robustness against high confidence attack

5.3 Comparisons with the state-of-the-art detectors

KD+BU [22]

RCE

LID [48]

RAID3

FS [30]

MagNet [65]

NIC [35]

DNR [29]

Other performance comparison:

6 Other experimental results and discussion

6.1 Performance on successful attacks only

6.2 Results with N last layer(s) output(s)

6.3 Ablation study

Only NN

Noise

Autoencoder

Up/down-sampling

Bottleneck

RAID³