1 Introduction

DL has achieved remarkable advances in different fields in human life especially computer vision tasks like object detection, image classification [1,2,3], surveillance [4], and medical imaging [5]. Despite that, it is found that DL models are vulnerable to adversaries [6, 7]. In image classification models, for instance, adversaries can generate AEs, by adding small perturbations to an input image that are imperceptible to humans and devices, that cause DL models to misclassify the input images. Such potential threat affects security-critical DL-based applications [8] such as self-driving cars.

Adversaries can generate AEs for white-box, black-box, and gray-box attacks [9, 10]. In white-box attack scenario, the adversary knows everything about the DL-model including inputs, outputs, architecture, and weights of the model. Hence, he is guided by the model gradient to generate AE by solving an optimization problem [7, 11,12,13,14]. In black-box scenario, the adversary knows nothing about the model but he leverages the transferability property [15] of AEs and the input content. By sending queries to the model, the adversary can craft small perturbations that are harmonious with the input image [16,17,18,19]. In the gray-box scenario, the adversary knows only the input and the output of the model and hence, he tries to substitute the original model with an approximated model and then uses its gradient as in white-box scenario to generate AEs.

Researchers pay attention to this threat and several emerging methods have been proposed to detect or to defend against AEs. More details about defense and detection methods can be found in Section 2.

DL model’s uncertainty is one of the main methods that has been used to determine whether an input sample belongs to the training manifold. The uncertainty is usually measured by adding randomness to the model using Dropout technique [20, 21]. It is found that clean sample predictions do not change, when randomness is added, while it changes for AEs. Feinman et al. [22] proposed BU metric that used Monte Carlo dropout to estimate the uncertainty to detect AEs that are near the classes manifold, while Smith et al. [23, 24] used mutual information method to estimate the uncertainty. The prediction risk of these methods is higher compared to the recent uncertainty method, SelectiveNet [25], that is used in this work. On the other hand, it was shown in [26] that predicted class probabilities, i.e. model’s confidence, of in-of-distribution samples are higher than of out-of-distribution. Model’s confidence was used in [26,27,28,29] to implement AE detectors. Uncertainty and confidence based detectors showed limited success against black- and gray-box attacks. Uncertainty and confidence based detectors are usually threshold-based detectors as shown in Fig. 1(a). To enhance detectors’ performance, one recommendation goes to the direction of providing ensemble detection methods, as shown in Fig. 1b. Although state-of-the-art detectors achieve promising results, they may have one or more limitation(s); not performing well with some known attacks [30], broken by attackers [31, 32], performance of baseline detectors is not consistent [33], increase the model parameters overhead [34], time consuming [35], or introduce latency [36] in the inference time.

Fig. 1
figure 1

a High-level architecture of the uncertainty/confidence-based detectors. b High-level architecture of the uncertainty/confidence-based ensemble detectors. The input sample is passed to the CNN model to do class prediction. The detector, i.e. the uncertainty method, estimates the uncertainty of the input samples using model hidden layers. Using a predefined threshold, the input sample is not adversarial if the uncertainty exceeds the predefined threshold

In this paper and in order to mitigate the aforementioned limitations, we revisit the model’s uncertainty and confidence to propose a novel ensemble AE detector that hasn’t had any knowledge of AEs, i.e. unsupervised detector, as shown in Fig. 2. The proposed method has the following attributes; 1) it investigates SelectiveNet capability in detecting adversarial examples since it measures the uncertainty with less risk. According to the author’s knowledge, the SlelectiveNet [25] is not used in adversarial attacks detection models. 2) Unlike other detectors [29, 37, 38], the proposed method uses the model’s last N-layers outputs, i.e. feature maps, to build N-CNNs \({\mathscr{M}}\) that have different processing blocks like up/down sampling, auto-encoders [39, 40], noise addition [41,42,43], and bottleneck layer addition [44] that make the representative data of last layers more unique to the input data distribution to yield better model’s confidence. To reduce the effect of white-box attacks, the output of \({\mathscr{M}}\) is transferred/distilled to build the last CNN \(\mathcal {S}\). 3) The proposed model ensembles the proposed detection techniques to provide the final detector. This step has a great impact in reducing the adversary’s capability to craft perturbations that can fool the detector, since he has to fool every detection technique. We name the proposed method as Selective and Feature based Adversarial Detection (SFAD). The high-level architecture of the SFAD is illustrated in Fig. 1b.

Fig. 2
figure 2

a SFAD’s architecture. N-last representative output of DNN is used to build N Selective Adversarial Example classifiers. The confidence output, i.e. pred. probabilities, of the N classifiers is concatenated to be as input for Selective Knowledge Transfer classifier. b Feature maps processing blocks. c SelectiveNet architecture [25] d Detection Process: Selective probabilities (\(P_{s}^{m_{j}}\) and \(P_{s}^{\mathcal {S}}\)) are used in the Selective detection process. Confidence/Prediction probabilities, (\(P_{c}^{m_{j}}\) and \(P_{c}^{\mathcal {S}}\)), are used in the confidence detection process. Confidence/Prediction probabilities, (\(P_{c}^{\mathcal {F}}\) and \(P_{c}^{\mathcal {S}}\)), are used in the mismatch detection process. The total detection is the ensemble of the three detection modules

A prototype of SFAD is tested under white-box, black-box and gray-box attacks on MNIST [45], and CIFAR10 [46]. Under the white-box attacks, the experimental results show that SFAD can detect AEs at least with accuracy of 89.8% (many with 99%) for all tested attacks except for the PGD attack [14] with at least 65% detection accuracy in average. For black- and gray-box attacks, SFAD shows better performance than other tested detectors. Finally, SFAD is tested under the HCA [47] and it shows that it is fully(100%) and partially(57.76%) robust on MNIST and CIFAR10 respectively. SFAD sets the thresholds to reject 10% of clean images. Moreover, comparisons with state-of-the-art methods are presented. Hence, our key contributions are:

  • We propose a novel unsupervised ensemble model for AE detection. Ensemble detection makes SFAD more robust against white-box and adaptive attacks.

  • We investigate the SelectiveNet’s, as an uncertainty model, capability in detecting AEs.

  • We show that, by processing the feature maps of last N-layers, we can build classifiers for better confidence distribution. We provide an ablation experiments to study the impact of the feature processing blocks.

  • SFAD prototype proves the concept of the approach and lets the door open in future to find the best N layers and the best N(or M) CNNs combinations to build the detector’s classifiers.

  • Unlike tested state-of-the-art detectors, SFAD prototype shows better performance under gray- and black-box attacks. SFAD prototype shows that it is fully robust on MNIST and partially robust on CIFAR10 when attacked with HCAs. For instance, Local Intrinsic Dimensionality (LID) method [48] reported very high detection accuracy on the tested attacks, but fails on HCAs [31, 47].

2 Related work

2.1 Detection methods

Defense techniques like adversarial training [7, 14, 49, 50], feature denoising [51,52,53], pre-processing [54, 55], and gradient masking [56,57,58,59] try to make the model robust against the attacks and let the model correctly classify the AEs. On the other hand, detection methods provide adversarial status for the input image. Detection techniques can be classified according to the presence of AEs in the detector learning process into supervised and unsupervised techniques [33]. In supervised detection, detectors include AEs in the learning process. Many approaches exist in the literature. In the feature-based approach [38, 60,61,62,63], detectors use clean and AEs inputs to built their classifier models from scratch by using raw image data or by using the representative layers’ outputs of a DNN model. For instance, in [38], the detector quantizes the last ReLU activation layer of the model and builds a binary v classifier. As reported in [38], this detector is not robust enough and is not tested against strong attacks like Carlini-Wagner (CW) attacks. While the work in [61] added a new adversarial class to the NN model and train the model from scratch with clean and adversarial inputs. This architecture reduces the model accuracy [61]. In the concurrent recent work [63], Wang et al. used the saliency map features of clean and adversarial examples to learn the classifier’s detector. In the statistical-based approach [22, 48], detectors perform statistical measurement to define the separation between clean and adversarial inputs. In [22], KD estimation, BU, or combined models are introduced. Kernel-density feature is extracted from clean and AEs in order to identify AEs that are far away from data manifold while Bayesian uncertainty feature identifies the AEs that lie in low-confidence regions of the input space. LID method is introduced in [48] as a distance distribution of the input sample to its neighbors to assess the space-filling capability of the region surrounding that input sample. The works in [31, 47] showed that these methods can be broken. Finally, the network invariant approach [62, 64] learns the differences in neuron activation values between clean input samples and AEs to build a binary NN detector. The main limitation of this approach is that it requires prior knowledge about the attacks and hence it might not be robust against new or unknown attacks.

On the other hand, in unsupervised detection, detectors are trained with clean images only to identify the AEs. It is also known as prediction inconsistency models since it depends on the fact that AEs might not fool every NN model. That’s because the input feature space is almost limited and the adversary always takes that as an advantage to generate the AEs. Hence, unsupervised detectors try to reduce this limited input feature space available to adversaries. Many approaches have been presented in the literature. The Feature Squeezing (FS) approach [30] measures the distance between the predictions of the input and the same input after squeezing. The input will be adversarial if the distance exceeds a threshold. The work in [30] squeezes out unnecessary input features by reducing the color bit depth of each pixel and by spatial smoothing of adversarial inputs. As reported in [30], FS is not performing well with some known attacks like FGSM. Instead of squeezing, denoising based approach, like MagNet [65], measures the distances between the predictions of input samples and denoised/filtered input samples. It was found in [32, 53] that MagNet can be broken and do not scale to large images. Recently, a network invariant approach was introduced [35]. They proposed a NIC method that builds a set of models for individual layers to describe the provenance and the activation value distribution channels. It was observed that AEs affect these channels. The provenance channel describes the instability of activated neurons set in the next layer when small changes are present in the input sample while the activation value distribution channel describes the changes with the activation values of a layer. The reported performance of this method showed its superiority against other state-of-the-art models but other works reported that the baseline NIC’s detectors are not consistent [33], increase model parameters overhead [34], are time consuming [35], and increase the latency in the inference time [36].

Uncertainty-based detectors

Following the observation that the prediction of clean image remains correct with many dropouts, while the prediction of AE changes. Feinman et al. [22] proposed BU metric. BU uses Monte Carlo dropout to estimate the uncertainty, to detect those AEs that are near the classes manifold, while Smith et al. [23] used mutual information method for such a task. In [24], Sheikholeslami et al. proposed an unsupervised detection method that provides a layer-wise minimum variance solver to estimate model’s uncertainty for in-distribution training data. Then, a mutual information based threshold is identified.

Confidence-based detectors

Aigrain et al. [27] built a simple NN detector that uses the model’s logits of clean and AEs to build a binary classifier. Inspired by the hypothesis of that, for a given perturbed image, different models yield different confidences, Monteiro et al. [28] proposed a bi-model mismatch detection method. The detector is a binary RBF-SVM classifier that takes as input the output of two classifiers of clean and AEs. On the other hand, Sotgiu et al. proposed an unsupervised detection method that uses the last N representative layers’ outputs of the classifier to built three SVM classifiers with RBF kernel. The confidence probabilities of the SVMs are combined to build the last SVM-RBF classifier. Then, a threshold is identified to reject inputs that have less maximum confidence probability.

2.2 SelectiveNet as an uncertainty model

Let \(\mathcal {X}\) be an input space, e.g. images, and \(\mathcal {Y}\) a label space. Let \(\mathbb {P}(X,Y)\) be the data distribution over \(\mathcal {X} \times \mathcal {Y}\). A model, \(f:X \rightarrow Y\), is called a prediction function, \( \ell : Y \times Y \rightarrow \mathbb {R}^{2}\) is a given loss function. Given a labeled set \(S_{k} = {(x_{i}, y_{i})}_{i=1}^{k} \subseteq (\mathcal {X} \times \mathcal {Y})^{k}\) sampled i.i.d. from \(\mathbb {P}(X,Y)\), where k is the number of training samples. The true risk of the prediction function f w.r.t. \(\mathbb {P}\) is \(R(f) \triangleq \mathbb {E}_{\mathbb {P}(X,Y)}[\ell (f(x),y)]\) while the empirical risk of the prediction function f is \(\hat {r}(f\mid S_{k}) \triangleq \frac {1}{k} {\sum }_{i=1}^{k} \ell (f(x_{i}),y_{i})\).

Here, we briefly demonstrate the SelectiveNet as stated in [25]. The selective model is a pair (f,g), where f is a prediction function, and \(g : \mathcal {X} \rightarrow \{0,1\}\) is a binary selection function for f,

$$ (f,g)(x) \triangleq \begin{cases} f(x), & \text{if } g(x) = 1; \\ \text{don't know}, & \text{if } g(x) = 0. \end{cases} $$
(1)

A soft selection function can also be considered, where \(g : \mathcal {X} \rightarrow [0,1]\), hence, the value of (f,g)(x) is calculated with the help of a threshold τ as expressed in the following equation

$$ (f,g)(x) \triangleq \begin{cases} f(x), & \text{if } g(x) \geq \tau; \\ \text{don't know}, & \text{if } g(x) < \tau. \end{cases} $$
(2)

The performance of a selective model is calculated using coverage and risk. The true coverage is defined to be the probability mass of the non-rejected region in \(\mathcal {X}\) and calculated as

$$ \phi(g) \triangleq E_{P}[g(x)], $$
(3)

while the empirical coverage is calculated as

$$ \hat{\phi}(g\mid S_{k}) \triangleq \frac{1}{k} \sum\limits_{i=1}^{k}g(x_{i}) $$
(4)

The true selective risk of (f,g) is

$$ R(f,g) \triangleq \frac{E_{P}[\ell (f(x),y)g(x)]}{\phi(g)}, $$
(5)

while the empirical selective risk is calculated for any given labeled set Sk as

$$ \hat{r}(f,g\mid S_{k}) \triangleq \frac{\frac{1}{k}{\sum}_{i=1}^{k}\ell (f(x_{i}),y_{i})g(x_{i})}{\hat{\phi}(g\mid S_{k})}. $$
(6)

Finally, for a given coverage rate 0 < c ≤ 1 and Θ, a set of parameters for a given deep network architecture for f and g, the optimization problem of the selective model is expressed as:

$$ \begin{array}{@{}rcl@{}} &\begin{aligned} \theta^{\ast} = \operatorname*{arg min}_{\theta \in {\Theta}} (R(f_{\theta},{g}_{\theta}))\\ \textit{s.t. } \phi({g}_{\theta}) \geq c, \end{aligned} \end{array} $$
(7)

and can be solved using the Interior Point Method (IPM) [66] to enforce the coverage constraint. That yields to unconstrained loss objective function over samples in Sk,

$$ \begin{array}{@{}rcl@{}} &\begin{aligned} \mathcal{L}_{(f,g)} \triangleq \hat{r}_{\ell}(f,g\mid S_{k}) + \lambda {\Psi}(c-\hat{\phi}(g\mid S_{k}))\\ {\Psi}(a) \triangleq \max(0,a)^{2}, \end{aligned} \end{array} $$
(8)

where c is the target coverage, λ is a hyper-parameter controlling the relative importance of the constraint, and Ψ is a quadratic penalty function. As a result, SelectiveNet is a selective model (f,g) that optimizes both f(x) and g(x) in a single model in a multi-task setting as depicted in Fig. 2c. For more details about the SelectiveNet model, readers are advised to read [25].

3 Adversarial Detection (SFAD) method

3.1 SFAD’s classifiers design

It is believed that the last N layers in the DNN \(\mathcal {F}\) have potentials in detecting and rejecting AEs [29, 37]. In [67] and [37], only the last layer (N = 1) is utilized to detect AEs. At this very high level of presentation, AEs are indistinguishable from samples of the target class. This observation is enhanced when DNR [29] used the last three layers to build SVM with RBF kernel based classifiers. Unlike other works, in this work, 1) feature maps of the last layers Zj, where \(j=\{1,2 \dots ,N\}\), are processed. In the aforementioned methods, the representatives of the last layers are not processed and basically the detectors represent another approximation of the baseline classifier which is considered as a weak point. 2) MTL is used via the SelectiveNet. MTL has an advantage of combining related tasks with one or more loss function(s) and it does better generalization especially with the help of the auxiliary functions. For more details about MTL, please refer to these recent review papers [68, 69].

In this section, the Adversarial Detection (SFAD) method is demonstrated. As depicted in Fig. 2a, SFAD consists of two main blocks; the selective AEs classifiers \({\mathscr{M}}\) block (in blue), where \({\mathscr{M}}=\{m_{j}\}_{j=1}^{N}\), and the selective knowledge transfer classifier \(\mathcal {S}\) block (in orange). In the training phase, we have two steps; the first is to train the \({\mathscr{M}}\) classifiers, and the second step is to train the \(\mathcal {S}\) classifier. Hence, \({\mathscr{M}}\), and \(\mathcal {S}\) are trained separately. While in the inference/test time, the output of \(\mathcal {F}\), \({\mathscr{M}}\), and \(\mathcal {S}\) blocks, i.e. model’s uncertainties and confidences, are used in the detection process, as depicted in Fig. 2d.

3.2 Selective AEs classifiers block: training the \({\mathscr{M}}\) classifiers

As shown in Fig. 2a, the aim of \({\mathscr{M}}\) block is to build N individual classifiers, \({\mathscr{M}}=\{m_{j}\}_{j=1}^{N}\). It was shown that perturbation propagation becomes clear when the DNN model goes deeper, hence, using N-last layers have potential in identifying the AEs. Unlike works in [29, 37], we process the representative last N-layer(s) outputs Zj in different ways in order to make clean input features more unique, as shown in Fig. 2b and discussed in the next Section 3.2.1. This will limit the feature space that the adversary uses to craft the AEs [30, 65]. Moreover, each of the last N-layer output has its own feature space which makes each mj classifier be trained with different feature space. Hence, combining and increasing the number of N will enhance the detection process.

For simplicity and as recommended in [29], we set N = 3 in the implemented prototype and hence, each individual layer output is assigned to a classifier as shown in Fig. 2a. Let the last N layers’ outputs zji of xi from Sk are z1i, z2i, and zNi, respectively, where, \(j=\{1,2, \dots , N\}\). zji are individually the inputs of the mj classifier.

The outputs of the mj classifier are denoted as mji(zji). Let \(\mathcal {Y}^{\prime }=\mathcal {Y}+1\) be a label space of mj, where the extra label is denoted for the selective status, hence mj represents a function \(m_{j}:Z_{j} \rightarrow Y^{\prime }\) on a distribution \(\mathbb {P}(Z_{j},Y^{\prime })\) over \(\mathcal {Z} \times \mathcal {Y}^{\prime }\). We refer to the selective probability of mj as \(P_{s}^{m_{j}}\) and the confidence probabilities of mj as \(P_{c}^{m_{j}}\). mj optimizes the overall loss function

$$ \mathcal{L}_{m_{j}} = \alpha \mathcal{L}_{(m_{j},g_{m_{j}})} + (1-\alpha) \mathcal{L}_{h_{m_{j}}} \text{ , where }\alpha=0.5, $$
(9)

where \({\mathscr{L}}_{(m_{j},g_{m_{j}})}\) is the selective loss function of mj, as discussed in Section 2.2, and \({\mathscr{L}}_{h_{m_{j}}}\) is the auxiliary loss function of mj and are calculated as following:

$$ \begin{gathered} \mathcal{L}_{(m_{j},g_{m_{j}})} \triangleq \hat{r}_{\ell}(m_{j},g_{m_{j}}\mid S_{k}) + \lambda {\Psi}(c-\hat{\phi}(g_{m_{j}}\mid S_{k})),\\ {\Psi}(a) \triangleq \max(0,a)^{2}, \end{gathered} $$
$$ \mathcal{L}_{h_{m_{j}}} = \hat{r}(h_{m_{j}}\mid S_{k}) = \frac{1}{k}{\sum}_{i=1}^{k}\ell(h_{m}(z_{ji}),y_{i}). $$

Studying the value of α is out of the paper scope, but other task balancing methods, may be applied like, uncertainty [70], GradNorm [71], DWA [40], DTP [72], and MGDA [73].

3.2.1 Feature maps processing

As depicted in Fig. 2b each selective classifier consists of different processing blocks; auto-encoder block, up/down-sampling block, bottleneck block, and noise block. These blocks aim at giving distinguishable features for input samples to let the detector recognize the AEs efficiently.

Auto-encoder

Auto-encoders are widely used as a reconstruction tool and its loss is used as a score for different tasks. For instance, it is used in the detection process of AEs in [65]. It is believed that AEs gave higher reconstruction loss than clear images. This process is a.k.a attention mechanism [74, 75] and it is used to focus on better representation of input features especially on the shallow classifiers.

Up/down-sampling

Up sampling and down sampling are used in different deep classifiers [39, 40]. The aim of down sampling, a.k.a pooling layers in NN, is to gather the global information of the input signal. Hence, if we consider the clean input signal as a signal that has global information and then we expand the global information by bi-linear up sampling and then down sample by average pooling, we will measure the ability of global information reconstruction of the input signal. Besides, this process can be seen as a use case of the reconstruction process.

Noise

Adding noise has a potential impact in making NN more robust against AEs and it has been used in many defense methods [41,42,43]. In this work, we add a branch in the classifier that adds small Gaussian noise to the input signal before and after the auto-encoder block. Then, the noised and clean input features are concatenated before the bottleneck block.

Bottleneck

The bottleneck block [44] consists of three convolutional layers; 1×1, 3×3, and 1×1 convolutional layers. The bottleneck name came from the fact that the 3×3 convolutional layer is left as a bottleneck between 1×1 convolutional layers. It is mainly designed for efficiency purposes but according to [74, 76] it is very effective in building shallow classifiers which helps having better representation of input signal.

3.3 Selective knowledge transfer block: training the \(\mathcal {S}\) classifier

The block \(\mathcal {S}\) aims at building selective knowledge transfer classifier. It concatenates the confidence values of Y classes of the \({\mathscr{M}}\) classifiers. The idea behind the block \(\mathcal {S}\) is that each set of its input is considered as a special feature of the clean input. Hence, we transfer this knowledge, mj confidence probabilities, of clean inputs to the classifier. Besides, in the inference time, we believe that AE will generate a different distribution of the confidence values and if the AE is able to fool one mj, it may not fool the others.

As Fig. 2a shows, the confidence probabilities of mj classifiers are concatenated to be as an input Q = \(concat(P_{c}^{m_{1}},\) \( P_{c}^{m_{2}}, \dots , P_{c}^{m_{N}})\) for the selective knowledge transfer block \(\mathcal {S}\). The \(\mathcal {S}\) classifier consists of one or more dense layer(s) and yields the selective probability of \(\mathcal {S}\) as \(P_{s}^{\mathcal {S}}\) and the confidence probabilities of \(\mathcal {S}\) as \(P_{c}^{\mathcal {S}}\). \(\mathcal {S}\) represents a function \(\mathcal {S}:Q \rightarrow Y^{\prime }\) on a distribution \(\mathbb {P}(Q,Y^{\prime })\) over \(\mathcal {Q} \times \mathcal {Y}^{\prime }\). Hence, it optimizes the following loss function

$$ \mathcal{L}_{\mathcal{S}} = \alpha \mathcal{L}_{(\mathcal{S},g_{\mathcal{S}})} + (1-\alpha) \mathcal{L}_{h_{\mathcal{S}}} \text{ , where }\alpha=0.5, $$
(10)

where \({\mathscr{L}}_{(\mathcal {S},g_{\mathcal {S}})}\) is the selective loss function of \(\mathcal {S}\), as discussed in Section 2.2, and \({\mathscr{L}}_{h_{\mathcal {S}}}\) is the auxiliary loss function of \(\mathcal {S}\) and are calculated as following:

$$ \begin{array}{@{}rcl@{}} &\begin{gathered} \mathcal{L}_{(\mathcal{S},g_{\mathcal{S}})} \triangleq \hat{r}_{\ell}(\mathcal{S},g_{\mathcal{S}}\mid S_{k}) + \lambda {\Psi}(c-\hat{\phi}(g_{\mathcal{S}}\mid S_{k})),\\ {\Psi}(a) \triangleq \max(0,a)^{2}. \end{gathered} \end{array} $$
$$ \mathcal{L}_{h_{\mathcal{S}}} = \hat{r}(h_{\mathcal{S}}\mid S_{k}) = \frac{1}{k}\sum\limits_{i=1}^{k}\ell(h_{\mathcal{S}}(q_{i}),y_{i}). $$

3.4 Detection process in the test time

After having the \({\mathscr{M}}\) and the \(\mathcal {S}\) classifiers trained, we can use them with the baseline classifiers \(\mathcal {F}\) to detect the AEs in the inference/test time, As depicted in Fig. 2d. Specifically, the output of baseline model \(P_{c}^{\mathcal {F}}\), the outputs of \({\mathscr{M}}\) block, \(P_{s}^{m_{j}}\) and \(P_{c}^{m_{j}}\), and the output of \(\mathcal {S}\) block, \(P_{s}^{\mathcal {S}}\) and \(P_{c}^{\mathcal {S}}\), are used in the ensemble detection process. First of all, the following thresholds have to be identified:

  • the confidence threshold value

    $$ th_{c}=\max (th_{c}^{\mathcal{S}}, th_{c}^{m_{1}}, th_{c}^{m_{2}}, ..., th_{c}^{m_{N}}) $$

    where \(th_{c}^{m_{j}}\) is the confidence threshold for the selective AEs classifier mj, and \(th_{c}^{\mathcal {S}}\) is the confidence threshold for the \(\mathcal {S}\) classifier.

  • selective threshold \(th_{s}^{m_{j}}\) for each selective AEs classifier mj.

  • selective threshold \(th_{s}^{\mathcal {S}}\) for the \(\mathcal {S}\) classifier.

Following the steps in [29], we select our thresholds using a subset of the clean test samples at a level when 10% (at most) of clean samples can be rejected by the ensemble detection. Once the thresholds are calculated we run the detection process as follows:

  1. 1.

    Confidence detection: is set to 1 if \(max(P_{c}^{\mathcal {S}})<th_{c}\) and is set to 0 otherwise, where 1 means adversarial input.

  2. 2.

    Selective detection: is set to 1 if \(P_{s}^{\mathcal {S}}<th_{s}^{\mathcal {S}}\) or \(P_{s}^{m_{1}} < th_{s}^{m_{1}}\) or … or \(P_{s}^{m_{N}} < th_{s}^{m_{N}}\) and is set to 0 otherwise.

  3. 3.

    Mismatch detection: is set to 1 if argmax \((P_{c}^{\mathcal {S}}) \neq argmax \) \((P_{c}^{\mathcal {F}})\) and is set to 0 otherwise.

  4. 4.

    Ensemble detection: The input sample is adversarial if it is detected in confidence, selective, or mismatch detection process.

4 Experimental settings

4.1 Datasets

The proposed prototype is evaluated on CNN models trained with two popular datasets; MNIST [45] and [46] CIFAR10.

MNIST is hand-written digit recognition dataset with 70000 images (60000 for training and 10000 for testing) and ten classes and CIFAR10 is an object recognition dataset with 60000 images (50000 for training and 10000 for testing) ten classes.

4.2 Baseline classifiers

For the baseline models, two CNN models are trained; one for MNIST and one for CIFAR10. For MNIST, we trained 6-layer CNN with 98.73% accuracy while for CIFAR10 we trained 8-layer CNN with 89.11% accuracy. The classifier’s architectures for MNIST and CIFAR10 are shown in Table 1 and Table 2, respectively.

Table 1 MNIST baseline classifier architecture
Table 2 CIFAR10 baseline classifier architecture

In order to evaluate the proposed prototypes against gray-box attacks, we consider that the adversaries know the training dataset and the model outputs and do not know the baseline model architectures. Hence, Table 3 and Table 4 show the two alternative architectures for MNIST and CIFAR10 classifiers. For MNIST, the classification accuracies are 98.37% and 98.69% for Model #2 and Model #3, respectively. While for CIFAR10, the classification accuracies are 86.93% and 88.38% for Model #2 and Model #3 respectively.

Table 3 MNIST classifiers architectures for gray-box setting
Table 4 CIFAR10 classifiers architectures for gray-box setting

4.3 SFAD Settings

As described in Section 3 and Fig. 2, we introduce here the implementation details for the detector components.

4.3.1 Selective AEs classifiers block

It consists of an autoencoder, up/down sampling, bottleneck, and noise layers. Each has the following architecture:

Autoencoder

As shown in Fig. 3, let the input size be Z × w × h. In the encoding process, the number of 3 × 3-kernel filters are set to Z/2, Z/4, and Z/16, respectively. In the decoding process, the number of filters Z are symmetrically restored. Finally, to maintain the input samples characteristics that we have before autoencoding, the input is added/summed to the output of the autoencoder.

Fig. 3
figure 3

Autoencoder architecture

Up/down-sampling

As shown in Fig. 4, let the input size be Z × w × h. The input size is doubled by bilinear up sampling in the first two consecutive layers and then restored by average pooling in the last two layers. Finally, to maintain the features before up/down sampling, the input is added to the output of up/down-sampling.

Fig. 4
figure 4

Up/down-sampling architecture

Bottleneck

It is a three-convolutional layer module with kernels of size 1 × 1, 3 × 3, and 1 × 1. The architecture of the bottleneck layers are shown in Fig. 5. The number of the filters for each layer is 1024, 512, and 256.

Fig. 5
figure 5

Bottleneck architecture

Noise

For this layer, the GaussianNoise layer model from Keras library is used with small standard variation of 0.05.

Dense layers

A dense layer with 512 output is used followed by batch normalization and ReLU activation function.

SelectiveNet

A dense layer with 512 outputs is used followed by batch normalization and ReLU activation function. After that, as original SelectiveNet’s implementation suggests, a layer that divides the result of the previous layer by 10 is used as a normalization step. Finally, a dense layer of one output is used with sigmoid activation function. We set λ = 32, c = 1 for MNIST and c = 0.9 for CIFAR10, and coverage threshold to 0.995 for MNIST and 0.9 for CIFAR10. More details about selectiveNet hyper-parameters are found in [25].

4.3.2 Selective Knowledge Transfer block

It consists of one dense layer with 128 outputs followed by batch normalization and ReLU activation function. The selective task of the knowledge transfer block consists of a dense layer with 128 outputs followed by batch normalization and ReLU activation function. After that a normalisation layer that divides the result of the previous layer by 10 is used as recommended by the original implementation of SelectiveNet. Finally, a dense layer of one output is used with sigmoid activation function. We set λ = 32, c = 1 for MNIST and c = 0.9 for CIFAR10, and coverage threshold to 0.7 for MNIST and CIFAR10. More details about selectiveNet hyper-parameters are found in [25].

4.4 Threat model, attacks, and state-of-the-art detectors

4.4.1 Threat model

We follow one of the threat models presented in [47, 77]; Zero-Knowledge adversary threat model. It is assumed that the adversary has no knowledge that a detector is deployed and he generates the white-box attacks with the knowledge of the baseline classifier. For cases when an adversary has perfect or limited knowledge of the detector, we assume that the adversary’s work will be so hard since SFAD adopts ensemble detection, and hence, we leave this as future work. Instead, we tested SFAD robustness with the recommended strong high confidence attack [31, 47], a variant of CW attack, that is rarely tested in other detectors.

4.4.2 Adversarial attacks

We test SFAD against different types of white and black box attacks. For the white box attacks, we use FGSM [7], PGD [14], CW [13], and DF attacks. While for the black-box attacks, we use TA [19], PA [18], and ST [17]. For the comparison with the state of the art algorithms, more black box attacks are considered like SA [78], and HopSkipJump [79] attacks. The attack settings are shown in Table 5.

Table 5 Considered adversarial attacks and their parameters

Fast Gradient Sign Attack (FGSM) [7]

It is a \(L_{\infty }\)-norm attack and uses the model gradients to generate the AE. The sign of gradient for each pixel of the input x is used to build the AE \(x^{\prime }\) as follows:

$$ x^{\prime} = x + \epsilon \text{ sign }(\nabla_{x} \ell(x,y)), \text{ such that } x^{\prime}\in [0,1]^{n} $$
(11)

where 𝜖 is a parameter to control the perturbation amount such that \(||x^{\prime }-x||_{\infty } < \epsilon \).

Projected Gradient Descent (PGD) [14]

It is the iterative version of the FGSM attack. PGD attack applies FGSM attack k times and starts from a random perturbation in Lp-ball around the input sample. It is expressed as:

$$ \begin{gathered} x_{i+1}^{\prime} = x_{i}^{\prime} + \alpha \text{ sign }(\nabla_{x} \ell(x_{i}^{\prime},y)),\\ \text{ such that } x_{1}^{\prime}=x + rand(noise) \text{ , }\\ x_{i+1}^{\prime}\in [0,1]^{n} \text{ , and } i=1 \text{ to } k \end{gathered} $$
(12)

where α is the parameter to control the ith iteration step size and it is 0 < α < 𝜖.

Carlini-Wagner (CW) [13]

CW followed the optimization problem of the BFG [6] and replaced the loss function with an objective function:

$$ g(x^{\prime})=\max(\max_{i\neq t}(Z(x^{\prime})_{i}) - Z(x^{\prime})_{t}, -k), $$
(13)

where Z is the softmax function and k is the confidence parameter. Hence, CW solves the following optimization problem to build the AE:

$$ \underset{\delta}{\min} ||\delta|| + c g(x^{\prime}), \text{ such that } x^{\prime}\in [0,1]^{n}, $$
(14)

where δ is the amount of perturbation and c is a regularisation parameter that we continuously search for to find minimum δ.

DF [12]

Given a binary affine classifier \(\mathcal {C} = \{x: f(x)=0\}\), where f(x) = wTx + b, DF attack defines the orthogonal projection of x0 onto \(\mathcal {C}\) as the minimal perturbation that is needed to change the classifier’s decision, and it is calculated as \(\delta _{*}=-\frac {f(x)}{||w||^{2}}w\). At each iteration, DF attack solves the following optimization problem

$$ \begin{gathered} \operatorname*{argmin}_{\delta_{i}} ||\delta_{i}||_{2}, \\ \text{ such that } f(x_{i})+\nabla f(x_{i})^{T}\delta_{i} =0 \end{gathered} $$
(15)

and these perturbations are accumulated to get the final perturbation.

PA and TA [19]

PA is a L0-norm black box attack and uses the DEde (DE) [80] algorithm, to solve the optimization problem:

$$ \underset{\delta}{\text{max }} f(x+\delta) \text{ , such that } ||\delta||_{0} \leq d $$
(16)

where d is a small number and equal to one in case of one-pixel. TA generalizes (16) to \(L_{\infty }\)-norm attack to solve the optimization problem.

ST [17]

ST applies translation and rotation changes to the input samples in order to fool the model and solves the optimization problem:

$$ \underset{\delta u,\delta v, \theta}{\text{max }} \ell(f(x^{\prime}), y) \text{ , for } x^{\prime}=T(x;\delta u,\delta v, \theta) $$
(17)

where T,δu,δv and 𝜃 are, the transform function, x-coordinate translation, y-coordinate translation and angle rotation, respectively.

SA [78]

In order to generate perturbation δ, SA, in each iteration, selects colored 𝜖-bounded localized square shaped updates at random positions using random search strategy. Hence, it solves the optimization problem:

$$ \underset{x^{\prime} \in [0,1]^{n}}{\text{min }} \ell(f(x^{\prime}), y) \text{ , such that } ||\delta||_{p} \leq \epsilon $$
(18)

where \(\ell (f(x^{\prime }), y)=f_{y}(x^{\prime })-\max \limits _{k\neq y}f_{k}(x^{\prime })\). \(f_{y}(x^{\prime })\) and \(f_{k}(x^{\prime })\) are the prediction probability scores of \(x^{\prime }\) for y and k classes, respectively.

HopSkipJump attack [79] (HSJA)

HopSkipJump is boundary-decision based black box attack that depends on estimating gradient-based direction. It starts from largely perturbed adversarial example δ and moves towards the clean input class boundary by minimizing the ||δ||2.

4.4.3 Comparison with existing detectors

State-of-the-art supervised and unsupervised detectors are compared with SFAD. Supervised methods like KD+BU [22], LID [48], and RAID [64] are compared with SFAD. While unsupervised methods like FS [30], MagNet [65], NIC [35], and DNR [29] are also considered in the comparisons. A brief summary for each detector are demonstrated here:

KD+BU [22]

It depends on building a binary classifier using two main features. The first one is the uncertainty features that are estimated using the Monte Carlo dropout technique [21]. The second feature depends on the kernel density estimation of each class in the training data.

rce [81]

It depends on measuring the kernel density as in [22]. Instead of using the baseline classifier to measure the density functions, Pang et al. [81] measures the density functions using a more robust classifier that is trained using reverse cross entropy technique.

LID [48]

Instead of measuring the kernel density, Ma et al. in [48] used Local Intrinsic Dimensionality (LID) to calculate the distance distribution of the input sample to its neighbors.

RAID [64]

It depends on measuring the differences in neuron activation values between clean and AEs inputs and then builds a binary classifier with these features.

FS [30]

It depends on feature squeezing approach that transforms the input samples using squeezers. It uses color bit-depth reduction, local smoothing using median filter and non-local smoothing filter using non-local mean denoiser. To determine the adversarial status of an input, the distance between confidences of clean input and its squeezed version is calculated and compared with the threshold.

MagNet [65]

First, it trains denoisers using clean training data. Then, it either 1) calculates the reconstruction error of the input and its denoised version, or 2) measures the distances between the predictions of an input sample and its denoised version to determine the adversarial status of an input.

NIC [35]

It observes the behavior of clean training data only in the intermediate DL model layers. Specifically, it observes the provenance channel and the activation value distribution channels. The provenance channel describes the instability of activated neurons set in the next layer when small changes are present in the input sample, while the activation value distribution channel describes the changes with the activation values of a layer. For each individual layer, one-class classifiers (OCC) are built to model the in-distribution training data. A final one-class classifier that joins all one-class classifiers’ outputs is used to determine the adversarial status of an input.

DNR [29]

In this detector, Sotgiu et al. [29] uses the N-last representative layers outputs of the baseline classifiers to build N-SVM classifiers with RBF kernel. The output of these classifiers are then combined to build the joint SVM-RBF classifier. To determine the adversarial status of an input, the detector depends on checking the maximum confidence probability if it is less than a predefined threshold.

5 SFAD performance evaluation

In this section, we evaluate the performance of the SFAD prototype 1) against different types of attack scenarios and datasets, 2) against the strong high confidence attack, and then 3) we provide a comparison discussion with state-of-the-art detectors. As a reminder, we use only the last three representative layers (N = 3) to build three selective AEs classifiers since the aim is to prove the concept of the approach and if that is changed with the best combination, the detector accuracy will be enhanced accordingly.

5.1 Performance under white, black, and gray boxes attacks

5.1.1 Zero-Knowledge (of detectors) adversary white-box attacks

Table 6 shows the performance evaluation of the SFAD prototype for MNIST and CIFAR10 datasets. It also shows the baseline DNN prediction accuracy for the AEs in “Baseline DNN” row and for the not detected AEs in “prediction” row. The “Total” row is the total accuracy of ensemble detection and truly classified/predicted samples.

Table 6 SFAD’s performance accuracy (%) against white-box attacks(𝜖) on MNIST and CIFAR10 datasets at FP= 10%

For MNIST dataset, the FGSM attacks with small epsilon (𝜖 = 0.05, 0.075, and 0.1) slightly fooled the baseline classifier and hence their feature space still inside or at the border as of training dataset. The detector shows its ability to reject those samples that are so close to the classes borders and achieves the accuracy of 99.96%, 99.88%, and 99.62% for 𝜖 = (0.05,0.075, and,0.1), respectively. Similar observation is noticed for PGD attacks with small 𝜖 values. For larger 𝜖 values, DF, and CW attacks, the AEs are highly able to fool the baseline classifier since adversaries are able to change the MNIST test samples’ feature space to lie out of its corresponding class border and hence, for all tested attacks except the PGD, the model was able to catch them with accuracy above 98.65%. While the detector achieves 68.09% and 58.93% for PGD attacks with 𝜖 = (0.2, and,0.4), respectively. Some PGD examples’ feature space became indistinguishable from the trained samples feature space. That makes SFAD not able to catch all AEs and to enhance SFAD’s performance, the best representative layers combination has to be used as input for the detector.

For CIFAR10 dataset, SFAD achieves comparable results with state-of-the-art methods for FGSM (𝜖 = 0.1, 0.2, and 0.4), DF, and CW attacks. While for FGSM (𝜖 = 0.05, and 0.075) and PGD attacks, the AEs have, to some extent, indistinguishable feature space than those the detector is trained with. In average, the model achieves accuracy of 65.2% for PGD attacks.

For both datasets, the effectiveness of selective, confidence, and mismatch detection is obvious, as shown in Table 7. The ability of the two modules to detect the AEs is increasing when the amount of the perturbations is increasing. When the amount of the perturbations increased in a way that makes the adversarial samples feature space indistinguishable from the training dataset, the ability of these modules to detect the AEs is decreasing.

Table 7 SFAD’s performance accuracy (%) of different detection processes against white-box attacks(𝜖) on MNIST and CIFAR10 datasets at FP= 10%

5.1.2 Black-box attacks

Table 8 shows SFAD prototype’s detection accuracy against the TA [19], PA [18], and ST [17] attacks on MNIST and CIFAR10 datasets. The detector is able to catch the AEs with very high accuracy, higher than 97.56% and 93.97% for MNIST and CIFAR10, respectively. It is clear that the selective, confidence, and mismatch modules complement each other. The black-box attacks significantly change the samples features that facilitate the confidence module detection process. While the ability of selective module is limited for TA and PA attacks since these attacks change one or more pixels within a threshold that is in a variation of the input sample and yield AEs that are so close to clean samples. Similar to white box attacks, the effectiveness of selective, confidence, and mismatch detection is obvious for the both datasets as shown in Table 9.

Table 8 SFAD’s performance accuracy (%) against black-box attacks on MNIST and CIFAR10 datasets at FP= 10%
Table 9 SFAD’s performance accuracy (%) of different detection processes against black-box attacks on MNIST and CIFAR10 datasets at FP= 10%

5.1.3 Gray-box attacks

Gray-box scenario assumes that the adversary has only knowledge about the model training data and the output of the DNN model. Hence, we trained two models as substitution models named Model#2 and Model#3 for MNIST and CIFAR10 as shown in Tables 3 and 4, respectively. Then, white-box based AEs are generated using the substitution models. The SFAD prototype is then tested against these AEs. For both datasets, it is shown in Tables 10 and 11 that the perturbations properties generated from one model are transferred to the tested model, Model#1. For MNIST, see Table 10, SFAD prediction rate is much better for PGD attacks and the prediction rate for other attacks is comparable with the prediction rate of AEs generated from Model#1. For CIFAR10, see Table 11, the prediction rate for CW and DF attacks is higher than those attacks that are generated using Model#1, while the prediction rate for FGSM is comparable with the prediction rate for FGSM attacks that are generated using Mode1#1. Unlike other attacks, the PGD attacks transferable properties sound to be much stronger and have different feature space, compared to feature space of AEs that are generated from Model#1. This reduces the ability of the detector to catch such attacks.

Table 10 SFAD’s performance accuracy (%) against gray-box attacks(𝜖) on MNIST dataset at FP= 10%
Table 11 SFAD’s performance accuracy against gray-box attacks(𝜖) on CIFAR10 dataset at FP= 10%

5.2 Robustness against high confidence attack

In [31], ten defenses and detectors were broken using Backward Pass Differentiable Approximation (BPDA), Expectation Over Transformation (EOT), and High Confidence Attack (HCA). BPDA, and EOT are appropriate for defense techniques, while HCA is used to fail detectors. HCA is a variant of CW attack and generates adversarial examples with high confidence level. In [31], LID were broken using HCA. In his experiment, we generate AEs using HCA with 𝜖 = 0.3125 for MNIST and 𝜖 = 0.031 for CIFAR10. The results show that SFAD is fully robust on MNIST against HCA and partially robust (57.76%) on CIFAR10. Our analysis finds that the confidence and selective detection methods are effective to detect AEs. In case the confidence level of the attack is increased, SFAD can be fine-tuned by selecting the proper layer outputs to build the selective AE classifiers.

All the experiments that are conducted in this work are tested under zero knowledge of the detector. We assume that the adversary’s work is very hard for building an adaptive attack to fool SFAD since it ensembles three detection methods. Despite that, SFAD performance will drop when the adversary is able to craft customized perturbations to fool both the baseline classifier and the ensemble detector.

5.3 Comparisons with the state-of-the-art detectors

In this subsection we build a comparison with different types of supervised and unsupervised detectors using the detectors benchmarkFootnote 1 [82] and the results are shown in Table 12. We compare the average FGSM and PGD results. For fair comparisons, 𝜖 values of 0.125, 0.25, and 0.3125 are set for MNIST dataset, while for CIFAR10 are set to 0.03 and 0.06. Moreover, the supervised detectors are trained and tested separately against each adversarial attack algorithm. As discussed in Section 6.4, rejection/false positive rates of SFAD can be decreased with small compromise in the performance.

Table 12 Detection accuracies for the state-of-the-art detectors against white-box and black-box attacks

KD+BU [22]

KD+BU detector is a combination of kernel density and Bayesian uncertainty based classifiers. For both datasets, the results show that SFAD outperforms KD+BU detector against all tested attacks except for PGD attacks in CIFAR10 dataset. In fact, KD+BU needs not noisy clean and adversarial images to accurately train the detector to identify the boundaries between clean and adversarial inputs.

RCE

Footnote 2. Compared to KD [22], RCE achieves better performance since its classifier yields latent representations that better distinguish AEs from normal examples. For the both datasets, MNIST and CIFAR10, reverse cross entropy (RCE) yields better area under the curve scores than KD, while it shows limited performance against the basic iterative method (BIM) [11] and the HCA.

LID [48]

SFAD outperforms LID in both datasets and the tested attacks except for PGD attacks on CIFAR10. LID achieves better false positive rate compared to SFAD but it fails against High Confidence Attack as reported in [47]. When LID is trained for the HCA attacks, it achieves better results than in [47]. Our approach provides full and partial robustness against HCA for MNIST and CIFAR10, respectively. Similar to KD+BU, LID needs not noisy clean and adversarial images to accurately train the detector to identify the boundaries between clean and adversarial inputs.

RAID3

[64]. For MNIST dataset, RAID achieves higher detection rate for PGD attacks (𝜖 = 0.3) and higher detection rate against FGSM and PGD attacks for CIFAR10 while our approach improved the performance against CW and DF attacks. Besides, RAID has a better false positive rate for MNIST only. RAID trains clean and adversarial inputs to identify differences in neuron activation between clean and adversarial samples. Hence, it requires a huge knowledge of attacks and its variants to enhance its performance.

FS [30]

As stated in [30], FS requires high quality squeezers for different baseline networks and it was shown that FS is not performing well against tested attacks on CIFAR10 dataset, while our approach generalizes better than FS at the expense of higher false positive rate.

MagNet [65]

Results reported on Table 12 is for the detection process of MagNet and defense process of MagNet is not considered. For MNIST, comparable results are achieved by our approach except for CW and ST attacks where SFAD achieves better performance. For CIFAR, our approach outperforms MagNet against the tested attacks. Since MagNet is a denoiser-based detector, it is not guaranteed that the denoisers will remove all the noise and have highly denoised inputs that respect the target threshold. This applies specifically to L0 and L2 attacks. On the contrary, our approach relies on confidence value changes that the AEs will cause which makes our approach able to identify AEs. Although MagNet yields to a less false positive rate, it was shown in [32] that MagNet can be broken by different strategies.

NIC [35]

NIC is the state-of-the-art detector that achieves better performance, in general, against white box attacks compared to other detectors, while our approach achieves better performance against tested black box attacks. Unlike the proposed approach, other works reported that the NIC’s baseline detectors are not consistent [33], increase the model parameters overhead [34], are time consuming [35], and have latency in the inference time [36].

DNR [29]

DNR adopted confidence-based detectors and is close to our approach, but we include the feature processing and selective modules components. The reported results show that our approach outperforms DNR at the same false positive rates for MNIST and CIFAR10 datasets.

Other performance comparison:

SFAD has middle complexity level due to classifiers training times, and has no inference time latency, but it has a compromise on overhead due to classifiers parameters saving. Compared to other detectors, SFAD introduces shallow networks hence, compared to NIC, DNR, and LID, our detector has much less complexity. Besides, it works in parallel to the baseline classifier and no latency is provided compared to FS and NIC. Finally, like NIC and DNR, SFAD has to pay a little price in terms of overhead compared to MagNet, FS and LID.

6 Other experimental results and discussion

In this section, more performance analysis is discussed in order to validate the SFAD prototype. First, we evaluate SFAD against successful attacks onlyFootnote 3. Then, the proposed approach is tested with different N settings. Moreover, in order to emphasize the advantages of SFAD’s feature processing components, we provide an ablation study for each component. Finally, performance results on different rejection rates, i.e. false positive rates are shown.

6.1 Performance on successful attacks only

Table 13 shows the detection rate against the AEs that fooled the baseline DNN classifier only under white-box and black-box scenarios. For MNIST, in general, comparable results with the state-of-the-art detectors are achieved for all tested white and black boxes attacks (> 96.91%) except for the PGD attacks (83.88%).

Table 13 Detection modules’ accuracies (%) against successful white-box and black-box attacks(𝜖) on MNIST and CIFAR10 datasets at FP= 10%

For both datasets, the impact of selective, confidence, and mismatch detection modules are obvious. The ability of the modules to detect the AEs is increasing when the amount of the perturbations is increasing. When the amount of the perturbations increases in a way that makes the adversarial samples feature space indistinguishable from the training dataset, the ability of these modules to detect the AEs decreases. Mismatch detection shows a high impact in the detection process of AEs except for PGD attacks. Once the amount of crafted perturbation becomes high, the performance of mismatch detection decreases. That’s because the detector classifiers’ and the baseline DNN classifier’s behavior will be inconsistent for highly degraded inputs.

6.2 Results with N last layer(s) output(s)

Results shown in Fig. 6 emphasize the conclusion in [29] that recommends to use more than one layer from the last layers of the baseline DNN classifier to be used in detection techniques. For MNIST dataset, the benefit of using more than one layer appears in detecting PGD (𝜖 = 0.2, and 0.4), TA, and PA attacks, while it appears in all tested attacks on CIFAR10 dataset. It means that low-/and medium-level hidden layers hold features that will be triggered when small perturbations are added to input samples.

Fig. 6
figure 6

Total model performance accuracy (%) for black and white box scenarios on MINIST and CIFAR10 datasets at FP= 10% with different N selective AEs classifiers settings

6.3 Ablation study

In this section, we emphasize the advantages of SFAD’s feature processing components including noise, autoencoder, up/down sampling, and bottleneck blocks. Tables 14 and 15 show the performance results for each block once when it is present alone and another time when it is absent for MNIST and CIFAR10 datasets. In all settings, the selectiveNet is present in the selective AE classifiers and in the selective knowledge transfer classifier.

Table 14 Ablation performance (%) on white-box scenarios for MNIST dataset
Table 15 Ablation performance (%) on white-box scenarios for CIFAR10 dataset

Only NN

When all processing blocks are absent, the MNIST results show the ability to detect FGSM, PGD of small 𝜖 values, and CW attacks slightly better than the proposed approach. While the proposed approach yields better results for DF and PGD of high 𝜖 values. Since CIFAR10 dataset is different from MNIST and has different characteristics, the only NN component did not yield better results against FGSM of high 𝜖 values, PGD, CW, and DF attacks.

Noise

When only the noise block is used, the model achieves comparable results to SFAD except against PGD attacks. When we remove the noise block, the performance of SFAD is reduced especially against PGD attacks for MNIST and CIFAR10 datasets. The noise block helps the detector to better distinguish the feature space of clean input images from those features of AEs.

Autoencoder

Autoencoder block shows a substantial impact in the proposed approach. As discussed in Section 3.2.1, if the autoencoder couldn’t reconstruct its input, different feature space might be generated for the input signal which let SFAD able to detect the AEs. For MNIST dataset, the autoencoder block enhanced the performance results compared to only NN model against PGD of higher 𝜖 values, while the performance is reduced when the autoencoder block is removed from the proposed approach. On the other hand, for CIFAR10 dataset, when only the autoencoder is present, the performance results are much better against FGSM of high 𝜖 values, PGD, CW, and DF attacks when it is compared to only NN. The performance is reduced when it is removed from the proposed approach against PGD attacks.

Up/down-sampling

Unlike other processing blocks, up/ down sampling block yields less performance results against FGSM attacks and yields comparable results against other attacks compared to only NN model. That’s because the up/ down- sampling restores the global information of the input signal by the average pooling process. On the other hand, removing the sampling block from the proposed approach reduces the performance results especially for the CIFAR10 dataset.

Bottleneck

Like autoencoder block, the bottleneck block shows its ability to distinguish input signal characteristics especially in the proposed shallow classifiers (the selective AEs classifiers). Compared to only NN model, the only bottleneck model enhanced the performance results against FGSM of high 𝜖 values, PGD, CW, and DF attacks for CIFAR10 dataset and enhanced the performance results against PGD of high 𝜖 values attacks for MNIST. Besides, the performance of the proposed approach is significantly decreased for CIFAR10 dataset when the bottleneck block is removed.

6.4 Performance with different rejection rates (False positive (FP))

In this subsection we show the performance results of the proposed approach when thresholds are set to reject less than 10% for MNIST as shown in Fig. 7. Results show that an acceptable performance can be achieved if the thresholds are set to less than 10%. For instance, when the false positive rate is set to be 2%, results against PGD (𝜖 = 0.2,, and 0.4) attacks are significantly decreased because of the selective detection. In all other tested attacks, the difference is up to 4% and 1.76% when FP= 2% and 3%, respectively.

Fig. 7
figure 7

Performance comparisons between different False Positive (FP) rates and FP= 10% of SFAD for white-box attacks on MNIST dataset

7 Conclusion

In this work, we have proposed a novel unsupervised and ensemble mechanism, namely SFAD, to detect adversarial attacks. SFAD handled the N-last layers outputs of the baseline DNN classifier to identify AEs. It built N selective AEs classifiers that each took one layer output of the baseline classifier as input and then processed the input using autoencoder, up/down sampling, bottleneck, and additive noise blocks. Then, these feature-based classifiers were optimized in the SelectiveNet model to estimate the model’s uncertainties and confidences. The confidence values of these classifiers were then distilled as input to the selective knowledge transfer classifier to build the last classifier. Selective and confidence thresholds were set to identify the adversarial inputs. Selective, confidence, and mismatch modules are jointly working to enhance the detection accuracy. We showed that the model is consistent and is able to detect tested attacks. Moreover, the model is robust in different attack scenarios; white, black, and gray boxes attacks. This robustness, with the advantage that the model does not require any knowledge of adversarial attacks, will lead to better generalization. The main limitation of the model is that the best combination of N needs to be identified to enhance the detection accuracy and to reduce the false positive rate.