1 Introduction

Deep learning algorithms have started to outperform humans in the past few years. For example, in the “ImageNet Large Scale Visual Recognition Challenge (ILSVRC)”, a deep learning model called ResNet [1] beat human performance in 2015, and the record was later broken by more advanced architectures. Similarly, Goodfellow et al. [2] created a system that outperforms human operators for the problem of reading addresses from Google Street View imagery and solving CAPTCHAS. In the field of gaming, AlphaGo, an AI program, defeated the global Go champion in 2016. Many advanced systems are now being developed using deep learning models, which have shown to be extremely successful in a variety of domains, including medical diagnosis, autonomous vehicles, game play, and machine translation. However, the main emphasis of the researchers during the rise of deep learning models was the creation of increasingly precise models and the reliability and robustness of those models were paid almost no attention. DNN’s do, in fact, necessitate a more thorough examination because they have some inherent vulnerabilities that can be easily exploited by intruders.

Around the end of 2013, researchers discovered that existing DNN models are vulnerable to meticulously crafted attacks. Szegedy et al. [3] were among the very first who noticed the presence of adversarial instances in the domain of image classification. The authors have demonstrated that it is possible to modify an image by a small amount to change the prediction of the deep learning model. It is shown that a very slight and nearly unnoticeable change in input is enough to deceive even the most advanced classifiers and cause incorrect classification. Back then, a vast number of research studies have been undertaken in this new field named “Adversarial Machine Learning” and these studies have not been restricted just to image classification domain. For example, Sato et al. [4] demonstrated in the NLP domain that altering merely one word from an input sentence can deceive a sentiment analyzer trained with textual data. A further example is in the audio domain [5],where the authors built targeted adversarial audio samples in autonomous speech recognition task by introducing very little disturbance to the original waveform. The result of this study shows that the target model may simply be exploited to transcribe the input as any desired phrase.

Adversarial evasion attacks mainly work by modifying the input samples in a way that increases the likelihood of making incorrect decisions, resulting in inaccurate predictions. These attacks can cause the model’s prediction performance to deteriorate since the algorithm is unable to correctly predict the real output for the input instances. Attacks that take advantage of DNN’s weakness can substantially compromise the security of these machine learning (ML)-based systems, often with disastrous results. In the context of medical applications, a malicious attack could result in an inaccurate disease diagnosis. As a result, it has the potential to impact the patient’s health as well as the healthcare industry [6]. Similarly, self-driving cars employ ML to navigate traffic without the need for human involvement. A mistaken decision for the autonomous vehicle based on a adversarial attack could result in a tragic accident [7, 8]. Hence, defending against malicious attacks and boosting the robustness of ML models without sacrificing clean accuracy is critical. Presuming that these ML models are to be utilized in crucial areas, we should pay utmost attention to both the performance of ML models and the security problems of these architectures.

In this research work, we concentrate on adversarial defense strategies based on moment-based uncertainty estimates of a distilled model which are obtained from Monte Carlo (MC) Dropout samples. We propose a hybrid approach by using and significantly improving the effectiveness of uncertainty-based reversal technique [9] and combining it with defensive distillation technique to provide more robust models. We name our proposed network architecture TENET by inspiring from the famous sci-fi movie TENET (Directed by Christopher Nolan, Warner Bros. Pictures and Syncopy Inc., 2020) due to resemblance of main concepts (inversion). We developed two more effective variants of the reversal process based on scibilic uncertainty. Reversal method involves reverting the input sample back to its original data manifold by decreasing its quantified uncertainty before feeding it to the classifier. This technique would be impossible if we used only one metric whose calculation is dependent on a reference information like model loss. However, as the quantification of model uncertainty is independent of any reference information like real label of the input, we can successfully restore the inputs back to their original data manifold by minimizing the quantified uncertainty. Our codes are released on GitHubFootnote 1 for scientific use.

To summarize; our key contributions for this work are as follows:

  • We enhanced the performance of a recently proposed technique which can successfully restore adversarial samples back to their original class manifold and introduced two more effective variants of it.

  • To the best of our knowledge, we are the first in the research community that consider scibilic uncertainty to build robust models.

  • We introduce a hybrid architecture which combines defensive distillation technique and uncertainty-based reversal method. We experimentally show that these two approaches can handle complementary situations and they together help to reduce the success rate of different attacks like FGSM, BIM, PGD, DeepFool and CW lower than 5%.

This study is structured as follows: Sect. 2 goes over some of the most well-known attack types and defense techniques in the literature. In Sect. 3, we introduce the concept of uncertainty as well as main types and describe how we can quantify them. The details of our approach are presented in Sect. 4. We provide our experimental findings in Sect. 5 and wrap up our research in Sect. 6.

2 Literature survey

Since the uncovering of DNN’s vulnerability to adversarial attacks [3], a lot of work has gone into inventing new adversarial attack algorithms and defending against them by utilizing more robust architectures [10,11,12,13]. We discuss some of the noteworthy attack and defense studies separately.

2.1 Adversarial attacks

DNN models have some vulnerabilities that make them challenging to defend in adversarial settings. For example, they are mostly sensitive to slight changes in the input data, leading to unexpected results in the model’s predictions. Figure 1 depicts how an adversary could take advantage of such a vulnerability and fool the model using properly crafted perturbation applied to the input.

Fig. 1
figure 1

The figure shows a simple example to adversarial attack. The adversarial perturbation is applied upon the original image. The precisely crafted perturbation manipulates the model in such a way that a “Cat” is wrongly classified as “Sports Car” with very high degree of confidence

In general, adversarial strategies can be classified based on different criteria. Considering the final aim of the attacker, attacks can be grouped into two as targeted and untargeted attacks. In the former, the attacker tampers with the input image, causing the model to predict a class other than the genuine class. Whereas in the latter, the attacker perturbs the input image so that a particular target class is predicted by the model. Attacks can also be grouped based the level of knowledge that the attacker has. If the attacker has full knowledge of the model like architecture, weights, hyper-parameters, etc., we call this kind of setting as White-Box Settings. However, if the attacker has no information of the deployed model and defense strategy, we call this kind of setting as Black-Box Settings [14]. In this study, we mainly focus on untargeted attacks in a White-Box setting.

The majority of attack ideas rely on perturbing the input sample in order to maximize the model’s loss. In recent years, many different adversarial attack techniques have been suggested in literature. The most widely known and used adversarial attacks are Fast-Gradient Sign Method, Iterative Gradient Sign Method, Projected Gradient Descent, DeepFool and Carlini &Wagner. These five adversarial attack algorithms are briefly explained in Sects. 2.1.12.1.4.

2.1.1 Fast-gradient sign method

This approach, sometimes known as FGSM [15], is among the first and most famous adversarial attacks so far. In this attack algorithm, the derivative of the model’s loss function with respect to the input sample is used to identify which direction the input image’s pixel values should be altered in order to minimize the model’s loss function. Once extracted, it alters all pixels in the opposite direction simultaneously to maximize the loss. We may craft adversarial samples for a model with a classification loss function represented as \(J(\theta ,{\textbf{x}},y)\) by utilizing the formula below, where \(\theta \) denotes the parameters of the model, \({\textbf{x}}\) is the benign input, and \(y_\textrm{true}\) is the real label of our input.

$$\begin{aligned} {\textbf{x}}^\textrm{adv} = {\textbf{x}} + \epsilon \cdot \hbox {sign}\left( \nabla _x J(\theta ,{\textbf{x}},y_\textrm{true}) \right) \end{aligned}$$
(1)

Another important aspect of FGSM is that it is not intended to be optimum, but rather fast. It is not designed to output the minimum required amount of perturbation. Furthermore, when compared to other attack types, the success ratio of FGSM is relatively low when applied with small \(\epsilon \) values

2.1.2 Iterative gradient sign method

Kurakin et al. [16] proposed a minor but significant enhancement to the FGSM. Instead of taking one large step \(\epsilon \) in the direction of the gradient sign, we take numerous smaller steps \(\alpha \) and utilize the supplied value \(\epsilon \) to clip the output in this method. This method is also known as the Basic Iterative Method (BIM), and it is simply FGSM applied to an input sample iteratively. Equation 2 describes how to generate perturbed images under the \(l_\textrm{inf}\) norm for a BIM attack.

$$\begin{aligned} \begin{aligned} {\textbf{x}}_{t}^*&= {\textbf{x}} \\ {\textbf{x}}_{t+1}^*&= \hbox {clip}_{x, \epsilon } \{ {\textbf{x}}_{t} + \alpha \cdot \hbox {sign} \left( \nabla _{\textbf{x}} J(\theta , {\textbf{x}}_t^*, y_\textrm{true}) \right) \} \end{aligned} \end{aligned}$$
(2)

where \({\textbf{x}}\) is the clean sample input to the model, \({\textbf{x}}^*\) is the output adversarial sample at ith iteration, J is the loss function of the model, \(\theta \) denotes model parameters, \(y_\textrm{true}\) is the true label for the input, \(\epsilon \) is a configurable parameter that limits maximum perturbation amount in given \(l_\textrm{inf}\) norm, and \(\alpha \) is the step size.

The BIM attack has a better success rate than the FGSM [17]. The attacker can manage how far an adversarial sample is pushed further away from the decision boundary by configuring the \(\epsilon \) parameter.

2.1.3 Projected gradient descent

This attack type, commonly known as PGD, has been proposed by Madry et al. [18]. It perturbs an input image \({\textbf{x}}\) for a number of i iterations in the direction of the model’s loss function gradient with a tiny step size. It projects the generated adversarial sample back onto the \(\epsilon \)-ball of the input after each perturbation step depending on the chosen distance norm. In addition, rather than starting from the original point (\(\epsilon \) = 0, in all the dimensions), PGD employs random start, which can be defined as:

$$\begin{aligned} {\textbf{x}}_0 = {\textbf{x}} + P\left( -\epsilon , +\epsilon \right) \end{aligned}$$
(3)

where \(P\left( -\epsilon , +\epsilon \right) \) is the uniform distribution between (\(-\epsilon , +\epsilon \)).

2.1.4 DeepFool attack

This attack method has been introduced by Moosavi-Dezfooli et al. [19] and it is one of the strongest untargeted attack algorithms in literature. It is made to work with several distance norm metrics, including \(l_\textrm{inf}\) and \(l_{2}\) norms.

The DeepFool attack is formulated on the idea that neural network models act like linear classifiers with classes separated by a hyperplane. Starting with the initial input point \(\mathbf {x_t}\), the algorithm determines the closest hyperplane and the smallest perturbation amount, which is the orthogonal projection to the hyperplane, at each iteration. The algorithm then computes \({\textbf{x}}_{t+1}\) by adding the smallest perturbation to the \({\textbf{x}}_{t}\) and checks for misclassification. The illustration of this attack algorithm is provided in Fig. 2. This attack can break defensive distillation method and achieves higher success rates than previously mentioned iterative attack approaches. But the downside of this attack algorithm is that the produced adversarial sample generally lies close to the decision boundary of the model.

Fig. 2
figure 2

Illustration of DeepFool attack algorithm

2.1.5 Carlini and Wagner attack

The attack proposed by Carlini and Wagner [20] is one of the strongest attack algorithms so far. As a result, it is commonly used as a benchmark for the adversarial defense research groups, which tries to develop more robust DNN architectures that can withstand adversarial attacks. It is shown that, for the most well-known datasets, the CW attack has a greater success rate than the other attack types on normally trained models. Like DeepFool, it can also deceive defensively distilled models, which other attack types struggle to create adversarial examples for.

In order to generate more effective and strong adversarial samples under multiple \(l_{p}\) norms, the authors reformulate the attack as an optimization problem which may be solved using gradient descent. A confidence parameter in the algorithm can used to change the level of prediction score for the created adversarial sample. For a normally trained model, application of CW attack with default setting (confidence set to 0) would generally yield to adversarial samples close to decision boundary. And high-confidence adversaries generally located further away from decision boundary.

Adversarial machine learning is a burgeoning field of research, and we see a lot of new adversarial attack algorithms being proposed. Some of the recent remarkable ones are as follows: (i) Square Attack [21] which is a query efficient black-box attack that is not based on model’s gradient and can break defenses that utilize gradient masking, (ii) HopSkipJumpAttack [22] which is a decision-based attack algorithm based on an estimation of model’s gradient direction and binary-search procedure for approaching the decision boundary, (iii) Prior Convictions [23] which utilizes two kinds of gradient estimation (time and data dependent priors) and propose a bandit optimization-based framework for adversarial sample generation under loss-only access black-box setting and (iv) Uncertainty-Based Attack [24] which utilizes both the model’s loss function and quantified epistemic uncertainty to generate more powerful attacks. Figure 3 shows adversarial samples generated by attack algorithms discussed earlier.

Fig. 3
figure 3

An example image from CIFAR10 dataset and some of the adversarial samples crafted by using previously mentioned attack types

Fig. 4
figure 4

Defensive distillation

2.2 Adversarial defense

In this section, we review some of the most notable adversarial defense methods proposed over the last few years.

2.2.1 Defensive distillation

Although the idea of knowledge distillation was previously introduced by Hinton et al. [25] to compress a large model into a smaller one, the utilization of this technique for adversarial defense purposes was first suggested by Papernot et al. [26]. The algorithm starts with training a \(teacher \ model\) on training data by employing a high temperature (T) value in the softmax function as in Eq. 4, where \(p_{i}\) is the probability of i\(^\text {th}\) class and \(z_{i}\)’s are the logits.

$$\begin{aligned} p_{i} = \frac{\exp (\frac{z_{i}}{T})}{\sum _{j} \exp (\frac{z_{i}}{T})} \end{aligned}$$
(4)

Then, using the previously trained teacher model, each of the samples in the training data is labeled with soft labels calculated with temperature (T) in prediction time. The \(distilled \ model\) is then trained with the soft labels acquired from the teacher model, again with a high temperature (T) value in the softmax. When the training of the student model is over, we use temperature value as 1 during prediction time. Figure 4 shows the overall steps for this technique.

This technique was found to significantly reduce the ability of traditional gradient-based untargeted attacks to build adversarial samples. Because defense distillation has an effect of diminishing the gradients down to zero and the usage of standard objective function is not effective anymore. To illustrate this fact, we made a simple experiment using a test sample from MNIST (Digit) dataset and draw the loss surface of the normal and distilled models against two different directions (one for loss gradient direction and one for a random direction). As depicted in Fig. 5, the gradient of the distilled model diminishes to zero and thus loss-based attacks have difficulty in crafting adversarial samples for defensively distilled models. However, it was later demonstrated that more successful attack types, such as the CW and DeepFool attacks, could defeat the defensive distillation strategy. The reason why we opt to employ this technique in our approach is that, one can easily craft high confident examples near the decision boundary of a defensively distilled model. And also, due to gradient vanishing property, it is effective in defending against loss gradient-based untargeted attack types.

Fig. 5
figure 5

Loss surfaces of “normally trained” and “distilled” models

2.2.2 Adversarial training

Adversarial training is considered as an intuitive way of defensive strategy in which the robustness of the deep learner is strengthened by training it with adversarial samples. This strategy can be represented mathematically as a Minimax game, as shown in Eq. 5:

$$\begin{aligned} \underset{\theta }{\min }\ \ \underset{|\delta \Vert \le \epsilon }{\max }\ \ J(h_\theta (x+\delta ), y) \end{aligned}$$
(5)

where h denotes the model, J denotes the model’s loss function, \(\theta \) represents model’s weights and y is the actual label. \(\delta \) is the amount of perturbation amount added to input x and it is constrained by given \(\epsilon \) value. The inner objective is maximized by employing the most powerful attack possible, which is often approximated by various adversarial attack types. In order to reduce the loss resulting from the inner maximization step, the outside minimization objective is used to train the model. This whole process produces a model that is expected to be resistant to adversarial attacks used during the training of the model. For adversarial training, Goodfellow et al. [15] used adversarial samples crafted by the FGSM attack. And Madry et al. used the PGD attack to build more robust models, but at the expense of consuming more computational resources. Despite the fact that adversarial training is often regarded as one of the most effective defenses against adversarial attacks, adversarially trained models are nevertheless vulnerable to attacks like CW.

Adversarial ML is a very active field of research, and new adversarial defense approaches are constantly being presented. Among the most notable are as follows: (i) High-Level Representation Guided Denoiser (HGD) [27] which avoids the error amplification effect of a traditional denoiser by utilizing the error in the upper layers of a DNN model as loss function and manages the training of a more efficient image denoiser, (ii) APE-GAN [28] which uses a Generative Adversarial Network (GAN) trained with adversarial samples to eliminate any adversarial perturbation of an input image, (iii) Certified Defense [29] which proposes a new differentiable upper bound yielding a model certificate ensuring that no attack can cause the error to exceed a specific value and (iv) [30] which uses several uncertainty metrics for detecting adversarial samples.

3 Preliminaries

Predictive models have traditionally been required to make decisions even in ambiguous cases where the model is unsure about its prediction. And this fact often leads to low-quality predictions. Assuming that the prediction of the model is always correct without considering the model’s uncertainty can have disastrous consequences. This led the researcher study developing different methods for uncertainty quantification in an attempt to improve model reliability.

We begin this part by discussing the main types of uncertainty in ML. Then, we go over how different uncertainty metrics can be quantified.

3.1 Uncertainty in machine learning

In ML, there are two main kinds of uncertainty: aleatoric and epistemic uncertainty [31,32,33]. And recently, apart from these main types, a new uncertainty metric named scibilic uncertainty has been introduced.

3.1.1 Epistemic uncertainty

Uncertainty due to an inadequate knowledge and limited data required for a perfect predictor is referred to as Epistemic uncertainty [34]. As shown in Fig. 6, it can be classified as: approximation uncertainty and model uncertainty.

Fig. 6
figure 6

Different types of epistemic uncertainty

Approximation Uncertainty

In a traditional ML task, the learner is provided with data points from a dataset that is independent and identically distributed. Then, the learner attempts to induce a hypothesis \({\hat{h}}\) from hypothesis space \({\mathcal {H}}\) by selecting an appropriate learning method with its associated hyper-parameters and minimizing the expected loss (risk) with a chosen loss function, \(\ell \). Nevertheless, what the learner actually does is to try to keep empirical risk \({R}_{emp}\) as low as possible, which is an estimation of real risk R(h). The induced \({\hat{h}}\) represents approximation to the \(h^{*}\) which is the the real risk minimizer and best possible hypothesis within \({\mathcal {H}}\). This leads to an approximation uncertainty. As a result, the quality of the induced hypothesis is not ideal, and the trained model will be prone to errors.

Model Uncertainty

Assume that the perfect predictor is not included in the hypothesis space H. In that situation, the learner has no possibility of developing a hypothesis function that can effectively map all potential inputs to outputs. This results in a discrepancy between the ground truth \(f^{*}\) and the best possible function \(h^{*}\) within \({\mathcal {H}}\), which is referred to as model uncertainty.

The Universal Approximation Theorem, on the other hand, showed us that any target function f can be approximated by a neural network [35, 36]. For deep neural networks, the hypothesis space \({\mathcal {H}}\) can be extremely large. Hence, it is reasonable to presume that \(h^{*} = f^{*}\). The model uncertainty can be neglected in deep neural networks, leaving only the approximation uncertainty to be considered. As a result, the actual source of epistemic uncertainty in deep learning tasks is related with approximation uncertainty. Epistemic uncertainty is referred to the confidence a model has about its prediction [37]. The fundamental cause is the uncertainty regarding the model’s parameters. This form of uncertainty is visible in areas where we have inadequate training data and the model weights are not properly tuned.

3.1.2 Aleatoric uncertainty

Aleatoric uncertainty relates to the variation in an experiment’s outcome caused by inherent random effects [38]. Despite having adequate training examples, this form of uncertainty cannot be reduced [39]. The noise observed in a sensor’s measurement data is an excellent example of this phenomena.

Fig. 7
figure 7

Illustration of the Epistemic and Aleatoric uncertainty

A simple nonlinear function (\(\hbox {logit}({0.085 \times x})\) in the interval \(x\in [0,12]\) ) is presented in Fig. 7. Noisy samples are illustrated in the region at right where \(9<x<12\), and those samples lead to high aleatoric uncertainty. These points, for example, could reflect an erroneous sensor measurement; one can deduce that the sensor generates errors around \(x=10.5\) for some unknown inherent reason. We can also argue that the figure’s central regions represent areas of high epistemic uncertainty. Because our model does not have enough training examples to accurately represent the data.

3.1.3 Scibilic uncertainty

Reinhold et al. [40] proposed a new sort of uncertainty named scibilic uncertainty by combining epistemic and aleatoric uncertainty. This new uncertainty metric was employed in an image segmentation challenge to identify areas in an input image that the model could resolve how to predict if it was given enough data to train with. After quantifying epistemic and aleatoric uncertainty, we can compute scibilic uncertainty by dividing the former by the latter. The intuition behind scilibilic uncertainty is as follows: For a suspicious input, a DNN model trained on naturally occurring data may result in high epistemic uncertainty. Nevertheless, due to some intrinsic property of the data, the model can lead to significant aleatoric uncertainty for that same input, making it difficult to make a reliable prediction. The division procedure allows us to keep epistemic uncertainty that is not caused by the model’s difficulty for that particular input.

3.2 Quantifying uncertainty in deep neural networks

Numerous research studies have been conducted in recent years to quantify uncertainty in DNN models. The majority of these studies relied on Bayesian NNs, which quantify predictive uncertainty by learning the posterior distribution over the weights. However, Bayesian NNs have an extra computing overhead and an inference problem. As a result, a number of approximations to Bayesian approaches have been proposed which employ variational inference [41,42,43,44]. Lakshminarayanan et al. [45], on the other hand, adopted deep ensemble approach for uncertainty quantification as an alternative to Bayesian Neural Networks. However, this method involves training of many models, which may be impractical in practice. Gal et al. [46] proposed a more elegant and efficient technique and demonstrated that an NN model with inference time dropout corresponds to a Bayesian approximation of the Gaussian process. Their approach functions as an ensemble model in training mode (during prediction time) of the model and dropout is therefore enabled. In each individual ensemble model, the system drops out part of the neurons in each layer of the network based on the dropout ratio. The variance of the MC dropout sampling output throughout prediction time is used to approximate the overall epistemic uncertainty. Later, Kendall and Gal [47] presented a technique in which both epistemic and aleatoric uncertainties are captured in a single model. They employed a CNN Model f (Bayesian NN) with weights represented by \({\hat{\omega }}\) that maps an input x to \({\hat{y}}\) and \({\sigma }^2\). In their approach, the model output is divided into two parts as predictive mean \(({\hat{y}})\) and predicted variance \({\hat{\sigma }}^2\) terms. Consequently, the two types of uncertainty are quantified as follows:

$$\begin{aligned} \underbrace{\frac{1}{T} \sum _{t=1}^T \hbox {diag}({\hat{\sigma }}^2)}_\textrm{aleatoric} + \underbrace{\frac{1}{T} \sum _{t=1}^T ({\hat{y}} - {\bar{y}})^{\otimes 2}}_\textrm{epistemic} \end{aligned}$$
(6)

the number of MC Dropout samples in prediction time when the model is in training mode, \({\bar{y}}=\sum _{t=1}^T {\hat{y}}_{t}/T\) and \(y^{\otimes 2} = yy^T\)

The method described above is delicate and has been demonstrated to be effective in computer vision applications such as image segmentation. Unfortunately, since the output of the model is divided into two parts for predicting mean and variance terms, it was inconvenient to employ in adversarial machine learning trials for us. We had to look for other options, since the attack algorithms are developed to function with model architectures with only prediction output term (no variance).

The method we employed in this study is proposed by Kwon et al. [48] as an alternate approach for quantifying both epistemic and aleatoric uncertainty in classification models. In the author’s method, the variance of the prediction is comprised of two parts that represent aleatoric and epistemic uncertainty. Let \({\hat{\omega }}\) be the trained weights used in the neural network, K denotes the number of output classes and \(p(y^*|x^*,{\hat{\omega }})\) denotes the prediction \(y^*\) of a model for any test sample \(x^*\) given the weights of the model where \(y^* \in {\mathbb {R}}^{k}\), then the following is the formula for their method:

$$\begin{aligned} \hbox {Var}_{p(y^*|x^*,\omega )}(y^*) = {\mathbb {E}}_{p(y^*|x^*,\omega )}(y^{*^{\otimes 2}}) - {\mathbb {E}}_{p(y^*|x^*,\omega )}(y^*)^{\otimes 2} \nonumber \\ \end{aligned}$$
(7)
$$\begin{aligned} = \underbrace{\frac{1}{T} \sum _{t=1}^T [\hbox {diag}\{p(y^*|x^*,{\hat{\omega }}_t)\} - p(y^*|x^*,{\hat{\omega }}_t)^{\otimes 2}]}_\textrm{aleatoric} \nonumber \\ \end{aligned}$$
(8)
$$\begin{aligned} \underbrace{+ \frac{1}{T} \sum _{t=1}^T \{p(y^*|x^*,{\hat{\omega }}_t)\} - {\hat{p}}(y^*|x^*,{\hat{\omega }}_t)^{\otimes 2}}_\textrm{epistemic} \nonumber \\ \end{aligned}$$
(9)

where \({\hat{p}}(y^*|x^*,{\hat{\omega }}_t) = \sum _{t=1}^T \{p(y^*|x^*,{\hat{\omega }}_t)\}\)

Both Eqs. (8, 9) produce a \(k \times k\) matrix with diagonal elements representing the variance of each output class.

After we calculate epistemic and aleatoric uncertainty, we may simply compute scibilic uncertainty as follows:

$$\begin{aligned} \hbox {Scibilic} = \frac{\hbox {Epistemic}}{\hbox {Aleatoric}} \end{aligned}$$
(10)

Eventually, for a given input \({\textbf{x}}^*\), we have three different column vectors of shape \(K \times 1\) as \(EP \in {\mathbb {R}}^{k}\), \(AL \in {\mathbb {R}}^{k}\), \(SC \in {\mathbb {R}}^{k}\), whose elements represent epistemic, aleatoric and scibilic uncertainty for each class respectively.

4 Approach

In regions with a low number of training samples, model uncertainty is larger. We cannot obtain a model that perfectly predict all testing data. This can be explained due to the absence of ground truth in these areas. Figure 8 displays the prediction outputs of a regression model trained on a small amount of data that are bound by some interval. For this toy example, we trained a neural network with single hidden layer and ten neurons to learn a linear function \(y = 2 \times x + 3\). As can be observed in the figure, the model’s uncertainty values (epistemic) derived from MC dropout estimates are high in places where we do not have training data, indicating that the quality of the prediction is low and the model is having difficulty deciding the accurate output values. Consistently, high loss values are observed in those regions. As a result, we can argue that the regions with high epistemic uncertainty corresponds to the regions of low prediction accuracy. Therefore, testing the model in severe settings with input that it has never encountered before will lead to model prediction failure [24]. Similarly, restoring the input samples to the regions where the model was trained on (low uncertainty regions) would yield more accurate predictions. In this study, we employed this idea. However, we paid attention to one key point, that is, while trying to minimize the quantified uncertainty for any input sample, we made sure that the restoration operation has minimal effect to model loss.

Fig. 8
figure 8

Uncertainty values obtained from a regression model

4.1 Uncertainty-based reversal operation

We begin this section by presenting the pseudo-code for uncertainty-based reversal procedure, as described in Algorithm 1. This reversal method is designed under \(L_\infty \) norm.

We compute \(\nabla _{\textbf{x}} \ell (h({\textbf{x}}_{t}, y_{pred}))\) and \(\nabla _{\textbf{x}} U({\textbf{x}}_t,h,p,T)\) for each iteration of our uncertainty-based reversal procedure. Then, we restore the input sample by minimizing its quantified uncertainty and utilize the sub-directions of uncertainty’s gradient that are not shared by the loss’ gradient with respect to predicted class. A better understanding of this idea can be obtained by glancing at Fig. 9.

Fig. 9
figure 9

Sub-directions used in reversal procedure

In our proposed method, we used both loss and uncertainty information and exclusively use the sub-directions from the uncertainty’s gradient that are not shared by the gradient of the loss. The intuition behind this approach is as follows: In a conventional production setting where an ML model is employed for a classification problem, the input is supplied to the model, and the final prediction is observed after the input sample is processed and mapped to an output, as illustrated in upper part of Fig. 10. For any input, the gradient of the loss against predicted label gives us an idea about the possible direction where we can minimize the loss. However, if the prediction of the ML model is wrong, the model will be more confident in its wrong prediction and final prediction will be much more inaccurate when we perturb the image in loss gradient direction. Therefore, when trying to minimize the quantified uncertainty of the input sample, we needed to get rid of the common sub-directions shared by loss gradient. After rejecting part of the sub-directions, the remaining sub-directions in the uncertainty’s gradient can be utilized to safely return the input to its original data manifold.

Fig. 10
figure 10

Options for ML model deployment

In this study, we have used both the standard version which is based on epistemic uncertainty and developed 2 additional variants of above procedure which are different in terms of the type of the uncertainty metric employed and the way of using the output uncertainty vector. We started our experiments by using epistemic uncertainty obtained from Eq. 9. We used the expected value(mean) of the epistemic uncertainty (EP) for the uncertainty quantification as in the case of [9]. Then, we tried scibilic uncertainty (SC) via Eq. 10 and used mean of the SC. Lastly, instead of using the average scibilic uncertainty measure of all classes, we used the uncertainty value of the predicted class only. In this way, we used the following three equations for uncertainty quantification.

$$\begin{aligned} U({\textbf{x}}_t,h,p,T) = \frac{1}{K} \sum _{k=1}^K EP[k] \end{aligned}$$
(11)
$$\begin{aligned} U({\textbf{x}}_t,h,p,T) = \frac{1}{K} \sum _{k=1}^K SC[k] \end{aligned}$$
(12)
$$\begin{aligned} U({\textbf{x}}_t,h,p,T) = SC[pred] \end{aligned}$$
(13)
Fig. 11
figure 11

Restoring perturbed image back to its original class data manifold

Fig. 12
figure 12

The effect of uncertainty-based reversal procedure on the predictions of normally trained and distilled models

4.2 Analysis of the uncertainty-based reversal method

This technique can be applied as a reverse-perturbation operation before feeding any input into a classification model. As seen in the bottom section of Fig. 10, an input X that is intended to be presented to the ML model is first processed by uncertainty-based reversal procedure. The goal of this reverse-perturbation operation is to judiciously perturb the input image in a way that reduces its quantified uncertainty. This “slightly reversed” image \({\hat{X}}\) will then be fed into the ML model. Figure 11 illustrates the uncertainty-based reversal process. The crucial thing is that the location of the input sample should not be too far away (on the incorrect side) from the decision boundary of the model to ensure a successful reversal operation. And, this is the major drawback of the standard uncertainty-based reversal operation. For DeepFool attack and CW attack with confidence parameter set to 0, the perturbed samples generally resides close to the decision boundary. However, if one applies CW attack by setting the confidence parameter to a high value, the attack algorithm will generally craft high confident adversarial samples far away from decision boundary. That is why, for a “normally trained” model, the success rate of standard reversal operation will be lower.

However, defense distillation technique can help us to overcome this problem. Because, during the training of a distilled network, what we actually do is to force the model to learn making high-confident predictions. And therefore, during prediction time, we see that the distilled models mostly make high-confident predictions in favor of the predicted class no matter where the input resides in its own data manifold [20]. And this is valid even if the test sample lies near the vicinity of model’s decision boundary (whether it is in the correct or wrong side). This way, even if the attacker sets a high value for the confidence parameter for CW attack, the algorithm can easily craft adversarial sample near decision boundaries. Thanks to this, reversal procedure can successfully restore the input back to its original data manifold. To demonstrate this phenomenon, we made an experiment by using two different models as normal and defensively distilled (student) model. For each of these models: we used the same random sample from CIFAR10 dataset and applied DeepFool attack on it for generating adversarial samples. Then, we applied uncertainty-based reversal procedure on these perturbed samples and get the restored images. For each of the input, perturbed and restored samples, we have also shown the softmax output scores of the normal and student models used as illustrated in Fig. 12. The attack algorithm and our reversal procedure variants are successful on both of the normal and distilled models. When we check the softmax output scores of the normal model for perturbed sample in the first scenario, we see that there is not much difference between the prediction scores of the correct and wrong class. However, we observe that the distilled model makes its prediction in favor of the predicted class with a very high confidence. We know that DeepFool attack results in adversarial samples close to decision boundaries, this experiment verifies our intuition of using a defensively distilled model together with uncertainty-based reversal procedure to force most of the successful adversarial samples to reside near the decision boundary.

Of course, for any kind of procedure that is planned to be applied on the input samples of a deployed model, a significant issue to consider is that this process should not have highly negative impact on the model’s performance on clean data. Any modification to the model’s functioning that reduces prediction accuracy below an acceptable level can not be permitted, regardless of how much robustness it delivers. We conducted comprehensive testing to determine the impact of uncertainty-based reversal procedure on the model’s clean data performance and confirmed that the accuracy rate did not decline more than a tolerable level, as shown in the experiments section. The results show that this technique can be used to strengthen the robustness of the deployed ML models against malicious attacks, especially in risky environments where security is an important concern.

4.3 Adversarial assumptions

In this work, we assume that the main objective of the adversary is to obtain the desired behavior for an ML model and the criteria for success for the attacker are tied directly with “any” labeling mistake. This type of attack strategy is classified in the literature as untargeted-attack, in which the attacker is considered successful if, for instance, the rifle image is predicted to be anything other than a rifle. Our assumption was that the attacker was fully aware of the architecture and parameters of the target model as in the case of whitebox setting. Another crucial assumption concerns the constraints of the attacker. Clearly, the attacker should be limited to applying a perturbation with \(l_p\) norm up to certain \(\epsilon \) value for an attack to be unrecognizable to the human eye. To ensure this modification to be imperceptible, the attacker must find an approximate solution to a difficult constraint optimization problem and identify which areas of the input should be modified. The adversary tries to decrease the classification performance of the target network as much as possible by employing any of the known attack algorithms like [15, 16, 18, 49]. For this study, we used \(l_\infty \) and \(l_2\) norm metrics to restrict the maximum perturbation amount that an adversary can apply on the input sample. Finally, the error rate of our proposed defense technique is assessed over the percentage of resulting successful attack samples which is proposed by Goodfellow et al. [15] and recommended by Carlini et al. [50].

5 Results

5.1 Experimental setup

For our experiments, we used two sets of models as normal and distilled(student) by using same architectures and trained our CNN models using MNIST (Digit) [51], MNIST (Fashion) [52] and CIFAR-10 [53] datasets. In the first group, our normally trained models attained accuracy rates of 99.11%, 92.61%, and 79.38%, whereas, in the second group, our distilled models attained accuracy rates of 99.41%, 92.62%, and 80.47%. The architectures of our CNN models and the hyper-parameters used in model training are listed in Table 1 and 2. Lastly, for quantifying uncertainty metrics, we set \(T = 50\) as the number of MC dropout samples.

Table 1 CNN architectures for normal and distilled models
Table 2 CNN model parameters
Table 3 Parameters that are used in our uncertainty-based reversal process: \(\alpha \) denotes the step size and i denotes # of reversal steps for a perturbation budget \(\epsilon \)
Table 4 Attack success rates of normally trained model on MNIST (Digit) dataset with and without uncertainty-based reversal procedure
Table 5 Attack success rates of normally trained model on CIFAR10 dataset with and without uncertainty-based reversal procedure

5.2 Experimental results

Throughout our experiments, we applied attack on the test samples only if they were previously classified correctly by our models. Because, an attacker would obviously have no motivation to perturbed samples that have already been misclassified. We utilized an open source Python library called Foolbox [54] to implement the attacks used in this study .Footnote 2.

Table 6 Effect of reversal procedure on clean performance of normally trained model—MNIST (Digit) Dataset
Table 7 Attack success rates of distilled model on MNIST (Digit) dataset with and without uncertainty-based reversal procedure
Table 8 Effect of reversal procedure on clean performance of distilled model—MNIST (Digit) dataset
Table 9 Attack success rates on MNIST (Fashion) dataset with and without uncertainty-based reversal procedure

We started our experiments by first evaluating contributions of different uncertainty metrics on uncertainty-based reversal procedure performance. To do this, we used normal CNN models which are trained on MNIST (Digit) and CIFAR10 Datasets, and we applied several different attack types on each sample to craft their adversarial counterparts. We then tried to restore those adversarial samples back to their original class manifolds by using each of the three variants of reversal procedure. Table 3 summarizes the values of the parameters that are used in the reversal procedure.

Table 10 Effect of reversal procedure on clean performance—MNIST (Fashion) dataset
Table 11 Attack success rates on CIFAR-10 dataset with and without uncertainty-based reversal procedure

The results of the defense method variants are provided in Table 4 and 5. As can be seen from the final attack success rates, best robustness performance is achieved when we used the Scibilic uncertainty value of the predicted class only. When we check Table 4, we also observe a considerable difference between the reversal performances of Scibilic and Epistemic Uncertainty (standard uncertainty metric used in the initial proposal of reversal procedure). For instance: in the case of BIM attack, attack success rates drop from 31.47 to 20.48% if we switch from Epistemic Uncertainty (Eq. 11) to Scibilic Uncertainty (Eq. 12). And instead of using the mean of Scibilic Uncertainty vector, if we use the uncertainty value of the predicted class only (Eq. 13), we can even lower the final attack success rate to 18.16%. We also observe that the difference between the reversal performances of each uncertainty metric is less clear as the complexity and the dimensions of the used dataset increases.

We then checked the effect of reversal procedure on clean data performance. For this purpose, we applied reversal strategy (the standard one with epistemic uncertainty and our 2 variants separately) directly to each of the test samples of MNIST (Digit) dataset. And we compared the resulting model accuracy values with the ones we obtained without any reversal operation. As shown in Table 6, reversal strategy has only a minimal and tolerable impact on model classification performance. Considering the level of robustness it provides, we can thus infer that the use of uncertainty-based reversal strategy has no detrimental impact on overall. We also do not observe a noticeable difference between the impact of our variants on clean data classification performance. Therefore, we choose the usage of our second variant (Scibilic Unc. with pred. class) as our base metric in our reversal strategy and the rest of the experiments are conducted using this.

Although the reversal procedure is performing very well on certain attack types like DeepFool or CW attack (when confidence parameter set to a low value), for other loss-based attacks like FGSM, BIM or PGD, we still face some problems. The same is valid if we opt to use CW attack by setting confidence parameter to a high value during attack implementation. The reason is that, for those cases, the resulting adversarial samples generally lie far from the decision boundary of the model. To mitigate this problem, we employed another method known as defensive distillation. Distillation technique has an effect of diminishing the gradients of the model down to almost zero and also force the model to make its predictions much more confidently. The former effect of distillation prohibits loss-based untargeted attacks to use gradients efficiently and results in considerably lower attack success rates. And the latter effect of distillation results in high confidence adversarial samples located close to decision boundary. Therefore, when we combined reversal procedure with defensive distillation, we achieved much better results. The results in Table 7 show that our proposed architecture (TENET) provides perfect robustness to all kinds of used attacks and reduces the attack success rates down to 1% regardless of the attack algorithm with only a negligible effect on clean data classification performance (Table 8).

To evaluate and validate the effectiveness of our proposed architecture, we have conducted additional experiments on different datasets. Table 9 shows the performance of reversal procedure on MNIST (Fashion) dataset for both normal and distilled models.

Table 12 Effect of reversal procedure on clean performance—CIFAR10 dataset
Table 13 BPDA attack success rates
Table 14 Attack success rates on a normally trained Fashion MNIST model—Algorithm comparison
Table 15 Comparison of attack success rates with TENET and Adversarial Training
Table 16 Attack success rates on normally trained VGG19 model and VGG19 with TENET architecture

And Table 10 shows the effect of our proposed architecture on clean data classification performance.

Finally, we have performed the same set of experiments on CIFAR-10 dataset. The results are available in Table 11 and Table 12. The results of our detailed experiments on all the datasets reveal that our reversal procedure and defensive distillation technique can handle complimentary situations and together they provide very high degree of robustness against various kinds of untargeted attacks.

In the last part of our experiments, we wanted to test our defense method against an adaptive attack idea. For this purpose, we tried to compare the robustness of a normal model and our proposed architecture which both have non-differentiable components that are obscuring the gradients from the attacker. In this scenario, for attacking the target models, we used Backward Pass Differentiable Approximation (BPDA) approach via Advertorch Toolbox [55] and replaced the non-differentiable components (bits squeezing, median filter) with identity function in the backward pass as suggested by Athalye et al. [56]. The results that are available in Table 13 show that adaptive attack ideas like BPDA might be successful against a defense approach which obscures the gradients from the attacker. However, if the same gradient masking-based defense approach was applied to our TENET architecture, BPDA attack idea would not be successful. The main reason behind the robustness of our proposed architecture against BPDA is that our method involves a defensive distillation step which forces gradients of the model to zero for any gradient-based untargeted attack [57]. Hence, even if the attacker tries to circumvent the defense by using an approximate function in the backward pass, the computed gradients will still be useless for crafting successful adversarial perturbation as it is a defensively distilled model.

5.3 Discussions and further results

We begin this part by showing the positive effect of getting rid of the common directions which are shared by loss and uncertainty’s gradient from uncertainty’s gradient. Results available in Table 14 show the attack success rates of DeepFool and CW attacks against normal prediction and our proposed defense method (with and without eliminating the common directions). For this experiment, we call the version of Algorithm-1 which does not discard the common directions as “primitive” (omitting line 10 in Algorithm-1). As can be seen from the results, we can substantially increase the defensive performance once we discard the common directions. Because, for the perturbed input samples that are pushed away from their decision boundaries and thus, which are already classified wrongly, the gradient of the loss with respect to the predicted label will point to the wrong class data manifold. Therefore, these sub-directions have a negative impact on reverting the input sample back to its own data manifold.

We then wanted to compare the performance of our proposed defense method with one of the most effective defense approaches in literature, which is adversarial training. The results available in Table 15 show that our proposed TENET architecture outperforms adversarial training in terms of robustness in all the experiments we conducted with different datasets.

Once we have evaluated the effectiveness of our proposed defense method on our comparably small models, we tried to test its performances on a considerably larger model. To do this, we first trained VGG-19 [58] models (with custom dropout layers) on CIFAR-10 dataset and achieved accuracy rates of 90.46% and 89.47% for normally trained and distilled models. Then, we have applied different attack algorithms and compared the attack success rates of TENET architecture with a standalone normal model. The results available in Table 16 reveal once again the efficacy of our proposed defense approach. To effectively use our technique in transfer learning settings, the transferred model should have already been trained using dropout layers (the layers which are located before the part of the model that is frozen).

As a last experiment, we have measured the time spent by our defense method for a batch of input of size 64 from the MNIST Dataset. In our local machine, it took 1,51 s to make a prediction with our proposed defense method, compared to 4.12 milliseconds of making a prediction directly without any previous operation. As expected, the execution time of our defense method is longer than making a single prediction due to additional uncertainty quantification steps and backward derivative operation.

6 Conclusion

In this study, we proposed a new defense architecture by significantly enhancing uncertainty-based reversal method and combining it with the utilization of defensive distillation technique. We evaluated and validated the effectiveness of our approach on three different datasets that are widely utilized in adversarial research field. The results of our extensive experiments suggest that our proposed architecture generalizes effectively across datasets and offers a very high degree of adversarial robustness without jeopardizing clean data classification performance.

In this research, we focused solely on the image domain and used only CNN models. However, we wonder if uncertainty-based reversal procedure is adaptable to other domains, such as audio or text, where different network models are used. Therefore, we plan to apply and evaluate the efficacy of our approach on other DNN architectures utilized in various domains. And finally, as a future work, we would like to test our method in transfer learning scenarios and work on potential improvements for tackling additional time and computational complexity introduced via our proposed defense method.