Keywords

1 Introduction

Deep neural networks (DNNs) have been used in a wide range of applications in different fields, such as face recognition [6], voice recognition [1] and sentiment analysis [30]. DNNs can also achieve state-of-the-art performance in tasks such as security verification in unconstrained environments where very low false positive rate metrics are required [6]. However, deep learning models are shown to be vulnerable to interference from adversarial samples. Attackers can manipulate the model outcome by deliberately adding the perturbations to the original samples to attack the models [28].

In general, the current approaches to attack models can be categorised into two types: white-box attack [12] and black-box attack [22]. For white-box attacks, the attacker knows the relevant parameters of the target model and can formulate the most suitable attack method. For black-box attacks, on the other hand, the attacker does not have access to the model parameters. In terms of the characteristics of the white-box and black-box attack methods, the black-box attack provides the adversarial performance of the attacking samples, which is useful for improving the robustness of deep learning models in real-world scenarios. Specifically, the black-box attack methods have three types, including query-based method [14], transfer-based method [8] and hybrid method [9].

The objective of the query-based method is to interrogate the model to extract pertinent input or output information, and subsequently utilize this limited information to iteratively generate optimal adversarial samples. However, such method is subject to restrictions imposed by access permissions and often require multiple queries to obtain excellent adversarial samples. The transfer-based method aims to train and generate adversarial samples on a known-information local surrogate model, which are then transferred and tested on the target black-box model for the attack success rate. Compared to query-based methods, transfer-based methods do not require additional access to the model and can bypass certain adversarial defense mechanisms aimed at queries. The hybrid method combines the principles of query and transfer approaches. Although it can achieve sufficiently high attack success rate, it also implies that it is susceptible to adversarial defense mechanisms targeting both queries and transfers. Therefore, in this paper, we focus on transfer-based method.

As a common approach of transfer-based attack, feature-level attack attempts to maximise the internal feature loss by attacking intermediate layers’ features to improve the transferability of the attack [32]. The aim is to increase the weight of negative features in the middle layer of the model while decreasing the weight of positive features. More negative features will be retained to assist the diversion of the model’s predictions. However, it is still challenging to harmoniously differentiate the middle-level features via feature-level attack method, which is also prone to its local optimum [32]. Moreover, it is well-known that the effectiveness of transfer-based black-box attacks is influenced by the overfitting on surrogate models and specific adversarial defenses. To address these challenges, we propose to utilise the information of neuron importance estimation for the middle layer to identify the adversarial features more accurately. In addition, we also evaluate the transferability of our proposed method on adversarially trained models, which will be specifically discussed in Sect. 4. The results demonstrate that our method achieves favorable attack success rates even on target models protected by adversarial defenses.

To obtain adversarial samples with higher transferability, this paper presents a double adversarial neuron attribution attack (DANAA). DANAA method attributes the model outputs to the middle layer neurons, thus measuring individual neuron weights and retaining features that are more important towards transferability. We use adversarial non-linear path selection to enrich the attacking points, which improves the attribution results. Extensive experiments on the benchmarking datasets following the literature methods have been conducted. The results show that, DANAA can achieve the best performance for the adversarial attacks. We anticipate this work will contribute to the attribution-based neuron importance estimation and provides a novel approach for transfer-based black-box attack. Our contributions are summarised as follows:

  • We propose DANAA, an innovative method of non-linear gradient update paths to achieve a more accurate neuron importance estimation, for a more in-depth study of the route to attribution method.

  • We present both theoretical and empirical investigation details for the attribution algorithm in DANAA, which is a core part of the method, in Sect. 3.

  • A comprehensive statistical analysis is performed based on our benchmarking experiments on different datasets and adversarial attacks. The results in Sect. 4 demonstrates the state-of-the-art performance of DANAA method.

2 Related Work

In this section, we review the literature on white-box attacks, query-based black-box attacks, transfer-based black-box attacks, and hybrid black-box attacks.

2.1 Common White-Box Attacks

Previous work has demonstrated that neural networks are highly susceptible to misclassification by pre-addition of perturbed test samples. Such processed samples are called adversarial samples. The emergence of adversarial samples has led to the development of a range of adversarial defences to ensure the model performance [16, 28, 29].

Currently, adversarial attacks can be divided into white-box attacks and black-box attacks depending on the level of available information for the model being attacked. There are various approaches for white-box attacks, such as gradient-based and GAN-based. Gradient-based white-box attacks include FGSM [12], I-FGSM [16], PGD [20] and \( C \& W\) [3]. Some recent GAN-based white-box attack methods are AdvGAN [33], GMI [37], KED-MI [4] and \( Plug \& Play\) [24]. While white-box attacks are effective in measuring the robustness of a model under attack, in real-world scenario, the parameters of the model are often not accessible, leading to the development of black-box attacks.

2.2 Query-Based Black-Box Attacks

Query-based attacks are a branch of black-box attacks aiming to train an effective adversarial sample by performing a small-scale attack on the target model to query the model parameters, such as the model labels and confidence levels. These parameters can be used as part of the dataset to assist in training the migration algorithm to verify the migration of the black-box model. Ilyas et al. [14] were the first to propose a query-based black-box attack approach. Following, they proposed combining prior and gradient estimation of historical queries and data structures based on Bandit Optimization, which greatly reduces the number of queries [15]. Li et al. [17] proposed a query-efficient boundary-based black box attack method (QEBA). It proved that the gradient estimation of the boundary-based attack over the entire gradient space is invalid in terms of the number of queries. Andriushchenko et al. [2] proposed the square search attack method, which selects local square blocks at random locations in the image to search and update the direction of the attack.

2.3 Transfer-Based Black-Box Attacks

The transferability of adversarial attacks refers to the applicability of the adversarial samples generated by the local model to the target model for attack. The attacker firstly uses the parameters obtained from the attack on the local model to train the adversarial samples, then uses these samples to perform a black-box attack on the target model to verify the success rate.

There are three main categories of transfer-based black-box attacks, namely gradient calculation methods, input transformation methods and feature-level attack methods. Gradient calculation methods such as MIM [7], VMI-FGSM [31] and SVRE [35] improve transferability by designing new gradient updates. Input transformation methods such as DIM [34], PIM [11] and SSA [18] boost the transferability by using input transformations to simulate the ensemble process of the model, while feature-level attacks focus on the middle-layer features.

Some state-of-the-art feature-level attack methods include NRDM, FDA, FIA and NAA, etc. NRDM [21] attempts to maximise the degree of distortion between neurons, but it does not take into account the role of positive and negative features in the attack. FDA [10] averages the neuronal activation values to obtain an estimate of the importance of a neuron. However, this method does not distinguish the degree of each neuron’s importance and the discrimination between positive and negative features is still too low. FIA [32] multiplies the activation values of neurons and back-propagation gradients for estimation, but its effect on the original input is affected by over-fitting and the results are not accurate. NAA [36] effectively improves the transferability of the model and reduces computational complexity by attributing the model’s output to an intermediate layer to obtain a more accurate importance estimation. However, its attribution method focuses more on the gradient iteration process considering linear path, and there is still room for improvement in the non-linear path condition.

2.4 Hybrid Black-Box Attacks

Hybrid method is a combination of query-based method and transfer-based method. It not only considers the priori nature of the transfer but also utilizes the gradient information obtained from the query, which resolves the challenges of high access cost for the query attack and low accuracy for transfer attack.

Dong et al. [5] proposed a hybrid method named P-RGF, which used the gradient of surrogate model as prior knowledge to guide the query direction of RGF and obtained the same success rate as RGF with fewer queries. Fu et al. [9] train Meta Adversarial Perturbation (MAP) on an surrogate model and perform black-box attacks by estimating the gradient of the model, which has good transferability and generalizability. Ma et al. [19] introduced Meta Simulator to black-box attacks based on the idea of meta-learning. By combining query and transfer based attacks, the researchers not only significantly reduce the number of queries, but also reduce the complexity of queries by transferring the adversarial samples trained on the surrogate model to the target model.

While there are different types of black-box attack methods, transfer-based attacks is considered as the most convenient method which doesn’t require additional information queries for the model. However, it poses the challenge of a good transferability for the adversarial samples. Therefore, in this work, we target the transfer-based attack methods. Especially, we introduce the attribution method for the middle-layer feature estimation, which shows a promising performance with our experiments.

3 Method

3.1 Preliminaries

When an adversarial attack to the target model can be successfully launched given an adversarial samples trained with a local DNN model, we consider there is a strong transferability relationship between these two models. Formally, with a deep learning network \(N:R^{n}\rightarrow R^{c}\) and original image sample \(x^{0}\in R^{n}\), whose true label is t, if the imperceptible perturbation \(\sum _{k=0}^{t-1}\bigtriangleup x^{k}\) is applied on the original sample \(x^{0}\), we may mislead the network N with the manipulated input \(x^{t}=x^{0}+\sum _{k=0}^{t-1}\bigtriangleup x^{k}\) to the label of m, which can also be denoted as \(x^{adv}\). Assuming the output of the sample x as N(x), the optimization goal will be:

$$\begin{aligned} \left\| x^{t}-x^{0} \right\| _{n}<\epsilon \quad subject \ to\quad N(x^{t})\ne N(x^{0}) \end{aligned}$$
(1)

where \(\left\| \cdot \right\| _{n} \) represents the n-norm distance. Considering the activation values in the middle layers of network N, we denote the activation value of y-th layer as y and the activation value of j-th neuron as \(y_{j}\).

3.2 Non-linear Path-Based Attribution

Inspired by [25] and [36], we define the attribution results of input image \(x^{t}\)(with \(n \times n\) pixels) as

$$\begin{aligned} A:=\sum _{i=1}^{n^{2}} \int \bigtriangleup x_{i}^{t} \frac{\partial N(x^{t} )}{\partial x^{t} }\textrm{d}t \end{aligned}$$
(2)

As shown in Fig. 1, different from the NAA algorithm [36], our paper proposes a new attribution idea that uses a non-linear gradient update path instead of the original linear path, which allows the model to find the optimal path against the attack itself. In Eq. 2, the gradient of N iterates along the non-linear path \(x^{t}=x^{0}+\sum _{k=0}^{t-1}\bigtriangleup x^{k}\), in which \(\frac{\partial N}{\partial x^{t}}(\cdot )\) is the partial derivative of N to the i-th pixel. For each iteration, \(\bigtriangleup x^{t}=sign(\frac{\partial N(x^{t})}{\partial x^{t}})+N(0,\sigma )\). We further apply the learning rate and Gaussian noise to update the perturbation.

Fig. 1.
figure 1

Non-linear gradient update path diagram

Afterwards, we can approximate A as N(x) depending on basic advanced mathematics and extend the attribution results to each layer. The formula of attribution can then be expressed as:

$$\begin{aligned} A_{y_{j}}:=\sum _{i=1}^{n^{2}} \int \bigtriangleup x_{i}^{t} \frac{\partial N(x^{t})}{\partial y_{j}({x^{t}})}\frac{\partial y_{j}(x^{t})}{\partial x^{t}}\textrm{d}t \end{aligned}$$
(3)

where \(A_{y_{j}}\) represents the attribution of j-th neuron in the layer y, \(\sum A_{y_{j}}=A\). We provide the relevant proof of our non-linear path-based attribution in following section.

3.3 Proof of Non-linear Path-Based Attribution

Since we now have \(A_{y_{j}}\) as Eq. 3, assuming that the neurons on the middle layer of the deep neural network are independent from each other, \(A_{y_{j}}\) can be expressed as

$$\begin{aligned} A_{y_{j}}:=\int \frac{\partial N(x^{t})}{\partial y_{j}(x^{t})}\sum _{i=1}^{n^{2}} \bigtriangleup x_{i}^{t} \frac{\partial y_{j}(x^{t})}{\partial x^{t}}\textrm{d}t \end{aligned}$$
(4)

where \(\frac{\partial N(x^{t})}{\partial y(x^{t})}\) is the gradient of\(N(x^{t})\) to the j-th neuron, \(\sum _{i=1}^{n^{2}} \bigtriangleup x_{i}^{t} \frac{\partial y_{j}(x^{t})}{\partial x^{t}}\) is the sum of the gradient of \(y_{j}\) to each pixel on \(x^{t}(x^{t}\in R^{n})\). Since the two gradient sequences are zero covariance, we then convert Eq. 4 into:

$$\begin{aligned} A_{y_{j}}:=\int \frac{\partial N(x^{t})}{\partial y_{j}(x^{t})}\textrm{d}t\cdot \int \sum _{i=1}^{n^{2}} \bigtriangleup x_{i}^{t} \frac{\partial y_{j}(x^{t})}{\partial x^{t}} \textrm{d}t \end{aligned}$$
(5)

Combining the principles of calculus, we can prove that

$$\begin{aligned} \int \sum _{i=1}^{n^{2}} \bigtriangleup x_{i}^{t} \frac{\partial y_{j}(x^{t})}{\partial x^{t}} \textrm{d}t=y_{j}^{t}-y_{j}^{0} \end{aligned}$$
(6)

then we denote \(y_{j}^{t}-y_{j}^{0}\) as \(\bigtriangleup y_{j}^{t}\), Eq. 5 can be converted into

$$\begin{aligned} A_{y_{j}}:=\bigtriangleup y_{j}^{t}\int \frac{\partial N(x^{t})}{\partial y_{j}(x^{t})}\textrm{d}t \end{aligned}$$
(7)

Denoting \(\int \frac{\partial N(x^{t})}{\partial y_{j}(x^{t})}\textrm{d}t\) as \(\gamma (y_{j})\), which means the gradient of network N along our non-linear path with attention to the j-th neuron. Afterwards, we can get \(A_{y_{j}}=\bigtriangleup y_{j}^{t}\cdot \gamma (y_{j})\). Since the neuron \(y_{j}\) is on the middle layer y, finally the attribution result of the layer y can be expressed as

$$\begin{aligned} A_{y}=\sum _{y_{j}\in y}A_{y_{j}}=\sum _{y_{j}\in y}\bigtriangleup y_{j}^{t}\cdot \gamma (y_{j})=\bigtriangleup y^{t}\cdot \gamma (y) \end{aligned}$$
(8)
Algorithm 1
figure a

Double Adversarial Neuron Attribution Attack

Algorithm 1 shows the specific pseudocode structure of our DANAA algorithm with Non-linear Path-based attribution.

4 Experiments

Extensive experiments have been conducted to demonstrate the efficiency of our method. Following sections cover the topic of leveraged datasets, benchmarking models and incorporated metrics. We also provide the experimental settings. We performed five rounds of benchmarking experiments to compare our algorithm with other methods, demonstrating the superiority of our approach to the baselines in terms of transferability for adversarial attacks. Moreover, we conducted the ablation study to investigate our approach, focusing on the impact of various learning rates and noise deviation on attack transferability.

4.1 Dataset

Following other literature methods, the widely-used datasets from NAA work [36] are considered in this paper. The datasets consist 1000 images of different categories randomly selected from the ILSVRC 2012 validation set [23], which we called a multiple random sampling(MRS) dataset.

4.2 Model

We include four widely-used models for image classification tasks, namely Inception-v3 (Inc-v3) [27], Inception-v4 (Inc-v4) [26], Inception-ResNet-v2 (IncRes-v2) [26], and ResNet-v2-152 (Res152-v2) [13], as source models for assessing the attacking performance of our algorithm. We start with four pretrained models without adversarial learning, which include Inc-v3, Inc-v4, IncRes-v2, and Res152-v2. Later on, we construct more robust models for a in-depth comparison, such as including adversarial training for the pretrained models. This results in two adversarial trained models, including Inception-v3(Inc-v3-adv) and Inception-Resnet-v2 (IncRes-v2-adv) [16]. The remaining three models are based on the ensemble models: the ensemble of three adversarial trained Inception-v3(Inc-v3-adv-3), the ensemble of four adversarial trained Inception-v3 (Inc-v3-adv-4), and the ensemble of three adversarial trained Inception-Resnet-v2 (IncRes-v2-adv-3), following the work from [29]. In [29], the models are combined by training the sub-models of the corresponding model independently and finally weighting the results of each sub-model to increase the accuracy and robustness of the model.

4.3 Evaluation Metrics

The attack success rate is selected as the metric to evaluate the performance. It measures the proportion of the dataset where our method produces incorrect label predictions after attacking. Hence, a higher success rate indicates improved performance of the attack method.

4.4 Baseline Methods

For comparison in our experiment, we selected five state-of-the-art attack methods as the baseline, including MIM [7], NRDM [21], FDA [10], FIA [32], and NAA [36]. Furthermore, to test the effect of each model after combining input transformation methods and to verify the superiority of our algorithm, we apply both DIM and PIM to the attack methods. The implementation details can be found in the open source repository. Consequently, we extend the model comparison set with MIM-PD, NRDM-PD, FDA-PD, FIA-PD, NAA-PD and DANAA-PD, respectively.

4.5 Parameter Setting

In the experiment, we set the parameters as following: the learning rate (lr) is 0.0025; the noise deviation is 0.25; and the maximum perturbation rate is 16, which is derived from the number of iterations (15) and the step size (1.07). The batch size is 10, and the momentum of the optimization process is 1. Since we introduced the DIM and PIM algorithms to verify the superiority of our model when combining input transformation methods, we set the transformation probability of DIM to 0.7, and the amplification factor and kernel size of PIM are 2.5 and 3, respectively. For the target layer of the attack, we choose the same layer as in NAA. Specifically, we attack InceptionV3/InceptionV3/Mixed_5b/concat layer for Inc-v3; InceptionV4/InceptionV4/Mixed_5e/concat layer for Inc-v4; InceptionResnetV2/Ince-ptionResnetV2/Conv2d_4a_3x3/Relu layer for IncRes-v2; the ResNet-v2-152/blo-ck2/unit_8/bottleneck_v2/add layer of Res152-v2 [36].

4.6 Result

All the experiments are carried out with the hardware of RTX 2080Ti card. A detailed replication package can be found in the open source repository at https://github.com/Davidjinzb/DANAA. We subsequently compile the results of all the attack methods without and with the input transformation methods (ending with PD) in Table 1.

In Table 1, we can see that, DANAA has retained a strong and robust performance across all the models, in comparison with other attack methods. Especially, DANAA demonstrated notable improvements on five models that are adversarial trained. We can observe a largest improvement of the attacking performance is between our method and NAA method [36], which is the generally second best attacking method in the comparison experiments. The ratio of improvement is 9.0%. Across all local models, our approach demonstrated an overall average improvement of 7.1% as compared to NAA on the adversarial trained models. By introducing the PD concept, our method achieves a maximum improvement of 9.8% over NAA-PD and an overall average improvement of 7.3% on the adversarial trained models.

Table 1. Attack success rate of multiple methods on different models

4.7 Ablation Study

In this section, we investigate the impact of the learning rate and Gaussian noise deviations on the performance of the proposed method.

4.7.1 The Impact of Learning Rates.

Experiments are conducted using different scales of learning rates, which are 0.25, 0.025, 0.0025 and 0.00025. In Fig. 2, the DANAA method exhibits the highest attack success rate for nearly all models when the selected learning rate was 0.0025. In Fig. 3, the highest attack success rates are achieved on most models for DANAA-PD method.

Notably, when using Inception-ResNet-v2 as the source model, although at a learning rate of 0.0025 DANAA-PD ranked second best in attack success rate on the models without adversarial training, its effectiveness on the model with adversarial training is still much higher than those at other learning rates.

4.7.2 The Impact of Gaussian Noise Deviation (Scale).

To verify the effect of adding Gaussian noise to the gradient update on model transferability in this paper, we selected different noise deviations for testing in this subsection. As shown in Fig. 4 and Fig. 5, five different scales of the Gaussian noise deviation ranging from 0.2 to 0.4 are used in this experiment. In general, higher value of scale tends to have more superior results for the normal training model while sacrificing performance for the more robust one. Conversely, a lower value of scale results in less improvements for the normal trained model but better performance for the adversarial trained model. Accordingly, the scale value of 0.25 is selected for the optimal performance in this paper.

Fig. 2.
figure 2

DANAA attack success rate performance at different learning rates

Fig. 3.
figure 3

DANAA-PD attack success rate performance at different learning rates

Fig. 4.
figure 4

DANAA attack success rate performance at different noise deviation

Fig. 5.
figure 5

DANAA-PD attack success rate performance at different noise deviation

5 Conclusion

In this paper, we propose a double adversarial neuron attribution attack method (DANAA) to achieve enhanced transfer-based adversarial attack results. Compared with other literature methods, our method obtains a better transferability for the adversarial samples. To derive more accurate importance estimates for the middle layer neurons, we firstly employ a non-linear path to the perturbation update process. Considering the calculation of gradient on the non-linear path, for all examined models, the performance of DANAA algorithm has substantially improved by up to 9.0% in comparison with the second best method with adversarial trained models, and has an average overall improvement by 7.1%. With the information transformation methods of DIM and PIM, our DANAA-PD algorithm also has a maximum enhancement of 9.8% and an average overall improvement of 7.3% compared to NAA-PD algorithm. Extensive experiments have demonstrated that the attribution model proposed in this paper achieves the state-of-the-art performance, with greater transferability and generalisation capabilities.