1 Introduction

Breakthroughs in theories and technologies of deep learning have promoted the application of deep neural networks (DNNs) across diverse visual tasks, including autonomous driving [6, 7], medical diagnosis [4, 5], face recognition [2, 3], and image classification [1]. However, recent researches have revealed the vulnerability of DNNs against adversarial attacks, which add a carefully designed perturbation to the image to mislead DNNs. Researchers have proposed a range of defense methods for building safe and reliable deep-learning systems. Research on adversarial attacks is also vital for uncovering defects in DNNs and enhancing their robustness. In particular, researchers explore how different attack methods (e.g., adversarial sample attacks) can discover the vulnerabilities of models and how to enhance the robustness and resilience of models through defense techniques (e.g., adversarial training).

Based on the accessible information of target models, black-box and white-box attacks have developed into two main branches [19, 20]. White-box attacks allow attackers full access to target models and manipulate input images to optimize adversarial objectives. However, attackers are typically unable to access the complete information of a target model, which inspires studies on black-box attacks. Black-box attacks are grouped into query and transferable attacks based on whether a surrogate model is required. Query attacks [12, 13] refer to an attacker repeatedly querying a target model and using its feedback to generate adversarial samples. Transferable attacks [10, 11, 16, 19] fool an unknown target model using an adversarial sample generated by a source model. Query methods compute the gradient while consuming excessive quantities of queries to update the adversarial perturbations/examples, which limits their practicality. Instead, relying on the adversarial transferability of examples across models, transferable attacks that utilize surrogate models instead of target models have received more attention.

Existing attacks [10, 14] suffer from limited transferability because adversarial examples tend to overfit to the source model. During attacks, adversarial samples may fall into a local optimum against a source model. Such local optimal solutions cannot be efficiently transferred across different models, thus limiting the availability of attacks. Strategies for improving transferability include data augmentation, gradient estimation, appropriate optimization objectives, ensemble models, and analysis of specific models. Some studies have addressed overfitting to surrogate networks through data enhancement, e.g., translation [11] and random transformation [16]. In terms of the gradient estimation, in several studies [10, 19, 29], attacks have been executed to solve the optimization problem using momentum [10], Nesterov [26], variance tuning [19], etc. In addition, in many studies [17, 18, 20, 23, 27], intermediate-layer based loss functions have been designed for higher transferability, which drives the development of feature-level attacks. Based on the findings of [28], some researchers have attempted to achieve more transferable adversarial attacks utilizing ensemble models [9, 28,29,30]. In a recent study [31], skip connections are used to generate transferable adversarial examples. Nevertheless, existing feature-level attacks disturb the features/attentions in a deterministic gradient-based manner. The generated perturbations minimize or maximize the given loss function in a relatively deterministic manner, which lacks diversity. Therefore, these attacks tend to fail to explore an abundant local optimum and thus suffer from limited transferability. In [45], DNNs are used to parameterize the adversary’s generators for producing perturbations. These DNNs learn to produce adversarial perturbations using latent codes and randomly disrupt various prominent features for transferability. Attentive diversity attacks [45] rely on a generative adversarial network (GAN) that requires abundant training, and the GAN is another intricate and incomprehensible neural network.

Ensemble is one of the most powerful techniques for enhancing the DNN performance. Allen-Zhu et al. [47] investigate the working mechanism of ensemble in deep learning by considering the learning of multi-view data. This demonstrates that an ensemble of individual networks with random initialization can extract more comprehensive features. Based on the success of the ensemble method, Xu et al. [48] introduce self-ensemble. The fine-tuning of the model is improved via self-ensemble and self-distillation, and knowledge extraction is used to improve the fine-tuning efficiency of the ground truth and models. Self-distillation allows models to benefit from each other, while self-ensemble improves model performance by aggregating intermediate pre-trained models from different time points in the past as base models. In addition, in some studies, ensemble adversarial attacks are proposed for transferability. By treating iterative ensemble attacks as a gradient descent process, researchers [16] decrease the variance of gradients during the ensemble process to boost transferability. Xiong et al. [30] adopt two ensemble strategies and demonstrate that greater diversities in surrogate ensembles facilitate stronger transferability.

This study aims to enhance the transferability of adversarial perturbations. We propose a self-ensemble based feature-level adversarial attack (SEFA), which introduces diverse initializations. In particular, we designed a feature-level optimization objective with respect to the perturbations and sample the orthogonal initial perturbations as inputs to the optimization. Diverse orthogonal initializations guide the optimization process to search the perturbation space as much as possible to prevent adversarial perturbations from being trapped in model-specific local optima and to enhance the transferability of adversarial perturbations. To reduce the impact of model-specific information, refined aggregate gradients asfeature importances eliminate the noise caused by aggregation and directs the optimization objective to concentrate on the salient features corresponding to gradients with higher intensity, which guides the perturbation toward a more transferable direction.

Our contributions are summarized as follows.

  • We introduce self-ensemble for increasing the randomness of the initial perturbations, i.e., sampling orthogonal low-dimensional vectors and fitting them to the perturbation space as the initial perturbations.

  • We propose the self-ensemble based adversarial attack in feature space disrupting the salient and critical features distinguished by the refined aggregate gradient in a stochastic manner.

  • We develop the combination of SEFA and other transferability enhancement methods, which generates more transferable adversarial perturbations and demonstrates the flexibility of the proposed method.

The remainder of this paper is organized as follows. In Section II, we provide a concise review of methods pertaining to adversarial attacks and defenses. Section III introduces the preliminaries. Section IV details the proposed SEFA. Section V presents empirical evaluations of SEFA and its comparisons with some baseline attacks. Finally, Section VI presents the conclusions of this research.

2 Related work

In this section, we introduce the literature related to adversarial attacks and adversarial defenses.

2.1 Adversarial attack

Szegedy et al. uncover the adversarial vulnerability of DNNs [15], stating that imperceptible perturbations manipulate the decisions of neural networks. They reveal two properties of the adversarial examples, misleading and imperceptible. Initially, some studies focus on adversarial attacks against image classification. Then, many attempts prove that the vulnerability exists in diverse mainstream visual tasks, e.g., face recognization [2, 3], smart healthcare [4, 5], automatic drive [6, 7]. The generation of adversarial examples has gradually developed into a technique, adversarial attacks. The transferability of adversarial perturbations seriously affects the development of deep learning, especially DNNs-based safety- and security-sensitive applications. Moreover, utilizing adversarial perturbations to boost the robustness of DNNs and investigate their defects gradually becomes a key issue. Therefore, adversarial attacks have received considerable attention among researchers.

Existing mainstream existence interpretations of adversarial issue focus on the linear nature of neural networks and non-robust features of datasets. One popular hypothesis on the adversarial examples is the high dimensional linear property [8], which also explains their generalization across datasets and models. [21] demonstrates that the adversarial nature of examples results from non-robust features, which are effective sources for achieving higher accuracy in neural networks and provide an explanation for transferability. The authors perform training and testing on both robust and non-robust features, highlighting that human have a limited perception of non-robust features, yet these features significantly influence decisions of the model.

Black-box attacks and white-box attacks are the two main branches of adversarial attacks. In white-box attacks, the attackers know the detailed information about the target model and can accurately generate adversarial perturbations via computing gradients of the optimization objective with respect to the images. However, in practice, attackers usually lack access to target models. In comparison, black-box attacks can work without model details, which is practical and challenging. Query attacks and transferable attacks evolve into two main categories of black-box attacks, where query attacks do not require surrogate models. Query attacks craft adversarial examples relying on the approximated gradients obtained by queries. Specially, a model assigns scores (i.e., soft label) to practicable labels for a given input and selects the label with the highest score as final decision (i.e., hard label). With soft labels, score-based attacks generate adversarial samples based on the responses of target models to fool those models. Instead, decision attacks solely require the hard labels, estimating gradients and updating adversarial examples to adjust the examples on the decision boundary. However, query-based attacks are impractical in real-world scenarios due to the large number of queries. More practical and flexible transfer-based attacks [9,10,11, 19,20,21,22], which rely on the transferability of adversarial perturbations, receive widespread attention. Adversarial samples are generated by surrogate models for attacking unknown target neural networks.

Many studies attempt to enhance the transferability of adversarial perturbations. [16] introduces random transformation (i.e., input diversity) and Dong et al. [11] demonstrate translation invariance of neural networks. The gradients are translated during iterations to enhance the robustness. Exploring the relationship of [16] and [11, 24] proposes resized-diverse-inputs (RDIM) and diversity-ensemble (DEM) further boosting the transferability of perturbations. They aggregate multi-scale gradients generated by RDIM with region fitting during iterations to generate transferable adversarial perturbations. [25] aggregates gradients of adversarial samples and their neighboring points during iterations to stabilize the oscillation of update directions, which boosts the transferability of adversarial samples to diverse target networks to a certain extent through data augmentation.

Moreover, improving optimization method for adversarial attacks is also a feasible direction, i.e., gradient ascent method. Momentum [10], Nesterov accelerated gradient [26] and variance tuning [19] are adopted during the iterations to find better local optima by avoiding oscillations in updates. Intuitively, a series of gradient descent methods can be used to address the optimization problem in adversarial attacks.

Aggregating surrogate networks improves the probability of finding transferable adversarial perturbations. In order to transfer the adversarial examples to unknown target networks, Liu et al. [28] aggregate the gradients of the surrogate models. They demonstrate that it is challenging for the target adversarial samples to transfer together with their target labels to other models, since the label distributions around the source labels differ among models with different architectures, even if the architectures are similar. Li et al. [29] fuse diverse feature-based models via vertical ensemble, devising ghost networks of the source model for transferability. Xiong et al. [30] demonstrate that there are variances between different model gradients in the aggregation process. They reduce the variance by stochastic variance reduction for stabilizing gradient update directions, making updated gradients more general to the other models. By contrast, Wu et al. [31] introduce decay parameters to reduce the gradients of residual modules during the computation of model gradients. With the skip connections similar to ResNet, they generate adversarial samples with higher transferability.

Based on the conclusion that different models share similar features, many attacks disturb the intermediate layers rather than the output layer, maximizing internal feature distortion for achieving higher transferability. Naseer et al. [17] maximize the distance of adversarial intermediate features and legitimate intermediate features, which pushes adversarial images away from the original images. Ganeshan et al. [18] utilize a discriminative criterion, namely the mean of channels, to guide the optimization. This method enhances the features that do not support the ground truth but suppressed those supporting the truth class to deceive the source network and target networks. However, it may get into a local optimization of a particular network. By contrast, feature importance-aware attack (FIA) [20] designs an appropriate optimization objective and aggregates gradient for eliminating model-specific information and generating adversarial perturbations. Lu et al. [27] delve into transferability across vision tasks and achieved powerful cross-task adversarial attacks by dispersion reduction. However, these attacks perturb examples along the gradients during the iterations, lacking of stochasticity, and therefore often trap into poor specific optima and exhibit limited transferability.

2.2 Adversarial defense

Adversarial examples can be used to investigate the internal shortcomings of DNNs and to improve their robustness. Researchers have proposed many adversarial defense approaches [32,33,34,35,36,37,38,39,40,41] to boost the robustness of neural networks, which are categorized into detection only and complete defense. The goal of complete defense is to make the outputs of the target model consistent with expectations. For example, a classification model correctly classifies adversarial samples. Moreover, detection only aims to detect potential adversarial examples for rejection.

Complete defense consists of two mainstream directions: modified training/input and modified networks. Many works improve model robustness by adversarial training. Ding et al. [32] propose consensus-based enhancement samples for adversarial defense. The intensity of the red, green and blue components of the image are exchanged to generate the enhancement samples. The original and consensus samples are used to train models. In the testing phase, the prediction results of the test and consensus samples are counted. The category corresponding to the maximum value above the threshold is the final classification result. If the maximum value is below the threshold, the test sample is discriminated as an adversarial sample. Lau et al. [33] propose joint spatial attack to generate adversarial perturbations against images and intermediate features. Then [33] uses mixup method to provide interpolated images for the attack to enhance the adversarial training. Yin et al. [34] demonstrate that the difference in feature distribution between the original and adversarial samples leads to a trade-off between accuracy and robustness. [34] utilizes a class-conditional discriminator to learn class-discriminative and attack-invariant features, i.e., to learn similar distributions of the original samples and various attack samples. The neural networks endeavor to learn domain invariant features to deceive the class-conditional discriminator. Liu et al. [35] show the vulnerability of adversarial training against transferable adversarial samples. [35] introduces linear robustness and approximates it with Jacobian norm. In addition, perturbation-based saliency map regularization is employed to enhance interpretability. Li et al. [36] analyze the problem that standard gradient regularization leads to inconsistency between model robustness and gradient saliency. Then a significant-based gradient regularization is proposed to reduce the performance gap by introducing gradient significance in the regularization training.

In terms of modifying input, researchers have proposed a variety of defense strategies. [37] proposes a new defense method to reconstruct legitimate samples using collaborative GANs to filter the perturbation noise in adversarial samples. The robustness of the model is improved by training an attacker model to generate adversarial samples and training a defender model to reconstruct the original samples. Zhang et al. [38] propose meta-invariant defense as an attack-independent defense method to achieve generalizable robustness against unknown adversarial attacks. Jia et al. [39] address the overfitting problem in fast adversarial training and propose a positive priori-guided adversarial initialization. Zhao et al. [40] emphasize the limitations of point-by-point adversarial sampling and introduce variational adversarial defense for more robust decision bounds. Niu et al. [41] experimentally demonstrate the correlation between perturbations and image pixels and the effectiveness of simultaneously eliminating perturbations in multiple frequency bands. [41] proposes to compress multiple frequency bands simultaneously to reduce the perturbation and the perturbation attachment space to purify the adversarial samples. Downsampling the lower frequency bands to disperse the perturbations and compressing the channel size in the higher frequency bands.

There are a several studies for detecting antagonistic samples. Nowroozi et al. [42] extract random features from the spreading layer of the source network as input to the target network. Then multiple target networks are trained to detect adversarial samples. [43] and [44] propose a multi-classifier architecture for image tampering detection and adversarial attack detection. Considering the security of the closed decision of one-class classification and the good performance of two-class classification, [43] and [44] combine one-class classification and two-class classification to improve the detection performance.

3 Preliminaries

Assuming a neural network for classification \({F_\psi }:x \mapsto c_s\), where x represents the legitimate image, \(c_s\) is the ground truth, and \(\psi \) denotes the information of the neural network. The objective of non-target adversarial attacks is to create an adversarial perturbation \(\delta \), which is carefully produced and results in misclassification on the target model (i.e., \({F_\psi }\left( {{x^{adv}}} \right) \ne c_s\), \({x^{adv}} = x + \delta \)). In general, \({\ell _p}\)-norm is used to restrict perturbations. We then formulate the generation of adversarial perturbations as follows.

$$\begin{aligned} \arg \mathop {\max }\limits _{{x^{adv}}} {L_\psi }\left( {{x^{adv}},c_s} \right) , {\mathop \mathrm{s.t.}\nolimits } {\left\| {x - {x^{adv}}} \right\| _p} \le \epsilon . \end{aligned}$$
(1)

Function \({L_\psi }\left( { \cdot , \cdot } \right) \) calculates the distance between the predicted and true labels, \(\epsilon \) constrains the intensity of the perturbations, and \(p = 0,2,\infty \).

Many attempts have been made to address the above adversarial optimization with the full information of \(F_\psi \). However, the idea is unrealistic in applications. A viable approach is to optimize the adversarial examples on an accessible surrogate model \({F_\theta }\). The surrogate model \({F_\theta }\) and target model \(F_\psi \) have different architectures and parameters but aligned outputs, and thus attackers produce transferable adversarial examples with \({F_\theta }\) for attacks. Feature-level attacks maximize distortions in intermediate features. \({F_l}\left( . \right) \) denotes the feature of an input from the l-th layer.

4 The proposed method

Following the observation that DNNs tend to extract similar features, feature-level attacks craft adversarial examples by perturbing the intermediate features. Therefore, adversarial examples generated in feature space have higher transferability, thus allowing them to fool multiple neural networks. However, these attacks generate perturbations in a specific gradient-based manner. During the iterations, they maximize a given loss function to update the perturbation. Owing to the lack of stochasticity in the process, they are often trapped in model-specific local optima, thereby reducing their transferability. Therefore, it is crucial to prevent the local optima to enhance the transferability. The production of adversarial perturbations requires fine-grained and model-agnostic features as guidance, i.e., feature importance, and updating adversarial examples requires diversity. To address these issues, we design a self-ensemble based feature-level adversarial attack. The devised approach significantly boosts the transferability of adversarial perturbations by refining the feature importance and introducing stochasticity into the optimization process, as exhibited in Fig. 1.

Fig. 1
figure 1

Overview of self-ensemble based feature-level adversarial attack. Given a clean image with an added random initial perturbation from a Gaussian distribution, intermediate feature maps are extracted from a surrogate network, and gradients are calculated as the feature importance backpropagating from the final probabilities to the feature maps. Then, the optimization of the weighted feature maps enhances negative features and suppresses positive features. The mutually orthogonal initial perturbations are selected successively from the Gaussian distribution to disrupt the features in a diverse manner, thus achieving higher transferability

Fig. 2
figure 2

Illustration of adversarial examples produced by traditional attacks and the proposed SEFA among the class decision boundaries of the surrogate model and target models. Our attack attempts the feasible space of adversarial perturbations extensively then avoids the local optimum instead of greedily crafting deterministic adversarial examples like the traditional attacks that easily result in low local optima and then exhibit limited transferability among target models

4.1 Self-ensemble for transferable adversarial attacks

In existing studies, adversarial attacks have been modeled as optimization problems and the adversarial example has been updated in a deterministic manner. Therefore, they often fall into local optima and suffer from limited transferability. In this study, we attempt to add diverse perturbations to clean images. Thus, obtaining an optimal \(\delta \) can be formulated as the following constrained optimization problem,

$$\begin{aligned} \arg \mathop {\min }\limits _\delta {\left\| \delta \right\| _p},{\mathop \mathrm{s.t.}\nolimits }{F_\theta }\left( {x + \delta } \right) \ne c_s. \end{aligned}$$
(2)

However, solving problem (2) is not trivial because it is impractical to determine a search space that satisfies the constraint. We obtain the perturbations by maximizing the loss function, as in the majority of previous studies (Fig. 2).

$$\begin{aligned} \arg \mathop {\max }\limits _\delta {L_\theta }\left( {x + \delta ,c_s} \right) ,{\mathop \mathrm{s.t.}\nolimits }{\left\| \delta \right\| _p} \le \varepsilon , \end{aligned}$$
(3)

where \({L_\theta }\left( { \cdot , \cdot } \right) \) is a loss function w.r.t. the perturbation \(\delta \). Although problem (3) is not fully equivalent to (2) and may thus not guarantee that the obtained perturbations will always mislead the classifier, it quickly finds a possible perturbation within the constrained range.

The key to the problem (3) is an appropriate optimization objective. We design a new optimization objective for the problem (3) to perturb the object-aware features,

$$\begin{aligned} L\left( \delta \right) = \sum {\left( {W \odot {F_l}\left( {x + \delta } \right) } \right) }, \end{aligned}$$
(4)

where W is the aggregate gradient (i.e., feature importance) w.r.t. \({F_l}\left( x \right) \).

$$\begin{aligned} \begin{aligned}&W = {\frac{{\sum \nolimits _{i = 1}^{N} {W_l^{x \odot B_{{p_d}}^n}} }}{{{{\left\| {\sum \nolimits _{i = 1}^{N} {W_l^{x \odot B_{{p_d}}^n}} } \right\| }_2}}}}, \\ &B_{{p_d}}^n \sim {\mathop \textrm{Bernoulli}\nolimits }\left( {1 - {p_d}} \right) , \end{aligned} \end{aligned}$$
(5)

where the aggregate quantity N is the number of masks for transform, and \(p_d\) denotes the probability of random pixel dropping. \(B_{{p_d}}^n\), sampled from the Bernoulli distribution, \({\mathop \textrm{Bernoulli}\nolimits }\left( {.} \right) \), randomly discards the pixels of x by the element-wise product with x. \(W_l^x\) is expressed as follows,

$$\begin{aligned} W_l^x = \frac{{\partial \ell \left( {x,c_s} \right) }}{{\partial {F_l}\left( x \right) }}, \end{aligned}$$
(6)

where \(\ell \left( {.,.} \right) \) donates an unnormalized probability w.r.t. the ground truth \(c_s\). The sign of W indicates the basic stance of the feature with respect to the truth class, and the intensity denotes the importance of the feature. The features corresponding to positive and negative gradients are considered as positive and negative features of the samples, respectively. Minimizing \({W \odot {F_l}\left( {x + \delta } \right) }\) suppresses positive features but encourages negative ones, manipulating the learnable features of the samples. In contrast to previous attacks, i.e., feature disruptive attack (FDA) and neural representation distortion method (NRDM), \({W \odot {F_l}\left( {x + \delta } \right) }\) employs W as a discriminator, thus focusing on salient object features and avoiding model-related information. Therefore, the weighted representation of the feature \({W \odot {F_l}\left( {x + \delta } \right) }\) is a powerful and transferable optimization objective that directs perturbations towards more transferable directions.

Similar to the training of neural networks, an optimization-based adversarial attack fixes the parameters of a pre-trained model and then optimizes the inputs to generate adversarial examples. The pre-trained model contains prior knowledge of the dataset distribution. The layers closer to the input layer capture fine-grained feature information common to multiple models with a high resolution through small receptive fields. However, the receptive fields and overlapping areas between them in the layers closer to the output layer gradually increase, focusing on model-specific global information with rich semantic information. By observing the gradient of (4) w.r.t. \(\delta \), we obtain

$$\begin{aligned} \frac{{\partial L\left( \delta \right) }}{{\partial \delta }} = \frac{{\partial \sum {\left( {W \odot {F_l}\left( {x + \delta } \right) } \right) } }}{{\partial {F_l}\left( {x + \delta } \right) }}\frac{{\partial {F_l}\left( {x + \delta } \right) }}{{\partial \delta }}. \end{aligned}$$
(7)

Updating the perturbation involves layers closer to the input layer, which indicates that the relatively general prior knowledge of the dataset, rather than the model-specific knowledge, is used to fine-tune and infect the source example.

In this study, we model adversarial attacks as the optimization problem (4). The optimization process significantly affects the transferability of adversarial perturbations. In addition, increasing diversity and variability can further improve the transferability. Therefore, we introduce diversity from an optimization perspective, sampling orthogonal initial perturbations in the outer layer of the generation and exploring as diverse directions as possible with a clean image as the origin. Diverse orthogonal initial perturbations guide the optimization to attempt different initial directions to increase the diversity and transferability of adversarial examples. In particular, given a trained DNN and an image, the constructed optimization objective requires an initial perturbation.

$$\begin{aligned} {\delta _0} \sim \mathcal{N}\left( {0.01,{\sigma ^2}} \right) ,{\delta _0} \in {\mathbb {R}^{Dim \times Dim}}. \end{aligned}$$
(8)

Because the dimension of the image Dim is quite high, we sample the directions in the subspace and fit them to the original image space to improve the search efficiency, as shown in lines 4–7 of Algorithm 1.

$$\begin{aligned} \begin{aligned} \delta _0 \sim \mathcal{N}\left( {0.01,{\sigma ^2}} \right)&,\delta _0 \in {R^{\frac{{Dim}}{r} \times \frac{{Dim}}{r}}}, \\ \delta _0 =&{\mathop \textrm{Interp}\nolimits }\left( {\delta _0} \right) , \end{aligned} \end{aligned}$$
(9)

where r is the dimension reduction factor, and \({\mathop \textrm{Interp}\nolimits }\left( .\right) \) denotes the bilinear-interpolation. In the n-th outer loop, the adversarial example x is represented by \(x^n_T\). Eventually, we obtain a set of adversarial examples \(\left\{ {x_T^1, \cdots ,x_T^{Num}} \right\} \).

The use of SEFA can be understood as follows. Inspired by the ensemble, diverse adversarial examples are explored as much as possible to avoid falling into model-specific local optima, instead of moving the adversarial example along a deterministic gradient. The proposed SEFA embodies the idea of an ensemble without aggregation operations (e.g., averaging). We perform the attacks using the set \(\left\{ {x_T^1, \cdots ,x_T^{Num}} \right\} \), preserving the diversity of the adversarial samples and improving the transferability of the attacks.

Comparing [30, 46], in which ensemble adversarial examples are generated by combining DNNs of different architectures, we explore adversarial samples with higher transferability generated by models of the same architecture but with randomness of initialization, which can be understood as a self-ensemble without external knowledge. Self-ensemble is utilized not only to enhance robustness [48], but also to generate adversarial examples. Xu et al. [48] aggregate intermediate pre-trained models of past time steps, while the proposed method constructs diverse adversarial examples as candidates by introducing randomness of initialization. Because we improve the transferability via an ensemble operation from an optimization perspective, it can be combined with previous transferability enhancement methods [9, 11, 16] to perform more powerful adversarial attacks.

Algorithm 1
figure g

SEFA.

Fig. 3
figure 3

Histogram of the gradients. The gradients indicate the importance and stance of the feature w.r.t. the ground truth. The aggregation based on the raw gradient yields more values near zero. \(p_f\) denotes the probability for filtering

4.2 Feature importance by refined gradient

The feature importance discussed in the previous section aggregates gradients from a randomly transformed x to highlight robust/transferable features/gradients while neutralizing non-robust features or gradients. However, the aggregation of gradients introduces a small amount of noise, thus causing the optimization objective to focus on a few robust features from a limited set of transformed x, as shown in Fig. 3. The noise caused by aggregating gradients is inherently random and may not be shared by DNNs. Figure 3 illustrates the statistical information of the gradients of an image. The relatively flat distribution of the aggregated gradients indicates that the information of all the transformed images is observed, including random noise generated by the transformation.

Fig. 4
figure 4

Visualization of feature maps and corresponding gradients at the layer Conv3\(\_\)3 of VGG16. The raw operation provides the feature map and gradient from the clean image, the aggregate feature and gradient are calculated from multiple transforms of a legitimate image, and refining the aggregate feature and gradient generates the refined ones

To suppress the random noise, we propose the refining of the gradient, which refines the aggregate gradient from random transformations of the input image. The refinement operation eliminates redundancy while preserving the general texture and spatial structure. Object-aware and semantical salient features result in larger gradient magnitudes, which are further emphasized after aggregation. However, the gradients corresponding to random features have lower magnitudes. Refining retains the gradients corresponding to important features, while filtering out other gradients. In this study, we adopt quantile filtering with the probability \(p_f\), which can be expressed as follows.

$$\begin{aligned} \begin{aligned} W_f =&{\mathop \textrm{Filter}\nolimits }_{p_f}\left( {\frac{{\sum \nolimits _{i = 1}^N {W_l^{x \odot B_{{p_d}}^n}} }}{{{{\left\| {\sum \nolimits _{i = 1}^N {W_l^{x \odot B_{{p_d}}^n}} } \right\| }_2}}}} \right) , \\ &B_{{p_d}}^n \sim {\mathop \textrm{Bernoulli}\nolimits }\left( {1 - {p_d}} \right) . \end{aligned} \end{aligned}$$
(10)

\({\mathop \textrm{Filter}\nolimits }_{p_f}(.)\) is expressed as follows.

$$\begin{aligned} {\mathop \textrm{Filter}\nolimits }_{p_f}(w)=\left\{ \begin{array}{ll} w_{i,j} & \left| {{w_{i,j}}} \right| > {Q_{{p_f}}} \\ 0 & \left| {{w_{i,j}}} \right| \le {Q_{{p_f}}} \end{array}\right. , \end{aligned}$$
(11)

where \(Q_{{p_f}}\) is the \(100 \times p_f\)-th percentile of \(\left| {w} \right| \), \(\left| {.} \right| \) denotes the absolute value of the input, and \(w_{i,j}\) is a pixel at a cell (ij) of w.

The refined gradient \(W_f\) preserves the highlighted critical and robust semantically meaningful features that provide more accurate information for generating transferable adversarial perturbations. Figure 4 presents visualizationes of the refined gradients. Compared to the aggregate gradient, the refined gradient is cleaner and focused on objects, thus providing better feature importance for transferable attacks. After the refinement operation, features with small absolute values are assigned to 0, and the contrast of the features is enhanced. In addition, the refined gradient focuses on the lower-left corner of the red box, i.e., the region of the maximum value of the features. The quantified filtering results are presented in Fig. 3. The change in the gradient distribution is more distinct when \(p_f=0.3\), whereas subsequent experiments show that \(p_f=0.1\) is more appropriate from the perspective of transferability.

4.3 Transferable adversarial attacks

By employing the refined gradient \(W_f\) (i.e., the feature importance) and substituting (4) into (3), we obtain the following optimization objective for diverse perturbations in the feature space,

$$\begin{aligned} \arg \mathop {\max }\limits _\delta \sum {\left( {W_f \odot {F_l}\left( {x + \delta } \right) } \right) } ,{\mathop \mathrm{s.t.}\nolimits }{\left\| \delta \right\| _p} \le \varepsilon . \end{aligned}$$
(12)

The loss function in (12) enhances the salient features with a negative \(W_f\) but suppresses those corresponding to a positive \(W_f\). Thus, transferable adversarial attacks can be achieved.

Algorithm 2
figure h

SEFA-DITI.

The key to transferability is introducing stochasticity in the generation of adversarial perturbations. Therefore, we propose the self-ensemble based transferable attack in the feature space. The effective adversarial attack framework comprises two subparts described in the previous subsections. The entire process is described in Algorithm 1. Given a clean image x, we generate \(\delta _T^n\) through optimization with the initial perturbation \(\delta _0^n\), a random variable from a Gaussian distribution. Subsequently, to make the perturbation as diverse as possible, we take the \(\delta _0^{n+1}\) successively, which is orthogonal to the items of the perturbation set \(\left\{ {\delta _0^0, \cdots ,\delta _0^n} \right\} \). Finally, we achieve diverse perturbations in the feature space, which boosts the transferability.

To further improve the transferability, we combine the proposed method with transferability enhancement approaches such as diverse input [11] and translation operations [16]. We describe the combination of methods in detail to explain this combination clearly. The combination of SEFA, diverse inputs iterative method (DIM) [11], and translation-invariant iterative method (TIM) [16] is referred to as SEFA-DITI, as shown in Algorithm 2. \({\mathop \textrm{Trans}\nolimits }\left( {.,.} \right) \) is a random transform operation that creates diverse input images for transferability.

$$\begin{aligned} {\mathop \textrm{Trans}\nolimits }(x , p_T)=\left\{ \begin{array}{ll} {\mathop \textrm{Trans}\nolimits }(x) & \text{ with } \text{ probability } p_T \\ x & \text{ with } \text{ probability } 1-p_T \end{array}\right. . \end{aligned}$$
(13)

\({\mathop \textrm{Gkern}\nolimits }\left( Tkern\_size\right) \) yields a two-dimensional Gaussian kernel Tkern with \(Tkern\_size\), which is used to convolve the gradient \(g_{t + 1}\).

Many previous gradient-based attacks have attempted to solve the optimization objective (12), such as the momentum iterative method (MIM) [10] and TIM [11]. Given the advantage and superiority of the momentum descent, the approach is used to address (12), as in [10], and algorithm 1 describes the details of the attack.

4.4 Theoretical analysis

Based on the finding that different neural networks extract similar features, NRDM [17] maximizes the distance between the features of the adversarial example \({F_l}\left( {{x^{adv}}}\right) \) and legitimate example \({F_l}\left( x \right) \). FDA [18] and FIA [20] perturb salient features carefully selected according to the activation values or gradients. For a better illustration, the objective functions are expressed as follows.

$$\begin{aligned} {L_{NRDM}} = {\left\| {{F_l}\left( {{x^{adv}}} \right) - {F_l}\left( x \right) } \right\| _2}, \end{aligned}$$
(14)
$$\begin{aligned} \begin{aligned} {L_{FDA}}&= \log \left( {{{\left\| {{F_l}\left( {{x^{adv}}} \right) |{F_l}\left( x \right) < {C_l}\left( {i,j} \right) } \right\| }_2}} \right) \\&- \log \left( {{{\left\| {{F_l}\left( {{x^{adv}}} \right) |{F_l}\left( x \right) > {C_l}\left( {i,j} \right) } \right\| }_2}} \right) , \end{aligned} \end{aligned}$$
(15)
$$\begin{aligned} {L_{FIA}} = \sum {\left( {W \odot {F_l}\left( {{x^{adv}}} \right) } \right) }, \end{aligned}$$
(16)

where \({C_l}\left( {i,j} \right) \) denotes the mean activation values across channels.

With the optimization objectives above, FDA utilizes feature activation to characterize the importance of features, thus suppressing the features that support the ground truth but enhance the others. However, the distinguishable criterion, i.e., the mean activation values across channels, fails to effectively identify object-aware salient features and abate model-specific information. NRDM merely maximizes the distance \({F_l}\left( {{x^{adv}}} \right) \) and \({F_l}\left( {x} \right) \). In contrast, FIA achieves higher transferability by minimizing (16).

The gradients of the optimization objectives of NRDM (14), FDA (15), and FIA (16) are written as follows,

$$\begin{aligned} \frac{{\partial {L_{NRDM}}}}{{\partial {x^{adv}}}} = \frac{{\partial {{\left\| {{F_l}\left( {{x^{adv}}} \right) - {F_l}\left( x \right) } \right\| }_2}}}{{\partial {F_l}\left( {{x^{adv}}} \right) }}\frac{{\partial {F_l}\left( {{x^{adv}}} \right) }}{{\partial {x^{adv}}}}, \end{aligned}$$
(17)
$$\begin{aligned} \begin{aligned} \frac{{\partial {L_{FDA}}}}{{\partial {x^{adv}}}}&= \frac{{\partial \log \left( {{{\left\| {{F_l}\left( {{x^{adv}}} \right) |{F_l}\left( x \right) < {C_l}\left( {i,j} \right) } \right\| }_2}} \right) }}{{\partial {F_l}\left( {{x^{adv}}} \right) }}\frac{{\partial {F_l}\left( {{x^{adv}}} \right) }}{{\partial {x^{adv}}}} \\&- \frac{{\partial \log \left( {{{\left\| {{F_l}\left( {{x^{adv}}} \right) |{F_l}\left( x \right) > {C_l}\left( {i,j} \right) } \right\| }_2}} \right) }}{{\partial {F_l}\left( {{x^{adv}}} \right) }}\frac{{\partial {F_l}\left( {{x^{adv}}} \right) }}{{\partial {x^{adv}}}}, \end{aligned} \end{aligned}$$
(18)
$$\begin{aligned} \frac{{\partial {L_{FIA}}}}{{\partial {x^{adv}}}} = \frac{{\partial \sum {\left( {W \odot {F_l}\left( {{x^{adv}}} \right) } \right) } }}{{\partial {F_l}\left( {{x^{adv}}} \right) }}\frac{{\partial {F_l}\left( {{x^{adv}}} \right) }}{{\partial {x^{adv}}}}. \end{aligned}$$
(19)

The comparison of (17), (18), (19), and (7) as well as the experiments mentioned in the corresponding references, clearly indicates that the gradient of feature map w.r.t. the input \(\frac{{\partial {F_l}\left( {{x^{adv}}} \right) }}{{\partial {x^{adv}}}}\) is the core, and the items (e.g., \(\frac{{\partial {{\left\| {{F_l}\left( {{x^{adv}}} \right) - {F_l}\left( x \right) } \right\| }_2}}}{{\partial {F_l}\left( {{x^{adv}}} \right) }}\)) introduce model-specific information, limiting transferability. While the item \(\frac{{\partial \sum {\left( {\Delta \odot {F_l}\left( {{x^{adv}}} \right) } \right) } }}{{\partial {F_l}\left( {{x^{adv}}} \right) }}\) contains the feature importance guiding the adversarial example towards a more transferable direction. The conclusions and experiments in their respective papers adequately illustrate the advantages of FIA. However, it is difficult for FIA to introduce randomness from the perspective of perturbation. FIA only moves the adversarial sample along the deterministic gradient. Therefore, we introduce a self-ensemble into the optimization process of (4) to expand the search space for transferable perturbations. We consider the orthogonal initial perturbations \(\delta _0^n\) \(\left( {1,2, \cdots ,Num} \right) \) and explore various directions at the beginning of the optimization to boost transferability. The gradient of the optimization objective (4) during the iterations can be expressed as follows,

$$\begin{aligned} \begin{aligned} \frac{{\partial L\left( \delta \right) }}{{\partial \delta }} =&\frac{{\partial \sum {\left( {W_f \odot {F_l}\left( {x + \delta } \right) } \right) } }}{{\partial {F_l}\left( {x + \delta } \right) }}\frac{{\partial {F_l}\left( {x + \delta } \right) }}{{\partial \delta }}\\ =&\frac{{\partial \sum {\left( {W_f \odot {F_l}\left( {x + \delta } \right) } \right) } }}{{\partial {F_l}\left( {x + \delta } \right) }}\frac{{\partial {F_l}}}{{\partial {Z_l}}}\frac{{\partial {Z_l}}}{{\partial {F_{l - 1}}}} \cdots \frac{{\partial {F_1}}}{{\partial {Z_1}}}\frac{{\partial {Z_1}}}{{\partial \delta }}\\ =&\frac{{\partial \sum {\left( {W_f \odot {F_l}\left( {x + \delta } \right) } \right) } }}{{\partial {F_l}\left( {x + \delta } \right) }}\frac{{\partial {F_l}}}{{\partial {Z_l}}}{W_l} \cdots \frac{{\partial {F_1}}}{{\partial {Z_1}}}{W_1}, \end{aligned} \end{aligned}$$
(20)

where \({Z_l} = {W_l}{F_{l - 1}} + {B_l}\), \({F_l} = \sigma \left( {{Z_l}} \right) \), and \(\sigma \left( \cdot \right) \) is the activation function. In the case of ReLU, the gradient is

$$\begin{aligned} \begin{aligned} \frac{{\partial L\left( \delta \right) }}{{\partial \delta }} =&\frac{{\partial \sum {\left( {W_f \odot {F_l}\left( {x + \delta } \right) } \right) } }}{{\partial {F_l}\left( {x + \delta } \right) }}\left( {\sigma \left( {{Z_l}} \right) \odot \left( {1 - \sigma \left( {{Z_l}} \right) } \right) } \right) \\ &{W_l}\left( {\sigma \left( {{Z_{l - 1}}} \right) \odot \left( {1 - \sigma \left( {{Z_{l - 1}}} \right) } \right) } \right) W \times \cdots \times \\ &\left( {\sigma \left( {{Z_1}} \right) \odot \left( {1 - \sigma \left( {{Z_1}} \right) } \right) } \right) {W_1}. \end{aligned} \end{aligned}$$
(21)

FIA updates the adversarial examples with deterministic gradients (19) because \({W_l}, \cdots ,{W_0}\) are fixed, and \(\sigma \left( {{Z_l}} \right) , \cdots ,\sigma \left( {{Z_1}} \right) \) are stable. As shown in Fig. 2, the proposed SEFA introduces self-ensemble by taking orthogonal initial perturbations, which introduces diversity from the perspective of gradient. Diverse initialization results in different \(\sigma \left( {{Z_l}} \right) , \cdots ,\sigma \left( {{Z_1}} \right) \), contributing to the crafting of diverse transferable adversarial examples. The self-ensemble is the key to the proposed SEFA. As demonstrated in the following experiments, self-ensemble can significantly improve the transferability.

Fig. 5
figure 5

Legitimate/adversarial images (top row) produced by the proposed SEFA and their attentions (bottom row). SEFA generates adversarial examples that disrupt the attentions and final decisions in diverse manners

Table 1 Attack success rates of various attacks against normally trained models

5 Experimental results

In this section, we describe the extensive experiments conducted to evaluate the effectiveness of SEFA. First, the setup of the experiments is described. The attack results of SEFA against baseline methods with undefended models and advanced defended models are then illustrated. Furthermore, we perform ablation studies on the probability \(p_f\) and hyperparameter Num in the proposed framework.

5.1 Experiment setup

Based on a baseline attack [20], we establish experimental settings to compare the transferability of adversarial attacks fairly. ImageNet [20] is a source dataset widely used for evaluating adversarial attacks. Figure 5 presents a legitimate image and some adversarial images. The perturbations generated by the proposed SEFA disrupt the attention and predictions of the model in various ways, thereby diversifying the enhancement of negative features and the suppression of positive features in the input images. The experiment setup is described as follows.

Datatset. The ImageNet-compatible dataset [20] is used to examine the transferability of the adversarial attacks, containing 1000 instances randomly sampled from different categories of the ILSVRC 2012 validation set. The CIFAR-10 dataset is a color image dataset containing 10 categories with 6000 images of size 32x32 in each category for training and evaluating image classification models.

Models. We test our method using four source models: ResNet-v1-152 (Res-152), Inception-ResNet-v2 (InceRes-v2), Inception-v3 (Ince-v3), and VGG16 (Vgg-16). Considering both normal and adversarial training, the proposed SEFA is used to attack several target models, seven normally trained models, and five defended models. For normal training, seven normally trained models are selected including Inception-ResNet-v2, ResNet-v1-50 (Res-50), ResNet-v1-152, VGG19 (Vgg-19), VGG16, Inception-v4 (Ince-v4), and Inception-v3. For the adversarial training [9], five adversarially trained modelsFootnote 1 are selected, namely, InceRes-v2-Ens, Ince-v3-Ens4, Ince-v3-Ens3, InceRes-v2-Adv, and Ince-v3-Adv. The source and target models above are pretrained with ImageNet, approaching an almost 100\(\%\) classification success rate on the dataset. To evaluate the performance of the attack methods on the CIFAR10 dataset, we select four source models and seven target models for validation. Four source models, Res-20, Vgg-16, Shuffle-v2 [49], and RepVgg [50], are utilized to generate the adversarial samples. The seven target models are Res-20, Res-32, Vgg-16, Vgg-19, Mobile-v2 [51], Shuffle-v2, and RepVgg.

Baseline Methods. Several gradient iterative attacks are selected as baselines. In addition, three feature adversarial attacks, FIA [20], FDA [18], and NRDM [17], are selected as the comparison baselines. Our method is compared with these methods to validate the effectiveness and advancement of the proposed SEFA.

Evaluation. The probability that adversarial images generated by a source network mislead a target network is called the attack success rate. When the source network is the same as the target network, it indicates the success rate of the attack in a white setting. Instead, it is the black-box success rate of an attack.

Parameter. For a fair comparison, the parameters are set as follows, as in [20], step size \(\eta = 1.6\), number of iterations \(T=10\), and perturbation limitation \(\epsilon =16\) . The momentum descent is a generic optimizer for all baselines, where the decay factor \(\lambda \) is set as 1.0. For patch-wise attack method (PIM), the project kernel size \(k_w\) is 3, project factor \(\gamma \) is 0.5, and amplification factor \(\beta \) is 2.5. The filter probability \(p_f\) and Num in SEFA are 0.1 and 50, respectively, and r is 2. In SEFA-DITI, the kernel size \(Tkern\_size\) is 15, and the transform probability \(p_T\) is 0.7. As for the target layer of the surrogate model in feature-level attacks, we select the middle layer of the networks for attacks. Mixed\(\_\)5b in Ince-v3, the last layer of block2 in Res-152, Conv\(\_\)4a in InceRes-v2, and Conv3\(\_\)3 in Vgg-16 are selected as the target layers. Under these settings, we could realize a fair comparison between the devised SEFA and baseline attacks.

5.2 Comparison of transferability

Table 2 Attack success rates of various attacks against normally trained models on the CIFAR10 dataset

This section presents the performance of the baseline attacks and proposed SEFA against normally trained models and adversarially trained ones respectively. We choose four source models with different architectures (Ince-v3, InceRes-v2, Res-152, and Vgg-16) and attackdefense models and normally trained models.

Attacking Normally Trained Models. The proposed SEFA significantly outperforms the baseline attacks, as shown in Table 1. The success rates of the adversarial examples produced by the source models Ince-v3 and InceRes-v2 increased by about 10%. In particular, the adversarial examples generated by SEFA with the surrogate network Vgg-16 successfully transfer among different models, achieving attack success rates of over 95\(\%\). The success rates of the combination of SEFA and DITI, i.e., SEFA-DITI, increase by 1% \(\sim \) 3%. Comparing DITI, SEFA, and SEFA-DITI, it can be observed that SEFA contributes more prominently to SEFA-DITI than DITI. With the source models Inception-ResNet-v2 and Inception-v3, SEFA has higher success rates for both black-box and white-box attacks than existing feature-based attacks such as FIA, FDA, and NRDM. Compared to previous studies, SEFA improves the success rate against normally trained models by about 7.7%.

The adversarial examples generated by the surrogate network Vgg-16 mislead the target models with success rates of nearly 96%, while the transferability of the adversarial perturbations with surrogate networks Ince-v3 and InceRes-v2 is limited. The results in Table 1 imply that less complicated models (e.g., Vgg-16) tend to craft more transferable adversarial examples because these models avoid examples overfitting to the source models compared to complex/large ones (e.g., Ince-v3 and InceRes-v2). It would be interesting to explore more appropriate models for generating transferable adversarial perturbations.

For the CIFAR10 dataset, the results of the attacks against the seven target models are exhibited in Table 2. The layer3_2 of Res-20, features_40 of Vgg-16, stage4_2 of Shuffle-v2, and stage4_0 of RepVgg are selected as the target intermediate layers for FIA and SEFA. Table 2 shows that the proposed SEFA obtains better results. For example, SEFA achieves an average improvement of 2.7 % in the attack success rate compared to FIA. Thus, SEFA is effective for both the large-scale ImageNet dataset and simple CIFAR10 dataset with a wide applicability.

Table 3 Attack success rates of various attacks in mis-match index testing against FR on the CIFAR10 dataset
Table 4 Attack success rates of various attacks against defense models

We conduct experiments against feature randomization (FR) [42] on the CIFAR10 dataset to verify the effectiveness of the attack methods. Following [42], we determine the feature size \(FS = \left\{ {30,50,200,400,NS} \right\} \). NS denotes the full size of the flatten layer of the source model Res-20. There are 50,000 original samples and 50,000 adversarial samples for training, 5,000 original samples and 5,000 adversarial samples for validation, and 5,000 original samples and 5,000 adversarial samples for testing. First, we input the samples into the source model to extract features. Subsequently, we randomly select 50 times feature vectors from the features. The Fifty sets of feature vectors of size \(fs \in FS\) are used to train 50 support vector machines (SVMs). In the evaluation phase, we use the 50 SVMs to identify the 50 feature sets of adversarial samples for different attacks. The success rates of multiple attacks against FR in mis-match index testing are shown in Table 3. The experimental results demonstrate a slight improvement in the attack performance of the proposed SEFA against FR. For example, the attack success rate of SEFA improves by an average of 0.5% over FIA. FR is an effective adversarial detection method. The robustness of attack methods against defenses needs to be improved to further facilitate defenses.

Fig. 6
figure 6

Effects of the number of initial perturbations on the attack success rate. Two source models, Ince-v3 and Res-152, generate adversarial examples with different filter probabilities. The filter probability changes from 0 to 70. The success rates are the results of attacking four normally trained models Ince-v3, Vgg-16, InceRes-v2, and Res-152 and five defense models InceRes-v2-Adv, InceRes-v2-Ens, Ince-v3-Ens4, Ince-v3-Ens3, and Ince-v3-Adv

Fig. 7
figure 7

Effects of the filter probability on the attack success rate. Two source models, Ince-v3 and Res-152, generate adversarial examples with different filter probabilities. The filter probability changes from 0 to 0.4. The adversarial examples are used to attack four normally trained models Ince-v3, Vgg-16, InceRes-v2, and Res-152 and five defense models InceRes-v2-Adv, InceRes-v2-Ens, Ince-v3-Ens4, Ince-v3-Ens3, and Ince-v3-Adv

Attacking Defense Models. Adversarial training of neural networks has regularization-like effects, achieving strong robustness to adversarial examples. In most cases, the proposed SEFA and SEFA-DITI are ranked among the top two, as shown in Table 4. This is because SEFA-DITI combines SEFA and the enhancement methods DIM and TIM to introduce randomization. Data enhancement helps to further improve the generalization of the adversarial samples and hence the transferability. Compared to the baseline attack, our approach improves the success rate against the defense model by about \(13.4\%\). Compared with normally trained models, the proposed SEFA demonstrates a more significant improvement in attacking the adversarially trained networks. This is because the success rates of attacks against the normal training models are already quite high. There is a small number of difficult samples for attackers. Therefore, it is difficult to improve the success rates. Table 4 demonstrates the threats posed by the proposed SEFA to the defense models. Tables 1 and 4 present the attack success rates against the normally trained models and the adversarially trained models, respectively. The values in the tables indicate the attack success rates (corresponding to rows) against the target models (corresponding to columns).

5.3 Ablation study

There are two parameters in the proposed method: the filter probability \(p_f\) and number of initial perturbations Num. With the parameter settings \(p_f=0.1\) and \(Num=50\), we fix one parameter and modify the other to analyze the effect of the parameters on the framework.

Fig. 8
figure 8

Effect of stochasticity and refined gradient. \(L_1\) acts as the baseline. \(L_2\) and \(L_3\) comprise the stochasticity and refined gradient, respectively. \(L_4\) adopts the above two terms simultaneously

\(p_f\) increases from 0 to 0.4 in steps of 0.1, and Num increases from 0 to 70. Figures 6 and 7 illustrate the effects of \(p_f\) and Num on attacks. The effects of the filtering probability and quantity of initial perturbations on the success rates of the source and target networks are approximately the same. The trends in the attack success rates for different target networks with Num increasing are also approximately consistent. The attack time increases as Num increases, and the attack success rate gradually became saturated. Therefore, the optimal Num for attacking is 50 to achieve a better tradeoff between effectiveness and efficiency, as shown in Fig. 6. In terms of the filter probability, a larger \(p_f\) (e.g., 0.4) removes a large amount of redundant feature importance information. However, the success rates of attacks with \(p_f=0.1\) increases significantly, as shown in Fig. 7. Finally, the appropriate number of initial perturbations Num and filter probability \(p_f\) are selected for attack.

Moreover, the keys to the proposed SEFA are the stochasticity and refined gradient. To investigate the contributions of these two factors, we designed four optimization objectives and experimentally validated them using two source models: Ince-v3 and Res-152. The four objective functions are constructed as follows. \(L_1\) boosts the positive features and discourages the negative features from the aggregate gradient, and \(L_2\) uses a refined gradient. \(L_4\) is the proposed loss function (4), and \(L_3\) selects the aggregate gradient. Here, \(L_2\) and \(L_3\) explore the effects of these two items, respectively. Figure 8 presents the success rate with the four loss functions.

$$\begin{aligned} {L_1} = \sum {\left( {W \odot {F_l}\left( {{x^{adv}}} \right) } \right) }, \end{aligned}$$
(22)
$$\begin{aligned} {L_2} = \sum {\left( {W_f \odot {F_l}\left( {{x^{adv}}} \right) } \right) }, \end{aligned}$$
(23)
$$\begin{aligned} {L_3} = \sum {\left( {W \odot {F_l}\left( {x + \delta } \right) } \right) }, \end{aligned}$$
(24)
$$\begin{aligned} {L_4} = \sum {\left( {W_f \odot {F_l}\left( {x + \delta } \right) } \right) }. \end{aligned}$$
(25)

\(L_2\) and \(L_3\) outperform \(L_1\), demonstrating the effectiveness of the two items-the refined gradient and stochasticity introduced. \(L_3\) surpasses \(L_2\), indicating that the stochasticity improves the transferability to a greater extent. In most cases, the proposed loss \(L_4\) significantly outperforms the others, demonstrating the advantage of the proposed SEFA.

6 Conclusions

We propose a general framework for adversarial attacks by introducing self-ensemble. Our method disrupts the salient features in a stochastic manner through diverse initial perturbations and refining the feature importance, thus significantly improving the diversity and randomness of adversarial perturbations. Consequently, the generated adversarial examples efficiently avoid being trapped in model-specific local optima and become more transferable among the target models. Moreover, the devised attack can further enhance the transferability in combination with other methods. Theoretical analysis and extensive experiments demonstrate the excellent performance of SEFA against the baseline attacks.

In the future, we will consider reducing the computational complexity of the proposed method. Examining more effective methods based on self-ensemble, such as transferable targeted adversarial attacks, is a possible future research direction. Moreover, we intend to explore the generalization of adversarial attacks to various visual applications such as object detection and semantic segmentation.