1 Introduction

Transformers have demonstrated excellent performance not only in natural language processing [1, 2] but also in computer vision, especially for object detection [3] and image classification [4]. Vision transformers (ViTs) achieve state-of-the-art performance by aggregating global features with a wider receptive field than convolutional neural networks (CNNs) [5]. Recent works have shown that ViTs exhibit stronger robustness than CNNs in different attack settings [6,7,8]. However, Naseer et al. discovered that both ViTs and CNNs are vulnerable to adversarial examples despite their significant differences between architectures, which can be predicted as incorrect results by adding tiny and imperceptible perturbations to input [9]. This suggests that high transferable adversarial examples generated by one model can still be adversarial for other unknown models. The vulnerability has attracted significant attention in many security-sensitive applications, such as face verification, medical diagnosis and autonomous driving [10,11,12,13,14].

Adversarial examples play an important part in evaluating robustness and exploring the internal drawbacks of a model before deployment [15,16,17]. Previous researches have proposed to generate adversarial examples on CNNs, and one of the most fundamental tasks is to attack classification models [18, 19]. According to the available model information, adversarial attack methods can be categorized into white-box attacks [12] and black-box attacks [20]. White-box attacks allow the attacker full access to the knowledge about the target model, such as parameters and model architectures. On the contrary, black-box attacks generate adversarial examples without knowing any information about the target model, which are widely deployed in practice but still perform low success rates.

There are two typical sorts of methods to mitigate the above problem, i.e., query-based attacks [21,22,23] and transfer-based attacks [24,25,26]. Query-based attacks attempt to estimate gradients based on queried information, while they suffer from high query complexity and low efficiency. This becomes an obstacle to deploy query-based attacks in real-world scenarios. By contrast, transfer-based methods directly generate adversarial examples by substitute models. It is more flexible and practical to attack other unknown models with similar decision boundaries. Nevertheless, few works focus on generating adversarial examples based on ViTs. Numerous methods are proposed to rely on the properties of CNNs and have been proved to perform weak transferability on ViTs [8], which hinders the development of ViTs in practice. Thanks to the cross-model transferability of adversarial examples, we propose a transfer-based attack method by probing the inner properties of ViTs.

Fig. 1
figure 1

The MASR of the different number of encoder blocks on the transferability of adversarial examples. K is the number of blocks in Deit-T that are utilized to generate adversarial examples. ViTs are made up of T2T-7 [27], T2T-24 [27], TnT [28], ViT-S and ViT-B [5]. CNNs are made up of VGG19 [29], DN201 [30], SE154 [31], RN50 and RN152 [32]. MASR represents the mean success rate of adversarial examples against models

Vision Transformer is composed of multiple encoder blocks which consist of layer normalization [33], multi-head self-attention (MHSA) and feed-forward network (FFN). Each encoder block can produce self-attention feature maps to express global features across image patches. It is inherently different from CNNs and limits the effectiveness of existing attack methods. In order to address this problem, we conduct a toy experiment to explore the impact of encoder blocks on the transferability of adversarial examples. The experimental results are shown in Fig. 1. We obverse that partial rather than all encoder blocks can significantly improve the transferability of adversarial examples. More specifically, adversarial examples are generated by the fast gradient sign method (FGSM) [34] and projected gradient descent (PGD) [35] with Deit-T [36] on ImageNet validation dataset [37] respectively. Then we test the mean attack success rates (MASR) of these adversarial examples on five CNNs and five ViTs models. It turns out that it is redundant for generating adversarial examples by all encoder blocks. For example, the MASR is not rising consistently, with the number of encoder blocks increasing. In some cases, the MASR decreases as the number of encoder blocks involved increases. This indicates that fusing additional information from encoder blocks will cause adversarial examples to overfit the source model, hindering the adversarial transferability.

Motivated by the above observation, we propose a novel black-box attack method based on encoder blocks of ViTs, named partial blocks search attack (PBSA). The PBSA generates adversarial examples with high transferability on ViTs that can attack both CNNs and ViTs target models. To the best of our knowledge, it is the first work to explore the relationship between the number of encoder blocks and the transferability of adversarial examples. In particular, we introduce a block weight score difference, which can effectively divide encoder blocks into two categories based on their sensitivity to adversarial perturbations. According to the difference in noise sensitivity between different categories of encoder blocks, we design two strategies to generate adversarial examples. We discover that partial encoder blocks instead of all encoder blocks can effectively alleviate the overfitting to a specific source model. In addition, we employ the adaptive weight to adjust the magnitude and stabilize the update direction of perturbations, which further enhances adversarial transferability.

To summarize, we highlight the main contributions of this work as follows:

  • We present a partial blocks search attack (PBSA) method to enhance the transferability of adversarial examples, which divides encoder blocks into two categories by the weight score difference to prevent overfitting.

  • We propose to regularize the self-attention feature maps in the partial encoder blocks of ViTs, and create an ensemble of the remaining blocks for final prediction. This promotes more stable adversarial perturbations due to it reduces the number of blocks involved in each strategy.

  • Extensive experiments are conducted to demonstrate that PBSA achieves superior performance over the state-of-the-art transferable attack methods. Furthermore, empirical results witness that the transferability can be significantly enhanced by combining our method with other existing methods.

The rest of this paper is organized as follows. Section 2 presents some related works. Section 3 introduces the proposed PBSA method involving the partial blocks search process and adversarial attack process. Section 4 describes experiment settings and extensive experimental results on various target models. Finally, Section 5 concludes the work of this paper.

2 Related works

Without any available information on the target model, the transfer-based attack is a straightforward strategy to generate adversarial examples which is more flexible and practical. Therefore, transfer-based attacks have attracted increasing attention recently. Numerous pioneer works generate adversarial examples by the transfer-based attack, including one-step attacks [34] and iterative attacks [38]. In this section, we briefly introduce relevant works about transfer-based attack methods on CNNs and ViTs.

2.1 Transfer-based attack methods on CNNs

FGSM is a fundamental one-step attack method that updates once along the gradient direction of the loss function within the perturbation budget \(\epsilon\). It usually performs at high speed but has a poor attack success rate [34]. PGD optimizes gradient direction several times based on a one-step attack. After updating, it projects adversarial examples onto the \(\epsilon\)-sphere, which has strong attack performance under the white-box setting [35]. Unfortunately, it is prone to overfit the source model, leading to low transferability [26, 39].

There are many recent studies focusing on improving the transferability of adversarial examples. Dong et al. propose the momentum iterative attack (MI) by adopting a momentum term in an iterative manner to stabilize optimization and prevent poor local maxima [40]. They also investigate how ensemble-based approaches attack multiple models simultaneously. Wu et al. explicate a property of the skip connections and design the skip gradient method (SGM) to produce high transferable adversarial examples [26]. They discovered that more low-level information could be preserved through a decay factor on gradients, which facilitates the improvement of transferability. Wu et al. propose an attention-guided transfer attack (ATA) by introducing model attention to regularize the search for adversarial perturbations [39]. It allocates suitable attention over extracted features to mitigate overfitting to specific blind spots of the source model. Instead of the attention-guided transfer attack, Wang et al. propose to feature importance-guided transfer attack [41]. They disrupt the critical object-aware features by considering the feature’s importance. This effectively suppresses model-specific features and promotes the transferability of adversarial examples. In addition, Dual Attention Suppression (DAS) attack corrupts similar attention between multiple models and reduces human attention to generate adversarial examples in the physical world [42].

Different from regularization methods, Kantipudi et al. discover the color robustness problem and propose to attack the color channels of images [43]. They design a simple yet effective method called color channel perturbation attack, which changes the original color channels of images by stochastic weights. To explore the impact of color distortions on the performance of neural networks, De et al. provide a particular dataset dedicated to colour distortions and colour modifications [44].

Although the above methods manifest outstanding transferability on CNNs, the significant structural difference between networks limits their effectiveness on ViTs. Therefore, it is challenging to extend existing methods based on CNNs to ViTs directly.

2.2 Transfer-based attack methods on ViTs

Recently, ViTs have achieved impressive performance in image classification, attracting increasing attention. They process the input as a sequence of flattened patches and model global information across patches based on the self-attention mechanism. A great deal of research on vision transformers has emerged. The vision transformer model (ViT) is the first work to establish a pure transformer architecture on computer vision [5]. It achieves high accuracy and strong robustness on multiple image recognition benchmarks when applied directly to the sequences of image patches. In order to improve data efficiency, the data-efficient image transformer (DeiT) is proposed to focus on the distillation through attention [36]. Without any large-scale external dataset, it attains excellent performance compared to state-of-the-art CNNs while reducing model parameters. Tokens-to-token (T2T) is proposed for overcoming the limitations of the simple tokenization [27]. It models the local structure of an image with a T2T module, and combines neighboring tokens together to learn low-level structures, which achieve the competitive performance when trained from scratch on ImageNet. Meanwhile, Han et al. point out that it is essential to consider attention inside local patches and propose the transformer in transformer (TNT) to enhance the feature representation ability [28]. They introduce an outer transformer block and an inner transformer block to explore relationships among patches, which can extract more detailed features.

Some current works state that ViTs achieve higher robustness than CNNs under adversarial attacks. However, Naseer et al. indicate that it is due to conventional attack methods do not leverage the true representation potential of ViTs, which leads to sub-optimal attack procedures [9]. Few works are investigating how to generate adversarial examples with high transferability on ViTs.

The first work proposes self-ensemble (SE) and token refinement (TR) to generate high transferable adversarial examples on ViTs, which is based on all encoder blocks [9]. On the one hand, SE extracts class tokens from full encoder blocks to utilize more class-specific information with a shared classifier head. It exploits an ensemble of multiple discriminative pathways to optimize the direction of adversarial perturbation. On the other hand, TR strives to refine information and aligns class tokens extracted from encoder blocks. This remarkably enhances the transferability of adversarial examples by exploiting the structural information. Both of the above methods achieve high attack performance across different ViTs, but TR suffers from time-consuming because it requires fine-tuning on an external dataset. In order to overcome this issue, Wei et al. propose a dual attack framework tailored for the architecture of ViTs to help alleviate the over-fitting concern [45]. They emphasize that gradients of attention weight in each head of an encoder block and perturbation patches decrease the transferability.

We propose a transfer-based attack method to prevent adversarial examples from overfitting to a specific source model. Different from previous research, our proposed method focuses on investigating how encoder blocks involved impair adversarial transferability, which achieves high performance and can easily combine with existing methods to boost the transferability of adversarial examples.

3 Methodology

In this section, we provide a detailed description of the proposed PBSA. It consists of two procedures, including the partial blocks search process and adversarial attack process. As shown in Fig. 2, we first generate a negative image set for comparison with the original image. To search for suitable encoder blocks, we then calculate the gradient backpropagation of the loss and the self-attention maps. The self-attention maps are extracted from the multi-head self-attention. Finally, we adopt two strategies based on different blocks to balance the knowledge of the images and the source model, which optimizes the direction of perturbations.

Fig. 2
figure 2

Illustration of Partial Blocks Search Process. Given an original image, negative images are generated by adding different noise. The red dashed line represents back propagation and the WSD is weight score difference in encoder blocks

Fig. 3
figure 3

Structure of the self-attention block

3.1 Partial blocks search process

In this section, we present an approach to identify the importance of blocks and search for suitable encoder blocks for each strategy. The appropriate encoder blocks are searched by gradient backpropagation as depicted in Fig. 2. Given an image \(x\in {\mathbb {R}}^{H \times W \times C}\) with the corresponding ground-truth label y, H, W and C represent height, width and channel number of image, respectively. A transformer model f has B encoder blocks, and f(x) represents the prediction label of x. The image is reshaped into a sequence of non-overlapping 2D patches \(x_{p} \in {\mathbb {R}}^{N \times D}\) before being fed into transformer, where \(N=H W / P^{2}\) is the number of patches, P represents patch size, so \(D=P^{2} \cdot C\) represents patch dimension. As shown in Fig. 3, Q, K, and V in multi-head self-attention are core components of ViTs for global features aggregation, which capture the interactions among patches. Let \(Z_{i}\) be the input of i-th block, then we calculate \(Q_{i}=Z_{i} W_{q}\), \(K_{i}=Z_{i} W_{k}\) and \(V_{i}=Z_{i} W_{v}\), where \(W_{q} \in {\mathbb {R}}^{N \times D_{q}}\), \(W_{k} \in {\mathbb {R}}^{N \times D_{k}}\) and \(W_{v} \in {\mathbb {R}}^{N \times D_{v}}\) are learnable weight matrices. Finally, the output of multi-head self-attention in the i-th encoder block is given by

$$\begin{aligned} S A M_{i}(x)={\text {softmax}}\left( Q_{i} \cdot K_{i}^{T} / \sqrt{D_{q}}\right) \cdot V_{i}, {\quad i \in [0,B-1]}, \end{aligned}$$
(1)

where i is the index of the encoder block, and self-attention feature map from i-th block is expressed as \(S A M_{i}(x)\).

To eliminate detrimental influence on adversarial examples from redundant model information, a weight score is proposed to divide encoder blocks into two categories effectively. \(W(x)=\left[ w^{0}(x), w^{1}(x), \ldots , w^{B-1}(x)\right]\) denotes model’s total weight score. \(J(\cdot , \cdot )\) denotes the loss function of transformer model f. \(\nabla _{x} J(x, y)\) represents the effect of image pixels on model loss, as well as \(\nabla _{x} S A M_{i}(x)\) represents the effect of image pixel on self-attention maps from the i-th block. To ensure the weight score has the same resolution as the image, which can be easily applied to generate adversarial examples with high transferability, the weight score of the i-th encoder block is defined by

$$\begin{aligned} w^{i}(x)=\frac{\nabla _{x} J(x, y)}{\nabla _{x} S A M_{i}(x)}. \end{aligned}$$
(2)

Inspired by how contrastive learning extracts more discriminative features through positive and negative images [46], a negative image set is constructed by introducing random noise to an original image as described in Fig. 2, which evaluates the performance of encoder blocks extraction features. Thus the average weight score of the i-th block in the model calculated by a negative image set is defined by

$$\begin{aligned} w_{M}^{i}(x)=\frac{1}{M} \sum _{m=1}^{M} w^{i}\left( x \odot \text{ Mask } _{p}^{m}\right) , \end{aligned}$$
(3)

where \(\text { Mask }_{p}^{m}\) represents m-th noise, p is the probability that the pixel value is zero. M represents the total number of negative images corresponding to an original image. \(\odot\) denotes the element-wise product.

When the weight scores between the negative images and the original images are closer, the features extracted by the corresponding blocks are more discriminative, which also indicates that the block has a stronger anti-interference ability. Therefore, the weight score difference (WSD) is introduced to express the anti-interference ability of the encoder block, which reflects the similarity between the variation trends of the loss and self-attention maps for different inputs. It can be obtained by computing the difference in the weight scores between the original image and the corresponding negative image set. The weight score difference D(x) of i-th block is defined as

$$\begin{aligned} d^{i}=|w^{i}(x)-w_{M}^{i}(x)|, \end{aligned}$$
(4)

where \(|\cdot |\) means absolute value. Then WSD is sorted in ascending and the indexes of the corresponding blocks are recorded as \(S=\left[ s_{0}, s_{1}, \ldots , s_{B-1}\right]\). For a robust encoder block, the variation trends of the loss and self-attention maps for different inputs are similar. Therefore, the smaller WSD means the stronger anti-interference ability of the block. Specifically, the \(s_{0}\)-th block and the \(s_{B-1}\)-th block have the strongest and the weakest anti-interference ability, respectively. For clarity, the first K blocks in the S are called strong blocks, while the other blocks are called weak blocks.

3.2 Adversarial attack process

The adversarial example \(x^{*}\) is visually similar to x but satisfies \(f(x)\ne f(x^{*})\). For non-target attacks, the loss function should be maximized on the basis of constraining perturbation \(\delta\) within \(\epsilon\). Therefore, the adversarial attack process can be formulated as an optimization problem

$$\begin{aligned} \underset{\delta }{\arg \max } J\left( x^{*}, y\right) , \text{ s.t. } \Vert \delta \Vert _{\infty } \le \epsilon , \end{aligned}$$
(5)

where \(\Vert \cdot \Vert _{\infty }\) represents the \(L_{\infty }\)-norm, which is commonly utilized to measure the maximum length of perturbations.

Fig. 4
figure 4

Illustration of Adversarial Attack Process

Since more discriminative information can be obtained from blocks with strong anti-interference ability, an ensemble of strong blocks is created for final prediction. As shown in Fig. 4, we select the output of strong blocks to accumulate the loss for generating adversarial examples, which is defined as

$$\begin{aligned} {L_{P B}(x)=\sum _{k=0}^{K-1} J_{s_{k}}(x, y),} \end{aligned}$$
(6)

where \(J_{s_{k}}(\cdot , \cdot )\) represents the loss of \(s_{k}\)-th block.

Although the weak blocks are less resistant to interference, they are more sensitive to noise, which helps measure the difference between two images. Therefore, these blocks can be utilized to measure the \(l_{2}\)-distance between the self-attention feature maps of adversarial examples and the original images. As depicted in Fig. 4, another ensemble of weak blocks is created to constrain perturbations. The attention-based regularization loss \(L_{AR}\) is defined as

$$\begin{aligned} \begin{aligned} \begin{aligned} {L_{AR}=\sum _{k=K}^{B-1}\left\| S A M_{s_{k}}\left( x^{*}\right) -S A M_{s_{k}}(x)\right\| _{2},} \end{aligned} \end{aligned} \end{aligned}$$
(7)

\(\left\| \cdot \right\| _{2}\) represents Euclidean distance, which is utilized to measure the similarity between the original images and adversarial examples.

To further boost the transferability of adversarial examples, we adopt the \(l_{2}\)-norm to constrain the distance between adversarial examples and original images, generating more stable perturbations. The pixel-based regularization loss \(L_{PR}\) is defined as

$$\begin{aligned} \begin{aligned} \begin{aligned} {L_{PR}=\left\| x^{*}-x\right\| _{2},} \end{aligned} \end{aligned} \end{aligned}$$
(8)

By introducing the attention-based regularization loss \(L_{AR}\) and pixel-based regularization loss \(L_{PR}\), we get the proposed objective for the adversarial attack, which is formulated as

$$\begin{aligned} \begin{aligned} \begin{aligned} {L\left( x^{*}\right) =L_{P B}(x)+\mu L_{AR}+\lambda L_{PR},} \end{aligned} \end{aligned} \end{aligned}$$
(9)

where \(\mu\) and \(\lambda\) are scalars used to balance the regularization terms.

In order to improve the transferability of adversarial examples, we disturb the most effective pixels for perturbation by adaptive weight based on the best strong block, which can be expressed

$$\begin{aligned} A W(x)=\text {ReLU}\left( \frac{w_{m}^{s_{0}}(x)}{\left\| w_{m}^{s_{0}}(x)\right\| _{2}}\right) +1, \end{aligned}$$
(10)

The \(\text {ReLU}(\cdot )\) function attenuates the influence of irrelevant pixels and enlarges perturbations of essential pixels.

The attack process of PBSA is based on PGD, which is one of the most popular attack methods [35]. It perturbs original images by optimizing gradient direction several times and projects adversarial examples onto the \(\epsilon\)-sphere. Therefore, the attack process of the proposed PBSA can be expressed as

$$\begin{aligned} {x_{t+1}^{*}=C l i p_{x,\epsilon }\left( x_{t}^{*}+\alpha \cdot A W(x) \odot {\text {sign}}\left( \nabla _{x} L\left( x_{t}^{*}\right) \right) \right) ,} \end{aligned}$$
(11)

where \(C l i p_{x,\epsilon }\left( \cdot \right)\) denotes clipping the values to make each within \(\left[ x-\epsilon , x+\epsilon \right]\). \(\alpha\) is step size and t is the number of iteration. Finally, our algorithm denoted as partial blocks search attack is summarized in Algorithm 1.

figure a

4 Experiments

In this section, we first provide detailed experiment settings. To evaluate the performance of our method, then we conduct comprehensive experiments to compare it with existing methods. The experimental results are discussed and analyzed. Finally, we demonstrate the effectiveness of PBSA with ablation experiments.

4.1 Experiment settings

4.1.1 Datasets

We select the same clean images as previous studies [40] and resize all images to uniform size for a fair comparison. The dataset consists of 1000 images of different categories from the ILSVRC 2012 validation dataset [37] and can be almost correctly classified by all the target models [40]. Originally Deit models adopted as source models in our experiments were pre-trained with an image size \(224 \times 224\times 3\) [36]. In order to load model parameters correctly, all images are resized to the same size before being fed into source models.

4.1.2 Models

We adopt Deit-T and Deit-S [36] as source models which perform high data efficiency. There are 12 encoder blocks in both Deit-T and Deit-S. They reshape an original image into a sequence of 196 flattened patches as input, where the patch size is 16.

After fixing a source model, the adversarial examples are fed into multiple target models to validate the effectiveness of PBSA. Extensive experiments on two completely different network architectures are conducted for assessment, i.e., vision transformer models and convolution neural networks. T2T-7, T2T-24 [27], TnT [28], ViT-S and ViT-B [5] are five ViTs models with diverse backbones that we consider. Moreover, we also take into account five distinct CNNs models, including VGG19 [29], DenseNet201 (DN201) [30], SENet154 (SE154) [31], ResNet50 (RN50) and ResNet152 (RN152) [32]. These models are trained on the ImageNet dataset, which all achieve high accuracy in the classification task.

4.1.3 Implementation details

We compare our method with classic and state-of-the-art attack methods, including FGSM [34], PGD [35], SGM [26], PGD-SE and PGD-RE [9]. We adopt the official settings provided in the corresponding papers. As SGM was originally proposed to attack CNNs, we follow the attack setting in [45] because ViTs have similar components to CNNs, such as skip connection. To enable fair comparisons, we set the maximum perturbation size \(\epsilon =16\) for all attack methods. Besides FGSM, the step size \(\alpha\) is set to 2 and iteration number N is 10. For SGM, decay factor \(\gamma\) is set to 0.2. In our method, number of strong blocks K is 3, the probability p that the image pixel value is zero is 0.3 and the mask number M is 5, \(\mu =0.01\) and \(\lambda =0.5\) are utilized to balance the regularization terms.

4.1.4 Evaluate metrics

To evaluate the performance of our method, following [40, 9], we report the attack success rate (ASR) by calculating the percentage of adversarial examples and clean images classified into different categories on the target model. A higher attack success rate represents better transferability [9, 40]. ASR is formalized as

$$\begin{aligned} A S R=\frac{{\text {Num}}\left( f\left( x^{*}\right) \ne f(x)\right) }{{\text {Num}}(x)}, \end{aligned}$$
(12)

where \({\text {Num}}(\cdot )\) is a counting equation.

In addition, we exploit the mean absolute difference (MAD) between original images and adversarial examples to measure adversarial perturbations. A higher mean absolute difference means more adversarial perturbations are generated. It can be expressed by

$$\begin{aligned} M A D=\frac{1}{H * W} \sum _{i=0}^{H * W}|x_{i}-x_{i}^{*}|, \end{aligned}$$
(13)

where \(x_{i}\) represents i-th pixel of image x.

4.2 Comparison of transferability

We first perform adversarial attacks on various ViTs to demonstrate the effectiveness of our proposed PBSA. The ASR of different methods on Deit-T and Deit-S are depicted in Table 1. The source models are on rows and the five models we test are on columns. It is apparent that these methods suppress the classification performance of all target models. The proposed PBSA achieves substantially higher transferability than other methods on both Deit-T and Deit-S. In general, the PBSA consistently outperforms the baseline attacks by \(12\%\) to \(26\%\) on average. For instance, if we craft adversarial examples on Deit-T, the proposed PBSA yields \(97.38\%\) success rate against ViT-S and \(50.82\%\) success rate against T2T-24, while the PGD attack only obtains the corresponding success rates of \(84.46\%\) and \(22.12\%\), respectively. These experimental results convincingly demonstrate the high effectiveness of the proposed PBSA.

Table 1 The success attack rates (%) of adversarial examples crafted by different methods against various ViTs
Table 2 he success attack rates (%) of adversarial examples crafted by different methods against various CNNs

With the remarkable improvement in ViTs, we also analyze the ASR of different methods against multiple CNNs to evaluate the performance of PBSA further. It is helpful to prove that the PBSA is feasible to attack a model without knowing its architecture or other properties. The results are summarized in Table 2. It is obvious that the proposed PBSA also considerably improves the transferability of CNNs by a considerable margin, which indicates that model-specific information are suppressed effectively. To be more precious, the average ASR of the adversarial examples generated by PBSA on Deit-T and Deit-S against various CNNs are about \(27.44\%\) and \(28.75\%\) higher than those of PGD attack separately. This reflects that the adversarial examples generated by PBSA have high performance and generalization on both ViTs and CNNs.

In addition, we further discuss and analyze the experimental results of these attack methods in Tables 1 and 2. To begin with, FGSM performs better on models with higher capacity than other baseline attack methods. It is a one-step attack, which improves transferability by avoiding overfitting to the source model. However, it exhibits poor performance on small models. Moreover, SGM obtains relatively higher ASR because of the similar components between ViTs and CNNs. Finally, PGD-SE and PGD-RE perform relatively lower transferability compared to our method. A possible explanation for this is that they suffer from model-specific local optimum because they rely on complete information from class tokens. Our method does not rely on the source model but focuses on the image itself, which allows it to perform better transferability across different models. Moreover, regardless of ViTs or CNNs, adversarial examples crafted on Deit-S have significantly better transferability than those crafted on Deit-T.

Furthermore, the MASR and MAD of adversarial perturbations crafted by the proposed PBSA and other baseline attacks on Deit-T are reported in Table 3. We find that PGD employs the least perturbations but displays the worst transferability. Meanwhile, FGSM produces the most perturbations, but its performance is not the best, implying that adversarial perturbations crafted by these methods are unstable and ineffective. PBSA delivers the best transferability with an acceptable difference under the same perturbation budget compared to other baseline attacks. For instance, the MASR of the proposed PBSA outperforms PGD by more than \(25\%\) on ViTs and \(27\%\) on CNNs, whereas the MAD of PBSA is about one lower than FGSM. It confirms that our attack is stealthy. The main reason is that the PBSA focuses on optimizing the direction and promoting the generation of more reliable perturbations. Though the MAD of PBSA is greater than some baseline attacks, the perturbations are imperceptible by the human eye.

Table 3 The mean absolute difference and mean attack success rates (%) by existing methods against various target models

4.3 Visualization of adversarial examples

Randomly selected adversarial examples crafted by various attacks and their corresponding original images are displayed in Fig. 5. The predicted class labels on T2T are provided below the images. We intuitively observe that these attack methods generate images visually similar to original images, but the proposed PBSA achieves much higher transferability.

Fig. 5
figure 5

Visualization of adversarial examples crafted by various attacks on Deit-T. Blue indicates classified examples correctly, whereas red represents classified examples incorrectly

4.4 Combination with existing methods

To investigate the generalization of our method, we combine PBSA with existing methods. Since SGM and PDG-RE improve the transferability excellently among baseline attacks, we further enhance the performance of the proposed PBSA by integrating with the two attacks. In order to ensure fairness, the experiments are conducted on Deit-T with the same settings. The results of the combined methods are summarized in Tables 4 and 5. We can observe that the proposed PBSA facilitates a noticeable enhancement of transferability on the basis of baseline attacks. In general, the transferability of SGM+PBSA and PDG-RE+PBSA consistently outperforms baseline attacks by \(16\%\) and \(12\%\) on average, respectively. This indicates that the improvement of the existing methods is rather limited. In particular, the combination of these three attacks achieves optimal performance among the existing state-of-the-art methods. The results corroborate that the PBSA is compatible with other attacks and can remarkably enhance transferability by combining existing methods in complementary ways.

Table 4 The attack success rates (%) of combined existing methods against ViTs
Table 5 The attack success rates (%) of combined existing methods against CNNs

4.5 Ablation studies

To highlight the contribution of the proposed PBSA, we explore the influence of key hyper-parameters and components in PBSA on transferability. Considering the similar trends in results when utilizing different source models, we only discuss the transferability of adversarial examples crafted on Deit-T to simplify the analysis. The experimental settings are default, as mentioned above.

Fig. 6
figure 6

The effect of hyper-parameter \(\mu\) on attack success rates (\(\%\))

Fig. 7
figure 7

The effect of hyper-parameter \(\lambda\) on attack success rates (\(\%\))

4.5.1 Effect of hyper-parameters

We first explore the impact of \(\mu\) and \(\lambda\) on transferability, which are used to balance the contribution between different regularization terms. Specifically, we adopt control variates. Firstly, the \(\mu\) is varied while the \(\lambda\) is fixed to investigate the appropriate value of \(\mu\). Then the \(\lambda\) is varied under the appropriate value of \(\mu\). The results of different values of \(\mu\) from 0.001 to 10 are presented in Fig. 6. We can intuitively see that ASR curves on ViTs and CNNs are unimodal, and target models share the same optimal value of \(\mu\). Specifically, with the \(\mu\) increases, the results grow and then fall rapidly. When \(\mu\) is set to 0.01, transferability achieves the peak for every model. The results of different values of \(\lambda\) from 0.1 to 0.9 are depicted in Fig. 7. In general, transferability gradually improves when we continually increase \(\lambda\) between 0.1 and 0.5, while \(\lambda\) exhibits little impact when \(\lambda\) is greater than 0.5. Consequently, we adopt \(\mu\) = 0.01 and \(\lambda\) = 0.5, which tend to yield higher transferable adversarial examples.

Fig. 8
figure 8

The effect of hyper-parameters K on attack success rates (\(\%\))

The number of strong encoder blocks K is the dominant hyper-parameter in our method, which is employed to adjust encoder blocks involved in each strategy. The effect of K on transferability by various target models is presented in Fig. 8. We can observe that performance improves slightly when K is small in some target models. Nevertheless, there has been a drastic drop in the transferability in almost every case when the K increases continually. It reveals that fusing more knowledge from encoder blocks will degrade transferability due to the overfitting problem. This observation emphasizes the importance of searching for suitable encoder blocks for adversarial examples generation and gives insight into the relationship between the number of strong blocks and transferability. Therefore, we set \(K=3\) as a compromise for better performance.

4.5.2 Effect of components

Table 6 The attack success rates (%) crafted by different components against ViTs
Table 7 The attack success rates (%) crafted by different components against CNNs

Based on the optimal parameters, we give a quantitative analysis of each component to confirm the effectiveness of the proposed PBSA in detail. We perform a series of ablation studies under various combinations of the components, including partial block search (BS), attention-based regularization (AR), and pixel-based regularization (PR). This experiment employs PGD as a baseline attack and Deit-T as the source model. As illustrated in Table 6 and Table 7, each component achieves higher performance than the baseline, indicating that each component in PBSA is helpful to improve transferability. Among these combinations, the attack method that combines all components promotes the transferability of various by \(26\%\) on average. These results demonstrate the effectiveness of the proposed method.

4.5.3 Effect of perturbation budget

The results of varied maximum perturbation on ViTs and CNNs are shown in Fig. 9. It is obvious that there has been a steady increase in transferability when the maximum perturbation is increased continually. Based on the above experimental results, we discover that the adversarial transferability tends to be higher on smaller target models. For example, either Deit-T or Deit-S generates adversarial examples, they achieve higher ASR on ViT-S and ResNet50 than ViT-B and ResNet152, respectively. On the one hand, we conjecture that a larger model which provides a sufficient search space is less likely to avoid local optimum than a smaller one. On the other hand, larger networks can extract more high-level information, which is less transferable than low-level information. This suggests that larger models are more important for defending adversarial perturbations.

Fig. 9
figure 9

Effect of maximum perturbation on attack success rates (%)

5 Conclusion

In this paper, we explored the relationship between the number of encoder blocks and the transferability of adversarial examples. In order to generate high transferable adversarial examples on vision transformers, we proposed a novel and flexible method based on distinct encoder blocks, named partial blocks search attack (PBSA). Considering the noise sensitivity of each encoder block, we combined the two strategies to generate adversarial examples, making encoder blocks exploited fully. Since we only integrated partial blocks for each strategy, the risk of overfitting to the source model was mitigated, which resulted in adversarial examples transferred among different target models successfully. We conducted a series of experiments with two source models against ten target models, including five ViTs and five CNNs. Experimental results demonstrated that adversarial examples generated by PBSA have significantly better transferability than those generated by the existing methods. Therefore, our method can effectively attack a variety of models and serve as a benchmark for evaluating the robustness of neural network models. We plan to explore developing PBSA to attack videos or build more robust neural network models with adversarial examples in further research.