1 Introduction

Deep neural networks (DNNs) perform dramatically well in various visual tasks, including image classification, object detection, and semantic segmentation. However, they are also susceptible to being fooled by adversarial perturbations: perturbations combined with data inputs in a specific way, cause intentional misclassification.

Attacking deep neural networks has drawn an increasing attention, and researchers have made great progress in understanding the space of adversarial perturbations, beginning in the digital domain (e.g. by modifying images corresponding to a scene) [1,2,3,4], and more recently in the physical domain [5,6,7,8]. Compared to attacks in the digital domain, adversarial perturbations encounter more challenges when attacking in the physical world: (1) Perturbations in the digital world can be so small in magnitude that it is impossible for a camera to perceive them due to the sensor imperfections. (2) The perturbations generated by many current algorithms are image-dependent. If the image in front of the surveillance camera changes, the attacker needs to generate a corresponding new perturbation immediately, which is difficult to achieve in physical attacks. (3) The perturbed image is difficult to maintain at a precise distance and angle in the view of surveillance camera, which requires perturbations to be robust to various transformations (e.g. rotations or scaling) and locations.

To tackle these challenges and achieve attacks in physical scenarios, Brown et al. [9] proposed visible adversarial perturbations, called Adversarial Patch (GoogleAP). These adversarial patches can be printed, added to any scene, photographed, and presented to deep neural networks. Even when the patches are small, the deep neural network is incapable to identify real objects in the scene and report a false class. Adversarial patch is image-independent and robust for rotation and scaling, can be placed anywhere within the field of view of the deep neural network, and causes the deep neural network to output a targeted class (Fig. 1).

Adversarial patch can be applied to traffic signs to mislead automated vehicles [10], or placed near products to fool online shopping platforms [11], or even affixed to clothing to hide attackers from surveillance cameras [12, 13]. Adversarial patch brings a serious challenge to the safety of deep neural networks. At the same time, research on adversarial patch helps to improve the defense capability of deep neural networks against malicious attacks.

However, GoogleAP requires training data on which the target model is trained. The data dependency brings barriers to the application of GoogleAP in the real world as training data of the target system is generally unavailable. For example, autonomous driving companies never tell the public what data they use to train detectors, online shopping website protects the data used to train their classifier from being stolen, and face verification devices carefully store their face data in their systems.

Fig. 1
figure 1

Examples of real-world adversarial patch attacks against VGG 16. We can observe that each object is misclassified when the adversarial sticker is placed beside it

In order to address these shortcomings, we present a novel data independent approach to craft adversarial patches (DiAP). The objective of DiAP is to generate an adversarial patch that can fool the target model on most of the images without any knowledge about the data distribution. Inspired by GD-UAP [14], DiAP perform non-targeted attacks by fooling the features learned by the deep neural network. In other words, we formulate this as an optimization problem to calculate the non-targeted adversarial patch, which can fool the features learned on each layer of the deep neural network and eventually making it to be misclassified as an adversarial example. After generating the non-targeted adversarial patch, DiAP takes it as an important background item to help an attacker extract the features of the target class for crafting the targeted adversarial patch. Experimental results show that the adversarial patch generated by DiAP exhibits strong attack capabilities. In particular, by extracting vague information about training data from non-targeted patches, DiAP outperforms the state-of-the-art attack methods in blackbox attack scenarios.

2 Related work

Most of prior work has focused on generating visually imperceptible adversarial perturbations which cover the entire input image [15, 16]. This type of attack is effective in digital scenes, but is difficult to deploy in actual physical scenes. The camera equipment cannot carefully capture the small perturbation on the input, resulting in the failure of the adversarial attack in real world.

Some researchers have tried to add visible perturbations at specific locations in the input image to attack deep neural networks [7, 17, 18]. Compared to visually imperceptible adversarial perturbations, the visible perturbations at specific locations require fewer pixels to change, and allow for significant changes to specified pixel points. This kind of attack performs quite well in digital domain and shows some attack capabilities in the physical scene. However, since they can only be placed in specific locations in the input image, the application of these perturbations in physical attacks is confined.

Recent work by [9] and [19] have studied adversarial perturbations under a new attack model - adversarial patch, which is restricted to a small region and can be placed anywhere on the input image. This attack is based on the assumption that machine learning models operate without human validation of each input, so malicious attackers may not care about the imperceptibility of their attacks. Moreover, even if humans are able to notice adversarial patch, they may view it as an art form rather than some way of attacking the deep neural network.

It is observed that the objective presented by [9] and [19] to craft adversarial patch requires real and clean examples to participate in training. However, in physical attacks, it is difficult for malicious attackers to obtain training data for the target system in general. Therefore, this paper presents a data independent attack method, DiAP.

From the perspective of whether there is an attack target, adversarial attack against the deep neural network can be divided into non-targeted and targeted. Non-targeted attack don’t specify the prediction class of the deep neural network, and the adversarial examples may be identified as any category other than the correct one. However, the purpose of targeted attack is to trick the deep neural network into recognizing adversarial examples as specified categories. Our DiAP can carry out both non-targeted attack and targeted attack.

3 DiAP for non-targeted attack

In the first setup, we explore generating adversarial patches that can be used for non-targeted attack without training data.

3.1 Setting and method

Let \(\mu \) denote a distribution of images in \(\mathfrak {R}^{n}\), and f denote a pre-trained deep neural network that outputs for each image \(x\in \mathfrak {R}^{n}\) an estimated label f(x). We want the adversarial patch p to attack the target model by replacing a part of the image x, regardless of the scale, rotation or location of the patch. The adversarial example A(pxlt) could be achieved by a patch application operator which first applies transformations t (e.g. rotations or scaling) to the patch p, and then applies the transformed patch p to the image x at location l. That is, the goal of DiAP to perform a non-targeted attack is to seek an non-targeted adversarial patch \(p_{nt}\) such that

$$\begin{aligned} f(A(p_{nt}, x, l, t)) \ne f(x), \quad \text {for}\,\, x \sim \mu \end{aligned}$$
(1)

In the absence of training data, we fool the features learned at individual layers of the deep neural network to finally craft the non-targeted adversarial patch \(p_{nt}\) . To achieve this goal, we introduce a variant of the spurious activation loss proposed in [20]. In particular, the non-targeted patch \(p_{nt}\) is trained to optimize the spurious activation objective function

$$\begin{aligned} \begin{aligned} \mathop {\arg \max }_{p_{nt}} {\mathbb {E}}_{l \sim \! L, t \sim \! T} \left[ \log \left( \prod \limits _{i=1}^K \! \Vert {\mathcal {L}}_i \! \left( A \!\left( p_{nt}, I_{z}, l, t \right) \right) \Vert _2 \right) \right] \end{aligned} \end{aligned}$$
(2)

where L is a distribution over locations in the image, and T is a distribution over transformations of the patch. \(I_{z}\) is the background image with \(RGB=[0,0,0]\) for all pixels. \({\mathcal {L}}_i(A(p_{nt}, I_{z}, l, t))\) is the output tensor at layer i when the image \(A(p_{nt}, I_{z}, l, t)\) is fed to the network f. K is the number of layers in f at which we maximize the output caused by \(A(p_{nt}, I_{z}, l, t)\). The proposed objective computes product of output magnitude at all the individual layers. Note that the RGB values of the background image \(I_{z}\) are all zero, which will not mislead the features extracted by the various layers of the deep neural network. Therefore, the erroneous features extracted by the network are actually caused by the non-targeted adversarial patch \(p_{nt}\).

We would ideally want the patch \(p_{nt}\) to provoke as much strong disturbance at all layers as possible in order to fool the features distilled by multiple layers and attack the network, that is, the larger \({\mathbb {E}} \left[ \cdot \right] \) in Eq. (2), the greater the contamination caused by \(p_{nt}\). During the training, the patches are transformed and then digitally inserted on a random location on the background image \(I_{z}\), and we optimize Eq. (2) without any training data. The entire process is detailed in Algorithm 1.

figure a

3.2 Experiments and results

We utilized 5 pre-trained ImageNet models, viz. Inception-V3 [21], ResNet-50 [22], Xception [23], VGG-16 [24] and VGG-19 [24].Footnote 1 For all experiments, the weights of these models are kept frozen throughout the optimization process. We test our attack from three aspect: (1) The Whitebox-Single Model Attack trains and evaluates a single patch on a single model, and the process is repeated on the models mentioned above. (2) The Whitebox-Ensemble Attack jointly trains a single patch across five models, and then evaluates the patch by averaging the win rate across all these models. (3) The Blackbox Attack is similar to the leave one method, jointly training a single patch across four of the ImageNet models, and then evaluating the blackbox attack on the fifth model. The blackbox attack is very similar to the real-world attack, because the attacker knows neither the structure, parameters, nor the training dataset of the target model. In the process of training, we use the gradient descent optimizer and set the hyper-parameter \(Ite_{max}=800 \) and \( \eta =8 \) for Algorithm 1.

In the testing phase, we need images to evaluate the attack performance of the patch, although our approach is data independent when generating the patch. Note that there are images in the ILSVRC 2012 validation data that are misclassified by the above 5 pre-trained models. For example, the top-1 accuracy of Inception-V3 is 82.8%. It is less meaningful to study the attack success rates if the pre-trained models cannot correctly classify the original images. To verify the effect of our attack method, we randomly choose 1000 images from the ILSVRC 2012 validation data for testing, and these images are correctly classified by the above 5 pre-trained models.

The patches are rescaled and rotated, and then digitally inserted on a random location on the test images. We selected 13 scaling parameters, i.e. 5 points in equal intervals from 1 to 10% and 8 points in equal intervals from 10 to 50%. The rotation angle is limited in \([-45^{\circ }, 45^{\circ }]\). Figure 2 shows the process of constructing adversarial examples for testing when the scaling parameter is 10%. The patch generated by Inception-V3 for whitebox single model attack is rescaled to cover 10% of the input images, and then randomly rotated and placed at random positions on different test images.

Fig. 2
figure 2

The process of constructing adversarial examples when the scaling parameter is 10%. Left images: original natural images. Central image: an adversarial patch. Right images: adversarial images. The patch is rescaled to cover 10% of the original images, and then randomly rotated and placed at random locations on different test images

Fig. 3
figure 3

Attack success rates of non-targeted adversarial patches generated by DiAP. Each point in the plot is computed by averaging the results of 1000 adversarial examples, which are crafted by applying the patch to test images at random locations in these images. This is done for various scales or rotations of the patch, with each transformation independently tested on 1000 images. Note that these success rates are the mean of five ImageNet models

Fig. 4
figure 4

Adversarial patches for non-targeted attack. a is generated for the whitebox single model attack, b is computed for the whitebox ensemble attack, and c is crafted for the blackbox attack. The target model is the pre-trained VGG-16

Figure 3 shows the attack success rates, each point is calculated through 5000 tests (1000 adversarial images \(\times \) 5 pre-trained models). When the non-targeted adversarial patch covers 10% of the image area, the attack success rates of all three attack methods achieve more than 70% attack success rates, of which the whitebox single model attack exceeds 90%, although DiAP knows nothing about the training data.

Figure 4 shows some patches generated for non-targeted attack. These patches all seem to contain a lot of small round circular patterns and exhibit some symmetry. We examine the estimated labels of adversarial examples crafted by the adversarial patch shown in Fig. 4. When the patch covers 10% of the testing images, 97.1% of the adversarial examples superimposed with the patch (a) are classified as bubbles, 60.9% of the images perturbed by the patch (b) are recognized as salt shaker or ladybug, and 55.8% of the images disturbed by the patch (c) are identified as bubble or pinwheel. This indicates that when performing non-targeted attacks by optimizing Eq. (2), DiAP tends to construct patches with circular patterns in order to enhance robustness to various transformations, so the corresponding adversarial examples are more likely to be classified as circular objects.

When DiAP perform non-targeted attacks, the above experiments clearly show the existence of several dominant labels, and these labels are typically circular objects. However, when performing physical attacks, malicious attackers may want adversarial patches to cause the network to output any target class. In the next section, we generate adversarial patches to trick the network to evaluate different adversarial examples as the target class.

4 DiAP for targeted attack

We now explore crafting adversarial patches that can be applied to perform targeted attack without training data.

figure b

4.1 Setting and method

Let \({\widehat{y}}\) denote the target label, and the goal of DiAP to implement a targeted attack is to seek an targeted adversarial patch \(p_{t}\) such that

$$\begin{aligned} f(A(p_{t}, x, l, t)) = {\widehat{y}}, \quad \text {for}\,\, x \sim \mu \end{aligned}$$
(3)

Ideally, the goal of Eq. (3) can be achieved if the network estimate any image, which is perturbed by the patch \(p_{t}\), as label \({\widehat{y}}\). Lack of data prevents us from extracting features of the target class \({\widehat{y}}\) from the training example x like existing methods [9, 19]. Note that the non-targeted patch \(p_{nt}\) generated by Eq. (2) can effectively attack the network and lead the network to recognize many of the adversarial examples as dominant labels, which implies that the patch may contain some information about the training data. We believe that this prior knowledge helps generate targeted patch \(p_{t}\) without training data. In order to craft such a \(p_{t}\), we optimize for the following objective

$$\begin{aligned} \mathop {\arg \max }_{p_{t}} {\mathbb {E}}_{l \sim L, t \sim T} \left[ \log {\Pr ({\widehat{y}} | A(p_{t}, I_{nt}, l, t)} \right] \end{aligned}$$
(4)

where the background image \(I_{nt}\) is defined as

$$\begin{aligned} I_{nt}=A(p_{nt}, I_{z}, l, t) \end{aligned}$$
(5)

In other words, we take \(A(p_{nt}, I_{z}, l, t)\) as the background image. During the training, the non-targeted patch \(p_{nt}\) is randomly scaled and rotated, and then digitally inserted on a random location on the image \(I_{z}\) to form a background image \(I_{nt}\). Next, the targeted patch \(p_{t}\) is also randomly transformed and digitally inserted on a random location on the background image \(I_{nt}\). Figure 5 shows this process.

Fig. 5
figure 5

The process of constructing the background image \(A(p_{t}, I_{nt}, l, t)\) for targeted attack. After random transformation, \( p_{nt} \) is digitally inserted on a random position of the image \( I_ {z} \) to craft \( I_ {nt} \). We take \( I_ {nt} \) as the background image in Eq. (4), and explore its implicit information to help generate the targeted patch \( p_{t} \)

Fig. 6
figure 6

Attack success rates of targeted attack. The target class is toaster. DiAP adopts \(I_{nt}\) as the background image to generate adversarial patches for targeted attack. DiAP_Iz replace \(I_{nt}\) with \(I_z\), and DiAP_n use random noise as the background image. A real toaster image is also tested for comparison. All attack success rates are the average values calculated by the five models

Using \(A(p_{nt}, I_{z}, l, t)\) as the background image seems strange, as it is presented in Eq. (2) to train non-targeted attack patch \(p_{nt}\). However, this approach is reasonable if we aware that \(p_{nt}\) probably contain information about training data, although \(p_{nt}\) does not belong to the training data. We seek an effective way to use the information implied in the non-targeted patch \( p_{nt} \) (i.e. optimize Eq. (4)), leading the image \( A (p_{t}, I_{nt}, l , t) \) to be evaluated as the target label \( {\widehat{y}} \). Finally, covered by the targeted patch \(p_{t}\), the adversarial example \(A(p_{t}, x, l, t)\) is misclassified as \({\widehat{y}}\) by the network. It is worth noting that during the entire optimization process, we did not use information about training data at all. The entire process is detailed in Algorithm 2.

4.2 Experiments and results

We experiment with the same settings as in Sect. 3.2. To evaluate the performance of DiAP for targeted attack, we introduce GoogleAP for comparison. We also inspect our DiAP in two other ways to verify the effect of the background image \(I_{nt}\). DiAP_Iz adopts image \(I_z\) instead of image \(I_{nt}\) as the background image in Eq. (4), and DiAP_n replace \(I_{nt}\) with randomly noise as the background image. In the process of training, we use the gradient descent optimizer to craft adversarial patches, and set the hyper-parameter \(Ite_{max}=800 \) and \( \eta =8 \) for Algorithm 2.

Figure 6 shows the attack success rates of targeted adversarial patches, and the target class is toaster. The test images are the same as those used in Sect. 3.2. In the whitebox single model attack, GoogleAP’s attack success rates are higher than that of DiAP. Nevertheless, when the patch takes 10% of the image size, about 80% of the attack success rate still can be achieved by DiAP. In the whitebox ensemble attack, the performance of GoogleAP and DiAP is very close. Further, in the blackbox scenario, DiAP performs even better than GoogleAP. Noting that DiAP is completely unaware of any information about the training data while GoogleAP uses training examples for optimization, this result seems counterintuitive. We presume that, without knowing the structure of the attacked model (blackbox scenario), the vague information about the training data provided by \(I_{nt}\) makes DiAP more generalized, resulting in the higher attack success rates than GoogleAP. This indicates that when the attacker is not acquainted with the attacked model, fuzzy information about training data may be more beneficial to generating attack patches than accurate example knowledge. In addition, DiAP is far more effective than using \(I_{z}\) or random noise as the background image, as is shown in Fig. 6 by the relatively poor performance of DiAP_Iz and DiAP_n.

Fig. 7
figure 7

Adversarial patches for targeted attack. a, b and c are generated by DiAP for the whitebox single model attack, the whitebox ensemble attack, and the blackbox attack, respectively. d is crafted by DiAP_Iz for the blackbox attack. e is computed by DiAP_n for the blackbox attack. The target model is VGG-16 and the targeted label is toaster. f is the photo of a real toaster, which is used in the experiments shown in Fig. 6

Fig. 8
figure 8

Attack success rates of targeted adversarial patches generated by DiAP. Eight target labels are randomly selected

Furthermore, we wonder that whether real photographs of target class could fool the deep neural network, just like DiAP. A real toaster image is digitally inserted into testing images in the same way as the adversarial patch. As shown in Fig. 6, to achieve a 90% attack success rate, the real toaster image has to cover about 50% of the testing image. However, in this case, the new adversarial image actually turns into a picture of the toaster.

Figure 7 shows some patches that DiAP generates for targeted attacks. These patches are very similar to the target class (real toaster), although DiAP doesn’t know what the toaster looks like since it’s data independent and isolated from real examples during training. We presume that DiAP learns high-level features of the target category, making the adversarial patch more toaster-like than the real toaster picture in the vision of the deep neural networks.

We randomly select other 8 categories as target labels (e.g. banana, jellyfish), and the experimental results are shown in Fig. 8. The adversarial patches are generated by DiAP. When the patch occupies 20% of the input image area, the success rate of both whitebox attacks exceeds 80% for all target classes. When the patch occupies 20% of the input image area, for all target classes, the attack success rates of both whitebox methods exceed 80%, and that of the blackbox scenario exceeds 60%. Moreover, the attack success rate varies depending on the target class. Taking the whitebox single model attack with the scaling parameter of 10% as an example, the lowest success rate is 47.35% and the highest success rate is 78.56% for different target classes. However, there is no strict correspondence between the attack success rate and the target class in the three attack settings. For example, in the blackbox scenario, the attack patches generated with the target category broom achieve higher attack success rates than other target categories, while in the whitebox ensemble attack, the attack patches generated with the target category banana perform better.

Fig. 9
figure 9

Eight adversarial patches crafted by DiAP for targeted attack. All these patches are generated for the whitebox single model attack, and the target model is VGG-16 and the targeted labels are listed blow them. These patches look similar to their target classes

Fig. 10
figure 10

Attack success rates in the blackbox attack scenario. The X-axis represents the 8 target categories. The Y-axis indicates the attack success rate against the target class when the adversarial patch (or real image) covers 10% of the test image area

Figure 9 shows patches generated by DiAP. To some extent, these patches have many similarities with their target classes. For example, the patch (b) appears to be composed of many small bananas, patch (g) shows that it contains Pembroke’s eyes and nose, and patches (c), (d), and (f) even seem to be distorted images of their target categories. This further confirms our inference that the adversarial patch actually extracts the high-level features of the target category.

The attack performance of DiAP in the blackbox scenario is particularly noteworthy, which means that malicious attackers can attack deep neural networks without knowing the model structure, pre-training parameters, and training dataset. Figure 10 shows the attack success rate in the blackbox attack scenario when the scaling parameter is 10%. We examine the attack ability when the patch occupies 10% of the test image area. In most cases, DiAP outperforms GoogleAP, and DiAP achieves an average attack success rate of 44.95% compared to GoogleAP’s 41.15%. This suggests that patches generated by DiAP have better transferability and are more suitable for blackbox attacks. At the same time, 8 real images are randomly selected for comparison, one for each target class. The attack success rates of real pictures are all less than 10%, and the average is only 5.25%. These results confirm that our adversarial patches, rather than real image of the target label, can attack deep neural networks.

Fig. 11
figure 11

A real-world attack example. The targeted patch is generated for whitebox ensemble attack and the test model is the pre-trained Inception-V3 mentioned in 3.2. The photographs on the left are correctly classified. However, after being inserted into the targeted patch, the photographs on the right are all classified as toaster, in spite of the different scale, location, angle of the patch

5 Physical world attacks

In this section, a physical world attack experiment is conducted to validate the practical effectiveness. We use a targeted patch (as shown in Fig. 7b) generated by DiAP, which is crafted through the whitebox ensemble method. The test classifier is the pre-trained Inception-V3 model mentioned in Sect. 3.2. The attack is considered successful if the pre-trained Inception-V3 model correctly identifies the original image and misclassifies the adversarial example as the specified label. After printing this patch by a Canon IP8780 printer, we place it in a variety of real world scenes and take photographs with the combination of different distances and angles using a MI 8 phone. The images and results are shown in Fig. 11 demonstrate that the attack successfully fools the network, regardless of the scale, location, angle of the patch.

6 Conclusion

In this paper, we presented a data independent approach to generating adversarial patches. By contaminate the extracted features at each layer of the attacked network, DiAP generate non-targeted adversarial patches. Then, combined with the information implied by the non-targeted patch, DiAP extract the features of the target class to craft the targeted adversarial patches. In the process of generating patches, DiAP does not use any training data at all, which is conducive to performing attacks in real physical scenarios. The extensive experimental results, under whitebox and blackbox settings in both digital and physical world, demonstrate that DiAP owns strong attack capabilities, and achieves state-of-the-art performance. Further, DiAP outperforms state-of-the-art method in blackbox attack, which implies that when the network structure is unknown, fuzzy example information may be more helpful for attack than real examples. To encourage reproducible research, the code of DiAP is made available at https://github.com/zhouxy2020/DiAP.