Keywords

1 Introduction

1.1 Backgrounds

Due to recent breakthroughs in deep learning techniques, Deep Neural Networks (DNNs) have achieved state-of-the-art classification performance in various tasks. However, it has also been shown that the classification models can still be easily affected by adversarial examples [4, 5, 7, 8, 15, 28, 30, 32] which are malicious inputs such that small perturbations are added to benign inputs in order to fool the classifiers. Adversarial attacks can cause serious security problems because DNNs are deployed in the real world in various applications. For example, Deng et al. [12] analyze adversarial attacks on driving models, and show that these regression models are also very vulnerable to adversarial attacks. Sharif et al. [35] show that it is possible to impersonate another individual by having the face image wear glasses, as in Adversarial Patch [6]. Therefore, in order to design robust models, it is necessary to investigate the potential risks and identify the vulnerabilities of deep learning models. Hence adversarial attacks are an important research topic.

If an adversarial example of an image x exists, attacking a classifier turns into a search problem within a small volume around a benign image x. Recently, several algorithms have been proposed to generate adversarial examples, and these methods can be classified based on several categories.

  • Threat Model: One of the key differences in adversarial attacks is the setting of the attacker, and there are two primary types: white-box and black-box. In the white-box setting [7, 15, 28, 32], the attacker is assumed to have all the knowledge about the target model. The main idea of generating adversarial examples in this setting is to apply a perturbation in the direction of the gradient of the loss w.r.t. the input x. However, in reality, an attacker is likely to have access to only a limited amount of information. In the black-box setting [4, 5, 8, 23, 30], the attacker is only allowed query access to the target model. That corresponds to an attack on a web service using a pre-trained classifier (e.g., Google Cloud Vision API [2], IBM Watson Visual Recognition [3], Amazon Rekognition [1]). In this setting, the attacker needs to compute a perturbation only from the output information obtained by querying a model, which is thus more difficult setting. The main strategies for generating adversarial examples in the black-box setting are shown in Sect. 2.

  • Adversarial Goal: Another important difference in adversarial attacks is whether the attacker aims to misclassify the input x to a class other than the true class y (untargeted), or to misclassify the classification result to a specific target class \(t(\ne y)\) (targeted). Targeted attacks, especially on classifiers with a large number of classes, are quite a difficult task.

  • Distance Metric: Adversarial examples are inputs with slight perturbations that are carefully crafted to cause the classifier to misclassify them. It is commonly used \(\ell _p\)-distances between adversarial and benign examples with \(p\in \{0,2,\infty \}\).

We focus on score-based black-box adversarial attacks. Existing query-based black-box attack methods have already achieved a high attack success rate, and the main effort is now focusing on reducing the number of queries. Attacks with low queries, i.e., methods with better query efficiency, can save attackers a great deal of cost in both time and money. For example, the Google Cloud Vision API [2] limits the number of requests per minute to 1, 800. High query efficiency attack methods are also effective in deceiving systems [10] that recognize the behavior of submitting many similar queries in short time as fraudulent, which is one of our motivations.

Fig. 1.
figure 1

Sample images of each random pattern noise (RPN) in our sampling space (from the left is vertical-wise, horizontal-wise, uniform, diagonal-wise local uniform noise).

1.2 Our Contribution

In this paper, we propose a simple but effective hyperparameter-free score-based black-box \(\ell _\infty \)-adversarial attack in computer vision. The core technique of our approach is to use susceptibility of Convolutional Neural Networks (CNNs) to noise with regional homogeneity [24, 46], and specifically to construct adversarial perturbations by combining patterned noises such as vertical-wise and horizontal-wise (see Fig. 1). This idea is incorporated into an iterative random search method to sequentially update the perturbations. In a pre-specified non-orthogonal search direction, we modify the perturbation with randomly selected local uniform noises, check whether it is moving towards or away from the decision boundary using a confidence score, and repeat the perturbation update. With each update, the image moves further away from the original image and towards the decision boundary.

In Sect. 4, we conduct comparative experiments with several existing \(\ell _\infty \)-attacks using naturally and adversarially trained models and input-transformation-based defense methods.

In the experiments on the naturally trained models in Sect. 4.1, we use CIFAR-10 and ImageNet datasets to perform comparative experiments with Parsimonious, SignHunter and Square Attack. As a result, we show that our method achieves high attack success rates in both untargeted and targeted attack settings, especially in low query budgets. Specifically, in the untargeted attack on CIFAR-10, our method achieves the average query efficiency of 1.8 times while achieving a higher attack success rate than that of Square Attack. In the untargeted attack on ImageNet, our method also achieves 1.4 times higher average query efficiency than that of Square Attack.

In Sect. 4.2, we evaluate our method against several defensive models based on adversarial training that classify MNIST and CIFAR-10 datasets. In the benchmark Madry et al.’s and TRADES models on MNIST, our method achieves higher attack success rates than the other black-box methods. However, in other Clean Logit Pairing (CLP) and Logit Squeezing (LSQ) models, the results of our method are inferior to those of other black-box attacks, especially in terms of attack success rate. From this result we clarify the effect of local uniform noise in each defensive model.

In Sect. 4.3, we show attacks to several input-transformation-based defense methods that adopt the naturally trained models classifying CIFAR-10 and ImageNet as a backbone. Our method achieves an attack performance of over \(90\%\) on CIFAR-10 and over \(70\%\) on ImageNet, despite relatively small query budgets. Therefore, our method maintains a high attack success rate with or without the protection of defense methods.

Overall, our method achieves high attack performance on a wide range of target models in a hyperparameter-free manner, making it a realistic method for attackers. We also observe that our method suffers from gradient masking, and our definition of local uniform noise is highly convergent for defensive models other than gradient masking. Finally, in Sect. 4.4, we experimentally verify the effectiveness of our definition of local uniform noises and show that all of them contribute to the attack performance.

2 Related Work

There are a few different settings for adversarial attacks in the black-box setting. This section describes the differences between these settings and the main strategies. Then, we show our contribution by comparing with them.

2.1 Transfer-Based Black-Box Attacks

Most of the existing adversarial attacks assume the white-box setting, where the attacker has full access to the model architecture and the ability to perform backpropagation to obtain gradient information. On the other hand, white-box attacks can be pseudo-black-boxed by using transferability [38], called transfer-based black-box attacks. Transferability is a property that adversarial examples generated for a classifier can be used as for another same type classifiers. Papernot et al. [31] proposed a method to learn a surrogate model by querying the target model. By using the surrogate model with decision boundaries similar to the target model, they can simulate a white-box adversarial attack [15, 32]. However, transfer-based attacks have some problems. First, although transfer-based attacks are theoretically possible in a decision-based setting, they often require carefully designed surrogate models, or even require many queries to extract the target model. Next, the generated adversarial examples do not always transfer well [36]. Recent studies have also proposed input transformation methods [13, 25, 42] to improve the transferability of adversarial examples, and showed black-box attack performances. Although such a method [25] achieves particularly high transferability, they ignore the task of extracting models and only show the attack success rates between each network architecture.

2.2 Score-Based Black-Box Attacks

In score-based black-box attacks, the attacker can obtain the predicted probabilities for each class by querying the inputs to the target model. The attacker solves an optimization problem to compute the adversarial perturbations while directly observing the output from the target model.

Gradient Estimation Based Methods. The ZOO method, proposed by Chen et al. [9], generates adversarial examples by estimating the gradient of the classifier using a coordinate-wise finite difference method. The AutoZOOM, a modified version of ZOO, was proposed by Tu et al. [39], which uses random gradient estimation and dimensionality reduction techniques to significantly improve query efficiency while maintaining attack performance. However, it still requires an enormous number of queries to the target model (13, 525 queries on average for the targeted attack on ImageNet). Hence, gradient estimation-based methods are considerably less efficient, especially for models with high-dimensional inputs.

Gradient-Free Methods. The Parsimonious Attack proposed by Moon et al. [30] solves a discrete optimization problem with local search and the greedy algorithm. On a perturbation divided into a set of \(n^2\) square tiles, Parsimonious finds the sign of each tile by local search, and then uses the greedy algorithm to find a better solution. The SignHunter Attack proposed by Al-Dujaili et al. [4] sequentially estimates the sign of gradient in \(1/2^n\) regions of the perturbation in deterministic order. Several attack methods including these [4, 29, 30] reduce the dimensionality of the search space of the perturbation by modifying neighboring pixels in the perturbation at once, making the computation more efficient. Andriushchenko et al. proposed the Square Attack [5], which achieved state-of-the-art attack success rates and query efficiency. Square Attack solves optimization problems by random search, which directly updates the perturbation with randomly generated square-shaped noise, as opposed to methods that invert the sign of the perturbation, such as Parsimonious and SignHunter. The DeepSearch proposed by Zhang et al. [47] generates adversarial examples close to the original images by reducing the \(\ell _\infty \) distance of the perturbation, while using hierarchical grouping strategy like Parsimonious. However, we do not compare our method with DeepSearch because the attack success rate and query efficiency are not high (similar to those of Parsimonious) although the \(\ell _\infty \) distance of the perturbation generated by DeepSearch is small.

On the other hand, several studies have improved query-based attacks, in which the attacker generates adversarial examples in transfer-based and query-based manner using a surrogate white-box model that is either pre-trained or trained by the attacker himself. The Subspace Attack by Guo et al. [18] uses the gradient of the surrogate model as a heuristic search direction for finite difference gradient estimation. Huang et al. proposed TREMBA [19], which learns an embedding space that can generate adversarial perturbations for a surrogate model, and significantly reduces queries compared to NES and AutoZOOM. Feng et al. [14] improved the transfer performance from the surrogate model to the target model. Their proposed \(\mathcal{C}\mathcal{G}\)-Attack is robust to biases between the surrogate model and the target model by transferring partial parameters of the adversarial distribution of the surrogate model while learning the untransferred parameters based on queries to the target model. The SWITCH proposed by Ma et al. [27] continues to select loss-maximizing perturbations whenever possible when images perturbed by gradients generated from a surrogate model do not satisfy the optimization objective. Yatsura et al. proposed a meta-learning method [45] to be used in combination with random search based attacks. Their learned controller improves the attack performance by online adjustment of the parameters of the proposal distribution at each iterate during the attack. However as explained in Sect. 2.1, we do not compare our method to these methods since the attacker needs to construct a surrogate model in advance and the computational cost is high.

2.3 Defense Methods

As adversarial attacks become more prevalent, many recent studies have also focused on building defense models against them. There are several lines of research in the literature, and the defense methods are roughly consisted of two groups: input-transformation-based defense methods and adversarial training.

The input-transformation-based defense methods include denoising, input randomization, and input transformation. These methods attempt to mitigate the effects of perturbations in adversarial examples by adding image processing-like changes to an input image. Specifically, the denoising methods include low-pass filtering [34] and autoencoders [16], which attempt to remove adversarial perturbations from adversarial examples. The input randomization methods including resizing and padding [41] and the input transformation methods including JPEG Compression [17, 26] attempt to mitigate the effect of adversarial perturbations.

On the other hand, adversarial training [21, 28, 48] aims to obtain robustness by training the model with adversarial examples, which is a more costly but more effective method than image processing defenses. In general, it is known that adversarial training defenses are more robust than other defenses in the case of MNIST and CIFAR-10. Furthermore, Madry et al. [28] show that PGD [28] is a universal first-order adversarial attack, which means that adversarial training with PGD-generated adversarial examples is resistant to many other first-order attacks. The PGD-generated adversarial examples are the basis for many adversarially trained models, including [21, 28, 48]. The model of Madry et al. [28] provides robust adversarial training by min-max optimization. TRADES [48] focuses on the trade-off between robust error and natural error and trains to improve both. Adversarial Logit Pairing [21] learns by matching the logit of a benign image with the corresponding logit of adversarial examples, while acquiring ancillary information such as their similarity to each other.

2.4 Differences Among Other Black-Box Methods and Our Method

We discuss more about the existing methods presented in Sect. 2.2 and clarify the differences between them and our method. First, regarding the optimization method of the perturbation, it can be observed that Parsimonious [30] has many useless queries, partly because it uses the local search. SignHunter [4] is a deterministic search and can guarantee the attack success rate for the number of queries, but it is not very efficient. Since the convergence of the iterative random search used in Square Attack [5] is much higher than that of Parsimonious and SignHunter, an iterative random search is also used in our method.

As for the components of the perturbation, the perturbation of Parsimonious and SignHunter consist of a uniform noise in a specific segmentation range (square or rectangle shape), while the perturbation of Square Attack consists of a vertical-wise initialization and a uniform noise of a square of a certain size. On the other hand, our method places not only square-shaped but also vertical-wise, horizontal-wise and diagonal-wise uniform noise on the segmented area of squares in the image. Furthermore, while Parsimonious and Square Attack have hyperparameters that need to be tuned depending on the setting of the attack and the target model, our method does not need any hyperparameters. This feature is a great advantage in black-box attacks because it can be easily implemented in any setting.

Fig. 2.
figure 2

An example of a sequence of adversarial perturbations on ImageNet generated by our method at each iterate. The left column shows the adversarial perturbation and the adversarial example for the first query (the attack has not yet succeeded at this point), and the right column shows those for the 165th query where the attack was successful (the class changed from altar to vault). In addition, the transition of adversarial perturbations after the first query is shown between them. The red boxes indicate the block range b determined by the \(\textsf{SplitBlock}\) function, i.e., the region where the noise is modified at each iterate. In the second query, we change the perturbation with a randomly picked RPN for a \(1 \times 1\) region, i.e., the entire image region, and if the loss is lowered, we update the perturbation to this. In the third to sixth queries, the search is performed in \(2 \times 2\) regions. After that, the perturbation update process is repeated while gradually increasing the number of segmented regions. (Color figure online)

3 Our Methods

In this section, we first recall the definitions of the threat model in the adversarial attacks and describe an optimization framework for finding adversarial perturbations against classification models. Then, we describe our black-box \(\ell _\infty \)-adversarial attack using random pattern noises and random search.

3.1 Optimization Framework

Formally, we define a classifier \(f:X\xrightarrow {}\mathbb {R}^K\) where \(x \in X\) is the input image, \(y \in Y = \{1,2,\ldots ,K\}\) is the output space and f(x) denotes the predicted score of each class in Y. In the untargeted setting, the goal of the attacker is to find a perturbation \(\delta \) such that an adversarial example \((x+\delta )\) is misclassified to classes other than the true class y, i.e., \(\mathop {\mathrm {arg~max}}\limits _{k \in Y}f_k(x+\delta ) \ne y\). Additionally, the attacker also seeks to minimize \(\ell _p\) distance, i.e.,

$$\begin{aligned} \mathop {\mathrm {arg~max}}\limits _{k\in Y}\ f_k(x+\delta )\ne y \quad \text { s.t.} \quad \Vert \delta \Vert _p \le \epsilon \ \text {and}\ (x+\delta )\in X, \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _p\) is the \(\ell _p\)-distance norm function and \(\epsilon \) is the radius of \(\ell _p\)-ball. The task of finding a perturbation \(\delta \) can be handled as a constrained optimization problem. Therefore, \(\ell _p\)-bounded untargeted attacks aims at optimizing the following objective:

$$\begin{aligned} \min _{\delta :\Vert \delta \Vert _p \le \epsilon }\ L(f(x+\delta ),y) \end{aligned}$$
(2)

where L is a loss function (typically the cross-entropy loss) and y is the true label of x. Equation 2 mostly works to minimize the score for label y. We also study the adversary in targeted setting. In the targeted setting, the attacker aims \(\mathop {\mathrm {arg~max}}\limits _{k \in Y}f_k(x+\delta ) = t\) for a target label \(t(\ne y)\) chosen from Y and optimizes the perturbation by minimizing the loss \(L(f(x+\delta ),t)\). A black-box targeted attack on a network with many output classes (large K) will be a rather difficult task.

3.2 Algorithm

In this section, we present our black-box \(\ell _\infty \)-attack. We assume that the attacker has an image \(x \in X\) and a black-box classifier f. An output f(x) is the predicted probabilities over K-classes w.r.t. input image x. In the untargeted setting, our goal is to find a perturbation \(\delta \in \{-\epsilon ,\epsilon \}^d\) such that \(\mathop {\mathrm {arg~max}}\limits \ f(x+\delta ) \ne y\) under the \(\ell _\infty \)-perturbation constraint, where \(\epsilon \in \mathbb {R}^+\) is the radius of \(\ell _\infty \)-ball. Our method is based on a random search [33] which is a well known iterative technique in optimization problems. If we apply this technique to the adversarial attacks, it acts as sequential updates of the perturbation. If the loss value \(L(f(x+\delta ^*),y)\) w.r.t. the perturbed image \((x+\delta ^*)\) with the updated perturbation \(\delta ^*\) is lower than the prior loss value \(L(f(x+\delta ),y)\), this update is adopted to the current perturbation, otherwise it is discarded.

The core technique of our approach is that the perturbation is composed of noises with regional homogeneity. There are studies [24, 46] showing the vulnerability of CNNs to local uniform noises. In particular, Li et al. [24] investigate how effective local homogeneous noise is for defensive models against adversarial attacks. They find that adversarial perturbations made for defensive models exhibit more homogeneous patterns than those made for naturally trained models. We therefore investigate whether local homogeneous noises can be applied to generate adversarial examples (Note that, they [24] aim to generate universal adversarial perturbations, which is a deceptive perturbation for arbitrary images, and is a different objective from ours, so it is not comparable). Specifically, our method constructs perturbations with four patterned noises: vertical-wise, horizontal-wise, uniform, and diagonal-wise (henceforth, collectively referred to as random pattern noise, RPN). This represents a major difference from Square Attack [5], which updates perturbations only with uniform noise in the form of squares.

figure a

Algorithmic Scheme with Random Search. Our proposed schemes are presented in Algorithms 1, 2 and 3. First, we set a initial perturbation to the vertical-wise one. A vertical-wise initialization is a technique used in [5]. Then, we obtain the current loss by querying the perturbed image \((x+\delta )\). Since we are interested in query efficiency, the algorithm stops as soon as an adversarial perturbation is found. Therefore, the process is terminated if the attack is already successful at the first query point (step 3 in Algorithm 1). After that, we decide the set of block areas to be modified using the \(\textsf{SplitBlock}\) algorithm in Algorithm 3. In a random search loop, first the algorithm picks a block area b and obtains the new perturbation \(\delta ^*\) updated for the area through \(\textsf{RPNSampling}\) in Algorithm 2. Then, an adversarial example \(x_{adv}\) is generated by adding the perturbation to the benign image. Note that, all perturbed images are clipped in the domain \([0,1]^d\). If the resulting loss corresponding to the perturbed image \((x+\delta ^*)\) with the updated perturbation is lower than the current loss, the change is applied. The process is performed at most N (the maximum number of iterations) times and the attack is failure if we cannot find the adversarial perturbation until N times. Figure 2 shows a sequence of candidates of adversarial examples at each iterate generated by our method. A candidate is generated at each iterate, and the perturbation is updated if the loss at that time is lower than the previous one.

RPN Sampling. Our \(\textsf{RPNSampling}\) algorithm presented in Algorithm 2 returns a new perturbation \(\delta ^*\) updated for a given block area b to be modified. As the variation of RPNs, we focus on vertical-wise, horizontal-wise, uniform and diagonal-wise perturbations. We show the samples of each RPN in Fig. 1. In this algorithm, one of the four RPNs \(\delta _{vert}, \delta _{horiz}, \delta _{uni}, \delta _{diag} \in \{-\epsilon , \epsilon \}^d\), which are randomly generated each time, is sampled as \(\gamma \). The algorithm then changes the new perturbation \(\delta ^*\) to \(\gamma \) only in the region of block area b. From Fig. 2, it can be observed that one of the randomly generated RPNs is picked at each iterate and changed to that RPN only in certain regions. The effectiveness of each RPN is experimentally verified in Sect. 4.4.

Table 1. Results of both untargeted and targeted attacks on Madry et al.’s naturally trained model [28] classifying CIFAR-10. We set the norm bound \(\epsilon _\infty =0.031\) and a limit of queries to 10 k.

Split Block. The \(\textsf{SplitBlock}\) algorithm shown in Algorithm 3 returns a set of elements which are block areas to be modified in the perturbation. The purpose of this function is to decide a low-dimensional space for a perturbation. In general, the input space of a deep learning classifier is very high-dimensional. Therefore, the optimization in the high-dimensional domain requires a very large number of queries and is inefficient. The optimization can be done efficiently by narrowing down the search space for solutions by making changes in some regions at a time as a group. The dimensionality reduction techniques are used in many existing methods [4, 29, 30], and we observe that the main difference lies in the number of region partitions.

Given image size w, the \(\textsf{SplitBlock}\) equally divides the perturbation into \(n^2\ (n\in \{1,\ldots ,w\})\) square regions. Then, each divided area including its coordinates is stored in the order of the region size in the set.

While Square Attack [5] updates the perturbation by randomly selecting a square shaped region \(s \times s\) of size \(s(<w)\) from the image size \(h \times w\), our method updates it regularly for each of the \(n^2\) equally divided square regions. After updating all the \(n^2\) regions, we move on to search in \((n+1)^2\) regions. This can be observed in Fig. 2. In the testing phase, we show that the non-orthogonal search direction and \(n^2\) partitions provide a wider change area in the low query budget, which is a factor to achieve high query efficiency.

4 Experiments

In this section, we evaluate our method by comparing it with other \(\ell _\infty \)-attack methods: Parsimonious [30], SignHunter [4] and Square Attack [5]. We consider the \(\ell _{\infty }\)-threat model and execute attacks on both untargeted and targeted attack settings, then quantify the performance in terms of attack success rates, average queries and median queries. The attack success rate is calculated by the proportion of adversarial images which successfully fool the model. The mean and median queries are the mean and median number of queries for successful adversarial images.

In Sect. 4.1, we show results based on naturally trained models, i.e., models that are not hardened against adversarial attacks. In Sect. 4.2 and 4.3, we show results based on robust models of adversarially training and models with input-transformation-based defenses. In Sect. 4.4, we evaluate our method a little more by ablation study. Specifically, we experimentally investigate how much each of our defined RPNs contributes to the attack performance.

Fig. 3.
figure 3

Cumulative distribution of the number of queries required for untargeted attacks on CIFAR-10.

Fig. 4.
figure 4

Cumulative distribution of the number of queries required for targeted attacks on CIFAR-10.

4.1 Experiments on Naturally Trained Models

Datasets and Target Models. We evaluate our method on CIFAR-10 [22] and ImageNet [11] datasets. CIFAR-10 is \(32\times 32\times 3\) dimensional images having 10 classes. For CIFAR-10, we randomly choose 1, 000 images from the test set for evaluation, all of which are initially correctly recognized by the target model. ImageNet has 1, 000 classes. Since the size of images of ImageNet dataset is not fixed, we re-scale these images to \(299\times 299\times 3\) (default input size of Inception-v3 model explained below). For ImageNet, we randomly choose 1, 000 images belonging to 1, 000 categories from ILSVRC 2012 validation set, all of which are initially correctly recognized by the target model. All images are normalized in [0, 1] scale, and for all experiments, we clip the perturbed image into the input domain \([0, 1]^d\) for all algorithms by default.

For the experiments on CIFAR-10, we use Madry et al.’s naturally trained model [28]. The model architecture and weights are available at hereFootnote 1. For the experiments on ImageNet, we use the pre-trained model provided as an application in KerasFootnote 2. We select the Inception-v3 [37] pre-trained model in our experiments because we can see in [5] that it is robuster than some other models for ImageNet against adversarial attacks.

Table 2. Results of both untargeted and targeted attacks on Inception-v3 classifying ImageNet. We set the norm bound \(\epsilon _\infty =0.031\) and a limit of queries to 10 k.

Method Setting. Since it is standard in the literature, we give a budget of 10 k queries per image to find an adversarial perturbation. We set the maximum \(\ell _{\infty }\)-perturbation of the adversarial image to \(\epsilon =0.031\ (\approx 8/255)\) on both CIFAR-10 and ImageNet. Query budgets and the maximum distortion \(\epsilon \) are parameters specific to the threat model of adversarial attacks, so they are generally not considered as hyperparameters. In targeted attacks, we set the target class to \(y_{target} = (y_{true}+1)\ mod\ K\), where \(y_{true}\) is the true class, and K is the number of classes.

Results on CIFAR-10. We show the results in Table 1. Our method achieves the highest attack success rates on both untargeted and targeted settings. Also at the same time, we improve the number of queries required to fool the classifiers compared to other three methods. Compared to the state-of-the-art method, Square Attack, our method achieves a higher attack success rate, 1.5 to 1.8 times higher average query efficiency, and 1.9 to 2.4 times higher median query efficiency. We also plotted the cumulative success rates in terms of the required budget in Figs. 3 and 4. Especially in low-query budgets, our method remarkably outperforms the other methods. Additionally, the success rates of Square Attack and our method at 1 query indicate the strength of the vertical-wise initialization. As hyperparameters for the comparison methods, we set \(block\ size=4\) and \(batch\ size=64\) for Parsimonious and \(p=0.05\) for Square Attack by default.

Fig. 5.
figure 5

Cumulative distribution of the number of queries required for untargeted attacks on ImageNet.

Fig. 6.
figure 6

Cumulative distribution of the number of queries required for targeted attacks on ImageNet.

Results on ImageNet. The results are presented in Table 2, and Figs. 5 and 6. Although our method does not achieve the highest attack success rate in the targeted attack setting, it achieves higher attack success rate and query efficiency in the untargeted attack setting. As Figs. 5 and 6 show, our method achieves the highest attack success rate up to \(5^5\) queries in both untargeted and targeted attack settings. Additionally, we can see from Table 2 that more than half of the images are successfully attacked for the untargeted attack with 49 queries, which is about half of the median query of Square Attack. These results indicate the high query efficiency in low query budgets of our method. As hyperparameters for the comparison methods, we set \(block\ size=32\) and \(batch\ size=64\) for Parsimonious and \(p=0.05\) for Square Attack.

Table 3. Results on adversarially trained models of Madry et al. [28], TRADES [48], CLP and LSQ [21] on MNIST, and CLP and LSQ [21] on CIFAR-10. We set the norm bound \(\epsilon _\infty \) and a limit of queries to 0.3 and 10 k respectively for MNIST and \(0.062\ (\approx 16/255)\) and 10 k respectively for CIFAR-10. The percentages in the model column indicate the natural accuracy in the test data for each model.

4.2 Experiments on Adversarially Trained Models

Here we evaluate our method to robust models based on adversarial training.

Datasets and Target Models. We use some robust models classifying MNIST [44] and CIFAR-10 datasets as the same as the experiment of Square Attack [5]. MNIST is \(28\times 28\times 1\) dimensional grayscale handwritten numeric dataset. In the experiments on MNIST, we randomly sample 1, 000 images from the test set, all of which are initially correctly recognized by Madry et al.’s naturally trained model [28]. In the experiments on CIFAR-10, we use the same test data in Sect. 4.1.

In line with the experiments in [5], we use the \(\ell _{\infty }\)-adversarially trained models of Madry et al. [28], TRADES [48], Clean Logit Pairing (CLP) [21] and Logit Squeezing (LSQ) [21] for MNIST, and the \(\ell _{\infty }\)-adversarially trained models of CLP and LSQ for CIFAR-10.

Method Setting. We give a budget of 10 k queries per image to find an adversarial perturbation. We set the maximum \(\ell _{\infty }\)-perturbation of the adversarial image to \(\epsilon =0.3\) on MNIST, and \(\epsilon =0.062\ (\approx 16/255)\) on CIFAR-10. All experiments in this section are done in the untargeted setting.

Results on MNIST. Table 3 shows the results. In Madry et al.’s and TRADES models, SignHunter achieves better query efficiency, but has a lower attack success rate on average than the other methods. Comparing the methods with similar attack success rates, our method achieves higher attack success rates and better query efficiencies. Although our method does not achieve better performance than other methods in CLP and LSQ models, our method achieves better performance in Madry et al.’s and TRADES models, where the original robust accuracy is higher. This indicates the potential attack power of our method. As the hyperparameter for Parsimonious, we set \(block\ size=4\) and \(batch\ size=64\). As the hyperparameter for Square Attack, we set \(p=0.8\) for Madry et al.’s and TRADES models and \(p=0.3\) for CLP and LSQ models.

Results on CIFAR-10. The results are shown at the bottom of Table 3. All methods have high attack success rates overall, and there is not as large a difference in attack performance due to the shape of the uniform noise as for MNIST. In both models, our method achieves the highest median query efficiency, although not the highest average query. This suggests that the query efficiency in low query budgets of our method is high. As hyperparameters for the comparison methods, we set \(block\ size=4\) and \(batch\ size=64\) for Parsimonious and \(p=0.3\) for Square Attack.

Table 4. Results on input-transformation-based defenses: Bit-Red [43], JPEG [17], FD [26], and ComDefend [20]. Each defense method adopts the backbones of Madry et al.’s naturally trained model classifying CIFAR-10 and Inception-v3 pre-trained model classifying ImageNet, respectively. We use 50 randomly selected images and set a limit of queries to 200, the norm bound \(\epsilon _\infty \) to 0.031 for CIFAR-10 and 0.062 for ImageNet.

On the Difference in Attack Success Rates in CLP and LSQ Models. It may be concluded that the difference in the attack performance of the methods in CLP and LSQ models classifying MNIST comes from the form of local uniform noise generated by each method. SignHunter considers the image as a one-dimensional vector and flips the sign in a particular segmentation range, so that a rectangular noise can be seen in the image. On the other hand, Parsimonious and Square Attack make most of the noise consist of square shaped uniform noise. The results in Table 3 show a large margin in terms of attack success rate of the attacks between these two patterns. This suggests that CLP and LSQ models are particularly vulnerable to square shaped uniform noise, which causes the large differences.

4.3 Experiments on Input-Transformation-Based Defenses

In this section, we attack against input-transformation-based defense methods other than adversarial training.

Datasets and Target Models. Since the basic input-transformation-based defense methods are input-independent, they can be applied to various models to easily improve the defense performance against adversarial attacks. We consider four defense methods: Bit-Depth Reduction (Bit-Red) [43], JPEG Compression (JPEG) [17], Feature Distillation (FD) [26], and ComDefend [20]. All of these defense methods are input-transformation-based methods that apply a transformation to the input image to mitigate the effects of adversarial perturbation. We conduct attack experiments on models applying each defense method to Madry et al.’s naturally trained model for classifying CIFAR-10 and Inception-v3 pre-trained model for classifying ImageNet, respectively. Since ComDefend requires a separate pre-trained model for defense and is not available in CIFAR-10, we only consider ImageNet for this method. For the test data on both CIFAR-10 and ImageNet, we randomly sample 50 images from those used in Sect. 4.1, and we generate adversarial examples of these images.

Method Setting. Considering a more realistic setting, we give a budget of 200 queries, which is much less than the number of queries in the experiment in Sect. 4.1. We set the maximum \(\ell _\infty \)-perturbation of the adversarial image to \(\epsilon =0.031\ (\approx 8/255)\) on CIFAR-10, and \(\epsilon =0.062\ (\approx 16/255)\) on ImageNet. The amount of perturbation distortion on ImageNet is based on VMI-CT-FGSM [40]. All experiments in this section are done in the untargeted setting.

Results on CIFAR-10. The results are shown in upper part of Table 4. Our method outperforms the other black-box attacks against all three input-transformation-based defenses. Our method achieves an attack success rate of more than \(90\%\) for all defense methods, and when compared to the results for the case without defense methods in Sect. 4.1, it can be seen that our method is almost unaffected by these defenses. Overall, our method achieves better performance in situations where the attacker is given only a small query budget. As hyperparameters for Parsimonious, we set \(block\ size=4\) and \(batch\ size=64\).

Results on ImageNet. The results are shown in the lower part of Table 4. Our method achieves better attack performance except the attack success rate on Bit-Red. In particular, the median query of our method is about half that of Square Attack in most settings, which indicates a relatively high query efficiency of our method. The defense methods such as input transformation are very easy to apply to ImageNet with high dimensionality and are considered more realistic than adversarial training. However, such a simple defense method is not sufficient to prevent adversarial attacks. In terms of the amount of perturbation distortion in the adversarial image, these defense methods may be more robust for smaller amounts of that. As hyperparameters for Parsimonious, we set \(block\ size=32\) and \(batch\ size=64\).

Table 5. Ablation study of our method which shows how the individual RPNs (in Sect. 3.2) improve the performance. Our final method is highlighted in blue, and the results are shown below when each RPN was removed from the “All” sampling space.

4.4 Ablation Study

In this subsection, we evaluate our methodology a little more. We perform a simple ablation study to show how the individual RPNs (in Sect. 3.2) improve the performance of our attack. The comparison is done for an \(\ell _\infty \)-threat model of radius \(\epsilon =0.031\). We use 1, 000 test images and carry out targeted attacks against the Inception-v3 model pre-trained on ImageNet with a 10 k query budget. Results are shown in Table 5. The “All” in sampling space column means that the RPN is sampled from the all sampling space (vertical-wise, horizontal-wise, uniform and diagonal-wise), which is our final method we used in our experiments. The results when each RPN is removed from the all sampling space are shown below that. In terms of attack success rate and query efficiency, we can see that all RPNs contribute to the attack performance. In particular, when the diagonal-wise pattern is removed, the attack success rate and query efficiency are greatly degraded. In addition, based on the results, further analysis of the noise patterns will be a future challenge, assuming that the addition of new RPNs will improve the attack performance.

5 Conclusion

We proposed a query-efficient black-box attack using an iterative random search and random pattern noises. In our experiments, we show that our method achieves higher success rates than existing methods in both untargeted and targeted attacks, especially in low-query budgets. In the experiments on defensive models, we show that our method achieves high attack performance in most settings. Since our method is hyperparameter-free, it is practical and easy to apply for attackers.