Keywords

1 Introduction

Deep neural networks (DNN) have been shown to be susceptible to adversarial examples, which are small, human-imperceptible perturbations to the inputs designed to fool the network prediction [14, 33]. Adversarial attacks can be categorized into two main settings: white-box attacks and black-box attacks. In the white-box setting, the attackers have access to all information about the target model and thus can use the model’s gradient to effectively guide the search for adversarial examples [7, 23, 33]. Black-box setting, on the other hand, attacks a model only from classification queries [9, 18, 25]. This type of access requirement is considered more realistic in practice.

Traditionally, black-box methods require a massive amount of queries to find a successful adversarial perturbation. Since each query to the target model costs time and money, query efficiency is a requisite for any practical black-box attack method. Recent years have seen the development of several black-box approaches with significant improved query efficiency [1, 3, 15, 19, 24, 34]. However, current black-box attacks access the target models only at perturbed samples and completely rely on the queries there to update the perturbation at each iteration. To reduce the number of queries, it would be beneficial to be able to make use of these queries to extract more from the models, inferring the loss values and identifying candidate perturbations, where no model query was made. This is a challenging goal: since the landscapes of adversarial losses are often complicated and not well-understood, the accuracy of approximations of the loss values from available model queries is not guaranteed.

In this paper, we develop a new \(\ell _2\) black-box adversarial attack on frequency domain, which uses an interpolation scheme to approximate the loss value around the current state and guide the perturbation updates. We refer to our method as Black-box Attack Based on IntErpolation Scheme (BABIES). This algorithm is inspired by our observation that for many standard and robust image classifiers, the adversarial losses behave like parabolas with respect to perturbations of an image in the Fourier domain, thus can be captured with quadratic interpolation. We treat the adversarial attack problem as a constraint optimization on an \(\ell _2\) sphere, and sample along geodesic curves on the sphere. If the queries show improvements, we accept the perturbation. If the queries do not show improvement, we will infer a small perturbation from those samples without additional queries. Our method achieves significantly improved query efficiency because the perturbation updates are now informed not only directly from model queries (as in existing approaches), but also from an accurate quadratic approximation of the adversarial loss around the current state. The main contributions of this work can be summarized as follows:

  • Theoretical and empirical justifications that the adversarial loss behaves like a parabola in the Fourier domain, but NOT like a parabola in pixel domain.

  • Development of BABIES, a random-search-based black-box attack that exploits the parabolic loss landscape to improve the query efficiency.

  • Evaluations of BABIES with targeted and untargeted attacks on MNIST, CIFAR-10 and ImageNet datasets with both standard and defended models.

1.1 Related Works

To guide the search for adversarial examples, existing black-box attacks often aim at approximating the gradient, either from the gradient of a surrogate model [26, 27], or from model queries via finite different approximation, zeroth-order optimization, natural evolution strategies, etc. [1, 4, 8, 9, 18, 19, 34]. Many approaches for reducing the dimension of the search space have been proposed, based on principal component analysis [4], autoencoder [34], and compressed sensing [20]. Our method generates random perturbations on the low frequency domain, similar to SimBA [16] and PPBA [20]. This subspace has been shown to admit a high density of adversarial perturbations [15]. Other strategies for designing random perturbations to guide random-search-based attacks include Square Attack [3], which crafts perturbations with square shape, PRGF [12], which utilizes a transfer-based prior, and GenAttack [2], which uses genetic algorithms. Adversarial examples can also be generated from learning their probability distributions [11, 22] and combinatorial optimization techniques [24].

Our black-box method is concerned with the score-based scenario, where the attacker has access to the output scores of the target classifier. More limited variants of the black-box setting have also been studied, where only access to the top-1 predicted labels is assumed [5, 6, 10, 18]. Recent work [21] considers no-box settings, where the attacker makes no query to the target model but just gathers a small labeled dataset. These forms of attacks are more challenging.

2 Background

Image classification aims to successfully predict what a human sees in an image. The objective of adversarial attack on an image classification model is to introduce a small distortion beyond human perceptibility into the original image to fool the target model. In this work, we consider the score-based black-box attack. We first give the formal definition of the adversarial attack problem under consideration. Let \(f:[0,1]^d \rightarrow \mathbb {R}^K\) be a classifier with d input dimension and K classes, where \(f_k(\textbf{x})\) is the predicted probability that the image \(\textbf{x}\) belongs to class k. The predicted label of the image \(\textbf{x}\) is denoted by

$$ h(\textbf{x}) := \mathop {\textrm{argmax}}\limits _{k=1,\ldots , K} f_k(\textbf{x}). $$

An adversary aims to generate a perturbed image, denoted by \(\hat{\textbf{x}}\), with a small perturbation that solves the following constrained optimization problem

$$\begin{aligned} \min _{\hat{\textbf{x}}}\, \delta (\textbf{x}, \hat{\textbf{x}}) \; \text { s.t. } \left\{ \begin{aligned}&h(\hat{\textbf{x}}) \ne h(\textbf{x})&\text {(untargeted)}, \\&h(\hat{\textbf{x}}) = \hat{y}&\text {(targeted)}, \end{aligned} \right. \end{aligned}$$
(1)

where \(\delta (\cdot , \cdot )\) measures the perceptual difference between the original image \(\textbf{x}\) and the adversarial image \(\hat{\textbf{x}}\), and \(\hat{y}\) is the target label for targeted attacks. The most commonly used definition for \(\delta \) is the \(\ell _2\) norm or the \(\ell _{\infty }\) norm of the distortion \(\textbf{x} - \hat{\textbf{x}}\). In this work, we will use the \(\ell _2\) norm, i.e., \(\delta (\textbf{x}, \hat{\textbf{x}}) := \Vert \textbf{x}-\hat{\textbf{x}}\Vert _2\), as the distortion metric.

Loss Minimization. For score-based adversarial attack, we can exploit the access to the score function \(f(\textbf{x})\) to define an adversarial loss \(L(\hat{\textbf{x}}, y)\) to guide the search towards adversarial examples. For untargeted attack, the probability of the class \(h({\textbf{x}})\) that the original image \(\textbf{x}\) belongs to is often used as adversarial loss, i.e., \(L(\hat{\textbf{x}}, h(\textbf{x})) := f_{h(\textbf{x})}(\hat{\textbf{x}})\). For targeted attack towards a label \(\hat{y}\), we want to maximize \(f_{\hat{y}}(\hat{\textbf{x}})\), so choose \(L(\hat{\textbf{x}}, \hat{y}) := - f_{\hat{y}}(\hat{\textbf{x}})\). Since the gradient of the target classifier is unavailable and each query to the model costs time and money, the total number of black-box queries for constructing an adversarial example must not exceed a prescribed budget. Thus, the optimization problem in Eq. (1) is modified to

$$\begin{aligned} \min _{\hat{\textbf{x}}} L(\hat{\textbf{x}}, y)\; \text { s.t. }\; \Vert \hat{\textbf{x}}-\textbf{x}\Vert _2 \le \rho , \, \text { queries } \le B, \end{aligned}$$
(2)

where B is the maximum allowable number of queries and \(\rho \) is the constraint on the maximum image distortion. For notational simplicity, we suppress the dependence of L on y and write L as \(L(\hat{\textbf{x}})\) in the rest of the paper.

To solve (2), we employ an iterative random search approach, where at each iteration, we query along a randomly sampled search direction and update the current point based on those queries. When doing a Taylor expansion of the loss with respect to a perturbation \(\delta \) along any randomly selected direction, i.e.,

$$ L(\hat{\textbf{x}}) = L(\textbf{x}^*) + \frac{dL}{d\delta }(\textbf{x}^*) \delta + \frac{d^2L}{d\delta ^2}(\textbf{x}^*) \delta ^2 + \mathcal {O}(\delta ^3), $$

with \(\textbf{x}^*\) being the current state, it is intuitive to conjecture that the loss would behave like a parabola in the neighborhood of \(\textbf{x}^*\). However, it is not the case for all perturbation strategies. In the following sections, we show that the adversarial loss behaves like a parabola in the Fourier domain determined by the discrete consine transform (DCT) [15, 30], but NOT like a parabola in the pixel domain. Then, we develop the BABIES algorithm that exploits the parabolic loss landscape in the frequency domain to improve query efficiency.

3 Theoretical and Empirical Study on the Landscape of the Adversarial Loss

In this section, we investigate the shape of the loss’s landscape with respect to two different perturbations, i.e., pixel perturbation and DCT perturbation [15, 30].

Our main observation is that the loss’s landscape is closer to a parabola with respect to a DCT perturbation, as shown in Fig. 2 and 3. To theoretically verify such observation, we consider a simplified convolutional neural network (CNN)-based classifier for 1D signals. The length of each signal sample is N. We assume the first two layers of the CNN is a \(3 \times 1\) convolutional layer followed by a \(2\times 1\) max-pooling layer, which is a common setup for CNN-based classifiers. Let \(\textbf{x} = (x_{i+1}, x_{i+2}, x_{i+3}, x_{i+4})\), for \(i \in \{0, \ldots , N-4\}\), be a \(4\times 1\) interior segment of the signal and \( {\textbf {w}} = (w_1, w_2, w_3)\) be the convolution filter. The output of the convolutional layer, centered at \(x_{i+2}\) and \(x_{i+3}\), consists of two entries \(y_2\) and \(y_3\) given by

$$ \begin{aligned} y_2 = w_1 x_{i+1} + w_2 x_{i+2} + w_3 x_{i+3},\\ y_3 = w_1 x_{i+2} + w_2 x_{i+3} + w_3 x_{i+4}, \end{aligned} $$

and the output after the ReLU activation is

$$\begin{aligned} z_2 = {\max }(y_2,0),\quad z_3 = {\max }(y_3,0), \end{aligned}$$
(3)

and the output of the max-pooling layer is

$$\begin{aligned} u = \max (z_2, z_3). \end{aligned}$$
(4)

The simplified CNN model is visualized in Fig. 1. Note that we choose the 1D case to avoid tedious derivation, but the theoretical intuition is applicable to 2D and 3D cases.

Fig. 1.
figure 1

Illustration of the simplified CNN classifier for verifying our theoretical intuition. We only explicitly write out the first convolutional and max-pooling layers, which is sufficient to verify our theoretical intuition. Other layers are included in “other operations”.

Let us define a perturbed signal as \(\textbf{x} + \delta \textbf{q}\), where \(\textbf{q} = (q_{i+1}, q_{i+2}, q_{i+3}, q_{i+4})\) is the perturbation direction. The derivative of the adversarial loss \(L(\delta )\) (as a function of the perturbation’s magnitude \(\delta \)) is represented by

$$\begin{aligned} \frac{dL}{d\delta }(\delta ) = \frac{dL}{du} \frac{du}{d\delta }(\delta ), \end{aligned}$$
(5)

and we focus on analyzing the behavior of \({du}/{d\delta }\) for both pixel and DCT perturbations.

The Property of \({du}/{d\delta }\) Due to Pixel Perturbation. In this case, we perturb the pixel \(x_{i+2}\), i.e., setting \(q_{i+2} = 1\), \(q_{i+1} =q_{i+3}=q_{i+4} = 0\), to study how \(d u/d \delta \) behaves as a function of \(\delta \). Specifically, \({du}/{d\delta }\) under the perturbation of \(x_{i+2}\) can be written as

$$\begin{aligned} \frac{du}{d\delta }(\delta ) = \frac{\partial u}{\partial z_2} \frac{\partial z_2}{\partial y_2} w_2 + \frac{\partial u}{\partial z_3} \frac{\partial z_3}{\partial y_3} w_1, \end{aligned}$$
(6)

which involves the derivatives of the ReLU function and the max-pooling function, e.g.,

$$ \frac{\partial u}{\partial z_2} = \left\{ \begin{aligned}&1, \;\text { if } z_2 \ge z_3\\&0, \; \text {otherwise} \end{aligned}\right. , \;\;\text { and } \;\; \frac{\partial z_2}{\partial y_2} = \left\{ \begin{aligned}&1, \;\text { if } y_2 > 0\\&0, \; \text {otherwise} \end{aligned}\right. , $$

and \(\partial u/\partial z_3\), \(\partial z_3/\partial y_3\) can be defined similarly. Therefore, \({du}/{d\delta }\) can only choose values from the set

$$\begin{aligned} \frac{du}{d\delta }(\delta ) \in \{0, w_1, w_2\}, \end{aligned}$$
(7)

when perturbing the pixel \(x_{i+2}\) by \(\delta \). Since \(y_2\), \(y_3\), \(z_2\), \(z_3\) are functions of \(\delta \), the value of \({du}/{d\delta }\) may “jump” from one value in \(\{0, w_1, w_2\}\) to another, because \(w_1\) and \(w_2\) may be dramatically different, e.g., \({\textbf {w}} = (-1, 5, -1)\) defines a sharpen filter kernel. The maximum jump size could be

$$\begin{aligned} \Big |\frac{du}{d\delta }(\alpha ) - \frac{du}{d\delta }(\beta )\Big | \le |w_1| + |w_2|, \end{aligned}$$
(8)

where \(\alpha \ne \beta \) but \(|\alpha - \beta |\) is very small. This will eventually lead to the rapid change of the derivative of the total loss \(dL/d\delta \) defined in Eq. (5). Figure 2-right illustrates a typical loss landscape with respect to pixel perturbation.

The Property of \({du}/{d\delta }\) Due to DCT Perturbation. In this case, all the pixels are perturbed simultaneously. Specifically, the perturbation direction \(\textbf{q}\) is defined by

$$\begin{aligned} q_{i+1} = \sqrt{\frac{2}{N}}\cos \left( \frac{(2i+1)n\pi }{2N}\right) , \ q_{i+2} = \sqrt{\frac{2}{N}}\cos \left( \frac{(2i+3)n\pi }{2N}\right) , \\ q_{i+3} = \sqrt{\frac{2}{N}}\cos \left( \frac{(2i+5)n\pi }{2N}\right) , \ q_{i+4} = \sqrt{\frac{2}{N}}\cos \left( \frac{(2i+7)n\pi }{2N}\right) , \end{aligned}$$

where \(n\in \{0,\ldots ,N-1\}\) is the selected frequency and N is the total signal length. Then, the derivative \({du}/{d\delta }\) is represented by

$$ \begin{aligned} \frac{du}{d\delta }(\delta )&= \frac{\partial u}{\partial z_2} \frac{\partial z_2}{\partial y_2}( w_1 q_{i+1} + w_2 q_{i+2} + w_3 q_{i+3}) \\&\quad + \frac{\partial u}{\partial z_3} \frac{\partial z_3}{\partial y_3} ( w_1 q_{i+2} + w_2 q_{i+3} + w_3 q_{i+4}). \end{aligned} $$

Therefore, \({du}/{d\delta }\) can only choose values from the set

$$\begin{aligned} \begin{aligned}&\frac{du}{d\delta }(\delta ) \in \big \{w_1 q_{i+1} + w_2 q_{i+2} + w_3 q_{i+3},\, w_1 q_{i+2} + w_2 q_{i+3} + w_3 q_{i+4},\, 0 \big \}. \end{aligned} \end{aligned}$$
(9)

As opposed to the pixel perturbation case in Eq. (7), the potential “jumps” of \(du/d\delta \) in the DCT domain is much smaller. In fact, the maximum jump size

$$\begin{aligned} \begin{aligned} \Big |\frac{du}{d\delta }(\alpha ) - \frac{du}{d\delta }(\beta )\Big |&\le \frac{2\sqrt{2}}{\sqrt{N}} \sin \left( \frac{n\pi }{2N}\right) \Bigg [w_1\sin \left( \frac{(i+1)n\pi }{N}\right) \\&\quad + w_2\sin \left( \frac{(i+2)n\pi }{N}\right) + w_3\sin \left( \frac{(i+3)n\pi }{N}\right) \Bigg ], \end{aligned} \end{aligned}$$
(10)

where \(\alpha \ne \beta \). When perturbing low-frequency modes, i.e., n is small, suggested in [15, 30], Eq. (10) can be bounded by \(\frac{2\sqrt{2}}{\sqrt{N}} |\sin \left( \frac{n\pi }{2N}\right) |(|w_1| + |w_2| + |w_3|)\). It is easy to see that this bound is much smaller than the one in Eq. (8) due to the appearance of N (the signal length) in the denominators.

Experimental Illustration. To verify the above intuition, we investigate the landscape of the adversarial loss on untargeted attacks on four different classifiers: (a) standard Inception_v3 on ImageNet [32], (b) \(\ell _2\)-robust ResNet18 on ImageNet [29], (c) standard VGG on CIFAR-10 [31], (d) \(\ell _2\)-robust ResNet50 on CIFAR-10 [13]. For each model, we randomly select 50 images from the corresponding testing sets and define the loss functions as in Background section. We sample 100 1D segments in a neighborhood of each original image, along randomly selected DCT directions and pixel directions, then compute the loss function restricted on them. Then we fit these loss values with parabolas using quadratic regression. The true and approximated landscapes typically found in the DCT and pixel settings are compared in Fig. 2. We observe that the adversarial loss with respect to DCT perturbations is smooth and close to a parabola. On the other hand, the loss function with respect to pixel perturbations shows sharp turns due to the rapid change (jumps) of \(dL/d\delta \), therefore cannot be captured by quadratic approximation. This empirical observation is consistent with the above theoretical study.

Fig. 2.
figure 2

The landscape of the adversarial loss along DCT directions is often well-behaved and can be fitted with a parabola (left). The landscape along pixel direction features sharp turns due to the rapid change (jumps) of \(dL/d\delta \) shown in Eq. (8), thus cannot be adequately captured by quadratic approximation (right).

To show the phenomenon in Fig. 2 is statistically meaningful, we plot in Fig. 3 the correlation between true and approximated loss values given by parabolas on a large number of sample points. To generate each plot, 5000 points are randomly sampled on 100 segments in the neighborhood of each of 50 images (therefore 5000 segments in total). Since the losses on different segments and images are significantly different in value, we normalize them on each segment such that their values lie in [0, 1]. Here, we observe strong correlation in DCT setting, confirming that the adversarial losses are generally well-approximated by parabolas in the frequency directions, but much less so in the pixel directions.

Fig. 3.
figure 3

Scatter plot displays the correlation between true and approximated loss values on a large number of points, sampled from 5000 segments along DCT directions (top) and pixel directions (bottom). We observe a strong correlation in DCT setting, and much less so along pixel setting. This plot verifies the generality of the example in Fig. 2, that the shape of adversarial losses along DCT directions is close to and can be adequately approximated by a parabola.

4 The BABIES Algorithm

In this section, we present how to exploit the parabolic loss landscape in the DCT domain to develop our BABIES algorithm for black-box attack. Our method consists of two components. Before describing our quadratic interpolation scheme for perturbation updating, we discuss the sampling rule with large step size on the hypersphere.

The Sampling Rule on the Hypersphere. Let us define \(\mathcal {B}_\rho : = \{\hat{\textbf{x}} \in [0,1]^d: \Vert \hat{\textbf{x}}-\textbf{x}\Vert _2 \le \rho \}\) and \(\mathcal {S}_{\rho }\) be the boundary of \(\mathcal {B}_{\rho }\). Let Q be the set of low frequencies extracted by the DCT. Starting from \(\textbf{x}\), we generate a sequence of iterates \(\textbf{x}^{(k)}\) in \(\mathcal {B}_{\rho }\) which progresses toward an adversarial example. Let \(\varepsilon \) be the step size parameter and assume \(\textbf{q}^{(k)}\) is the direction sampled from Q, at iteration k, we determine two queries based on \(\textbf{q}^{(k)}\) and \(\varepsilon \). When all of \(\textbf{x}^{(k)}\), \(\textbf{x}_{\varepsilon }\) and \(\textbf{x}_{-\varepsilon }\) are in the interior of \(\mathcal {B}_\rho \), i.e., at the beginning of the search, we simply query at

$$\begin{aligned} \textbf{x}_{-\varepsilon } = \textbf{x}^{(k)} - \varepsilon \textbf{q}^{(k)} \ \ \ \text {and} \ \ \ \textbf{x}_{\varepsilon } = \textbf{x}^{(k)} + \varepsilon \textbf{q}^{(k)}, \end{aligned}$$
(11)

and update \(\textbf{x}^{(k)}\) using these queries (see the update rule in the second part). When one or more of \(\textbf{x}^{(k)}\), \(\textbf{x}_{-\varepsilon }\), \(\textbf{x}_{\varepsilon }\) reach the hypersphere \(\mathcal {S}_{\rho }\), we switch to the sampling rule on the hypersphere, where the queries along the straight line in Eq. (11) is replaced by those along the geodesic curve passing through \(\textbf{x}^{(k)}\) and coplanar to \(\textbf{q}^{(k)}\). We choose \(\textbf{x}_{-\varepsilon }\) and \(\textbf{x}_{\varepsilon }\in \mathcal {S}_{\rho }\) so that the arc length (instead of standard distance) between them and \(\textbf{x}^{(k)}\) is \(\varepsilon \). Denoting \(\delta ^{(k)} = \textbf{x}^{(k)}-\textbf{x}\), then the angle between \(\delta ^{(k)}\) and the line connecting \(\textbf{x}_{-\varepsilon }\) or \(\textbf{x}_{\varepsilon }\) and \(\textbf{x}^{(k)}\) will be \(\varepsilon /\rho \). Extract the tangent component of \(\textbf{q}^{(k)}\) as \( \widetilde{\textbf{q}}^{(k)}:= \textbf{q}^{(k)} - \langle \textbf{q}^{(k)},\delta ^{(k)}\rangle \delta ^{(k)}/\Vert \delta ^{(k)}\Vert _2^2, \) we arrive at the formula for queries on \(\mathcal {S}_{\rho }\):

$$\begin{aligned} \textbf{x}_{\pm \varepsilon } = \rho \frac{\left( \textbf{x}^{(k)} \pm \frac{\widetilde{\textbf{q}}^{(k)}}{\Vert \widetilde{\textbf{q}}^{(k)}\Vert }\rho \tan \left( \frac{\varepsilon }{\rho }\right) \right) }{\left\| \textbf{x}^{(k)} \pm \frac{\widetilde{\textbf{q}}^{(k)}}{\Vert \widetilde{\textbf{q}}^{(k)}\Vert }\rho \tan \left( \frac{\varepsilon }{\rho }\right) \right\| _2}. \end{aligned}$$
(12)

The key hyperparameter of our algorithm is the query step size \(\varepsilon \). Here, we select relatively large \(\varepsilon \) for better long-range exploration of adversarial examples. Since the generated samples always lie in \(\mathcal {B}_{\rho }\), we can use large query steps without concerning with the image distortion. Note that the iterates quickly approach \(\mathcal {S_{\rho }}\), so we spend all of the efforts, except the first few iterations, searching the adversarial examples on the hypersphere \(\mathcal {S}_{\rho }\). We do not fine-tune \(\varepsilon \) in our experiments. As seen in the next section, the value of step size \(\varepsilon \) is fixed for each type of target models (standard or robust) and datasets (ImageNet or CIFAR-10 and MNIST), even though \(\rho \) significantly varies.

figure a

The Update Rule with Quadratic Interpolation. We discuss how to update the iterate from the loss values at three points \(\textbf{x}^{(k)}, \textbf{x}_{-\varepsilon }\) and \(\textbf{x}_{\varepsilon }\) derived from either Eq. (11) or Eq. (12). If one of the queries decreases the loss value, i.e., \(\min (L(\textbf{x}_{-\varepsilon }), L(\textbf{x}_{\varepsilon })) < L(\textbf{x}^{(k)})\), we accept it as a new state

$$\begin{aligned} \textbf{x}^{(k+1)} \!\!=\! \mathop {\textrm{argmin}}\limits (L(\textbf{x}_{-\varepsilon }),\! L(\textbf{x}_{\varepsilon })), \end{aligned}$$
(13)

and thus make a big step to explore a new region on the hypersphere. When \(\min (L(\textbf{x}_{-\varepsilon }), L(\textbf{x}_{\varepsilon })) \ge L(\textbf{x}^{(k)})\), the loss function L restricted on the geodesic curve (or straight line if searching within \(\mathcal {B}_\rho \)) connecting \(\textbf{x}^{(k)}, \textbf{x}_{-\varepsilon }\) and \(\textbf{x}_{\varepsilon }\) has a local minimizer. Certainly, it is desirable to identify and use this minimizer for the next iterate. Based on our intuition in the previous section, the idea is to fit a parabola to the three data points to estimate the loss, and assign the minimizer of the parabola to \(\textbf{x}^{(k+1)}\). The geodesic (or standard) distance between this minimizer and \(\textbf{x}^{(k)}\) can be computed as

$$\begin{aligned} \gamma = \frac{\varepsilon }{2}\dfrac{L(\textbf{x}_{\varepsilon })-L(\textbf{x}_{-\varepsilon })}{L(\textbf{x}_{\varepsilon }) - 2\,L(\textbf{x}^{(k)}) + L(\textbf{x}_{-\varepsilon })} . \end{aligned}$$
(14)

As a result, \(\textbf{x}^{(k+1)} = \textbf{x}_{\gamma }\), where \(\textbf{x}_{\gamma }\) is defined similarly to \(\textbf{x}_{\varepsilon }\) in Eq. (12) when searching on \(\mathcal {S}_\rho \) or Eq. (11) when searching within \(\mathcal {B}_\rho \) with \(\gamma \) replaces \(\varepsilon \). It is easy to see that the update is not zero when \(L(\textbf{x}_{\varepsilon }) \ne L(\textbf{x}_{-\varepsilon })\). Moreover, we use the interpolated loss value

$$\begin{aligned} L_\textrm{interp} = L(\textbf{x}^{(k)}) + \frac{1}{8} \frac{(L(\textbf{x}_{\varepsilon })-L(\textbf{x}_{-\varepsilon }))^2}{L(\textbf{x}_{\varepsilon }) - 2\,L(\textbf{x}^{(k)} + L(\textbf{x}_{-\varepsilon })}, \end{aligned}$$
(15)

to approximate the current best loss value, instead of querying the loss function at \(\textbf{x}^{(k+1)}\). The present strategy continuously updates the iterates with small moves, when the random queries cannot find a new candidate. The effectiveness of this strategy relies on the interpolation accuracy. When generating adversarial examples in the frequency domain, the interpolation error is sufficiently small and able to lead the random search towards the loss descent direction (justified by the parabolic landscape of the adversarial loss).

Table 1. The target classifiers and experiment parameters. The first numbers are for untargeted attack. Numbers in parentheses are for targeted attack. Additional results on sensitivity of BABIES’s performance to hyperparameters are included in Appendix.

5 Experimental Evaluation

5.1 Results of BABIES on MNIST, CIFAR-10 and ImageNet

We evaluate BABIES-DCT and compare with established algorithms from the literature: Bandits-TD \(\ell _2\) attack [19], SimBA-DCT [16] and \(\ell _2\)-Square Attack [3] on the MNIST, CIFAR-10 and ImageNet datasets. We use the default hyper-parameters suggested by the authors of the baseline methods. We use the following standard metrics to evaluate the attack performance: the mean number of queries of successful attacks (Avg. QY), the median number of query of successful attacks (Med. QY) and the success rate (SR). Additional evaluations with two other baselines, PPBA [20] and PRGF [12], are provided in the Appendix.

Setup. For MNIST and CIFAR tests, we attack 1,000 correctly labeled images randomly selected from their corresponding testing sets. Evaluation on ImageNet is performed on a set of 1,000 images from the ImageNetV2 [28]. In targeted attack, the target labels are uniformly sampled at random, and the same target labels are used for all methods. The search subspace of BABIES-DCT on ImageNet is set to the first 1/8-th of all frequencies, and includes additional 1/32-th of the next frequencies when all available frequencies are used up without success. Due to the low dimensionality of CIFAR-10 images, we initialize the random search on the first 5/8-th of all frequency, and add an additional 1/8-th of the frequencies at a time, if necessary. We use our method and the other baselines to attack eight pre-trained classifiers (four standard and four \(\ell _2\)-robust). For each attack, we limit the number of queries (B) the attacker can make and the allowable \(\ell _2\) distortion (\(\rho \)). We use different values of \(\rho \) since our experiments span various types of datasets and classifiers. In particular, larger \(\rho \) is used if the attacks are more challenging (i.e., on ImageNet dataset, robust models and/or targeted attack). We make minimal tuning of the step size \(\varepsilon \), just setting it to be a fraction of \(\rho \). Details of the target classifiers and test parameters here are shown in Table 1. Additional results to show the performance of BABIES is not sensitive with \(\varepsilon \) are presented in the Appendix.

Results on the Standard Models (Tables  2 and 3). The main comparison results evaluated in the attacks on CIFAR-10 images are reported in Table 2. Here, the median queries of BABIES-DCT are significantly lower than those of other baselines in all of the tests, highlighting that for many images, our method can find an adversarial perturbation in the DCT domain with very few queries. We also achieve the best success rates in three out of four cases. On the average query count, BABIES-DCT and Square-attack each lead in two tests. Both being random-search based method, BABIES-DCT samples from a pre-defined sets of DCT directions while Square-attack crafts random perturbations from a more flexible space. For low dimensional images like CIFAR-10, the set of DCT directions are limited, so eventually all directions will be chosen and recycled. The ability of generating more flexible random directions is an advantage for Square-attack in this case.

Table 3 shows the comparison results on ImageNet. Our method has a very strong performance on the targeted attacks, where it outperforms the others in all three metrics and requires much fewer number of queries (\(39\%\) and \(13\%\) less than the next baseline for Inception_v3 and ResNet50 respectively). On untargeted attacks, the results are more comparable, where BABIES-DCT is slightly better than Square-attack in the success rate, and slightly worse in the number of queries of successful attacks. SimBA does not look particularly strong here, but it should be note that SimBA can find adversarial examples with very small distortions in \(\mathcal {B}_\rho \), while other methods focus on searching on the hypersphere \(\mathcal {S}_\rho \). As such, with the same maximum allowable distortion, SimBA always achieves lowest average distortion on successful attacks.

Table 2. Comparison on attacks against the standard models for CIFAR-10. BABIES-DCT leads in success rate in 3/4 tests and achieves significantly lower median queries than other baselines in 4/4 tests.
Table 3. Comparison on attacks against the standard models for ImageNet. For targeted attacks, BABIES outperforms other baselines in all three metrics. For untargeted attacks, the performance of tested methods are more comparable.
Table 4. Comparison on attacks against the \(\ell _2\) robust models for CIFAR-10 and MNIST. None of the attacks achieve success rates close to 100%, so we evaluate methods on success rate before other metrics. BABIES-DCT leads in success rate in 4/4 tests (and over \(14\%\) to the next baseline in 3/4 tests).
Table 5. Comparison on attacks against the \(\ell _2\)-robust models for ImageNet. Success rate is the most important metric. BABIES-DCT leads in success rate by over \(18\%\) to the next baseline in 3/4 tests. For targeted ResNet50, our method is close to Bandits in success rate, but it requires 50% less queries.

Results on Defended Models (Tables  4 and 5). Here, none of the evaluated attacks can achieve success rates close to \(100\%\), so we will evaluate them based on the success rate before other metrics, because with low success rate, a method can achieve a low number of queries. For MNIST and CIFAR-10 attacks (Table 4), BABIES-DCT has a significant lead in success rate in three out of four tests (\(14\%\)\(27\%\) to the next baseline). The gap between our method and Bandits is close for untargeted SmallCNN attack, but then, our method posts a much lower average and median query counts. For ImageNet attacks (Table 5), BABIES-DCT leads by a large margin in untargeted ResNet18 (\(18\%\)), targeted ResNet18 (\(29\%\)) and untargeted ResNet50 (\(19\%\)) attacks. Our method is close to Bandits in the targeted ResNet50, but again, it requires much fewer queries. We observe that BABIES shows more significant advantages in attacking defended models, which is consistent with the empirical result (in Fig. 3) that the loss landscape of defended models are closer to parabolas than that of the standard models.

Qualitative Results (Fig.  4). Since the distortion metric is only an approximation of the imperceptibility, we would like to compare how imperceptible the adversarial images are to the human eye. For that purpose, we selected four images from the targeted attack (on Inception_v3) experiment to explain our observations. The clean images and the distorted images are shown in Fig. 4. All adversarial images have approximately same \(\ell _2\) distortion norm \(\Vert \delta \Vert _2 \simeq 12\). It is easy to see that different methods lead to different types of distortion. Even though Bandits is less efficient in our experiments, it generates the most imperceptible adversarial images with comparable \(\ell _2\) norms. The adversarial images from BABIES-DCT and SimBA-DCT (not shown here) exhibit noticeable wave-like distortions for some images, especially when the background color is light. Finally, Square-attack adds more noticeable sharp distortions. The distortion mass is distinctively concentrated in a set of small squares, coded by the design of searching space by this method.

Fig. 4.
figure 4

Qualitative comparison of the imperceptibility of distortion. The distorted images are selected from the targeted attack (on Inception_v3) experiment and have approximately same distortion norm \(\Vert \delta \Vert _2 \simeq 12\). Bandits generates perturbations with a grainy look and can blend with the background. The wave-like distortions from BABIES-DCT are noticeable for some images. Square attack generates in general more noticeable distortions compared with the other methods.

5.2 Results on Attacking Google Cloud Vision

We perform attacks on the Google Cloud Vision API, which is a powerful image analysis tool capable to identify the dominant entities/objects in images. For an input image, the API provides a list of top concepts appearing in the image, together with their probabilities. We consider an untargeted attack that aims to remove the top 3 objects in the original images. We define the adversarial loss as the maximum of the probabilities of original top 3 concepts, similar to [16, 20], and minimize this loss with BABIES, until an adversarial example is found. We allow maximum 2000 queries for each image and a maximum distortion \(\rho = 12.0\) in \(\ell _2\) norm. Our attack on 50 images randomly selected from ImageNetV2 shows that BABIES achieves an 82% success rate with 205 average queries on successful attacks. In Fig. 5, we show some example images before and after the attack. We observe that the leading concepts in the original images, related to food, laundry, camel and insect, are completely removed in the adversarial images and replaced by less important or incorrect labels. This test demonstrates that our method is highly effective against real world systems.

Fig. 5.
figure 5

Example images in our BABIES attack on the Google Cloud Vision API to remove top 3 labels. Labels related to the main objects of original images (clockwise from top left: food, laundry, camel, insect) are completely removed and replaced by less important or incorrect labels.

6 Discussion and Conclusion

We propose to exploit the parabolic landscape of the adversarial loss with respect to DCT perturbation to improve the query efficiency of random search methods for black-box adversarial attack. Using a simple quadratic interpolation strategy, we demonstrate that the loss smoothness can greatly improve query efficiency without additional query per iteration. Our algorithm solve a constraint optimization problem on \(\ell _2\) sphere. Thus we propose to use large query step for better exploration of the search space. Our evaluation shows a remarkable advantage of this strategy.

Our theoretical and empirical study on the landscape of the adversarial loss provides a new angle to investigate the vulnerability of an image classifier. From this perspective, the theoretical insight on the loss landscape may be even more valuable to the community than the BABIES algorithm. For example, an intriguing observation from theoretical study and our experiment is that the relative performance of BABIES-DCT (in comparison to other baselines) is strongest in attacking \(\ell _2\)-robust models. One possible reason is that the loss landscapes of the defended models are closer to parabolas, which provides a favorable setting for our method. While defended classifiers have been studied extensively recently, \(\ell _\infty \) models have got more attention and less is known about \(\ell _2\) models. Understanding the properties and possible weakness of \(\ell _2\) models is an interesting problem we plan to study next. Despite the superior performance, our method has several limitations. First, our method is designed for \(\ell _2\) attack, which may not outperform the state-of-the-art methods in \(\ell _\infty \) attack. Second, since the perturbation is made in the Fourier domain, the perturbation is combination of cosine functions which is easier to be distinguished by naked eyes than pixel perturbations, even though the \(\ell _2\) norm satisfies the constraint.

There are several possible directions to pursue in the future research. One is to investigate the loss smoothness in other spaces, e.g., replacing DCT with wavelet transform. In fact, the idea of Square Attack makes Haar wavelet transform a good candidate to study. An advantage of using wavelet transform is that wavelet is only supported locally, which means perturbing a wavelet mode will result in a smaller distortion than perturbing a globally supported cosine basis. Another area for improvement is to perturb multiple DCT modes within each iteration for more efficient exploration. We leave these directions for future work.