Keywords

1 Motivation

It is well known that deep learning image classification tools can be vulnerable to adversarial attacks. In particular, a carefully chosen perturbation to an image that is imperceptible to the human eye may cause an unwanted change in the predicted class [7, 15]. The fact that automated classification tools may be fooled in this way raises concerns around their deployment in high stakes application areas, including medical imaging, transport, defence and finance [11]. Over the past decade, there has been growing interest in the development of algorithms that construct attacks, and strategies that defend against them [1, 6, 10, 12, 13]. Amidst the background of this war of attrition, there has also been “bigger picture” theoretical research into the existence, computability and inevitability of adversarial perturbations [2, 5, 14, 16, 17].

In this work, we contribute to the algorithm development side of the adversarial attack literature. We focus on the manner in which perturbation size is measured. Figure 1 illustrates the benefits of our new algorithm. On the left, we show the image of a handwritten digit from the MNIST data set [9]. A trained neural network (accuracy 97%) correctly classified this image as a digit 8. In the middle of Fig. 1 we show a perturbed image produced by the widely used DeepFool algorithm [12]. This perturbed image is classified as a 2 by the network. On the right in Fig. 1 we show another perturbed image, produced by our new algorithm. This new image is also classified as a 2. The Deepfool algorithm looks for a perturbation of minimal Euclidean norm, treating all pixels equally. In this case, we can see that although the perturbed image is close to the original, there are tell-tale smudges to the white background. Our new algorithm seeks a perturbation that causes a minimal componentwise relative change; and in this context it will not make any change to zero-valued pixels. We argue that the perturbation produced is less noticeable to the human eye, being consistent with a streaky pen, rough paper, or irregular handwriting pressure.

Fig. 1.
figure 1

Showcasing the capabilities of our new algorithm, which seeks a perturbation that causes minimal componentwise relative change. Left: image from the MNIST data set [9], correctly classified as an 8 by a neural network. Middle: perturbed image produced by Deepfool [12], classified as a 2. Right: perturbed image produced by new componentwise algorithm, also classified as a 2. The componentwise algorithm does not change the background, where pixel values are zero. In the notation of Sect. 2, the relative Euclidean norm perturbation size, \(\Vert \Delta x \Vert _2/\Vert x\Vert _2\), is 0.09 for Deepfool and 0.23 for the componentwise algorithm. This reflects the fact that Deepfool looks for the smallest Euclidean norm perturbation whereas the componentwise algorithm has a different objective.

2 Overview of Algorithm

We will focus on image classification, assuming that there are c possible classes. Regarding an image as a normalized vector in \(x \in \mathbb {R}^{n}\), a classifier takes the form of a map \( F:[0,1]^n \rightarrow \mathbb {R}^c\), where we assume that output class is determined by the largest component of F(x).

Suppose \(F(x) = y\) and we wish to perturb the image to \(x + \Delta x\) with \(F(x+\Delta x) = \widehat{y}\), where the desired output \(\widehat{y}\) produces a different classification, so \(\widehat{y}\) has a maximum component in a different position to the maximum component of y. In the untargeted case, \(\widehat{y}\) may be any such vector. In the targeted case, we wish to specify which component of \(\widehat{y}\) is maximum.

Because we seek a small perturbation, we will use the linearization \(F(x+\Delta x) - F(x) \approx \mathcal {A} \Delta x\), where \(\mathcal {A} \in \mathbb {R}^{c \times n}\) is the Jacobian of F at x, and F is assumed to be differentiable in a neighbourhood of x. Then, motivated by the connection to (norm-based) backward error developed in [4] and also by the concept of componentwise backward error introduced in [8], we consider the optimization problem

$$\begin{aligned} \min \{\epsilon : \mathcal {A}\Delta x = \widehat{y} - y, \quad |\Delta x|_i \le \epsilon f_i \quad \textrm{for} \quad 1 \le i \le n \}. \end{aligned}$$
(1)

Here \(f\ge 0 \in \mathbb {R}^n\) is a given tolerance vector, and we note that choosing \(f_i = |x_i|\) forces zero pixels to remain unperturbed. Following the approach in [8] it is then useful to write \(\Delta x = Dv\), where \(D={\text {diag}}(f)\) and \(v \in \mathbb {R}^n\) so that our optimization becomes

$$\begin{aligned} \min \{\Vert v\Vert _\infty : \mathcal {A}Dv = \widehat{y} - y\}. \end{aligned}$$
(2)

In practice, we found that the problem (2) encourages all components of v to achieve the maximum \(\Vert v \Vert _\infty \), leading to adversarial perturbations that were quite noticeable. We found more success after replacing (2) by

$$\begin{aligned} \min \{\Vert Dv\Vert _2 : \mathcal {A}Dv = \widehat{y} - y\}. \end{aligned}$$
(3)

Because \(\Delta x=Dv\), in this formulation we retain the masking effect where zero values in the tolerance vector f force the corresponding pixels to remain unperturbed. We found that minimizing \(\Vert Dv\Vert _2\) rather than \(\Vert v \Vert _\infty \) produced perturbations that appeared less obvious, and this was the approach used for Fig. 1.

It can be shown that the underlying optimization task arising from this approach may be formulated as a linearly-constrained linear least-squares problem. To derive an effective algorithm, various additional practical steps were introduced; notably, (a) projecting to ensure that perturbations do not send pixels out of range, and (b) regarding each optimization problem as a means to generate a direction in which to take a small step within a more general iterative method.

In our presentation, we will show computational results on a range of data sets that illustrate the performance of the algorithm and compare results with state-of-the-art norm-based attack algorithms. We will also explain how a relevant componentwise condition number for the classification map gives a useful warning about vulnerability to this type of attack.

For full details we refer to [3].