Keywords

1 Introduction

Deep learning has shown outstanding success in near all machine learning fields. However, it has been proved that deep neural networks are vulnerable to adversarial examples, i.e., small disturbances to the input signal, which are usually invisible to the human eyes, are enough to induce large changes in model output [17]. This phenomenon has aroused people’s concerns about the safety of deep learning in the adversarial environment, where malicious attackers may significantly degrade the robustness of deep learning based applications. To mitigate the harm caused by the adversarial examples, numerous defensive methods including pre-processing based [6], modified networks based [3] and detection based [13] have been proposed. Among these methods, adversarial training [12] is one of the most powerful approaches for robust defense against adversarial attacks since [2] a set of purportedly robust defenses by the adaptive attack.

Fig. 1.
figure 1

(CIFAR-10) Visualization of the FGSM and PGD robustness of the model trained with FAST-FGSM AT (dashed), FGSMPR AT (solid). All statistics are evaluated against FGSM attacks and 50 steps PGD attacks with 10 random restarts on the test dataset. FAST-FGSM AT occurs catastrophic overfitting at 180 epochs, characterized by a sudden drop of PGD robustness and a rapid increase of FGSM robustness. FGSMPR AT (ours) does not suffer from catastrophic overfitting and maintains stable robustness during the whole training process.

Adversarial Training (AT) aims to augment each small batch of training data with adversarial examples for learning a robust model. It is generally considered to be more expensive than traditional training because it is necessary to construct adversarial examples through first-order methods such as projection gradient descent (PGD). To combat the increased computational overhead of PGD AT, a recent line of work focused on improving the efficiency of AT. [20] proposed to perform multi-step PGD adversarial attacks by chopping off redundant computations during backpropagation when computing adversarial examples to obtain additional speedup. [15] proposed a variant of K steps PGD AT with a single-step Fast Gradient Sign Method (FGSM) AT overhead, called “FREE AT”, which can update model weights as well as input perturbations simultaneously by using a single backpropagation in a way that is less expensive than PGD AT overheads. Inspired by [15, 19] found that previously non-robustness FGSM AT, with a random initialization, could reach similar robustness to PGD AT, called “FAST-FGSM AT”. However, FGSM-based AT suffers from catastrophic overfitting where the robustness against PGD attacks increases in the early stage of training, but suddenly drops to 0 over a single epoch, as shown in Fig. 1. Several methods have been proposed to prevent the overfitting of AT [1, 11, 18, 19]. However, these methods are either computationally inefficient or decrease the robustness accuracy.

In this paper, we first analyze the reason why FGSM AT suffers catastrophic overfitting in the training process. We observe that FGSM AT is prone to learn spurious functions that excessively fit the FGSM adversarial data distribution but have undefined behavior off the FGSM adversarial data manifold. Then we discuss the difference behind the logits output between the FGSM and PGD adversarial examples in the model trained with FGSM AT and PGD AT, where we show that the logits become significantly different when the FGSM AT trained model suffers from overfitting, while the robust model trained with PGD AT remains stable. We additionally provide for this case an experimental analysis that helps to explain why the FGSM AT trained model generates vastly different logit outputs for single-step and multi-step adversarial examples when catastrophic overfitting occurs. Finally, we propose a novel Fast Gradient Sign Method with PGD regularization (FGSMPR), in which a PGD regularization item is utilized to prompt the model to learn logits that are a function of the truly robust features in the image and ignore the spurious features, thus preventing catastrophic overfitting, as shown in Fig. 1.

The contribution of this paper is summarized as follows:

  • We analyze the reason why FGSM AT suffers from catastrophic overfitting and demonstrate that the logit distribution of the FGSM AT trained model evaluated against FGSM and PGD adversarial examples have significant difference when suffering from catastrophic overfitting.

  • We propose a Fast Gradient Sign Method with PGD regularization (FGSMPR), which can effectively prevent FGSM AT from catastrophic overfitting by explicitly minimizing the difference in the logit of the model against FGSM and PGD adversarial examples.

  • The extensive experiments show that the FGSMPR can learn a robust model comparable to PGD AT with low computational overhead while in relief of catastrophic overfitting. Specially, the FGSMPR takes only 30 min to train a CIFAR-10 model with 46% robustness against 50 steps PGD attacks.

Fig. 2.
figure 2

(CIFAR-10) Visualization of the FGSM and PGD accuracy/loss of the model trained with FGSM AT, FAST-FGSM AT, PGD-7 AT and tested against FGSM adversarial attacks and 50 steps PGD attack with 10 random restarts during the training process. All results are averaged over three independent runs. FGSM AT and FAST-FGSM AT occurs catastrophic overfitting around 30 and 180 epochs, respectively, characterized by a sudden drop in PGD accuracy and FGSM loss and a rapid increase in PGD loss and FGSM accuracy.

2 Related Work

2.1 Adversarial Training

Previous work [12] formalized the training of adversarial robust model into the following non-convex non-concave min-max robust optimization problem:

$$\begin{aligned} \min _{\theta } \mathbb {E}_{(x, y) \sim \mathcal {D}}[\max _{\delta \in \mathcal {S}} \mathcal {L}(\theta , x+\delta , y)]. \end{aligned}$$
(1)

The parameter \(\theta \) of the network is learned by Eq. 1 on the example \((x, y) \sim \mathcal {D}\), where \(\mathcal {D}\) is the data generating distribution. \(\mathcal {S}\) denotes the region within the \(\epsilon \) perturbation range under the \(\ell _{\infty }\) threat model for each example, i.e., \(\mathcal {S}=\{\delta :\Vert \delta \Vert _{\infty } \le \epsilon \}\), which is usually chosen so that it contains only visually imperceptible perturbations. The procedure for AT is to use adversarial attacks to approximate the internal maximization over \(\mathcal {S}\).

FGSM AT. One of the earliest versions of AT used the FGSM attack to find adversarial examples \(x'\) to approximate the internal maximization, formalized as follows [5]:

$$\begin{aligned} x' = x + \epsilon \cdot {\text {sign}}(\nabla _{x} \mathcal {L}(\theta , x, y)). \end{aligned}$$
(2)

FGSM AT is cheap since it only relies on computing the gradient once. However, the FGSM AT is easily defeated by multi-step adversarial attacks.

PGD AT. PGD attacks [12] used multi-step gradient projection descent to approximate the inner maximization, which is more accurate than FGSM but computationally expensive, formalized as follows:

$$\begin{aligned} x^{t+1}&= \varPi _{x+\mathcal {S}}\left( x^{t}+\alpha {\text {sign}}\left( \nabla _{x} \mathcal {L}(\theta , x, y)\right) \right) , \end{aligned}$$
(3)

where \(x^{0}\) initialized as the clean input x, \(\varPi \) refers to the projection operator, which ensures projecting the adversarial examples back to the ball within the radius \(\epsilon \) of the clean data point. The number of iterations K in the PGD attacks (PGD-K) determines the strength of the attack and the computational cost. Further, N random restarts are usually employed to verify robustness under strong attacks (PGD-K-N).

2.2 Single-Step Adversarial Training

FREE AT [15], a single-step training method that generates adversarial examples while updating network weights, is quite similar to FGSM AT. By deeply analyzing the differences between FREE AT and FGSM AT, [19] found that an important property of FREE AT is that the perturbation of the previous sign of gradient is used as the initial perturbation of the next iteration. Based on this observation, [19] proposed a FAST-FGSM AT with almost the same robustness as the PGD AT, but the spent time close to the normal training by adding non-zero initialization perturbations to FGSM AT and further combining some standard techniques [14, 16] to accelerate the training. However, FAST-FGSM AT suffers from catastrophic overfitting, where the robustness for PGD adversarial examples suddenly drop to 0% over a single epoch.

A recent line of work foucs on addressing the catastrophic overfitting problem in single-step AT. [19] used the early stopping method to stop training the model when the model robustness decreases beyond a threshold. [18] introduced dropout layers after each non-linear layer of the model and further decay its dropout probability as the training progresses. [11] monitored the FGSM AT process and performed PGD AT with a few batches to help the FGSM model recover its robustness when the robustness decreases beyond a threshold. [1] proposed the Gradient Alignment (GradAlign) regularization item that maximizes the gradient alignment based on the connection between FAST-FGSM AT overfitting and local linearization of the model as a way to prevent the occurrence of catastrophic overfitting. Although these methods provide a better understanding of catastrophic overfitting prevention, but still cannot essentially explain the problem of catastrophic overfitting. Moreover, these methods can improve the robustness of single-step AT models to some extent, but sacrifice a large amount of computational overhead and lose the efficient advantage of single-step AT, even up to the training time of multi-step AT.

3 Proposed Approach

3.1 Observation

To investigate catastrophic overfitting, we begin by recording the robust accuracy of FGSM AT on CIFAR-10 [9]. We evaluate the robust accuracy of the model against 50 steps PGD attacks with 10 random restarts (PGD-50-10) for step size \(\alpha = 2/255\) and maximum perturbation \(\epsilon =8/255\). Figure 2 visualizes the accuracy and loss of the FGSM AT trained, FAST-FGSM AT trained, and PGD-7 AT trained model and evaluated against FGSM and PGD-50-10 attack during the training phase. As we can see, when FGSM AT and FAST-FGSM AT occur catastrophic overfitting around 30 and 180 epochs respectively, the robustness against PGD-50-10 attack of the model trained with FGSM AT and FAST-FGSM AT begin to drop suddenly, whereas the accuracy against FGSM increases rapidly. However, for the robust PGD-7 AT, the accuracy and loss of the model tend to stabilize after a certain number of epochs.

We maintain that the reason the models trained using FGSM AT suffer from catastrophic overfitting is that it is prone to learn spurious functions that fit the FGSM data distribution but have undefined behavior off the FGSM data manifold. Therefore, the FGSM AT is highly susceptible to overfitting due to a single-step adversarial perturbation, resulting in a sudden drop in the PGD robustness of the model, while the FGSM accuracy increases instantaneously. To study the differences in the performance of the models trained with FGSM AT and PGD-7 AT for evaluating at the FGSM and PGD adversarial examples, we utilize a distance function \(\mathcal {L}\) to measure the difference between the output of the model evaluated at single-step and multi-step adversarial attacks. For a model that take inputs x and output logits f(x), we have:

$$\begin{aligned} \mathcal {L} (f({x^{fgsm}}), f({x^{pgd}})), \end{aligned}$$
(4)

where \(x^{fgsm}\) and \(x^{pgd}\) are adversarial examples crafted by FGSM and PGD-7, respectively. Here, we choose \(L_{2}\) for \(\mathcal {L}\). For a well-generalized and robustness model, we assume that the logit \(f(x^{fgsm})\) and \(f(x^{pgd})\) of the model evaluated at FGSM and PGD adversarial examples should be as similar as possible, i.e., \(||f(x^{fgsm})-f(x^{pgd})||_{2}\) should be very small.

To demonstrate our intuition, we firstly train several CIFAR-10 models using FGSM AT and PGD-7 for 200 epochs. For each model, we compute the difference between the output of the model evaluated at FGSM and PGD-7 adversarial examples by using Eq. 4, and performed data processing using a logarithmic function to visualize the differences more clearly, as shown in Fig. 3. In Fig. 3 (b), it can be observed that there is no significant difference in the logits from FGSM and PGD adversarial examples during the early phase of training, which matches our intuition. Once catastrophic overfitting occurs, the gap between the logit of the model evaluated at single-step and multi-step adversarial attacks are increasing rapidly around 30 and 180 epochs respectively, which is consistent with PGD loss. In contrast, the PGD-7 AT does not suffer catastrophic overfitting and the difference of the logit of the model is keeping stable. This phenomenon will also appear on the simple MNIST dataset [10], but it is not as clear as CIFAR-10, as shown in Fig. 3(a).

Fig. 3.
figure 3

Visualization of the \(\mathcal {L}_{2}\) distance of logit of the FGSM AT trained, FAST-FGSM AT trained, PGD-7 AT trained model and evaluated against FGSM and PGD adversarial attack. (a) In MNIST, when the model is not robust, the difference in \(\mathcal {L}_{2}\) distance starts to fluctuate, while PGD AT is relatively smooth. (b) In CIFAR10, FGSM AT and FAST-FGSM AT occurs catastrophic overfitting around 30 and 180 epochs, respectively, and is characterized by a rapid increase of \(L_2\) distance.

3.2 PGD Regularization

Based on the analysis in Sect. 3.1, the only FGSM adversarial loss is not enough for the model to learn the robust features of both single-step and multi-step adversarial examples. To solve this problem, inspired by [8], we use the logit pairing to encourage the model to learn robust internal representation of FGSM and PGD adversarial examples so that the logit outputs \(f(x^{fgsm})\) and \(f(x^{pgd})\) of the model for FGSM and PGD adversarial examples to be as similar as possible:

$$\begin{aligned} {} \lambda \frac{1}{m} \sum _{i = 1}^{m} \mathcal {L}(f(x_{i}^{fgsm} ; \theta ), f({x}_{i}^{pgd};\theta )), \end{aligned}$$
(5)

where \(\mathcal {L}\) is \(L_{2}\) norm; \(x_{i}^{fgsm}\) and \({x}_{i}^{pgd}\) are adversarial examples crafted by FGSM and PGD attacks, respectively; \(\lambda \) is a hyparameter to balance FGSM loss and PGD regularization item. Combining with the proposed regularization, the FGSM AT can learn a robustness model comparable with PGD-7 AT, as validated in Sect. 4.

We hold that PGD regularization works well because it provides an additional prior that regularizes the model toward a more accurate understanding of adversarial examples. If we train the model with only the single-step FGSM adversarial loss, it is prone to learn spurious functions that excessively fit the FGSM adversarial data distribution but have undefined behavior off the FGSM data manifold (e.g., multi-step adversarial examples). PGD regularization forces the explanations of the FGSM adversarial example and multi-step adversarial example to be similar. This is essentially a prior encouraging the model to learn logits that are a function of the truly significant features in the image and ignore the spurious features.

3.3 Training Route

The overall training procedure of the FGSMPR AT is summarized in Algorithm 1. We first perform FGSM adversarial attack to generate FGSM adversarial examples \(x^{fgsm}_{i}\) and compute FGSM AT loss \(fgsm\_loss\) using cross-entropy. Then, we perform PGD adversarial attack for m examples, from a batch of natural examples, to generate m PGD adversarial examples. After generating FGSM and PGD adversarial examples, the regularization loss \(reg\_loss\) of m FGSM and PGD adversarial examples are calculated using Eq. 5 and used as part of the total loss \(total\_loss\). Finally, the parameter \(\theta \) of the model is updated using a proper optimizer (e.g., stochastic gradient descent). The hyperparameter \(\lambda \) shall be properly chosen to balance FGSM loss \(fgsm\_loss\) and PGD regularization \(reg\_loss\) item. In practice, we take \(\alpha =\epsilon /K\), \(K = 3\) and \(m = 1\). In other words, we only pick a single example from a batch for generating a PGD-3 adversarial example, which is then used for regularization to encourage the model to learn similar logit output. The experiments show that a single PGD adversarial example for regularization is enough to learn a robustness model.

figure a

4 Experiments

In this section, we demonstrate that the proposed FGSMPR is robust against strong PGD attacks. All experiments are run on a single RTX 2070, in which we use half-precision computation recommended in [19] to speed up the training of CIFAR-10 model, which was incorporated with the Apes amp package at the O1 optimization level for all CIFAR-10 experiments.

Attacks: We attack all models using PGD attacks with K iterations and 10 random restarts on both cross-entropy loss (PGD-K-10) and the Carlini-Wagner loss (CW-K-10) [4]. All PGD attacks used at evaluation for MNIST [10] are run with 10 random restarts for 20/40 iterations. All PGD attacks used at evaluation for CIFAR-10 [9] are run with 10 random restarts for 20/50 steps.

Perturbation: For MNIST, we set the maximum perturbation \(\epsilon \) to 0.3 and the PGD step size \(\alpha \) to 0.1. For CIFAR-10, we set the maximum perturbation \(\epsilon \) to 8/255 and the PGD step size \(\alpha \) to 2/255.

Comparisions: We compare the performance of our proposed method (FGSMPR) with FGSM: standard FGSM AT [5]; FAST-FGSM AT: FGSM AT with a random initialization [19]; FREE AT: recently proposed single-step AT method [15]; GradAlign AT: recently proposed method solving catastrophic overfitting [1]; PGD-K AT: AT with a K iterations PGD attack [12].

Evaluation: We demonstrate that the performance of models against PGD-K-10/CW-K-10 adversarial attacks under white-box settings. For all experiments, the mean and standard deviation over three independent runs are reported.

Table 1. Validation accuracy (%) and robustness of MNIST models trained with FGSM AT, FAST-FGSM AT, GradAlign AT, FREE AT, PGD-40 AT, FGSMPR AT without early stopping and the corresponding training time. All statistics are evaluated against PGD/CW attacks with 20/40 iterations and 10 random restarts for \(\alpha =0.1\), \(\epsilon =0.3\) over three independent runs. The bold indicates the best performance except for PGD-40 AT.
Table 2. Validation accuracy (%) and robustness of CIFAR-10 models trained with FGSM AT, FAST-FGSM AT, GradAlign AT, FREE AT, PGD-7 AT, FGSMPR AT without early stopping and the corresponding training time. All statistics are evaluated against PGD/CW attacks with 20/50 iterations and 10 random restarts for \(\alpha =2/255\), \(\epsilon =8/255\) over three independent runs. The bold indicates the best performance except for PGD-7 AT.

4.1 Results on MNIST

First, we conduct a study to demonstrate that our proposed approach is highly working in MNIST benchmark dataset [10]. We train models for MNIST dataset with the same architecture used by [19], using FGSM AT, FAST-FGSM AT, FREE AT, PGD-40 AT, FGSMPR AT. Except that the AT free replays each batch of \(m=8\) for a total of 7 epochs, all other models are trained for 50 epochs. For the proposed method, we set the hyparameter \(\lambda \), K and m to (0.1, 3, 1). The experimental results are provided in Table 1. It can be observed that our proposed FGSMPR AT is more robust against both PGD and CW attacks on the MNIST dataset than the GradAlign AT and FREE AT, and is second only to the PGD AT model with a small difference. In the course of testing the robustness of FAST-FGSM AT on the MNIST dataset, we found an interesting problem where increasing the number of MNIST training epochs to 50 also resulted in catastrophic overfitting, although this phenomenon was previously found only in CIFAR-10. Besides, the GradAlign AT [1] can keep the model from suffering catastrophic overfitting to some extent, but it is far inferior to other comparison methods in defending against the higher iteration adversarial attacks.

Fig. 4.
figure 4

Visualization of the accuracy of the CIFAR-10 model trained for FGSM AT, FAST-FGSM AT, GradAlign AT, FREE AT, PGD-7 AT, and FGSMPR AT. All the statistics are tested against 50 steps PGD attacks with 10 random restarts for \(\alpha =2/255\), \(\epsilon =8/255\). Catastrophic overfitting for the FGSM and FAST-FGSM AT occur around 30 and 180 epochs, respectively, and is characterized by a sudden drop in the PGD accuracy.

4.2 Results on CIFAR-10

To verify whether AT scheme suffers from catastrophic overfitting, we train 200 epoch for all CIFAR-10 models using the Preact ResNet-18 [7] architecture without early stopping, especially the FREE AT replays each batch \(m=8\) times for a total of 25 epochs as recommend in [15]. For the FGSMPR, we set the hyparameter \(\lambda \), K and m to (0.5, 3, 1). The experimental results are provided in Table 2. It can be observed that FGSMPR AT is quite similar to PGD-7 AT while our training time is half of PGD-7 AT. To demonstrate that the proposed FGSMPR does not suffer from catastrophic overfitting, we takes 211 min to train a CIFAR-10 model for 200 epochs, which is longer than time for FREE AT. However, our method was able to achieve 46% robustness by training 30 epochs in only 30 min, which is half less than FREE AT. Further, we visualize the robustness of the training process of different AT method and tested against a 10 random restart PGD-50 attack, as shown in Fig. 4. It can be observed that the robustness of the FGSMPR against PGD has steadily increased, which is only 0.8% behind PGD-7 AT and does not suffer from catastrophic overfitting even when trained to 200 epochs. Instead, FAST-FGSM AT started to have a trend similar to PGD AT, but there is a sharp drop in robustness around 180 epochs when occuring catastrophic overfitting. The GradAlign AT was proposed to prevent the FGSM AT from catastrophic overfitting, but the accuracy still dropped by more than 10% and took more than two times longer compared to our FGSMPR AT. Besides, we also test the model’s robustness under different \(l_{\infty }\) perturbation where all models are trained with early stopping. In the case of larger \(l_{\infty }\) perturbations, FGSMPR AT is essentially indistinguishable from PGD-7 AT, and even slightly better than PGD-7 AT, as shown in Fig. 5.

Fig. 5.
figure 5

Accuracy of the model trained for FGSM AT, FAST-FGSM AT, GradAlign AT, FREE AT, PGD-7 AT and FGSMPR AT with early stopping. All the statistics are evaluated against 50 steps PGD attacks with 10 random restarts for different \(l_{\infty }\)-perturbation \(\epsilon \).

5 Conclusion

In this paper, we analyze and address the catastrophic overfitting in FGSM AT. We empirically show that FGSM AT is prone to learn spurious functions that fit the FGSM adversarial data distribution but have undefined behavior off the FGSM data manifold. We therefore exploit the difference behind the logits between the FGSM and PGD adversarial examples in the model trained with FGSM AT and PGD AT, where the logit becomes significantly different when FGSM AT suffers from overfitting, while PGD AT remains stable. Based on these observations, we propose a novel FGSMPR AT, where a PGD regularization term is used to encourage the model to learn similar embeddings of FGSM and PGD adversarial examples. The extensive experiments show that the FGSMPR can effectively keep FGSM AT from catastrophic overfitting with a low computational cost.