1 Introduction

Deep neural networks (DNNs) are well known to be vulnerable to infinitesimal yet highly malicious artificial perturbations in their input, i.e., adversarial examples (Szegedy et al., 2014). The lack of robustness cause a crisis of security and trustworthiness for applications built on DNNs and thus hinders their further deployment in real world applications especially in the critical domains like healthcare (Qiu et al., 2023). Thus far, adversarial training (AT) has been the most effective defense against adversarial attacks (Athalye et al., 2018). AT is typically formulated as a min-max optimization problem:

$$\begin{aligned} \arg \min _{\varvec{\theta }} \mathbb {E}[\arg \max _{\varvec{\delta }} \mathcal {L}(\varvec{x}+\varvec{\delta }; \varvec{\theta })] \end{aligned}$$
(1)

where the inner maximization searches for the perturbation \(\varvec{\delta }\) to maximize the loss, while the outer minimization searches for the model parameters \(\varvec{\theta }\) to minimize the loss on the perturbed examples.

One major issue of AT is that it is prone to overfitting (Rice et al., 2020; Wong et al., 2020). Unlike in standard training (ST), overfitting in AT, a.k.a. robust overfitting (Rice et al., 2020), significantly impairs adversarial robustness. Many efforts (Li & Spratling, 2023b; Wu et al., 2020; Dong et al., 2022; Liu et al., 2023; Liu & Satoh, 2023) have been made to understand robust overfitting and mitigate its effect. One promising solution is data augmentation (DA), which is a common technique to prevent ST from overfitting. However, many studies (Rice et al., 2020; Wu et al., 2020; Gowal et al., 2021; Rebuffi et al., 2021) have revealed that advanced DA methods, originally proposed for ST, often fail to improve adversarial robustness. Therefore, DA is usually combined with other regularization techniques such as Stochastic Weight Averaging (SWA) (Rebuffi et al., 2021), Consistency regularization (Tack et al., 2022) and Separate Batch Normalization (Addepalli, Jain, and Radhakrishnan, 2022) to improve its effectiveness. However, recent work (Li & Spratling, 2023c) demonstrated that DA alone can significantly improve AT if it has strong diversity and well-balanced hardness. This suggests that ST and AT may require different DA strategies, especially in terms of hardness. It is thus necessary to design DA schemes dedicated to AT.

IDBH (Li & Spratling, 2023c) is the latest DA scheme specifically designed for AT. Despite its impressive robust performance, IDBH employs a heuristic search method to manually optimize the DA. This search process requires a complete AT for every sampled policy, which induces prohibitive computational cost and scales poorly to large datasets and models. Hence, when the computational budget is limited, the hyperparameters for IDBH might be found using a reduced search spaceFootnote 1 and by employing a smaller model, leading to compromised performance.

Fig. 1
figure 1

An overview of the proposed method (legend in the right column). The top part shows the pipeline for training the policy model, \(f_{plc}\), while the bottom illustrates the pipeline for training the target model, \(f_{tgt}\). \(f_{aft}\) is a model pre-trained on clean data without any augmentation, which is used to measure the distribution shift caused by data augmentation. Please refer to Sect. 3 for a detailed explanation

Fig. 2
figure 2

An example of the proposed augmentation sampling procedure. The policy model takes an image as input and outputs logit values defining multiple, multinomial, probability distributions corresponding to different sub-policies. A sub-policy code is created by sampling from each of these distributions, and decoded into a sub-policy, i.e., a transformation and its magnitude. These transformations are applied, in sequence, to augment the image

Another issue is that IDBH, in common with other conventional DA methods such as AutoAugment (Cubuk et al., 2019) and TrivialAugment (Müller & Hutter, 2021), applies the same strategy to all samples in the dataset throughout training. The distinctions between different training samples, and between the model checkpoints at different stages of training, are neglected. We hypothesize that different data samples at the same stage of training, as well as the same sample at the different stages of training, demand different DAs. Hence, we conjecture that an improvement in robustness could be realized by customizing DA for data samples and training stages.

To address the above issues, this work proposes a bi-level optimization framework (see Fig. 1) to automatically learn Adversarial Robustness by Online Instance-wise Data-augmentation (AROID). To the best of our knowledge, AROID is the first automated DA method specific to adversarial robustness. AROID employs a multi-head DNN-based policy model to map a data sample to a DA policy (see Fig. 2). This DA policy is defined as a sequence of pre-defined transformations applied with strength determined by the output of the policy model. This policy model is optimized, alongside the training of the target model, towards three novel objectives to achieve a target level of hardness and diversity. DA policies, therefore, are customized for each data instance and evolve with the target network as training progresses. This in practice produces a more globally optimal DA policy and thus benefits robustness. Importantly, the proposed policy learning objectives, in contrast to the conventional ones like validation accuracy (Cubuk et al., 2019), do not reserve a subset of the training data for validation and do not rely on prohibitively expensive inner loops for training the target model to evaluate the rewards of the sampled policies. The former ensures the entire training set is available for training to avoid potential data scarcity. The latter enables policy optimization to be much more efficient and scalable so that it is more practical for AT. Compared to IDBH in particular, this allows our approach to explore a larger space of DAs. Taking an example of optimizing the DA for CIFAR10 and PRN18, AROID took 9 h using an A100 GPU, IDBH took 412 h using an A100 GPU, and AutoAugment took 5000 h using a P100 GPU (Hataya et al., 2020).

Extensive experiments show that AROID outperforms all competitive DA methods across various datasets and model architectures while being more efficient than the previous best method (IDBH). AROID achieves state-of-the-art robustness for DA methods on the standard benchmarks. Besides, AROID outperforms, regarding accuracy and robustness, state-of-the-art AT methods. It also complements such robust training methods and can be combined with them to improve robustness further.

2 Related Work

Robust training. To mitigate overfitting in AT, many methods other than DA, have been previously proposed. One line of works, IGR (Ross & Doshi-Velez, 2018), CURE (Moosavi-Dezfooli et al., 2019), AdvLC (Li & Spratling, 2023b), discovered a connection between adversarial vulnerability and the smoothness of input loss landscape, and promoted robustness by smoothing the input loss landscape. Meanwhile, Wu et al. (2020) and Chen et al. (2021) found that robust generalization can be improved by a flat weight loss landscape and proposed AWP and SWA, respectively, to smooth the weight loss landscape during AT. RWP (Yu et al., 2022) and SEAT (Wang & Wang, 2022) were later proposed to further refine AWP and SWA, respectively, to increase robustness. SCARL (Kuang et al., 2023) incorporated semantic information into adversarial training. IBD (Kuang et al., 2023) distilled prior knowledge from a robust pre-trained model to enhance adversarial robustness. Many works, including MART (Wang et al., 2020), LAS-AT (Jia et al., 2022), ISEAT (Li & Spratling, 2023a), considered the difference between individual training instances and improved AT through regularizing in an instance-wise manner. Our proposed approach is also instance-wise, but contrary to existing methods tackles robust overfitting via DA instead of robust regularization. As shown in Sect. 4.5, it works well alone and, more importantly, complements the above techniques.

Data augmentation for ST. Although DA has been a common practice in many fields, we only review vision-based DA in this section as it is most related to our work. In computer vision, DA can be generally categorized as: basic, composite and mixup. Basic augmentations refer to a series of image transformations that can be applied independently. They mainly include crop-based (Random Crop (He et al., 2016a), Cropshift (Li & Spratling, 2023c), etc.), color-based (Brightness, Contrast, etc.), geometric-based (Rotation, Shear, etc.) and dropout-based (Cutout (DeVries & Taylor, 2017), Random Erasing (Zhong et al., 2020), etc.) transformations. Composite augmentations denote the composition of basic augmentations. Augmentations are composed into a single policy/schedule usually through two ways: interpolation (Hendrycks et al., 2020; Wang et al., 2021) and sequencing (Cubuk et al., 2019, 2020; Müller & Hutter, 2021). MixUp (Zhang et al., 2017), and analogous works like CutMix (Yun et al., 2019), can be considered as a special case of interpolation-based composition, which combines a pair of different images, instead of augmentations, as well as their labels to create a new image and its label.

Composite augmentations by design have many hyperparameters to optimize. Most previous works, as well as the pioneering AutoAugment (Cubuk et al., 2019), tackled this issue using automated machine learning (AutoML). DA policies were optimized towards maximizing validation accuracy (Cubuk et al., 2019; Lin et al., 2019; Li et al., 2020; Liu et al., 2021), maximizing training loss (Zhang et al., 2020) or matching the distribution density between the original and augmented data (Lim et al., 2019; Hataya et al., 2020). Optimization here is particularly challenging since DA operations are usually non-differentiable. Major solutions seek to estimate the gradient of DA learning objective w.r.t. the policy generator or DA operations using, e.g., policy gradient methods (Cubuk et al., 2019; Zhang et al., 2020; Lin et al., 2019) or reparameterization trick (Li et al., 2020; Hataya et al., 2020). Alternative optimization techniques include Bayesian optimization (Lim et al., 2019) and population-based training (Ho et al., 2019). Noticeably, several works like RandAugment (Cubuk et al., 2020) and TrivialAugment (Müller & Hutter, 2021) found that if the augmentation space and schedule were appropriately designed, competitive results could be achieved using a simple hyperparameter grid search or fixed hyperparameters. This implies that in ST these advanced yet complicated methods may not be necessary. However, it remains an open question if simple search can still match these advanced optimization methods in AT. Besides, instance-wise DA strategy was also explored in Cheung and Yeung (2022); Miao et al. (2023) for ST. Our method is the first automated DA approach specific for AT. We follow the line of policy gradient methods to enable learning DA policies. A key distinction here is that our policy learning objective is designed to guide the learning of DA policies towards improved robustness for AT, while the objective of the above methods is to increase accuracy for ST.

3 Method

We propose a method to automatically learn DA alongside AT to improve robust generalization. An instance-wise DA policy is produced by a policy model and learned by optimizing the policy model towards three novel objectives. Updating of the policy model and the target model (the one being adversarially trained for the target task) alternates throughout training (the policy model is updated every K updates of the target model), yielding an online DA strategy. This online, instance-adaptive, strategy produces different augmentations for different data instances at different stages of training.

The following notation is used. \(\varvec{x} \in \mathbb {R}^d\) is a d-dimensional sample whose ground truth label is y. \(\varvec{x}_i\) refers to i-th sample in a dataset. The model is parameterized by \(\varvec{\theta }\). \(\mathcal {L}(\varvec{x}, y; \varvec{\theta })\) or \(\mathcal {L}(\varvec{x}; \varvec{\theta })\) for short denotes the predictive loss evaluated with \(\varvec{x}\) w.r.t. the model \(\varvec{\theta }\) (Cross-Entropy loss was used in all experiments). \(\rho (\varvec{x}; \varvec{\theta })\) computes the adversarial example of \(\varvec{x}\) w.r.t. the model \(\varvec{\theta }\). \(p_i(\varvec{x}; \varvec{\theta })\) or \(p_i\) for short refers to the output of the Softmax function applied to the final layer of the model, i.e., the probability at i-th logit given the input \(\varvec{x}\).

3.1 Modeling the DA Policy

Following the design of IDBH (Li & Spratling, 2023c) and TrivialAugment (Müller & Hutter, 2021), DA is implemented using four types of transformations: flip, crop, color/shape and dropout applied in order. We implement flip using HorizontalFlip, crop using Cropshift (Li & Spratling, 2023c), dropout using ErasingFootnote 2 (Zhong et al., 2020), and color/shape using a set of operations including Color, Sharpness, Brightness, Contrast, Autocontrast, Equalize, Shear (X and Y), Rotate, Translate (X and Y), Solarize and Posterize. A dummy operation, Identity, is included in each augmentation group to allow data to pass through unchanged. More details including the complete augmentation space are described in Section A.

To customize the DA applied to each data instance individually, a policy model parameterized by \(\varvec{\theta }_{plc}\), is used to produce a DA policy conditioned on the input data (see Fig. 2). The policy model employs a DNN backbone to extract features from the data, and multiple, parallel, linear prediction heads on the top of the extracted features to predict the policy. The policy model used in this work has four heads corresponding to the four types of DA described above. The output of a head is converted into a multinomial distribution where each logit represents a pre-defined sub-policy, i.e., an augmentation operation associated with a strength/magnitude (e.g. ShearX, 0.1). Different magnitudes of the same operation are represented by different logits, so that each has its own chance of being sampled. A particular sequence of sub-policies to apply to the input image are selected based on the probabilities encoded in the four heads of the policy network.

3.2 Objectives for Learning the Data Augmentation Policy

The policy model is trained using three novel objectives: (adversarial) Vulnerability, Affinity and Diversity. These objectives are designed to learn data augmentations with strong diversity and appropriate hardness: requirements that have been shown to be effective for adversarial training (Li & Spratling, 2023c).

3.2.1 Motivation

Intuitively, enhancing the diversity and hardness of data augmentation should help mitigate robust overfitting by increasing the complexity of the training data. Specifically, enhanced diversity increases the number of distinct data augmentations applied during training and expands the effective training set size (Gontijo-Lopes et al., 2021). Increasing hardness raises the difficulty level of the augmented data for the model to learn (adversarially), thereby reducing (robust) overfitting. However, if the hardness exceeds the level that the training model can fit, accuracy and even robustness will decline, despite the reduction in robust overfitting. Therefore, to maximize performance, hardness should be carefully adjusted to balance between reducing robust overfitting and improving overall performance. The optimal level of hardness should therefore be tailored to different models and training settings.

Understanding what kind of data augmentation is effective for adversarial training is not the focus of the current work so we refer the reader to (Li & Spratling, 2023c) for a formal quantitative definition of diversity and hardness, along with extensive experimental evidence supporting the above reasoning.

3.2.2 Objectives

Vulnerability measures the loss variation caused by adversarial perturbation on the augmented data w.r.t. the target model:

$$\begin{aligned} \mathcal {L}_{vul}(\varvec{x}; \varvec{\theta }_{plc})&= \mathcal {L}(\rho (\varvec{\hat{x}}; \varvec{\theta }_{tgt}); \varvec{\theta }_{tgt}) - \mathcal {L}(\varvec{\hat{x}}; \varvec{\theta }_{tgt}) \nonumber \\ \text {where}\ \varvec{\hat{x}}&= \Phi (\varvec{x}; S(\varvec{\theta }_{plc}(\varvec{x}))) \end{aligned}$$
(2)

\(\Phi (\varvec{x}; S(\varvec{\theta }_{plc}(\varvec{x})))\) augments \(\varvec{x}\) by \(S(\varvec{\theta }_{plc}(\varvec{x}))\), the augmentations sampled from the output distribution of policy model conditioned on \(\varvec{x}\), so \(\varvec{\hat{x}}\) is the augmented data. A larger Vulnerability indicates that \(\varvec{x}\) becomes more vulnerable to adversarial attack after DA. A common belief about the relationship between training data and robustness is that AT benefits from adversarially hard samples.Footnote 3 (Madry et al., 2018; Li & Spratling, 2023c). From a geometric perspective, maximizing Vulnerability encourages the policy model to project data into the previously less-robustified space.

Nevertheless, the maximization of Vulnerability, if not constrained, would likely favor those augmentations producing samples far away from the original distribution. Training with such augmentations was observed to degrade accuracy and even robustness when accuracy is overly reduced (Li & Spratling, 2023c). Therefore, Vulnerability should be maximized while the distribution shift caused by augmentation is constrained:

$$\begin{aligned} arg\max _{\varvec{\theta }_{plc}}\ \mathcal {L}_{vul}(\varvec{x}; \varvec{\theta }_{plc})\ \ \text {s.t.}\ ds(\varvec{x}, \varvec{\hat{x}}) \le D \end{aligned}$$
(3)

where \(ds(\cdot )\) measures the distribution shift between two samples and D is a constant. Directly solving Eq. (3) is intractable, so we convert it into an unconstrained optimization problem by adding a penalty on the distribution shift as:

$$\begin{aligned} arg\max _{\varvec{\theta }_{plc}}\ \mathcal {L}_{vul}(\varvec{x}; \varvec{\theta }_{plc}) - \lambda \cdot ds(\varvec{x}, \varvec{\hat{x}}) \end{aligned}$$
(4)

where \(\lambda \) is a hyperparameter and a larger \(\lambda \) corresponds to a tighter constraint on distribution shift, i.e., smaller D. Distribution shift is measured using a variant of the Affinity metric (Gontijo-Lopes et al., 2021):

$$\begin{aligned} ds(\varvec{x}, \varvec{\hat{x}}) = \mathcal {L}_{aft}(\varvec{x}; \varvec{\theta }_{plc}) = \mathcal {L}(\varvec{\hat{x}}; \varvec{\theta }_{aft}) - \mathcal {L}(\varvec{x}; \varvec{\theta }_{aft}) \end{aligned}$$
(5)

Affinity captures the loss variation caused by DA w.r.t. a model \(\varvec{\theta }_{aft}\) (called the affinity model): a model pre-trained on the original data (i.e., without any data augmentation). Affinity increases as the augmentation proposed by the policy network makes data harder for the affinity model to correctly classify. By substituting Eq. (5) into Eq. (4), we obtain an adjustable Hardness objective:

$$\begin{aligned} \mathcal {L}_{hrd}(\varvec{x}; \varvec{\theta }_{plc}) = \mathcal {L}_{vul}(\varvec{x}; \varvec{\theta }_{plc}) - \lambda \cdot \mathcal {L}_{aft}(\varvec{x}; \varvec{\theta }_{plc}) \end{aligned}$$
(6)

This encourages the DA produced by the policy model to be at a level of hardness defined by \(\lambda \) (larger values of \(\lambda \) corresponding to lower hardness). Ideally, \(\lambda \) should be tuned to ensure the distribution shift caused by DA is sufficient to benefit robustness while not being so severe as to harm accuracy.

Last, we introduce a Diversity objective to promote diverse DA. Diversity enforces a relaxed uniform distribution prior over the logits of the policy model, i.e., the output augmentation distribution:

$$\begin{aligned} \mathcal {L}_{div}^h (\varvec{x}) = \frac{1}{C} \left[ - \sum _i^{p_i^h < l} \log (p_i^h) + \sum _j^{p_j^h > u} \log (p_j^h)\right] \end{aligned}$$
(7)

C is the total count of logits violating either lower (l), or upper (u) limits and h is the index of the prediction head. Intuitively speaking, the Diversity loss penalizes overly small and large probabilities, helping to constrain the distribution to lie in a pre-defined range (lu). As l and u approach the mean probability, the enforced prior becomes closer to a uniform distribution, which corresponds to a highly diverse DA policy. Diversity encourages the policy model to avoid the over-exploitation of certain augmentations and to explore other candidate augmentations. Note that Diversity is applied to the color/shape head in a hierarchical way: type-wise and strength-wise inside each type of augmentation.

Combining the above three objectives together, the policy model is trained to optimize:

$$\begin{aligned} arg \min _{\varvec{\theta }_{plc}} \ -\mathbb {E}_{i \in B} \mathcal {L}_{hrd}(\varvec{x}_i) + \beta \cdot \mathbb {E}_{h \in H} \mathcal {L}_{div}^h (\varvec{x}; \varvec{\theta }_{plc}) \end{aligned}$$
(8)

where B is the batch size and \(\beta \) trades-off hardness against diversity. \(\mathcal {L}_{div}^h\) is calculated across instances in a batch, so no need for averaging over B like \(\mathcal {L}_{hrd}\).

3.2.3 Mechanism

The Vulnerability objective is computed using feedback on adversarial vulnerability, measured by the variation in loss caused by adversarial perturbations, from the target model. The policy model learns from this feedback to determine which types and magnitudes of data augmentation (DA) elevates the adversarial vulnerability of augmented data. This learning raises the likelihood of applying such augmentations to the training data, thereby resulting in increased hardness. Meanwhile, the Affinity objective is employed to limit DA’s hardness to a level that does not compromise performance. Additionally, the Diversity objective prevents the over-reliance on specific DA methods, promoting exploration across a diverse spectrum of augmentation techniques. Together, these three objectives dictate the appropriate DA for each training sample.

3.3 Optimization

The entire training is a bi-level optimization process (Algorithm 1): the target and policy models are updated alternately. This online training strategy adapts the policy model to the varying demands for DA from the target model at the different stages of training. The target model is optimized using AT with the augmentation sampled from the policy model:

$$\begin{aligned} arg \min _{\varvec{\theta }_{tgt}} \mathcal {L}(\rho (\Phi (\varvec{x}; S(\varvec{\theta }_{plc}(\varvec{x})));\varvec{\theta }_{tgt}); \varvec{\theta }_{tgt}) \end{aligned}$$
(9)

After every K updates of the target model, the policy model is updated using the gradients of the policy learning loss as follows:

$$\begin{aligned} \frac{(8)}{\partial \varvec{\theta }_{plc}} = - \frac{\partial \mathbb {E}_{i \in B} \mathcal {L}_{hrd}(\varvec{x}_i)}{\partial \varvec{\theta }_{plc}} + \beta \frac{\mathbb {E}_{h \in H} \mathcal {L}_{div}^h (\varvec{x})}{\partial \varvec{\theta }_{plc}} \end{aligned}$$
(10)

The latter can be derived directly, while the former \(\frac{\partial \mathcal {L}_{hrd}}{\partial \varvec{\theta }_{plc}}\) cannot because the involved augmentation operations are non-differentiable. To estimate these gradients, we apply the REINFORCE algorithm (Williams, 1992) with baseline trick to reduce the variance of gradient estimation. It first samples T augmentations, named trajectories, in parallel from the policy model and then computes the real Hardness value, \(\mathcal {L}_{hrd}^{(t)}\), using Eq. (6) independently on each trajectory t. The gradients are estimated (see Section B for derivation) as follows:

$$\begin{aligned} \frac{1}{B\cdot T}\sum _{i=1}^B\sum _{t=1}^T \sum _{h=1}^H \frac{\partial \log (p_{(t)}^h(\varvec{x}_i))}{\partial \varvec{\theta }_{plc}} [\mathcal {L}_{hrd}^{(t)}(\varvec{x}_i) - \tilde{\mathcal {L}_{hrd}}] \end{aligned}$$
(11)

\(p_{(t)}^h\) is the probability of the sampled sub-policy at the h-th head and \(\tilde{\mathcal {L}_{hrd}}=\frac{1}{T}\sum _{t=1}^T \mathcal {L}_{hrd}^{(t)}(\varvec{x}_i)\) is the mean \(\mathcal {L}_{hrd}\) (the baseline used in the baseline trick) averaged over the trajectories. Algorithm 2 illustrates one iteration of updating the policy model. Note that, when one model is being updated, backpropagation is blocked through the other. The affinity model, used in calculating the Affinity metric, is fixed throughout training.

Algorithm 1
figure a

High-level training procedures of the proposed method. \(\alpha \) is the learning rate. M is the number of iterations.

Algorithm 2
figure b

Pseudo code of training the policy model for one iteration. \(\varvec{x}\) is randomly sampled from the entire dataset.

3.4 Modes of Application

AROID can be used in two modes: online and offline. In the online mode, the policy and target models are jointly trained so that the policy model has to be retrained every time a new target model is trained. This adapts the DA policy to the target model on-the-fly which improves effectiveness but adds the extra cost of policy learning to that of adversarial training. In the offline mode, the training of policy and target models are separate phases. A policy model is trained in advance (using online AROID), a step that is analogous to the hyperparameter optimization of other DA methods. This pre-trained policy model is then subsequently used to train a new target model. Specifically, at each epoch of training the target network a policy network checkpoint, saved at the corresponding epoch when using online AROID, is used to sample DA policies for training the target model. When AROID is deployed in this offline mode, we refer to it as AROID-T, as it involves the transfer of the policy model. The standard mode of application is online, which we refer to simply as AROID.

3.5 Efficiency

The efficiency of AROID is dependent on the mode. The cost of AROID is composed of two parts: policy learning and DA sampling. Policy learning can be one-time expense if AROID is used in offline mode. DA sampling requires only one forward pass of the policy model, which can be negligible because the policy model can be much smaller than the target model without hurting the performance. Therefore, AROID in offline mode is roughly as efficient as other regular DA methods.

In online mode, in the worst case, AROID adds about 43.6% extra computation to baseline AT (see calculation in Section C) when \(T=8\) and \(K=5\). This is less than the overhead 52.5% of the state-of-the-art AT method LAS-AT (Jia et al., 2022) and substantially less than the search cost of IDBH and AutoAugment (compared in Sect. 4.4). Furthermore, we observed that AROID can still achieve robustness higher than other competitors with a much smaller policy model (Sect. 4.13.3), reduced T and increased K (Sect. 4.4) for improved efficiency. For example, setting \(T=4\) and \(K=20\), the overhead is only about 10% compared to baseline AT.

Another efficiency concern, as for all other deep learning methods, is hyperparameter optimization. We discuss below how this can be done efficiently so that AROID can be easily adapted to a new setting. First, as shown in Sect. 4.13.1, most of our hyperparameters can transfer well among different training settings, so that only a light tuning is needed to achieve reasonably good performance for new setting. In most cases, only \(\lambda \) needs to be tuned. Second, hyperparameter optimization can be accelerated by first searching with a cheap setting, such as \(K=20\) and \(T=4\), and then transferring the found values to the final setting, i.e., \(K=5\) and \(T=8\). Note that our hyperparameter tuning process is not different from others.

Table 1 The performance of various DA methods

4 Experiments

The experiments in this section were based on the following setup unless otherwise specified.

General set-ups. We used model architectures Vision Transformer (ViT-B/16 and ViT-B/4) (Dosovitskiy et al., 2020), WideResNet34-10 (WRN34-10) (Zagoruyko & Komodakis 2016) and PreAct ResNet-18 (PRN18) (He et al., 2016b). We evaluated on datasets CIFAR10/100 (Krizhevsky, 2009), ImagenetteFootnote 4 and ImageNet (Deng et al., 2009).

For CIFAR10/100, models were trained by stochastic gradient descent (SGD) for 200 epochs with an initial learning rate 0.1 divided by 10 at 50% and 75% of epochs. The momentum was 0.9, the weight decay was 5e-4 and the batch size was 128. The experiments on Imagenette and ImageNet followed a similar protocol as those on CIFAR10 except the following changes. For Imagenette, the weight decay was 1e-4, the total number of epochs was 40, and the learning rate was decayed at 36th and 38th epoch. The ViT-B/16 was pre-trained on ImageNet-1K. Gradient clipping was applied throughout training. Note that CIFAR10 with ViT-B/4 is trained using the same setting as Imagenette with ViT-B/16. For ImageNet, models were trained for 50 epochs with an initial learning rate 0.01 divided by 10 at 20th and 40th epoch. Models were pre-trained on ImageNet-1K. The weight decay was 0. Experiments were run on Nvidia Tesla V100 and A100. All results reported by us were averaged over 3 runs except for ImageNet due to the limit of computational resource.

Adversarial set-ups. By default, we used \(\ell _{\infty }\) PGD AT (Madry et al., 2018) with a perturbation budget, \(\epsilon \), of 8/255. The number of steps was 10 and the step size was 2/255. For ImageNet, the perturbation budget, \(\epsilon \), was 4/255, the number of steps was 2 and the step size was \(2\epsilon /3\). Following Rice et al. (2020), we tracked PGD10 robustness on the test set at the end of each epoch during training and selected the checkpoint with the highest PGD10 robustness, i.e., the “best” checkpoint to report robustness. Robustness was evaluated by AutoAttack (Croce & Hein, 2020).

Table 2 The performance of AROID-T, our method in offline mode

Configuration of AROID. Hyperparameters are optimized using grid search. By default, \(T=5\), \(K=8\) and \(\beta =0.8\) were used. The diversity limits l and u were 0.9 (0.8)Footnote 5 and 4.0 respectively for CNNs (ViTs). \(\lambda \) was 0.4-0.2-0.1 (decayed with the learning rate for better performance), 0.4 and 0.3 for WRN34-10, ViT-B/4 and PRN18 on CIFAR10, 0.3-0.1-0.01 and 0.2 for WRN34-10 and PRN18 on CIFAR100, and 0.3 for ViT-B/16 on Imagenette. The default backbone of the policy model was PRN18 except that ViT-B/16 (pre-trained on ImageNet-1K) was used for Imagenette.Footnote 6

Section D describes more implementation details of AROID and the competitive methods to be compared below.

Table 3 Evaluation of robust overfitting for models trained with various data augmentation methods on CIFAR10/100 with WRN34-10

4.1 Benchmarking DA on Adversarial Robustness

Table 1 compares our proposed method against existing DA methods. AROID outperforms all existing methods regarding robustness across all five tested settings. The improvement over the previous best method is particularly significant for ViT-B on CIFAR10 (+1.62%) and Imagenette (+1.12%). Note that in most cases IDBH is the only method whose robustness is close to ours. However, our method is much more efficient than IDBH in terms of policy search (shown in Sect. 4.4). If our method is compared only to those methods with a computational cost the same or less than AROID’s, i.e., excluding IDBH and AutoAugment, the improvement over the second best method is +2.05%/2.58%/0.78%/1.12%/1.02% for the five experiments. Furthermore, we highlight the substantial improvement over the baseline of our method, +3.65%/4.53%/1.77%/4.85%/1.73%, in these five settings.

In addition, AROID also achieves the highest accuracy in four of the five tested settings, and in the setting of Imagenette the accuracy gap between the best method and ours is marginal (0.37%). Overall, our method significantly improves both accuracy and robustness, achieving a much better trade-off between accuracy and robustness. The consistent superior performance of our method, across various datasets (low and high resolution, simple and complex) and model architectures (CNNs and ViTs, small and large capacity), suggests that it has a good generalization ability.

4.2 Offline Versus Online AROID

This section evaluates the transferability of the learned policy models. It uses AROID in the offline mode (i.e. AROID-T as described in Sect. 3.4), across three scenarios: (1) with the same dataset and model architecture; (2) across different datasets; (3) across different model architectures. In scenario 1, a policy model is pre-trained on CIFAR10 for a WRN34-10 model and is applied to train a WRN34-10 model on CIFAR10. In scenario 2, a policy model is pre-trained on CIFAR10 for a WRN34-10 model and is applied to train a WRN34-10 model on CIFAR100. In scenario 3, a policy model is pre-trained on CIFAR10 for a PRN18 model and is applied to train a ViT-B/4 model on CIFAR10.

As shown in Table 2, AROID-T achieved accuracy and robustness comparable to its online counterpart, AROID. Importantly, AROID-T still outperforms previous data augmentation methods (Table 1) in terms of both accuracy and robustness. Notably, the cost of applying AROID-T is roughly the same as that of other data augmentation methods. Overall, these results demonstrate that AROID-T transfers well across various settings.

4.3 Mitigating Robust Overfitting

This section evaluates the effectiveness of our proposed method in mitigating robust overfitting. Robust overfitting is measured, using the standard convention, as the difference between the best and end robustness. The results in Table 3 demonstrate that compared to the baseline, AROID substantially reduces the degree of robust overfitting from 5.64 to 0.91% on CIFAR10 and from 3.69 to 0.83% on CIFAR100. AROID achieves the smallest robustness gap among all competitive methods on CIFAR100. Additionally, AROID achieves a robustness gap of 0.91%, close to the minimum record of 0.52% achieved by AutoAugment, while exhibiting significantly higher best and end robustness rates of +1.31% and +0.92%, respectively. Overall, these results suggest that our method effectively mitigates robust overfitting.

4.4 Comparison of Policy Search Costs

We compare here the cost of policy search of AROID against other automated DA methods, i.e., AutoAugment and IDBH. Before comparison, it is important to be aware that the search cost for IDBH increases linearly with the size of search space, while the cost of AROID stays approximately constant. IDBH thus uses a reduced search space that is much smaller than the search space of AROID. However, reducing the search space depends on prior knowledge about the training datasets, which may not generalize to other datasets. Moreover, scaling IDBH to our larger search space is intractable, and it would be even more intractable if IDBH was applied to find DAs for each data instance at each stage of training, as is done by AROID.

Even in the most expensive configuration (\(K=5\) and \(T=8\)), AROID is substantially cheaper than IDBH and AutoAugment regarding the cost of policy search as shown in Table 4. The computational efficiency of AROID can be further increased by reducing the policy update frequency (increasing K) and/or decreasing the number of trajectories T, while still matching the robustness of IDBH. If IDBH and AutoAugment were restricted to use the same, much lower, budget for searching for a DA policy, given the huge gap, we suspect that they may find nothing useful.

Table 4 The cost of policy search for automated DA methods using PRN18 on CIFAR10

4.5 Comparison with State-of-the-Art Robust Training Methods

Table 5 compares our method against state-of-the-art robust training methods. It can be seen that AROID substantially improves vanilla AT in terms of accuracy (by 3.16%) and robustness (by 3.65%). This improvement is sufficient to boost the performance of vanilla AT to surpass the state-of-the-art robust training methods like SEAT and LAS-AWP in terms of both accuracy and robustness. This suggests that our method achieved a better trade-off between accuracy and robustness while boosting robustness.

More importantly, our method, as it is based on DA, can be easily integrated into the pipeline of existing robust training methods and, as our results show, is complementary to them. Our method was combined with other AT methods in the same way as any other data augmentation method: simply by using the sampled data augmentation policy to augment the data before generating adversarial examples. The update of the policy model is independent of the training method used. By combining with SWA and/or AWP, our method substantially improves robustness even further while still maintaining an accuracy higher than that achieved by others methods. It is worth noting that CutMix combined with SWA is widely recognized as a strong baseline for data augmentation. Our approach surpasses this baseline when combined with SWA as well.

Table 5 The performance of various robust training (RT) methods with baseline and our augmentations for WRN34-10 on CIFAR10

4.6 Generalization to Alternative AT Methods

To further test the generalizability of AROID to alternative AT methods, we integrate AROID with two more superior AT methods: TRADES (Zhang et al., 2019) and SCORE (Pang et al., 2022). Results are shown in Table 6. AROID achieves highest accuracy and robustness among all the tested DA methods with both advanced AT methods. Overall, these results together with those in Sect. 4.5, show that AROID generalizes well to various AT methods (PGD, TRADES, SCORE, AWP, SWA).

Table 6 Comparison of various DA methods when trained by alternative AT methods like TRADES and SCORE for PRN18 on CIFAR10

4.7 Combining with Extra Data

Table 7 The performance of our methods when trained with extra data for WRN34-10 on CIFAR10

The leading methods on the robustness benchmark RobustBench (Croce et al., 2021) heavily use extra data to augment adversarial training. We incorporate AROID with extra real data following Carmon et al. (2019) and compare it against PORT (Sehwag et al., 2022) and HAT (Rade & Moosavi-Dezfooli, 2022) which are ranked, to date, first and second respectively in RobustBench for the model architecture WRN34-10. As shown in Table 7, our method significantly improves both accuracy and robustness over the baseline methods. Our method also surpasses PORT regarding both accuracy and robustness. Our method, compared to HAT, achieves a comparable robustness and a clearly higher accuracy exhibiting a better trade-off between accuracy and robustness. Note that HAT employs a more effective AT method, HAT, and a different activation function, SiLU, both of which are known to boost performance.

Next, we test whether AROID can be applied to enhance the state-of-the-art method BDM (Wang et al., 2023), which utilizes 50 M synthetic data samples. As shown in Table 7, AROID achieves a marginal improvement over this baseline in terms of accuracy and robustness, indicating that AROID remains effective even in data-rich settings. However, it is observed that the performance improvement provided by AROID diminishes when compared to results without the additional 50 M data. This reduction occurs because the robust overfitting in the baseline is largely mitigated by the additional data, and since AROID enhances adversarial training by alleviating robust overfitting, the scope for further improvement by AROID is consequently reduced.

Although the benefit of data augmentation diminishes when a large amount of synthetic data is incorporated for training on CIFAR10, this approach may not be as effective on more complex datasets such as ImageNet. As observed in Azizi et al. (2023), increasing synthetic ImageNet data beyond a certain limit (around 1.2M synthetic images) degrades model performance in high-resolution settings (\(256\times 256\) and \(1024\times 1024\) pixels), while it consistently provides benefits in low-resolution setting (\(64\times 64\) pixels). This degradation at high resolutions may be due to greater bias in the model and/or lower quality in the generated images at higher resolutions.

Table 8 The result of AROID on ImageNet with ConvNeXt-T

4.8 Generalization to ImageNet

To further test the generalizability and scalability of our method to a large-scale dataset, we train AROID on ImageNet (Deng et al., 2009) with ConvNeXt-T (Liu et al., 2022). Some DA methods are missing in this comparison due to limited computational resources (explained in Section D.2). As shown in Table 8, AROID significantly improves robustness over the baseline by 4.18% and AutoAugment by 2.6%. It also achieves the highest accuracy among the tested methods. Overall, AROID is able to scale and generalize to ImageNet.

The AROID hyperparameters were set to \(\lambda =0.7\), \(\beta =2\), \((l, u) = (0.8, 4.0)\), \(T=20\) and \(K=4\). As we did not have sufficient computational resources to fully optimize these hyperparameters on ImageNet performance is likely to be suboptimal and falls-short of the state-of-the-art result (Singh et al., 2023). It has been observed in Singh et al. (2023) that adversarial training on ImageNet prefers heavy data augmentation that is composed of RandAugment (Cubuk et al., 2020), CutMix, MixUp and Random Erasing. DA operations like CutMix and MixUp are not included in our DA search space. Incorporating these operations into our search space is thus expected to boost the performance of our method on ImageNet. We leave the exploration of this enhancement to the future.

Table 9 The performance of various DA methods on the common corruption dataset CIFAR10-C for WRN34-10
Table 10 Robustness evaluation against more adversarial attacks

4.9 Performance on Common Corruption Datasets

This section assesses the generalization capability of the proposed method under input data distribution shifts, known as Out-Of-Distribution (OOD) testing. Following Kireev et al. (2022), we trained models on the CIFAR10 training set and evaluated them on CIFAR10-C (Hendrycks & Dietterich, 2019). CIFAR10-C is created by applying 15 types of common visual corruptions to the CIFAR10 test set, representing visual corruption shifts encountered in the wild.

In Kireev et al. (2022), only clean accuracy was evaluated on CIFAR10-C, focusing on the efficacy of adversarial training in improving robustness against common corruptions. However, this study emphasizes adversarial robustness. A recent study suggested that adversarial robustness is highly vulnerable to input distribution shifts (Li et al., 2024). Therefore, we also evaluated adversarial robustness on CIFAR10-C by conducting AutoAttack on the CIFAR10-C data.

As shown in Table 9, our proposed method achieves the highest accuracy and robustness among all competitive data augmentation methods, indicating excellent OOD generalization ability for both clean and robust performance under common corruption distribution shifts.

4.10 Robustness Evaluation with More Attacks

To further ensure our robustness evaluation is reliable, we additionally evaluate AROID and other related works using three more adversarial attacks PGD (Madry et al., 2018), CW (Carlini & Wagner, 2017) and JITTER (Schwinn et al., 2023). From the results shown in Table 10 it can be seen that AROID is consistently superior under various adversarial attacks.

4.11 Data Scaling Versus Model Scaling

This section compares the effectiveness of scaling up data (our method) versus scaling up the model in enhancing adversarial training. To test this, we trained AROID using the WRN34-10 model architecture (depth of 34 and widening factor of 10) and compared it to WRN34-12 and WRN46-10 architectures trained with RandomCrop DA. WRN34-12 and WRN46-10 were chosen because they have approximately 44% and 42% more parameters, respectively, than WRN34-10, which is comparable to the worst-case extra computational overhead, 43.6%, caused by AROID.

As shown in Table 11, AROID with WRN34-10 achieved the highest accuracy and robustness, greatly outperforming RandomCrop even when larger models were used. This suggests that optimizing data augmentation, when implemented correctly, can be more effective than merely scaling up the model to boost performance. The issue with RandomCrop and larger models is that, as indicated by the large gap between best and end robustness, scaling up models cannot effectively mitigate robust overfitting, resulting in poor generalization of robustness.

Table 11 The performance of baseline RandomCrop with larger models on CIFAR10

4.12 Enlarging Policy Search Space

Table 12 The performance of AROID with the original and the enlarged (with CutMix added) data augmentation space with and without SWA for WRN34-10 on CIFAR10

This section assesses if enlarging policy search space can enhance AROID. We conducted tests by adding CutMix to our policy search space as an additional transformation to be sampled and applied after the dropout transformation (please refer to Sect. 3.1 for the specification of data augmentation policy structure). CutMix was chosen due to its effectiveness in adversarial training when combined with SWA (Rebuffi et al., 2021).

As shown in Table 12, the inclusion of CutMix, compared to the original data augmentation space, results in reduced robust overfitting and improved best and end robustness, regardless of whether it is combined with SWA or not. Additionally, incorporating CutMix even leads to a boost in best accuracy when combined with SWA. One possible account for this improvement is that the addition of CutMix increases the diversity of data augmentation in the learned policy, thereby mitigating robust overfitting and enhancing robust generalization (the reasons why diverse data augmentation mitigates robust overfitting are explained in Sect. 3.2.1).

However, it is important to note that not all data augmentation methods yield such benefits. The impact of incorporating additional data augmentation methods into the policy search space is specific to the nature of the augmentation techniques themselves. Toxic data augmentation methods, as observed in Cubuk et al. (2020), may not enhance, and in some cases, may even impair the performance of AROID if added to the search space. Overall, AROID can indeed benefit from an enlarged search space if implemented appropriately.

4.13 Ablation Study

This section verifies the sensitivity of our method to its hyperparameters and several design choices. The experiments were conducted on CIFAR10 with PRN18 and Imagenette with ViT-B/16. The default values of hyperparameters are the ones marked in green in Fig. 3.

4.13.1 Hyperparameters

Fig. 3
figure 3

Ablation study of hyper-parameters \(\lambda \), \(\beta \), l, u, T and K for CIFAR10 with PRN18 (even rows) and Imagenette with ViT-B/16 (odd rows). The selected value for each hyper-parameter is marked green color

Policy update frequency K. Figures 3j and l show that the highest accuracy and robustness were achieved when \(K=5\), i.e., the lowest frequency under the test. This implies that AT benefits from a more “up-to-date” DA. Furthermore, it seems possible to trade accuracy for efficiency by choosing a larger value of K (up to 20) while maintaining similarly high robustness. In general, the accuracy and robustness of our method declines with lower policy update frequency.

Number of trajectories T. Figure 3i and k show that high accuracy and robustness are achieved around \(T=8\). This suggests that (1) there is a minimum requirement on the amount of trajectories for our policy gradient estimator to be accurate and, (2) our method may not benefit from increasing T beyond 8.

Strength of Affinity \(\lambda \). As shown in Fig. 3a and c, robustness first increases and then decreases within the tested range of value. This is consistent with the prior that AT benefits from appropriate hardness but degrade if data augmentations are overly hard (Li & Spratling, 2023c).

Strength of Diversity \(\beta \). The performance within the tested range of value is close in Fig. 3b and d, suggesting that the performance of AROID is not sensitive to the value of \(\beta \). Nevertheless, this does not imply that Diversity is unnecessary in our policy learning. On the contrary, it plays an important role in policy learning as shown in Sect. 4.13.2.

Summary. We observe that, within the tested value range, hyper-parameters like \(\lambda \), \(\beta \), T and K have a quite similar trend in both settings, while the lower limit l (Fig. 3e, g) and upper limit u (Fig. 3f, h) in the diversity objective shows slightly different trends between the two settings. Despite the slightly different behaviors of a few hyper-parameters, the optimal value of hyper-parameters is observed to transfer across these two settings, i.e., they achieve reasonably good performance with a similar set of hyper-parameter values \(T=8\), \(K=5\), \(l=0.8/0.9\), \(u=4\), \(\lambda =0.3\), \(\beta =0.8\). We also find this setting transfers well across different AT methods of PGD, SCORE and TRADES since we can only tune the value of \(\lambda \) while keep the rest unchanged to achieve reasonably good performance and outperform the other compared data augmentations.

4.13.2 Policy Learning Objectives

This section conducts an ablation study to evaluate the effect of each proposed policy learning objective on the performance of AROID. As shown in Table 13, removing any single policy learning objective leads to a considerable drop in both accuracy and robustness, indicating that each objective is crucial for learning an effective data augmentation policy. Particularly, we observed that when Diversity is removed by setting \(\beta =0\), accuracy drops from 84.68 to 73.88%, and robustness drops from 50.57 to 22.24%. Without Diversity constraint, the policy network’s training failed because the output policy distribution became concentrated on a few sub-policies, assigning zero probabilities to the remaining ones. The REINFORCE method could not recover from this situation because it no longer explored other options. This underscores the importance of maintaining a certain level of Diversity constraint in our policy learning. However, no clear benefit is observed as this constraint is further strengthened by raising \(\beta \), as shown in Fig. 3b and d.

Table 13 The impact of removing each policy learning objective on the performance of AROID for PRN18 on CIFAR10

4.13.3 Policy Model Architecture

Interestingly, we observed in Table 14 that for CIFAR10 a relatively small model WideResNet10-1 (a WideResNets with depth 10 and widening factor 1) with 0.08M parameters is sufficient for learning the DA policy for a relatively large target model PRN18 with 11.17M parameters and further increasing capacity beyond this scale, even 100x, does not benefit either accuracy or robustness. Therefore, the policy model can be much smaller than the target model.

Table 14 Comparison of the various policy model backbone architectures on CIFAR10 with a target model of PRN18
Table 15 Comparison of uniform sampling from AROID DA space on CIFAR10 with PRN18
Fig. 4
figure 4

The progression of the three proposed policy learning objectives throughout the AROID training process on CIFAR10 for WRN34-10. Lines are smoothed with a moving average over 5 epochs for improved clarity

4.13.4 Uniform Sampling

We performed AT using data augmentations uniformly sampled from AROID’s data augmentation space. The results are labeled Uniform in Table 15. As shown in the table, AROID significantly improves accuracy and robustness over its uniformly sampled counterpart suggesting the necessity of optimizing the data augmentation policy.

4.14 Analysis of Learned DA Policies

This section first analyzes the dynamics of the proposed policy learning objectives during training (Sect. 4.14.1). It then visualizes the learned data augmentation policies sampled over a course of training (Sect. 4.14.2). Last, it visualizes some image samples transformed by the learned data augmentation policies (Sect. 4.14.3).

4.14.1 Progression of Policy Learning Objectives

To understand the dynamics of the learned data augmentation policy, Fig. 4 visualizes the progression of the three proposed policy learning objectives throughout the AROID training process. Generally, Vulnerability represents the adversarial vulnerability of the augmented data, Affinity reflects the distribution shift caused by data augmentation, and Diversity is negatively correlated with the diversity of data augmentation (lower Diversity implies greater diversity). It is observed that during training, Vulnerability and Affinity increase while Diversity decreases. These trends suggest that the data augmentation sampled from the learned policies becomes progressively harder, in terms of both adversarial vulnerability and distribution shift, and more diverse throughout the training process. This aligns with the goal of our policy learning as described in Eq. (8) to encourage an increase in Vulnerability while regularizing Affinity and Diversity to decrease. It is important to note that an increase, rather than a decrease, is observed in the Affinity loss because Affinity was regularized with a decaying strength (in this case 0.4, 0.2, 0.1).

Fig. 5
figure 5

Visualization of the learned DA policies, applied to ten images randomly sampled from CIFAR10 training set, for the Flip, Crop, Color/Shape and Dropout types of augmentations. The policy model is resumed from a checkpoint saved at the end of 110th epoch when training a WRN34-10 model on CIFAR10 (following the training setting as specified in Section D). The sampled ten images are visualized at the bottom in the order of the x-axis in the above bar-charts. The chance of applying no transformation (Identity) is the gap between the colored bar and the top (i.e., score of 1.0). In the Color/Shape group, the probabilities of different magnitudes are not shown separately, but are summed to get the overall probability of a transformation

Fig. 6
figure 6

Visualization of how the learned DA policies evolve as training progresses. The same, randomly sampled, image (visualized at the bottom) was used across epochs (5, 25, 50, 75, 100, 125, 150, 175, 200) to produce the policies. The first bar in each sub-figure corresponds to the epoch 5 and describes the initial state of the policy model (training of policy model starts from epoch 5). For each bar in the figures, the policy model was resumed from the checkpoint saved at the corresponding epoch (x-axis) in the same course of training. The chance of applying no transformation (Identity) is the gap between the colored bar and the top (i.e., the score of 1.0). In the Color/Shape group, the probabilities of different magnitudes are not shown separately, but are summed to get the overall probability of a transformation

4.14.2 Visualization of Learned DA Policies

Figure 5 visualizes the learned distribution of DAs for different, randomly sampled, data instances. Instance-wise variation of the learned DA policy is visible for the Color/Shape augmentations (Fig. 5c) and evident for the Dropout augmentations (Fig. 5d), but subtle in the rest (Fig. 5a, b). Note that even for the different data instances from the same class (e.g., instances 4, 7, 10 from the class “frog”), the learned DA distributions can still differ considerably (Fig. 5d). This confirms that (1) AROID is able to capture and meet the varied demand of augmentations from different data instances, and (2) such demand exists for some, but not all, augmentations. These observations may explain why many instance-agnostic DA methods such as IDBH, despite being inferior to ours, still work reasonably well (see Table 1).

It was also observed in Fig. 6 that the learned DA policy for the same data instance evolved as training progressed. In the Color/Shape group (Fig. 6c), augmentations like Sharpness became observably more likely to be selected while others such as ShearY became less probable as training continued. Dropout (i.e. Erasing; Fig. 6d) particularly with large magnitudes was rarely applied prior to 100th epoch, i.e., the first decay of learning rate. The possibility of applying Crop (i.e. Cropshift; Fig. 6b) and Flip (i.e. HorizontalFlip; Fig. 6a) first dropped until the first decay of learning rate and then stayed nearly constant afterwards.

Consistent to the previous findings on ST (Cubuk et al., 2019) and harmful augmentations (Rebuffi et al., 2021), we observed that AT on CIFAR10 favored mostly color-based augmentations like Equalize and Sharpness and disfavored geometric augmentations like Rotate and harmful augmentations like Solarize and Posterize (see both Figs. 5c, 6c). This verifies the effectiveness of our DA policy learning algorithm.

4.14.3 Visualization of Augmented Data Samples

Figure 7 depicts 20 pairs of original and augmented data samples from CIFAR10. The visualization demonstrates that our method effectively enhances the diversity of augmented data samples. While the original and augmented data samples are paired here in a one-to-one manner, the learned policy enables the generation of a much larger variety of distinct augmented data.

Fig. 7
figure 7

Visualization of 20 randomly-sampled pairs of original (odd rows) and augmented (even rows) samples from CIFAR10. The policy model is the same as that used for Fig. 5

5 Conclusions

This work introduces an approach, dubbed AROID, to efficiently learn online, instance-wise, DA policies for improved robust generalization in AT. AROID is the first automated DA method specifically for AT. Extensive experiments show its superiority over both alternative DA methods and contemporary AT methods in terms of accuracy and robustness. AROID has also significantly reduces the cost of policy search making automated data augmentation practical to use for adversarial training, even for large datasets. AROID can be also used in an offline mode to further save on computation. The learned DA policies are visualized to verify the effectiveness of AROID and understand the preference of AT for DA.

However, AROID has some limitations as well. First, despite being more efficient than IDBH, it still adds extra computational burden to training, unless AROID-T is used. This could harm its scalability to larger datasets and model architectures. Second, the Diversity objective enforces a minimal chance (set by the lower limit) of applying harmful transformations and/or harmful magnitudes if they are included in the search space. This constrains the ability of AROID to explore a wider (less filtered) search space. Future works could investigate more efficient AutoML algorithms for learning DA policies for AT, and design new policy learning objectives to reduce the number of hyperparameters and alleviate the side-effect of Diversity.