Keywords

1 Introduction

Deep learning models are often vulnerable to adversarial examples: malicious inputs modified to yield erroneous model outputs, while appearing unmodified to human observers at inference phase [1,2,3,4]. Potential attacks include confusing vehicle behavior in automated driving or having malicious content like malware identified as legitimate. Yet, all existing adversarial example attacks require explicit knowledge of the model internals or its training data (white-box). However, to search for adversarial examples of a real world system, such knowledge may not be available. In this situation, the target model is a black-box to the attacker. Therefore, it is quite difficult to extract information about the decision boundary of target models, which is usually a pre-requisite to design input perturbations that result in erroneous predictions. However, previous works have shown that transferability exists between different models, i.e., the adversarial examples can transfer from one model to another [1, 5,6,7,8]. Such a property can be leveraged to perform black-box attacks. In other words, the attacker can query the target system, and establish a substitute model based on the query results [9]. Then the attacker can generate the adversarial examples for the substitute model, and these adversarial examples may transfer to disorder the target system. For example, an adversary who seeks to penetrate a computer network rarely has access to the specifications of the deployed intrusion detection system, however they can observe its outputs for any chosen inputs [10]. These observed input-output pairs will be used to produce synthetic datasets, and to train a substitute model approximating the target system. Therefore, the adversarial examples generated by substitutes are more likely to transfer to confuse the target system.

However, conventional attack strategies notoriously only consider to train a single substitute to craft adversarial examples with a weak transfer capability in black-box attack scenario, which is easily defended by existed defense mechanism [11,12,13]. Papernot et al. [14] have proposed ensemble adversarial training technique, which is an extension of adversarial training [1, 15], to increase robustness of DL models against black-box attacks. Thus, new attack strategies should be designed to explore the vulnerability of DL models with ensemble adversarial training.

In this paper, we propose two types of ensemble-based black-box attack strategies, iterative cascade ensemble strategy and stack parallel ensemble strategy, to implement more powerful black-box attacks against DL models and demonstrate that the ensemble adversarial training does not significantly increase the robustness and security of DL models. Besides, potential factors that contribute to the effective attacks against DL models are examined from three perspectives: the transferability of substitutes, the diversity of substitutes, and the number of substitutes. Ensemble adversarial black-box attack strategies and strategy analysis will be emphatically introduced in Sect. 2. The comparison experiment results on real world data sets and feasibility exploration are reported in Sect. 3 and paper concludes in Sect. 4.

2 Ensemble-Based Black-Box Attack Strategy

Before introducing the attack strategies, we will briefly introduce the architecture of substitutes and transferable adversarial examples generation algorithms used in this paper. For the input x \( \in \) RD, the composition of functions modeled by the substitute can be formalized as [16]:

$$ F\left( x \right) = softmax\left( {f_{n} \left( {\theta_{n} ,f_{n - 1} \left( {\theta_{n - 1} , \ldots f_{2} \left( {\theta_{2} ,f_{1} \left( {\theta_{1} ,x} \right)} \right)} \right)} \right)} \right) $$
(1)

where each function fi for i \( \in \) 1…n is modeled by a layer of neurons, each layer is parameterized by a weight vector \( \theta_{i} \) impacting each neuron’s activation. The output of the last layer is computed by using the softmax function, which ensures that the output vector F(x) satisfies 0 ≤ F(x)i ≤ 1, and F(x)1 + … + F(x)c = 1, where c is the number of classes.

Transferable adversarial examples are generated by substitute through carefully introducing human indistinguishable perturbations to the original examples, then these generated adversarial examples \( x^{*} \in R^{D} \) can transfer to confuse target model O, i.e., O \( \left( {x^{*} } \right) \)O(x). Currently proposed adversarial examples generation algorithms mainly include gradient-based (e.g., FGSM [1], I-FGSM [2], R + FGSM [14], etc.) and optimization-based (e.g., Carlini \( L_{\infty } \) Attack [17]), and specific details are described below:

Fast Gradient Sign Method (FGSM) is a single-substitute attack method. It finds the adversarial perturbation that yields the highest increase of the loss function under \( L_{\infty } \)-norm. The update equation is

$$ x^{*} = x + \alpha \cdot sign\left( {\nabla_{x} loss\left( {1_{y} ,F\left( x \right)} \right)} \right) $$
(2)

where α controls the magnitude of adversarial perturbation, \( 1_{\text{y}} \) is the one-hot encoding of the ground truth label of y. I-FGSM is a straightforward way to extend the FGSM by using a better iterative optimization strategy and R + FGSM significantly increases the power of the FGSM by adding gaussian noise to inputs before computing the gradient.

Carlini \( L_{\infty } \) Attack is a stronger single-substitute attack method proposed recently. It finds the adversarial perturbation \( r \) by using an auxiliary \( \omega \) as

$$ r = \frac{1}{2}\left( {\tanh \left( \omega \right) + 1} \right) - x $$
(3)

Then the loss function optimizes the auxiliary variable \( \omega_{n} \)

$$ \mathop {\hbox{min} }\nolimits_{\omega } \left\| {\frac{1}{2}\left( {\tanh \left( \omega \right) + 1} \right)} \right\| + c \cdot f\left( {\frac{1}{2}\left( {\tanh \left( \omega \right) + 1} \right)} \right) $$
(4)

The function \( f\left( \cdot \right) \) is defined as

$$ f\left( x \right) = { \hbox{max} }\left( {Z\left( x \right)_{{1_{y} }} - \hbox{max} \left\{ {Z\left( x \right)_{i} :i \ne 1_{y} } \right\}, - \kappa } \right) $$
(5)

where \( Z\left( x \right)_{i} \) is the logits output for class i, and \( \kappa \) controls the confidence gap between the adversarial class and true class.

Yet, these single-substitute attack algorithms achieve unsatisfactory attack performance in black-box attack scenario. Then, we attempt to ensemble multiple pre-trained substitutes to produce adversarial examples with more powerful transferability in the form of iterative cascade ensemble and stack parallel ensemble, as illustrated in Fig. 1.

Fig. 1.
figure 1

Illustration of Iterative Cascade Ensemble Strategy (a) and Stack Parallel Ensemble Strategy (b).

2.1 Iterative Cascade Ensemble Strategy

Iterative cascade ensemble strategy employs a cascade structure, as shown in Fig. 1(a), where each substitute of cascade will receive adversarial examples \( x_{j}^{*} \left( {j \in \left[ {0,k} \right]} \right) \) generated by its preceding substitute, and output its counterparts to the next substitute. During each iteration, the output of the k-th substitute \( x_{k}^{*} \) will be used as the input to the first substitute. Output results obtained from the k-th substitute after \( \rho \) iterations are final adversarial examples. Before implementing the iterative cascade ensemble strategy, the adversary first requires to train k heterogeneous substitute models with various synthetic datasets, which are constructed by observed input-output pairs and their augmentation with Jacobian-based technique [9]. In order to obtain more effective adversarial examples, each substitute is trained based on various architectures of deep neural networks. Afterwards, FGSM or Carlini \( L_{\infty } \) Attack is adopted as a classic attack algorithm for each substitute to craft adversarial examples. Finally, the adversaries can cascade multiple pre-trained substitutes and iteratively maximize each loss of substitute to obtain the final adversarial examples. The iterative cascade attack procedure is outlined in Algorithm 1.

figure a

The algorithm first requires to initialize the value of all input variables ε, α, k, ρ (where ε = α/2 and k = ρ), and add gaussian noise to original normal examples. For each substitute, the standard cross entropy loss function [18] should be constructed to compute gradient to maximize the loss function optimized for the \( L_{\infty } \) distance metric. The gradient of loss function determines the direction which feature should be changed. During each iteration, generated adversarial example \( x_{k}^{*} \) will be assigned to \( x_{0}^{*} \) as input of the first substitute. Until the loop iteration ends, the final transferable adversarial examples are obtained from the output of k-th substitute.

2.2 Stack Parallel Ensemble Strategy

Stack parallel ensemble strategy employs a parallel structure, as shown in the Fig. 1(b), where each substitute of parallel will receive the original legitimate example x, and output result \( x_{j}^{*} \left( {j \in \left[ {1,k} \right]} \right) \) will be combined with a linear way as new input of the k + 1 substitute. Output results obtained from the k + 1 substitute are final adversarial examples. Before implementing the parallel ensemble strategy, the adversary first requires to train k + 1 heterogeneous substitute models with various synthetic datasets, which are constructed by observed input-output pairs and their augmentation with Jacobian-based technique [9]. In order to achieve more effective adversarial examples, each substitute is still trained based on various architectures of deep neural networks. Afterwards, FGSM or Carlini \( L_{\infty } \) Attack is adopted as a classic attack algorithm for each substitute to craft adversarial examples. Finally, the adversary can parallel multiple pre-trained substitutes and maximize each loss of substitute to obtain adversarial examples. The stack parallel attack procedure is outlined in Algorithm 2.

figure b

The algorithm still requires to initialize the value of all input variables \( \varepsilon \), \( \alpha \), k, \( x_{mid}^{*} \) (where \( \varepsilon = \alpha /2, x_{mid}^{*} \) = 0D), add gaussian noise to original legitimate examples and compute gradient of constructed loss function. The gradient of loss function determines the direction which feature should be changed. For the top-k substitutes, generated adversarial example \( x_{j}^{*} \left( {j \in \left[ {1,k} \right]} \right) \) will be combined with a linear way and save to \( x_{mid}^{*} \) as new input of the k + 1 substitute. The final transferable adversarial examples are achieved from the output of k + 1 substitute.

2.3 Strategy Analysis

Empirical evidence has shown that adversarial examples appear in wide regions, spanning a contiguous subspace of high dimensionality and a large portion of this space is shared between different models, thus enabling transferability [1, 7, 19]. Ian Goodfellow et al. first proposed Gradient Aligned Adversarial Subspace (GAAS) [7] method to find multiple independent orthogonal adversarial directions to directly evaluate the dimensionality of the adversarial subspace. The dimensionality of adversarial subspaces is relevant to the transferability problem: the higher the dimensionality, the more likely the subspaces of substitute and target model will intersect significantly. As proposed in [7], the decision boundaries learned by both the substitute and target model must be extremely close to each another in adversarial direction. Adversarial direction is defined by \( x \) and \( x^{*} :d_{adv} = \left( {x^{*} - x} \right)/x^{*} - x_{2} \), where adversarial example \( x^{*} \) (blue dot) is generated from test example (brown dot) \( x \) to be misclassified by substitute F(x): \( \mathop {\text{argmin}}\nolimits_{\varepsilon > 0} F\left( {x^{*} :x + \varepsilon \cdot d_{adv} } \right) \ne F\left( x \right) \), as shown in Fig. 2(a). That is, the cross-boundary distance (the red double-ended arrows) in adversarial direction between the decision boundaries of substitute and target model must be very short. In other words, the shorter the distance, the stronger transferability.

Fig. 2.
figure 2

Illustration of a binary misclassification procedure in the adversarial direction over a 2D input domain. (Color figure online)

Actually, it is difficult to guarantee the trained substitute accurately approximating the target black-box model and the adversarial direction is also not unique, which lead to the weak transferability of crafted adversarial examples. However, if adversarial examples remain adversarial for multiple substitutes, it is more likely to transfer to disorder the target model, as shown in Fig. 2(b). From the Fig. 2(b), we can observe that an adversarial example (blue dot) generated by our proposed ensemble-based black-box attack strategies crossing the decision boundaries of k (e.g. k = 3) substitutes, has a greater probability to cross the decision boundary of target model. This fully illustrates the ensemble-based black-box attack strategies effectively shorten the cross-boundary distance and improve the transferability of generated adversarial examples.

3 Experiments

All experimentsFootnote 1 use TensorflowFootnote 2 framework and cleverhans libraryFootnote 3. To demonstrate the effectiveness and feasibility of the proposed ensemble-based black-box attack strategy, we empirically compare the conventional single-substitute attack algorithms described previously, e.g., FGSM, I-FGSM, R + FGSM and Carlini \( L_{\infty } \) attack, and expose the potential factors that contribute to the high-efficiency attacks.

3.1 Setup

Four benchmark datasets for two tasks, i.e., digit recognition and traffic sign recognition, are used in experiments. Details about datasets are listed in Table 1. The target classifier as black-box model in this work are trained with training data of each dataset. For each dataset, few unused test examples, as query inputs, are used to query target classifier and produce synthetic datasets augmented by observed input-output pairs. Then, diverse convolutional neural network architectures, as shown in Table 2, are selected to train substitutes with various synthetic datasets for ensemble to implement black-box attack tasks.

Table 1. Summary of 4 benchmark datasets
Table 2. Neural network architectures used in this work for substitute and target model training. Conv: convolution layer, FC: fully connected layer, Relu: activation function

Two diverse measurements, Success rate and Transfer rate, are redefined to evaluate the vulnerability of DL models according to Eqs. 6. and 7.

$$ \frac{1}{nk}\sum\nolimits_{j = 1}^{k} {\sum\nolimits_{i = 1}^{n} {{\mathbb{I}}\left( {F_{j} \left( {x_{i}^{*} } \right) \ne F_{j} \left( {x_{i} } \right)} \right)} } $$
(6)
$$ \frac{1}{n}\sum\nolimits_{i = 1}^{n} {{\mathbb{I}}\left( {O\left( {x_{i}^{*} } \right) \ne O\left( {x_{i} } \right)} \right)} $$
(7)

where \( {\mathbb{I}}\left( \cdot \right) = 1 \) represents generated adversarial example is misclassified, and 0, otherwise. These two metrics are used to measure the error rate of substitute and target model respectively.

3.2 Results

This section first quantitatively analyzes the vulnerability of DL models under success rate and transfer rate measurement. Afterwards, we empirically compare the conventional single-substitute attack algorithms based on FGSM and Carlini \( L_{\infty } \) attack for different datasets. Finally, possible factors that contribute to the higher transfer rate are explored from two aspects, the diversity of substitutes and the number of substitutes k.

Figure 3 demonstrates that deep learning models are extremely susceptible to adversarial examples generated by proposed ensemble-based black-box attack strategies under different perturbation amplitude \( \upalpha \).

Fig. 3.
figure 3

Success rate and Transfer rate of adversarial examples generated by ensemble-based black-box attack strategies under different perturbation amplitude on MNIST and GTSRB.

The transferability of adversarial examples generated by each substitute and cascading or paralleling any k substitutes (e.g. k = 3, 5) are illustrated in Table 3 and Fig. 4. Experiments demonstrate that the adversarial examples crafted by iterative cascade ensemble strategy achieve higher transfer rate than stack parallel ensemble strategy dramatically. Both obtain superior attack performance to other single-substitute attack algorithms. We also can observe that optimization-based algorithm (e.g. Carlini \( L_{\infty } \) attack) provided for each substitute to iterative cascade ensemble obtain greater transferability than gradient-based algorithm (e.g. FGSM). Figure 5 demonstrates that our proposed ensemble-based black-box attack strategies are still aggressive to target classifier trained with ensemble adversarial training defense mechanism.

Table 3. Transfer rate of adversarial examples generated by single-substitute, iterative cascade ensemble strategy and stack parallel ensemble strategy based on FGSM and Carlini \( \varvec{L}_{\infty } \) attack for different datasets.
Fig. 4.
figure 4

Transfer rate of adversarial examples crafted by disparate attack strategies on two major classification tasks. Ensemble strategies compared with single-substitute attack algorithms based on FGSM under differ perturbation amplitude \( \alpha \) are shown in Fig. (a)–(d). Ensemble strategies compared with single-substitute attack algorithms based on Carlini \( L_{\infty } \) Attack under different confidence \( \kappa \) are shown in Fig. (e) and (f).

Fig. 5.
figure 5

Weakly defense performance of target classifier trained with ensemble adversarial training defense mechanism.

Moreover, possible factors that contribute to the higher transfer rate are explored from two perspectives: the diversity of substitutes and the number of substitutes k.

  1. (1)

    The number of substitutes k. The experiment results are shown in Fig. 6. for our proposed ensemble-based black-box attack strategies, which indicates that the larger the value of k, the higher transfer rate of generated adversarial examples.

    Fig. 6.
    figure 6

    Transfer rate of adversarial examples crafted by iterative cascade ensemble strategy and stack parallel ensemble strategy with different number of substitutes k.

  2. (2)

    The diversity of substitutes. Two averaged pairwise measures [20] (the Q statistics, the correlation coefficient \( \rho \)) and two non-pairwise measures [20] (The entropy measure E, the Kohavi-Wolpert variance KW) are selected to analyze the relationship between the diversity of substitute and transferability of generated adversarial examples. Experimental results are listed in Table 4, where I, II, III and IV represent the four strategies to generate the substitutes, such as, the substitutes are same, trained with different training sets, trained with different architectures and trained with different training sets and architectures respectively. Comparative experimental results demonstrate that the greater the diversity of substitutes, the stronger the transferability of adversarial examples. Thus, all same substitutes used in I-FGSM obtain the lowest transfer rate, as shown in Fig. 4.

    Table 4. The relationship of diversity in substitute cascade/parallel ensembles and transferability of generated adversarial examples. (↑) represents the measure value of diversity is increased, (↓) represents the measure value of diversity is decreased.

4 Conclusion

In this paper, we propose two types of ensemble-based black-box attack strategies, iterative cascade ensemble strategy and stack parallel ensemble strategy, to explore the vulnerability of deep learning system. Experimental results show that our proposed ensemble adversarial attack strategies can successfully attack the deep learning system trained with ensemble adversarial training defense mechanism. The adversarial examples generated by iterative cascade ensemble strategy achieve better transferability than stack parallel ensemble strategy dramatically. Both obtain superior attack performance to other single-substitute attack algorithms. We also can observe that the diversity in substitute ensembles is an important factor to influence the transferability of generated adversarial examples.