Keywords

1 Introduction

Deep neural networks (DNN) have been widely used in all areas, such as computer vision [1], natural language processing [2], etc. In addition, a variety of learning methods have been proposed, such as reinforcement learning [3, 4], federated learning [5], and so on. Then comes into our sight not only the security of data [6], but also the security of models. These well-trained deep models are valuable assets to the owners. However, they may be possessed or tampered illegally. For example, customers who buy a DNN model might distribute it beyond the license agreement, or attackers may inject backdoor into the models.

Some malicious attacks like adversarial examples [7], data poison [8] and backdoor attack [8] are very common in deep learning. Correspondingly, some measures have been taken to solve these security issues. Among them, neural network watermarking is a promising research area. Digital watermarking is a traditional technique used for copyright protection or integrity authentication of digital products. Neural network watermarking is the extension of traditional watermarking concept for neural networks. And neural network watermarking techniques can be classified into two types: robust and fragile watermarking.

So far, most of the published researches are robust watermarking techniques used for protecting the copyright of DNN models. The word robust means these methods are insensitive to changes that aim to remove the embedded watermark. The robust watermarking techniques can roughly be divided into two categories: weight-parameter-based methods [9,10,11,12,13] and trigger-set-based methods [14,15,16,17,18]. The former ones are white-box schemes, in which the details of network parameters are needed. While the latter ones are black-box watermarking methods requiring no inner parameters of the models. In these methods, a trigger image set is built in advance and these images may be assigned with false labels that are irrelevant to their contents. In the verification process, the watermarked model’s classification results of trigger set can be used for authentication directly.

Fragile watermarking [19] is originally designed for multimedia authentication, which is sensitive to content modification and is usually transparent in terms of perception. Now we migrate the concept of traditional fragile watermarking to neural networks. For DNN fragile watermarking, the following properties should be considered. First, it should require low training cost and be easy to embed and extract from the model. Second, the embedded watermark should be imperceptible and has no much impact on model’s original performance. Third, there should be some quantifiable metrics for malicious tampering authentication. Fourth, it should be extensible and can be widely applied to other networks and datasets.

Formally, image classifier \(\mathcal {C_\theta }\) is a supervised learning task aiming at finding a classification function \(\mathcal {F}\) to classify the images in training set Tra with the classification loss \(\mathcal {L}_{cla}\), i.e., \(\mathcal {C_\theta }={\mathcal {L}}_{cla}(\mathcal {F}(Tra))\). Usually, the trigger-set-based watermarking methods need a trigger image set Tri apart from training set, where images in Tri are stamped with preset labels according to some rules. And watermarked classifier \(\mathcal {C_\theta }_w \) tries to classify the images in both the training set and trigger set, i.e., \(\mathcal {C_\theta }_w=\mathcal {L}_{cla}(\mathcal {F}(Tra\ \cup \ Tri))\). Watermarked models gain the ability to recognize both normal images and trigger images by training the network from scratch or fine tuning trained models. For unmodified watermarked models, \(\mathcal {C_\theta }_w\) is supposed to output the predefined labels for input trigger images.

DNN models are vulnerable to malicious attacks like data poisoning attacking. Now many DNN models need multi-parties training, the participants might inject backdoor into the network while updating parameters. Typical data poisoning behaviors can be classified into three kinds as follows:

  1. 1)

    Simple data poisoning [8]: attackers attempt to reduce the model performance by introducing lots of mislabeled samples to the training set, which can be expressed as: \(\mathcal {C_\theta }_p=\mathcal {L}_{cla}(\mathcal {F }(Tra\ \cup \ Mislabeled\ Samples ))\).

  2. 2)

    Backdoor data poisoning [8]: it is a more imperceptible way of model tampering by adding some poisoned samples with a fixed backdoor pattern to the training set, that is, \(\mathcal {C_\theta }_b=\mathcal {L}_{cla}(\mathcal {F }(Tra\ \cup \ Backdoor\ Samples))\). Through training, this backdoor pattern can be added to normal samples to obtain the expected output label \(y_b\), i.e., \(\mathcal {C_\theta }_b(x \oplus Backdoor\ pattern)=y_b\). This data poisoning method is difficult to be found, for only a small number of backdoor samples are needed and the injected backdoor has little impact on model’s performance.

  3. 3)

    Label-consistent data poisoning [20]: label consistency means there is no tampering with image labels. This kind of attacking method often use adversarial example or GANs to recreate the samples in training set, making DNN more difficult to learn the features of image content. So, the backdoor attacking can succeed because networks focus on the backdoor pattern more than often. And traditional backdoor samples with wrong labels are easy to be detected by checking the training set, therefore, the label-consistency data poisoning methods have drawn much attention.

Some approaches have been proposed to detect the malicious tampering of DNN models. In [21], a detection and mitigation system for DNN backdoor attacks is proposed, where the backdoor can be identified and even some mitigation techniques are proved to remove the embedded backdoors. And in [22], a black-box based backdoor detection scheme is presented with minimal prior knowledge of the model. Both of them are solutions of detecting the existence of backdoors afterwards, and no precautions are taken to prevent backdoor inserting. In [13], a reversible watermarking algorithm for integrity authentication is proposed, in which the parameters of the model can be fully recovered after extracting the watermarking and the integrity of the model can be verified by applying the reversible watermarking. However, it’s a white-box method requiring the details of networks, which is inconvenient for watermarking embedding and extraction. Therefore, we proposed a black-box based integrity authentication method with fragile watermarking technique, in which a trigger set is used and no model inner parameter is revealed. Let’s call it: fragile neural network watermarking with trigger image set.

The contributions of this paper are summarized as follows:

  • We firstly proposed a black-box based fragile watermarking method for authenticating the integrity of DNN classifiers.

  • A novel loss function, \(\mathcal {L}_{var}\) and an alternate two-stage training strategy were put forward elaborately, with which fragile watermark can be embedded easily into the neural network. Meanwhile, two easily accessible metrics are designed for model authentication, which can be obtained quickly by only checking the classification outputs of trigger images.

  • Our proposed watermarking method is of good compatibility and extensibility, and experiments on three benchmark datasets showed that the embedded watermark have little impact on the prior task of the network.

The rest of this paper is organized as follows: Sect. 2 describes the proposed fragile watermarking and authentication metrics. Then, the properties of our scheme are demonstrated in Sect. 3. At last, conclusion is given in Sect. 4.

2 Fragile Watermarking

2.1 Application Scenario

Before introducing our proposed watermarking method, let us describe the application scenario with the following scenes. Consider three parties: model trainer, consumer, and attacker. The training of a complex network requires multiple stages of adjustment, and an attacker among the trainers may use the convenience of accessing training data for poisoning, resulting in network backdoored or performing worse than expected. And attackers could even poison models delivered to consumers. To this end, we introduce a fragile watermarking method suitable for neural networks to verify the integrity of watermarked models.

2.2 Watermarking Methodology

Figure 1 shows the overview of the proposed watermarking method. The whole process is divided into three steps: the first step is to construct a trigger image set using a secret key; and the second step is to embed a fragile watermark in a two-stage training procedure; the last step is the authentication process determining whether a neural model is tampered through two proposed metrics.

Fig. 1.
figure 1

An overview of the proposed fragile neural network watermarking method. The process of our scheme is divided into 3 steps. The fragile watermark is embedded by fine tuning the well-trained models in a two-stage alternate training manner, until the fine-tuned models satisfy the following condition: \(Acc_{trig}=1\) and \(Acc_{val}\ge expected \ value\). At last, the integrity of fragile watermarked model can be authenticated by evaluating two metrics, namely \(Acc_{trig}<1\) or \(Var_{diff}\ge 0\).

The first step of our scheme is to generate L pseudo-random trigger images with the secret key key specified by the model trainer. Each trigger image is assigned with a fix label, which can be expressed as follows:

$$\begin{aligned} \{Image_{i}, Label_{i}\} \leftarrow \{(I_i,\ i\%C)\ |\ i=0, 1,\cdots ,L-1\}, \end{aligned}$$
(1)

where \( Image_{i}\) or \(I_i\) represents a trigger image and \(Label_{i}\) is its preset label \(i\%C\). Here, \(\%\) is the Mod operation of math, and C is the total number of classes in training dataset. Figure 2 shows three trigger samples used in our experiments.

Fig. 2.
figure 2

Three examples of trigger images.

The second step of our method is watermarking embedding process. Our proposed fragile watermark is usually embedded in the well-trained models by a two-stage fine tuning procedure, with the classification loss \(\mathcal {L}_{cla}\) and fragile watermarking loss \(\mathcal {L}_{var}\), which are defined as follows:

$$\begin{aligned} \mathcal {L}_{cla}=- \sum _{j=0}^{C-1}y_{j}\mathrm {log}(p_j), \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{var}=\mathcal {L}_{cla}+\alpha \cdot \mathrm {Var}(\mathcal {P}), \end{aligned}$$
(3)

where \( \mathcal {L}_{cla} \) is the cross entropy loss for multi-class classification. For fragile watermarking, the watermarked model should be sensitive to modification. Thus, a regularization term \(\mathrm {Var}(\mathcal {P})\) is added in \(\mathcal {L}_{var}\). Here \(\alpha \) is a weight coefficient, and \(\mathcal {P}\) is a vector of classified results for each trigger image after Softmax operation, i.e., \(\mathcal {P}=\{p_j\ |\ j=0,1,\cdots ,C-1\}\) (\(\sum p_j=1\)). Here, \( p_{j}\) is the predicted probability of each class which falls into (0, 1), and \(\mathrm {Var}(\mathcal {P})\) is the variance of \(\mathcal {P}\).

The watermark embedding process can be divided into two stages. In the first stage, training set (Tra) along with the whole trigger set (Tri) are used for model fine-tuning with loss \(\mathcal {L}_{cla}\), where training set (Tra) is made up of partial images from the raw training dataset. The model is trained to recognize the images in both training set and trigger set. In the second stage, loss \(\mathcal {L}_{var}\) is used to fine-tune the model only on the trigger set, with the purpose of reducing the variance of predicted probability vector and classifying all the trigger images rightly. This watermark embedding process would not stop until two conditions are satisfied: 1). The classification accuracy on trigger samples is equal to 1, i.e., \(Acc_{trig}=1\); 2). The classification accuracy on normal samples in validation set is equal to or greater than the expected value, that is, \(Acc_{val} \ge expected \). By using this alternate training method, the proposed fragile watermark are embedded easily into the DNN models. And the classification accuracy on normal images are usually not lower than the original un-watermarked models.

Here, we also defined the concept \(\mathcal {C_\theta }_{robust}\) as the classifier with a robust watermark. It is trained on both training set and trigger set with loss \(\mathcal {L}_{cla}\), which means only the first stage of watermarking embedding is used. It can be expressed as: \(\mathcal {C_\theta }_{robust}=\mathcal {L}_{cla}(\mathcal {F}(Tra\ \cup \ Tri))\). This is also the way that many previous methods [14,15,16,17,18] embed robust watermarks. By doing so, the output of the watermarked model to the trigger set will not be easily changed by fine-tuning the model, in other words, the watermark is not fragile.

The performance of fragile watermarked model has a close relation to the size of Tra, which reflects the trade-off between watermarking embedding efficiency and model performance. When the size of Tra declines, it takes less time for watermark embedding, yet the classification accuracy on normal images will also decrease. Hence, we randomly take 10% of the images in raw training database as the training set Tra. The sensibility of watermark is enhanced gradually as \(\alpha \) increases, but a too large value can also lead to performance degradation. Here we limit \(\alpha \) ranging from 0 to several hundreds in our experiments.

The last step of our scheme is model authentication. After acquired, the DNN model’s predicted labels of trigger images will be quantified into two authentication metrics to decide whether the model has been modified or not.

2.3 Authentication Metrics

Two novel authentication metrics are proposed to verify whether a watermarked model has been modified in our scheme. The first one is \(Acc_{trig}\), which is the classification accuracy on the input trigger images. If \(Acc_{trig} < 1\), it indicates that the model has been modified. The stronger the malicious attack is, the more \(Acc_{trig}\) will drop. If a model is modified, the trigger images may be classified into different classes as follows:

$$\begin{aligned} \mathcal {N}=\{n_0, n_1, \dots , n_i, \dots , n_{C-1}\}, \end{aligned}$$
(4)

where \(n_i\) means the number of trigger samples that are classified as the i-th class image, and the summation of \(n_i\) is equal to the image number of trigger set. Similarly, \(\mathcal {N}_0\) is the corresponding statistical result of unmodified model. Based on these two values, the second metric of our scheme is given:

$$\begin{aligned} Var_{dif}=\mathrm {Var}(\mathcal {N})-\mathrm {Var}(\mathcal {N}_0), \end{aligned}$$
(5)

where \(Var_{dif}\) is used to measure the difference of classification results before and after the watermarked model is tampered. The value of \(Var_{dif}\) reflects the degree to which the model has been modified. \(Var_{dif}\) is always greater than or equal to zero and a great value means the watermarked model is modified severely.

3 Experiments

In our paper, watermarking experiments are conducted with resnet18 [23] and resnet50 [23], which were trained on datasets cifar-10 & cifar-100, and Caltech101, respectively. The information of these datasets is showed in Table 1. We trained the classifiers from scratch with the following configurations: the optimizer of resnet18 is SGD with momentum with the learning rate of 0.1, which decreases ten times every twenty epochs; the optimizer of resnet50 is Adam, and the learning rate is 1e−4, which is reduced to 1e−5 in the last 20 epochs. Then fragile watermarking is embedded by fine-tuning the trained classifiers above. During the process of watermarking or attacking, the optimizer remains unchanged, and the learning rate is set to value in the final stage of training.

Table 1. Dataset information
Fig. 3.
figure 3

The accuracy curves with different \(\alpha \) values used during watermarking embedding. Val Acc and Trigger Acc means classification accuracy on validation set and trigger set, respectively.

3.1 Visualization of Watermarking

In this part, we compared the outputs of fragile watermarked models by some visual methods. Figure 3 records the classification accuracy of images during the watermark embedding fine-tuning process, in which Val Acc and Trigger Acc means classification accuracy on validation set and trigger set, respectively. When \(\alpha \) is 0, a robust watermark instead of a fragile watermark is embedded, and with the increase of \(\alpha \), the sensibility of watermark increases and it will take more epochs for fragile watermark embedding. However, from Fig. 3, we can see that the increase of \(\alpha \) has little effect on the performance of classifiers, for the watermarking embedding fine-tuning process would not stop until two conditions are satisfied, i.e., \(Acc_{trig}=1\) and \( Acc_{val} \ge expected \). Thus, we can pick \(\alpha \) ranging from zero to several hundreds casually for watermark embedding.

Fig. 4.
figure 4

Comparison of feature projection extracted by different models. The figures on the left and right are results of robust and fragile watermarked model trained on cifar-10, respectively.

The model’s last convolution layer’s output tensor were projected onto a two-dimension plane as demonstrated in Fig. 4 by a visualization tool UMAP. The left and right figures illustrate the projected features extracted by classifiers with robust and fragile watermark trained on cifar-10, respectively, where 10 larger colored circles represent the projected feature points of 5,000 normal samples, and the purple dots inside the circles are projected feature points of 36 trigger images, which all belong to 10 image classes. As shown in Fig. 4, once the model is embedded with fragile watermark, the projected feature points of trigger samples will deviate from the center to classification boundary.

3.2 Perceptual Transparency

The existence of fragile watermark has little influence on original classification ability of model. Table 2 lists the testing accuracy of un-watermarked models and fragile watermarked models when varying the value of \(\alpha \). Results show that the reduction of accuracy is less than 0.3% after watermark is embedded, and in some cases the accuracy of watermarked model even increase slightly compared to that of the clean model.

Table 2. The accuracy of un-watermarked classifier and fragile watermarked classifier for normal images in testing set when varying the value of \(\alpha \)

3.3 Watermarking Property

In this part, we tested the property of proposed fragile watermarking under malicious attacking. To prove the effectiveness, 5 datasets listed as follows are used to attack the fragile watermarked models using data poisoning method.

  1. 1.

    The original training set (T).

  2. 2.

    An extra training set (ET), which has the same distribution with T. In our experiment, the validation set of original database is used as ET.

  3. 3.

    A poisoned dataset (DP), which has dozens of mislabeled samples with the aim to reduce the model performance.

  4. 4.

    A simple backdoor poisoned dataset (SB), which contains the images in training set (T) as well as some mislabeled samples with backdoor patterns.

  5. 5.

    A label-consistent poisoned dataset (LCB) constructed according to [20].

Fig. 5.
figure 5

The classification accuracy of trigger set (\(Acc_{trig}\)) when the robust watermarked models (the first row) and fragile watermarked models (the second row) are fine-tuned with 5 datasets (T, ET, DP, SB, and LCB) on cifar-10 or cifar-100, respectively. If the curve’s label is prefixed with Last, it means only the last layer of model is fine-tuned, otherwise all layers are fine-tuned.

Five datasets above are created on the basis of a image database, such as cifar-10 and cifar-100. As [20] demonstrates, the minimum number of poisoned samples for backdoor attacking should not be lower than 0.15% of the training set size. Thus, for dataset cifar-10, we only added 75 (0.15% of the size) poisoned samples to dataset DP, SB, and LCB. In our experiment, the backdoor pattern is a \(3\times 3\times 3\) white-and-black square block, which is superimposed on the lower right corner of normal samples.

Figure 5 illustrates the mean accuracy of trigger set when watermarked models are fine-tuned with 5 datasets above. The malicious attacking experiments are conducted either on all the layers, or only on the last layer of the watermarked models. In Fig. 5, the first and second row are respectively the attacking results of models with robust watermarks and fragile watermarks. It can be easily seen that the fragile watermarked models are vulnerable to all kinds of modifications compared to the robust watermarked models. The accuracy on trigger set \(Acc_{trig}\) declines rapidly when the fragile watermarked models are fine-tuned with poisoned datasets. And the more different the poisoned data is, the faster \(Acc_{trig}\) goes down. Thus, the accuracy curves of DP, SB, and LCB decline more quickly than the curves of T and ET.

Table 3. Value of \(Var_{dif}\) when the fragile watermarked model has been fine-tuned with 5 different datasets for ten epochs

Table 3 lists the \(Var_{dif}\) values, when the fragile watermarked model is fine-tuned with 5 datasets mentioned above. As can be seen, the \(Var_{dif}\) value of T and ET are much smaller than the values of other three poisoned datasets, with the reason that the images in T and ET are the same or similar to the images in raw training set. When fragile watermarked models are maliciously fine-tuned with poisoned dataset DP, SB, and LCB, \(Var_{dif}\) will be much greater than zero. The \(Var_{dif}\) values of cifar-10 are much larger than the values of cifar-100, mainly because that cifar-10 owns less images classes and is sensitive to modification.

In order to view the variation of \(Var_{dif}\) more intuitively, we depicted the heat maps of classified trigger image numbers according to \(\mathcal {N}\) in equation (4). As shown in Fig. 6, the well trained model is fine-tuned with four datasets T, DP, SB, and LCB. The horizontal and vertical axis are the class number and the epoch number of fine tuning, respectively. For each block, the intensity of the color is proportional to the predicted number of trigger samples falling into the corresponding image categories. In the heat maps of poisoned datasets DP, SB, and LCB, the distribution of colored blocks are more uneven compared to blocks of T. This is because the embedded fragile watermark is damaged greatly when model is fine-tuned by these three poisoned datasets.

Fig. 6.
figure 6

The numbers of trigger samples being classified into each class during the malicious fine-tuning process carried on 4 datasets. The horizontal axis here is the image classes and vertical axis is the epochs. The color depth of blocks means the numbers of trigger images that are predicted as the corresponding image class.

After a model is embedded with proposed fragile watermark, it may be upload to cloud or sent to the users directly. To verify the integrity of acquired model, we can apply the two metrics \(Acc_{trig}\) and \(Var_{dif}\) with the trigger image set offered by the model owner. As long as \(Acc_{trig}<1\) or \(Var_{dif}>0\), we can believe that the received models have been tampered.

3.4 Extensibility

Here, we explored the influence of trigger image set size on model performance. First, we constructed several trigger sets of size ranging from 0 to 200, and then each trigger set is used to embed fragile watermark individually. At last, the classification accuracy on the testing set are tested in Table 4. It can be seen that the size of trigger image set has no influence on the prediction consequences of fragile watermarked models.

Table 4. The classification accuracy of fragile watermarked models on normal images in testing set with different sizes of trigger set used in fragile watermarking

The fragile watermarking method is also experimented on dataset Caltech101. In our experiments, classifier resnet50 is first embedded with fragile watermark, and then be fine-tuned with 5 datasets (T, ET, DP, SB, and LCB). Caltech101 is a 101-class dataset with the image size of about \(300\times 200\times 3\) pixels, and the backdoor pattern for Caltech101 is a \(9\times 9\times 3\) block. As a result, the testing set accuracy of the un-watermarked classifier resnet50 is 94.46%, and the accuracy of fragile watermarked classifiers is between 94.23% and 94.49%. The classification ability of DNN model is proved to be almost unaffected by watermarking.

Fig. 7.
figure 7

The classification accuracy of trigger set (\( Acc_{trig}\)) when 5 datasets (T, ET, DP, SB, LCB) are used to fine-tune robust watermarked resnet50 and fragile watermarked resnet50 respectively on dataset Caltech101.

The performance of watermarked resnet50 is also tested by fine tuning the model with 5 datasets on the basis of Caltech101, and the results are presented in Fig. 7. The left two and right two figures are separately the fine-tuned results of robust and fragile watermarked resnet50. As illustrated in Fig. 7, fragile watermarked resnet50 is sensitive to all kinds of fine tuning, especially when fine-tuned with poisoned datasets DP, SB, and LCB. When all the layers of watermarked resnet50 are fine-tuned, the curves of poisoned datasets DP, SB, and LCB goes down more quickly than other curves. When only the last layer of model is fine-tuned, accuracy curves fluctuate wildly. But in general, our proposed fragile watermarking method is sensitive to malicious fine-tuning and can be used to detect model tampering carried on various datasets.

4 Conclusion

In this paper, we proposed a black-box based fragile neural network watermarking method with trigger image set for authenticating the integrity of DNN models. In our approach, models are trained to fit both the training set and the trigger set simultaneously in a two-stage alternate training process, which aims to embed fragile watermark while keeping the original classification performance. The embedded fragile watermark is sensitive to model tampering, and thus can be used to verify the integrity of models. Two meaningful metrics are provided to determine whether the fragile watermarked model has been modified as well as assess the distribution difference between the training set and the data used for malicious attacking. The experiments on three benchmark datasets have shown that our proposed fragile watermarking method is widely applicable to various classifiers and datasets. We leave the research on more sensitive semi-fragile neural network watermarking to future work.