1 Introduction

Deep neural networks have made tremendous progress in the area of multimedia representation [5, 40, 49, 50]. It attempts to model high-level abstractions in data by employing deep architectures composed of multiple nonlinear transformations [7]. In addition, deep neural networks can be applied to various types of data such as sound [49], video [30], text [46], time series [53], and images [33]. In particular, deep convolutional neural networks (DCNN) such as LeNet [36], AlexNet [33], VGGNet [42], GoogLeNet [44], and ResNet [22] have demonstrated remarkable performance for a wide range of computer vision problems and other applications.

Additionally, many deep learning frameworks have been released. They help engineers and researchers to develop systems based on deep learning or do research with less effort. Examples of these great deep learning frameworks are Caffe [27], Theano [8], Torch [12], Chainer [45], TensorFlow [1], and Keras [10].

Although these frameworks have made it easy to utilize deep neural networks in real applications, training is still a difficult task because it requires a large amount of data and time; for example, several weeks are needed to train a very deep ResNet with the latest GPUs on the ImageNet dataset, for instance [22].

Therefore, trained models are sometimes provided on Web sites in order to make it easy to try out a certain model or reproduce the results in research articles without training. For example, Model ZooFootnote 1 provides trained Caffe models for various tasks with useful utility tools.

It has been empirically observed that utilizing trained models to initialize the weights of a deep neural network has potential the following benefits. Fine-tuning [42] is a strategy to directly adapt such already trained models to another application with minimum re-training time. It was reported that pre-training neural networks often achieves lower training error than neural networks that are not pre-trained [15, 24].

Thus, sharing trained models is very important for the rapid progress of research and development of deep neural network systems. In the future, more systematic model-sharing platforms may appear, by analogy with video-sharing sites. Some digital distribution platforms for purchase and sale of the trained models or even artificial intelligence skills (e.g., Alexa SkillsFootnote 2) may appear, similar to Google Play or App Store.

In that sense, trained models could be important assets for the owner(s) who trained them. Dataset quality and quantity directly affect the accuracy of tasks with large networks. The success of deep neural networks has been achieved not only by algorithms but also through massive amounts of data and computational power. Even if the same architecture is employed for different applications, their model weights and their performance are not be guaranteed to be equal. For instance, if two applications employ the same architecture such as AlexNet [33], and they are trained in the same manner but with a different dataset, the performance would depend on the quality and quantity of the dataset. Furthermore, a large cost is incurred to create a dataset of sufficient size for specific and realistic tasks. From the viewpoint of applications, it could be argued that model weights rather than architectures constitute competitive advantage.

We argue that trained models could be treated as intellectual property, and we believe that providing copyright protection for trained models is a worthwhile challenge. Discussion on whether or not the copyright law can protect computationally trained models is outside the scope of this paper. We focus on how to technically protect the copyrights of trained models.

To this end, we employ a digital watermarking idea, which is used to identify ownership of the copyright of digital content such as images, audio, and videos. In this paper, we propose a digital watermarking technology for neural networks. In particular, we propose a general framework to embed a watermark in deep neural network models to protect intellectual property and detect intellectual property infringement of trained models. This paper is an extended version of [48] with further analysis of attacks on the watermark.

2 Problem formulation

Given a model network with or without trained parameters, we define the task of watermark embedding as embedding T-bit vector \({\varvec{b}} \in \{0, 1\}^{T}\) into the parameters of one or more layers of the neural network. We refer to a neural network in which a watermark is embedded as a host network and refer to the task that the host network is originally trying to perform as the original task.

In the following, we formulate (1) requirements for an embedded watermark or an embedding method, (2) embedding situations, and (3) expected types of attacks against which embedded watermarks should be robust.

Table 1 Requirements for an effective watermarking algorithm in the image and neural network domains

2.1 Requirements

Table 1 summarizes the requirements for an effective watermarking algorithm in an image domain [13, 21] and a neural network domain. While both domains share almost the same requirements, fidelity and robustness are different in image and neural network domains. For fidelity in an image domain, it is essential to maintain the perceptual quality of the host image while embedding a watermark. However, in a neural network domain, the parameters themselves are not important. Instead, the performance of the original task is important. Therefore, it is essential to maintain the performance of the trained host network, and not to hamper the training of a host network.

Regarding robustness, as images are subject to various signal processing operations, an embedded watermark should stay in the host image even after these operations. Note that the greatest possible modification to a neural network is fine-tuning or transfer learning [42]. An embedded watermark in a neural network should be detectable after fine-tuning or other possible modifications.

2.2 Embedding situations

We classify the embedding situations into three types: train-to-embed, fine-tune-to-embed, and distill-to-embed, as summarized in Table 2.

Train-to-embed is the case in which the host network is trained from scratch while embedding a watermark where labels for training data are available.

Fine-tune-to-embed is the case in which a watermark is embedded while fine-tuning. In this case, model parameters are initialized with a pre-trained network. The network configuration near the output layer may be changed before fine-tuning in order to adapt the final layer’s output to another task.

Distill-to-embed is the case in which a watermark is embedded into a trained network without labels using the distilling approach [23]. Embedding is performed in fine-tuning where the predictions of the trained model are used as labels. In the standard distill framework, a large network (or multiple networks) is first trained and then a smaller network is trained using the predicted labels of the large network in order to compress the large network. In this paper, we use the distill framework as a simple way to train a network without labels.

The first two situations assume that the copyright holder of the host network is expected to embed a watermark into the host network during training or fine-tuning. Fine-tune-to-embed is also useful when a model owner wants to embed individual watermarks to identify those to whom the model had been distributed. By doing so, individual instances can be tracked. The last situation assumes that a non-copyright holder (e.g., a platformer) is entrusted to embed a watermark on behalf of a copyright holder.

Table 2 Three embedding situations

2.3 Expected attack types

Related to the requirement for robustness in Sect. 2.1, we assume three types of attacks against which embedded watermarks should be robust: fine-tuning, model compression, and watermark overwriting.

2.3.1 Fine-tuning

Fine-tuning [42] seems to be the most feasible type of attack, whether intentionally or unintentionally, because it empirically has the following potential benefits as follows. To utilize trained models as initial weights of training another networks often achieves lower training error than training from scratch [15, 24]. The fine-tuning step helps to both reduce the computational cost and improve the performance. Many models have been constructed on top of existing state-of-the-art models. Fine-tuning alters the model parameters, and thus, embedded watermarks should be robust against this alteration.

2.3.2 Model compression

Model compression is very important in deploying deep neural networks in embedded systems or mobile devices as it can significantly reduce memory requirements and/or computational cost. Model compression can be easily imagined by analogy with lossy image compression in the image domain. Lossy compression distorts model parameters, so we should explore how it affects the detection rate.

2.3.3 Watermark overwriting

Watermark overwriting would be a severe attack. Attackers may try to destroy an existing watermark by embedding different watermark in the same manner. Ideally embedded watermarks should be robust against this type of attack.

3 Proposed framework

In this section, we propose a framework for embedding a watermark into a host network. Although we focus on a DCNN [36] as the host, our framework is essentially applicable to other networks such as standard multilayer perceptron (MLP), recurrent neural networks (RNN), and long short-term memory (LSTM) [25].

3.1 Embedding targets

In this paper, a watermark is assumed to be embedded into one of the convolutional layers in a host DCNN,Footnote 3 Let (SS), D, and L, respectively, denote the size of the convolution filter, the depth of input to the convolutional layer, and the number of filters in the convolutional layer. The parameters of this convolutional layer are characterized by the tensor \({\varvec{W}} \in \mathbb {R}^{S \times S \times D \times L}\). The bias term is ignored here. Let us think of embedding a T-bit vector \({\varvec{b}} \in \{0, 1\}^{T}\) into \({\varvec{W}}\). The tensor \({\varvec{W}}\) is a set of L convolutional filters, and the order of the filters does not affect the output of the network if the parameters of the subsequent layers are appropriately re-ordered. In order to remove this arbitrariness in the order of filters, we calculate the mean of W over L filters as \(\overline{W}_{ijk} = \frac{1}{L} \sum _l W_{ijkl}\). Letting \({\varvec{w}} \in \mathbb {R}^M\) (\(M = S \times S \times D\)) denote a flattened version of \(\overline{{\varvec{W}}}\), our objective is now to embed T-bit vector \({\varvec{b}}\) into \({\varvec{w}}\).

3.2 Embedding regularizer

It is possible to embed a watermark into a host network by directly modifying \({\varvec{w}}\) of a trained network, as is usually done in the image domain. However, this approach degrades the performance of the host network in the original task as shown later in Sect. 4.3.1. Instead, we propose embedding a watermark while training a host network for the original task so that the existence of the watermark does not impair the performance of the host network in its original task. To this end, we utilize a parameter regularizer, which is an additional term in the original cost function for the original task. The cost function \(E({\varvec{w}})\) with a regularizer is defined as:

$$\begin{aligned} E({\varvec{w}}) = E_0 ({\varvec{w}}) + \lambda E_R ({\varvec{w}}), \end{aligned}$$
(1)

where \(E_0 ({\varvec{w}})\) is the original cost function, \(E_R ({\varvec{w}})\) is a regularization term that imposes a certain restriction on parameters \({\varvec{w}}\), and \(\lambda \) is an adjustable parameter. A regularizer is usually used to prevent over-fitting in neural networks. \(L_2\) regularization (or weight decay [34]), \(L_1\) regularization, and their combination are often used to reduce over-fitting of parameters for complex neural networks. For instance, \(E_R ({\varvec{w}}) = ||{\varvec{w}}||^2_2\) in the \(L_2\) regularization.

In contrast to these standard regularizers, our regularizer imposes a certain statistical bias on parameter \({\varvec{w}}\), as a watermark in a training process. We refer to this regularizer as an embedding regularizer. Before defining the embedding regularizer, we explain how to extract a watermark from \({\varvec{w}}\). Given a (mean) parameter vector \({\varvec{w}} \in \mathbb {R}^M\) and an embedding parameter \({\varvec{X}} \in \mathbb {R}^{T \times M}\), the watermark extraction is simply done by projecting \({\varvec{w}}\) using \({\varvec{X}}\), followed by thresholding at 0. More precisely, the j-th bit is extracted as:

$$\begin{aligned} b_j = s\left( \sum _{i} X_{ji} w_i\right) , \end{aligned}$$
(2)

where s(x) is a step function:

$$\begin{aligned} s(x) = {\left\{ \begin{array}{ll} \, 1 &{} x \ge 0 \\ \, 0 &{} \mathrm {else}. \end{array}\right. } \end{aligned}$$
(3)

This process can be considered to be a binary classification problem with a single-layer perceptron (without bias).Footnote 4 Therefore, it is straightforward to define the loss function \(E_R ({\varvec{w}})\) for the embedding regularizer by using (binary) cross entropy:

$$\begin{aligned} E_R ({\varvec{w}}) = - \sum _{j=1}^{T} \left( b_j \log (y_j) + (1 - b_j) \log (1 - y_j) \right) , \end{aligned}$$
(4)

where \(y_j = \sigma (\sum _{i} X_{ji} w_i)\) and \(\sigma (\cdot )\) is the sigmoid function:

$$\begin{aligned} \sigma (x)=\frac{1}{1+\exp (-x)}. \end{aligned}$$
(5)

We call this loss function an embedding loss function.

Note that an embedding loss function is used to update \({\varvec{w}}\), not \({\varvec{X}}\), in our framework. It may be confusing that \({\varvec{w}}\) is an input and \({\varvec{X}}\) is a parameter to be learned in a standard perceptron. In our case, \({\varvec{w}}\) is an embedding target and \({\varvec{X}}\) is a fixed parameter. \({\varvec{X}}\) works as a secret key [21] to detect an embedded watermark. The design of \({\varvec{X}}\) is discussed in Sect. 3.3.

This approach does not impair the performance of the host network in the original task as confirmed in experiments, because deep neural networks are typically over-parameterized. It is well known that deep neural networks have many local minima and that all local minima are likely to have an error very close to that of the global minimum [11, 14]. Therefore, the embedding regularizer only needs to guide model parameters to one of a number of good local minima so that the final model parameters have an arbitrary watermark.

3.3 Regularizer parameters

In this section, we discuss the design of the embedding parameter \({\varvec{X}}\), which can be considered as a secret key [21] in detecting and embedding watermarks. While \({\varvec{X}} \in \mathbb {R}^{T \times M}\) can be an arbitrary matrix, it will affect the performance of an embedded watermark because it is used in both embedding and extraction of watermarks. In this paper, we consider three types of \({\varvec{X}}\): \({\varvec{X}}^{\textsf {direct}}\), \({\varvec{X}}^{\textsf {diff}}\), and \({\varvec{X}}^{\textsf {random}}\).

\({\varvec{X}}^{\textsf {direct}}\) is constructed so that one element in each row of \({\varvec{X}}^{\textsf {direct}}\) is ’1’ and the others are ’0’. In this case, the j-th bit \(b_j\) is directly embedded in a certain parameter \(w_{\hat{i}}\) s.t. \({\varvec{X}}^{\textsf {direct}}_{j\hat{i}} = 1\).

\({\varvec{X}}^{\textsf {diff}}\) is created so that each row has one ’1’ element and one ’-1’ element, and the others are ’0’. Using \({\varvec{X}}^{\textsf {diff}}\), the j-th bit \(b_j\) is embedded into the difference between \(w_{i_+}\) and \(w_{i_-}\) where \({\varvec{X}}^{\textsf {diff}}_{ji_+}=1\) and \({\varvec{X}}^{\textsf {diff}}_{ji_-}=-1\).

Each element of \({\varvec{X}}^{\textsf {random}}\) is independently drawn from the standard normal distribution \(\mathcal {N}(0, 1)\). Using \({\varvec{X}}^{\textsf {random}}\), each bit is embedded into all instances of the parameter w with random weights. These three types of embedding parameters are compared in experiments.

4 Experiments

In this section, we demonstrate that our embedding regularizer can embed a watermark without impairing the performance of the host network, and the embedded watermark is robust against various types of attacks. Our implementation of the embedding regularizer is publicly available.Footnote 5

4.1 Evaluation settings

4.1.1 Dataset

For experiments, we used the well-known CIFAR-10 and Caltech-101 datasets. The CIFAR-10 dataset [32] consists of 60,000 \(32 \times 32\) color images in 10 classes, with 6000 images per class. These images were separated into 50,000 training images and 10,000 test images. The Caltech-101 dataset [16] includes pictures of objects belonging to 101 categories; it contains about 40–800 images per category. The size of each image is roughly \(300 \times 200\) pixels, but we resized them to \(32 \times 32\) for fine-tuning. For testing, we used 30 images for training and at most 40 of the remaining images for each category.

4.1.2 Host network and training settings

We used the wide residual network [52] as the host network. The wide residual network is an efficient variant of the residual network [22]. Table 3 shows the structure of the wide residual network. A depth parameter N is the number of blocks in groups, and a width parameter k is widening factor that scales the width of the residual blocks in groups.

In all our experiments, we set \(N = 1\) and \(k = 4\) and used SGD with Nesterov momentum [2, 39, 43] and cross entropy loss in training. The initial learning rate was set at 0.1, weight decay to \(5.0 \times 10^{-4}\), momentum to 0.9 and minibatch size to 64. The learning rate was dropped by a factor of 0.2 at 60, 120, and 160 epochs, and we trained for a total of 200 epochs, following the settings used in [52].

We embedded a watermark into one of the following convolution layers: the second convolutional layer in the conv 2, conv 3, and conv 4 groups. Hereinafter, we refer to the location of the host layer by simply describing the conv 2, conv 3, or conv 4 group. In Table 3, the number M of parameter \({\varvec{w}}\) is also shown for these layers. The parameter \(\lambda \) in Eq. (1) is set to 0.01. As a watermark, we embedded \({\varvec{b}} = \mathbf {1} \in \{0, 1\}^{T}\) in the following experiments.

Table 3 Structure of the host network
Fig. 1
figure 1

Histogram of the embedded watermark \(\sigma (\sum _{i} X_{ji} w_i)\) (before thresholding) with and without watermarks. All watermarks will be successfully detected by binarizing \(\sigma (\sum _{i} X_{ji} w_i)\) at a threshold of 0.5. In the case of random, it can be easily determined whether or not a watermark is embedded with the histogram. a direct, b diff and c random

Fig. 2
figure 2

Distribution of model parameters \({\varvec{W}}\) with and without watermarks. a Not embedded, b direct, c diff and d random

4.2 Embedding results

We trained the host network from scratch (train-to-embed) on the CIFAR-10 dataset with and without embedding a watermark. In the embedding case, a 256-bit watermark (\(T=256\)) was embedded into the conv 2 group.

4.2.1 Detecting watermarks

Figure 1 shows the histogram of the embedded watermark \(\sigma (\sum _{i} X_{ji} w_i)\) (before thresholding) with and without watermarks where (a) direct, (b) diff, and (c) random parameters are used in embedding and detection. If we binarize \(\sigma (\sum _{i} X_{ji} w_i)\) at a threshold of 0.5, all watermarks are correctly detected because \(\forall j, \; \sigma (\sum _{i} X_{ji} w_i) \ge 0.5\) if and only if \(\sum _{i} X_{ji} w_i \ge 0\) for all embedded cases. Please note that we embedded \({\varvec{b}} = \mathbf {1} \in \{0, 1\}^{T}\) as a watermark as previously mentioned. Although random watermarks will be detected for the non-embedded cases, it can be easily determined whether the watermark is not embedded because the distribution of \(\sigma (\sum _{i} X_{ji} w_i)\) is quite different from those for embedded cases.

4.2.2 Distribution of model parameters

We explore how trained model parameters are affected by the embedded watermarks. Figure 2 shows the distribution of model parameters \({\varvec{W}}\) (not \({\varvec{w}}\)) with and without watermarks. These parameters are taken only from the layer in which a watermark was embedded. Note that \({\varvec{W}}\) is the parameter before taking the mean over filters, and thus, the number of parameters is \(3 \times 3 \times 64 \times 64\). We can see that direct and diff significantly alter the distribution of parameters while random does not. In direct, many parameters became large and a peak appears near 2 so that their mean over filters becomes a large positive value to reduce the embedding loss. In diff, most parameters were pushed in both positive and negative directions so that the differences between these parameters became large. In random, a watermark is diffused over all parameters with random weights and thus does not significantly alter the distribution. This is one of the desirable properties of watermarking related to the security requirement; one may be aware of the existence of the embedded watermarks for the direct and diff cases.

The results so far indicated that the random approach seemed to be the best choice among the three, with low embedding loss, low test error in the original task, and no alteration of the parameter distribution. Therefore, in the following experiments, we used the random approach in embedding watermarks without explicitly indicating it.

4.3 Fidelity

4.3.1 Embedding without training

As mentioned in Sect. 3.2, it is possible to embed a watermark in a host network by directly modifying the trained parameter \({\varvec{w_0}}\) as usually done in the image domain. Here we try to do this by minimizing the following loss function instead of Eq. (1):

$$\begin{aligned} E({\varvec{w}}) = \frac{1}{2} ||{\varvec{w}} - {\varvec{w}}_0||^2_2 + \lambda E_R ({\varvec{w}}), \end{aligned}$$
(6)

where the embedding loss \(E_R ({\varvec{w}})\) is minimized while minimizing the difference between the modified parameter \({\varvec{w}}\) and the original parameter \({\varvec{w}}_0\). Table 4 summarizes the embedding results after minimizing Eq. (6) against the host network trained on the CIFAR-10 dataset. We can see that embedding fails for \(\lambda \le 1\) as the bit error rate (BER) is larger than zero while the test error of the original task becomes too large for \(\lambda > 1\). Thus, it is not effective to directly embed a watermark without considering the original task.

Table 4 Losses, test error (\(\%\)), and bit error rate (BER) after embedding a watermark with different \(\lambda \)

4.3.2 Test error and training loss

Figure 3 shows the training curves for the host network in CIFAR-10 as a function of epochs. Not embedded is the case where the host network is trained without the embedding regularizer. Embedded (direct), Embedded (diff), and Embedded (random), respectively, represent training curves with embedding regularizers whose parameters are \({\varvec{X}}^{\textsf {direct}}\), \({\varvec{X}}^{\textsf {diff}}\), and \({\varvec{X}}^{\textsf {random}}\). We can see that the training loss \(E({\varvec{w}})\) with a watermark becomes larger than the not-embedded case if the parameters \({\varvec{X}}^{\textsf {direct}}\) and \({\varvec{X}}^{\textsf {diff}}\) are used. This large training loss is dominated by the embedding loss \(E_R ({\varvec{w}})\), which indicates that it is difficult to embed a watermark directly into a parameter or even into the difference of two parameters. On the other hand, the training loss of Embedded (random) is very close to that of Not embedded.

Table 5 shows the best test errors and embedding losses \(E_R ({\varvec{w}})\) of the host networks with and without embedding. We can see that the test errors of Not embedded and random are almost the same, while those of direct and diff are slightly larger. The embedding loss \(E_R ({\varvec{w}})\) of random is extremely low compared with those of direct and diff. These results indicate that the random approach can effectively embed a watermark without impairing the performance in the original task.

Fig. 3
figure 3

Training curves for the host network on CIFAR-10 as a function of epochs. Solid lines denote test error (y-axis on the left) and dashed lines denote training loss \(E({\varvec{w}})\) (y-axis on the right)

Table 5 Test error (\(\%\)) and embedding loss \(E_R ({\varvec{w}})\) with and without embedding

4.3.3 Fine-tune-to-embed and distill-to-embed

In the above experiments, a watermark was embedded by training the host network from scratch (train-to-embed). Here, we evaluated the other two situations introduced in Sect. 2.2: fine-tune-to-embed and distill-to-embed.

For fine-tune-to-embed, two experiments were performed. In the first experiment, the host network was trained on the CIFAR-10 dataset without embedding and then fine-tuned on the same CIFAR-10 dataset with and without embedding (for comparison). In the second experiment, the host network is trained on the Caltech-101 dataset and then fine-tuned on the CIFAR-10 dataset with and without embedding.

Table 6a shows the result of the first experiment. Not embedded 1st corresponds to the first training without embedding. Not embedded 2nd corresponds to the second training without embedding and Embedded corresponds to the second training with embedding. Figure 4 shows the training curves of these fine-tunings.Footnote 6 We can see that Embedded achieved almost the same test error as Not embedded 2nd and a very low \(E_R ({\varvec{w}})\).

Table 6 Test error (\(\%\)) and embedding loss \(E_R ({\varvec{w}})\) with and without embedding in fine-tuning and distilling

Table 6b shows the results of the second experiment. Not embedded 2nd corresponds to the second training without embedding and Embedded corresponds to the second training with embedding. Figure 5 shows the training curves of these fine-tunings. The test error and training loss of the first training are not shown because they are not compatible with the two different training datasets. From these results, it was also confirmed that Embedded achieved almost the same test error as Not embedded 2nd and very low \(E_R ({\varvec{w}})\). Thus, we can say that the proposed method is effective even in the fine-tune-to-embed situation (in the same and different domains).

Finally, embedding a watermark in the distill-to-embed situation was evaluated. The host network is first trained on the CIFAR-10 dataset without embedding. Then, the trained network was further fine-tuned on the same CIFAR-10 dataset with and without embedding. In this second training, the training labels of the CIFAR-10 dataset were not used. Instead, the predicted values of the trained network were used as soft targets [23]. In other words, no label was used in the second training. Table 6c shows the results for the distill-to-embed situation. Not embedded 1st corresponds to the first training and Embedded (Not embedded 2nd) corresponds to the second distilling training with embedding (without embedding). It was found that the proposed method also achieved low test error and \(E_R ({\varvec{w}})\) in the distill-to-embed situation. Table 6d shows the result for the distill-to-embed situation on the different domain; the difference from Table 6c is that the predicted values for the Caltech-101 are used as soft targets here instead of CIFAR-10. The test error is calculated on CIFAR-10.

Fig. 4
figure 4

Training curves for fine-tuning the host network. The first and second halves of epochs correspond to the first and second trainings. Solid lines denote test error (y-axis on the left) and dashed lines denote training loss (y-axis on the right)

Fig. 5
figure 5

Training curves for the host network on CIFAR-10 as a function of epochs. Solid lines denote test error (y-axis on the left) and dashed lines denote training loss (y-axis on the right)

4.4 Robustness of embedded watermarks

In this section, the robustness of the proposed watermark is evaluated for the three types of attacks explained in Sect. 2.3: fine-tuning, model compression, and watermark overwriting.

4.4.1 Robustness against fine-tuning

Fine-tuning or transfer learning [42] seems to be the most likely type of (unintentional) attack because it is frequently performed on trained models to apply them to other but similar tasks with less effort than training a network from scratch or to avoid over-fitting when sufficient training data are not available.

In this experiment, two trainings were performed; in the first training, a 256-bit watermark was embedded in the conv 2 group in the train-to-embed manner, and then, the host network was further fine-tuned in the second training without embedding, to determine whether or not the watermark embedded in the first training stayed in the host network, even after the second training (fine-tuning).

Table 7 shows the embedding loss before fine-tuning (\(E_R ({\varvec{w}})\)) and after fine-tuning (\(E'_R ({\varvec{w}})\)), and the best test error after fine-tuning. In the same domain, the host network is trained on the CIFAR-10 dataset while embedding a watermark and then further fine-tuned without embedding a watermark. We evaluated fine-tuning in the same domain (CIFAR-10 \(\rightarrow \) CIFAR-10) and in the different domains (Caltech-101 \(\rightarrow \) CIFAR-10). We can see that, in both cases, the embedding loss was increased slightly by fine-tuning but was still low. In addition, the bit error rate of the detected watermark was equal to zero in both cases. The reason why the embedding loss in fine-tuning in the different domains is higher than that in the same domain is that the Caltech-101 dataset is significantly more difficult than the CIFAR-10 dataset in our settings; all images in the Caltech-101 dataset were resized to \(32 \times 32\)Footnote 7 for compatibility with the CIFAR-10 dataset.

Table 7 Embedding loss before fine-tuning (\(E_R ({\varvec{w}})\)) and after fine-tuning (\(E'_R ({\varvec{w}})\)), and the best test error (\(\%\)) and bit error rate (BER) after fine-tuning

4.4.2 Robustness against model compression

It is sometimes difficult to deploy deep neural networks in embedded systems or mobile devices because they are both computationally intensive and memory intensive. In order to solve this problem, the model parameters are often compressed [18,19,20]. The compression of model parameters can intentionally or unintentionally act as an attack against watermarks. In this section, we evaluate the robustness of our watermarks against model compression, in particular, against parameter pruning [20] and distillation [23].

Robustness against parameter pruning In parameter pruning, parameters whose absolute values are very small are cut off to zero. In [19], quantization of weights and the Huffman coding of quantized values are further applied. Because quantization has less impact than parameter pruning and the Huffman coding is lossless compression, we focus on parameter pruning.

In order to evaluate robustness against parameter pruning, we embedded a 256-bit watermark in the conv 2 group while training the host network on the CIFAR-10 dataset. We removed \(\alpha \)% of the \(3 \times 3 \times 64 \times 64\) parameters of the embedded layer and calculated embedding loss and bit error rate. Figure 6a shows embedding loss \(E_R ({\varvec{w}})\) as a function of pruning rate \(\alpha \). Ascending (Descending) represents embedding loss when the top \(\alpha \)% parameters are cut off according to their absolute values in ascending (descending) order. Random represents embedding loss where \(\alpha \)% of parameters are randomly removed. Ascending corresponds to parameter pruning, and the others were evaluated for comparison. We can see that the embedding loss of Ascending increases more slowly than those of Descending and Random as \(\alpha \) increases. It is reasonable that model parameters with small absolute values have less impact on a detected watermark because the watermark is extracted from the dot product of the model parameter w and the constant embedding parameter (weight) \({\varvec{X}}\).

Figure 6b shows the bit error rate as a function of pruning rate \(\alpha \). Surprisingly, the bit error rate was still zero after removing 65% of the parameters and 2 / 256 even after 80% of the parameters were pruned (Ascending). We can say that the embedded watermark is sufficiently robust against parameter pruning because, in [19], the resulting pruning rate of convolutional layers ranged from 16 to 65% for the AlexNet [33], and from 42 to 78% for VGGNet [42]. Furthermore, this degree of bit error can be easily corrected by an error correction code (e.g., the BCH code). Figure 7 shows the histogram of the detected watermark \(\sigma (\sum _{i} X_{ji} w_i)\) after pruning for \(\alpha = 0.8\) and 0.95. For \(\alpha = 0.95\), the histogram of the detected watermark is also shown for the host network into which no watermark is embedded. We can see that many of \(\sigma (\sum _{i} X_{ji} w_i)\) are still close to one for the embedded case, which might be used as a confidence score in determining the existence of a watermark (zero-bit watermarking).

Fig. 6
figure 6

Embedding loss and bit error rate after pruning as a function of pruning rate. a Embedding loss and b bit error rate

Fig. 7
figure 7

Histogram of the detected watermark \(\sigma (\sum _{i} X_{ji} w_i)\) after pruning

Robustness against distillation Distillation is a training procedure initially designed to train a deep neural networks model using knowledge transferred from a different model. The intuition was suggested in [4], while distillation itself was formally introduced in [23]. Distillation is employed to reduce computational complexity or compressing the knowledge in an ensemble of models into a single small model. In the standard distillation framework, a large network (or multiple networks) is first trained, and then, a smaller network is trained using the predicted labels of the large network in order to compress the large network. As well as fine-tuning, distillation could be an unintentional attack and it is specific to deep neural networks.

In this experiment, we performed two trainings. First a 256-bit watermark was embedded in the conv 2 group in the train-to-embed manner with CIFAR-10. Then, in the second training, another model was distilled using the CIFAR-10 dataset and the predicted values of the first trained network instead of the actual labels. The second training did not embed a watermark and initial weights were set at random. We employed the simplest form of distillation in this experiment. Although we could use a different network architecture and different dataset in the transfer step, we trained a new model of the same architecture on the same set CIFAR-10 for simplicity.

Table 8 shows the test error and bit error rate after the first and second trainings. The watermark could not be detected from the distilled model as expected because the model weights had been initialized with random weights.

Table 8 Test error (\(\%\)) and bit error rate (BER) of the embedded host network and after distilling without embedding the watermark

4.4.3 Robustness against watermark overwriting

Overwriting is a common attack in digital content watermarking [28]. A third-party user may embed a different watermark in order to overwrite the original watermark. Basically, it is necessary to know where the original watermark is embedded to overwrite watermarks. Please note that in addition to regularizer parameters \({\varvec{X}}\), which work as a secret key, the location where a digital watermark is embedded should also be secret information. However, it is conceivable for a watermark to be embedded into all or multiple layers to destroy the embedded original watermark or change ownership without exact information on where the original watermark is actually embedded.

In order to evaluate robustness against overwriting, we embedded a 256-bit watermark in the conv 2, conv 3 and conv 4 groups with a regularizer parameter \({\varvec{X}}_0\), while training the host network on the CIFAR-10 dataset. Then, we additionally embedded a 256-bit, 512-bit, 1024-bit, and 2048-bit watermark into the host network, respectively, with a regularizer parameter \({\varvec{X}}_0\) different from \({\varvec{X}}_1\). The number of parameters \({\varvec{w}}\) of conv 2, conv 3, and conv 4 groups were 576, 1152, and 2304, respectively. All bit error rates of the original host networks were zero. The additional watermarks were embedded while training on the CIFAR-10 dataset.

Table 9 shows test error, embedding loss \(E_R ({\varvec{w}})\), and bit error rate with the first regularizer parameter \({\varvec{X}}_0\) after overwriting the first watermark. When the bit error rate is close to 0.5, it indicates that the original watermark has been erased completely. We can see that the original watermark was erased in some cases where the number of embedded bits was large compared to the number of parameters \({\varvec{w}}\).

Table 9 Test error (\(\%\)), embedding loss \(E_R ({\varvec{w}})\), and bit error rate with the original regularizer parameter after overwriting a watermark

4.5 Capacity of watermark

In this section, the capacity of the embedded watermark is explored by embedding different sizes of watermarks into different groups in the train-to-embed manner. Please note that the number of parameters \({\varvec{w}}\) of conv 2, conv 3, and conv 4 groups was 576, 1152, and 2304, respectively. Table 10 shows test error (\(\%\)), embedding loss \(E_R ({\varvec{w}})\) and bit error rate for combinations of different embedded blocks and different numbers of embedded bits. We can see that embedded loss or test error becomes high if the number of embedded bits becomes larger than the number of parameters \({\varvec{w}}\) (e.g., 2048 bits in conv 3) because the embedding problem becomes overdetermined in such cases. Thus, the number of embedded bits should be smaller than the number of parameters \({\varvec{w}}\), which is a limitation of the embedding method using a single-layer perceptron. This limitation would be resolved by using a multilayer perceptron in the embedding regularizer.

Table 10 Test error (\(\%\)), embedding loss \(E_R ({\varvec{w}})\), and bit error rate for the combinations of embedded groups and sizes of embedded bits

5 Discussion

5.1 Insights

Fidelity As mentioned in Sect. 3.2, poor local minima are rarely a problem with large networks in practice. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general [35]. Therefore, the proposed approach was able to maintain the performance of the original task and carry out successful watermarking as shown in the experimental results of Sects. 4.3.2 and 4.3.3.

Robustness For watermarking techniques in the neural networks domain, fine-tuning seems to be the most feasible and significant attack. The experimental results in Sect. 4.4.1 show the proposed method could retain the watermark completely after fine-tuning in both cases: the same domain and a different domain. In the case of the same domain, updates of weight values were assumed to be small if the host model was trained well in the first training. On the other hand, in the case of a different domain, weight values are supposed to change dramatically. However, our experimental results show the watermark remained after fine-tuning to a different domain. It is considered that fine-tuning would cause less alteration for weights near the input layer compared to near the output layer. Therefore, the digital watermark could successfully resist a fine-tuning attack, if the watermark is embedded near the input layer of sufficiently deep networks. Additionally, there is an advantage that the network configuration near the input layer may not be changed for another task.

Capacity The result presented in Sect. 4.5 indicates that the capacity is strongly related to the number of the host weights compared to the length of watermarks. Capacity may be increased by using a multilayer perceptron in the embedding regularizer.

5.2 Limitations

Although we have obtained some initial insights into the new problem of embedding a watermark in deep neural networks, the proposed approach still has the following limitations.

Distillation Distillation is theoretically a serious attack for watermarking of neural networks. However, distillation does not seem to be an important attack in reality, since it requires data that are very similar to the inputs used in the original training phase in order to maintain fidelity.

Overwriting As shown in Sect. 4.4.3, overwriting destroys the original watermark. This experiment is assumed to know exactly where the original watermark was embedded. It is conceivable that watermarks could be embedded into all or multiple layers to destroy the original watermark, although this would incur a much greater computational cost due to the large size of widely targeted parameters. Overwriting is still a serve attack, and we should explore an effective way of combatting overwriting.

Black-box type situation In the proposed digital watermarking approach for deep neural network models, we make an assumption that the weight values are visible. Thus, it is impossible to detect abuse in a black-box type situation such as a client–server system where a watermarked model is used on a server by unauthorized parties. To effectively deal with such a situation, the copyright protection of neural network models requires another approach. Inspired by our work [48], Merrer et al. propose a method that allows the extraction of the watermark from a neural network remotely through a service API [38]. The method embeds zero-bit watermarks into models with a stitching algorithm based on adversaries.

5.3 Further expected developments

Further developments are expected by using the analogy of digital content protection and domain-specific issues for deep neural networks.

Embedding as sequential learning In Sect. 4.3.1, we have shown that it is not effective to directly embed a watermark without considering the original task. We can consider this embedding process as sequential learning; the training of the original task is the first task, and subsequent watermark embedding is the second task. Thus, the increase in error rate after embedding can be interpreted as catastrophic forgetting [26]. From this point of view, we can adopt recently developed methods [26, 37] to overcome this catastrophic forgetting in embedding watermark.

Compression as embedding Compressing deep neural networks is a very important and active research topic. While we confirmed in this paper that our watermark is very robust against parameter pruning in this paper, a watermark might be embedded in conjunction with compressing models. For example, in [19], after parameter pruning, the network is re-trained to learn the final weights for the remaining sparse parameters. Our embedding regularizer can be used in this re-training to embed a watermark.

Network morphism In [9, 51], a systematic study has been conducted on how to morph a well-trained neural network into a new one so that its network function can be completely preserved for further training. This network morphism can constitute a severe attack against our watermark because it may be impossible to detect the embedded watermark if the topology of the host network undergoes major modification. We have left the investigation into how the embedded watermark is affected by this network morphism as a topic for future work.

Steganalysis Steganalysis [31, 41] is a method for detecting the presence of secretly hidden data (e.g., steganography or watermarks) in digital media files such as images, video, audio, and, in our case, deep neural networks. Watermarks ideally are robust against steganalysis. While, in this paper, we confirmed that embedding watermarks does not significantly change the distribution of model parameters, more exploration is needed to evaluate robustness against steganalysis. Conversely, developing effective steganalysis against watermarks for deep neural networks could be an interesting research topic.

Fingerprinting Digital fingerprinting is an alternative to the watermarking approach for persistent identification of images [6], video [29, 47], and audio clips [3, 17]. In this paper, we focused on one of these two important approaches. Robust fingerprinting of deep neural networks is another and complementary direction to protect deep neural network models.

6 Conclusions

In this paper, we have proposed a general framework for embedding a watermark in deep neural network models to protect the rights to the trained models. First, we formulated a new problem: embedding watermarks into deep neural networks. We also defined requirements, embedding situations, and the types of attacks that watermarking deep neural networks are vulnerable to. Second, we proposed a general framework for embedding a watermark in model parameters using a parameter regularizer. Our approach does not impair the performance of networks into which a watermark is embedded. Finally, we performed comprehensive experiments to reveal the potential of watermarking deep neural networks as the basis of this new problem. We showed that our framework could embed a watermark without impairing the performance of a deep neural network. The embedded watermark did not disappear even after fine-tuning or parameter pruning; the entire watermark remained even after 65% of the parameters were pruned.