1 Introduction

H.264/AVC is a state-of-the-art video compression standard and has become the most widely deployed video codec in almost all applications: video distribution in the Internet as well as on Blu-ray discs, even mobile devices employ H.264/AVC to compress captured video data. The ubiquity of digital content and its ease of duplication and modification call for technical solutions for copyright protection and authentication; digital watermarking is an integral part of such solutions. As digital video content is mostly coded with H.264/AVC, watermarking solutions that are tailored to the specifics of this standard can greatly improve the performance of digital watermarking systems, e.g., with respect to runtime performance.

In H.264/AVC, the watermark could be embedded at various stages in the compression pipeline, i.e., pre-compression or spatial domain, transform coding stage, quantized transform coefficients (QTCs) stage and entropy coding stage. So, basically, the work on video watermarking can be divided into two main approaches: (1) closed-loop watermarking, i.e., watermark embedding requires the coder reconstruction loop (any modification due to watermark embedding might likely affect further encoding of information) and (2) open-loop watermarking, i.e., the watermark embedding modifies only a well-defined set of information without affecting the encoding process of any other information (see Fig. 1). The closed-loop watermarking almost needs complete re-compression, while open-loop watermarking embeds a watermark in compressed domain (syntax elements before entropy encoding) or bitstream domain (bit substitutions of the final bitstream). Thus, open-loop watermarking either requires only the entropy encoding stage or no stages from the compression. Therefore, open-loop watermarking is extremely runtime-efficient compared to closed-loop watermarking. In Sect. 2.1, we present the recent research on closed-loop watermarking, while open-loop watermarking approaches are discussed in Sect. 2.2. Embedding within the reconstruction loop avoids drift, shows a small impact on rate distortion (RD) performance and offers a high watermark payload. It also has the significant disadvantage that it requires re-computation of all prediction decisions for the embedding of each single watermarking message [27]. Thus, it requires the computationally most complex parts of video compression and therefore cannot be used for applications with runtime constraints or in applications which require numerous watermark messages to be embedded. Active fingerprinting is an example which usually requires numerous watermarks, as each buyer’s fingerprinting code is embedded in the visual content to detect the traitor in case of illegal distribution of the content.

On the other hand, if the watermark embedding occurs after the reconstruction loop (open-loop), the most complex parts of the compression pipeline are omitted. Watermark embedding modifies only prediction residuals on the syntax level. In open-loop watermarking, prediction is performed from watermarked content on the decoder side and from the original content on the encoder side. Thus, there is a mismatch between encoder and decoder side predictors, which accumulates over time. This accumulating mismatch is referred as drift and makes open-loop watermarking very challenging, as visual quality constraints are difficult to meet. As video data are usually stored and distributed in a compressed format (very often H.264), it is often impractical to first decode the video sequence, embed the watermark and then re-compress it. A low-complexity video watermarking solution requires that the embedding be conducted in an open-loop fashion. Furthermore, common design goals of watermarking algorithms should also be met, such as robustness and imperceptibility. In this paper, we present an algorithm for open-loop watermark embedding in H.264/AVC bitstreams, which is robust against re-compression. At the same time, the embedding is imperceptible and avoids intra-drift, by keeping the predictor pixels unchanged.

In previous work, we presented the basic idea of intra-drift-free H.264/AVC bitstream watermarking in [3], and could only report very preliminary results, which employed ad-hoc solutions of the system of linear equations and used a constant embedding strength without any human visual system model. In this paper, we present a significantly improved version of our basic idea; we now select the most suitable patterns from the complete set of solutions, and use the DCTune perceptual model [28] to modulate the watermark embedding strength. The use of DCTune improves the basic approach in two ways: first, visible artifacts are avoided in the watermarked video. Second, as the watermark embedding strength is increased to the DCTune imperceptibility limit, the improved technique offers higher robustness. Furthermore, a comprehensive analysis and a comprehensive evaluation of our proposed technique are performed in this paper.

The rest of the paper is organized as follows: recent work on H.264/AVC watermarking is presented in Sect. 2. The proposed technique including the derivation of the linear system, watermarking embedding and detection are presented in Sect. 3. Experimental results are presented in Sect. 4. In Sect. 5, we discuss the performance and possible improvements of the presented algorithm. Final conclusions are drawn in Sect. 6.

2 Recent work

For video content, watermarking is often strongly tied to the compression. As previously explained, two distinct approaches can be used for video watermarking: closed-loop watermarking, where the information is embedded either before or during compression, or open-loop watermarking, where the watermark is embedded during the entropy coding or directly within the bitstream. Figure 2 summarizes the different embedding stages that we can encounter in the literature, these stages are numbered within the circles. In the following subsections, we will describe state-of-the-art methods for these two approaches.

Fig. 1
figure 1

Two main categories of watermarking in video codec: (1) during the encoding process or spatial domain and (2) in the bitstream domain

Fig. 2
figure 2

Classification of watermarking schemes on the basis of working domain: (1) pre-compression, (2) transform domain, (3) quantized transform domain, and (4) open-loop

2.1 Closed-loop watermarking

Embedding before or during compression can be subdivided into three main classes, namely, embedding before compression, embedding in transform coefficients, and embedding in QTCs.

In the pre-compression stage, denoted stage 1 in Fig. 2, watermarking can be performed in the pixel domain [4, 5] or more commonly in a transform domain-like DFT [6, 11] or DWT [29]. Pre-compression watermarking approaches can also be performed on multiple frames [4, 6]. In [24], Pröfrock et al. have presented a watermarking technique in the spatial domain, robust to H.264/AVC compression attacks with more than 40:1 compression ratio. The watermark is only contained in intra-frames similar to the approach presented in [30]. The watermark can also be embedded in the transformed coefficients before quantization as proposed by Golikeri et al. [7] (see stage 2 in Fig. 2). Visual models developed by Watson [28] were employed.

Some algorithms [8, 21, 27, 30] embed the watermark in QTCs of H.264/AVC (see stage 3 in Fig. 2). Noorkami and Merserau [21] have presented a technique to embed a watermark message in both intra- and inter-frames in all non-zero QTCs. They claimed that visual quality of inter-frames is not compromised even if the message is embedded only in non-zero QTCs. In [8], Gong and Lu embedded watermarks in H.264/AVC video by modifying the quantized DC coefficients in luma residual blocks. To increase the robustness while maintaining the perceptual quality of the video, a texture-masking-based perceptual model was used to adaptively choose the watermark strength for each block. To eliminate the effects of drift, a drift compensation algorithm was proposed which adds a drift compensation signal before embedding the watermark.

In [27], Shahid et al. embedded watermarks in H.264/AVC video by modifying the quantized AC coefficients in luma and chroma residual blocks. This method embeds a watermark message inside the reconstruction loop to avoid drift, and hence needs re-compression for each watermark message. This algorithm has a negligible compromise on RD performance and is useful for high payload metadata hiding. While only modifying the AC QTCs above a certain threshold, the scheme offers 195 kbps payload with 4.6 % increase in bit-rate and 1.38 dB decrease in PSNR at QP value of 18.

2.2 Open-loop watermarking

To avoid decoding followed by the extremely computationally demanding re-compression combined with watermarking, some methods have suggested embedding watermark messages in an open-loop fashion, e.g., [12, 13, 18, 19, 31] to achieve low-complexity video watermarking (see stage 4 in Fig. 2). In [1], perceptual models are adopted to mitigate the visual artifacts. In [14], a compressed domain watermarking technique called differential energy watermark (DEW) is proposed. In [15], the authors proposed a watermarking algorithm named different number watermarking (DNW) algorithm, embedding the mark in the texture and edges. For all of the above techniques, the main problems are visible artifacts and bit-rate increase.

In previous work, the addition of a drift compensation signal to control/eliminate the drift, i.e., visible artifacts, has been proposed [9, 10]. In [10], Huo et al. have proposed to process only a subset of the transformed coefficients as opposed to process all the coefficients as suggested in the other compensation methods. Kapotas et al. [12] have presented a data hiding method in H.264/AVC streams for fragile watermarking. It takes advantage of the different block sizes used by the H.264 encoder during the inter-prediction stage to hide the desirable data. The message can be extracted directly from the encoded stream without the need of the original host video. This approach can be mainly used for content-based authentication. In [20], the authors proposed a self-collusion resistant watermarking embedding algorithm using a key-dependent embedding strategy. Their algorithm is reported to cause only 1 % increment of the bit-rate; however, it is not robust to re-compression. In [19], the same authors presented a new algorithm that is more robust, but the increment of bit-rate was up to 4–5 %. In [26], Qiu et al. proposed a hybrid watermarking scheme that embeds a robust watermark in the DCT domain and a fragile watermark in the motion vectors in the compressed domain. This approach was not robust against common watermarking attacks.

Watermark embedding may be conducted directly on the bitstream as well, i.e., by substitution of certain bits. In [18], authentication of H.264/AVC was performed by direct watermarking of CAVLC codes. In [13], Kim et al. presented a new algorithm embedding the watermark in the sign bit of the trailing ones in CAVLC of H.264/AVC with no change in bit-rate with PSNR higher than 43 dB. But, this technique was not robust against attacks such as re-compression with different encoding parameters and common signal processing attacks. Zou and Bloom [31] proposed to perform direct replacement of CAVLC for watermarking purpose. For CABAC [32], they have proposed a robust watermarking method using Spread Spectrum modulation while still preserving the bit-rate. This technique is robust against slight shift, moderate downscaling and re-compression. To keep the watermark artifacts invisible and to ensure the robustness, fidelity and robustness filters were used. In [16], the authors presented an algorithm for watermarking intra-frames without any drift. They proposed to exploit several paired-coefficients of a 4 × 4 DCT block to accumulate the embedding-induced distortion. The directions of intra-frame prediction are used to avert the distortion drift. The proposed algorithm has high embedding capacity and low visual distortions. Since this technique is based on intra-prediction direction, it is not robust against re-encoding attack, which is the most common non-intentional attack.

Open-loop video watermarking algorithms face major challenges/limitations: first, the payload of such algorithms is significantly reduced, i.e., up to a few bytes per second as explained in [9]. Second, if drift compensation is not applied, there is a continuous drift, which significantly distorts the visual quality. Third, if drift compensation is applied, the algorithms face considerable bit-rate increases. Fourth, the robustness is often very limited.

In [9] and [25], the authors emphasize the importance of bit-rate preserving watermarking algorithms.

The goal of this paper was to present a robust open-loop watermarking algorithm for H.264/AVC with a minimal increment of the bit-rate and robustness against re-compression. To achieve these goals, drift is avoided instead of compensated which allows to almost preserve the original bit-rate as no compensation signals need to be coded.

3 Drift-free open-loop H.264 watermarking

The main challenge in the design of an H.264/AVC open-loop watermarking algorithm is drift, which is a result of intra- and inter-prediction [19, 20]. If the watermark embedding algorithm modifies the pixels which are used in these processes, the result is a drift error which accumulates, i.e., multiple drift errors add up to induce clearly visible distortions. In the proposed watermarking technique, a watermark message is embedded in the transformed domain, while making sure that only those pixels are modified which are not used for the intra-prediction of future blocks. Capital letters refer to matrices, e.g., CYW and lower case letters with indices refer to the elements of a matrix, e.g., x ij represents jth element in ith row of matrix X.

The embedding of the watermark modifies the original pixel value x ij by adding the watermark signal w ij , which results in the watermarked signal x w ij .

$$ x_{ij}^{w}=x_{ij}+w_{ij} $$
(1)

The values y ij are the pixels in the next block, which is predicted from x ij . A pixel value y ij can be decomposed into its predictor p ij and residual r ij :

$$ y_{ij}=p_{ij}+r_{ij} $$
(2)

At the encoder side, p ij is derived from x ij , but after the watermark embedding, x ij changes into x w ij , so p ij changes to p w ij y ij will thus become y ij ;

$$ y_{ij}^{'}=p_{ij}^{w}+r_{ij} $$
(3)

Thus, the drift error e is:

$$ e=y_{ij}^{'}-y_{ij} $$
(4)

In this work, we completely avoid this drift error, as the reference pixel values for the intra-prediction remain unchanged.

$$ p_{ij}^{w} = p_{ij},\quad y_{ij}^{'} = y_{ij},\quad \Rightarrow \quad e=0. $$
(5)

In Sect. 3.1, we present the solutions of a system of linear equations, which allow embedding of a watermark message in the DCT coefficients without modifying boundary pixels which are used for prediction of future blocks. The watermark embedding step is explained in Sect. 3.2, while watermark detection is presented in Sect. 3.3.

3.1 Intra-drift-free DCT-watermarking: solutions of a system of linear equations

In the spatial prediction of H.264, the pixels of a macroblock (MB) are predicted using only information of already transmitted blocks in the same video frame. In H.264/AVC, two types of spatial predictions are available: I 4×4 and I 16×16. The I 4×4 mode is based on predicting each 4 × 4 luma block separately. Pixels which are used for prediction in this mode are shown in Fig. 3a. Nine prediction modes are available and it is well suited for coding the detailed areas of the frame. The I 16×16 mode, on the other hand, performs prediction of the whole 16 × 16 luma block. Figure 3b shows the pixels that are used for prediction in this mode. In this mode, DC coefficients are further transformed using Hadamard transform and are sent before AC coefficients. It is more suited for coding the smooth areas of the frame and has four prediction modes. In this paper, we are presenting the algorithm for a watermark embedding in the 4 × 4 luma blocks of intra-frames. Figure 4 shows the nine different predictions modes for a 4 × 4 block size.

Fig. 3
figure 3

H.264/AVC intra-prediction in spatial domain: a Prediction is performed from pixels at left and top of every 4 × 4 block for I 4×4 mode. Moreover, 4 × 4 blocks are transmitted in a dyadic manner inside a MB. b Prediction is performed at MB level for I 16×16 mode. The order of transmission of 4 × 4 blocks inside a MB is in a raster-scan fashion. In this mode, DC coefficients are further transformed using Hadamard transform and are sent before AC coefficients

Fig. 4
figure 4

Nine intra-prediction modes (a i) and labeling of the pixels (j)

Figure 4j shows the labeling of prediction sample. The prediction is based on pixels labeled A to M for the current block. Thus, the dhlpmno pixels will be used for future prediction. Since it is only the residual r ij which is coded in the bitstream and is watermarked in bitstream domain watermarking techniques, if the residual r ij for pixels at positions dhlpmno in Fig. 4j are unchanged, then the original pixels will remain unchanged. Since these pixels are used in the prediction, the error brought by the watermark embedding process will not propagate in the intra-frame. Moreover, if the prediction mode is unchanged, the bit-rate increment will only come from the entropy coding of a few modified coefficients and thus remain low. Since we aim to embed the watermark in the residual in the compressed domain, we need to explore the relationship between the spatial domain pixel values and the compressed domain coefficients.

The 4 × 4 residual matrix X is derived by the 4 × 4 de-quantized transform coefficient matrix Y as follows:

$$ X = C^T Y C $$
(6)

The matrix C is defined in the H.264 standard.

$$ C=\left(\begin{array}{cccc} 1 & 1 & 1 & 1\\ 1 & 0.5 & -0.5 & -1\\ 1 & -1 & -1 & 1\\ 0.5 & -1 & 1 & -0.5 \end{array}\right) $$
(7)

Transform matrix Y will be modified to Y′, such that the border pixels of X′ = C T YC are the same as those of X, i.e.:

$$ C^T Y' C - C^T Y C = X' - X = D = \left(\begin{array}{cccc} d_{00}&d_{01}&d_{02}&0 \\ d_{10}&d_{11}&d_{12}&0 \\ d_{20}&d_{21}&d_{22}&0 \\ 0 &0 &0 &0 \\ \end{array}\right) $$
(8)

Y′ can be written as Y + N, with \(N = \left(\begin{array}{llll} n_{00}&n_{01}&n_{02}&n_{03} \\ n_{10}&n_{11}&n_{12}&n_{13} \\ n_{20}&n_{21}&n_{22}&n_{23} \\ n_{30}&n_{31}&n_{32}&n_{33} \\ \end{array}\right), \) and thus,

$$ X'-X = C^T Y' C - C^T Y C = C^T (Y + N) C - C^T Y C = C^T N C = D $$
(9)

Each d i,j  = 0 in Eq. (8) corresponds to a linear equation, we thus get seven linear equations in \(\overrightarrow{n} = (n_{00}, \ldots, n_{33})^T.\)

$$ P \overrightarrow{n} = \overrightarrow{0} $$
(10)

The solutions of this system of linear equations are spanned on a 9-element basis. Any linear combination of these basis vectors can be added to the de-quantized coefficient matrix without changing the border pixels.

The nine possible bases that would preserve the last row and column for the blocks are given below in Eq. (11). As explained above, any summation of these basis matrices could be computed and used as solution patterns that will preserve the “prediction pixels”.

$$ \begin{array}{c} \left\{ \left(\begin{array}{cccc} \frac{1}{4} & 0 & 0 & \frac{1}{2}\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ \frac{1}{2} & 0 & 0 & 1\\ \end{array}\right), \left(\begin{array}{cccc} -\frac{1}{2} & 0 & \frac{1}{2} & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ -1 & 0 & 1 & 0 \end{array}\right), \left(\begin{array}{cccc} \frac{1}{2} & \frac{1}{2} & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 1 & 1 & 0 & 0 \end{array}\right), \left(\begin{array}{cccc} -\frac{1}{2} & 0 & 0 & -1\\ 0 & 0 & 0 & 0\\ \frac{1}{2} & 0 & 0 & 1\\ 0 & 0 & 0 & 0 \end{array}\right), \left(\begin{array}{cccc} 1 & 0 & -1 & \hbox {0}\\ 0 & 0 & 0 & 0\\ -1 & 0 & 1 & 0\\ 0 & 0 & 0 & 0\\ \end{array}\right),\right.\\ \left.\left(\begin{array}{cccc} -1 & -1 & 0 & 0\\ 0 & 0 & 0 & 0\\ 1 & 1 & 0 & 0\\ 0 & 0 & 0 & 0\\ \end{array}\right), \left(\begin{array}{cccc} \frac{1}{2} & 0 & 0 & 1\\ \frac{1}{2} & 0 & 0 & 1\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 \\ \end{array}\right), \left(\begin{array}{cccc} -1 & 0 & 1 & 0\\ -1 & 0 & 1 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 \\ \end{array}\right), \left(\begin{array}{cccc} 1 & 1 & 0 & 0\\ 1 & 1 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 \\ \end{array}\right)\right\} \end{array} $$
(11)

Our watermark embedding requires two solution patterns of the linear system to embed a single symbol of the watermarking sequence. The solutions should modify important coefficients (close to the DC coefficient, but not the DC coefficient itself), but modify as few coefficients as possible, which are also most likely to be non-zero. Hence, we have selected patterns with the minimum amount of non-zero entries that contain non-zero entries close to the DC coefficient. The following two solution patterns have been employed, each containing 4 non-zero coefficients:

$$ {\rm sol}_1 = \left[\begin{array}{cccc} 0 &1 & 1 & 0\\ 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ \end{array}\right] $$
(12)
$$ {\rm sol}_2 = \left[\begin{array}{cccc} 0 &0 & 0 & 0\\ 1 & 1 & 0 & 0\\ 1 & 1 & 0 & 0\\ 0 & 0 & 0 & 0\\ \end{array}\right] $$
(13)

We can notice that sol1 actually corresponds to the matrix summation of the 8th and 9th elements of the basis matrices, whereas sol2 corresponds to the matrix summation of the 6th and 9th elements of the basis matrices (see the 9-basis matrices in Eq. 11).

We can thus use either solution sol1 or sol2 to embed a 1 or a 0 within the block:

$$ \left\{ \begin{array}{c} y_{1(i,j)}=y_{i,j}+\alpha \cdot {\rm sol}_{1(i,j)} \rightarrow \hbox {Embed watermark bit `0'}\\ y_{2(i,j)}=y_{i,j}+\alpha \cdot {\rm sol}_{2(i,j)} \rightarrow \hbox {Embed watermark bit `1'} \end{array}\right. $$
(14)

where α is a weighting parameter; for any α value, the coefficient used for the prediction will be unchanged in y 1(i,j) and y 2(i,j). In the following, we will explain how to use a JND model to optimize the embedding strength.

3.2 Watermark embedding

For a better imperceptibility, we have adapted DCTune [28] to H.264/AVC. DCTune performs in the DCT transform domain. Since DCT transform has been replaced by integer transform (IT) in H.264/AVC, we have computed the DCTune visibility threshold using the IT. The adaptation of DCTune in our method is performed the same way as in [20]. Interested readers should refer to Sect. 2 in [20] for in-depth details on the DCTune adaptation. In a video frame, every area has different characteristics in terms of luminance, contrast and texture. Hence, the sensitivity to human vision also varies from one area to another inside a video frame. DCTune computes a visibility threshold for every transformed coefficient in the DCT domain. DCTune considers both luminance masking and contrast masking. Luminance masking occurs in the bright portions of an image, where basically, our visual system is less sensitive to luminance variation. Effectively, the same luminance variation leads to a lower contrast in brighter areas (than in darker areas). Hence, information loss is less noticeable in bright portions of the image. Concerning contrast masking, the human visual system exhibits a reduced sensitivity when two components having similar frequencies and orientations are mixed together.

Watermark can be embedded in two ways as shown in Fig. 5. If we have raw video input, the watermark embedding and video compression steps can be simultaneously performed as shown in Fig. 5a. A 4 x 4 block is selected for embedding based on solution of linear equations and DCTune threshold. Since the watermark is embedded in encoded video and not in the original video, DCTune is employed on reconstructed video content as shown in Fig. 5a.

Fig. 5
figure 5

Block diagram of drift-free robust watermarking system for H.264/AVC, a raw video input, and b H.264 bitstream input

If the input video is in compressed form, we need to perform entropy decoding for watermark embedding as shown in Fig. 5b. In this case, video will be completely decoded for HVS mask creation by DCTune. It is pertinent to mention that complete video decoding is required only once. For example, if we want to embed 1,000 different watermarks in a video bitstream, it would require entropy re-encoding 1,000 times but a complete decoding only once.

For both approaches, the solutions of the system of linear equations are employed for watermarking the QTCs.

One can note that the HVS masking operates on the decoded/reconstructed pixels, which already takes into account the quantization effect.

As previously explained, for each 4 × 4 block, a system of seven linear equations is derived. Evidently, several solutions may exist to this system of equations. In this work, for every “watermarkable” block, we need at least two solutions to exist. Basically, using either of these two solutions, allows us to embed the watermark information into the host media. In this work, once two solutions to the system are selected, they are used throughout all the blocks of every intra-frame.

Thus, two solution patterns of the system of linear equations are used to embed a single symbol of a bipolar watermark sequence W ∈ {−1, +1} (with zero mean and unit variance). Consequently, using the two solution patterns, we either encode −1 or +1 within the considered block. During the detection, the cross-correlation of the two solutions with the possibly watermarked block is determined. Thus, the goal was to achieve a maximum difference between these two correlations (cross-correlation gap θ), while keeping the embedding strength within the DCTune visibility threshold. The DCTune JND mask provides a maximum allowable strength for every coefficient within the 4 × 4 block; thus, the transformed coefficients within the selected embedding patterns (see Eqs. 12, 13) are successively increased until their value reaches the DCTune threshold. For every considered 4 × 4 block, DCTune provides a 4 × 4 JND matrix. During the perceptual optimization, we use the minimum JND coefficient being non-zero in either sol1 or sol2 (see Eqs. 12, 13). This minimum perceptual weighting coefficient is then used as α in Eq. (14). It is important to note that only the coefficients located within the patterns sol1 or sol2 are modified, all other coefficients are kept unchanged. It is important to note that the DCTune mask coefficients are not needed during the detection stage.

To determine the required processing power for the proposed algorithm, we performed a simulation on 50 frames of the tempete sequence in CIF format with a QP value of 24. For this simulation, three scenarios are considered: (1) encoding only was performed on the sequence, (2) both encoding and watermark embedding was performed, and (3) the sequence was encoded and the watermark was weighted using DCTune before the embedding. For this experiment, a 2.1-GHz Intel Core 2 Duo T8100 machine with 3,072 MB RAM was used. We have done this simulation both on intra-only sequence and on intra- and inter-sequences. For intra-only sequences, where a watermark is embedded within every frame, the encoding itself took 47.9 s for the whole sequence of 50 frames. Encoding the sequence and embedding the watermark took 48.9 s, whereas encoding and watermarking performed using DCTune to weight the watermark strength took 49.3 s.

Since intra-only sequence is used very rarely and intra- followed by inter-frames (IPP sequence) are used most of the time, the simulation was also conducted on intra- and inter-sequence with intra-period of 10. On IP sequences, the encoding itself took 4,078 s, while encoding the sequence and embedding the watermark took 4,083 s. Finally, encoding and watermarking performed using DCTune to weight the watermark strength took 4,084 s.

Table 1 shows the processing time for all three scenarios. The processing times are given both in terms of average time per frame and for the whole processing time needed for the 50 frames. We can observe that when I frames only are considered, the processing time is increased by less than 3 % compared to the “encoding-only” scenario. When P frames are included in the sequence (intra-period 10), the use of DCTune when embedding the watermark only increases the processing time by 0.15 %.

Table 1 Processing time when (1) encoding only is applied, (2) encoding and watermarking are applied, and (3) encoding and DCTune adapted watermarking are performed

The proposed embedding process is thus slightly faster than the works in [16] where the authors mentioned, “The data hiding procedure for an I frame of the sequence Coastguard and Bridge-far cost 188 ms and 125 ms on average, respectively”. It should be noted that in [16], QCIF sequences were used, whereas in Table 1, we used CIF format sequences.

Moreover, to ensure a proper quality on Intra- plus Inter-coded sequences, the PSNR was computed on an sequence containing both intra- and inter-coded frames. Figure 6 shows for every frame of the tempete sequence (50 frames) the PSNR when (1) only coding is applied (QP = 24) and (2) both coding and watermarking are applied. In this experiment, the intra-period is 10, only the I frames are watermarked (frames 1, 11, 21, 31, 41). On average (across the 50 video frames), when encoding only was performed, the PSNR reaches 39.35 dB, and when both coding and watermarking are applied, the PSNR equals 38.89 dB. We can thus witness only a 0.46-dB decrease when the watermarking method is applied.

Fig. 6
figure 6

PSNR computed on every frame of I + P coded sequence for either coding only or watermarked sequence

The following constraints are employed during the watermark embedding:

  • Bit-rate constraint we do not modify zero transform coefficients, i.e., these blocks are not used for embedding.

  • Imperceptibility constraint DCTune visibility threshold is calculated for each transformed coefficient. Embedding is performed with the maximum embedding strength allowed by the DCTune visibility threshold.

  • Robustness constraint To guarantee the robustness of this scheme, the cross-correlation gap between the two added solution patterns should be greater than a pre-determined threshold θ.

Embedding can be interpreted as a constrained optimization problem, in which we want to maximize the embedding strength under the constraints of bit-rate, imperceptibility and robustness.

When modifying the residuals, several constraints need to be respected:

  • A block is considered as watermarkable if at least two solutions exist for the system of equations.

  • For every watermarkable 4 × 4 block, the border pixels (x 03x 13x 23x 30x 31x 32x 33) must be kept unchanged (see Eq. 8).

  • Modify the 4 × 4 blocks having non-zero coefficients at the positions determined in the selected solution patterns (see Eqs. 12, 13).

  • The watermark strength must remain below the visibility threshold as defined by DCTune.

  • The cross-correlation gap between the two solution patterns should be above a threshold θ.

3.3 Watermark detection

The detection of the watermark is performed in the compressed domain as well. Its procedure is shown in Fig. 7 (within the dotted rectangle). During the watermark detection process, the following data are required during the retrieval procedure: (1) the chosen solution patterns of the system of linear equations used for watermarking (sol1, sol2) and (2) the watermarked block positions. The proposed detection algorithm is thus semi-blind as some part of information needs to be transmitted to the detector. More precisely, 32 bits are needed to encode the solution patterns, and 21 bits are needed for every embedded watermark symbol to specify its particular location within the frame, and within the embedding block.

Fig. 7
figure 7

Watermark detection procedure

The detection actually proceeds in two separate steps: in a first step, a watermark is extracted from the potentially watermarked video sequence, and then, during the second step, a global detection is performed where a correlation is computed between the original watermark and its extracted counterpart. These two steps are detailed thereafter.

Watermarked transform coefficients are extracted from the watermarked block. Let Z be the extracted 4 × 4 block, which can be extracted directly from the bitstream by partially decoding it as shown in Fig. 7. We will extract one bit of watermark from the block Z using a normalized cross-correlation with each of (sol1, sol2) to find which solution pattern has been embedded as explained in Algorithm 1.

Once the extraction of all the watermark symbols is performed, the so-obtained watermark sequence WMext is compared to the original watermark sequence using a normalized cross-correlation. A detection threshold has to be determined to ensure a proper detection [2, 17, 23]. False alarm and missed detection rates were computed to select the optimal detection threshold as explained in detail in Sect. 4. The second cross-correlation differs from the first one in the sense that it is employed to detect the watermark sequence, while the first correlation is employed to extract a symbol of the watermarking sequence, i.e., it is employed to determine which solution pattern has been embedded in the block.

4 Simulation results

This section presents evaluations of the proposed watermarking algorithm.Footnote 1 We have used a custom implementation based on the H.264/AVC reference software JM 12.2.Footnote 2 Several benchmark video sequences containing different combinations of motion, texture and objects have been used. In the following, unless otherwise specified, the benchmark sequences were in CIF format (352 × 288 pixels). In Sect. 4.1, the choice of a proper detection threshold is presented. The payload and invisibility of our algorithm are assessed as well. Finally, the robustness of the technique is tested against re-compression.

4.1 Detection threshold

The choice of the optimal detection threshold was obtained by computing the false alarm and missed detection rate on 100 samples, each sample is made of three consecutive intra-frames. The false positive and false negative rates were computed and led to an optimum detection threshold at 0.3 as depicted in Fig. 8. These thresholds allow minimization of false positives, while granting a very weak false negative rate. We deliberately chose here to avoid false positives to ensure that no watermark is detected in an unmarked video. Figure 8 shows the detection performances for five benchmark video sequences. Gray curves indicate the detector result when it is looking for a wrong watermark, while the black curves indicate the results for an original watermark.

Fig. 8
figure 8

Experimental analysis for threshold selection (false positives and true negatives)

In Fig. 8a, the watermarked sequences were re-encoded using a QP value of 34, whereas in Fig. 8b, the re-encoding used a QP value of 40, and thus the true positives decreases.

Concerning the selection of a detection threshold, two detection scenarios need to be considered. The hypothesis \(\mathcal{H}_0\) when no watermark is actually embedded into the host media (or the detector seeks for a different watermark) and the hypothesis \(\mathcal{H}_1\) when the correct watermark is indeed embedded into the input sequence. The detector is run on a large number of sequences under the two assumptions, \(\mathcal{H}_0\) and \(\mathcal{H}_1, \) and the distribution of the detector is plotted for both cases. This way, two distinct distributions (commonly Gaussian) can be observed while varying the detection threshold. Ideally, the two distributions \(\mathcal{H}_0\) and \(\mathcal{H}_1\) should not overlap, which then allows us to set a detection threshold in between the two distributions. Distributions for the two hypotheses \(\mathcal{H}_0\) and \(\mathcal{H}_1\) for QP values of 34 and 40 are depicted in Fig. 9a, b, respectively. Moreover, Fig. 9c shows all the 500 detections (whether detected or undetected) for all the sequences. In this plot, five input sequences were considered (“bus”, “mobile”, “paris”, “tempete” and “waterfall”), each one was either watermarked with the correct embedding sequence that is used by the detector (\(\mathcal{H}_1, \) represented by gray circles) or watermarked with a different watermark sequence (\(\mathcal{H}_0, \) represented by white squares). For each of the five input sequences, 100 watermarked versions were generated, and thus, the \(\mathcal{H}_0\) and \(\mathcal{H}_1\) distributions are each composed of 500 data points.

Fig. 9
figure 9

a \({{\mathcal{H}}_0}\) vs. \({{\mathcal{H}}_1}\) analysis and detection for all five input sequences

4.2 Analysis of payload, bit-rate and visual quality

The watermark embedding payload is dependent on two parameters: video content and QP value. The more the video content exhibits textures and motion, the higher is the payload. Table 2 shows the payload, bit-rate increment and PSNR decrease for the benchmark sequences at QP value of 24. The negative value of bit-rate modification actually represents a reduction in bit-rate. Bit-rate change is so small that it is shown in milli (10−3) of percent. One can observe that the proposed technique has a negligible effect on PSNR as well, with only 0.064 dB increase, while having an acceptable payload, and a similar bit-rate. It is important to notice that there is a negligible change in the bit-rate despite the fact that we have embedded the watermark message with the maximum embedding strength allowed by DCTune, to make our scheme more robust.

Table 2 Analysis of the proposed algorithm in terms of payload (per frame), bit-rate and PSNR at QP values of 24 for benchmark sequences

Video content can be encoded at different quality levels using different QP values. Hence, it is important to observe the payload of the proposed scheme on different QP values. Table 3 shows the analysis of our proposed scheme at different QP values for benchmark video sequence tempete. It is evident that the proposed scheme conserves bit-rate and PSNR for the whole range of QP values. Nevertheless, payload decreases with increase in QP value. It is because we embed our watermark symbols only in blocks where solely non-zero coefficients are affected. The number of such blocks is increased with the increase in the QP value.

Table 3 Analysis of the proposed algorithm in terms of payload (per frame), bit-rate and PSNR for different QP values for the tempete sequence

4.3 Robustness against re-compression

Re-compression is the most traditional non-intentional attack against watermarked videos. Evidently, the higher is the quantization step, the more devastating is the attack. Table 4 summarizes the cross-correlation peaks for five benchmark video sequences for re-compression with ten different QP values (from 24 to 42 with a step of 2). These results show that the watermark is successfully detected up to a QP value of 40. In Table 4, the data in italics represent the missed detections (the peak correlation is below the detection threshold).

Table 4 Watermark detection rate for benchmark video sequences under re-compression at different QP values (originally encoded at QP value of 24)

4.4 Comparison with previous works

For the sake of comparison with the recent works on H.264/AVC watermarking, we compared the performances of our watermarking system with [19] and [20] in terms of both PSNR and watermark payload. The PSNR was computed for three watermarked videos, either for the watermarking technique operating within the encoder or in the bitstream. In the following experiments, QCIF sequences were used (144 × 176 pixels), as this was the format used by Noorkami and Mersereau [19] and [20]. Figure 10 shows the PSNR for both the proposed technique and that used in [20]. It is important to note that in [20], the spreading of the prediction error when the watermark is embedded in the bitstream quickly leads to a severe quality loss. On the other hand, PSNR closely overlaps in the proposed scheme. For example, for the bus video sequence, the PSNRs are 40.05 and 39.93 dB in the encoder and in the bitstream, respectively. The difference in the PSNR in the proposed scheme is due to the quantization step.

Fig. 10
figure 10

Comparison of PSNR values with the method presented in [20]. Seven QP values for both the Encoder and Bitstream scenarios

Moreover, in [20], the authors stated, “On average, watermark embedding using our algorithm increases the bit-rate of the video by about 5.6 % (...) readers might find it useful to know that the PSNR of the watermarked video decreases 0.58 dB compared to the compressed (but unwatermarked) video (...)”. As we will see in the following, in the proposed method, the bit-rate increase is marginal, and the PSNR decrease remains very low.

We summarize in Table 5 the performances of our algorithm in terms of bit-rate increase, PSNR decrease, and payload. We highlight the differences in terms of payload between the proposed watermarking system and the works in [19]. From this table, We can notice that the two methods present similar watermark payload. Moreover, we also present in Table 5 the payload obtained in [20]. We can notice that the payload from our proposed embedding technique cannot compete with the payload presented in [20], but on the other hand, the bit-rate modification is marginal with our method. Noorkami and Mersereau in [20] did not provide the bit-rate increase per input sequence, but, as explained above, the average bit-rate increase for their method was about 5.6 %.

Table 5 Comparison of payload per frame with the algorithm presented in [19] and [20]

It is important to note that the watermarking methods in [19, 20] are blind, whereas our algorithm is semi-blind, as previously explained.

Finally, we show in Table 6 a performance comparison with [16] in terms of both bit-rate increase and payload.

Table 6 Comparison with [16] in terms of bit-rate modification and payload

Again, we can witness that the method from [16] offers a higher embedding capacity, but also leads to a higher bit-rate increase. Such an increase could become an important issue when dealing with high definition video sequences.

5 Discussion

In these experiments, we have only embedded in very specific 4 × 4 blocks: (1) all of the coefficients involved in watermark embedding are non-zero, (2) the cross-correlation gap between the two watermarking patterns must be greater than θ, and (3) the embedding strength must be within DCTune threshold for that block. If all the three constraints are satisfied, we embed the watermark message bit using the maximum allowable strength by DCTune. In our experiments, we have selected only two solutions to embed a watermark and if any of the above-mentioned constraints are not met for those two solutions, we do not embed any watermark bit in that block.

Our main goal in this work was to provide a robust embedding technique while granting a minimum bit-rate modification and high-quality marked sequences. We did not intend to provide a high payload embedding. However, it is important to note that the watermark payload, and consequently, the robustness could significantly be improved if more than two solutions were selected. The above-mentioned objectives (robustness, invisibility, bit-rate) were reached using a bipolar watermark. Nevertheless, future works will be devoted to study the possible improvements of using a multipolar watermark.

In [20], the authors used a human visual system model adapted for 4 × 4 DCT blocks, while we have used DCTune for this purpose. Since the embedding position inside the 4 × 4 block depends on the solutions of the linear system and such solution patterns could be randomly selected to increase security.

We have used the experimental threshold for embedding and cross-correlation to detect the watermark. In [20], the authors establish a theoretical framework for watermark detection; however, this framework only applies for blind detection. Our experiment prove that our algorithm can resist re-compression even when associated with a high-quality degradation.

6 Conclusion

In this paper, we proposed an intra-drift-free robust watermarking algorithm for H.264/AVC in the compressed domain. A system of linear equations is solved to determine additive watermarking patterns for the H.264 4 × 4 transform coefficients that do not cause intra-drift in the pixel domain. The watermarking strength is adjusted with an adapted DCTune JND mask. A cross-correlation-based detection system is employed to guarantee the robustness. The proposed scheme preserves the bit-rate on average, there is often even a decrease in bit-rate. This is a significant advantage compared to previously proposed approaches, which lead to an increase in bit-rate of the watermarked video. Another advantage of our approach is its runtime-efficiency. Hence, the proposed scheme is very suitable for real-time applications, such as video streaming of actively fingerprinted content. At the same time, we see a negligible effect on visual quality (results are given in terms of PSNR). The watermark remains detectable in re-compressed videos, even in very low quality and highly distorted video sequences (encoded with a high QP).