1 Introduction

The ever evolving social networks, blogs, and video sharing websites make it rather easy to access, edit and redistribute digital multimedia contents such as images, video and audio, multimedia content-authentication has become an ongoing and constant requirement for protecting the media. In multimedia context, video authentication aims to establish its authenticity in time, sequence and content. A video authentication scheme ensures the integrity of digital video, and verifies that the video presented into use has not been tampered with. Digital watermarking provides a promising method of protecting digital data from illicit copying and manipulation by embedding a secret code directly into the data [1921, 33].

Recently, video transcoding [2] which is a core technology for providing universal multimedia access by the Internet users with different access links and devices has become a disturbing issue for some video stream owners or producers. Consequently, there is a crucial need to protect the video streams from being forcibly transcoded and illegally re-distributed. One class of authentication watermarks is hard authentication [38] which rejects any intentional or unintentional modifications to the video bitstream and can be considered as a form of lossless authentication. The inserted watermark is so weak that any manipulation to the video content disturbs its integrity. Digital signatures are one way of achieving hard (lossless) authentication.

The well-established H.264/AVC video coding standard [1] was jointly developed by the ITU-T VCEG and the ISO/IEC MPEG standards committees. It achieves clearly higher compression efficiency, often quoted as, up to a factor of two over the MPEG-2 video standard [11]. A number of error-resilience tools to tolerate errors in H.264/AVC have been developed [13]. However, no specific tool and/or mechanism considers data integrity and authenticity of the transferred bitstreams. Therefore, data integrity and authenticity is still an open issue in H.264/AVC.

The early video watermarking approaches originated from still image watermarking techniques [5], which were extended to video by hiding the watermark in every frame separately, are infeasible to be applied directly to the H.264 standard, since it needs a full decoding and re-encoding for embedding or watermark detection. In this paper, the watermark embedding is applied in the compressed domain [18], in which the original video is provided to the embedder as a stream of bits. The embedder partially decodes the stream and parses the syntactical elements of the compressed video, such as transform coefficients, motion vectors, and intra/inter prediction modes. The elements of the partially decoded video are modified to insert the watermark, and then reassembled to form the compressed watermarked video stream. In doing so, the embedder has sufficient information, such as prediction, motion, and quantization parameters, which allows informed decisions to improve fragility, imperceptibility, and bit-rate control.

In digital video watermarking, due to the unrestrained watermark embedding, two major perceptual artifacts, spatial noise and temporal flicker, may arise. Embedding watermark in smooth areas, renders perceptual distortion in most cases, while inserting different watermarks into video frames independently without taking the temporal dimension into account, usually yields a flicker effect in video which is induced from the differences between the intensities of pixels at the same position in two successive video frames [17].

Many researchers have tackled the copyright protection and content-authentication issues in the well-established H.264/AVC coding standard which adopts many new features [29]. In the open literature, most of the proposed watermarking schemes operate during the encoding phase. In [9], a blind robust DCT-based watermarking method was proposed. This method is robust against compression. In [24], a non-blind robust watermarking method based on Human Visual Model (HVM) was proposed. This method solves the error pooling effect discussed in [33], yet it suffers from two problems: 1) The payload capacity was not convincing, and 2) the original video must be available for watermark extraction. Guo et al. [27] proposed a hybrid scheme including a robust watermark embedded in the DCT domain and a fragile watermark in motion vectors during the H.264 encoding. Their scheme suffers from security problems due to exploiting just the diagonal coefficients for embedding. In [25], the robust algorithm proposed in [24] is extended for embedding a watermark in P-frames by considering the HVM and temporal domain analysis to preserve the visual quality.

In [26], an authentication scheme for H.264 video was proposed, in which the watermark is embedded by reactivating some of the skipped macroblocks (skip-MBs). This scheme has several pitfalls: 1) Skipped MBs are 16 × 16 macroblocks and embedding in which may induce prominent artifact. 2) Skipped macroblocks are sent to the decoder with no coded coefficients, no header, and no prediction information, and reactivating them will cause additional overhead which thus increases the corresponding bit-rate. 3) Reactivating skipped MBs may render some security breaches which help an attacker to distinguish reactivated skipped MBs with simply one nonzero ac residual.

Kapotas et al. [14] proposed a fragile method utilizing the intra IPCM-block type for watermark embedding. The embedding is conducting over the Least Significant Bits (LSB) of the luma and chroma components in the spatial domain. This method cannot detect malicious content modification outside the IPCM-blocks, and it also increases the bit-rate. Kim [15] devised an entropy coding based data hiding method to embed a watermark bit in the sign bit of the trailing ones in Context-Adaptive Variable Length Coding (CAVLC) of the H.264 bitstream. This method suffers from flickering artifacts in the temporal direction, since the errors incurred by the bit-modification are accumulated throughout the intra frame in the raster scan order by the intra predictions. Liu et al. [22] embedded watermark bits by changing the best block type determined by the Rate-Distortion Optimization (RDO). By constraining the prediction modes on different block sizes, the bits of “0” and “1” can be hidden, which inevitably will have a negative effect the final PSNR and bit-rate.

In [12, 37], authors exploited the intra prediction modes of qualified intra 4 × 4 luminance blocks to hide information data based on mapping rules and matrix coding. The methods of Hu [12] and Yang [37] hide the secret data by modifying the best intra prediction mode for some intra 4 × 4 luminance blocks, along with the mapping rules. However, the rules were derived from the statistical analysis of certain testing sequences which are not always optimal for any video sequence. In addition, the mapping rules should be sent to the decoder to perform the watermark extraction. Wang et al. [32] proposed a fragile watermarking scheme, in which the watermark embedding is performed into the last nonzero quantized coefficient of each DCT block during the encoding process. Unfortunately, the results show a high distortion induced by the watermark insertion due to the unacquainted watermark embedding.

In [30], watermark features are extracted as the authentication data from DCT domain to generate a unique digital signature using an MD5 hash function. The authentication information treated as fragile watermark is embedded in a set of motion vectors belonging to higher motion activities with the best partition mode in a tree-structured motion compensation approach. This method suffers from high visible distortion perceived in their subjective evaluation results. Kim et al. [16] proposed a fragile watermarking method scheme that inserts a watermark bit on the motion vectors’ LSB for inter-coded MBs or on the mode number for intra-coded MBs. For skip-MB type MBs, a watermark bit is inserted at the first nonskip-MB type MB with the same coordinate in the following frames. This method suffers from security problem in which an attacker can easily localize the watermark embedding positions, and thus guessing the embedded watermarks. Xu et al. [36] proposed a semi-fragile authentication technique that extracts a self-authentication code and then embeds into the diagonal DCT coefficients in the encoding phase of H.264/AVC. This method suffers from a security problem due to exploiting just the diagonal coefficients for watermark embedding.

In the compressed domain, Xu and Wang [35] proposed a fast fragile watermarking algorithm for the H.264/AVC using Exponential-Golomb (Exp-Golomb) code words mapping. Watermark embedding is performed by modulating the Exp-Golomb coded reference frame index in the bitstream. The algorithm is claimed to be fast and preserves the video coding efficiency and high payload capacities. However, since the optimal reference frames are modified due to watermark embedding, maintaining perceptual transparency is not achievable. This can be drawn clearly from their reported objective observations. In [23], a robust low complexity DCT-based scheme in the H.264 compressed domain is proposed. The watermark embedding is performed based on a spatiotemporal analysis utilizing the useful available information in syntactic elements of the H.264 stream. This scheme performs well in preserving the coding efficiency, but the reported results show relatively low payload capacity.

To circumvent the previous drawbacks, an improved low complexity content-based hard authentication scheme which can detect content-preserving attacks and/or content-changing attacks for H.264/AVC compressed domain is proposed. The concept of the proposed scheme is to extract fragile features such as intra prediction modes of intra 4 × 4 luminance sub-blocks (I4-block) and 16 × 16 blocks (I16-block) of I-frames, then generate a content-based Message Authentication Code (MAC) to be encrypted using a content-based key then embedded into the last nonzero quantized residuals of selected luma intra predicted I4-blocks of I-frames based on an efficient spatiotemporal analysis in a GOP-based fashion. The content-based key derived by some features and a secret symmetric key known only to the H.264 stream owner. Accordingly, the embedded watermark protects all kinds of MBs and frames. The authentication information can be detected and verified blindly from the encoded bitstream without the need of the original host video. Figure 1 illustrates the schematic block diagram of the proposed scheme.

Fig. 1
figure 1

Schematic block diagram of the proposed scheme

The rest of the paper is organized as follows. In Section 2.1, we discuss in details the fragility problem and perform a spatial analysis to enhance the fragility of the watermarking algorithm. Section 2.2 explains the proposed low complexity temporal method. The fragile watermark generation is presented in Section 2.3. Sections 2.42.5 present the watermark embedding, watermark extraction and verification. Then, Section 3 illustrates the experimental results and discussions. Finally, conclusions and future work are drawn in Section 4.

2 The proposed watermarking scheme

Fragile watermarking is a popular technique for digital multimedia content-based authentication. The essential requirement of a fragile watermarking is to detect any content-preserving manipulations or content-changing manipulations. This is easily achieved by utilizing a hashed digest of the original signal to determine the authenticity of the content. Message digest generation in which content-based bits are extracted from the structural information of the video content is used to authenticate video streams. In this paper, a fragile watermarking scheme is proposed for the well-established H.264/AVC bitstreams to verify whether video data are authentic or not.

In the proposed watermarking scheme, an encrypted content-based Message Authentication Code (MAC) is embedded/extracted in a GOP-based fashion using the syntactic elements of the compressed bitstream. The content-based MAC consists of a vector of hashed and encrypted intra/inter prediction modes of the luma components of the intra/inter MBs. This vector acts as the authentication information to be embedded into last nonzero quantized residuals of selected I4-blocks in I-frames based on spatiotemporal analysis. The majority of previous works in the field of H.264/AVC watermarking embed the watermark information into I-frames because any tampering with these frames would lead to immediate effect on the subsequent P- and B-frames in terms of perceptual quality. Conversely, P- and B-frames are highly compressed by motion compensation, and thus normally they have less capacity to embed additional information.

Embedding watermark into last nonzero quantized ac residuals of selected I4-blocks has three benefits:

  1. 1)

    Perceptual quality: Since the Discrete Cosine Transform (DCT) employed in H.264/AVC coding has a strong “energy compaction” property: i.e. most of the signal information tends to be concentrated mainly in the low-frequency part of the DCT spectrum. Consequently, modifying the last nonzero residuals reduces the perceptual distortion.

  2. 2)

    Fragility or sensitivity to tampering: The predefined zigzag scan order reorders the quantized ac residuals from low to high frequency. Thus, modifying the residuals in high frequency, i.e. last nonzero residuals, enjoys the benefit of higher sensitivity against re-encoding and signal processing manipulations.

  3. 3)

    Bit-rate control: Since the modification does not change the order of residuals, this will not affect each run-length after zigzag scan. In other words, keeping nonzero residuals at the beginning and long runs of zeros at the end of the data stream, makes the run-length coding very efficient.

In this study, the coefficient scanning order scheme for directional spatial prediction-based techniques proposed in [34] is adopted to further improve the bit-rate of the final watermarked video streams. The authors show that the probability distribution of the ac coefficient values is related to the selected intra prediction mode. Based on the statistics and mathematical analysis, they proposed two new coefficient scanning schemes based on the selected intra prediction mode. In this paper, these two scanning schemes (vertical and horizontal) are implemented to reduce the bit-rates of the watermarked video streams.

To combat intra-collusion attack [8], in which a unique key is used to embed the same watermark in all frames, the structural information of the H.264 stream is utilized. The content-based public key (K) is derived from 16 intra prediction modes of 16 I4-blocks for a specific MB in an I-frame. The public key is then scrambled using a private symmetric key to generate the resultant key which is used for two purposes: 1) To generate the fragile watermark for each GOP. 2) To select the specified I4-block for watermark embedding. The I4-blocks are selected based on this key and a spatiotemporal analysis to further enhance fragility, imperceptibility, and security demands. As we will see later in this section, the intra prediction modes are prone to change when re-encoding is applied; this fact grants the generated content-based key a higher sensitivity, which will cause the decoder to lose synchronization. Moreover, changing some of the intra prediction modes of some blocks leads to different residuals, and hence makes the embedded watermark unachievable. The watermark embedding in I4-blocks meets the demand of Human Visual System (HVS), in which human eyes are less sensitive to noises in edge and detail regions rather than in smooth areas. The security of the algorithm is granted by using random I4-block selection based on the generated content-based key for each MB.

Unlike some previous video coding standards (namely Motion JPEG and MPEG-2), the intra prediction technique is employed in H.264/AVC video coding. Intra prediction is performed within each I-frame with blocks of two different sizes: 16 × 16 denoted as I16-block and 4 × 4 denoted as I4-block. The I4-block is best suited for coding picture regions with significant details (textures), while the I16-block is more suited for coding smooth regions. In an I4-block, a prediction is based on the surrounding, previously coded and reconstructed blocks. The I4-block has nine prediction modes, eight are directional (mode = 0, 1, 3, .., 8), and one is directionless (DC mode = 2). Similarly, the I16-block has four prediction modes, three of them are directional (mode = 0, 1, 3), and one is directionless (DC mode = 2) [29]. For Inter MBs, on the other hand, variable block-size motion compensation is used to obtain residual information. The supported sizes include 16 × 16, 16 × 8, 8 × 16, and 8 × 8, in which the 8 × 8 partition can be further divided into 8 × 4, 4 × 8, or 4 × 4 blocks.

The H.264/AVC standard is a lossy compression, and thus the process of re-encoding a video sequence produces another video sequence which is similar to the original one but not exactly identical. Specifically, the intra prediction modes are vulnerable to change when re-encoding is applied. We take advantage from this point to estimate the I4-blocks with the highest possibility that an intra prediction mode change may happen. Thus, exploiting these I4-blocks for watermark embedding can yield higher sensitivity to aware attacks. To achieve high sensitivity while preserving the essential authentication demands such as fragility, imperceptibility and bit-rate control, we analyze the syntactic elements of the H.264 bitstreams as detailed below.

2.1 Spatial analysis

By utilizing the advantages of the compressed domain watermarking, we analyze the syntactic elements of the Network Application Layer (NAL) units. Watermark embedding must be applied to blocks with high details (texture) because the human eyes are less sensitive to noise in edge and detail regions rather than smooth areas. In [30], the authors utilized the number of nonzero quantized ac coefficients in H.264 to estimate the spatial activity for a given block. Specifically, the more nonzero quantized coefficients indicate the higher possibility of spatial details in the corresponding block.

As mentioned above, the luma intra prediction modes are vulnerable to change when re-encoding is applied. To demonstrate this effect, we applied re-encoding to several standard non-watermarked video sequences, and then estimated the rate of changes of intra prediction modes of I4-blocks and I16-blocks with different numbers of nonzero quantized residuals. Figure 2 shows the results for rate of changes drawn from 100 frames using six different sequences: (Mobile, Silent, Tempete, Table, Container, and Salesman) which were compressed using QP = 18 and QP = 28.

Fig. 2
figure 2

The rate of changes of I4-block and I16-block intra prediction modes of different sequences with QP = 18 and 28

As it can be seen, for blocks containing the less Number of NonZero (NNZ) quantized residuals, yields more changes in prediction mode. Hence, embedding watermark in MBs with a lower value of NNZ normally can ensure more probability that a de-synchronization of the watermark decoding may happen. Yet, the number of NNZ quantized residuals must be constrained to a lower bound to maintain the imperceptibility demand as is discussed later in the paper.

Knowing that the numbers of NNZ quantized residuals vary across video sequences based on the spatial characteristics, in this case a threshold namely ω is used to select the more suitable sets of blocks for watermark embedding. It is inappropriate to select a constant threshold since the distribution of NNZ coefficients varies from one sequence to sequence. Figure 3 shows the distributions of NNZ (within I-frame) for different sequences with QP = 28. It is clear that sequences of more details and textures, such as Mobile and Tempete where the MBs contain greater values of NNZ, the corresponding distributions are skewed to the right, while for a smooth sequence, such as Silent, the maximum numbers of I4-blocks are lying within the area with lower number of nonzero quantized residuals. Therefore, the appropriate threshold should be selected according to the spatial activity of each sequence.

Fig. 3
figure 3

The distribution of I4-blocks for different nonzero values in first I-frame of different sequences with QP = 28

Concerning the watermarking for authentication demands such as fragility, perceptual quality, and capacity, a percentage of I4-blocks, namely σ, containing a required value of NNZ is selected for watermark embedding. A higher value in σ, leads to an increase in embedding capacity while the perceptual quality and fragility will decrease. To obtain a target value of ω, the Cumulative Distribution Function (CDF) of NNZ distribution F NNZ is utilized as follows:

$$ {F}_{\mathrm{NNZ}}\left(\omega \right)\le \sigma $$
(1)

The CDF of NNZ distribution F NNZ, is defined as below:

$$ {F}_{\mathrm{NNZ}}\left(\omega \right)=P\left(\mathrm{NNZ}\le \omega \right)={\displaystyle \sum_{\mathrm{NNZ}\le \omega }p\left(\mathrm{NNZ}\right)} $$
(2)

Figure 4 depicts the F NNZ of the five different sequences with QP = 28. As it can be seen, for the same σ, different values of ω are achieved depending on the spatial activity of the sequences. For instance, when σ = 0.5, the value of ω for Mobile and Table is 6 and 2, respectively. Thus, for each sequence the selected value of ω based on σ, ensures the best number of I4-blocks required for watermark embedding with the highest sensitivity while maintaining better perceptual quality.

Fig. 4
figure 4

The F NNZ for different nonzero values of different sequences with QP = 28

In the temporal domain, another perceptual quality enhancement can be explored, since embedding watermarks in flat moving objects may induce some artifacts. To preserve a better perceptual quality, low complexity motion estimation is adopted for MBs according to the corresponding Motion Vectors (MV) as detailed in the following section.

2.2 Temporal analysis

We estimate the motion activity of a video sequence by exploiting the motion vectors of inter MBs for two reasons: 1) In the compressed domain watermarking, the motion vectors information is readily available and easy to decode from the bitstream and there is no need for further decoding and re-encoding. 2) The motion vectors represent the essential motion information for each I4-block. As a result, accessing the motion vectors provides representative information about the motion activity for the smallest block (I4-block) of interest. Hence, a low complexity and accurate temporal analysis can be achieved.

To extract the motion activity of a video sequence, the temporal activity represented by the motion activity of each I4-block is estimated by computing the mean of the corresponding motion vectors in P-frames of the previous GOP. In this case, an array called I4-block Motion (I4M) is extracted by evaluating the average of the motion vectors of the corresponding blocks in P-frames of the previous GOP. This information can be calculated by:

$$ \mathrm{I}4\mathrm{M}=\frac{{\displaystyle \sum_{k\in \Pr eviousGOP} Mm{v}_i}}{ Num\left(m{v}_i\right)} $$
(3)

where

$$ Mmv{}_i=\sqrt{ mv{h}_i{}^2+ mv{v}_i{}^2} $$

where mvh i and mvv i denote the horizontal and vertical components of the motion vector of i-th I4-block, respectively; Num(mv i ) denotes the number of the motion vectors considered. For each P-frame of size of 176 × 144, another matrix called Frame Motion Activity (FMA) of size of 44 × 36 is achieved, in which each unit shows the value of I4M for each I4-block in that frame.

To estimate the I4-block Normalized Motion Activity (I4NMA), the I4M and FMA information related to the previous GOP frames are exploited to construct the GOP Motion Activity (GMA), and which is then used to construct the I4NMA, as defined below:

$$ \mathrm{I}4\mathrm{NMA}=\frac{\mathrm{I}4\mathrm{M}}{\mathrm{GMA}} $$
(4)

where

$$ \mathrm{GMA}={\displaystyle \underset{k\in \Pr eviousGOP}{\cup}\mathrm{mean}\left({\mathrm{FMA}}_k\right)} $$

In this equation, k denotes the index of a P-frame in the previous GOP and mean (FMA k ) is a scalar value which represents the average of the FMA matrix entries for all frames. For a given I4-block, it is considered as Fast Moving Blocks (FMB) if the corresponding I4NMA obtained in Eq. (4) is greater than one. As a result, such FMBs are considered for watermark embedding, since FMBs are good candidates for yielding less visual artifact when I4NMA is employed as a qualifying factor. Thus, some blocks are skipped for embedding if the corresponding I4NMA values are less than one since embedding the watermark into non-active areas leads to noticeable temporal artifacts.

2.3 Fragile watermark generation

To authenticate every GOP in the H.264/AVC stream, each GOP is processed individually. This approach enables us to detect the attacked scenes more precisely. To that end, the encrypted MAC value is embedded in each GOP. In the proposed GOP-based authentication scheme, two order sequences are introduced. The GOP Order Sequence (GOS) represents the GOP order sequence in the watermarked sequence, and the Frame Order Sequence (FOS) represents the current frame order in the watermarked sequence. The GOS and FOS are responsible for detecting GOP-based and Frame-based attacks, respectively.

The watermark generation process decodes the NAL syntactic elements to extract the intra/inter MBs and collect the intra I4-block and I16-block prediction modes (IV1), inter prediction modes (IV2), FOS, and GOS into separate buffers, then the collected buffers are XOR-ed and the product is then treated by a one-way hash cryptographic function PJW Hash [3] denoted as H(•). To enhance the security, the generated digest is then encrypted using the aforementioned content-based key (K) in the form of

$$ {W}_G=\mathrm{E}\left(\mathrm{H}\left( IV1\oplus IV2\oplus FOS\oplus GOS\right),K\right) $$
(5)

where E(•) denotes a low complexity encryption function which scrambles its inputs based on the content-based key (K), ⨁ denotes the XOR logical operator. Finally, this digest acts as the fragile watermark to be embedded into a set of selected intra luminance I4-blocks. The details of watermark embedding are given in the following section.

2.4 Watermark embedding

In the literature, several embedding techniques are found, for example, Spread Spectrum (SS) [7], Least Significant Bits (LSB) [10] and Quantization Index Modulation (QIM) [6]. Spread Spectrum watermarking and Quantization Index Modulation techniques usually support robust watermarking schemes, while the LSB modulation techniques are better suited for the case of authentication schemes. On one hand, LSB is cheaper in terms of computational cost; on the other hand, it is better to aware attacks, and thus it is adopted for watermark embedding in this study.

The watermark payload is the secret Message Authentication Code (MAC) created for each GOP which was generated from the previous section in the form of binary sequence denoted as W G  = {w i | i = 0, 1, …, M, w i ϵ {0,1}}, where M is the watermark length. The cover payload is the nonzero quantized ac residuals of I4-blocks of I-frames in each GOP. A Host Block (HB) denotes the I4-block to embed watermark w i , the I4-block (X) must be a Candidate Block (CB), which means the I4-block (X) belongs to a set S of pseudo randomly selected blocks based on the generated content-based key (K).

Regarding the watermark embedding process demands, two constrains, fragility threshold Tr f and quality threshold Tr q , are established. To prevent quality degradation, the embedding is applied on the I4-blocks which meet the following condition:

$$ \begin{array}{l}\mathrm{if}\kern0.5em T{r}_q\le \mathrm{NNZ}(X)\le T{r}_f\hfill \\ {}\mathrm{where}\kern0.5em T{r}_f=\omega +\phi \hfill \end{array} $$
(6)

where NNZ(•) indicates the number of nonzero quantized ac coefficients in a selected I4-block, and the value of the threshold Tr q is application-dependant. The threshold Tr f is introduced to enhance the watermark fragility. The parameter ω is derived based on the spatial activity of the sequence according to (1) and (2), and the parameter ϕ has to be selected in such a way that the fragility threshold Tr f does not exceed the maximum number of nonzero residuals. Consequently, the embedding is restricted to a set of I4-blocks which are more sensitive to re-encoding and other signal processing manipulations while maintaining high perceptual quality. The watermark embedding algorithm is organized as below:

  1. Step 1:

    Parse the current H.264/AVC stream to construct the MBs structure for the first GOP (Gi).

  2. Step 2:

    Apply the spatiotemporal analysis for Gi.

  3. Step 3:

    Call the fragile watermark generation algorithm to construct the watermark information W G for Gi.

  4. Step 4:

    If the current frame is I-frame, for each MB, if the current I4-block (X) is a CB and which meets the condition in Eq. 6, then Eq. 7 is applied according to the watermark bit w i to modulate the last nonzero quantized ac residual.

    $$ ac{\prime}_i\left\{\begin{array}{ll}a{c}_i\hfill & \mathrm{if}\kern0.5em {w}_i=1\kern0.5em \mathrm{and}\kern0.5em \left|a{c}_i\right|\kern0.5em \mod \kern0.5em 2=1\ \hfill \\ {}a{c}_i-1\hfill & \mathrm{if}\kern0.5em {w}_i=1\kern0.5em \mathrm{and}\kern0.5em \left|a{c}_i\right|\kern0.5em \mod \kern0.5em 2=0\hfill \\ {}a{c}_i+1\hfill & \mathrm{if}\kern0.5em {w}_i=0\kern0.5em \mathrm{and}\kern0.5em \left|a{c}_i\right|\kern0.5em \mod\ 2=1\hfill \\ {}a{c}_i\hfill & \mathrm{if}\kern0.5em {w}_i=0\kern0.5em \mathrm{and}\kern0.5em \left|a{c}_i\right|\kern0.5em \mod \kern0.5em 2=0\hfill \end{array}\right. $$
    (7)

    where | • | denotes the absolute value function; ac i and ac′ i denote the original and watermarked nonzero quantized ac residuals, respectively.

  5. Step 5:

    Entropy re-encode the modified MBs and record them back to the slice unit.

  6. Step 6:

    Repeat Steps 1–5 until all GOPs are watermarked.

From Eq. 7, the maximum change made to the selected quantized residuals for watermark embedding is equal to one. If the watermark information W and the quantized residuals are uniformly distributed, then 50 % of the embedded bits will not affect the quantized residuals. Thus, the amount of distortion incurred is minuscule and does not degrade the visual quality of the frame.

2.5 Watermark extraction and verification

If a H.264/AVC stream receiver suspects that the video stream was tampered with or intentionally modified for any reason, the watermark extraction and verification algorithm can be applied to confirm the authenticity and integrity. Watermark extraction is performed after entropy decoding. The extraction process is the inverse of the embedding process. The main steps of the watermark extraction and verification are as follows:

  1. Step 1:

    Partially decode the watermarked H.264/AVC video stream to construct the GOP structures including intra/inter prediction modes, motion vectors, and quantized residuals of all of the I4-blocks.

  2. Step 2:

    Apply the spatiotemporal analysis for Gi.

  3. Step 3:

    Call the fragile watermark generation to construct the encrypted and hashed embedded watermark information W G for the current GOP.

  4. Step 4:

    If the current frame is I-frame and current I4-block is a CB which meets the condition defined in Eq. 6, then extract the embedded watermark from the last nonzero quantized ac residual to construct the extracted watermark information W′ G . The watermark bit w i is determined as

    $$ {w}_i\prime =\left\{\begin{array}{l}\begin{array}{cc}\hfill 1,\hfill & \hfill \mathrm{if}\kern0.5em \left| ac{\prime}_i\right|\kern0.5em \mod \kern0.5em 2=1\hfill \end{array}\\ {}\begin{array}{cc}\hfill 0,\hfill & \hfill \mathrm{otherwise}\hfill \end{array}\end{array}\right. $$
    (8)
  5. Step 5:

    Compare the two sets of extracted watermarks W G and W' G . If they are identical, then Gi is verified and authenticated.

  6. Step 6:

    Repeat Steps 2–5 for all GOPs. If all of the extracted watermarks, W G and W' G are identical, then the H.264/AVC stream is verified and authenticated.

Evidently, the extraction process is simple and fast, because the hidden authentication information can be detected solely from the last nonzero ac residuals and the easily accessible intra/inter prediction modes, and the motion vectors. Consequently, partial decoding is available with the proposed scheme which yields an advantage in fast video authentication scenario. In particular, the video data are normally bulky generated from most application such as video surveillance system.

3 Experimental results and discussions

The proposed watermarking scheme was implemented using the H.264/AVC JM10.2 of the reference software [28]. To confirm the effectiveness of the proposed watermarking scheme, different standard video sequences (Foreman, Grandma, Carphone, Container, Claire, Mother, Bus, Salesman, Table, Soccer, Tempete, Akiyo, Stefan, Silent, News, and Mobile) of QCIF format (176 × 144) at rate 30 frames/s were tested. The selected video sequences include the low to high spatial detail and low to high amount of movement activities. Specifically, the QCIF (YUV 4:2:0) was selected for its common resolution in mobile and low bit-rate applications. The GOP structure consists of an I-frame followed by 4 P-frames in the Main profile and the CABAC entropy coding.

In this study, the variation on bit-rate (VAR RATE ) and PSNR (VAR YPSNR ) as defined in Eqs. 9 and 10 are employed for objective comparisons.

$$ VA{R}_{RATE}=\frac{R\prime -R}{R}\times 100 $$
(9)

where R and R′ denote the bit-rate of the original and watermarked bitstreams, respectively.

$$ VA{R}_{YPSNR}= YPSNR- YPSNR\prime $$
(10)

where YPSNR and YPSNR′ denote the average PSNR of the luma (Y) samples in all frames of the original and watermarked bitstreams, respectively.

Since PSNR does not consider the temporal activity of the encoded bitstreams, the Video Quality Metric (VQM) [4] is employed in this study. The value of the VQM lies in between zero and one, where one and zero indicate maximum impairment and the best quality scenario, respectively.

Since the bit-rate depends on the embedded capacity, in order to perform fair comparisons, the watermark cost (δ) defined in [25] is employed to denote the increase in the number of bits used to encode the watermarked video per watermark bit:

$$ \delta =\frac{R\prime -R}{ Cp} $$
(11)

where Cp denotes the payload capacity.

Regarding the relationships among fragility, imperceptibility, bit-rate, and payload capacity, where a higher payload capacity normally implies higher visual quality degradation and higher bit-rate increment, thus a tradeoff between these paradoxical factors was adapted in this study. Subsequently, extensive experiments have been conducted to determine the optimum values of the threshold parameters σ, Tr q , and Tr f . On one hand, increasing the value of σ will increase the embedded payload on the cost of inducing distortion and the increase in bit-rate. On the other hand, increasing Tr q leads to positive effect on visual quality while decreasing the embedded payload. To efficiently estimate the control parameters σ, Tr f , and Tr q , several experiments were conducted over 100 frames of five standard sequences, including Mobile, Container, Table, Tempete, and Silent, using the proposed method by setting QP = 28. In the experiments, four values were tested for the quality threshold Tr q ranging from 1 to 4. This range has been carefully selected, since in the H.264/AVC coding, statistics shows that on average 60 % of transform coefficients of the prediction residue in one MB are quantized to zero. In addition, three values of σ were tested, including 0.7, 0.6, and 0.5. For simplicity, the fragility threshold Tr f was fixed at ω + 1 by setting ϕ = 1. Thus, the averaged results of 60 experiments are reported in Fig. 5. As it is shown in Fig. 5a, a greater value in σ, the higher capacity can be obtained, and correspondingly the higher distortion induced as in Fig. 5c; similarly, higher fragility―lower Normalized Correlation Coefficient (NCC) was obtained as depicted in Fig. 5d. In the meantime, Fig. 5b shows an exponential trend, where a lower watermark cost (δ) is achieved when Tr q  = 4 for all values of σ. Hence, by setting σ to 0.6 and Tr q to 4, a tradeoff between these conflicting factors is obtained. Thus, the lowest effect on video coding efficiency, and the highest fragility are assured.

Fig. 5
figure 5

The results of average variation on a Capacity (Cp), b Watermark cost (δ), c Video Quality Metric (VQM), and d Fragility (NCC) of different sequences with multiple values of σ (0.7, 0.6. and 0.5) and Tr q (1, 2, 3 and 4)

3.1 Imperceptibility test

To evaluate the imperceptibility of the proposed scheme, a series of experiments have been performed. Fig. 6a, e and b, f illustrate the original and watermarked second intra coded frame of Tempete and Foreman sequences, respectively. Similarly, Fig. 6c, g and d, h show the original and watermarked first inter coded frame (in the 2nd GOP) of Tempete and Foreman sequences, respectively. It is clear that no significant difference of subjective visual quality is found between the original and watermarked frames. Moreover, in the carried out experiments, no visible artifacts can be observed in all of the test video sequences. This can be clearly noticed from the subjective evaluation of the subsequent P-frames in Fig. 6c, g and d, h since no propagated flickering is noticed at all.

Fig. 6
figure 6

Visual quality evaluation of the proposed scheme for Tempete and Foreman with QP = 28 a Original I-frame (YPSNR = 35.25 dB) and b Watermarked I-frame (YPSNR′ = 35.28 dB) c Original P-frame (YPSNR = 33.82 dB) and d Watermarked P-frame (YPSNR′ = 33.78 dB) e Original I-frame (YPSNR = 36.78 dB) and f Watermarked I-frame (YPSNR′ = 36.77 dB) g Original P-frame (YPSNR = 36.17 dB) and h Watermarked P-frame (YPSNR′ = 36.15 dB)

For objective evaluation, Fig. 7 shows the frame-by-frame YPSNR of the luma samples of the original and watermarked Tempete and Table sequences. As it can be seen, the VAR YPSNR does not exceed 0.05 (dB) in average for all of the intra frames, which proves that the proposed scheme can maintain the visual quality of the watermarked bitstreams.

Fig. 7
figure 7

Frame-by-frame YPSNR, YPSNR′ of the original and watermarked a Tempete and b Foreman with QP = 28

Figure 8 shows the YPSNR using the proposed watermarking scheme at several values of QP using multiple values of Tr q ranging from 1 to 4. Apparently, we can see almost no difference between the original and watermarked sequence, thus, the proposed scheme is suitable for applications with different fidelities. Notably, the lowest difference is obtained when Tr q  = 4.

Fig. 8
figure 8

a Tempete and b Foreman YPSNR curves at constant QP with Tr q  = (1, 2, 3 and 4)

Figure 9 shows the Rate-distortion curve for the proposed scheme at multiple values of Tr q ranging from 1 to 4. It appears that the proposed scheme does not reduce the perceptual quality but kept almost the same of the original codec. Thus, the proposed scheme can maintain the perceptual quality under various bit-rates.

Fig. 9
figure 9

Rate-distortion curve for a Tempete and b Foreman with Tr q  = (1, 2, 3 and 4)

Table 1 presents a comparison of the visual quality variation VAR YPSNR (dB) using different video sequences of length 120 frames between the proposed scheme and the former method [32] with QP = 28. The results show that the proposed scheme outperforms the former scheme for all video sequences. This inconsiderable perceptual distortion of the proposed method can be explained due to the watermark embedding in the last nonzero ac residuals, i.e. the high frequency region of the DCT spectrum.

Table 1 Performance comparison in terms of visual quality variation VAR YPSNR (dB) with former method [32]

Here below, the Structural SIMilarity index (SSIM) [31] is employed to compare the proposed method in terms of temporal perceptual quality with the former semi-fragile method [36]. The value of the SSIM lies in between zero and one, where zero and one indicate maximum impairment and the best quality scenario, respectively. In this test, four QCIF videos of length 150 frames were used. The test was performed using the GOP structure “IPPPPPPPPPI” and with QP = 28. Figure 10 shows the corresponding SSIM results. Obviously, the proposed method outperforms the other method in most cases, and the average SSIM of the proposed method is above 0.98 for all the tested sequences. The prominent improvement is achieved by the proposed spatiotemporal analysis.

Fig. 10
figure 10

SSIM for the proposed method and the method in [36]

Finally, the VQM is employed to compare the proposed method in terms of temporal perceptual quality with the former methods [24, 33] and [25]. In this test, eight QCIF videos were used. The test was performed using the GOP structure “IBPBPBI” and with QP = 28. Table 2 shows the corresponding VQM of these methods when only I-frames are watermarked. Obviously, the proposed method outperforms the other methods, and the average VQM of the proposed method is about five times less than the other methods. The prominent improvement is achieved by the proposed spatiotemporal analysis.

Table 2 VQM for the proposed watermarking method and the methods in [24, 33] and [25] when watermark is embedded in I-frames

3.2 Bit-rate and payload capacity test

To show the limited bit-rate increment incurred using the proposed watermarking scheme, several experiments have been performed using various video sequences with QP = 28.

Table 3 shows the minuscule effect of the proposed watermarking scheme on bit-rate. Obviously, we can see the acceptable bit-rate increases with reasonable payload capacities. The tiny bit-rate increase (0.50 in average) is achieved because the embedding is limited on the last nonzero quantized ac residuals.

Table 3 Payload capacity (CP) and bit-rate variation (VAR RATE )

To demonstrate that the proposed method preserves the coding efficiency, a comparison in term of VAR RATE with the method in [36] is performed. Figure 11 shows the minuscule effect of the proposed watermarking scheme on bit-rate. Obviously, we can see a significant reduction on the final bit-rate of 25 times lower in average. This can be explained by the novel anticipated watermark embedding in which only the last nonzero quantized ac residuals are modified.

Fig. 11
figure 11

The VAR RATE for the proposed watermarking method and the method in [36]

A further comparison with the DCT-based methods [24, 33], and [25] have been performed based on the watermark cost δ. This test was performed using the GOP structure “IBPBPBI” and QP = 28. Table 4 shows the comparison in terms of the watermark cost δ using various video sequences between the proposed method and the former methods [24, 33], and [25]. Obviously, the proposed method is superior to these schemes with the test video sequences.

Table 4 Comparison of the proposed watermarking method with methods in [24, 33] and [25] in terms of the δ when watermark is embedded in I-frames

3.3 Fragility to tampering test

As mentioned above, the objective of this study is to propose a hard authentication scheme which is able to detect any content-preserving and/or content-changing attacks. In general, in a fragile watermarking, an attacker would preferably want to attack the watermarked video stream in such a way that the watermark is completely destroyed, yet no perceivable degradation in quality is found in the attacked video. To evaluate the fragility sensitivity of the proposed scheme against such attacks, the watermarked streams, Foreman, Grandma, Carphone, Container, Tempete, and Mobile, were subjected to a three groups of simulated attacks: 1) Group 1 belongs to the content-preserving attacks, including re-encoding, rate control (50 Kbps) and transcoding (QP = 32). 2) Group 2 belongs to the common signal processing attacks, including median filtering (3 × 3), gaussian blurring (2.5 × 2.5), cropping (170 × 140) from bottom-right, and rotation (1) attacks. 3) Group 3 includes conventional GOP-based and Frame-based attacks. The simulated attacks in Group 1, 2 and 3 vary in strength from weak, medium, to strong, respectively.

Regarding the content-preserving attacks in Group 1, they are considered as the most common attacks in the H.264/AVC bitstream domain. The attacker aims to produce another version similar but not identical to the encoded sequence, aiming to remove the embedded watermark without damaging the video stream. While the median filtering, gaussian blurring, and geometric attacks are designed to test the capability of detecting unintentional attacks such as noisy channels causing high bit-error rates. The GOP-based and Frame-based attacks represent a set of strong attacks to simulate intentional ruin attacks.

In the experiments, the Normalized Correlation Coefficient (NCC) is employed to measure the similarity between the embedded and detected watermark bits. NCC ranges from 1 to −1, where 1 indicate perfect watermark match while −1 indicates totally watermark mismatch. Table 5 shows the performance of the fragile watermarking scheme under these types of attacks in terms of the YPSNR′ using the attacked 100 frames of Tempete sequence. From the detected NCC and YPSNR′ metrics of the simulated attacks, we can observe that the proposed scheme is very sensitive to attacks which vary from small pixel value changes to strong ruin attacks, since the visual quality of the attacked video was seriously degraded after the attacking process while the NCC values remained very low. Thus, the scheme is able to authenticate the watermarked H.264 streams effectively.

Table 5 Fragility performance of the proposed watermarking scheme for Tempete sequence

Figure 11 shows the NCC of six video sequences under the aforementioned attacks. It can be seen that the detected NCC values are very low. In Fig. 10a, the Rate control attack shows lower NCC values since the rate control mechanism dynamically adjusts the QP of the video sequence being encoded depending upon the constraint limit set by the encoder, and thus a higher possibility of intra mode change may happen, which cause change in the generated watermark and the ac residuals. Moreover, it is observed that the results of Groups 2 and 3 attacks show low values of NCC [0.15 to −0.20], which is caused by the use of FOS and GOS, since these attacks affects them directly leading to watermark de-synchronization, i.e. authentication failure.

The reason behind authentication failure for the attacks mentioned in this section is that when the frames are tampered with, the hash produced at the decoder is different from that produced by the encoder, as the PJW hash is a one way hash and the probability of yielding the same hash from two different sets of inputs is close to zero. The effectiveness of the proposed method was enhanced by utilizing the sensitivity of the intra/inter prediction modes, and the content-based key (K) which makes the embedded watermark to be collapsed when any manipulations are involved. From the results drawn in Table 5 and Fig. 12, we conclude that the proposed scheme is able to detect any kind of spatial and/or temporal manipulations, meaning that the method is proficient to verify the authenticity of any watermarked H.264 stream.

Fig. 12
figure 12

Fragility performance evaluation under three groups of attacks for different sequences a Group 1, content-preserving attacks b Group 2, content-changing attacks c Group 3, conventional GOP-based and Frame-based attacks

3.4 Complexity test

The proposed method has a low computational complexity, since the major processes such as the fragile watermark generation, the watermark embedding, and finally the watermark extraction and verification are all composed of simple arithmetic operations. The overhead cost of the watermark embedding is induced by comparing the partial decoding against the partial decoding plus watermark embedding, while the watermark extraction overhead cost is induced by comparing the partial decoding against the partial decoding plus watermark extraction, respectively.

Figure 13 depicts the average incurred overheads with the modified JM reference software for both encoder and decoder using various QCIF sequences of size 80 frames with QP = 28. Apparently, the overhead cost is negligible, since the average delay is no more than 3.18 s. in the embedding, and no more than 1.32 s. in the extraction. Consequently, we conclude that the proposed method can be practically applied to H.264 streams with good efficiency.

Fig. 13
figure 13

Results of average time overhead for a watermark embedding, and b watermark extraction and verification

4 Conclusions and future works

In this study, a low complexity, blind GOP-based fragile watermarking scheme for authenticating the integrity of H.264/AVC videos was proposed. The use of the digital watermarking to authenticate digital video and the necessity of hard authentication were discussed. The scheme utilizes three major features of the H.264/AVC standard, including intra/inter mode prediction, motion vectors and I4-blocks quantized ac residuals. A low cost spatiotemporal analysis is proposed to lessen the effect of the imposed degradation and bit-rate increase while maintaining a high fragility to tampering. Moreover, the watermark embedding is performed in the compressed domain, and thus no extra overhead is induced.

A content-based key is generated to control fragile watermark generation, watermark embedding, and watermark extraction and verification processes. The secret self-authentication code is embedded into the last nonzero quantized ac residuals of the luma I4-blocks of I-frames in each GOP of the H.264/AVC video. Hidden information can be extracted via partially decoding the watermarked H.264 stream without the need of the original video stream.

Experimental results over several representative video sequences show the scheme has a comparatively high payload capacity with negligible effect on both video quality and coding efficiency. In addition, fragility tests demonstrated high sensitivity against content-preserving and/or content-changing attacks. Finally, the technique exhibits a very low computational complexity as the watermark embedding operation involves simple mathematical operations. This makes it ideal for content-authentication and tamper proofing in low-power handheld devices and real-time mobile applications. The future research can be put of developing a new semi-fragile scheme exploiting the new features of the next generation HEVC standard.