Digital video steganalysis by subtractive prediction error adjacency matrix

Wang, Keren; Han, Jiesi; Wang, Hongxia

doi:10.1007/s11042-013-1373-4

Digital video steganalysis by subtractive prediction error adjacency matrix

Published: 31 January 2013

Volume 72, pages 313–330, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Digital video steganalysis by subtractive prediction error adjacency matrix

Download PDF

Keren Wang^1,2,
Jiesi Han² &
Hongxia Wang³

455 Accesses
18 Citations
Explore all metrics

Abstract

Video has become an important cover for steganography for its large volume. There are two main categories among existing methods for detecting steganography which embeds in the spatial domain of videos. One category focuses on the spatial redundancy and the other one mainly focuses on the temporal redundancy. This paper presents a novel method which considers both the spatial and the temporal redundancy for video steganalysis. Firstly, model of spread spectrum steganography is provided. PEF (Prediction Error Frame) is then chosen to suppress the temporal redundancy of the video content. Differential filtering between adjacent samples in PEFs is employed to further suppress the spatial redundancy. Finally, Dependencies between adjacent samples in a PEF are modeled by a first-order Markov chain, and subsets of the empirical matrices are then employed as features for a steganalyzer with classifier of SVM (Support Vector Machine). Experimental results demonstrate that for uncompressed videos, the novel features perform better than previous video steganalytic works, and similar to the well-known SPAM (Subtractive Pixel Adjacency Model) features which are originally designed for image steganalysis. For videos compressed with distortion, the novel features perform better than other features tested.

A Video Steganalysis Algorithm for H.264/AVC Based on the Markov Features

Undetectable video steganography by considering spatio-temporal steganalytic features in the embedding cost function

Article 14 March 2020

Blind MV-based video steganalysis based on joint inter-frame and intra-frame statistics

Article Open access 08 November 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Steganography is the art of hiding messages into cover objects such as images, texts, videos and network protocol packets. In order to make the stego object unperceivable, the sender applies a mutually independent embedding operation to selected elements of the cover. Steganalysis is the art of detecting the existence or even determining the location, volume and extracting the content of hidden messages in various cover objects. When the existence of hidden messages is detected, the security of the steganographic system is believed to be destroyed. In this paper, the main concern lies in the detection of hidden messages embedded in videos.

Most embedding methods for videos are developed from those for images. A popular method under this paradigm is LSB (Least Significant Bit) matching[24, 27], which randomly increases or decreases pixel values by one to match the LSBs with the candidate message bits. Besides, Some other embedding methods, such as QIM (Quantization Index Modulation) [13, 19], and SS (Spread Spectrum) steganography[6, 8, 16] etc., are also introduced into spatial video steganography. Another kind of embedding method can be represented as MSU StegoVideo [17] which is a spatial video steganographic software got from the Internet. As videos are often coded before transmission to the receiver, and videos got from the Internet are always compressed in volume to save bandwidth, the robustness of steganographic methods in the spatial domain is very important for the restore of hidden messages. Most of those steganographic methods maintain robustness by embedding several copies of the original hidden messages into the cover videos.

Various methods have been designed to detect steganography in the spatial domain of videos. Kundur and Budhia [2] proposed a detection method based on collusion. It was also called TFA (Temporal Frame Averaging), which was commonly used in the research on watermark attacking. The estimated cover was got coarsely through collusion, and residuals between estimated frame and the stego frame were calculated. Kurtosis, entropy, and 25 % percentile of the residuals were utilized as features for steganalysis. In [3], Kurdur and Budhia further explained the basic theory and the effective condition of collusion. In [9], MoViSteg (Motion-based Video Steganalysis) was proposed by Jainsky et al. Motion interpolation was used to get a coarse estimation of cover object. Residuals between estimated copy and the stego one were then analyzed by ARE (Asymptotic Relative Efficiency), and adaptive threshold was adopted for the detection results. In [18], the local variance of the prediction error frame was calculated in size of 3 × 3. Gamma distribution was adopted to fit the distribution of the local variance, and two parameters of the distribution were extracted as the steganalytic features.

Besides temporal redundancy utilized in the steganalytic methods stated above, there is also spatial redundancy in the content of videos. Spatial averaging has been used for video steganalysis. In [23], 3 × 3 spatial average filtering was adopted to estimate the cover. Features used in [3] were then extracted for the steganalyzer. Kashyap [12] proposed SABS (Spatial Averaging Based Steganalysis) to detect stego videos. Differences of two averaging filters were calculated, and the same features as [3] were extracted for classification.

Despite of potentially high time-complexity, in order to obtain better detection performance on each frame, some image steganalytic methods are almost directly employed for video steganalysis. A framework considering video steganalysis as an extension of image steganalysis was proposed in [14], consisting of collusion, several video codec algorithms (e.g., motion estimation), and image steganalytic methods. In [28], the embedding operation was modeled as the convolution of the cover and the secret message on the histogram of adjacent frames’ differences. Aliasing degree was then defined as the feature to detect the presence of hidden messages. Liu etc. [15] extracted Markov features from the differences of neighboring coefficients in the transform domain and achieved a satisfying result for the detection of MSU StegoVideo. Kancherla [10, 11] introduced the JPEG steganalytic features [22] into spatial video steganalysis of MSU StegoVideo. Motion estimation was used to get an estimated copy of the cover object. L1 norm between features of the given object and the estimated object was employed to form the complete features.

In general, one kind of existing steganalytic methods based on TFA, PEF (Prediction Error Frame) or spatial average filtering extract features from the global statistical characteristics (like kurtosis, skewness, and 25 % percentile), while ignoring the correlation between either temporally or spatially neighboring pixels. The other kind of methods derived from image steganalytic methods are generally of high computing complexity and make insufficient use of the temporal redundancy. The most related works to this paper include Vinod’s work in [18] which belongs to the former kind and Pevny’s work in [20] which belongs to the latter one. Vinod suggested that lower dependencies within the video content exist in PEFs (Prediction Error Frames) than in the original video frames, and extracted features based on the distribution of the local variances of PEF samples. In pevny’s method, no spatial redundancy was deployed. Differences of spatially neighboring pixels were modeled by a Markov chain and empirical probability transition matrices were calculated to form features, which were called the SPAM (Subtractive Pixel Adjacency Model) features. Since designed for image steganalysis, no temporal redundancy is utilized in Pevny’s method.

This paper originates in the thought of combining the utilization of temporal redundancy and spatial redundancy. We focus on PEFs and model the differences of adjacent PEF samples by a Markov chain. Motion estimation makes use of the temporal redundancy between adjacent frames, while the differential filtering utilizes the spatial redundancy between adjacent PEF samples to further suppress the video content and amplify the stego noise. Those two kinds of processing lead to a higher WSNR (Watermark Signal to Noise Ratio), which seems favorable for steganalysis.

The paper is organized as follows. Section II starts with the rationale of proposed features: First, the model of spatial video steganography is stated. Second, the correlation between PEF and collusion is discussed theoretically, and comparison of PEF and PVD (Pixel Value Difference) is given according to the WSNR. At the last of this section, the proposed SPEAM (Subtractive Prediction Error Adjacency Model) features are presented, followed by a discussion of parameters in the feature extraction. The subsequent Section III presents the major part of experiments consisting of 1) comparison of several versions of the SPEAM features differing in the range of block matching, 2) comparison to SPAM and prior art of video steganalysis on uncompressed video sequences, and 3) compressed video sequences with format of MPEG2 and H.264. The conclusions are drawn in Section IV.

2 Proposed SPEAM features

Existing methods are mostly derived from collusion and image steganalysis. Aiming at amplifying the WSNR of the frame for analysis, collusion effectively suppresses the temporal redundancy between adjacent frames, while original image steganalytic methods which commonly derive features from PVDs satisfactorily suppress the spatial redundancy between adjacent pixels. In this section, the model of spatial video steganagraphy is stated, which is the basis for subsequent analysis. For illustrating why we derive our features from PEFs other than collusion or PVDs, the correlation between PEF and collusion is discussed, and the comparison of PEF and PVD is analyzed according to the WSNR. Finally, the proposed SPEAM features are presented.

2.1 Model of spatial video steganography

In a video steganographic system, the cover video is denoted by U_k(m,n), where k = 1,2,…K is the frame number, and m = 1,2,…,M and n = 1,2,…,N are row indices and column indices of pixels in each frame respectively. Before embedding, the secret message is modulated into a signal using a pseudo-random sequence, resulting in W_k(m,n). As in [3], we call W_k(m,n) the watermark. We assume that the embedding operation is employed in the spatial domain, and the related steganalysis is designed against spatial video steganography. Even if the embedding operation is carried out in a non-spatial domain such as the DCT (Discrete Cosine Transform) domain, and the DFT (Discrete Fourier Transform) domain, similar results can be formulated. The embedding operation is modeled as in [3] by

$$ {X_k}\left( {m,n} \right)={U_k}\left( {m,n} \right)+{\alpha_k}\left( {m,n} \right){W_k}\left( {m,n} \right),\;k=1,2,\ldots,K $$

(1)

where α _k(m,n) is a scaling factor used to tradeoff non-perceptibility and robustness. For simplicity of analysis, α is considered to be constant over all of the pixels and frames to give

$$ {X_k}\left( {m,n} \right)={U_k}\left( {m,n} \right)+\alpha {W_k}\left( {m,n} \right),\;k=1,2,\ldots,K $$

(2)

Only the Y component of video frames is taken into consideration unless emphasized. For LSB steganography, we can get W _k(m,n) = ±1 and α = 1. For SSIS [16], W _k(m,n) is treated as a 2-dimension Gaussian i.i.d. random process which obeys $ N\left( {0,\sigma_w^2} \right) $. The scaled watermark αW _k(m,n) is a function of hidden messages, the scaling factor and the secret key.

2.2 Correlation between collusion and PEF

Spatial redundancy has been utilized for image steganalysis by some filtering methods such as the differential filter and the wavelet filter [21]. Spatial redundancy has also been used for video steganalysis by various methods such as spatial averaging [12, 23]. Besides spatial redundancy, there is still temporal redundancy in video sequences.

Collusion [3] is a classical method to employ temporal redundancy for steganalysis. When the watermark is embedded into the slow-moving video, collusion seems a good method to pre-process the stego video. On the contrary, if the cover video is fast-moving, motion estimation is needed as preprocessing before collusion.

The art of collusion calculates average values of pixels in the same position of adjacent frames, i.e., the average of U _k(m,n), U _k−1(m,n) and U _k+1(m,n). Motion compensated collusion gains $ {{{\left( {{U_k}\left( {m,n} \right)+{{\overline{U}}_{k-1 }}\left( {m,n} \right)+{{\overline{U}}_{k+1 }}\left( {m,n} \right)} \right)}} \left/ {3} \right.} $, where $ {{\overline{U}}_{k-1 }}\left( {m,n} \right) $ comes from the corresponding block of U _k(m,n) in U _k−1. Those average values are then subtracted by U _k(m,n) to get the residual signal, from which steganalytic features are extracted. Motion estimation is a high-complexity operation, which segments each frame into several blocks, and searches for the corresponding blocks of the current frame U _k(m,n) in a specific region of its reference frame (e.g., U _k−1).

PEFs are got by replacing the averaging operation in collusion with the differential filtering $ {U_k}\left( {m,n} \right)-{{\overline{U}}_{k-1 }}\left( {m,n} \right) $. For compressed videos, PEFs can be gained directly from the compressed bit stream, while motion compensated collusion needs two copies of PEFs which the compressed bit stream of some codec (e.g., MPEG1) may not contain. This is why we would rather focus on PEFs other than collusion.

Figure 1 shows the joint probability $ \Pr \left( {{U_k}\left( {m,n} \right),{U_{k-1 }}\left( {m,n} \right)} \right) $ and $ \Pr \left( {{U_k}\left( {m,n} \right),{{\overline{U}}_{k-1 }}\left( {m,n} \right)} \right) $ estimated from 3710 frames of 14 standard video sequences (found at http://trace.eas.asu.edu/yuv/index.html. Each video sequence has not more than 300 frames) captured with CIF size. Due to the high correlation between adjacent frames, the values of pixels in the same positions (or in the corresponding positions) of adjacent frames are close to each other. For fast-moving videos, the deviation between U _k(m,n) and $ {{\overline{U}}_{k-1 }}\left( {m,n} \right) $ is much smaller than the deviation between U _k(m,n) and U _k−1(m,n), which suggests $ {U_k}\left( {m,n} \right)-{{\overline{U}}_{k-1 }}\left( {m,n} \right) $ used below may be less correlated to the video content than $ {U_k}\left( {m,n} \right)-{U_{k-1 }}\left( {m,n} \right) $. Figure 1 also suggests that the profile of the ridge along the major diagonal does not change much with the pixel value. This observation allows us to model the pixels in video frames by working with the differences $ {{\mathrm{U}}_k}\left( {m,n} \right)-{{\overline{U}}_{k-1 }}\left( {m,n} \right) $ instead of the co-occurrences $ \left( {{{\mathrm{U}}_k}\left( {m,n} \right),{{{\overline{\mathrm{U}}}}_{k-1 }}\left( {m,n} \right)} \right) $, which greatly reduces the model dimensionality. Further simplification can be achieved by only focusing on the differences falling in a certain range. If well set, this range may tradeoff the performance and complexity of the detector.

To observe the correlation between two adjacent frames more clearly, two assumptions are made as below:

1)
There is always high correlation between U_k(m,n) and $ {{\overline{U}}_{k-1 }}\left( {m,n} \right) $ in the host video.
2)
The watermark frames W_k(m,n) are independent to U_k, and are independent to each other. In addition, W_k obeys a Gaussian distribution $ N\left( {0,{\alpha^2}\sigma_w^2} \right) $, where $ \sigma_w^2 $ denotes the variance of the stego noise and α denotes the embedding intensity.

The first assumption allows us to make use of the differences $ {U_k}\left( {m,n} \right)-{{\overline{U}}_{k-1 }}\left( {m,n} \right) $ falling in a small range. It is not satisfied when motion estimation is not precise enough (e.g., motion searching range is set too small; videos are quick-moving with irregular trail or taken with low frame rate). The second assumption leads to simpler analysis of SS steganography.

2.3 Comparison of PVD and PEF

Original PVDs before corrupted can be calculated by

$$ \mathrm{PV}{{\mathrm{D}}_k}\left( {m,n} \right)={{\mathrm{U}}_k}\left( {m,n} \right)-{{\mathrm{U}}_k}\left( {m,n+1} \right) $$

(3)

and original PEFs can be got by

$$ {{\mathrm{P}}_k}\left( {m,n} \right)={{\mathrm{U}}_k}\left( {m,n} \right)-{{\overline{\mathrm{U}}}_{k-1 }}\left( {m,n} \right) $$

(4)

After the embedding of SS steganography, PVDs are calculated by

$$ \begin{array}{*{20}c} {\mathrm{PVD}_k^{\prime}\left( {m,n} \right)} & {={{\mathrm{Y}}_k}\left( {m,n} \right)-{{\mathrm{Y}}_k}\left( {m,n+1} \right)} \\ {} & {={{\mathrm{U}}_k}\left( {m,n} \right)-{{\mathrm{U}}_k}\left( {m,n+1} \right)} \\ {} & {+{{\mathrm{W}}_k}\left( {m,n} \right)-{{\mathrm{W}}_k}\left( {m,n+1} \right)} \\ \end{array} $$

(5)

and PEFs are represented by the equation

$$ \begin{array}{*{20}c} {\mathrm{P}_k^{\prime}\left( {m,n} \right)} \hfill & {={{\mathrm{Y}}_k}\left( {m,n} \right)-{{{\overline{\mathrm{Y}}}}_{k-1 }}\left( {m,n} \right)} \hfill \\ {} \hfill & {={{\mathrm{U}}_k}\left( {m,n} \right)-{{{\overline{\mathrm{U}}}}_{k-1 }}\left( {m,n} \right)} \hfill \\ {} \hfill & {+{{\mathrm{W}}_k}\left( {m,n} \right)-{{{\overline{\mathrm{W}}}}_{k-1 }}\left( {m,n} \right)} \hfill \\ \end{array} $$

(6)

The distribution of $ {{\mathrm{W}}_k}\left( {m,n} \right)-{{\overline{\mathrm{W}}}_{k-1 }}\left( {m,n} \right) $ is the same as that of $ {{\mathrm{W}}_k}\left( {m,n} \right)-{{\mathrm{W}}_k}\left( {m,n+1} \right) $ according to the second assumption, which allows us to just focus on the remaining components of PVDs (i.e., $ {{\mathrm{U}}_k}\left( {m,n} \right)-{{\mathrm{U}}_k}\left( {m,n+1} \right) $) and PEFs (i.e., $ {{\mathrm{U}}_k}\left( {m,n} \right)-{{\overline{\mathrm{U}}}_{k-1 }}\left( {m,n} \right) $). It is quite difficult to compare those two components using existing mathematical models, and thus, we focus on the first moment and second moment of PVDs and PEFs got from the cover videos according to the experiments and analysis below.

Each frame in the original videos is segmented into several blocks of size 8 × 8 for block matching. Local variances of PVDs and PEFs with size of 3 × 3 are denoted by $ Va{r_{{\mathrm{PV}{{\mathrm{D}}_{k,i }}}}} $ and $ Va{r_{{{{\mathrm{P}}_{k,i }}}}} $, where i is the indices of blocks in a frame. The probabilities of three cases of $ \left( {Va{r_{{\mathrm{P}\mathrm{V}{{\mathrm{D}}_{k,i }}}}},Va{r_{{{{\mathrm{P}}_{k,i }}}}}} \right) $ from all the original video frames are shown in Table 1.

Table 1 Probabilities of three cases of $ \left( {Va{r_{{\mathrm{P}\mathrm{V}{{\mathrm{D}}_{k,i }}}}},Va{r_{{{{\mathrm{P}}_{k,i }}}}}} \right) $, where P {>} implies the probability of $ P\left\{ {Va{r_{{{{\mathrm{P}}_{k,i }}}}} > Va{r_{{\mathrm{P}\mathrm{V}{{\mathrm{D}}_{k,i }}}}}} \right\} $

Full size table

It is interesting that in most blocks of the original videos, $ Va{r_{\mathrm{P}}} < Va{r_{\mathrm{P}\mathrm{VD}}} $ occurs whether the video content is fast-moving or not. On the other hand, when motion estimation is perfect as we expect, we can get $ E\left[ {{{\mathrm{P}}_k}\left( {m,n} \right)} \right]\approx 0 $ for all the cover video frames, while $ E\left[ {\mathrm{PV}{{\mathrm{D}}_k}\left( {m,n} \right)} \right]\geq 0 $ is inherent for PVDs. A simple comparison of the WSNR of PVD and PEF for the case that motion estimation is ideally accurate is given by

$$ \begin{array}{*{20}c} {WSN{R_{{\mathrm{P}\mathrm{V}{{\mathrm{D}}_k}}}}} \hfill & {=\frac{{E\left[ {{{{\left( {{{\mathrm{W}}_k}\left( {m,n} \right)-{{\mathrm{W}}_k}\left( {m,n+1} \right)} \right)}}^2}} \right]}}{{E\left[ {{{{\left( {{{\mathrm{U}}_k}\left( {m,n} \right)-{{\mathrm{U}}_k}\left( {m,n+1} \right)} \right)}}^2}} \right]}}} \hfill \\ {} \hfill & {=\frac{{2{\alpha^2}\sigma_w^2}}{{Va{r_{{PV{D_k}}}}+{E^2}\left[ {\mathrm{P}\mathrm{V}{{\mathrm{D}}_k}\left( {m,n} \right)} \right]}}} \hfill \\ {} \hfill & {\leq \frac{{2{\alpha^2}\sigma_w^2}}{{Va{r_{{\mathrm{P}\mathrm{V}{{\mathrm{D}}_k}}}}}}\leq \frac{{2{\alpha^2}\sigma_w^2}}{{Va{r_{{{{\mathrm{P}}_k}}}}}}} \hfill \\ {} \hfill & {\approx \frac{{E\left[ {{{{\left( {{{\mathrm{W}}_k}\left( {m,n} \right)-{{{\overline{\mathrm{W}}}}_{k-1 }}\left( {m,n} \right)} \right)}}^2}} \right]}}{{E\left[ {{{{\left( {{{\mathrm{U}}_k}\left( {m,n} \right)-{{{\overline{\mathrm{U}}}}_{k-1 }}\left( {m,n} \right)} \right)}}^2}} \right]}}=WSN{R_{{\mathrm{P}\mathrm{E}{{\mathrm{F}}_k}}}}} \hfill \\ \end{array} $$

(7)

A larger WSNR in PEFs based on ideally motion estimation than in PVDs suggests that features based on PEFs may be more efficient than those based on PVDs.

2.4 The SPEAM features

As mentioned above, PEF denoted by $ {{\mathrm{P}}_k}\left( {m,n} \right)={{\mathrm{U}}_k}\left( {m,n} \right)-{{\overline{\mathrm{U}}}_{k-1 }}\left( {m,n} \right) $ seems to be a favorable variable for steganalysis. Figure 2 shows $ \Pr \left( {{{\mathrm{P}}_k}\left( {m,n} \right),{{\mathrm{P}}_k}\left( {m,n+1} \right)} \right) $ of the original video sequence “akiyo”, “akiyo” corrupted with α = 1, and the original video sequence “waterfall”. It is quite easy to distinguish the first two cases. However, there seems no obvious deviation between the corrupted “akiyo” and uncorrupted “waterfall”, which suggests $ \Pr \left( {{{\mathrm{P}}_k}\left( {m,n} \right),\;{{\mathrm{P}}_k}\left( {m,n+1} \right)} \right) $ may be still correlated with the content of videos.

In fact, motion estimation to obtain PEFs has suppressed temporal redundancy in the video content, while spatial redundancy inherited from frame samples still exists within PEF samples. We employ an additional differential filter to realize further suppression of spatial redundancy within $ \left( {{{\mathbf{P}}_k}\left( {m,n+1} \right),{{\mathbf{P}}_k}\left( {m,n} \right)} \right) $. The differential filter is denoted by

$$ \mathrm{D}_k^{\to}\left( {m,n} \right)={{\mathrm{P}}_k}\left( {m,n} \right)-{{\mathrm{P}}_k}\left( {m,n+1} \right) $$

(8)

where $ \mathrm{D}_k^{\to}\left( {m,n} \right) $ denotes the left-to-right difference of PEF samples. In addition, instead of the joint probability $ \Pr \left( {{{\mathrm{D}}_k}\left( {m,n+1} \right),\;{{\mathrm{D}}_k}\left( {m,n} \right)} \right) $, a more commonly used conditional probability $ \Pr \left( {{{{{{\mathrm{D}}_k}\left( {m,n+1} \right)}} \left/ {{{{\mathrm{D}}_k}\left( {m,n} \right)}} \right.}} \right) $ is calculated to model correlations between adjacent PEF samples.

Figure 3 summarizes the feature extraction process of the SPEAM features, where the SPEAM implies modeling of adjacent PE samples’ differences by a Markov chain. First, difference matrices of adjacent PEF samples are computed. Second, transition probabilities of difference matrices along the same direction are calculated. Finally, several subsets of those transition probability matrices are averaged into two Markov matrices, which form the SPEAM features.

The Markov chain is chosen here mainly because of two facts. The first fact is that Markov features have performed well for image steganalysis, which implies that the Markov chain is useful for modeling spatially adjacent pixels and is favorable for steganalysis. The other fact is that adjacent PEF samples in the same PEF are quite similar to adjacent pixels in the same image. To avoid rigorous analysis of the complex dependencies between adjacent PEF samples theoretically, we attempt to introduce the Markov chain to model the dependencies. Since dependencies between adjacent PEF samples are manipulated when the secret message is embedded into the cover, we extract Markov features for detecting the existence of the secret.

The steps of feature extraction are as follows.

Step1:
Calculate difference matrices.

Difference matrices are denoted by $ \mathrm{D}_k^{\bullet}\left( {m,n} \right) $, where k∈ {1,…,K} represents the frame indices, $ \bullet \in \left\{ {\leftarrow, \to, \uparrow, \downarrow, \nwarrow, \searrow, \nearrow, \swarrow } \right\} $ gives the direction of difference. For $ \bullet ='\uparrow ' $

$$ \mathrm{D}_k^{\uparrow}\left( {m,n} \right)={{\mathrm{P}}_k}\left( {m,n} \right)-{{\mathrm{P}}_k}\left( {m-1,n} \right) $$

(9)

where m, n are the row and column indices. For $ \bullet =\prime \nearrow \prime $

$$ \mathrm{D}_k^{\nearrow}\left( {m,n} \right)={{\mathrm{P}}_k}\left( {m,n} \right)-{{\mathrm{P}}_k}\left( {m-1,n+1} \right) $$

(10)

Other difference matrices are obtained in similar manners.

Step2:
Compute transition probabilities of difference matrices along the same direction.

The SPEAM features model difference matrices $ \mathrm{D}_k^{\bullet } $ by a first-order Markov process and compute the empirical matrices. For $ \bullet =\prime \uparrow \prime $, the empirical matrix is given by

$$ \mathrm{M}_{k,u,v}^{\uparrow }=\frac{1}{MN}\sum\limits_n {\sum\limits_m {\Pr \left( {\mathrm{D}_k^{\uparrow}\left( {m-1,n} \right)=v|\mathrm{D}_k^{\uparrow}\left( {m,n} \right)=u} \right)} } $$

(11)

where u, v∈{−T,…T}. T denotes the threshold of u,v we concern. If $ \Pr \left( {\mathrm{D}_{k,m,n}^{\uparrow }=v} \right)=0 $, then $ \mathrm{M}_{k,u,v}^{\uparrow }=0 $. For $ \bullet =\prime \nearrow \prime $

$$ \mathbf{M}_{k,u,v}^{\nearrow }=\frac{1}{MN}\sum\limits_n {\sum\limits_m {\Pr \left( {\mathbf{D}_k^{\nearrow}\left( {m-1,n+1} \right)=v|\mathbf{D}_k^{\nearrow}\left( {m,n} \right)=u} \right)} } $$

(12)

It should be noted that the differential directions of the two matrices $ \mathrm{D}_k^{\bullet } $ in Eq. (11) (or (12)) are the same. Other empirical matrices are obtained in similar manners.

Step3:
Average Markov matrices to get two final matrices.

To decrease the feature dimensionality, we simply average the matrices $ \mathrm{M}_{k,u,v}^{\bullet } $ with the same distances between the two difference matrices in the calculations of $ \mathrm{M}_{k,u,v}^{\bullet } $ (e.g., $ \mathrm{D}_k^{\uparrow}\left( {m,n} \right)=u $ and $ \mathrm{D}_k^{\uparrow}\left( {m-1,n} \right) $ for $ \mathrm{M}_{k,u,v}^{\uparrow } $ in (11)). For $ \bullet \in \left\{ {\leftarrow, \to, \uparrow, \downarrow } \right\} $, the distance is thought to be the same one, while for $ \bullet \in \left\{ {\nwarrow, \searrow, \nearrow, \swarrow } \right\} $, the distance is assigned to be the other one. According to the two distinct distances, the matrices $ \mathrm{M}_{k,u,v}^{\bullet } $ of all the 8 directions are separated into two subsets, which are then averaged respectively. With a slight abuse of notation, the SPEAM features of a PEF can be formally written as

$$ \mathrm{F}_{{1,\ldots,m}}^k=\frac{1}{4}\left( {\mathrm{M}_k^{\leftarrow }+\mathrm{M}_k^{\to }+\mathrm{M}_k^{\uparrow }+\mathrm{M}_k^{\downarrow }} \right) $$

(13)

$$ \mathrm{F}_{{m+1,\ldots,2m}}^k=\frac{1}{4}\left( {\mathrm{M}_k^{\nwarrow }+\mathrm{M}_k^{\searrow }+\mathrm{M}_k^{\nearrow }+\mathrm{M}_k^{\swarrow }} \right) $$

(14)

where the dimensionalities of $ \mathrm{M}_k^{\bullet } $, $ \mathrm{F}_{{1,\ldots,m}}^k $ and $ \mathrm{F}_{{m+1,\ldots,2m}}^k $ are the same, i.e., $ m={{\left( {2T+1} \right)}^2} $.

There are two main parameters in the extraction of SPEAM features. One parameter is the searching range in the motion estimation scheme. A larger searching range may bring a higher motion estimation precision, but takes more time. Experiments of the searching range are presented in Section III.A. The other parameter is the upper bound of $ \left| {\mathrm{D}_k^{\bullet }} \right| $, which is represented by T. A larger T implies taking more cases of adjacent PE samples’ differences into consideration. When T is too large, however, most cases of adjacent PE samples may be unrelated to steganography and may be useless for steganalysis.

In [20], Pevny has employed T = 4 as the upper bound of difference values of adjacent pixels in image steganalysis, and experiments have proved its usefulness. Here we calculate the matrix $ \mathrm{F}_{{1,\ldots,m}}^k $ of all the 3710 frames from 14 standard video sequences, and intend to decide T from the average of $ \mathrm{F}_{{1,\ldots,m}}^k $. Figure 4 gives the average $ \mathrm{F}_{{1,\ldots,m}}^k $ of all the 3710 original video frames, stego frames embedded with SS of α = 1, and stego frames embedded with SS of α = 3. Three figures at the top are amplified to obtain the bottom figures. It should be noted that the blank samples lying near the anti-diagonal line are mainly caused by i.i.d. random numbers.

Figure 4 implies that based on the range [−3,3] × [−3,3] of $ \mathbf{F}_{{1,\ldots,m}}^k $, we can manually distinguish the stego videos from the cover ones. Our tests in the next section have also shown that it is effective to set T = 3, leading to the features’ dimension of $ 2\bullet {{\left( {2T+1} \right)}^2}=98 $.

3 Steganalysis of spread spectrum steganography using SPEAM features

To evaluate the performance of the SPEAM features, we test them against SS steganography which is a broadly-used embedding method in video spatial steganography. SS methods can be simply categorized into two kinds [18] when used in watermarking. The first kind embeds the same watermark pattern in all video frames, while the other kind never embeds the same watermark pattern in two distinct frames. We mainly care about the latter, which is more close to actual steganography.

Standard video sequences are usually used for researches on video steganalysis, video codec, object tracking etc. Video sequences found at http://trace.eas.asu.edu/yuv/index.html are used here. The size of those video frames is CIF (352 × 288), and the frame rate is 30fps. For simplicity, only the first 90 frames of each video are used here for the experiments. Contents of them are listed in Table 1. SVM [5] is used as the classifier, and the radial basis function kernel is employed to implement the transformation of the feature vector for SVM. Grid searching is exploited to find the optimal parameter pair (C,γ), where C is the penalty parameter, and γ is the controlling parameter of the kernel function. All grid points of (C = {1e2,1e3,1e4},γ = −log₂(98) + {−3,−2,…,4}) are tested.

Each video sequence has a single scene, leading to high dependencies between distinct frames which are even not neighboring. This makes it unreasonable to divide each sequence into several sub-sequences and take experiments upon those sub-sequences. Sequence-level cross validation stated in Algorithm 1 is designed to evaluate the proposed features. The accuracy Acc(C _o,γ_o) correlated to the optimal parameters (C _o,γ_o) forms the final testing result. The accuracy of each loop in Algorithm 1 is calculated by

$$ \mathrm{A}cc\_iter(i)=\frac{TP+TN }{N} $$

(15)

where TP is true positive, and TN is true negative. When 5 videos are chosen for testing, the whole loops of i for each (C,γ) is as large as $ C_{14}^5=2002 $. For simplicity, we set the maximum of i to be 100.

3.1 Motion searching range

The SPEAM features with searching range of 0, 3, and 7 in the motion estimation scheme are tested. Corresponding results are given in Table 2. The SPEAM features, which are calculated with the motion searching range of • are denoted by SPEAM(•).

Table 2 Detection accuracy of the SPEAM features with motion searching range of 0, 3, and 7 on uncompressed videos

Full size table

Corrupted pixel ratio (cpr), which is similar to bits per pixel (bpp) commonly used in image steganography, is defined here to represent the ratio of the corrupted pixel number to the total pixel number.

Generally speaking, a larger motion searching range implies a more precise matching of blocks in two adjacent frames. This makes it easier for the steganalyzer to distinguish the cover from the stego objects. Table 2 has shown that in most cases, the larger the searching range is set, the better results we obtain. However, deviations between SPEAM(3) and SPEAM(7) are not obvious. The reason may be that precision of motion estimation depends on not only the searching range of block matching, but also the content of videos, the size of blocks, the precision of motion unit (such as pixel, sub-pixel, and quarter pixel), and effects of the embedding operation, etc.

As a larger searching range takes more time for the feature extraction scheme of uncompressed videos, we choose 3 for the tradeoff of motion estimation precision and complexity in the following experiments of uncompressed video sequences.

3.2 Spatial steganalysis of uncompressed video sequences

To compare the SPEAM features with the SPAM features and Budhia’s features in [3] (Block-based collusion features are tested here with motion searching range of 3), experiments using sequence-level cross validation are taken here on uncompressed video sequences. Figure 5 gives the detection accuracy. Because of the poor performance of Budhia’s features, we just test them for cpr = 1.

Results shown in Fig. 5 imply that for all the cases tested, the SPEAM and SPAM features are close and much better than Budhia’s features. This may be because more characteristics (i.e., the Markov features) of the video sequences are utilized for steganalysis in the SPAM and SPEAM features. The time consumed by SPEAM, however, is more than that of Budhia’s features, which is due to the calculation of probability transition matrices (such as if-else decision) and block matching.

Figure 5 also implies that SPAM and SPEAM are generally similar for the spatial steganalysis of uncompressed videos. This may be because that for uncompressed videos, spatial dependencies between adjacent pixels may contain much enough information to distinguish the stego videos from the cover videos, while combining the utilization of temporal and spatial redundancy does not bring additional information to obviously improve the detection accuracy.

With a more precise block matching scheme which gets a matching result from the stego object more close to that of the cover one, the SPEAM features are believed to be more favorable.

3.3 Spatial steganalysis of compressed video sequences(MPEG2)

To further evaluate the performances of the proposed features for actual steganalytic systems, the experiment in the last subsection is replicated here on compressed video sequences. SS steganography is implemented on cover videos of YUV format, which are then converted to MPEG2 format by VcDemo [26] with GOP structure of IBBPBBPBBPBB and motion searching range of 15. At last, features are extracted from the cover MPEG2 videos and the stego MPEG2 videos, and sequence-level cross validation is employed to obtain testing results.

Since PEFs can be obtained after partial decompression of the compressed videos, no additional block matching is needed for the extraction of the SPEAM features. The Markov features of PEFs are calculated directly to form the SPEAM features. For a B-type frame, features of two PEFs are averaged, while for a P-type frame, features of the only PEF are directly employed.

For simplicity, we just use SS for message embedding, regardless of whether the message in the MPEG2 videos can be completely extracted. As the compressing scheme may erase some of the watermarks, only cpr = 1 is tested here.

Figure 6 shows the detection accuracy of the SPEAM features, the SPAM features, Budhia’s features, and Vinod’s features [18]. When the bit rate is as low as 2 Mb/s, the distortion of watermark and video content leads to degradation of all the tested features’ performances. Contrarily, when the bit rate is 5 Mb/s, the simulation result is similar as result shown in Table 2. This is because the quality of videos which have bit rate of 5 Mb/s is close to that of uncompressed videos (The bit rate of uncompressed videos is 352 × 288 × 30 ≈ 3.04 Mb/s).

In all the cases tested, the SPEAM features and the SPAM features perform much better than other two features, which suggests that modeling with a Markov chain contains more information sensitive to steganography than the i.i.d. model used in Budhia’s features and Vinod’s features. Besides, the SPEAM features seem prior to the SPAM features in most cases. The detailed ROC curves of the SPAM features and the SPEAM features are provided in Section III.E.

3.4 Spatial steganalysis of compressed video sequences(H.264)

Experiments in this subsection are carried out on compressed video sequences compressed in H.264 by libx264 encoder of FFMPEG [7]. Firstly, SS steganography is implemented on the cover videos. Secondly, both the cover videos and the stego videos are encoded into H.264 format by FFMEPG. Lastly, SPEAM features of videos are extracted and tested. The profile of the H.264 encoder is set to “baseline”, and two cases of bit rate (i.e., 2 Mb/s. and 5 Mb/s) are tested.

Figure 7 shows the testing results of the SPEAM features and the SPAM features. Generally, the results shown in the figure are similar to testing results on videos compressed in MPEG2 format. This means when the bit rate is 5 Mb/s, two features perform similarly, while when the bit rate is 2 Mb/s, the SPEAM features are prior to the SPAM features. The detailed ROC curves are provided in the next subsection.

3.5 Experimental results of SPEAM and SPAM

To compare the SPEAM features and the SPAM features for compressed videos, Fig. 8 gives the ROC curve of testing results when the SPAM features and the SPEAM features are tested on MPEG2-format videos, and Fig. 9 gives the ROC curve of testing results on H.264-format videos. Bit rate of 2 Mb/s and α of 1, 2, 3 are tested. Those two figures suggest that when α = 1, since most of the stego noise has been erased by the compression scheme, both features perform poor. When α = 1.3, distortion exists in the video content, and most of the stego noise survives the compression scheme. In this case, the SPEAM features which consider both spatial redundancy and temporal redundancy perform better than the SPAM features which just consider spatial redundancy between adjacent pixels. This is why we believe the SPEAM features are favorable for spatial video steganalysis.

4 Conclusions

The work presented in this paper utilizes the fact that the correlation between adjacent PEF samples exists in typical digital media while the dependences degrade because of the random stego noise. The dependences between differences of neighboring PEF samples are modeled by a Markov chain. Subsets of the empirical probability transition matrices are taken as a feature vector for steganalysis, which is called the SPEAM features.

The main advantage of SPEAM is that for compressed video sequences which are the major components of the Internet videos, SPEAM performs better than other methods tested. Furthermore, the calculation of features is of low complexity and is suitable for real-time applications. For uncompressed video sequences, SPEAM performs similar to SPAM which is one of the most effective image steganalytic methods, and is prior to previous works by Budhia.

In the future, we would like to investigate more advanced measures to merge the utilization of temporal redundancy and spatial redundancy, aiming at achieving better performances, especially for compressed videos with contents which are fast-moving with irregular trails or of high texture complexity. In addition, the effectiveness of the steganalytic features on videos of various codec should be further tested. Besides, steganography utilizing information got in the compression schemes (such as motion vector [1]) has been studied, and several steganalytic methods have been proposed [4, 25]. We also plan to research on dependencies between intra-frame MVs and correlation within inter-frame MVs, and derive favorable features for steganalysis.

References

Aly H (2011) Data hiding in motion vectors of compressed video based on their associated prediction error. IEEE Trans Inf Forensic Secur 6(1):14–18
Article MathSciNet Google Scholar
Budhia U, Kundur D (April, 2004) “Digital video steganalysis exploiting collusion sensitivity”, “Sensors, Command. Control, Communications, and Intelligence(C3I) Technologies for Homeland Security and Homeland Defense”, Edward M. Carapezza, ed., Proc. SPIE, vol. 5403
Budhia U, Kundur D, Zourntos T (2006) Digital video steganalysis exploiting statistical visibility in the temporal domain. IEEE Trans Inf Forensic Secur 1(1):43–55
Article Google Scholar
Cao Y, Zhao X, Feng D (2012) Video steganalysis exploiting motion vector reversion-based features. IEEE Signal Process Lett 19(1):35–38
Article Google Scholar
Chang CC and Lin CJ “LIBSVM: a library for support vector machines”[Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cox J, Kilian J, Leighton T, Shamoon T (1997) Secure spread spectrum Watermarking for Multimedia. IEEE Trans Image Process 6(12):1673–1687
Article Google Scholar
FFMPEG Library[Online], Available: http://ffmpeg.sourceforge.net/
Hartung F, Girod B (May 1998) “Watermarking of uncompressed and compressed video”, Signal Processing, Special Issue on Copyright Protection and Access Control for Multimedia Services,66(3):283–301
Jainsky JS, Kundur D, Halverson DR (September 2007) “Towards Digital Video Steganalysis using Asymptotic Memoryless Detection”, Proc. ACM MM&Sec’07, Dallas, Texas, USA, pp. 161–168
Kanchela K, Mukkamala S (June, 2009) “Video Steganalysis using Motion Estimation”, Proc. of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, pp. 1510–1515
Kancherla K, Mukkamala S (2009) “Video Steganalysis Using Spatial and Temporal Redundancies”, Proc. International Conference on High Performance Computing&Simulation, pp. 200–207
Kashyap S, Bora PK (2010) “Spatial Averaging based Steganalysis Scheme to detect Antipodal Watermarks”, pp. 1–5
Li Q, Cox IJ (2007) Using perceptual models to improve fidelity and provide resistance to valumetric scaling for quantization index modulation watermarking. IEEE Trans Inf Forensic Secur 2(2):127–139
Article Google Scholar
Liu B, Liu F, Yang CF (2008) “Stepwise inter-frame correlation-based steganalysis system for video streams”, Security and Communication Networks, Security Comm. Networks, pp. 487–494
Liu Q, Sung AH, Qiao M (2008) “Video steganalysis based on the expanded Markov and joint distribution on the transform domains—Detecting MSU StegoVideo”, in Proc. 7th International Conference on Machine Learning, and Applications, pp. 671–674
Marvel LM, Boncelet CG, Retter CT (1999) Spread spectrum image steganography. IEEE Trans Image Process 8(8):1075–1083
Article Google Scholar
MSUStegovideo[Online], Available: http://compression.ru/video/stego_video/index_en.html.
Pankajakshan V, Doërr G, Bora PK (2009) Detection of motion-incoherent components in video streams. IEEE Trans Inf Forensic Secur 4(1):49–58
Article Google Scholar
Pérez-González F, Barni M, Abrardo A, Mosquera C (2004) “Rational Dither Modulation: A Novel Data-Hiding Method Robust to Value-metric Scaling Attacks”, IEEE 6th Workshop on Multimedia Signal Processing, pp. 139–142
Pevný T, Bas P, Fridrich J (2010) Steganalysis by subtractive pixel adjacency matrix. IEEE Trans Inf Forensic Secur 5(2):215–224
Article Google Scholar
Pevný T, Bas P, Fridrich J (September, 2009) “Steganalysis by subtractive pixel adjacency matrix”, in Proc. 11th ACM Multimedia &Security Workshop, Princeton, NJ, pp. 75–84
Pevný T, Fridrich J (February, 2007) “Merging Markov and DCT Features for Multi-Class JPEG Steganalysis”, in Proc. SPIE, Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Contents IX, San Jose, CA, 6505:301–314
Rana V, Mishra R, Bora PK, Kashyap S (2008) “Novel Scheme of Video Steganalysis for Detecting Antipodal Watermarks”, Proc. IEEE Region 10 Conference-TENCON, pp. 1–5
Sharp T (2001) “An implementation of key-based digital signal Steganography”, in: Proc. 4th Information Hiding Workshop. Lect Notes Comput Sci 2137:13–26
Article Google Scholar
Su Y, Zhang C, Zhang C (2011) A video steganalytic algorithm against motion-vector-based steganography. Signal Process 91(8):1901–1909
Article MATH Google Scholar
VCDemo: Image and Video Compression Learning Tool. [Online]. Available: http://ict.ewi.tudelft.nl/~inald/vcdemo
Zhang T, Li W, Zhang Y, Zheng E, Ping X (2010) Steganalysis of LSB matching based on statistical modeling of pixel difference distributions. Inf Sci 180:4685–4694
Article Google Scholar
Zhang C, Su Y, Zhang C (2008) Video steganalysis based on aliasing detection. Elec Lett 44(13):801–803
Article Google Scholar

Download references

Acknowledgments

This research was supported by the National Natural Science Foundation of China (NSFC) under the grant No. 61170226, the Fundamental Research Funds for the Central Universities under the grant Nos.SWJTU11CX047, and Chengdu Science and Technology program under the grant No. 12DXYB214JH-002.

Author information

Authors and Affiliations

National Digital Switching System Engineering & Technological Research Center, Zhengzhou, China
Keren Wang
Science and Technology on Blind Signal Processing Laboratory, Chendu, China
Keren Wang & Jiesi Han
Southwest Jiaotong University, Chengdu, China
Hongxia Wang

Authors

Keren Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiesi Han
View author publications
You can also search for this author in PubMed Google Scholar
Hongxia Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keren Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, K., Han, J. & Wang, H. Digital video steganalysis by subtractive prediction error adjacency matrix. Multimed Tools Appl 72, 313–330 (2014). https://doi.org/10.1007/s11042-013-1373-4

Download citation

Published: 31 January 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s11042-013-1373-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Digital video steganalysis by subtractive prediction error adjacency matrix

Abstract

Similar content being viewed by others

A Video Steganalysis Algorithm for H.264/AVC Based on the Markov Features

Undetectable video steganography by considering spatio-temporal steganalytic features in the embedding cost function

Blind MV-based video steganalysis based on joint inter-frame and intra-frame statistics

1 Introduction