1 Introduction

Steganography is a technique for covert communication by embedding secret information into a cover medium such as digital images. A cover image becomes a stego-image when secret information is embedded into the cover image. As a counterpart of steganography, steganalysis [11, 14] is the method to detect the existence of stego-images. There are two major kinds of steganalysis: specific steganalysis that can detect a specific steganography, and blind(universal) steganalysis that can detect the existence of hidden messages without knowing details of steganography algorithms. Finding a suitable way to analyze the features of the stego-images is critical for an efficient blind steganalysis method. The Probability Density Function (PDF) moment and Characteristic Function (CF) moment are two typical statistic features frequently used in blind steganalysis algorithms.

Steganalysis can be categorized as either active or passive. Estimation of the length of embedded messages in stego-images is important to active steganalysis. Active steganalysis methods are powerful in length estimation such as in regular singular (RS) and sample pairs analysis (SPA) steganalysis schemes, but they become invalid in frequency domain. Passive steganalysis methods can discriminate stego-images from suspicious images in both spatial and frequency domains, such as in Lyu and Fraid’s [6] steganalysis scheme, but they cannot estimate the length of the hidden messages.

Lou et al. [5] proposed an active steganalysis algorithm which analyzes the characteristics of histogram changes during the data embedding procedure to discriminate cover images from stego-images. Their research [5] found that the original histogram’s peak will disappear and becomes concave after data embedding. This phenomenon is called “pair effect”. Ding and Ping [9] proposed a steganalysis method based on the analysis of the pulse positions of histograms of cover and stego speech. They found that although the influence of steganographic embedding differs for different cover signals, the trend is that the pulse positions of the histogram becomes smoother after consistent embedding. Steganographic methods randomize the pulse positions distribution, therefore the pulse positions of the histogram for stego signal is smoother than that of the cover signal. Xuan et al. [15] proposed a novel steganalysis scheme which uses the adequate information of co-occurrence matrix to capture the changes before and after data embedding. The energy differences between the gray-level co-occurrence matrix of the original cover image and the stego-image are expected to be able to capture the changes caused by the data embedding.

Dimension reduction is a hotspot in machine learning and data mining. Traditional statistical approaches have difficulties in directly modeling data in high dimensional spaces. Dimension reduction techniques play an important role in alleviating the difficulty of high dimensional problems. Dimension reduction techniques can be categorized as linear or nonlinear methods. Linear methods are limited to discovering the structure of data lying on or near a linear subspace of the high dimensional input space. The most widely used linear dimensional reduction methods include the classic Principal Component Analysis (PCA) [8] and Linear Discriminant Analysis (LDA) [1, 8]. These methods have been applied to a wide range of signal processing problems such as feature transformation and signal analysis. A low dimensional submanifold may have a highly nonlinear structure that linear methods could fail to handle. Recently, a number of manifold learning (also referred to as nonlinear dimensionality reduction) algorithms such as ISOMAP, LLE, Laplacian Eigenmap, etc. have been proposed to overcome the limitations of linear methods. These methods have been successfully applied to a number of benchmark manifold problems and have also been proved useful in several pattern recognition applications. In the past few years, the kernel trick has also been widely applied to extend linear dimension reduction algorithms to nonlinear ones by a kernel mapping function. Linear discriminant analysis (LDA) seeks to reduce dimensionality while preserving as much of the class discriminatory information as possible. The limitation of LDA is that if the distributions are significantly non-Gaussian, the LDA projections may not preserve complex structure in the data needed for classification. Recently, the Gaussian scale mixture (GSM) [7] has been proposed to model the natural images in wavelet domain. Laplacian Eigenmaps find a low-dimensional data representation by preserving local properties of the manifold. In Laplacian Eigenmaps, the local properties are based on the pairwise distances among near neighbors. Laplacian Eigenmaps compute a low dimensional representation of the data in which the distances between a data point and its \(\mathnormal{k}\) nearest neighbors are minimized. Laplacian Eigenmaps can mostly preserve the distances among nearest neighbors while maximizing the distances among points that are not the nearest neighbors. Despite the success of the LDA algorithm in many real world applications, it still has some drawbacks in efficiency. For example, it cannot keep the intrinsic geometry property of data in most cases and has limited efficiency in classifying sample data. In order to effectively exploit favorable attributes of both LDA and Laplacian and avoid their unfavorable ones, we try to solve the steganalysis problem in the graph embedding framework. Graph embedding framework [17] can be used to develop new dimension reduction algorithms to overcome the limitations of LDA. Another important aspect of dimension reduction is that if the high dimension is reduced properly, it will show great values in pattern recognition and data visualization.

The contribution of this paper are as follows: (1) We propose a passive steganalysis scheme to analyze the characteristics of distribution changing of the dimension reduction space to discriminate the cover images from stego-images. (2) We develop a new dimension reduction algorithm, i.e., Improved Kernel Linear Discriminant Analysis (IKLDA), to extract the hidden information of stego-images and find that they are clustered in a plane while cover images are scattered more evenly and have no other clusters. The rest of the paper is organized as follows: Section 2 introduces the IKLDA algorithms; Section 3 presents our experimental results; and Section 4 concludes the paper.

2 Algorithm

2.1 LDA algorithm and its improvement

LDA is a supervised method. It searches the project axes on which the data points of different classes are as far from each other as possible while the data points of the same class are as close to each other as possible. Suppose we have a set of samples x1,x2,...x l R d, which belongs to C classes, where \(\mathnormal{d}\) is the original dimension. Regarding supervised learning problem, let z∈(1,2,...,C) be a class label, where C is the number of classes as mentioned above. Let X be the matrix representation of the whole sample set, i.e., each sample is treated as a column of X. Let Y∈ R r(1≤r≤ \(\mathnormal{d}\)) be the projection of X, where \(\mathnormal{r}\) is the dimension of the lower dimensional space. We first consider the linear dimension reduction methods. We define a \(\mathnormal{d} \times \mathnormal{r}\) transformation matrix α such that the low dimensional data representation Y is given by: Y = α TX.

The goal of LDA is to look for a transformation matrix α to characterize the intra class compactness and the inter class separability, i.e., to find α lda such that

$$ \displaystyle{\alpha}_{\rm lda}=argmax\frac{{\alpha}^T{S_b}{\alpha}}{{\alpha}^T{S_w}{\alpha}} $$
(1)
$$ \displaystyle S_b=\sum\limits_{k=1}^c{l_k}(u^k-u){(u^k-u)}^T $$
(2)
$$ \displaystyle S_w=\sum\limits_{k=1}^c\left(\sum\limits_{i=1}^{l_k}((x_i^k-u^k){(x_i^k-u^k)}^T{\vphantom{\sum\limits_{i=1}^{l_k}}})\right) $$
(3)
$$ \begin{array}{rll} \displaystyle S_t&=&\sum\limits_{i=1}^l(x_i-u){(x_i-u)}^T \\ &&\displaystyle S_t={S_b}+{S_w} \end{array} $$
(4)

where u is the total sample mean vector, l k is the number of samples in the \(\mathnormal{k}\)-th class, u k is the average vector of the \(\mathnormal{k}\)-th class and \(x_i^k\) is the \(\mathnormal{i}\)-th sample in the \(\mathnormal{k}\)-th class, S w is called intra class scatter, a metric to measure intra-distances and S b is called inter class scatter, a metric to measure the inter-distances. So, the purpose of LDA is to find a transformation matrix α lda to maximize the linear separability of data points belonging to the different classes while minimize the linear compactness of data points within the same classes.

The maximization problem of obtaining α lda in (1) can be converted to solve the following generalized eigenproblem:

$$ S_b\phi=\lambda{S_w}\phi $$

Let \(\{{{\phi}_k}\}_{k=1}^d\) be the eigenvectors of the generalized eigen problem corresponding to the eigenvalue λ k , where λ 1 ≥ λ 2 ≥ ... ≥ λ d . Then, the eigenvectors [ϕ 1,ϕ 2...ϕ d ] form the columns of the linear transformation matrix α lda and Y can be computed by mapping X onto the linear basis matrix α lda. That is,

$$ Y={\alpha}_{\rm lda}^\mathrm{T}X $$

In order to improve LDA on the ground of graph embedding framework, intra class scatter S w can be transformed to pairwise expressions. As u k is the average vector of the \(\mathnormal{k}\)-th class, i.e.,\(u^k=\frac{1}{l_k}\sum_{j=1}^{l_k}x_j^k\), \((u^k)^T=\frac{1}{l_k}\sum_{i=1}^{l_k}x_i^k\), from (3), we obtain:

$$ \begin{array}{rll} S_w & = & \sum\limits_{k=1}^c\left(\sum\limits_{i=1}^{l_k}\left((x_i^k-u^k){(x_i^k-u^k)}^T\right)\right) = \sum\limits_{k=1}^c\sum\limits_{i=1}^{l_k}x_i^k{(x_i^k)}^T-\sum\limits_{k=1}^c\frac{1}{l_k}\sum\limits_{i,j=1}^{l_k}x_i^k{(x_j^k)}^T \\ & = & \frac{1}{l_k}\sum\limits_{k=1}^c\sum\limits_{i,j=1}^{l_k}x_i^k{(x_i^k)}^T-\frac{1}{l_k}\sum\limits_{k=1}^c\sum\limits_{i,j=1}^{l_k}x_i^k{(x_j^k)}^T \\ & = & \frac{1}{2\times l_k}\sum\limits_{i,j=1}^{l}\left(x_i(x_i)^T+x_j(x_j)^T-x_i(x_j)^T-x_j(x_i)^T\right) \nonumber \\[-2.5pt] & = & \frac{1}{2\times l_k}\sum\limits_{i,j=1}^{l}(x_i-x_j)(x_i-x_j)^T \end{array} $$
(5)

For the same reason, inter class scatter S b has also a pairwise expression. As μ is the total sample mean vector, i.e., \(\mu=\frac{1}{l}\sum_{j=1}^lx_j\), \({\mu}^T=\frac{1}{l}\sum_{i=1}^lx_i\). From (2), we have:

$$ \begin{array}{rll} S_b & = & S_t-S_w \\ & = & \sum\limits_{i=1}^l(x_i-\mu)(x_i-\mu)^T-S_w \\ & = & \sum\limits_{i=1}^lx_i{x_i}^T-\frac{1}{l}\sum\limits_{i,j=1}^lx_i{x_j}^T-S_w \\ & = & \frac{1}{l}\sum\limits_{i,j=1}^lx_i{x_i}^T-\sum\limits_{i,j=1}^lx_i{x_j}^T-S_w \\ & = & \frac{1}{2\times l}\sum\limits_{i,j=1}^{l}(x_i-x_j)(x_i-x_j)^T-\frac{1}{2\times l_k}\sum\limits_{i,j=1}^{l}(x_i-x_j)(x_i-x_j)^T \end{array} $$
(6)

In the process of deduction of (5) and (5), a connotative condition that x i and x j belong to the same class has been added, because neighbor points are mostly possible in the same class labels. In graph embedding framework, we combine the advantage of Laplacian method that nearby points remain nearby and far apart points remain far apart in dimension reduction and the advantage of LDA method that concentrating the points of intra-classes and repulsing the points of inter-classes. To this end, we employ the matrix representation of the graph G which takes the samples x i (i = 1...l) as its vertices. We denote by W=(W i,j ) the representation matrix of G, i.e., W i,j  =1 if x i and x j are neighbors; otherwise, W i,j  = 0. That is,

$$ W_{i,j}= \begin{cases} 1 & if x_i\in N_k(x_j) \ or \ x_j\in N_k(x_i) \\ 0 & else \end{cases} $$
(7)

where N k (x i ) denotes the set of \(\mathnormal{k}\) nearest neighbors of x i . Equation (7) can be upgraded to Euclidean distance based on Gaussian distribution. W i,j is defined by:

$$ W_{i,j}= \begin{cases} \exp \left(-\displaystyle\frac{{\lVert x_i-x_j \rVert}^2}{\delta_i\delta_j}\right) & if x_i\in N_k(x_j) \ or \ x_j\in N_k(x_i) \\ 0 & else \end{cases} $$
(8)

where \(\delta_i=\lVert x_i-x_i(k) \rVert,\) in which x i (k) is the \(\mathnormal{k}\)-th nearest neighbors of x i . The diagonal matrix D and the Laplacian matrix L of a graph G can be defined as:

$$ L=D-W;D_{ii}=\sum_{i \neq j} W_{i,j} $$
(9)

The idea of the algorithm of graph embedding framework is to find an appropriate low dimensional representation which can keep the neighborhood property of the vertices of the graph G. Let Y=[y1,y2,...y\(_{l_k}\)]T be the low dimensional space to be found, where y i is the low dimensional representation of vertex x i . The data points in embedding space should have the same geometric properties as that of the original data. For example, the \(\mathnormal{k}\)-nearset points of y i in the embedding space should corresponding to the \(\mathnormal{k}\)-nearset points of x i . As consequence, the nearby points in the high dimensional space are also projected to nearby points in the low dimensional representation. In fact, a rough low dimensional representation can be obtained by:

$$ Y=^{\mathrm{argmin}}_{{y^TBy=d}}\sum\limits_{i \neq j}{|| y_i-y_j ||}^2W_{i,j}{=}^{\mathrm{argmin}}_{{y^TBy=d}}y^TLy $$
(10)

where \(\mathnormal{d}\) is a constant and B is a constraint matrix to avoid multi-solutions. To obtain a better result, we need to find a more suitable projection matrix, still denoted by w, which satisfies Y= x Tw. w can be obtained by optimizing the following equation.

$$ \begin{array}{rll} W &=& ^{\mathrm{argmin}}_{\mathrm{w^TxBx^Tw=d}}\sum\limits_{i \neq j}{\lVert w^Tx_i-w^Tx_j \rVert}^2W_{i,j}\\ &=&{\ }^{\mathrm{argmin}}_{\mathrm{w^TxBx^Tw=d}}w^TxLx^Tw \end{array} $$
(11)

where x is the matrix of taking the samples x i (i = 1...l) as its columns as mentioned above.

2.2 Kernel LDA algorithm

The algorithms mentioned above are linear methods which are usually not good for the classification problems with nonlinearly distributed data. Therefore, it is necessary to introduce a new kernel trick to handle data with nonlinear distributions. In machine learning, the use of the kernel functions has been introduced to find a close-to-optimal projection based on different sample distributions. The kernel matrix implicitly maps the data into a nonlinear feature space. The choice of the kernel is crucial to incorporate a priori knowledge in application. Each kernel can be expressed as: \(\mathnormal{k}\)(x,y)= < ϕ(x),ϕ(y) >, in which < ϕ(x),ϕ(y) > is the scalar product,where ϕ(x) is a dimensional elevating mapping. We call the dimensional elevated space as the Reproducing Kernel Hilbert Space (RKHS). Examples of kernels are as follows:

$$ \begin{cases} Gaussian:k(x,y)=exp \left(-\frac{{\lVert x-y \rVert}^2}{2 \delta^2}\right) \\ Polynomial\ with\ degree\ d:k(x,y)=(c+\langle x,y \rangle)^d \\ Sigmoid:k(x,y)=\tanh (\langle x,y \rangle+\alpha) \end{cases} $$
(12)

In this paper, we choose Gaussian kernel \(\mathnormal{k}\)(x,x’)\(=\exp \left(-\frac{{\lVert x-x^{'} \rVert}^2}{2 \delta^2}\right)\), δ > 0 as the kernel function and denote as k i,j  = k(x i ,x j ) = < φ(x i ), φ(x j ) >.

To obtain an even better projection matrix, we consider the following optimization problem:

$$ \begin{array}{rll} \beta &=&{\ }^{\mathrm{argmin}}_{\mathrm{w^TkBk^Tw=d}}\sum\limits_{i \neq j}{\lVert \beta^Tk_i-\beta^Tk_j \rVert}^2W_{i,j}\\ &=&{\ }^{\mathrm{argmin}}_{\mathrm{w^TkBk^Tw=d}}w^TkLk^Tw \end{array} $$
(13)

where k i indicates the \(\mathnormal{i}\)-th column vector of the kernel gram matrix K and w is defined in (10). The solutions of (12) can be converted to the following generalized eigenvalue problem:

$$\label{eq14} KBK \beta=\lambda KLK \beta $$
(14)

where L is defined in (9). To obtain the weight coefficient \(S^{\rm weight}_{w}\) of intra class scatter matrix, we need to optimize S w in (5) and the expression of \(S^{\rm weight}_{w}\) is as follows:

$$ S^{\rm weight}_{w}=\frac{1}{l_k} $$
(15)

Similarly, optimizing (6) we can obtain the solution of S b as follows:

$$ S^{\rm weight}_{b}=\frac{1}{l}-\frac{1}{l_k} $$
(16)

We cannot directly solve the generalized eigenproblem of (14), since it is ill posed as mentioned in KLK [12]. Instead, we solve its following regularization problem:

$$ KBK \beta=\lambda (KLK+\varepsilon I)\beta $$
(17)

where ε is a constant with small value. As mentioned above, the core idea of our IKLDA is to combine the advantages of LDA and Laplacian method. That is, the advantage that the nearby points remain nearby and far apart points remain far apart in dimension reduction and the advantage that concentrating the points of intra-classes and repulsing the points of inter-classes. In fact, the former advantage means that the intrinsic geometric property of the data is well preserved during dimension reducing. In addition, the weights W i,j defined in (8) means that the influence of the data on \(S^{\rm weight}_{w}\) and \(S^{\rm weight}_{b}\) may be reduced as the increase of the distances among the points of the data.

Many improvements of LDA algorithms are based on the redefinitions of weighting factors of \(S^{\rm weight}_{w}\) and \(S^{\rm weight}_{b}\). For example, Loog et al. [4] proposed a criterion called approximate pairwise accuracy (aPAC); Sugiyama [12] proposed the LFDA algorithm by setting \(S^{'}_w=\frac{1}{l_k}S_w\),\(S^{'}_b=\frac{1}{l}S_b+(1-\frac{1}{l_k})e_le_l^T+\frac{1}{l}e_{lk}e_{lk}^T\); Yan et al. [16] proposed a cluster algorithm, named ICBKM, by setting \(S^{'}_w=I-\sum_{k=1}^c\frac{1}{l_k}e_{lk}e_{lk}^T\),\(S^{'}_b=\sum_{k=1}^c\frac{1}{l_k}e_{lk}e_{lk}^T-\frac{1}{l}e_le_l^T\)

According to (15)–(17), we redefine the weight factors S w and S b as follows:

$$ \begin{cases} S^{'}_w=\alpha\sqrt{\frac{1}{l_k}}KLK & where\ 0<\alpha<1;D_{ii}=\sum_{i \neq j} W_{i,j};L=D-W \\ S^{'}_b=(1-\alpha)\left(\sqrt{\frac{1}{l}-\frac{1}{l_k}}\right)KLK & same\ as\ above \end{cases} $$
(18)

Equation (17) can be expressed as follows:

$$ S^{'}_b \beta=\lambda (S^{'}_w+\varepsilon I)\beta $$
(19)

Form the above (8), (9), (12), (18), (19), we obtain the following optimization equation of our IKLDA method:

$$ (1-\alpha)\left(\sqrt{\frac{1}{l}-\frac{1}{l_k}}\right)KLK \beta=\lambda \left(\alpha\sqrt{\frac{1}{l_k}}KLK+\varepsilon I\right)\beta $$
(20)

where K\(= \exp (-\frac{{\lVert x-x^{'} \rVert}^2}{2 \delta^2})\), L is defined in (9).

Let \(\{{\beta}_i\}_{i=1}^r\) be the eigenvectors of (20) corresponding to the leading eigenvalues λ 1 ≥ λ 2 ≥ ... ≥ λ r . The reduced dimension Y of the data points in X is thus can be given by Y= β T K.

For the sake of completeness we will use correlation dimension to describe the assumed inner structure. Intrinsic dimensionality is the minimum number of parameters that is necessary account for all information in the data. It is hard to say how many dimensions are appropriate to describe the data set in real world applications. The fewer dimensions, the fewer properties that may fall short of describing the abundance of images, and therefore may lead to misclassification. However, the more the dimensions the feature vector has, the higher probability the redundancy and dependency exists. Usually, such redundancy or dependency of the feature vector tends to induce misclassification. Techniques for intrinsic dimensionality estimation can be divided into two groups: those based on local properties, and those based on global properties of the data. Local intrinsic dimensionality estimators are based on the following principle: the number of data points covered by a hypersphere around a data point with radius \(\mathnormal{r}\) grows proportional to r d, where \(\mathnormal{d}\) is the intrinsic dimensions. So \(\mathnormal{d}\) can be estimated by measuring the number of data points covered by a hypersphere with a growing radius \(\mathnormal{r}\).

Correlation dimension estimator [2] is roughly one kind of local estimator, the relative amount of data points in a hypersphere with radius \(\mathnormal{r}\) are as follows:

$$ C(r)=\frac{2}{n(n-1)}\sum\limits_{i=1}^n\sum\limits_{j=i+1}^c c\ where\ c=\begin{cases} 1 & if \lVert x_i-x_j \rVert \leqslant r\\ 0 & if \lVert x_i-x_j \rVert > r \end{cases} $$
(21)

We use C(\(\mathnormal{r}\)) to estimate the dimensions \(\mathnormal{d}\):

$$ d=\textstyle \lim_{r\to0} \frac{\log C(r)}{\log r} $$
(22)

It is pretty hard to solve (22), so we use the following expression instead:

$$ \bar d= \frac{\log (C(r_2)-C(r_1))}{\log(r_2-r_1)} $$
(23)

By (23) we calculate the intrinsic dimensions and get 2.91741 in the steganalysis data set.

3 Experimental results

The purpose of this section is to verify the efficiency of our IKLDA method in steganalysis by observing the difference of the distributions between stego-images and cover images of the dimension reduced data. The experimental results show that the detective rate of steganalysis is also increased.

To this end, we implement two groups of experiments. The first group involves five data hiding methods and the second includes one data hiding method. We take 1096 sample images from different picture sets, such as Nature, Ocean, Food, Animals, Architecture, Places, Leisure, Misc, of CorelDRAW Version 10.0 software CD\(\sharp\)3 to complete the first group of experiments. The following five typical data hiding methods [11] are used in our first group of experiments: Cox et al.’s non-blind SS, Piva et al.’s blind SS, Huang and Shi’s 8 by 8 block SS, a generic QIM (0.1 bpp(bit per pixel)), and a generic LSB(0.3 bpp, both the pixel position used for embedding data and the to-be-embedded are randomly selected). For each sample cover image, five stego-images are generated with these five data hiding methods, respectively. For all the data hiding methods, different random signals are embedded into different images. The evaluation of the proposed steganalysis system is hence more general.

Shi et al. [11] use the statistical moments of the characteristic functions of wavelet subbands as features for steganalysis. The characteristic function (CF) and the PDF (here, histogram) are similar to a Fourier transform pair. We denote the histogram by h(x j ), and the characteristic function by h(f k ),The \(\mathnormal{n}\)-th statistical moment of a characteristic function M n is defined as follows:

$$ M_n=\frac{\sum_{k=1}^{N/2}f_k^n|H(f_k)|}{\sum_{k=1}^{N/2}|H(f_k)|} $$

where H(f k ) is the magnitude of the CF.

In order to handle the noise introduced by data hiding, Shi et al. [11] proposed to predict each pixel grayscale value in the original cover image by using its neighboring pixels’ grayscale values, and obtain the prediction-error image by subtracting the predicted image from the test image. If the hidden data are unrelated to the cover media, the prediction-error image can remove all other informations other than that caused by data hiding and this makes the steganalysis more efficient. Fortunately, the hidden data are usually unrelated to the cover media.

In this first group of experiments, each test image is applied with Haar wavelet transformation three times to obtain a 3-level decomposition. For each level, there are four subbands, resulting in 12 subbands in total. If the original image is referred to level-0 LL subband, we have a total of 13 subbands. For each subband, the first three moments of characteristic functions are derived, resulting in a set of 39 features. Similarly, for the prediction-error image, another set of 39 features can be generated. Thus, a 78-Dimention feature vector is produced for a test image. In fact, the experiments show that using more than three-level wavelet decomposition and employing more than the first three order moments cannot further improve the steganalysis performance other than leading to higher computational complexity. Hence the 78-Dimention feature vectors are used in this proposed steganalysis system.

We use the Improved Kernel Linear Discriminant Analysis algorithm to project the sample image data onto R 3 which is good enough for image steganalysis. Figures 1, 2, 3, 4 and 5 show the spatial distributions obtained from IKLDA to capture the statistical differences between cover images and stego-images. All the results show that the distributions of the projected stego-images are exactly clustered in a plane, respectively, while the distributions of the projections of all cover images are scattered more evenly and have no other clusters. Based on this distribution analysis, we get a 100 % detective rate after further projecting the projected data in R 3 into one dimensional space.

Fig. 1
figure 1

Cox et al.’s spread spectrum

Fig. 2
figure 2

Piva et al.’s spread spectrum

Fig. 3
figure 3

Huang and Shi’s spread spectrum

Fig. 4
figure 4

Generic QIM(0.1 bpp)

Fig. 5
figure 5

Generic LSB(0.3 bpp)

Any image is decomposed into four subimages after applying a wavelet transformation and the four subimages are called horizontal, vertical, diagonal and lowpass subbands, respectively, and denoted by H-subband, V-subband, D-subband and L-subband, respectively.

In the second group of experiments, we also test IKLDA for other data sets. To complete the second group of experiments, we choose a number of 1813 gray-scale JPEG images which are 256 × 256 pixel and are embedded data by using steghide [10](a steganography method that is able to hide data in various kinds of images). We first fuse the feature space for each data set by two different methods to produce a total of 150 features. The first 72 features are from the method of Lyu and Farid [6] and the remaining 78 features are from the method of Shi et al. [11]. The first 72 features include the mean, variance, skewness and kurtosis of the subbands H1, V1, D1, H2, V2, D2, H3, V3, D3, EH1, EV1, ED1, EH2, EV2, ED2, EH3, EV3, ED3, where H1, V1, D1, H2, V2, D2, H3, V3, D3 are the H-subbands, V-subbands and D-subbands obtained by applying wavelet transformation three times and EH1, EV1, ED1, EH2, EV2, ED2, EH3, EV3, ED3 are corresponding to that of the prediction-error image. We arrange these 72 features as follows: 1:meanV, the mean of V1, 2:meanH, 3:meanD, 4:varV, 5:varH, 6:varD, 7:kurV, 8:kurH, 9:kurD, 10:skeV, 11:skeH, 12:skeD, 13:meanEV, 14:meanH, 15:meanED, 16:varEV, 17:varEH, 18:varED, 19:kurEV, 20:kurEH, 21:kurED, 22:skeEV, 23:skeEH, 24:skeED. Similarly, the corresponding values of the second level and third level are put into components from 25th to 48th and from 49th to 72nd, respectively. The rest 78 features are as follows: 1st–3rd features are the mean, variance and kurtosis of the original images; 4th–6th features are the mean, variance and kurtosis of the prediction-error image; from 7th to 42nd is followed by the means, variances and kurtosises of the H-subbands, V-subbands, D-subbands and L-subbands LLi, HLi, LHi, HHi, i=1,2,3, obtained by applying Haar wavelet transformation three times, say meanLL1, varLL1, kurLL1, meanHL1, varHL1, kurHL1, meanLH1, varLH1, kurLH1, meanHH1, varHH1, kurHH1, meanLL2, varLL2, kurLL2, meanHL2, varHL2, kurHL2, meanLH2, varLH2, kurLH2, meanHH2, varHH2, kurHH2, meanLL3, varLL3, kurLL3, meanHL3, varHL3, kurHL3, meanLH3, varLH3, kurLH3, meanHH3, varHH3, kurHH3; finally, from 43rd to 78th positions the values of the prediction-error image, corresponding to that from 7th to 42nd.

We index +1 for any original image and −1 for any stego-image. The following experiments show the efficiency of IKLDA for steganalysis after projecting the above 150 features onto R 3, by comparing the difference between the cover image and the stego-image, obtained by steghide, in the reduced dimensional space.

In order to show how IKLDA works, we consider two of its important parameters ϵ and α in (20) and their functions of impacting on the distributions of features of the cover images and stego-images in the reduced dimension. ϵ is a small constant. By fixing α and changing ϵ, we get feature clouds for steganalysis: from Fig. 6a–f we can observe that the less value of ε,the farther between the inter-classes and the closer in intra-classes. But with ε decreasing, a downturn appears; after that, a better result shows up again.

Fig. 6
figure 6

The distributions of IKLDA with the changing of parameter ϵ

Similarly, Fig. 7a–d are obtained by fixing ϵ and changing α in (20). The feature clouds of Fig. 7a–d show that, with the increasing of α, the samples in one intra-class are getting closer while samples in the other intra-class are getting farther. The samples between inter-classes are also getting farther.

Fig. 7
figure 7

The distributions of IKLDA with the changing of parameter α

In the second group of experiments, we still compare IKLDA with other dimension reduction methods [13] to show the efficiency of IKLDA. The goal is to find out which method can map the input data into a feature space in which samples from different classes can be clearly separated.

The possibilities of distributions can be classified into three cases: (1) the distances of the samples among inter-classes are maximized while that of intra-classes are maximized as well; (2) the distances of the samples among inter-classes are maximized while that of intra-classes are minimized; (3) the distances of the samples among inter-classes are minimized while that of intra-classes are minimized. From the following experiments we can get their corresponding distributions where Fig. 8a is the reduced dimension space obtained from IKLDA(ϵ = 0.0001) which can maximize the distances of the samples among inter-classes while minimize that of intra-classes. Figure 8b is the reduced dimension space obtained from Laplacian eigenmap which can get a nice symmetry geometry structure but fail in maximizing the distances of the samples among inter-classes. Figure 8h is the reduced dimension space obtained from Isomap, Fig. 8d is the reduced dimension space obtained from Diffusion Map and Fig. 8e is the reduced dimension space obtained from LTSA, they show that their methods can minimize the distances of the samples among intra-classes but fail in maximizing that of inter-classes. Figure 8c is the reduced dimension space obtained from Hessian LLE, Fig. 8f is the reduced dimension space obtained from LDA, Fig. 8g is the reduced dimension space obtained from PCA, Fig. 8i is the reduced dimension space obtained from LFDA, Fig. 8j is the reduced dimension space obtained from MDS, Fig. 8k is the reduced dimension space obtained from LPP and Fig. 8l is the reduced dimension space obtained from LPP’s kernel extension called KLPP, they cannot get favorable results either. The distributions of visualization experiments show that our new IKLDA algorithm gets better performance in our steganalysis scheme and is better than the other traditional dimension reduction methods.

Fig. 8
figure 8figure 8

The distributions of different dimension reduction methods

Most important of all, we should investigate the way the reduced dimensional space by our improved kernel linear discriminant analysis algorithm will work for the image steganalysis. We compare this phenomenon with that of Kocal’s to the best of our knowledge. Since Koçal et al. [3] found chaotic-type features for speech steganalysis where Lyapunov exponents (LE) and fraction of false neighbors (FNFS) have been used as chaotic features to detect the existence of the speech stego signal. The rest may be deduced by analogy, considering that data hiding within a speech signal can distort the chaotic properties of the original speech signal. Although there is no direct evidence for the existence of chaotic phenomena in image signals, which linear modeling cannot cover, there should be self-similar distribution in typical natural images. We would like to call this inner structure a possible fractal structure. Data hiding within image signals has distorted this inner structure and characteristic distribution of these signals, resulting in a change in distribution characters of the reduced dimension. Exploring this change made by data hiding can lead to the design of a learning-based steganalyzer.

Furthermore, the data-hiding affects the neighborhood distances between cover and stego-images in the proposed reduced dimension. The details of the data hiding effects can be shown in the above mentioned experiments. In speech steganalysis, some similar patterns come into effect that there is more significant distinction between cover and stego speech signals in phase space representation than in time-series representation. Accompanied with the image steganalysis, the reduced dimensional feature can uncover more useful information if proper reduction algorithm has been proposed.

In order to go a step further and explore our steganalysis schema in another way, we take speech steganalysis in phase space for example. The Lyapunov exponent, a quantitative measure for the divergence of nearby trajectories, should be calculated for all of the nearest neighbor pairs on different trajectories. A positive exponent means that the trajectories, which are initially close to each other, move apart over time (divergence); while for negative exponents, the trajectories move closer to each other (convergence); the magnitude of the exponent determines how rapidly the trajectories move. In image steganalysis our IKLDA is to maximize the distances of the samples among inter-classes and minimize that of intra-classes in the proposed reduced dimension space, and the key parameters in (20) act as the ratio to control how well this process will go, just as the Lyapunov exponent does in speech steganalysis.

As explained above, a new steganalysis scheme is proposed which focuses on the alteration and differences of statistical self-similarity features of stego-images which are formed by embedding algorithms, and has something to do with the distribution characters of stego-images. We believe that the proposed reduced dimensional space can act as characterization and modeling of image steganalysis and have the same effect as the phase-space used to distinguish stego-signals from cover signals in speech steganalysis.

4 Conclusions

In this paper, we propose an Improved Kernel Linear Discriminant Analysis algorithm to analyze the distribution difference between the cover image and the stego-image in the dimensional reduced space. We observe that after dimension reduction the steganographically embedded hidden information in stego-images are clustered in a plane while all other information of cover images are scattered more evenly in this space and have no other clusters. Based on this fact, we develop a passive steganalysis scheme to discriminate stego-images from innocent images. The experiment results verify the effectiveness of the propose approach.