1 Introduction

Image steganography is the mechanism of embedding secret information into a clean (cover) image using a stego key. The embedded image is called as the stego image. Owing to the fact that human eye is less perceptive to slight changes in image parameters, image steganography has gained importance and is widely used by terrorists. The common steganographic algorithms include JSteg, JP Hide and Seek, Outguess. Apart from these commercial tools, many application specific steganographic algorithms have been developed by researchers.

While steganography relates to secret communication, steganalysis involves hacking a suspicious image and identifying any secret message. There are two methods of image steganalysis, embedding specific and blind steganalysis. The embedding specific methods know the steganographic algorithm that was used to embed, but blind steganalysis does not have knowledge about the embedding logic that was used. Hence blind steganalysis is called as Universal method where the suspected image is classified by a two class (clean or stego) classifier. This classification could be based on statistical methods or computational intelligence methods [25]. While statistical blind steganalysis use Markov or wavelet or DCT coefficients to classify an image, computational intelligence methods are based on neural networks and genetic or evolutionary algorithms [9]. The efficiencies of these methods depend on the optimal features selected from the image for classification. Researchers state that acquiring the best feature set has always been a challenge as different steganographic algorithms change different image parameters [14, 20, 23, 29]. Recent research has presented a method of acquiring a rich set of image features that would give best classification of images but have disadvantages in terms of computational complexity [12]. This demands the need to optimize the data sets for best possible features. The most prominent statistical feature reduction methods convergence to local minima and choose inappropriate features.

To overcome this challenge this research intends to implement nature inspired optimization techniques for best image steganalysis. These are meta-heuristic algorithms that use stochastic operators to obtain global optimal values [2]. Randomness being the main characteristics of these stochastic algorithms, they search for the global optimum by creating a set of candidate solutions and then iteratively fine tune them till a satisfactory terminating condition [28]. The recent nature inspired algorithms include Firefly Algorithm (FA) [31], Cuckoo Search (CS) algorithm [32], Cuckoo Optimization Algorithm (COA) [24], Ray Optimization (RO) algorithm [15,16,17]. This research work implements modified Ant Lion Optimization for steganalysis of JPEG images. The next section discusses the proposed steganalyser, followed by the extracted image features and then ALO optimization technique. Finally the classification scheme used for steganalysis is discussed.

2 Implementation of proposed steganalyser

The proposed image steganalyser in this research consists of four main parts. In the first part stego images are created from clean images using the non shrinkage F5 (nsF5) algorithm. The second part involves the extraction of rich feature set (23,230 features) from stego and clean images. Optimizing this feature set to 232 features by ALO is the third part, followed by classification in forth stage. All implementations are done in MATLAB.

2.1 Image database and steganographic embedding

The images used in this research are taken from the standard BOSS (Break Our Steganographic System) database [1]. This database has full resolution images acquired from different cameras like Panasonic, Leica, Pentax, Cannon EOS and Nikon in .cr2 or .dng format and later converted into .pgm format. The original images are resized and cropped to 512 × 512. These images are made public for use by researchers in the field of steganography and steganalysis. From the images available in this standard database (http://dde.binghamton.edu/download/) 100 images are chosen and converted into .jpg format with a quality factor of 75. These 100 images are the cover (clean) images that are used to create another 100 stego images by Non Shrinkage F5 (nsF5) embedding. The subsequent feature extraction, optimization and classification in this research uses these 200 images.

The stego images for this research work have been created from the Non Shrinkage F5 (nsF5) algorithm. Permutation straddling of the image pixels in F5 enables uniform embedding of the data with a specific time complexity of order O(n). The sequences of operations involved in F5 are – choose a DCT coefficient, permute it with a key, embed the secret data and then code it with Huffman coding. For s secret bits and code words with length n = 2 s – 1, the embedding rate R(s), change in embedding density E(s) and embedding efficiency (μ) are

$$ R(s)=\frac{s}{n}=\frac{ks}{2^s-1} $$
(1)
$$ E(s)=\frac{1}{n+1}=\frac{1}{2^s} $$
(2)
$$ \mu =\frac{R(s)}{S(s)}=\frac{s{2}^s}{2^s-1} $$
(3)

Coefficients that become zero due to embedding are avoided in F5 as the decoder in receiver skips that coefficient. Hence the modified F5 (non shrinkage F5) algorithm uses syndrome coding on the DCT coefficients before applying F5 logic. This method eliminates the shrinkage problem and is superior compared to other steganographic methods that use side information about the cover. Literature shows that nsF5 algorithm is the best steganographic algorithm till date for JPEG images [18].

2.2 Extracted image features

Features extracted in this research include a rich set of all possible changes in DCT coefficients, that have spatial and frequency correlations. For a JPEG image of dimension I × J, the quantized DCT coefficients could be in a matrix C ϵ X I×J. Each DCT coefficient C p,q m,n represents the (p,q) th coefficient in the (m,n) th 8 × 8 block where (p,q) ϵ {0,1,....7}2, m ϵ {1,2,....I/8} and n ϵ {1,2,.... J/8}. For simplicity the individual elements are represented as Cm,n. The matrices denoting the absolute values, inter block differences and intra block differences are

$$ {DC}_{m, n}=\left|{C}_{m, n}\right|, m=1,2,\dots I\; and\; n=1,2,\dots J $$
(4)
$$ {DC^{ha}}_{m, n}=\left|{C}_{m, n}\right|-\left|{C}_{m, n+1}\right|, m=1,2,\dots I\; and\; n=1,2,\dots J-1 $$
(5)
$$ {DC^{va}}_{m, n}=\left|{C}_{m, n}\right|-\left|{C}_{m+1, n}\right|, m=1,2,\dots I-1\; and\; n=1,2,\dots J $$
(6)
$$ {DC^{da}}_{m, n}=\left|{C}_{m, n}\right|-\left|{C}_{m+1, n+1}\right|, m=1,2,\dots I-1\; and\; n=1,2,\dots J-1 $$
(7)
$$ {DC^{he}}_{m, n}=\left|{C}_{m, n}\right|-\left|{C}_{m, n+8}\right|, m=1,2,\dots I\; and\; n=1,2,\dots J-8 $$
(8)
$$ {DC^{ve}}_{m, n}=\left|{C}_{m, n}\right|-\left|{C}_{m+8, n}\right|, m=1,2,\dots I-8\; and\; n=1,2,\dots J $$
(9)

where, DC m,n represents difference of absolute value, DC ha m,n is the difference between the intra block horizontal coefficients, DC va m,n is the difference between the intra block vertical coefficients, DC da m,n is the difference between the intra block diagonal coefficients, DC he m,n is the difference between the inter block horizontal coefficients and DC ve m,n is the difference between the inter block vertical coefficients. The proposed feature model is the sub model framed from the 2D co-occurrence matrices calculated from these difference coefficients. There are 10 sub models in each of the co-occurrences matrices and final rich feature set has 23,230 features from all possible co-occurrence combinations of DCT coefficients. The details of the features due to each of the sub bands are enumerated in Table 1.

Table 1 The Extracted feature Model

This rich feature set has 23,230 features from all possible co-occurrence combinations of DCT coefficients.

2.3 ALO optimization

With such a large and diverse feature set, classification is a time and space complex problem. Hence the feature set is reduced by Ant Lion Optimization, a nature inspired meta-heuristic algorithm. The positions of the ants are the features in the search space in a specific iteration. The characteristics of the ALO algorithm are

  • The movements of the ants and antlions are random walks in search space.

  • The fitness function depends on the size of the pits built by the antlions (probability of ants getting trapped is more if the pit is larger in size).

  • In each iteration an elite antlion catches a prey.

  • Fitter ants are caught by the antlion (fitter features are selected).

  • Antlions start building new pits by repositioning themselves to the latest prey.

Mathematical representation of the random walk of ants is

$$ R(t)=\left[0, CU\left(2 s\left({t}_1\right)\right.-1\right., CU\left(2 s\left({t}_2\right)-1\right., CU\left(2 s\left({t}_3\right)\right.-1,\dots \dots CU\left(2 s\left({t}_n\right)-1\right] $$
(10)

Here ‘CU’ is the cumulative sum of random function‘s’ till ‘n’ iterations, ‘t’ is the step size of the random walk and s(t) is random function defined as 1 if rand > 0.5 and is 0 if rand < 0.5, where rand is a random number in the interval [0,1]. The nature of random walk is shown in Fig. 1 for three sets of features extracted with payload 0.5. The nature of randomness can be seen from the fluctuation in the curves around the origin (as expected for the behaviour of ants in search space).

Fig. 1
figure 1

Three set of feature values depicting the random walk of three ants in search space

The position of each ant is stored in matrix format for use in subsequent iterations during optimization.

$$ \mathrm{AntPos}=\left[\begin{array}{cc}\hfill A1,1\hfill & \hfill A1,2\dots \dots A1, p\hfill \\ {}\hfill A2,1\hfill & \hfill A2,2\dots \dots A2, p\hfill \\ {}\hfill \hfill & \hfill \vdots \hfill \\ {}\hfill A n,1\hfill & \hfill A n,2\dots \dots A n, p\hfill \end{array}\right] $$
(11)

here, A n,p is the p th position of the n th ant in any iteration. The number of features is n, corresponding to ants and their p positions correspond to p features. Each ant (feature vector) is optimized according to a fitness function and the fitness values are stored in a matrix,

$$ \mathrm{A}\mathrm{n}\mathrm{tOpt}=\left[\begin{array}{c}\hfill Aopt1\hfill \\ {}\hfill Aopt2\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill Aopt n\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill f\left(\mathrm{A}1,1\right.\hfill & \hfill \mathrm{A}1,2\kern2em \left.\mathrm{A}1,\mathrm{p}\right)\hfill \\ {}\hfill f\left(\mathrm{A}2,1\right.\hfill & \hfill \mathrm{A}2,2\kern2em \left.\mathrm{A}2,\mathrm{p}\right)\hfill \\ {}\hfill \hfill & \hfill \kern-5em \vdots \hfill \\ {}\hfill f\left(\mathrm{A}\mathrm{n},1\right.\hfill & \hfill \mathrm{A}\mathrm{n},2\kern2em \left.\mathrm{A}\mathrm{n},\mathrm{p}\right)\hfill \end{array}\right] $$
(12)

here, f(An, 1 An, 2 An, p) is the function for calculating the optimal value of the nth ant. According to the ALO algorithm, apart from the ants, antlions also hide in the search space to catch the ants. Their positions need to be tracked and are stored in another matrix,

$$ \mathrm{AntLionPos}=\left[\begin{array}{cc}\hfill L1,1\hfill & \hfill L1,2\dots \dots L1, d\hfill \\ {}\hfill L2,1\hfill & \hfill L2,2\dots \dots L2, d\hfill \\ {}\hfill \hfill & \hfill \kern-3.6em \vdots \hfill \\ {}\hfill L n,1\hfill & \hfill L n,2\dots \dots L n, d\hfill \end{array}\right] $$
(13)

where, L n,p is the d th position of the n th antlion. The n antlions have d positions for d features. The fitness values of antlions are stored in a matrix,

$$ \mathrm{AntLionOpt}=\left[\begin{array}{c}\hfill Lopt1\hfill \\ {}\hfill Lopt2\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill Lopt n\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill f\left(\mathrm{L}1,1\right.\hfill & \hfill \mathrm{L}1,2\kern1.5em \left.\mathrm{L}1,\mathrm{d}\right)\hfill \\ {}\hfill f\left(\mathrm{L}2,1\right.\hfill & \hfill \mathrm{L}2,2\kern1.5em \left.\mathrm{L}2,\mathrm{d}\right)\hfill \\ {}\hfill \hfill & \hfill \kern-5.2em \vdots \hfill \\ {}\hfill f\left(\mathrm{L}\mathrm{n},1\right.\hfill & \hfill \mathrm{L}\mathrm{n},2\kern1.5em \left.\mathrm{L}\mathrm{n},\mathrm{d}\right)\hfill \end{array}\right] $$
(14)

here, f(Ln, 1 Ln, 2 Ln, d) is the function for calculating the optimal value of the nth antlion. The phenomenon of random walk may be diverse, hence to accommodate within the search space, the variables are normalized by min-max normalization technique as below.

$$ A=\frac{\left( A- Mn\right)\times \left( B- Ct\right)}{\left( B t- Mn\right)}+ c $$
(15)

Mn are B are the minimum and maximum values of n th feature vector, Ct is minimum of feature vector in n th iteration and Bt is the maximum value of feature in n th iteration. Normalization ensures that all the feature values are in the search space. During the random walk, the ants get trapped in an antlion’s pit. Figure 2 shows the pits built by one or more antlions.

Fig. 2
figure 2

Pits built by one or more Antlions

A pit is modelled as a hyper sphere for each selected antlion and the movement of ants around the antlion in the hyper sphere is modelled according to the following equations,

$$ {C_i}^t={L_n}^t+{C}^t $$
(16)
$$ {D_i}^t={L_n}^t+{D}^t $$
(17)

C i t and D i t are the minimum and maximum of the feature values of the i th ant, C t and D t are the minimum and maximum among all feature values in the t th iteration. L n t is the n th antlion in t th iteration. When the ants come into the trap, they tend to move away, but the antlions spill or shoot sand upwards to make them slide. This sliding behaviour of the ants is modelled by decreasing radius of the hyper sphere defined by vectors C and D as

$$ {C}^t={C}^t/ S $$
(18)
$$ {D}^t = {D}^t/ S $$
(19)

The parameter S, defines the level of accuracy [22] and is defined as

$$ S={10}^{ut/ T} $$
(20)

where t is the current iteration, T is the total number of iterations, u = 2 when t/T > 0.1, u = 3 when t/T > 0.5, u = 4 when t/T > 0.75, u = 5 when t/T > 0.9, u = 6 when t/T > 0.95. The values of C and D iteratively decrease mimicking the sliding of ants to the bottom of the pit (global optimal point). To acquire high probability of catching the next prey (ant), the antlions update their position to that of the hunted ant, if fitness of ant is greater than fitness of antlion.

In this research work the fitness function is sum of squares of feature values and the random walk of ants and antlions is implemented as roulette wheel selection. The number of search agents (antlions) is chosen as 100 and hence the rich image feature set of size 23,230 reduces to 232 after optimization. For 100 cover images and 100 stego images, the optimized feature set is 200 × 232. This reduced feature set is given to the fusion classifier system.

2.4 Fusion classifier system

The identification of an image as a clean or stego image is a two class problem (clean image is classified as 1 and a stego image is classified as 2). The classifiers used are individual classifiers (SVM and MLP) and their fusion by three schemes (Bayes, Decision template and Dempster Schafer). Fusion classifiers are superior as they exploit the strengths of individual classifiers but avoid their weaknesses [19]. The original feature set is divided into 3 folds. Two folds are used for training and 1 fold is used for validation.

Support Vector Machines fit the data into high dimensional feature space and separate them by hyperplane. SVM in this research uses RBF kernel with penalty factor 100 and gamma of 10. Multi Layer Perceptron (MLP) [30] is a feed forward network for non - linearly separable data. The MLP implemented in this research has 10 nodes to associate the inputs to high output response with sigmoidal activation function.

The classifier fusion methods implemented are Bayes, Decision template and Dempster Schafer [19]. These are decision based data fusion methods using the perceived data from many sources. Bayesian inference uses the conditional probability according to Bayes rule in terms of the posterior probability P(Y/X), which demands the prior knowledge of P(X) and P(X/Y). Dempster Schafer fusion deals with uncertainty in terms of changing beliefs, evidences and incomplete knowledge. To combine the effect of two hypotheses (classifiers), the rule according to Dempster Schafer is in terms of mass functions or probabilities [19]. The selected classes try to maximize the belief function. Decision Template is another fusion scheme that combines classifiers by comparing the output of classifiers with a reference (decision) template. The reference templates are measured prior to comparison. The comparison is based on similarity measure and consistency measure. This method differs from other methods in that it considers the output of all classifiers to make the final support for a class, while other methods consider the output of that class alone to calculate the support [19]. This enables a decision based on the average of decision profiles due to all elements in the training set. Classification accuracy is used to compare the performance of individual and fusion classifiers. Accuracy is (Tp + Tn)/ (Tp + Tn + Fp + Fn), Tp is True Positive or hit, Tn is True Negative or rejection, Fp is false Positive or false alarm and Fn is False Negative or miss.

3 Results

The stego image is obtained for specific payloads on randomly selected DCT coefficients of cover images with a PRNG seed as in Table 2.

Table 2 Embedding changes in stego images according to payload

Appendix shows few clean (cover) images and their embedded counterpart for a specific payload of 0.5bpdct, embedding output, the extracted features stored in .xls file and the convergence of ALO optimizer for 400 iterations. The convergence of the entire ALO for the chosen feature set is shown in Fig. 3.

Fig. 3
figure 3

Convergence of ALO Optimization

After optimization the reduced features are classified. The classification accuracy for single and fusion classifiers for different payloads (0.5, 0.8, 0.01, 0.1) are tabulated in Tables 3, 4, 5, and 6 respectively. The highest accuracy values are in bold under each category.

Table 3 Classification accuracy of single and fusion classifiers for Payload = 0.5
Table 4 Classification accuracy of single and fusion classifiers for Payload = 0.8
Table 5 Classification accuracy of single and fusion classifiers for Payload = 0.01
Table 6 Classification accuracy of single and fusion classifiers for Payload = 0.1

For a payload of 0.5, the highest classification accuracy has been noticed for Decision template (72.22%) and Dempster Schafer (72.22%) fusion classifiers followed by Bayes (69.44%). Considering average accuracy, Bayes fusion classifier has the highest overall average accuracy (64.44%). When the payload is 0.8, Bayes has maximum average accuracy (61.66%) compared to all other classifiers. For payloads of 0.01 and 0.1, again Bayes classifier has the highest average classification accuracy. Table 7 shows the performance of all classifiers for different payloads but fixed PRNG value (PRNG seed = 80) during embedding. Even in this case, Bayes classifier has the highest average classification accuracy.

Table 7 Classification accuracy of single and fusion classifiers for different Payloads and fixed PRNG seed

Comparing the average classification accuracies of all classifiers in Table 7, Fig. 4 shows that Bayes fusion classifier is the best. Considering the average classification accuracies (from Tables 3, 4, 5, and 6) of all classifiers for different payloads, Fig. 5 shows that Bayes fusion classifier outperforms all other type of classifiers. Thus Bayes fusion classifier can be considered as universal classifier for JPEG steganalysis.

Fig. 4
figure 4

Performance of all classifiers for different payloads but fixed PRNG value during embedding

Fig. 5
figure 5

Average classification accuracies of all classifiers for different payloads averaged for different PRNG seed values

Table 8 shows the timing calculations for different images for a payload of 0.5 and PRNG seed value 80 during embedding. The total time is the sum of the extraction and optimization time of both the cover and stego images along with the embedding time. The average time taken for processing one image is 18.4315 s. Appendix shows the time calculation for stego image (image number 74). Running the algorithm for steganographic embedding, feature extraction and optimization, the time for 200 images was found to be 691.34 s or 11.522 min. This is shown in Appendix. Further the classification of this optimized feature set took 55.09 s. Thus the TOTAL PROCESSING TIME IS 746.43 s or 12.44 min for 200 images.

Table 8 Time Complexity Calculation when payload is 0.5 and PRNG seed = 80

4 Comparison with prior research

In the recent past, steganalysis with feature extraction has used limited statistical features and neural network based classifiers [13, 23]. While some parallel processing video coding algorithms [3,4,5] exist for reducing time and space complexity, their application to image processing would be expensive compared to our proposed simple genetic based ALO optimization. Though few research works have concentrated on rich models for universal steganalysis, they use statistical feature reduction techniques and like Fisher Linear Discriminant (FLD) and ensemble classifiers [12]. These ensemble methods take different portions of feature sets for classification and find the Minimal total error (in terms of false alarm and missed detection rate). Table 9 shows the comparison of this research with the earlier research work in this field.

Table 9 Comparison with earlier research by other researchers

From Table 9 it is obvious that this research has better classification accuracy (highest value of average classification accuracy is shown in bold) than other research works. Comparing with the recent research work in this field, ALO gives better classification accuracy for the most sophisticated nsF5 embedding, while others report steganalysis of HUGO and YASS [26]. Research by Chhikara [7] shows that the feature reduction rate for CCPEV-548 features is 82%. Another research by Chhikara [6] states that the reduction rate is 67% for DCT features and 38% for DWT features. Whereas the feature reduction rate in this research is 99%. This high reduction rate enables the use of rich image feature sets (23,230 features) for improved classification accuracy.

While most of the image steganalysis research analyse the results in terms of classification accuracy, research by Kodovsky [12] shows average running time for HUGO, EA and ± Embedding algorithms for their ensemble method. It is seen that for the most simple ± Embedding the running time is 1 h 20 min, for HUGO 4 h 35 min and for EA 3 h 09 min [12]. The time complexity calculation in our research shows 12.44 MINUTES for extracting, optimizing and classifying all 200 images for the sophisticated nsF5 embedding algorithm. Table 10 shows comparison of this research with our earlier research.

Table 10 Comparison with our earlier research

The research work presented in this paper gives greater classification accuracy of 64.44% (shown in bold in Table 10) than these earlier methods stated in Tables 8 and 9. Future scope of this research could be fine tuning the parameters of the ALO algorithm or changing the random selection of position of antlions (features) to some other method.

5 Significance of this research

The ALO algorithm guarantees exploration of the entire feature space as it considers the features as random walk of ants. There are only few adjustable parameters in ALO while optimizing. These characteristics justify the use of ALO compared to other nature inspired optimization algorithms. The concept of ALO based JPEG steganalysis is significant compared to related work in this field due to the following reasons,

  • The feature reduction rate obtained by ALO is far superior compared to other methods [6, 7, 26].

  • The method of direct feature optimization and classification used in this research is simpler than the state of art ensemble based feature selection method proposed by Kodovsky and Fridrich [12].

  • Compared to the various approaches of feature extraction and steganalysis [14, 20, 23, 29], this approach is not complicated as it considers all possible combinations of feature changes and then optimizes them to give better classification accuracy.

Thus Bayes classifier when used with ALO based optimization gives significant improvement in classification accuracy and reduced time complexity for JPEG steganalysis.

6 Conclusion

This research has implemented a nature inspired meta-heuristic technique for optimizing the high dimensional image features for improved image steganalysis. The steganographic embedding is due to nsF5 algorithm and the extracted features are of dimension 200 × 23,230. The extracted feature model is based on the correlation among the DCT coefficients in frequency and spatial domain. To tackle the problem of computational complexity, the feature set is reduced by Ant Lion Optimization (ALO) technique. In this technique, movements of ants and antlions are considered as random walks in search space and the positions of the ants are the features in iteration. The fitness function is due to the pits built by the antlions. The reduced feature set of dimension 200 × 232 is classified with individual (SVM and MLP) and fusion classifiers (Bayes, Decision template, Dempster Schafer). With classification accuracy and time complexity as the performance measure, results have been analyzed for different payloads and different chosen DCT coefficients (different PRNG seed values). The highest AVERAGE CLASSIFICATION ACCURACY has been noticed for Bayes fusion classifier (64.44%) when the payload is 0.5. It has been noted that the performance of fusion classifiers is good compared to individual classifiers. Hence for JPEG steganalysis, Bayes classifier with ALO based optimization gives better classification accuracy compared to existing research.