1 Introduction

Linguistic steganography is the technology to hide secret messages into an innocuous-looking cover text by using natural language processing techniques to achieve the goal of covert communication. With the prevalence of digital texts, including novels, news, office documents, blogs, etc., linguistic steganography has attracted increasing interest during the past few years. Accordingly, the opposite technology—linguistic steganalysis also has attracted the attentions of researchers. Linguistic steganalysis aims at discovering the presence of secret messages in digital texts. It can be used to find and even prevent covert communication established by terrorists or illegal groups.

Linguistic steganography mainly falls into two categories. The first category is directly to generate a new natural-looking text [4, 11] by certain rules based on mimicking technique. The generated texts are usually not only obscurely readable and understood, but also easy to be distinguished from natural texts by statistical analysis [5]. The other embeds messages by approximately meaning-preserving linguistic modifications, such as synonym substitution (SS) [2, 6, 10, 14, 15, 17, 19, 22], syntactic transformation [1, 13], etc., to the natural texts. Among the existing linguistic steganography methods, SS steganography is an attractive one because of its simplicity, high embedding capacity and good imperceptibility. And some improved variants of SS steganography have been proposed to prevent possible wrong syntax and semantics caused by SS [2, 10, 14, 15, 17]. For example, Bolshakov [2] previously tested relative synonyms for semantic compatibility with collocations to determine whether synonym substitutions were correct, and replaced absolute synonyms directly. Topkara et al. [17] just chose one alternative for every synonym to be replaced to hide one bit message. Liu et al. [10] used the disambiguation function to determine which synonym was right to substitute the original word. Muhammad et al. [14] only adopted two absolute synonyms of each synsetFootnote 1 for embedding messages so that the replaceable synonyms are limited in a narrow scope to obtain good imperceptibility. Shirali-Shahreza et al. [15] used the English words which have the same meanings but different spellings in British and American English to camouflage data. In addition, Yang et al. [22] and Chiang et al. [6] cooperated the SS with some auxiliary techniques to design text watermarking schemes.

In recent years, some works on linguistic steganalysis have been done for breaking the SS steganography [12, 16, 23]. The first work on linguistic steganalysis against SS steganography was published by Taskiran et al. [16]. In this work, a feature vector extracted from a 3-gram language model is used to distinguish the steganographically modified sentences from the unmodified ones by SVM. However, it neither accurately detected stego sentences nor determined whether a text was a cover or stego one. In addition, Yu et al. [23] constructed a detector based on characteristics of the evaluated suitability of synonyms for their context in a text. This detector provided reliable results when the embedding rates of stego texts were very high. However, it has to access Google frequently, which leads to a very low running speed because Google does not allow automated frequent queries [7]. Luo et al. [12] extracted a statistical feature from the synonym sequences, each of which was composed of synonymous words appearing in the text, to detect the stego texts generated by SS steganography. In spite of running fast, the detection accuracy was not desirable especially when detecting stego texts with lower embedding capacity. Additionally, above steganalysis methods all do not consider the two critical factors affecting detection accuracy for SS steganography—the embedding rate and the synonym database.

In order to improve the performance of previous steganalysis for SS steganography, a new linguistic steganalysis method is developed in this paper. As a result of SS steganography, the number of high frequency synonyms may be reduced while that of low frequency synonyms would be increased. These changes can produce convincing evidences to reveal the existence of the hidden message. Thus, attribute pairs defined based on synonyms’ frequencies and synsets’ sizes are used to capture these changes, and some statistical features are extracted from the distribution of attribute pairs to form a vector for SVM to discriminate stego and cover texts. One advantage of employing the attribute pair in this paper is that relationships between statistical characteristics in cover and corresponding stego texts can be formulized to correlate with the embedding rate. The experimental results verify the efficacy of the proposed steganalysis method for different embedding rates, SS steganographic tools, and synonym databases, and demonstrate that the proposed method has more significant advantages than the existing methods.

This paper is organized as follows. Section 2 briefly reviews the general process of the SS steganography, followed by introducing the main synonym coding strategies. In Section 3, the characteristics of the attribute pairs in cover texts and stego texts are analyzed, and the linguistic steganalysis against SS steganography based on attribute pairs is described. The impact on the extracted feature vector caused by coding strategy is investigated in Section 4. Experimental results are presented in Section 5. Section 6 concludes the paper.

2 Overview of the SS steganography

Early SS steganography just simply replaces words with their synonyms to embed a message in a known text, so that the meaning of the cover text, in theory, should not be significantly changed. The stego text looks like innocuous one in appearance. In general, the synsets in a prepared synonym database for SS steganography are pre-encoded by a coding strategy. Each used synonym must be encoded into unique values in order to represent different information.

In the embedding process, firstly, SS steganography recognizes the words with synonyms, and locates the corresponding synsets; then, according the embedded message, it chooses the synonym with appointed code to substitute the original word. Figure 1 illustrates how to embed messages into two given sentences by SS steganography. In the first sentence, “rebuke” is recognized as a synonym, and it is located in the synset {rebuke, reproof}, which is encodes as {0: rebuke, 1: reproof}. If the current embedded message is the bit “1”, then “rebuke” will be replaced by “reproof”, otherwise, no substitution will be made. In the second sentence, “onetime” is located in a synset which contains three synonyms. Different coding strategies may lead to completely different coding results, for example, “quondam” can be encoded as “01” or nothing. Three typical coding strategies in previous SS steganography will be introduced.

Fig. 1
figure 1

Embedding messages into two example sentences by SS steganography

  1. 1)

    The basic coding strategy. This is the simplest coding strategy, and can be described as follows: assuming a synset has 2q + m (integers q > 0, m ≥ 0) synonyms, then arbitrarily 2q synonyms from it can be selected to be encoded as q-bit strings.

  2. 2)

    The multi-base coding strategy [2, 19]. It sets the synsets containing the orderly appearing synonyms in a text to \( s{n_0},s{n_1},\ldots,s{n_n} \) with sizes \( {k_0},{k_1},\ldots,{k_n} \), respectively, then codes the secret message by the following steps:

    1. Step1.

      Convert the secret message into an integer M;

    2. Step2.

      Encode synonyms in synset sn i as integer from 0 to k i −1;

    3. Step3.

      For i = 0 to n, repeatedly calculate t i = M mod k i , M′ = M/k i , M = M′, and select the synonym with codeword t i from sn i to replace the original synonym in the cover text.

  3. 3)

    The binary tree-based coding strategy [6]. It constructs a binary tree for each synset with the synonyms as the leaves. Different bits (“0” or “1”) are assigned to different branches, and then each synonym in the same synset will obtain a unique codeword. A complete binary tree, Huffman tree, or a normal binary tree may be constructed. The construction principle of the binary tree is determined by concrete coding strategy.

In fact, for most synsets, their synonyms have only part of the same senses, and then substitutions may lead to meaning distortions, sentence syntactic structure errors. The SS steganography proceeds with the works on how to select a perfect word to replace the original synonym for messages embedding [2, 10, 14, 15, 17]. These works used collocations, context information, etc., to measure the suitability of synonym substitutions.

3 Linguistic steganalysis against SS steganography by attribute pairs analysis

3.1 Synonym attributes representation

Synonym substitutions would keep semantic attributes of synonyms in a text almost unchanged. However, some statistical attributes will undoubtedly be modified. Thus, it is necessary to present an outstanding representation method to describe these statistical attributes for steganalysis.

In the existing natural language processing area, synonymous words always have different frequencies in a huge corpus. With these in mind, synsets in a synonym database could be preprocessed into synonym vectors according to the frequencies of the inside synonyms. The definition of synonym vector is given as the definition 1.

Definition 1

A synonym vector is defined as an ordered synset sorted in descending order of the frequencies of the inside synonyms.

A synonym vector \( \left( {{w_0},\ldots,{w_{k-1 }}} \right) \) satisfies the conditions: \( F\left( {{w_{i-1 }}} \right)\geqslant F\left( {{w_i}} \right),\ M\left( {{w_{i-1 }}} \right)\approx M\left( {{w_i}} \right) \), where \( i=1,\ldots,k-1 \), F(w) denotes the frequency of synonym w derived from a huge corpus, M(w) represents a lexical concept of w. Each synonym will have a unique position in the synonym vector, whose dimension is definite. Thus, some inherent attributes of a synonym can be obtained, and a new term named attribute pair is introduced to represent them.

Definition 2

Attribute pair of a synonym is defined as its position in a synonym vector and the dimension of the synonym vector, denoted as an ordered pair <pos, dim>, where \( pos\in \left\{ {0,1,\ldots,dim-1} \right\} \).

Suppose that synonym w is located in a synonym vector \( \left( {{w_0},\ldots,{w_{k-1 }}} \right) \), if w = w j , \( j\in \left\{ {0,1,\ldots,k-1} \right\} \), then its attribute pair is <j, k>.

3.2 Statistical characteristics of attribute pairs

Give an attribute pair <j, k>, its relative frequency p(j, k)in a text is given by

$$ p\left( {j,k} \right)=\frac{{f\left( {j,k} \right)}}{{\sum\limits_{i=0}^{k-1 } {f\left( {i,k} \right)} }} $$
(1)

where f(j, k) is the number of total occurrences of <j, k> in the text, \( \sum\limits_{i=0}^{k-1 } {f\left( {i,k} \right)} \) represents the number of total occurrences of all attribute pairs whose second components equal to k in the text.

According to definition 1, for any synonym vector \( \left( {{w_0},\ldots,{w_{k-1 }}} \right) \), when j < h, \( h\in \left\{ {1,\ldots,k-1} \right\} \), the frequency of w j is larger than or equal to that of w h . Thus, usually the probability of w j occurring in a cover text is not less than that of w h . In fact, in most synonym vectors, synonyms with different attribute pairs have unequal frequencies. The cover text would contain more synonyms with attribute pair <j,k> than the ones with attribute pair <h, k>. Consequently,

$$ {f_c}\left( {j,k} \right)>{f_c}\left( {h,k} \right),\quad j<h $$
(2)
$$ {p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)>0,\quad j<h $$
(3)

where f c (j, k) and p c (j, k) represent the number of total occurrences and relative frequency of <j, k> in cover text, respectively.

Therefore,

$$ Max\left( {{f_c}\left( {j,k} \right)} \right)={f_c}\left( {0,k} \right) $$
(4)
$$ Min\left( {{f_c}\left( {j,k} \right)} \right)={f_c}\left( {k-1,k} \right) $$
(5)

In existing steganographic algorithms, synonyms in a synset are always stored or encoded in alphabetical order, just like the tool Tyrannosaurus lex [20] (Tlex for short). However, after processing synsets used by SS steganography into synonym vectors, the order of synonyms in a synonym vector is not always synchronous with the encoding order. Thus, for arbitrary synonyms being encoded as the same specified codeword, the values of their attribute pairs are random. In other words, in a stego text, if a synonym w contains a secret message and its attribute pair is <pos, k>, pos may be a random value varying from 0 to k−1. For all synonyms with secret messages, whose attribute pairs are <pos, k>, the proportion of synonyms with attribute pair <j, k> is 1/k. Since both a synonym and its substituted one come from the same synset, SS steganography just causes the transitions between attribute pairs having the same second components, leading to \( \sum\limits_{i=0}^{k-1 } {{f_s}\left( {i,k} \right)} =\sum\limits_{i=0}^{k-1 } {{f_c}\left( {i,k} \right)} \), where f s (j, k) represents the number of total occurrences of attribute pair <j, k> in stego text. Finally, the following equations can be deduced:

$$ {f_s}\left( {j,k} \right)=\left( {1-r} \right){f_c}\left( {j,k} \right)+\frac{1}{k}r\sum\limits_{i=0}^{k-1 } {{f_c}\left( {i,k} \right)} $$
(6)
$$ {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)=\frac{{{f_s}\left( {j,k} \right)-{f_s}\left( {h,k} \right)}}{{\sum\limits_{i=0}^{k-1 } {{f_s}\left( {i,k} \right)} }}=\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right) $$
(7)

where p s (j, k) represents the relative frequency of attribute pair <j, k> in stego text, r is the embedding rate. As a matter of convenience, r is measured by the ratio of total number of synonyms containing secret messages to total number of synonyms appearing in a text.

It is noteworthy that Eq. (7) is obtained under the assumption that the above analysis used the same synonym database with the one in SS steganography.

As 0 < r ≤ 1, thus

$$ {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)<{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right),\quad j<h $$
(8)

The difference between p s (j, k)−p s (h, k) and p c (j, k)−p c (h, k) depends on the value of r and p c (j, k)−p c (h, k). The larger p c (j, k)−p c (h, k) is, the larger the difference is for the same r. r becomes larger, p s (j, k)−p s (h, k) becomes closer to 0, the difference becomes larger. When r = 1, the difference achieves the largest value.

Figure 2 shows the statistical distributions of p(0, 2)−p(1, 2), p(0, 3)−p(1, 3), p(0, 3)−p(2, 3) of cover and stego texts. 5,622 cover texts were used and denoted as the ones with embedding rate 0% in Fig. 2. For each specified embedding rate, 5,622 stego texts were generated by the SS steganographic tool Tlex, whose source code was slightly modified so that message can be embedded with multi-base coding strategy for any embedding rate.

Fig. 2
figure 2

The statistical distributions of p(j,k)−p(h,k) for different attribute pairs

In Fig. 2, it can be observed that cover texts have higher values of p(j, k)−p(h, k), in comparison to stego texts. Especially, it is easy to differentiate stego texts with high embedding rates from cover texts by the values of p(j, k)−p(h, k), which are concentrated in different ranges. Hence, the statistics p(j, k)−p(h, k) extracted from the relative frequencies of attribute pairs can be taken as a clue to detect secret message existing in synonyms of texts.

3.3 The proposed linguistic steganalysis

The details of the proposed steganalysis using the statistical features derived from distributions of attribute pairs are as follows.

  1. Step1.

    Preprocess the collected synonym database to convert synsets into synonym vectors.

  2. Step2.

    Traverse every synonym in the text to obtain the value of its attribute pair.

  3. Step3.

    Calculate the values of p(j, k)−p(h, k), j < h, j, h ∊ {0, 1, …, k−1} to form a feature vector.

  4. Step4.

    Train an SVM with the above feature vector as an input.

  5. Step5.

    Use the trained SVM to classify stego texts from cover texts.

In general, only synonymous words with the closest senses could be replaced with each other, and each synonym can only exist in a synset for SS steganography. Thus, not all the synonyms of a word listed in every sense are used by steganography. The synset size is always small. The mean synset size of the synonym database used in Tlex is 2.32 words after deleting repeated synsets. And for the synonym database extracted from Wordnet 2.1 [21] and used in our experiments, the mean synset size is 2.53 words. Synonyms in a text recognized by these synonym databases are always coming from small synsets rather than large synsets. In other words, the larger k is, the smaller f(j, k) is. What is more, for a short text, the value of f(j, k) may be very small and even zero when k is large. Under this condition, only small k should be selected to extract informative feature vector. As mentioned above, a larger p c (j, k)−p c (h, k) is beneficial for differentiating p s (j, k)−p s (h, k) from p c (j, k)−p c (h, k). In order to maximize p c (j, k)−p c (h, k), j should be set to zero, since \( Max\left( {{f_c}\left( {j,k} \right)} \right)={f_c}\left( {0,k} \right) \). In conclusion, only the statistical characteristics p(j, k)−p(h, k) whose k = 2, 3, 4, j = 0, h = 1, …, k−1, are extracted to form a feature vector in this work. The final feature vector with six elements is (p(0, 2)−p(1, 2), p(0, 3)−p(1, 3), p(0, 3)−p(2, 3), p(0, 4)−p(1, 4), p(0, 4)−p(2, 4), p(0, 4)−p(3, 4)).

4 Analyzing the impact caused by synonym coding strategies

The analysis in Section 3 is based on the fact that synonyms are encoded independently of their frequencies in existing SS steganographic methods. But, if one considers the synonym frequencies when encoding the synonyms, then the stego texts may be less probable to be detected by the proposed steganalysis methods.

If the synonyms are encoded in frequency order, a determined attribute pair will map into a fixed codeword. For different synsets with the same size, their synonyms having the same attribute pair are encoded as the same codeword. Different coding strategies may encode a synonym into codewords of different lengths. Different codeword lengths would make the probabilities of the corresponding attribute pairs occurring in a text different. Therefore, different coding strategies would result in different statistical characteristics of the attribute pairs in stego texts. In the following, the impact caused by different synonym coding strategies on the extracted feature vector will be analyzed. And the analysis uses the same synonym database with the one in SS steganography.

4.1 Basic coding strategy

For a synset of size k, set k = 2q + m, (integer q>0, m≥0), then only 2q number of synonyms in it are encoded to embed message, and their codewords have the same length. Let the first component of attribute pairs of these 2q selected synonyms form a set denoted as S stego . The remainders form a set S cover . Given a synonym w with attribute pair <pos, k> in a stego text, if w has been embedded message, then pos is one of the integers in S stego with the same probability 1/2q (i.e. \( \frac{1}{k-m } \)), as the codeword length of w is q bits. If posS cover , then w must not be embedded message. Therefore,

$$ {f_s}\left( {j,k} \right)=\left\{ {\begin{array}{*{20}c} {\left( {1-r} \right){f_c}\left( {j,k} \right),} \hfill & {j\in {S_{cover }}} \hfill \\ {\left( {1-r} \right){f_c}\left( {j,k} \right)+\frac{1}{k-m }r\sum\limits_{i=0}^{k-1 } {{f_c}\left( {i,k} \right),} } \hfill & {j\in {S_{stego }}} \hfill \\ \end{array}} \right. $$
(9)
$$ {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)=\left\{ {\begin{array}{*{20}c} {\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right),} \hfill & {j<h,\;j\in {S_{stego }},h\in {S_{stego }}} \hfill \\ {\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right)+\frac{1}{k-m }r,} \hfill & {j<h,\;j\in {S_{stego }},h\in {S_{{\operatorname{cov}er}}}} \hfill \\ {\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right)-\frac{1}{k-m }r,} \hfill & {j<h,\;j\in {S_{{\operatorname{cov}er}}},h\in {S_{stego }}} \hfill \\ {\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right),} \hfill & {j<h,\;j\in {S_{{\operatorname{cov}er}}},h\in {S_{{\operatorname{cov}er}}}} \hfill \\ \end{array}} \right. $$
(10)

If m = 0, then S cover = ∅. For any \( j,h\in \left\{ {0,1,\ldots,k-1} \right\} \), jS stego , hS stego , and \( {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)=\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right) \), which is in accordance with the Eq. (7).

If m ≠ 0, S cover  ≠ ∅, when jS stego , hS stego or jS cover , hS cover , then \( {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)=\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right) \), which is consistent with Eq. (7); when jS stego , hS cover , then the difference between p s (j, k)−p s (h, k) and (p c (j, k)−p c (h, k)) is reduced compared to Eq. (7). Conversely, when jS cover , hS stego , then the difference is increased. Thus, just taking p(j, k)−p(h, k) with jS stego , hS cover as features may increase the difficulty of detecting message embedded by steganography encoding synonyms in frequency order. In the proposed steganalysis, only the components p(0, 3)−p(1, 3), p(0, 3)−p(2, 3) of the feature vector satisfy the condition m ≠ 0. Two integers of 0, 1, and 2 must belong to S stego , so in the worst case, only one of p(0, 3)−p(1, 3,), p(0, 3)−p(2, 3) would satisfy jS stego , hS cover .

Based on the above analysis, just one component of the proposed feature vector fails. Therefore, although the detection accuracy of the proposed steganalysis would be slightly declined, our steganalysis would still be effective in this case.

4.2 Multi-base coding strategy

Since the big integer M converted from the secret message can be regarded as a random integer, then the value of t i is random between 0 and k i −1, when t i =M mod k i . When SS steganography utilizes t i to select a synonym, the selected synonym is random. Thus if a synonym with attribute pair <pos, k> contains secret message, the probability of pos being arbitrary integer between 0 and k−1 is 1/k. In this case,

$$ {f_s}\left( {j,k} \right)=\left( {1-r} \right){f_c}\left( {j,k} \right)+\frac{1}{k}r\sum\limits_{i=0}^{k-1 } {{f_c}\left( {i,k} \right)} $$
(11)
$$ {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)=\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right) $$
(12)

Equation (12) is in accordance with the Eq. (7). The detection performance of the proposed method is not affected by the order of encoding synonyms using multi-base coding strategy in SS steganography.

4.3 Binary tree-based coding strategy

For each synset, a binary tree is constructed. The synonyms in different levels of a binary tree have different codeword lengths. Denote the codeword length of the synonym with attribute pair <j, k> as l jk , \( {l_{jk }}\in \left\{ {1,\ldots,k-1} \right\} \). If a synonym w with attribute pair <pos, k> contains secret message, then the probability of pos being arbitrary integer j between 0 and k−1 is known to be \( {{{1 \left/ {2} \right.}}^{{{l_{jk }}}}} \), and \( \sum\limits_{j=0}^{k-1 } {{1 \left/ {{{2^{{{l_{jk }}}}}}} \right.}} =1 \). Therefore,

$$ {f_s}\left( {j,k} \right)=\left( {1-r} \right){f_c}\left( {j,k} \right)+\frac{1}{{{2^{{{l_{jk }}}}}}}r\sum\limits_{i=0}^{k-1 } {{f_c}\left( {i,k} \right)} $$
(13)
$$ {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)=\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right)+\left( {\frac{1}{{{2^{{{l_{jk }}}}}}}-\frac{1}{{{2^{{{l_{hk }}}}}}}} \right)r $$
(14)

If l jk > l hk , \( \left( {\frac{1}{{{{2}^{{{{l}_{{jk}}}}}}}} - \frac{1}{{{{2}^{{{{l}_{{hk}}}}}}}}} \right)r < 0 \), \( {p_s}\left( {j,k} \right)-{p_s}\left( {h,k} \right)<\left( {1-r} \right)\left( {{p_c}\left( {j,k} \right)-{p_c}\left( {h,k} \right)} \right) \), difference between p s (j, k)−p s (h, k) < (1−r)(p c (j, k)−p c (h, k)) is enlarged. This is beneficial for the proposed method to detect this type of stego texts. Thus, no experiments on this case are conducted.

If l jk < l hk , \( \left( {\frac{1}{{{2^{{{l_{jk }}}}}}}-\frac{1}{{{2^{{{l_{hk }}}}}}}} \right)r>0 \), the difference between p s (j, k)−p s (h, k) and (p c (j, k)−p c (h, k)) is smaller than that in the case of Eq. (7). This will make the detection performance drop. But the impact may be small, since \( \max \left( {\frac{1}{{{{2}^{{{{l}_{{jk}}}}}}}} - \frac{1}{{{{2}^{{{{l}_{{hk}}}}}}}}} \right) = \frac{1}{2} - \frac{1}{{{{2}^{{k - 1}}}}} < \frac{1}{2} \). When k = 2, 3, 4, \( \max \left( {\frac{1}{{{2^{{{l_{jk }}}}}}}-\frac{1}{{{2^{{{l_{hk }}}}}}}} \right)=0,0.25,0.375 \), respectively. It is worthwhile to note that the coding strategy can only encode to make l 02 = l 12 = 1 for k = 2. At this time, Eq. (14) is in accordance with Eq. (7). To make l jk < l hk , synonyms with lower frequencies may be encoded as longer codewords while the ones with higher frequencies are encoded as shorter codewords. Based on this consideration, when k > 2, the worst case for the proposed steganalysis is to construct the special binary tree so that l ik  = i + 1, \( {l_{{\left( {k-1} \right)k}}}=k-1 \), i = 0, …, k−2. The constructed tree is not a complete binary tree any more in this case. In the experiments, a SS steganographic tool called Hsyn, which adopted this kind of special binary tree to encode synonyms in frequency order, has been implemented.

5 Experimental results and discussion

5.1 Experimental setup

A book named “Word Frequencies in Written and Spoken English: based on the British National Corpus” [8] provides several kinds of frequency lists derived from a 100,000,000 word electronic databank. A complete alphabetical frequency list without frequency cut-offs was downloaded from a companion website [9] for this book. With the help of this frequency list, we preprocess the synonym database to obtain an attribute pair sequence for each text.

In experiments, 5,622 texts were used as cover texts. These texts come from one chapter, several chapters, or a whole book, randomly downloaded from the Internet. The embedding capacities of these texts are significantly in a wide range, which helps to objectively evaluate the performance of steganalysis. The cover texts are embedded arbitrary messages by tools Tlex, Bsyn, Ctsyn with four embedding rates, 25%, 50%, 75%, and 100% to generate 3*4 groups of stego texts. The number of stego texts is 5,622*3*4 in total. Tools Bsyn, Tlex and Ctsyn were implemented with basic, multi-base and binary complete tree-based coding strategies respectively to encode synonyms in alphabetical order. In order to train a classifier, 2,000 texts were selected from the cover texts, and 1,000 texts were selected from each groups of stego texts to form a training set. The remainders formed a testing set.

According to the discussions in Section 4, three tools Bsyn-fre, Tlex-fre, Hsyn were implemented, which adopted basic, multi-base, binary tree-based coding strategies respectively to encode synonyms in frequency order. 5,622*3*4 stego texts were generated by these three tools with four embedding rates, 25%, 50%, 75%, and 100%.

For all the above SS steganographic tools, they used the same synonym database as well as in the tool Tlex, denoted as DB#1.

5.2 Detection performance analysis

In the experiments, the SVM with RBF kernel is utilized as the classifier. The software used is LIBSVM2.9 [3]. Receiver operation characteristic (ROC) curve is used to display the detection probability (the fraction of the stego texts that are correctly classified) in terms of the false positive probability (the fraction of the cover texts that are misclassified as stego texts) in this study. And to evaluate the overall goodness of the ROC curve, we use the area under the ROC curve (AUC) [18]. \( AUC=\int_0^1 {{P_D}\left( {{P_{FP }}} \right)d{P_{FP }}} \), where P D is the detection probability, P FP is the false positive probability. AUC is a common summary statistic for the goodness of a predictor in a binary classification task.

The AUCs of our steganalysis with DB#1 are listed in Table 1. Experimental results in Table 1 demonstrate that the proposed steganalysis can effectively detect different SS steganographic tools with various embedding rates. Especially, when the embedding rate is greater than 50%, the detection performance is very good. However, when the embedding rate is 25%, the detection performance is not very good. It is easy to find that the detection performances for some steganographic tools with the same embedding rate are very approximate. In fact, the corresponding ROC curves may be overlapped. Since the difference between detection performances for Bsyn-fre and Bsyn are slightly great, just the ROC curves of our steganalysis for Bsyn-fre and Bsyn are depicted in Fig. 3. It can be seen that the AUCs for Bsyn-fre and Hsyn are lower than those for Bsyn and Ctsyn, respectively. This indicates that the detection performance of our steganalysis is slightly affected by the frequency order of synonyms in a synset when encoding by basic and binary tree-based coding strategies. But the AUCs for Tlex and Tlex-fre are approximate, namely, the detection performance is nearly influenced for synonyms in any order for multi-base coding strategy.

Table 1 The AUCs of our steganalysis with DB#1
Fig. 3
figure 3

ROC curves of our steganalysis with DB#1 against part of SS steganographic tools

5.3 Analyzing the impact on detection performance caused by synonym database

Some current SS steganographic algorithms may measure the suitability of a substitution. These behaviors can be regarded as reducing the embedding rate or the number of synonyms in a database. On the other hand, different researchers may extract different synonym databases from different dictionaries or select different synonym sets to form a database. If the synonym databases used in the SS steganography and steganalysis include different synonyms, then for steganalysis, some synonyms with secret message would not be recognized, or some unrelated words are recognized as synonyms, or the same synonym may be located in a synset different from the one used in the steganography. It is intuitive that using different synonym databases will impact the detection performance of linguistic steganalysis.

Considering that the absolute synonyms have the higher probabilities of being used by all SS steganographic methods, we built an absolute synonym database extracted from the Wordnet 2.1 [21] for windows, called DB #2. Here, the absolute synonyms mean that words are synonymous in any of their senses. The size of DB #2 is smaller than that of DB #1. 24.24% words in DB #2 do not belong to the DB #1, while 53.92% words in DB #1 do not belong to DB #2.

Table 2 gives the AUCs of our steganalysis with DB#2 for the six SS steganographic tools. And the ROC curves of our steganalysis with DB#1 and DB#2 for Bsyn-fre and Bsyn are shown in the Fig. 4. Despite that the detection performance declines trivially compared with the case of using DB #1, the AUCs still maintain relatively high values. When the embedding rate is 50%, the detection performances with DB#2 are still good for Bsyn, Tlex, Tlex-fre, Ctsyn. But when the embedding rate is 25%, the detection performance is worse than that in the case of using DB#1. These results demonstrate that our steganalysis using an absolute synonym database can deliver good performance while lacking the synonym database used in SS steganography.

Table 2 The AUCs of our steganalysis with DB#2
Fig. 4
figure 4

ROC curves of our steganalysis with DB #1 and DB #2 against part of SS steganographic tools

5.4 Detection performance comparison with other steganalysis methods

One of the existing linguistic steganalysis against SS steganography was presented in [12] (denoted as Luo’s method for convenience). This method recognized all synonyms appearing in a text, and made the synonyms from the same synset form a sequence. The sequences containing one word were neglected. In the cover text, a synonym always continuously appeared in a sequence. But in the stego text, synonyms in a sequence would frequently alter to represent different secret message bits. Under this consideration, a statistic characteristic was calculated. This method just extracted statistics from part of the synonyms in a text. If the total occurrences of a word and its synonyms are less than two times, then this word will be ignored. Few sequences containing more than one word will lead to low detection accuracy. This method is not good at detecting the stego text with low embedding capacity.

The AUCs of Luo’s method with DB#1 and DB#2 are given in Tables 3 and 4, respectively. In order to compare the detection performance, in Fig. 5, we depict the AUCs of our and Luo’s method with both DB#1 and DB#2 in bar charts in terms of the embedding rates used by SS steganographic tools.

Table 3 The AUCs of Luo’s method with DB#1
Table 4 The AUCs of Luo’s method with DB#2
Fig. 5
figure 5

The bar chart of AUCs for our and Luo’s methods both with DB#1 and DB#2

From the experimental results, we can see that our steganalysis greatly outperforms Luo’s with both DB #1 and DB #2. The detection performances of our and Luo’s methods with DB #1, which is identical to the synonym database in SS steganographic tools, are better than those with DB #2, which is an absolute database and different from DB #1. The performance of Luo’s method with DB #2 is poor, while our method is still effective. In addition to that, in most situations, the performance of our method with DB #2 is better than that of Luo’s even if the DB #1 is used. However, with very few exceptions, our method may perform worse than Luo’s. Because the performance of our method is affected by some synonym coding strategies and the used synonym database, the AUCs of our method are relatively low while using DB#2 to detect the stego texts generated by Bsyn-fre and Hsyn with embedding rate 25%. In contrast, Luo’s method is not affected by the order of encoding synonyms in a synset, and its performance is significantly influenced by the length of the embedded secret message rather than the embedding rate. Thus Luo’s method has similar performance for Bsyn and Bsyn-fre, as well as for Ctsyn and Hsyn. When the DB #2 is used for detecting the stego texts with embedding rate 25%, the AUCs of Luo’s method are not very low. Consequently, in the case of using DB#2 for detecting Bsyn-fre and Hsyn with embedding rate 25%, it is possible that the AUCs of our method are slightly lower than those of Luo’s, as shown by the results in Fig. 5.

Another steganalysis method for SS steganography is Yu’s method [23]. It defined a suitability function based on the collection frequency of a word and its context to evaluate the suitability of a synonym for its context. A weighted value for each synonym in the text was estimated by the suitability function. As the results of SS steganography, most synonyms with high weighted values would be replaced by the ones with low weighted values. Thus, the expectation and variance of the weighted value sequence in the text composing a feature vector for SVM can be used for distinguishing stego texts from cover texts. However, this feature vector was not discriminative enough. The weighted values of synonyms from not only the same synsets but also the different synsets are so different that the feature vectors of cover texts may have significant difference. On the other hand, compared with the original cover text, only parts of synonyms with hidden message in a stego text were substituted. Although most high weighted values were altered to low ones, a few low weighted values were also changed to high ones. The distance between the feature vectors of texts before and after embedding message was small. The embedding rate was lower; less weighted values were modified by SS steganography. Thus, the stego texts and cover texts were not extremely discriminable by using the feature vector in [20], especially for the stego texts with low embedding rate.

Moreover, according to the detecting algorithm of this method, if the word has n synonyms (including itself), it needs to query the results in Google at least n times, which makes the detection process very time-consuming. What is worse, the program would be forced to terminate whenever Google detects the automated traffic, since automated queries are against their Terms of Service [7]. This method is not suitable for real-time detection of secret message in a text.

Because of the difficulty mentioned above, only parts of the texts in our sample set were selected to test the detection performance of Yu’s method. 1,000 cover texts with relatively small file sizes, and the corresponding stego texts generated by Tlex with four embedding rates were used. The train set was composed of 500 cover texts and 350 stego texts for each embedding rate. Table 5 lists the AUCs for our and Yu’s methods with DB #1 and DB #2. And the corresponding ROC curves are shown in Fig. 6. Experimental results show that our method outperforms Yu’s. The cover texts and stego texts in these experiments having small file sizes result in small embedding capacities, and when the embedding rate is low, the number of modifications caused by embedding operations is minuscule. Thus the values of the AUCs of our method in Table 5 are slightly less than those of Tables 1 and 2.

Table 5 The AUCs of our and Yu’s methods with both DB#1 and DB#2
Fig. 6
figure 6

ROC curves of our and Yu’s methods with DB #1 and DB #2 against Tlex

6 Conclusion

In this paper, the word frequency is taken into consideration to represent the statistical attribute of a synonym. After successfully converting the synonyms within a text into their attribute pairs, differences between the statistical characteristics of attribute pairs in cover and stego texts are analyzed. An effective feature vector is obtained in order to detect the SS steganography. Not only the existing SS steganographic algorithms but also the future possible algorithms using different synonym coding strategies to encode synonyms in frequency order have been investigated. The detection performances of our steganalysis are discussed, and the experimental results show that the proposed method achieves good performance in all the above cases.

When the proposed steganalysis without the knowledge of the synonym database used in SS steganography, it may lead the detection probability of the steganalysis drop. We built an absolute synonym database from WordNet for experiments. Although the results demonstrate that the detection performance of our steganalysis is still good, it is worse than that in the case of using the same synonym database with the one in SS steganography. In the future, more endeavors should be made to improve the detection performance of the steganalysis with an arbitrary synonym database.

Finally, the experimental results have shown that the proposed method has significant advantages over other methods. It is noteworthy that the proposed method is only supported by a preprocessed synonym database. When extracting the feature vector from the text, neither huge corpus nor Google is needed. The speed of the proposed steganalysis detecting a text is fast.