Keywords

1 Introduction

With the speedy development of information technology, the unseal surroundings of information transmission and sharing has been rapidly constructed. While providing convenience for people's lifestyle, it brings a series of safety risks. For example, the network information is susceptible to spiteful assaults, illegal access, falsification, plagiarism, etc. [1]. How to ensure the safety of multimedia data has become a significant topic that needs to be resolved eagerly in the domain of information security.

Steganography is the aesthetics and technology to conceal confidential information into multimedia carriers. Modern steganography technology uses mankind perception redundancy, statistical redundancy of multimedia data, and other characteristics to hide secret information by a certain coding form or encryption in some public data carrier to embed confidential information. The cover used to hide classified message is generally multi-media files transmitted on the network, such as videos, audio, images, text, etc. Among them, text has become an important cover due to its fast transmission and convenient access. Particularly, with the rapid development of natural language processing (NLP), text steganography has been greatly developed and refined. Currently, text steganography has been extensively used in confidential correspondence, copyright maintenance, content identification, etc.

Unlike text steganography, text steganalysis identifies whether a provided text contains undisclosed communication and extracts the embedded secret information when possible. In recent years, text steganalysis is becoming a vital investigation topic on information security, as one of the effective ways to prevent criminals from malicious use of text steganography technology for illegal activities. In addition, it further ensures safe and covert communication, and has important applications in military, intelligence, and government secret departments, such as detecting and jamming enemy communication signals; it can effectively block information sources, and conduct information reconnaissance and destruction on the enemy. Almost all information embedding algorithms inevitably change the statistical characteristics of the carrier. The core idea of steganalysis is to use a statistical machine learning algorithm to model and detect the subtle differences caused by information embedding, so as to identify the suspicious and stego text.

The key of text steganography and text steganalysis is shown in Fig. 1.

Fig. 1
figure 1

Basic model of text steganography and text steganalysis [2]

Owing to the significance of text steganalysis in Internet security, it is essential to review and summarize the mainstream text steganalysis in recent years. This paper outlines the current state of research in text steganalysis starting from 2016. Furthermore, we summarize, compare and analyse some of these algorithms.

Next, introduce the framework of this article. In Sect. 2, different types of text steganalysis are reviewed. The techniques and concepts involved in each type of text steganalysis are described in detail. In Sect. 3, a comparative analysis of the techniques and approaches is made. Eventually, the conclusion is drawn in Sect. 4.

2 Classification of Text Steganalysis Algorithms

In this Section, the classification of text steganalysis, including targeted steganalysis and blind steganalysis, will be introduced in detail.

2.1 Targeted Text Steganalysis

Targeted text steganalysis is a steganalysis means introduced to identify an especial text steganography algorithm. Scilicet, the detection algorithm knows which text steganography method is used to embed the secret information. Thus, targeted steganalysis is excel in detecting the specific text steganography algorithm. However, they may fail exponentially when they facing other steganography algorithms.

The common statistical features used for targeted steganalysis include word-initial distribution, alphabetic cases, contextual information, evolutionary features, and synonym frequency.

Distribution of First Letters of Words [3]. For the stego text generated by context-free steganography, words occur randomly, and the probability of appearance of words in each segment of the text lies only on the possibility in local region. In contrast, in a natural text, words do not occur randomly, and the process of word generation can be viewed as an nth-order Markov process [2]. That is, the probability distribution of word initials in natural texts is very different from that of word initials in context-free texts, as shown in Fig. 2.

Fig. 2
figure 2

Distribution of probabilities of natural text and context-free text

Stego [4]. Stego is a text steganography tool that uses dictionaries to transform secret message into grammar-free text with a configuration similar to normal text for steganographic communication. By studying the mechanism of Stego, the paper [4] proposes a Stego-based text steganography analysis method. When the dictionary words used for steganography start with all lowercase letters, the stego text can be detected by the steganalysis method based on sign features. Otherwise, the stego text will be detected by the steganalysis method based on statistical features.

Context Information. Article [5] introduces the concept of context clustering to estimate the contextual fitness of a text and shows how to distinguish ordinary text from a stego text by counting the contextual fitness values of the text. The Substitution-based Linguistic Steganography (SLS) system replaces an original element in the overwritten text with a replacement element in the same replacement set when performing message steganography. This substitution behaviour may result in the new replacement element not fitting well to the original context. According to this feature, the paper proposes a steganalysis scheme for SLS, and the specific process is shown in Fig. 3. Following that, a text steganalysis method based on synonym replacement is proposed based on this scheme. The average accuracy of this text steganalysis approach is 98.86%.

Fig. 3
figure 3

The steganalysis direct at substitution-based text steganography [5]. SI: Substitution Information; CI: Context Information; λ: Context Maximum Rate; θ: Context Maximum Deviation

Article [6] proposes a word embedding-based approach to detect secret information in a text. The method uses a continuous Skip-gram model to symbolize synonyms and their contextual words as word embeddings and encode the word semantics as a low-dimensional dense vector; the embeddings of synonym counterparts are used to effectively estimate the contextual adaptation and are weighted by the TF-IDF scores of the contextual words. By analysing the distinctions in the contextual adaptation scores of synonyms in the synonym set and the distinctions in the contextual adapt values of synonyms in the cover text and the stego text, extract three features and then input them to a support vector machine (SVM) classifier for steganalysis. The proposed steganalysis technique enhances higher than 4.8%.

Evolution Algorithm [7]. Article [7] proposes an evolutionary detection steganalysis system (EDSS) based on the evolutionary algorithm of the Java Genetic Algorithm Package (JGAP). The results of the EDSS can be classified into good adaptation and bad adaptation according to the adaptation value.

Synonym Frequency [8]. Article [8] proposes a text steganalysis method based on synonym substitution (SS). First, attribute pairs of synonyms are introduced to represent their positions in the ordered synonym set and the size of synonyms. Due to the substitution of synonyms, the quantity of high-frequency attribute pairs decreases nevertheless the quantity of low-frequency attribute pairs increases. Ground on this, the changes of statistical features of SS steganographic pairs of attributes are analysed theoretically, and secret information is detected using eigenvectors build on the relative frequency differences of diverse attribute pairs. This paper also analyses the impact of the synonym encoding strategy on feature vector extraction.

2.2 Blind Text Steganalysis

Blind steganalysis does not depend on a specific steganographic algorithm. As a result, it meets a wider range of applications and requirements. Since embedding secret information in normal text more or less changes the content of the text, introducing statistical difference in normal textual features. Therefore, the key step for blind steganalysis is to model these subtle differences [9]. As Fig. 4 shows, feature extraction and text classification are two stages of blind text steganalysis. Next, we will introduce the current mainstream blind steganography algorithms based on different model types.

Fig. 4
figure 4

Standard blind text steganalysis phases

Text Steganalysis Based on AdaBoost [10]. It points out that the statistical changes will be brought to the text after embedding secret information. Ground on this, a general detecting algorithm ground on AdaBoost is put forward to extract text statistical features and detect natural texts and stego texts.

AdaBoost can recognize all text embedding rates at 2 and 4%, and the recognition rate is also 100% under other conditions. The experiment proves that AdaBoost is almost unaffected by the embedding rate, reflecting the superior classification performance of AdaBoost.

Text Steganalysis Based on Statistical Language Model [11]. In the article [11], a text steganalysis algorithm based on a statistical language model is proposed to classify a given text segment into natural text and stego text using its complexity. The algorithm achieves 96.3% recognition accuracy for stego text segments and natural text segments when the segment size is 5 K; the algorithm detects more than 93.9% accuracy when the text size is 2 K. Not only that, but the experiment also tested the NICETEXT system, TEXTO system, and the text generated based on the Markov chain, and achieved superior results.

Text Steganalysis Based on SVM [12]. Article [12] proposes an SVM-based hidden information detection algorithm. The SVM classifier is built by learning and training the normal text and small-sample laden confidential text, and the better generalization ability of the classifier is used to classify the unknown text. The model has great generalization performance and the SVM classifier also has an excellent classification effect.

Natural Frequency Zoned Word Distribution Analysis (NFZ-WDA) [13]. Translation-based steganography (TBS) is secure text steganography that encodes secret information using the noise generated by the translation of natural language text. The NFZ-WDA method proposed in article [13] aims to detect TBS without using any TBS-related information. The single support in this method is a natural frequency lexicon, a word frequency dictionary obtained from a large corpus. NFZ-WDA uses frequency criteria (NFZs) to refine word distribution features. Since the elaboration of word distribution features maintains more structural information, the improved method can analyse the stego text generated by TBS more effectively. To attest the validity of the NFZ-WDA method, the paper carries out experiments on two-class and multi-class SVM classifiers. The results show that the accuracy of both detections is comparatively high and increases with the increase of text size. Thus, this text steganalysis method has good application prospects.

Text Steganalysis Based on Convolutional Neural Network (CNN). Article [14] proposes a CNN-based model for text steganalysis that captures complex dependencies and automatically learns the text feature representations. A decision strategy for detecting long texts is also proposed, so as to boost the performance ulteriorly. Firstly, the word embedding layer extracts the semantic and syntactic features of words. Secondly, use different sized rectangular convolution kernels to learn sentence features. The method is not only valid in exploring different types of text steganography algorithms but also achieves excellent results in analysing texts of different sizes.

Article [15] propounds a two-stage CNN-based method for text steganalysis. The first stage is a sentence-level CNN, consisting of a convolutional layer containing multiple convolutional kernels with disparate window sizes, a pooling layer, a fully connected layer with Dropout, and a Soft-max output. In this way, the layer not only handles variable-length sentences but also obtains two steganographic features per sentence. The second stage is a text-level CNN that uses the output of the first stage to ensure whether the detected text is steganographic or not. The average accuracy of this approach is 82.245%.

Text Steganalysis Based on Recurrent Neural Networks (RNN) [16]. In automatically generated stego text, the distortion of the conditional probability distribution is caused by the embedding of hidden information. Based on this, paper [16] proposes a text steganalysis algorithm that uses RNN to extract these feature distribution differences and subsequently classify these features into cover text and steganographic text. The experimental results show that the model not only has high detection accuracy but also can use the subtle differences of text feature distributions to estimate the amount of information embedded in the generated stego text.

Text Steganalysis Based on Word2vec [17]. A Word2vec-based approach to text steganalysis is proposed in [17]. First, a multi-dimensional word vector containing rich semantic information is trained for each word using the distributed word representation tool Word2vec; then to calculate the suitability of the synonym in a particular context, the correlation between two words needs to be measured by the cosine distance between the synonym and its contextual word vector, and obtain detection features; finally, the extracted detection features are input into a Bayesian estimation model for training and testing. The average detection accuracy of the approach reaches 97.71% for stego texts with different embedding rates, which has a very good measuring performance.

Text Steganalysis Based on Convolutional Sliding Windows (TS-CSW) [18]. Word association features in the stego text are distorted after inserting confidential message, and the TS-CSW is proposed based on this changed feature, which uses convolutional sliding windows (CSW) of multiple sizes to obtain relevant features of the text. Samples collected from the T-Steg dataset are used in the paper to train and test the proposed steganalysis approach. The model not only has great performance in steganalysis but also can estimate the amount of secret information embedded in the stego text.

Text Steganalysis Based on Long Short-Term Memory Networks (LSTM) [19]. To enhance the low-level features in the feature vector and then better associate with the low-level features to test the steganographic information in the generated text, paper [19] introduces two parts, including dense connectivity and feature pyramid. It comes up with a text steganalysis approach ground on densely connected long short-term memory networks with a feature pyramid. Firstly, map the words in the text to a semantical space with hidden representations for better utilization of semantical features; then the semantic features at different levels are extracted using a stacked bidirectional long short-term memory networks (Bi-LSTM); finally, fuse the semantic features at all levels and use the Sigmoid layer to resolve whether the text is steganographic or not. This approach achieves a satisfying result.

Text Steganalysis Based on LSTM-CNN. In article [20], a hybrid text steganalysis method (R-BILSTM-C) is proposed by combining the advantages of Bi-LSTM and CNN. The method captures long-term semantic information of text using Bi-LSTM and extracts local relationships between words using asymmetric convolutional kernels of different sizes. The detection accuracy is extremely increased. Furthermore, the paper visualizes the high-dimensional semantic feature space. The approach is able to be effectually used to different text steganography algorithms.

Article [21] proposes an LSTM-CNN model for text steganalysis. Firstly, map the words to semantical space to better utilize the semantical features of the text; then LSTM and CNN are combined to obtain local contextual info and long-range contextual info in a stego text. In addition, the text also employs an attention mechanism to identify important cues in suspicious sentences. The model can accomplish outstanding results in steganalysis tasks.

Text Steganalysis Based on Bi-LSTM-GNN [22]. A text steganalysis model with two stages of high robustness is proposed. In the first phase, Bi-LSTM is used to obtain feature information of all words in a sentence while holding a powerful correlation. In the second phase, input multi-sentence vectors to graph neural network (GNN), from which anomalous features between sentences are extracted. Moreover, article [22] adds adversarial instances to the training set to increase the robustness and generalization of the steganalysis model. The experiments reveal that the model not has excellent robustness but is quite effective for steganographic text judgment.

Text Steganalysis Based on Capsule Network [23]. Capsule networks identify the subtle differences between stego texts and normal texts by extracting and preserving the semantic features of the texts. Article [23] uses capsule networks to detect whether the natural text contains secret information: the text is vectorized using word2vec, and steganographic text generated by RNNs and variable-length encoding is used as the experimental dataset to enhance the generalization of the method. Experiments reveal that the method can reach a 92% correct detection rate for stego text at a lower embedding rate (1–3 bits/word), which is about 7% better than that of other neural networks; at a high embedding rate (4–5 bits/word), the detection accuracy can reach more than 94%.

3 Evaluation

From the above overview of text steganalysis in the past decade, it can be seen that the development of text steganalysis is consistently changing and improving, from the early target steganalysis to the more versatile and effective blind steganalysis.

The advantages and disadvantages of five chosen target text steganalysis are listed in Table 1. From Table 1, it is clear that the algorithms based on initial letter probability distribution, contextual information, and synonym frequency algorithms are simple and efficient; among them, the contextual information approach is simpler and easier to implement than the other two methods. The Stego-based steganalysis algorithm, however, relies on detecting the case form of the initial letter of text words, which is more restrictive.

Table 1 Comparative analysis of target text steganalysis [9]

As for blind text steganalysis, start from the CNN-based text steganalysis algorithm in [14], it has continuous developed and improved. As can be seen from Sect. 2.2 of this paper, blind steganalysis have been getting better from the early use of machine learning algorithms, such as SVM, to the use of deep learning algorithms such as CNN, RNN, LSTM, and the combination of LSTMs, CNN, and GNN, which have emerged in the last two years. The average detection accuracies of blind text steganalysis for stego texts are listed in Table 2. Although deep learning enhances the property of text steganalysis, the computation complexity and time cost of the algorithm are also raising, which has become one of the issues to be solved in the future.

Table 2 Average detection accuracies of blind text steganalysis

4 Conclusion

This paper reviews different types of text steganalysis algorithms since 2006, including target steganalysis and blind steganalysis, and compares and analyses the two categories, respectively. The study indicates that steganalysis methods do have their own advantages and disadvantages. We believe this paper can supply motivation and assistance for future steganalysis research.

As far as the current research trends are concerned, the development of NLP has a significant impact on text steganography and text steganalysis, for most of the latest algorithms are inspired by the advanced technology in NLP. The most momentous issue of text steganalysis is to enhance the effectiveness and robustness of steganalysis while simplifying model complexity. Therefore, in the near future, based on clarifying the development of text steganalysis and its actual development, we will face its main problems, closely combine the latest research results of NLP, reinvent the text steganalysis method, and strive to break through the development bottleneck mentioned in the previous section.