LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding

Varol Arısoy, Merve

doi:10.1007/s00521-022-07499-5

LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding

Original Article
Published: 01 July 2022

Volume 34, pages 19117–19145, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding

Download PDF

Merve Varol Arısoy ORCID: orcid.org/0000-0003-2085-1964¹

453 Accesses
4 Citations
Explore all metrics

Abstract

With the effect of digitalization, the transfer of all text documents over the Internet rather than human transmission has increased, and this situation has revealed the idea that text documents can be used as a carrier that can safely store information. Realizing that methods such as word-line shifting, usage of spaces, replacement of the word with its synonym are fragile against steganalysis, led to new searches and it was determined that deep learning models were more resistant to detecting the presence of hidden words. In this study, the text generation based on the information that is wanted to be hidden without a carrier text, both at word and character level, was performed. Arithmetic coding, perfect tree and Huffman coding methods were used as secret information embedding methods in text generation based on word level. In this part of the study, bidirectional LSTM architecture with attention mechanism was created as language model. In text generation based on character level, a new secret information embedding algorithm is created by combining the LZW compression algorithm with the Char Index (LZW-Char Index Encoding) method. The character-level model is created as a result of using the encoder–decoder architecture together with bidirectional LSTM and Bahdanau attention. The proposed method was evaluated from the perspectives of information embedding efficiency, information imperceptibility and hidden information capacity. As a result of the experiments, it was determined that the method exceeded the state-of-the-art performance and was more resistant to steganalysis.

A Comprehensive Review on Deep Learning-Based Generative Linguistic Steganography

TS-CSW: text steganalysis and hidden capacity estimation based on convolutional sliding windows

Article 02 March 2020

High-Performance Linguistic Steganalysis, Capacity Estimation and Steganographic Positioning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Advances in the field of digitalization have allowed the scope of transactions carried out on the Internet to expand and the rapid transfer of different types of content. While the mutual exchange of data takes place within seconds, this transfer continues over an open network. Therefore, it is possible for unwanted persons to monitor the content of the data sent and to capture the information. Steganography is a method that emerged to close this gap and to prevent the disclosure of confidential information. In steganography, the new stenographic data obtained by embedding the information, which is wanted to be hidden, in media such as pictures, text, video, and sound are transferred to the other party in a way that will not arouse suspicion. The aim here is to send steganographic data containing confidential information by trying to simulate it as much as possible to its version without confidential information. When this situation is considered in terms of a text document, the stego text containing confidential information and the version of this text as unembedded with information should be very close to each other. In steganography, there is the use of redundant areas on the carrier medium into which information will be embedded. This feature is relatively high in image steganography [1]. It is also easy to modify the image for steganography [2]. Therefore, image steganography is often preferred [3,4,5,6]. Although image steganography is frequently used, text steganography has advantages that make it more preferable. It has less memory consumption compared to other cover media (picture, video, audio), it is faster, and text format is more preferred in interpersonal communication [7]. Due to these advantages, it has recently attracted the attention of many researchers. In addition to its positive features, text steganography has some difficulties. There is less amount of redundant embedding area compared to other cover carriers. It also requires the use of complex natural language processing technologies. For this reason, steganography text is characterized as a challenging task [8].

Text steganography is basically divided into three, as format based, content based and random-statistical generation based, according to the method of embedding confidential information (Fig. 1). In format-based methods, text is usually treated as a specially encoded image. In order to hide the information, the formal features of the document such as paragraph format, font format, punctuation mark, white space, and non-printable character are used [9, 10]. Format-based steganography depends on the characteristics of the written language of the text in which information hiding is to be performed. While this method may produce good results for some languages, satisfactory results cannot be obtained for others. In this method, the length of the data to be hidden is taken into account [7]. It is also often difficult to detect visually [1]. In the literature, there are format-based steganography studies using techniques such as character spacing [11], word wrapping [12], and character encoding [13,14,15,16,17,18,19,20]. In the line and word wrapping method, which is one of the methods used in the literature, the word or line is shifted up and down in order to create unique spaces for information hiding. In this method, hidden information is protected during printing, but there is a possibility of damaging confidential information by changing the format setting of the document [21].

In the study in [22], punctuation was used to hide bits. In [23], the spelling of the words was used to hide the information. Format-based steganography method has a high capacity for hiding information, but its disadvantage is that it is vulnerable to text rewriting attacks [24].

In the Random and Statistical Generation subcategory of text steganography, statistical properties of the language on which data hiding will be performed and are extracted and these properties are used to create a cover text. However, the stego text created according to this method appears as a repetitive word/character sequence. This increases the probability of attracting the attention of the reader. In addition, the computation time increases [25, 26]. Another example of statistical text stenography appears in the study in [27]. Here, a method based on the Markov model and the half frequency is discussed, using the statistical properties of natural language. In the half frequency crossover rule based on the Markov model, the texts for the training of the model are integrated with various language rules. They benefited from two different databases, as it was their priority to increase embedding efficiency and capacity.

In the study in [28], they applied a stenography study on Arabic language by integrating the Markov chain encoder/decoder with the Huffman coding algorithm in order to increase the capacity of the hidden information in the statistical steganography method.

The work in [29] is an example of statistical text steganography based on the RSA algorithm and aimed at improving security and accuracy. Accordingly, the data are encrypted by creating defects on the appearance of the user data. The secret message to be transmitted was subjected to multilevel encryption using both private and public keys. Thus, they made the text more resistant to cyber threats and security breaches. Since it depends on the size of the secret message to be transmitted, the message encoded with this method is limited in size. The study in [30] mentions a statistical text steganography technique that uses a combination of cryptography and steganography to prioritize security. They used the data encryption standard (DES) in their methods. The frequency of the letters in the cover text has been a factor in determining the position of the bits to be hidden.

In the statistical text stenography proposed in [31], the bits of the secret message are reduced by generating metadata and adding header information to the first few bytes of the cover content, as well as mapping between the ASCII values of the character strings and their corresponding binary value. In this way, it has been evaluated that the capacity of embedded confidential information can be increased. In the next step, the secret message is stored in bits of the cover text.

In the study in [32], they mentioned a statistical text steganography model, which is based on multi-rule language techniques and does not contain carrier cover text. In their models, different language principles were combined alternately and more language features were tried to be extracted from the training text.

In the study in [33], Omega network structure was combined with part of speech (POS) considering the principle of replacing verbs and nouns in the cover text with verbs and nouns in the content of the secret message. In the study in [34], which is in the "Random and Statistical Generation" sub-branch, letter frequency and word length were the two statistical items they used. Here, they created a stego word by utilizing a codebook of mappings between hidden bit sequences and lexical items.

As a result of the advances in NLP, the studies in the field of text steganography have started to shift towards automatic steganographic text generation rather than the formal arrangement of the carrier text [2]. In this branch of text steganography called linguistic steganography, secret information is embedded in the content of the text [8]. In the category known as Text Modification of this branch, which includes studies in 3 separate subcategories, while preserving the semantic value of the original cover text, linguistic transformations equivalent to the words in the cover text are used to hide the message. These conversions include techniques such as syntactic conversion [33, 34], synonym substitution [35,36,37,38], misspelled word substitution [39], and phrase paraphrasing [40]. This type of linguistic steganography allows for high imperceptibility but limited capacity of information embedding. In addition, when the original cover text and the stego with confidential information are compared, there are deviations in the statistical and linguistic features of the stego text. This makes it easier to discover the existence of confidential information when linguistic steganalysis technology is used [8]. An example of text modification-based steganography can be seen in the study in [41]. The approach adopted here is based on replacing the characters of the secret message that is intended to be placed in the document with the characters of the carrier text. The robustness of the steganography was achieved by applying the multilayered coding concept, which includes block coding, partially homomorphic encryption and alphabetic transformation. In [42], antonym word substitution was used in text steganography instead of synonym word substitution, a method previously used frequently.

Modification-based steganography using the characteristics of the chosen language is also seen in the study of [7]. Here they developed an application for hiding information in Arabic texts. They tried to embed information in these fields by using the extension character of the Arabic language called Kashida and the small space character.

In the study in [43], they investigated information hiding within word documents. To this end, they redesigned the change tracking technique on the Microsoft Word document to increase embedding capacity and reduce any intermediary's suspicion of the existence of the message. They have hidden the secret message in bit format using the synonym substitution method. With their proposed method, they aimed to protect the word document from any intermediary suspicion by making the word document look quite normal.

Coverless steganography has emerged as a result of attempts to be protected from steganalysis attacks. In this method, no changes are made in the text of the carrier cover [8]. Information can be hidden by producing new texts in accordance with the statistical properties of natural language [44]. There are studies in the literature using this technique. The study of [45] is an example of the application of coverless text steganography. Here, an application based on the principle of dividing the confidential information into keywords and removing the location tag of each keyword and combining these tags and keywords was carried out. In addition, in the study in [46], coverless steganography application was performed by using PCCSN (Strike Number Parity of Chinese Characters). In the study in [47], they developed an application that bases coverless text steganography on the Markov chain model. They used the maximum variable bit embedding method instead of the usual fixed bit embedding according to the property and value of the transition probability of the model they created. In addition, in another study of [48], they created binary-transition sequence diagrams based on the transition probability concept of the Markov chain model and used these diagrams to generate new texts containing confidential information. In the study in [49], they proposed a coverless steganography method based on the properties of word association. Accordingly, a word node tree is created by using word association features between the parties to be communicated mutually. Then, the transmission path of the word node tree is coded to embed the secret information, and the mapping relationship between this path and the texts is established. Coverless steganography method is resistant to steganalysis attacks, but its information embedding capacity is quite low [8].

Steganography based on text generation has been developed in order to overcome the low-capacity information insertion constraint in text modification and coverless types of linguistic steganography. In this method, a carrier cover text is not needed. There is text generation based on hidden knowledge by using language models. The generation-based steganography method consists of a model that performs text generation and an embedding algorithm that allows hidden information to be embedded into the text during generation [50]. The advantage of the method is that there are more positions to place the hidden information in the generated stego text and there is no upper limit for the length of the generated text [8]. Therefore, the steganography method based on text generation has a greater data hiding capacity. In the literature, there are text generation-based steganography proposals developed in recent years [51,52,53,54]. When the recent studies are examined, a LSTM-based steganography model based on text generation was proposed in the study conducted in [51] and the language model that could not be developed sufficiently with the Markov chain method [55] was improved. Since the developed study also adopts the steganography style based on text generation, the details of the studies made using this method will be mentioned in the literature section.

In this study, text generation-based data hiding has been carried out. For this, 2 different language models were created, but these models were trained by using the same datasets. The first model generates word-level based text. Three different data embedding algorithms, namely arithmetic coding, perfect binary tree, and Huffman coding, were applied on this model. In addition, 3 different sampling types were used to predict the next word/character, namely top-k sampling, temperature sampling and greedy search in both models, and a comparison was made between the stego texts generated for each sampling type in terms of imperceptibility metric.

The model created in the study generates text based on confidential information. In other words, the generation of text and the placement of confidential information were carried out simultaneously. Therefore, a carrier cover text was not used. The second created model generates character-level text. On this model, a new form of embedding was created by using the LZW compression algorithm together with the Char Index method, and stego text was generated based on this. The greatest contribution of the study manifests itself in this respect. In the study, the purpose of creating two different types of models (word-level attention-based bidirectional LSTM and Char-Level encoder–decoder Bahdanau attention mechanism) is to make a comparative evaluation between character and word level-based models in terms of running speeds, information embedding rates, imperceptibility of the generated stego text and steg-analysis resistance metrics. At the same time, it is aimed to improve the working speed and to increase the rate of embedded confidential information with the use of the character-based generation model. The motivation behind the realization of this study is that the LZW compression algorithm was not used in the generation-based and confidential information-based text generation before in the literature until the study was conducted. Therefore, it is thought that the study will bring an innovation to the literature.

The contributions of the study to the literature are listed below.

The fact that the study makes comparisons among the stego texts generated by applying Huffman coding, arithmetic coding and perfect binary tree embedding algorithms separately to the bidirectional LSTM attention mechanism network that produces word-level text, on the basis of the metrics listed above
At the same time, the fact that 3 different sampling types, namely top-k sampling, temperature sampling, and greedy search, be used on the created language models (word level–character level), thus providing the opportunity to make comparisons on the quality of stego texts on the basis of sampling.
The generation of stego text by applying a new embedding algorithm created as a result of the integration of the LZW compression algorithm with the Char Index method, on the model with the Bahdanau attention mechanism that generates character-level text.

The remainder of the article is organized as follows. First, current relevant studies are introduced in chapter 2. Then the framework and main modules of the proposed method are explained in the chapter 3. The details of data hiding, and data extraction algorithms are discussed in chapter 4. The results obtained as a result of the trials are evaluated on the basis of the metrics used in chapter 5. Finally, the conclusion of the article is given in chapter 6.

2 Related work

2.1 Steganography based on automatic text generation

Since text steganography contains less unnecessary information in text files compared to image or sound files, and therefore having less embedding area, it is considered to be a more challenging task than other types of steganography [7]. In order to overcome this obstacle, many methods based on text modification and not containing carrier text have been proposed. However, as a result of the application of these methods, sufficient efficiency could not be obtained due to the inability to resist steganalysis attacks or the low information embedding capacity of the generated stego texts, steganography based on text generation, which is considered as a more promising method, has been developed. In the steganography method based on text generation, generation is made on a pre-created language model based on the information desired to be hidden without the need for a carrier cover text.

Previous research was on performing steganography based on automatic text generation using the Markov chain method. Language models of this method have been used in many studies [9, 55,56,57,58,59]. In the Markov-based language model proposed in the study in [57], it is aimed to ensure that each sentence generated contains an equal number of confidential information. However, ignoring the transition probability difference has led to unsatisfactory results. Similarly, [56] used the Markov chain model to construct Ci-poetry, a classical Chinese poetry style.

Yang et al. [9] combined Markov model and Huffman coding to overcome the quality degradation of stego text caused by the fixed-length coding method in their study. During the text generation, they dynamically updated the Huffman tree structure and performed the hidden information placement process.

Although a certain level of quality has been achieved in the stego texts generated in text steganography using the Markov model, the method still has some shortcomings. The fact that the texts generated by this method are not natural enough makes it weak in resisting steganalysis attacks [8]. In order to overcome this deficiency, new searches have been made and artificial neural network-based language models have been started to be developed. Accordingly, in [51]'s study, they first generated stego text by blending the language model trained with the LSTM network with the fixed-blocked data embedding algorithm. Here they split the secret information into blocks of bits and shared a key that assigns each of these bit blocks to tokens. Accordingly, they denoted each bit block with one of the corresponding tokens.

Subsequently, [1] generated stego text based on confidential information based on RNN (recurrent neural networks) in their studies. They used perfect binary-based fixed-length coding and Huffman-based variable-length coding algorithm as embedding algorithms. [50] investigated stego text generation on a sampling-based basis and used arithmetic coding, fixed-length and variable length coding methods as embedding algorithms in their studies. They tried to control the embedding rate by determining the constant “K” value in the sampling method used to select the next most probable word in fixed-length and variable length coding algorithms through a word filtering strategy based on Kullback–Leibler divergence (KLD). In arithmetic coding, they adopted the temperature-based sampling type.

Another study using arithmetic coding as a form of embedding belongs to [60]. Here, they proposed an algorithm that selects the minimum value at which K in top-K sampling, which is the sampling type they use, achieves the required imperceptibility.

In the study, [54] have developed a steganography system that aims to deceive both statistical and human eavesdroppers by combining a pretrained large language model with an arithmetic coding-based steganography algorithm.

Lingyun et al. [8] developed an LSTM-based language model and generated character-level text in their studies. They generated a large number of stego texts simply by changing the beginning of the feed information given to the model and using the same hidden information. Afterwards, they designed a selection strategy in order to find the best quality among the candidate stego texts generated. Similarly, in the study of [2], stego text based on confidential information was generated by developing an LSTM-based language model with attention mechanism, and fixed-length coding (FLC) and variable-length coding (VLC) algorithms were used as embedding method in order to place confidential information in the text.

Steganography was not only used to create plain texts, but also poems were used for the purpose of carrying confidential information [61]. created a network model based on template constrained generation with encoder–decoder architecture to produce classical Chinese poetry containing confidential information in their study. Accordingly, they used the LSTM encoder–decoder model to create the first line of the stanza with a keyword, and then they created the remaining lines one by one.

Zhou et al. [62] proposed the linguistic steganography method based on adaptive probability distribution and GAN (generative adversarial network) network in their study. The GAN-based model they generated was developed to be protected from high-dimensional steganalyzers. They performed data hiding during training by combining information embedding algorithms with feature function.

Yang et al. [63] tried to bring a different approach in their research by examining the sampling type used in steganography text generation. In their method called categorical sampling, while predicting the next word to come, they sampled a whole dictionary k times to create a candidate word pool (candidate pool, CP) consisting of k words. In other words, they took the word with the highest probability of k each time and threw it into the candidate word pool, and by sampling again, they determined the other word with the highest probability this time. This process was repeated k times. They embedded the secret information by applying arithmetic coding on the candidate word pool they created. As their study, which they claim to provide statistical imperceptibility, will impose a heavy computational load in cases where large size corpus structure is used and large size k value is selected, it is obvious that it will be difficult to implement in practice.

Yang et al. [64] talked about a structure they called VAE-Stega in order to solve the problem of perceptual and statistical imperceptibility in stego texts in their another study. In the encoder stage of the structure they created here, they learned the statistical distribution of normal texts, and in the decoder stage, they generated steganography sentences that fit both the statistical language model and the general statistical distribution of normal sentences.

2.2 Steganography based on compression of secret data

Compression algorithms have been used in the literature to increase the amount of confidential information placed inside the carrier cover text.

In [65], they conducted a steganography study on e-mail texts using the combinatorial compression method. They used a combination of burrows wheeler transform (BWT), move to forward (MTF) and LZW coding algorithm to increase the capacity of the information hidden in their proposed method. They also made use of the character count of the email ID to indicate the hidden bits. Their method adds some random characters before the “@” symbol of email IDs to increase randomness. In another study of [66], they have done research on the e-mail steganography, this time by using Huffman compression algorithm.

Tutuncu and Hassan [67] used lossless compression techniques and Vigenere cipher in their e-mail-based text steganography study. They used their e-mail addresses to insert and remove the secret message. After choosing the carrier text with the highest repetition pattern for the secret message, they created the distance matrix. They used run length encoding (RLE), burrows wheeler transform (BWT), move to front (MTF), run length encoding (RLE) and arithmetic encoding as compression algorithms. They used Latin square and Vigenere ciphers to generate stego keys. Using these generated keys, they placed the confidential information in their e-mail addresses.

In the study in [68], they examined the capacity and security problems in stego text generation using the embedding algorithm they obtained by combining the LZW compression algorithm and color coding. They applied their methods on the e-mail environment and tried to embed information in the e-mail address and message content. They placed the hidden data bits in the message by coloring them with the help of a color-coding table. They used the LZW compression algorithm to increase the rate of confidential information embedding.

Adaptation of the use of compression algorithm to text steganography is also seen in the study of [36]. Here, they reduced the length of the information to be hidden through the word indexing compression algorithm (WIC), and they selected the best stego text with high imperceptibility with the stego text selection strategy. They provided the process of placing the confidential information in the carrier text by using the synonym word substitution method.

In the study in [69], they studied e-mail steganography, which is a sub-branch of text steganography. In their study, they concealed the secret message within the e-mail addresses created through the e-mail body. In their proposed scheme, they first converted the secret message into a bitstream using the LZW algorithm and then embedded the resulting bitstream into the corresponding recipient addresses using steganography keys.

3 Preliminaries

Under this title, a brief information is given about the technologies used during the creation of the text generating model.

3.1 Generation based text steganography

In steganography based on text generation, generation is provided based on the information desired to be hidden. Therefore, the need for a carrier text has disappeared. In this type of steganography, there is a need for a trained model and embedding algorithms to place the secret information.

The texts consist of a combination of a number of words. While each sentence in the text is expressed with $= \left\{ {W_{i} } \right.\}_{i = 1}^{n}$, the Wi here denotes the words in the sentence. The probability of a word being included in a sentence in text generation is related to the conditional probability distributions of all previous words. This situation is expressed as in Eq. (1) [2].

$$\begin{aligned} {\text{Prob}}\left( S \right) & = {\text{Prob}}\left( {w_{1} ,w_{2} , \ldots w_{n} } \right) \\ & = \mathop \prod \limits_{i = 1}^{n} {\text{Prob}}(w_{i} |w_{i - 1} ,w_{i - 2} , \ldots w_{n} ) \\ \end{aligned}$$

(1)

The Probe(S) value in the above equation should be kept as high as possible. Ideally, this value is expected to be 1. In order to generate the "S" sentence, first a word is given to the system as feed data, and then the word with the highest probability that can come after this word is tried to be guessed. During the calculation of this new word to be generated, a new word is selected by creating a candidate pool from all possible words. This process continues in the same way throughout the generation of the entire text [2].

Generation-based steganography was adopted in the study. A model was created by using the LSTM network with the Bahdanau attention mechanism. Turkish newspaper articles, poems and various articles in the Kaggle database were used as corpus. Perfect binary tree, Huffman, arithmetic coding and LZW-CIE algorithms proposed in the study were used to embed hidden information during text generation.

3.2 LSTM network

Recurrent neural network (RNN-recurrent neural network) is incapable of dealing with long-range dependencies due to the vanishing gradient problem [8]. For this reason, the LSTM network, which is more efficient in finding and using the long-range context and widely used in sequence problems, is preferred. An LSTM network is shown as in Eq. (2).

$$\left. {\begin{array}{*{20}l} {I_{t} = \sigma \left( {W_{i} \cdot x_{t} + U_{i} .h_{t - 1} + b_{i} } \right)} \hfill \\ {F_{t} = \sigma \left( {W_{f} \cdot x_{t} + U_{f} \cdot h_{t - 1} + b_{f} } \right)} \hfill \\ {C_{t} = F_{t} \cdot C_{t - 1} + I_{t} \cdot \tanh \left( {W_{C} \cdot x_{t} + U_{c} \cdot h_{t - 1} + b_{c} } \right)} \hfill \\ {O_{t} = \sigma \left( {W_{o} \cdot x_{t} + U_{0} \cdot h_{t - 1} + b_{o} } \right)} \hfill \\ {h_{t} = O_{t} \cdot \tanh \left( {C_{t} } \right)} \hfill \\ \end{array} } \right\}$$

(2)

The LSTM network takes $x_{t}$, $h_{t - 1} ,C_{t - 1}$ as input at t. time and then calculates $h_{t} ,C_{t}$ values. $I_{t } , F_{t} , O_{t}$ shows input vector, hidden vector, and output vector, respectively, at t. instance. $C_{t}$ is a cell activation vector. These four vectors are the same dimension with hidden vector $h_{t}$. When t = 1, $h_{0}$ and $C_{0}$ are initialized to the zero vector. $\sigma$ is the logistic sigmoid function. While W_i, W_f, W_c, W₀, U_i, U_f, U_c, U₀ represent the weight matrices to be learned, b_i, b_f, b_c, b_o are the bias vectors to be learned [8].

In RNN, the short-term memory will continue to multiply h and then the gradient will disappear. In LSTM, accumulation is used instead of multiplication, so the gradient disappearance problem is solved. For this reason, LSTM is preferred more than RNN in the field of language modeling [8].

Bidirectional LSTM, on the other hand, is an extension of the traditional LSTM network and is preferred to increase the performance of the developed language model. Here, the input array is trained in two directions (forward and backward). Thus, the model can also learn information about the previous and next words [70].

3.3 Bahdanau (additive) attention

A model with the attention mechanism detects words that need special attention or are determined to have more importance in the sentence among the words given as input and tries to predict the next word based on these words. Thus, the texts generated have a more semantic and logical structure.

This mechanism attempts to capture useful context information between words, regardless of the distance between the input and target words. Context information from the encoder is efficiently discovered to provide the expected context information for the decoder. It is important to use the attention mechanism to obtain a more advanced language model and to make a more effective content selection [71]. In language models without this mechanism, the next word is predicted in order.

Bahdanau (additive) attention mechanism was firstly applied in the field of neural machine translation by [72] via creating encoder–decoder architecture. In traditional language models, in order to generate the hidden vector (h₁,h₂, …,h_n) of each word in the input and a content vector (Ci) during the prediction of the target word, it is converted linearly and then transferred to the decoder. When the attention mechanism is used, the hidden vectors of all the words in the input are taken into account to create the content vector C_i. In other words, C_i is determined by the weighted average of all hidden vectors (h₁,h₂,…..,h_n) and their attention weights $\left\{ {a_{i,j} {|} 1 \le i \le M, 1 \le j \le N{\text{\} }}} \right.$. $a_{i,j}$ is calculated with the current hidden vector of the target word ($z_{i}$) and other hidden vectors (h₁,h₂,h_n) [2]. Attention mechanism is applied as in Eqs. (3) and (4).

$$c_{i} = \mathop \sum \limits_{j = 1}^{N} a_{i,j} h_{j}$$

(3)

$$a_{i,j} = {\text{align}}\left( {z_{i - 1} ,h_{j} } \right) = \frac{{\exp \left( {{\text{score}}\left( {z_{i - 1} ,h_{j} } \right)} \right)}}{{\sum_{k} {\text{score}}\left( {z_{i - 1} ,h_{k} } \right))}}$$

(4)

The score (${\text{score}}\left( {z_{i - 1} ,h_{j} } \right)$) function in the equation in (4) scores how well the inputs around the j position and the output at the i position match [72, 73]. The scoring function is expected to give probabilities of how important each latent state is for the current time step. Finally, each encoder's latent state is multiplied by its corresponding score, and they are all added together to obtain the context (content) vector [74]. In Fig. 2, a visual of the application of the attention mechanism is given. This mechanism is used in the character-level language model architecture in the proposed study.

4 Proposed method

In this study, stego text generation which also guarantees high-capacity message hiding and which has a natural appearance without revealing the presence of hidden text during transmission of stego text over a supervised channel was performed. The proposed method generates stego text based on the secret message. On the language model created by combining the bidirectional LSTM network with the attention mechanism, 4 different hidden information embedding algorithms, namely perfect tree, Huffman, arithmetic coding and LZW, are applied. In addition, it is aimed to make a comparative evaluation in terms of imperceptibility metrics of stego texts generated by applying top-k, temperature, greedy search sampling types during text generation.

In the study, two different models that produce word-based and character-based text were used. Perfect tree, Huffman and arithmetic coding were used as the secret information embedding algorithm in the word-based generation model. In the character-based model, a new secret information embedding algorithm is created by combining the LZW compression algorithm with the Char Index method. The contribution of the study to the literature is the creation of a new embedding algorithm. Figure 3 presents a framework for the confidential information embedding structure of the study.

Under this title, the details of the proposed framework structure, which consists of 3 main modules, namely the automatic text generation module based on confidential information, the information embedding module and the confidential information extraction module, are given. In the module that generates text based on hidden knowledge, bidirectional LSTM model with attention mechanism trained using Turkish corpus generates text by using top-k, temperature, greedy search sampling types, while also taking into account the conditional probability distribution. In the confidential information embedding module, 4 different embedding algorithms were applied, namely perfect tree encoding (fixed-length encoding, FLC), Huffman encoding (variable-length encoding, VLC), arithmetic encoding and LZW-CIE. Confidential information extraction module covers transactions performed on the receiver side. The embedding algorithm used by the sender to place the confidential information is applied in the same way on the receiver side. In other words, if AC encoding is used during the insertion of confidential information, the same algorithm is also used in the extraction phase.

4.1 Automatic text generation module based on confidential information

Text generation is provided through the trained model by using Turkish corpus. The ability of the LSTM network to process sequential signals was utilized in the creation of the model. Text generation was performed on two different models, both word level and character level.

4.1.1 Word-level text generation

In order for the model, which produces text at the word level, to produce texts with high semantic value, the bidirectional LSTM network used in the architecture of the model is supported by the attention mechanism. Then, the model trained using Turkish corpus generated words based on bitstreams of the secret message. In the study, the word-level language model was used in an integrated manner with the perfect binary tree (FLC), Huffman (VLC) and arithmetic coding (AC) methods used in the embedding phase.

In the attention mechanism implemented on the word-level language model, the Attention class, which is derived from the default “Layer” class of the Keras API, is used. In other words, a custom attention class is added to the model as a layer. The architecture of the word-based model is given in Fig. 4.

4.1.2 Character-level text generation

The character-level language model is used in an integrated manner with LZW-CIE (LZW-Char Index Encoding) embedding method to predict the next possibly coming letter. A character-level language model was created in order to both improve the working time during the training of the model and to provide more capacity to place confidential information. The characters of the feed information given to this model are representative characters corresponding to the secret message. Therefore, stego text is generated based on the secret message. Bidirectional LSTM network with encoder–decoder architecture used here is supported by Bahdanau attention mechanism. Figure 5 gives the architecture of the character-based model.

The calculation of the Context vector in the figure above is shown in (5) [72].

$${\text{context}} = \mathop \sum \limits_{{t^{\prime} = 1}}^{{T_{x} }} \alpha \left( {t^{\prime}} \right)h\left( {t^{\prime}} \right)$$

(5)

The $\alpha \left( {t^{\prime}} \right)$ in the above equation represents the attention weights. In (6), the calculation of attention weights is given. $h_{{t^{\prime}}}$ value indicates encoder status outputs [72].

$$\alpha \left( {t^{\prime}} \right) = {\text{NeuralNet}}\left( {\left[ {s_{t - 1} ,h_{{t^{\prime}}} } \right]} \right), t^{\prime} = 1 \ldots T_{x}$$

(6)

The input of the attention layer in Fig. 2 is the output of each LSTM unit (encoder unit) and the output of the decoder from the previous time. At each time t step of the decoder, the level of attention given to the hidden encoder unit $h_{j}$ is indicated by $\propto_{tj}$ and the latent state of the decoder at the previous time (${s_{\left({t - 1} \right)}}$) is calculated as a function of $h_{j}$. The Softmax function is used in the last step to normalize attention values [72].

$$e_{tj} = \propto \left( {h_{j} ,s_{t - 1} } \right),\forall j \in \left[ {1,T} \right]$$

(7)

$$\propto_{tj} = \frac{{\exp \left( {e_{tj} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{T} \exp \left( {e_{tk} } \right)}}$$

(8)

4.2 Information embedding module

The process of placing the secret message in the text was carried out in 4 different ways. Under the subheadings of this section, the details of each embedding method applied in the study are mentioned.

4.2.1 Embedding by perfect binary tree encoding (fixed length encoding, FLC)

In the perfect binary tree structure, all internal nodes (node) have 2 child nodes, and all leaf nodes are at the same level. The fact that all leaf nodes have the same depth level ensures that each word in these nodes is expressed in equal bit length. For example, the word “school-okul” in Fig. 6 is represented by the binary code “000” from root to leaf, while the word “book-kitap” is represented by the code “101.” In other words, all leaf nodes are represented by a fixed-length binary code. Since the depth (d) of the example tree in Figs. 6 is 3, the leaf nodes are also 3 bits long. If the depth of the tree is different, for example 2, the leaf nodes will be represented by 2 bits. In this study, each leaf node is expressed in fixed-length coding at different embedding rates of 1,2, 3,4 and 5 bits. An example perfect binary tree coding is given in Fig. 6.

In the embedding module of the secret message using the perfect binary tree, a candidate word pool is needed to find the word corresponding to the message. In the creation of this word pool, the conditional probabilities of the words were taken into account. Accordingly, each word has a conditional probability distribution in form of $p(w_{n} |w_{1} , w_{2} , \ldots w_{n - 1} )$. At the stage of predicting the next word that may come to the trained language model, all probability values are ordered from largest to smallest, and by taking the highest “m” of these probability values, a candidate word pool was created from the words corresponding to the probability values. Then, using perfect binary tree coding, it was ensured that each word in the candidate word pool has a binary code value. The entire perfect tree structure is scanned from root to leaf node in order to find out whether there is a binary value equal to the k- length bitstream of the secret message. In case of equality, the word corresponding to the current bitstream value became the word in the leaf node where binary code equality is achieved. In other words, the bitstream value and the word in the leaf node have been replaced. The word corresponding to the current bitstream was added to the seed words given to the model, and this time the model was made to predict again with new feed words. In each new prediction of the model, all probability values were ordered from highest to lowest, and this time a new candidate word pool was created by taking the first m of them. In other words, the words in the word pool are renewed after each embedding process to ensure that they have a dynamic structure. The value of m here is found by taking the power of 2 to the power. It is in the form of $m = 2^{k}$. The k value represents the binary code length of each word. This entire process in the information embedding process was continued until the bitstreams of the secret message were finished. If the feed sentence given to the model was not completed yet after all the hidden information was embedded, the next highest probability word was determined by the model and the text generation continued. Text generation based on confidential information can be repeated using any number of loops. This not only ensures the generation of the desired length of text, but also increases the capacity of the embedded message by placing confidential information at many different points of the generated text.

The first word of the feed sentence, which was given to the model to predict the next word, was obtained by randomly choosing among the 100 words with the highest frequency in the corpus used. The remaining words of the feed sentence were randomly selected among the words used in the training of the model. The method followed during the creation of the feed sentence is valid for cases where perfect binary tree, Huffman and arithmetic coding methods are used to embed information. The details of how the feed information is generated when the LZW compression algorithm is used in the secret message embedding process are discussed in Sect. 4.2.4. The detailed representation of the hidden information embedding in the fixed-length coding method is given in Fig. 7. Accordingly, the equivalent of “110” bitstream was the word “ledger-defter.”

4.2.2 Embedding by Huffman encoding (variable length encoding, VLC)

The Huffman tree structure is another type of binary tree structure, where the code length used to encode words differs from each other unlike the perfect binary tree structure. Huffman is actually a compression method. Accordingly, the words that are frequently used in the text are represented by shorter codes, while the words that are rarely used are represented by longer codes. Therefore, when the candidate word pool was adapted to the Huffman tree structure, Variable Length coding was performed because each word had a binary code value with a different length.

In the Huffman tree structure, as in the perfect binary structure, a candidate word pool was created by taking m piece words with the highest probability that can come after the feed information. During the determination of m piece words, the conditional probability values of the words were ordered from the largest to the smallest, and m pieces of them were taken.

When the VLC embedding method is used, the bitstreams of the secret message are not taken in “k” bit length this time but read in order. In other words, if the first bit of the secret message is “0,” it is directed to the left side of the Huffman tree, and if it is “1,” it is directed to the right side. This process was continued until all bits of the message were read one by one. If the leaf node is reached before the hidden bitstreams are finished, the word in the leaf node corresponds to the bitstream values read up to that stage. The word corresponding to the current bitstream was added to the seed words given to the model, and this time the model was made to predict again with new feed words. In each new prediction of the model, all probability values were ordered from highest to lowest, and this time a new candidate word pool was created by taking the first m of them. In other words, the words in the word pool are renewed after each embedding process to ensure that they have a dynamic structure. If the embedding process of the bitstreams of the secret message was not completed, the remaining bitstream values were continued to be read and the embedding process performed above was repeated. If the feed sentence given to the model was not completed yet after all the hidden information was embedded, the next highest probability word was determined by the model and the text generation continued. As in fixed-length embedding, in variable-length embedding method, it is essential to find the word in the leaf node to replace the hidden bitstream by traversing the tree structure from the root to the leaf node. The detailed representation of the hidden information embedding in the variable-length encoding method is given in Fig. 8.

In Algorithm 1, confidential information embedding steps in cases where FLC and VLC coding methods are used are given in detail. Based on this algorithm, natural-looking stego texts based on confidential information were generated and then sent to the recipient over an open channel with high confidentiality. The steps implemented in Algorithm 1 are based on the study of [1].

4.2.3 Embedding by arithmetic encoding

Arithmetic coding performs lossless compression function based on the probability distribution of words. The fact that it does not require blocking makes arithmetic coding more effective in practice compared to Huffman [54]. Traditionally, arithmetic encoding encodes a set of items into a bit string [60]. In this study, in the secret message embedding function based on arithmetic coding, firstly, each word in the secret message is converted to its bitstream equivalent, and then the decimal values of the bitstreams are calculated. As in the same way as used in other embeddings, the model is made to predict the next m words with the highest probability. Then, the decimal value was calculated for each word in the candidate word pool. Based on these decimal values, the lowest and highest ranges were determined, and it was tried to determine in which range the decimal value of the words of the secret message was included. Accordingly, the word corresponding to that interval was substituted for the word secret message and text was generated. The details of the arithmetic embedding method are discussed later in the article and in Figs. 9 and 10.

Encoding In the coding process, m pieces words with the highest probability were first converted into binary. A decimal value is calculated by applying the formula $B\left( n \right) = \mathop \sum \nolimits_{i = 1}^{L} m_{i} \times 2^{ - i}$ on the bit string in the format $n = [n_{1} ,n_{2} , \ldots .n_{n}$] For example, decimal value of a string of bits in the form of $n = \left[ {0.1,1} \right]$ is $B\left( n \right) = 0 \times 2^{ - 1} + 1 \times 2^{ - 2} + 1 \times 2^{ - 3} = 0.375$ [60].

During the arithmetic coding, the communication with the language model continues. Accordingly, a candidate word pool is created by determining m piece words with the highest probability that can come after the feed word given to the model. Conditional probability distributions of m pieces words obtained from the model are first converted into binary and then decimal. The purpose of converting to decimal here is to keep the values within a range of [0,1). By using the calculated decimal values of words, an attempt is made to determine a range with the formula $= {\text{upper}}\;{\text{limit}} - {\text{lower}}\;{\text{limit}}$. When arithmetic coding is started for the first time, while the specified range is between [0,1), this range narrows in the later parts of the coding. For example, the decimal value of the word “time” to be hidden is 0.75. Let's assume the probabilities of the words in the candidate pool (“school-okul: 0.2”,”book-kitap:0.5,” “notebook-defter:0.3”). So, the representation of the candidate words in the range should be as shown in Fig. 9.

It is seen that the equivalent of the word “time” with a decimal value of 0.75 according to arithmetic coding is “notebook” when looking at Fig. 9. In this way, the first embedding process is provided. When it comes to the next word to be hidden, the interval is between [0.7,1) this time. In addition, after each embedding, a new candidate word pool is created by making the model predict m pieces words again, and this time the decimal value is calculated for each word in the new pool. Let's assume that the second word to hide is "passing-geçiyor" and its decimal value is 0.71. Accordingly, the calculation step should be as in (9).

$$\left. {\begin{array}{*{20}l} {d = {\text{upper}}\;{\text{limit}}-{\text{lower}}\;{\text{limit}} } \hfill \\ {{\text{Range}}\;{\text{of}}\;{\text{word}} = {\text{lower}}\;{\text{limit}}:{\text{lower}}\;{\text{limit}} + d \times \left( {{\text{probabiliy}}\;{\text{of}}\;{\text{word}}} \right)} \hfill \\ {{\text{Range}}\;{\text{of}}\;{\text{word}} = 0.7: 0.7 + \left( {1 - 0.7} \right) \times \left( {0.05} \right)\;{\text{interval}}\;{\text{calculation}}\;{\text{for}}\;{\text{the}}\;{\text{word}}\;{\text{"date-tarih"}}} \hfill \\ \end{array}} \right\}$$

(9)

For example, let's assume that the decimal values (probability) in the newly created sample word pool are (“date: 0.05”, “season-mevsim:0.01,” “year-yıl: 0.06”). Here, first of all, AC algorithm calculates the interval value to which each word in the candidate word pool is connected, by taking into account the equality in (10). Thus, by creating a range ruler as in Fig. 10, the range of the word to be hidden is determined. The word to replace the word “passing” in our example is the word “date” shown in Fig. 10.

All of the above operations are continued in the same order until the embedding of the entire message, which is requested to be hidden, is completed. In Algorithm 2, a detailed representation of the embedding using arithmetic coding is given. The steps implemented in Algorithm 2 are based on the study of [60].

4.2.4 Embedding by LZW-CIE encoding

LZW is a lossless data compression algorithm proposed by Terry A. Welch [76]. The basis of the LZW (Lempel–Ziv–Welch) compression algorithm is to replace the characters in the text with a symbol. The same character table is required for both compression and decoding [77].

In this compression algorithm, each symbol is represented by a code and this code length is fixed. The LZW character dictionary is also dynamically created without the need to pass the dictionary between the decoder and encoder parties. While the LZW encoder compresses the data, the source text is examined sequentially, and strings not in the dictionary are inserted into the next unused part of the dictionary. The previous scanned symbol is encoded with its corresponding code as output. The higher the number of symbols in the dictionary, the higher the LZW compression ratio. At the stage of decoding the data, the dictionary used during compression is created in the same way and the conversion from code to symbol is performed [78]. In this study, the LZW algorithm is used to compress the secret message and extract the message from the generated text document.

Encoding At the stage of embedding the information, the message to be hidden was compressed with the LZW compression algorithm, and a code list corresponding to the symbols in the message was obtained. The purpose of compression here is to increase the capacity of the hidden information. Since the alphabet in Turkish language consists of 29 letters, the values of "x" and "y" are obtained by dividing each code value by 29 and taking the mode according to 29. Then, a number value starting from 1 was assigned to each letter in the alphabet with the method we named “Char Index.” Afterwards, the equivalent of each "x" and "y" value in the alphabet was found and these values were represented by the corresponding letter. For example, the numeric value in "x" is represented by a letter and the numeric value in "y" is represented by another letter to create a list of 2 letters each. Each string of letters in this list constitutes a representation of the message to be hidden. A random feed information was taken from the corpus used in the training of the model and the created 2-letter string was added to this feed information. As the last step, after the representation of the secret message matched the letter characters, the 2 letters with the highest probability that could come after the 2-letter sequence were predicted by the model one after the other, and these letters replaced the letter characters representing the secret message. Afterwards, in order to complete the word, a random number of characters were generated and a word was created by adding the letter string consisting of this variable number of characters to the 2-letter sequence that was generated previously. As a result, the variable number of letters in each word leaves no doubt about the existence of confidential information. This process was continued until all the 2-letter representations corresponding to the secret message were finished, and then normal text was generated. The application steps of the LZW-CIE encoding method are shown in Fig. 11.

4.3 Information extraction module

Extracting the secret message from the stego text was carried out as a result of applying the algorithms (perfect binary tree, Huffman, arithmetic coding and LZW-CIE) used during embedding in the same way, this time on the receiving side. While calculating the conditional probability of the next word over the word level model in the hidden information extraction process with perfect binary tree, Huffman, arithmetic coding, in the extraction process performed with the developed LZW-CIE algorithm, the character-level model was used in the conditional probability calculation.

In the step of extracting the hidden information, the stego text was first divided into lines consisting of a fixed number of words, and then the words in each line were given to the model as a feed input, and it was made to estimate the probabilities of the next words to come. The probabilities were ordered from the largest to the smallest, the words are taken as much as the number of candidate word pool, and the same embedding algorithm applied in embedding confidential information on the words in this pool was applied this time in the extraction step. In other words, if information was placed using FLC in the embedding part, perfect binary tree was created with candidate words in the extraction step; if information was placed using VLC, a Huffman tree was created with candidate words in the extraction step. Then, by descending from the root to the leaf in both perfect binary tree and Huffman tree coding, it was examined whether the word in the stego text given as the feed input to the model is found in the leaf nodes. If found, the code value obtained by descending from the root to the relevant leaf node created the bitstreams of the secret message. The code values here are obtained as a result of taking the values "0" when moving to the left of the tree and "1" when moving to the right.

If arithmetic coding was used in the embedding phase, arithmetic coding was also used in the extraction phase in the same way. In FLC and VLC coding, the step of determining the words with the highest probability after the word in the stego text is given to the model as a feed input is also valid in the secret information extraction process with arithmetic coding. Accordingly, the probability values of the words in the candidate word pool with the highest probability predicted by the model, ordered from largest to smallest, were first converted to binary and then binary values were converted to decimal. The goal here is to keep decimal values in the range of 0 to 1. Afterwards, the feed word in the stego text was subjected to arithmetic coding and a decimal value was obtained for this word. The decoding process was carried out by looking in the fact that the decimal value of the feed word is included in which of the value ranges corresponding to the words in the candidate word pool. The process of extracting confidential information is continued until it is determined whether it contains confidential information or not by examining each word in the stego text received by the recipient, one by one as described above. The decode stage in arithmetic coding is the opposite of the encode stage, and the extraction process was carried out by gradually narrowing the value ranges.

In the step of extracting the confidential information made with LZW-CIE coding method, the first 2 characters of each word in the stego text reaching the receiver were taken and the character index value of these letters (positions in the alphabet) was checked. Then, the division process, which provides the "x" value in the encoding stage, was applied in reverse for the index value of the first character this time, and the index value of the second character was added to this value. This process was applied for all words in the text, and a list of decimal values was obtained. Each value in this list was decompressed by subtracting 1 more than the number of characters in the alphabet. The process of removing 1 more than the number of characters in the alphabet was done in order to obtain the same list as when compressing. In the proposed algorithm, the secret information can be placed in different positions of the word such as the beginning, the end or the middle of the word. For example, if confidential information is placed at the end of the word, it is necessary to pay attention to only the last 2 letters of each word during the extraction of confidential information.

5 Experiments and analysis

Under this title, the performance of the model created in the study has been evaluated from the perspectives of confidential information embedding efficiency, confidential information imperceptibility and confidential information capacity, and various trials have been carried out for this purpose. In order to measure the efficiency of embedding confidential information, a test of how long the model generates stego text on average was carried out. For the criterion of imperceptibility of confidential information, stego texts obtained by embedding confidential information at different rates were compared with the training text, and at the same time, the ability to resist steganographic perception was examined. For the secret information capacity metric, the rate of confidential information which can be placed in the generated stego texts was analyzed and the values obtained were compared with other text steganography algorithms. In Title 5.1, firstly the dataset used is introduced and then the structure of the word- and character-level models, parameter settings and the details of the model training are given.

5.1 Data preparation and model training

In this study, general Turkish documents with the extension ".txt" in the Kaggle [79] database were used to train the model. The documents used consist of newspaper articles [80], Turkish lecture notes [81], online pdf documents [82] and opinion columns [83]. A large-scale corpus was obtained by combining separately downloaded text documents into a single file. Before moving on to model training, an attempt was made to obtain a noise-free corpus by applying some preprocessing steps such as removing punctuation marks, removing numeric expressions, converting to lowercase, deleting special symbols, filtering out low-frequency words, eliminating stop words. The details of the training dataset obtained after the implementation of all these preprocessing steps mentioned are shown in Table 1.

Table 1 The details of the training datasets

Full size table

The hyperparameters used during the editing of the word-based model are as follows: First, each word is paired with a 100-dimensional embedding vector. Then, a bidirectional LSTM (bidirectional LSTM) layer consisting of 512 units was applied. The "return_sequence" parameter here is set to "True" to take into account the previous and next words in the sequence. An attention mechanism has also been added to the model so that it can concentrate on words considered to be more relevant. A "dropout" layer has been added to prevent overfitting. The construction of the model was completed by adding an LSTM layer consisting of another 100 units, a “Dense” layer with relu activation and an organizer to prevent overfitting again.

In the character-based model, each character is mapped to a 32-dimensional embedding vector. Then, a bidirectional LSTM (bidirectional LSTM) layer consisting of 768 units was applied. Likewise, the "return_sequence" parameter here is set to "True" so that the previous and next words are taken into account. The encoder phase was completed by adding the Bahdanau attention mechanism and then the dropout layer. In the decoder stage, the construction of the whole model was completed by adding the "Dense" layer on top of the Bidirectional LSTM layer. Softmax was used as the activation method in both models. The learning rate was set as 0.01, the batch_size as 128 and the drop_out rate as 0.2.

5.2 Evaluation results and discussion

Under this title, the performance of the proposed model has been examined on the basis of 4 criteria: information hiding efficiency, information imperceptibility, steganalysis resistance and hiding capacity, and its results have been evaluated.

(1)
Information hiding efficiency

In this criterion, the time spent by the model in placing the confidential information is determined. The reconstruction of the candidate word pool used in FLC, VLC and AC embedding methods in each iteration and the encoding of the candidate words (FLC, VLC, AC) directly affect the embedding time. In addition, the number of words in the candidate word pool (candidate pool size, CPS) is also a factor in determining the efficiency. In the model evaluation tests, information embedding efficiency was tested at different embedding rates. In order to compare with the referenced studies in [1, 64], 1000 texts limited to 50 words each were generated here for each CPS. The same generation method is adopted in the LZW-CIE coding method. Table 2 and Fig. 12 give information embedding times, which vary according to CPS.

Table 2 The average time for each model to generate a text containing 50 words at different embedding rates

Full size table

When the values in Table 2 and Fig. 12 are interpreted, it is observed that when the embedding rate increases, the time spent for the generation of stego text containing confidential information also increases. In the suggested VLC encoding in this table, since it is more time-consuming to create words with code values of different lengths in each iteration and place them in the tree structure, in general, it has been a more time-consuming coding method than FLC and AC coding. However, with the proposed model, information placement is provided in less time than the RNN-Stega FLC and RNN-Stega VLC models. When Table 2 is examined, it is seen that when the proposed model uses the FLC encoding method, it can produce 50-word stego text in the range of 3.828–4.36 s on average (for all CPS). While this value is in the range of 16.539–19.072 s when using VLC encoding, it varies between 14,803 and 16.095 on average in LZW-CIE encoding method. It has been observed that stego text can be generated in shorter times compared to the values obtained with the RNN-Stega FLC and RNN-Stega VLC model; therefore, the proposed model achieves higher information embedding efficiency in 4 different codings (FLC, VLC, AC and LZW-CIE) methods.

(2)
Imperceptibility analysis

The first purpose in text steganography is to convey the existence of confidential information transmitted over an open medium (public channel) to the recipient as unnoticed as possible. Therefore, the criterion of imperceptibility is the most crucial factor in the evaluation of the success of steganography studies. In order not to detect the presence of confidential information in the generated text, the difference between the statistical distributions of the carrier text (cover text without confidential information) and the stego text should be minimal or even nonexistent. In the model proposed in this study, a stego text generation based on confidential information without a carrier cover text was performed. The "perplexity" criterion was used for the task of preserving the statistical distribution by producing text based on confidential information, which is a more challenging process than ensuring the consistency of the statistical distribution by placing confidential information in the carrier text. “Perplexity (pp)” is a standard metric used to determine the quality of the created language model in the field of natural language processing [84]. This metric is defined as the average log probability per word in test texts [1].

$$\begin{aligned} {\text{Perplexity}} & = 2^{{\frac{ - 1}{N}\mathop \sum \limits_{i = 1}^{N} \log p\left( {s_{i} } \right)}} \\ & = 2^{{\frac{ - 1}{N}\mathop \sum \limits_{i = 1}^{N} \log p_{i} \left( {w_{1,} w_{2,} \ldots w_{n} } \right)}} \\ & = 2^{{\frac{ - 1}{N}\mathop \sum \limits_{i = 1}^{N} \log p_{i} \left( {w_{1} } \right)p\left( {w_{2} |w_{1} } \right) \ldots p\left( {w_{n} |w_{1} ,w_{2} , \ldots w_{n - 1} } \right)}} \\ \end{aligned}$$

(10)

In Eq. (10), $s_{i} = \left\{ {w1, w2, \ldots ,wn} \right\}$ refers the sentence generated and $p\left( {s_{i} } \right)$ refers to the probability of each word in this sentence obtained from the moldel. Since the Perplexity value is a metric for finding the statistical distribution of texts, this criterion was calculated both for the generated stego texts, and for the texts used in education, and the difference was evaluated [64]. The small value of this value means that the difference between the generated text and the educational text is so small; in other words, these two texts are quite similar to each other. The average perplexity value calculated for the training texts is 135.45. Then, the average perplexity value was calculated for the generated stego texts at different embedding rates. Average perplexity is expressed with $\Delta {\text{Mp}}$ [64].

$$\Delta {\text{Mp}} = {\text{mean}}\left( {{\text{ppStegoText}}} \right) - {\text{mean}}\left( {{\text{ppTrainingText}}} \right)$$

(11)

Based on the study in [64], in order to state the fact that the generated stego texts carry confidential information cannot be detected statistically and in other words, in order to measure how similar, the stego texts generated with the training texts are not only semantically but also statistically, the Kullback–Leibler divergence (KLD) metric was used. Since the KLD metric is not symmetrical, the difference in distribution between both the generated stego texts and the training sentences was examined by using the Jensen–Shannon divergence (JSD) [85, 86] metric [64]. The KLD and JSD distance metrics used are given in Eqs. (12) and (13), respectively. The “C” refers to the overall statistical distribution of training text here, and “S” refers to the overall statistical distribution of the stego text [64, 87].

$$D_{KL} (C||S) = \sum C_{i} \left( x \right)\log \left( {\frac{{C_{i} \left( x \right)}}{{S_{i} \left( x \right)}}} \right)$$

(12)

$$D_{JS} (C||S) = \frac{1}{2}D_{KL} \left( {P_{C} ||\frac{{P_{C} + P_{S} }}{2}} \right) + \frac{1}{2}D_{KL} \left( {P_{S} ||\frac{{P_{C} + P_{S} }}{2}} \right)$$

(13)

In order to evaluate the performance of the model in terms of perplexity criterion, a comparison was made with 2 separate studies close to this study in terms of the coding method used. These are the studies in [1, 64]. Due to the similarity of the starting point of the study in here and these 2 reference studies, the evaluation path followed by these studies in the performance evaluation stage of the proposed model was taken as a guide. In these two reference studies, text generation based on confidential information was carried out with an LSTM-based model. The most obvious difference between these studies and the proposed study is the architecture of the proposed model and the developed LZW-CIE coding method. In addition, as far as is known, there are not enough studies on Turkish text steganography, and the proposed model was tested on Turkish texts. This feature constitutes another important component of the study.

Both generated models (word and character based) were trained using the same dataset and with the same embedding rates. At the end of the training, stego texts consisting of 1000 sentences were created for the perplexity test. In order to make a more realistic evaluation, the sentences in the stego texts generated by both proposed models and with different embedding rates are represented in the feature space using the sentence matching model specified in [88]. Then, the general statistical distribution (KL and JSD values) of the 1000-sentence stego texts allocated for the test and the training texts consisting of 1000 sentences and randomly selected from the training corpus were calculated and analyzed to what extent they matched up with each other. Mean and standard deviation measurement results of perplexity value are given in Tables 3 and 4. Since the number of bits embedded per word (bpw) is variable in VLC coding, the bpw value in cases where this coding is used, and the corresponding perplexity results are reflected in Tables 3 and 4.

Table 3 The mean and standard deviation of the perplexity results of proposed models at different embedding rates on Turkish dataset as well as the results of the models of related work

Full size table

Table 4 The measurement results of evaluation metrics of the steganographic sentences generated by the proposed bidirectional LSTM attention based model and encoder–decoder model as well as the models of related work

Full size table

When the results in Tables 3 and 4 are examined, it is seen that the perplexity value increases as the embedding ratio (bpw) increases. The most crucial factor in determining the embedding rate is the size of the CP (candidate word pool). As CPS increases, the embedding rate increases significantly. The CPS dimension selected in the study was determined as 2, 4, 8, 16, 32. The embedding rate (bpw) was then calculated by dividing the number of bits embedded and the length of the generated text. When the metrics such as $\Delta {\text{MP}}$, KL and JSD in Table 4 are examined in order to evaluate the similarity ratio between the generated stego texts and the training texts, it is observed that with the increase in the embedding rate, these values decrease; in other words, with the increase in the embedding rate, stego texts that are more similar to the training texts are generated. The reason for this is that, with the increase in CPS, a larger number of all the words used in the training of the model are taken for encoding purposes (to create the CPS), and this is considered as the words in the stego texts to be generated are approaching the training texts with an increasing rate. With the convergence of stego and training texts, the reduction in the difference of $\Delta {\text{MP}}$, KL and JSD makes it difficult to distinguish between these two texts statistically. The disadvantage of the CPS increase and the corresponding bpw increase is that words with lower conditional probability values are used in the encoding phase, thus reduction in the quality of the generated stego texts. But with the proposed model (both word and character-based model), a result on the $\Delta {\text{MP}}$ metric lower than the compared VAE-Stega model was obtained; in other words, the quality difference between the training text and the stego text was less. Therefore, higher quality stego texts were generated with the proposed model. In addition, when the developed models are examined on the basis of the coding method used, it is seen that AC coding gives better results than VLC and FLC coding in terms of statistical difference between $\Delta {\text{MP}}$, stego text and training text and the developed LZW-CIE coding method is ahead of AC coding in the same metric. Therefore, in the dataset used, due to the LZW-CIE encoding has the ability to hide information more efficiently as it is based on character level, at the same time, the stego generated with the training texts was able to maintain the statistical closeness between the texts and the imperceptibility of confidential information.

(3)
Anti-steganalysis ability

The ultimate aim of the study carried out here is to deliver a text containing confidential information to the recipient without being noticed by third parties. For this purpose, the anti-steganalysis ability was evaluated by subjecting both proposed models to the criteria used by the steganalysis studies in [89,90,91]. Evaluation criteria known as accuracy (Acc) and recall (R) were used to evaluate the steganalysis performances of the proposed models.

The accuracy metric measures the proportion of correct results (both true positives (TP) and true negatives (TN)) within the total number of events examined [64].

$${\text{Acc}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}}$$

(14)

The recall metric measures the proportion of correctly identified positives [64].

$$R = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$

(15)

TP refers the positive samples predicted to be positive by the model; FP refers the negative samples predicted to be positive by the model; FN refers the positive samples predicted to be negative by the model; TN refers the negative samples predicted to be negative [64]. The steganalysis results obtained are given in Tables 5, 6 and 7.

Table 5 Steganalysis ability of steganography models (Huffman coding)

Full size table

Table 6 Steganalysis ability of steganography models (arithmetic coding)

Full size table

Table 7 Steganalysis ability of steganography models (LZW-CIE coding)

Full size table

When Tables 5, 6 and 7 are examined, it is possible to make the following evaluations. It is seen that the bidirectional LSTM architecture word-based model with Custom-Attention mechanism and the bidirectional LSTM encoder-decoder architecture model with character-level Bahdanau Attention mechanism, proposed in the study, give better results in accuracy and recall metrics than other models. With the increase in the amount of bpw, the fact that the generated stego texts became closer to the training texts has led to a decrease in the accuracy of the steganalysis measurements used, in other words, to a lower accuracy of estimating whether there is confidential information or not. This is a desired situation as a result of the study. In all coding methods used, with the increase in bpw, the accuracy and recall metrics have gradually decreased in accuracy. This trend of decreasing accuracy of steganalysis metrics in Table 7 continued throughout the references in [89,90,91].

When the effect of the proposed coding methods (FLC, VLC, AC, LZW-CIE) on the steganalysis measurement results, information embedding efficiency and information imperceptibility metrics were evaluated, it was seen that the most successful values were obtained when LZW-CIE coding was used. The reason for this is the observation that the character-based language model gives more successful results in modeling languages with a morphological structure such as Turkish [92], as well as the encoder-decoder architecture with the Attention mechanism used in the character-level model and the use of the bidirectional LSTM network in both the encoder and decoder parts. After the LZW-CIE coding method, the second successful coding method was AC. The measurement results between Tables 2, 3, 4, 5, 6 and 7 confirm this result.

(4)
Evaluation results between sampling types

In the study carried out, stego text was generated by using 3 different sampling types (greedy search, top-k, temperature based) on the proposed language models, and it was also examined how the parameters used in each sampling type would produce a result when they took different values. Tables 8 and 9 below provide perplexity results for each coding method. These tables do not include the perplexity results of text generation when using greedy search. Because when compared to the other 2 sampling types, it was seen that certain words were repeated a lot and the quality of the generated texts was low. It is seen that the perplexity value increases as the embedding rate increases in parallel with the increase in CPS in both top-k and temperature-based sampling types. In addition, the increase in the k value and the temperature value in top-k provided word diversity in the generated texts and allowed the formation of less predictable non-ordinary texts. More fluent texts could be generated in the top-k and temperature sampling types compared to the greedy sampling type. Considering that the perplexity value of the training texts is 135.45, in FLC, VLC, AC and LZW-CIE coding methods, the most ideal perplexity value was reached in the temperature-based sampling type when t = 0.6. The results in Table 4 obtained using T = 0.6 and temperature sampling confirm this.

(5)
Hidden capacity (embedding rate, ER) analysis

Table 8 Measurement results of different sampling types (FLC and VLC coding)

Full size table

Table 9 Measurement results of different sampling types (AC and LZW-CIE coding)

Full size table

These metric measures how much confidential information can be embedded in the text. ER calculation is made by dividing the number of hidden bits embedded by the number of bits of the whole text [1]. In the equation in (16), the calculation of ER is expressed mathematically.

$$\begin{aligned} {\text{ER}} & = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \frac{{\left( {L_{i} - 1} \right) \cdot k}}{{B\left( {s_{i} } \right)}} \\ & = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \frac{{\left( {L_{i} - 1} \right) \cdot k}}{{8 \times \mathop \sum \nolimits_{j = 1}^{{L_{j} }} m_{i,j} }} = \frac{{\left( {\overline{L} - 1} \right) \times k}}{{8 \times \overline{L} \times \overline{m}}}, \\ \end{aligned}$$

(16)

$N$ refers the number of sentences generated; $L_{i}$, i. refers the length of the sentence; $k$ refers the number of bits embedded in each word; $B\left( {s_{i} } \right);$ and i. represents the number of bits occupied by the sentence. Since the Turkish language consists of letters in the Latin alphabet and these letters are 1 byte, that is, 8 bits, for $8 \times \sum\nolimits_{j = 1}^{{L_{j} }} {m_{i,j} }$ expression i. represents the number of bits of the sentence and j. represents the number of bits of the word. $\overline{L}$ and $\overline{m}$ expressions, respectively, represent the average length of each sentence in the generated text and the average number of letters each word contains [1]. The measurements made in this study $\overline{L}$ value were found to be 15.34 and $\overline{m}$ value was found to be 3.67. The graph in Fig. 13 shows the variation of the ER ratio depending on its interaction with the bpw and sentence length parameters. Accordingly, as the sentence length increased, the ER ratio also increased and the highest result of 22.46% was obtained. In addition, it is seen that there is an improvement in the ER ratio with the increase in the bpw ratio. In Table 10, the percentages of placing confidential information in different bpw ratios of LZW-CIE model and RNN-Stega models are given comparatively. Accordingly, the LZW-CIE model achieved a 22.46% confidential information embedding success with a bpw rate of 8.698.

Table 10 The comparison of the embedding rates between models

Full size table

6 Conclusion

It has become a challenging yet promising field to perform automatic text generation based on completely confidential information without the need for a supporting cover text. Therefore, it is thought that changes in model architectures and advances in coding methods for embedding confidential information will accelerate the stego text generation process. For this purpose, in the proposed study two different linguistic steganographic models were designed in order to generate text based on confidential information. Bidirectional LSTM architecture with custom-attention mechanism was used to create the word-level model, while Bahdanau attention mechanism architecture, which includes bidirectional LSTM network in both encoder and decoder parts, was used to create the character-level model. FLC (perfect binary tree), VLC (Huffman tree), arithmetic coding and LZW-CIE coding method which was developed as a new embedding method were used as methods of embedding confidential information. The performance of the model was examined in terms of confidential information embedding efficiency, imperceptibility, anti-steg resistance and confidential information capacity metrics and it was determined that the proposed char-level model with LZW-CIE Encoding outperformed all previous related methods and also reached the greatest performance. We hope that the article published here will be a reference for researchers working in this field and guide future text steganography studies.

References

Yang Z, Guo X, Chen Z, Huang Y, Zhang Y (2019) RNN-Stega: linguistic steganography based on recurrent neural networks. In IEEE Trans Inf For Secur 14:1280–1295. https://doi.org/10.1109/TIFS.2018.2871746
Article Google Scholar
Kang H, Wu H, Zhang X (2020) Generative text steganography based on LSTM network and attention mechanism with keywords. Electron Imaging Media Watermark Secur For. https://doi.org/10.2352/ISSN.2470-1173.2020.4.MWSF-291
Article Google Scholar
Zhou Z, Sun H, Harit R, Chen X, Sun X (2016) Coverless image steganography without embedding. In International conference on cloud computing and security. Springer. https://doi.org/10.1007/978-3-319-27051-7_11
Fridrich J (2009) Steganography in digital media: principles, algorithms, and applications. Cambridge University Press, New York, Binghamton
Book Google Scholar
Li B, Tan S, Wang M, Huang J (2014) Investigation on cost assignment in spatial image steganography. IEEE Trans Inf For Secur 9:1264–1278. https://doi.org/10.1109/TIFS.2014.2326954
Article Google Scholar
Liao X, Yin J, Chen M, Qin Z (2020) Adaptive payload distribution in multiple images steganography based on image texture features. IEEE Trans Depend Secure Comput (Early Access). https://doi.org/10.1109/TDSC.2020.3004708
Article Google Scholar
Taha A, Hammad AS, Selim MM (2020) A high capacity algorithm for information hiding in Arabic text. J King Saud Univ Comput Inf Sci 32:658–665. https://doi.org/10.1016/j.jksuci.2018.07.007
Article Google Scholar
Lingyun X, Yang S, Liu Y, Li Q, Zhu C (2020) Novel linguistic steganography based on character-level text generation. Mathematics 8:1–18. https://doi.org/10.3390/math8091558
Article Google Scholar
Yang Z, Jin S, Huang Y, Zhang Y, Li L (2018) Automatically generate steganographic text based on markov model and huffman coding. https://arxiv.org/abs/1811.04720.
Lockwood R, Curran K (2017) Text based steganography. Int J Inf Privacy Secur Integr. https://doi.org/10.1504/IJIPSI.2017.10009581
Article Google Scholar
Chotikakamthorn N (1998) Electronic document data hiding technique using inter-character space. 1998 IEEE Asia-Pacific conference on circuits and systems. Microelectronics and integrating systems. Proceedings (Cat. No.98EX242). https://doi.org/10.1109/APCCAS.1998.743799
Shirali-Shahreza MH, Shirali-Shahreza M (2006) A new approach to Persian/Arabic text steganography. 5th IEEE/ACIS international conference on computer and information science and 1st IEEE/ACIS international workshop on component-based software engineering, software architecture and reuse (ICIS-COMSAR'06) (2006). https://doi.org/10.1109/ICIS-COMSAR.2006.10
Low SH, Maxemchuk NF, Lapone AM (1998) Document identification for copyright protection using centroid detection. IEEE Trans Commun 46:372–383. https://doi.org/10.1109/26.662643
Article Google Scholar
Altigani A, Barry B (2013) A hybrid approach to secure transmitted message using advanced encryption standard (AES) and word shift coding protocol. In: 2013 international conference on computing, electrical and electronic engineering (Icceee) (2013). https://doi.org/10.1109/ICCEEE.2013.6633920
Wang Z, Chang C, Lin C, Li M (2009) A reversible information hiding scheme using left-right and up- down Chinese character representation. J Syst Softw 82:1362–1369. https://doi.org/10.1016/j.jss.2009.04.045
Article Google Scholar
Por LY, Delina B (2008) Information in text hiding: A new approach steganography. In 7th WSEAS international conference on applied computers &applied computational science (ACACOS’08). https://doi.org/10.18201/ijisae.05687
Wang ZH (2009) Emoticon-based text steganography in chat. In: Second Asia Pacific conference on computational intelligence and industrial application. https://doi.org/10.1109/PACIIA.2009.5406559
Khairullah MD (2009) A novel text steganography system using font color of the invisible characters in microsoft word. In: Second international conference on computer and electrical engineering. https://doi.org/10.1109/ICCEE.2009.127
Bhaya W (2013) Text steganography based on font type in MS-word documents. J Comput Sci 99:898–904. https://doi.org/10.3844/jcssp.2013.898.904
Article Google Scholar
Bhattacharyya S, Indu P, Dutta S, Biswas A, Sanyal G (2011) Hiding data in text through changing in alphabet letter patterns (CALP). J Glob Res Comput Sci 2:33–39
Google Scholar
Roy S, Manasmita M (2011) A novel approach to format based text steganography. In: ICCCS’11:Proceedings of the 2011 international conference on communication, computing & security. https://doi.org/10.1145/1947940.1948046
Agarwal M (2013) Text steganographic approaches: a comparison. Int J Netw Secur Appl 5:91–106. https://doi.org/10.5121/ijnsa.2013.5107
Article Google Scholar
Shirali-Shahreza M (2008) Text steganography by changing words spelling. In: 10th international conference on advanced communication technology. https://doi.org/10.1109/ICACT.2008.4494159
Singh P, Chaudhary R, Agarwal A (2012) A novel approach of text steganography based on null spaces. IOSR J Comput Eng 3:11–17. https://doi.org/10.9790/0661-0341117
Article Google Scholar
Thabit R, Udzir NI, Yasin SM, Asmawi A, Roslan NA, Din R (2021) A comparative analysis of arabic text steganography. Appl Sci 11(15):6851. https://doi.org/10.3390/app11156851
Article Google Scholar
Mohammed AM, Rossilawati S, Zarina S, Mohammad KH (2021) A review on text steganography techniques. Mathematics 9(21):1–28
Google Scholar
Wu N, Ma W, Liu Z, Shang P, Yang Z, Fan J (2019) Coverless Text Steganography Based on Half Frequency Crossover Rule. In: Proceedings of the 2019 4th international conference on mechanical, control and computer engineering (ICMCCE). pp 726–7263. https://doi.org/10.1109/ICMCCE48743.2019.00168
Alghamdi N, Berriche L (2019) Capacity investigation of Markov chain-based statistical text steganography. Arabic language case. In: Proceedings of the 2019 Asia Pacific information technology conference, pp 37–43. https://doi.org/10.1145/3314527.3314532
Alanazi N, Khan E, Gutub A (2020) Efficient security and capacity techniques for Arabic text steganography via engaging Unicode standard encoding. Multimed Tools Appl 80:1403–1431. https://doi.org/10.1007/s11042-020-09667-y
Article Google Scholar
Bhat D, Krithi V, Manjunath KN, Prabhu S, Renuka A (2017) Information hiding through dynamic text steganography and cryptography. Comput Inform. https://doi.org/10.1109/ICACCI.2017.8126110
Article Google Scholar
Jayapandiyan JR, Kavitha C, Sakthivel K (2020) Enhanced least significant bit replacement algorithm in spatial domain of steganography using character sequence optimization. IEEE Access 8:136537–136545. https://doi.org/10.1109/ACCESS.2020.3009234
Article Google Scholar
Wu N, Liu Z, Ma W, Shang P, Yang, Z, Fan J (2019) Research on coverless text steganography based on multi-rule language models alternation. In: Proceedings of the 2019 4th international conference on mechanical, control and computer engineering (ICMCCE), pp 803–8033. https://doi.org/10.1109/ICMCCE48743.2019.00184
Murphy B, Vogel C (2007) The syntax of concealment: reliable methods for plain text information hiding. Proc SPIE Int Soc Opt Eng. https://doi.org/10.1117/12.713357
Article Google Scholar
Meral HM, Sankur B, Ozsoy AS, Gungor T, Sevinc E (2009) Natural language watermarking via morphosyntactic alterations. Comput Speech Lang 23:107–125. https://doi.org/10.1016/j.csl.2008.04.001
Article Google Scholar
Muhammad HZ, Rahman SMSAA, Shakil A (2009) Synonym based Malay linguistic text steganography. In: Proceedings of the innovative technologies in intelligent systems and industrial applications, CITISIA (2009). https://doi.org/10.1109/CITISIA.2009.5224169
Xiang L, Wu W, Li X, Yang C (2018) A linguistic steganography based on word indexing compression and candidate selection. Multimed Tools Appl 77:28969–28989. https://doi.org/10.1007/s11042-018-6072-8
Article Google Scholar
Xiang L, Wang X, Yang C, Liu P (2017) A novel linguistic steganography based on synonym run-length encoding. IEICE Trans Inf Syst 100:313–322. https://doi.org/10.1587/transinf.2016EDP7358
Article Google Scholar
Li M, Mu K, Zhong P, Wen J, Xue Y (2019) Generating steganographic image description by dynamic synonym substitution. Signal Process 164:193–201. https://doi.org/10.1016/j.sigpro.2019.06.014
Article Google Scholar
Topkara M, Topkara U, Atallah MJ (2007) Information hiding through errors: a confusing approach. Proc SPIE Int Soc Opt Eng. https://doi.org/10.1117/12.706980
Article Google Scholar
Chang CY, Clark S (2010) Linguistic steganography using automatically generated paraphrases. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. https://aclanthology.org/N10-1084
Naqvi N, Abbasi AT, Hussain R, Khan MA, Ahmad B (2018) Multilayer partially homomorphic encryption text steganography (MLPHE-TS): a zero steganography approach. Wirel Pers Commun 103:1563–1585. https://doi.org/10.1007/s11277-018-5868-1
Article Google Scholar
Mansor FZ, Mustapha A, Din R, Abas A, Utama S (2018) An antonym substitution-based model on linguistic steganography method. Indonesian. J Electr Eng Comput Sci 12: 225–232. https://doi.org/10.1159/ijeecs.v12.i1.pp225-232
Mahato S, Khan DA, Yadav DK (2020) A modified approach to data hiding in Microsoft Word documents by change-tracking technique. J King Saud Univ Comput Inf Sci 32:216–224. https://doi.org/10.1016/j.jksuci.2017.08.004
Article Google Scholar
Wu N, Shang P, Fan J, Yang Z, Ma W, Liu Z (2019) Research on coverless text steganography based on single bit rules. J Phys: Conf Ser 1237:1–6. https://doi.org/10.1088/1742-6596/1237/2/022077
Article Google Scholar
Chen X, Sun H, Tobe Y, Zhou Z (2015) Sun X (2015) Coverless information hiding method based on the chinese mathematical expression. Int Conf Cloud Comput Secur. https://doi.org/10.1007/978-3-319-27051-7_12
Article Google Scholar
Wang K, Gao Q (2019) A coverless plain text steganography based on character features. In IEEE Access 7:95665–95676. https://doi.org/10.1109/ACCESS.2019.2929123
Article Google Scholar
Wu N, Shang P, Fan J, Yang Z, Ma W, Liu Z (2019) Coverless text steganography based on maximum variable bit embedding rules. J Phys: Conf Ser 1237:1–6. https://doi.org/10.1088/1742-6596/1237/2/022078
Article Google Scholar
Wu N, Yang Z, Yang Y, Li L, Shang P, Ma W, Liu Z (2020) STBS-Stega: Coverless text steganography based on state transition-binary sequence. Int J Distrib Sens Netw 16:1–12. https://doi.org/10.1177/1550147720914257
Article Google Scholar
Zhang W, Wang, X, Zhang C, Zhang J (2020) Coverless text steganography method based on characteristics of word association. In: 2020 IEEE 20th international conference on communication technology (ICCT). https://doi.org/10.1109/ICCT50939.2020.9295910
Yang R, Ling Z (2019) Linguistic Steganography by Sampling-based Language Generation. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). https://doi.org/10.1109/APSIPAASC47483.2019.9023313
Fang T, Jaggi M, Argyraki K (2017) Generating steganographic text with LSTMs. https://arxiv.org/abs/1705.10742: 100–106. https://aclanthology.org/P17-3017
Tong Y, Liu Y, Wang J, Xin G (2019) Text steganography on RNN-generated lyrics. Math Biosci Eng 16:5451–5463. https://doi.org/10.3934/mbe.2019271
Article MathSciNet MATH Google Scholar
Dai F, Cai Z (2019) Towards near-imperceptible steganographic text. In: Proceedings of the 57th annual meeting of the association for computational linguistics. https://doi.org/10.18653/v1/P19-1422
Ziegler Z, Deng Y, Rush A (2019) Neural linguistic steganography. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). https://doi.org/10.18653/v1/D19-1115
Shniperov AN, Nikitina KA (2016) A text steganography method based on Markov chains. Autom Control Comput Sci 50:802–808. https://doi.org/10.3103/S0146411616080174
Article Google Scholar
Luo Y, Huang Y, Li F, Chang C (2016) Text steganography based on ci-poetry generation using markov chain model. Ksii Trans Internet Inf Syst 10:4568–4584. https://doi.org/10.3837/tiis.2016.09.029
Article Google Scholar
Moraldo HH (2014) An approach for text steganography based on markov chains. Aut Control Comp Sci 50:802–808. https://doi.org/10.3103/S0146411616080174
Article Google Scholar
Dai W, Yu Y, Deng B (2009) BinText steganography based on Markova state transferring probability. In: Proceedings of the 2nd international conference on interaction sciences: information technology, culture and human, ICIS’09 (2009). https://doi.org/10.1145/1655925.1656165
Dai W, Yu Y, Dai Y, Deng B (2010) Text steganography system using markov chain source model and des algorithm. J Softw 5:785–792. https://doi.org/10.4304/jsw.5.7.785-792
Article Google Scholar
Shen J, Heng J, Han J (2020) Near-imperceptible Neural Linguistic Steganography via Self-Adjusting Arithmetic Coding. EMNLP 2020.
Luo Y, Huang Y (2017) Text steganography with high embedding rate: using recurrent neural networks to generate chinese classic poetry. In: IH&MMSec '17: Proceedings of the 5th ACM workshop on information hiding and multimedia security. https://doi.org/10.1145/3082031.3083240
Zhou X, Peng W, Yang B, Wen J, Xue Y, Zhong P (2021) Linguistic steganography based on adaptive probability distribution. IEEE Trans Dependable Secure Comput (Early Access). https://doi.org/10.1109/TDSC.2021.3079957
Article Google Scholar
Yang Z, Xiang L, Zhang S, Sun X, Huang Y (2021) Linguistic generative steganography with enhanced cognitive-imperceptibility. IEEE Signal Process Lett 28:409–413. https://doi.org/10.1109/LSP.2021.3058889
Article Google Scholar
Yang ZL, Zhang SY, Hu YT, Hu ZW, Huang YF (2021) VAE-Stega: linguistic steganography based on variational auto-encoder. In IEEE Trans Inf For Secur 16:880–895. https://doi.org/10.1109/TIFS.2020.3023279
Article Google Scholar
Kumar R, Chand S, Singh S (2014) An Email based high capacity text steganography scheme using combinatorial compression. In: 2014 5th international conference - confluence the next generation information technology summit (confluence). https://doi.org/10.1109/CONFLUENCE.2014.6949231
Kumar R, Malik A, Singh S, Chand S (2016) A high capacity email based text steganography scheme using Huffman compression. In: 2016 3rd international conference on signal processing and integrated networks (SPIN). https://doi.org/10.1109/SPIN.2016.7566661
Tutuncu K, Hassan AA (2015) New approach in e-mail based text steganography. Int J Intell Syst Appl Eng 3: 54–56. https://doi.org/10.18201/ijisae.05687
Malik A, Sikka G, Verma HK (2017) A high capacity text steganography scheme compression and color coding. Eng LZW Sci Technol Int J 20:72–79. https://doi.org/10.1016/j.jestch.2016.06.005
Article Google Scholar
Fateh M, Rezvani M (2018) An email-based high capacity text steganography using repeating characters. Int J Comput Appl 43:226–232. https://doi.org/10.1080/1206212X.2018.1517713
Article Google Scholar
Berglund M, Raiko T, Honkala M, Kärkkäinen L, Vetek A, Karhunen J (2015) Bidirectional recurrent neural networks as generative models. In: NIPS'15: Proceedings of the 28th international conference on neural information processing systems 1: 856–864. https://doi.org/10.1021/acs.jcim.9b00943
Wang H, Zhang W, Zhu Y, Bai Z (2019) Data-to-text generation with attention recurrent unit. In: 2019 international joint conference on neural networks (IJCNN), (2019). https://doi.org/10.1109/IJCNN.2019.8852343
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. Preprint https://arxiv.org/abs/1409.0473.
Luong M, Pham H, Manning C (2015) Effective approaches to attention-based neural machine translation. Preprint https://arxiv.org/abs/1508.04025. https://aclanthology.org/D15-1166.pdf
Oinar C (2021) Introduction to Attention Mechanism: Bahdanau and Luong Attention. Artificial Intelligence. https://ai.plainenglish.io/introduction-to-attention-mechanism-bahdanau-and-luong-attention-e2efd6ce22da. Accessed 11 August 2021
Khandelwal R (2020) Attention: Sequence 2 Sequence model with Attention Mechanism. Towards Data Science. https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a. Accessed 15 August 2021
Welch TA (1984) A technique for high performance data compression. Computer 17:8–19. https://doi.org/10.1109/MC.1984.1659158
Article Google Scholar
Varian C, Munır R (2019) Modified email header steganography using LZW compression algorithm. In: Proceedings of the Sriwijaya international conference on information technology and its applications (SICONIAN 2019). https://doi.org/10.2991/aisr.k.200424.016
Chen C, Chang C (2010) High-capacity reversible data-hiding for LZW codes. In: 2010 second international conference on computer modeling and simulation. https://doi.org/10.1109/ICCMS.2010.346
Kaggle. https://www.kaggle.com/. Accessed 17 August 2021
Siyah B (2018) newspaper article Turkish (for simple exercises). Kaggle. https://www.kaggle.com/bulentsiyah/hurriyet (2018). Accessed 18 August 2021
Boğan H (2021) Turkish Corpus. Kaggle. https://www.kaggle.com/redrussianarmy/turkish-corpus. Accessed 18 August 2021
Erdem H (2021) Turkish Sentence by Kuzgunlar. Kaggle. https://www.kaggle.com/rootofarch/kuzgunlar-acikhack-tr-sentence. Accessed 18 August 2021
Ozturk O (2021) 910 Turkish Articles by 69 Columnists. Kaggle. https://www.kaggle.com/oktayozturk010/910-turkish-articles-by-69-columnists. Accessed 18 August 2021
Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S (2010) Recurrent neural network based language model. In Proc. Interspeech. https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf
Zhang Z, Liu J, Ke Y, Li J, Zhang M, Yang X (2019) Generative steganography by sampling. IEEE Access 7:118586–118597. https://doi.org/10.1109/ACCESS.2019.2920313
Article Google Scholar
Zhang R, Dong S, Liu J (2019) Invisible steganography via generative adversarial networks. Multimed Tools Appl 78:8559–8575. https://doi.org/10.1007/s11042-018-6951-z
Article Google Scholar
Rosa dos Reis T (2020) Measuring the statistical similarity between two samples using Jensen-Shannon and Kullback-Leibler divergences. Medium. https://medium.com/datalab-log/measuring-the-statistical-similarity-between-two-samples-using-jensen-shannon-and-kullback-leibler-. Accessed 20 August 2021
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In Proc Int Conf Mach Learn 32:1188–1196
Google Scholar
Yang Z, Huang Y, Zhang YJ (2019) A fast and efficient text steganalysis method. IEEE Signal Process Lett 26:627–631. https://doi.org/10.1109/LSP.2019.2902095
Article Google Scholar
Din R, Yusof SAM, Amphawan A, Hussain HS, Yaacob H, Jamaludin N, Samsudin A (2015) Performance analysis on text steganalysis method using a computational intelligence approach. In: International conference on electrical engineering, computer science and informatics (EECSI 2015). https://doi.org/10.11591/eecsi.v2.772
Wen J, Zhou X, Zhong P, Xue Y (2019) Convolutional neural network based text steganalysis. IEEE Signal Process Lett 26:460–464. https://doi.org/10.1109/LSP.2019.2895286
Article Google Scholar
Vania C, Grivas A, Lopez A (2018) What do character-level models learn about morphology? The case of dependency parsing. In: Proceedings of the 2018 conference on empirical methods in natural language processing. https://doi.org/10.18653/v1/D18-1278

Download references

Acknowledgements

I would like to thank the "Sky Translation Office" for the language editing of the article.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Faculty of Technology Information Systems Engineering, Burdur Mehmet Akif Ersoy Üniversitesi, 15300, Burdur, Turkey
Merve Varol Arısoy

Authors

Merve Varol Arısoy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Merve Varol Arısoy.

Ethics declarations

Conflict of interest

Author Merve Varol Arısoy has received assistance from Professor Dr. Ecir Uğur Küçüksille only in terms of sharing information about solving the problems encountered in the project and providing the necessary guidance during the realization of the study. Except this, the author has no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Varol Arısoy, M. LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding. Neural Comput & Applic 34, 19117–19145 (2022). https://doi.org/10.1007/s00521-022-07499-5

Download citation

Received: 19 October 2021
Accepted: 01 June 2022
Published: 01 July 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s00521-022-07499-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding

Abstract

Similar content being viewed by others

A Comprehensive Review on Deep Learning-Based Generative Linguistic Steganography

TS-CSW: text steganalysis and hidden capacity estimation based on convolutional sliding windows

High-Performance Linguistic Steganalysis, Capacity Estimation and Steganographic Positioning

Explore related subjects

1 Introduction

2 Related work

2.1 Steganography based on automatic text generation

2.2 Steganography based on compression of secret data

3 Preliminaries

3.1 Generation based text steganography

3.2 LSTM network

3.3 Bahdanau (additive) attention

4 Proposed method

4.1 Automatic text generation module based on confidential information

4.1.1 Word-level text generation

4.1.2 Character-level text generation

4.2 Information embedding module

4.2.1 Embedding by perfect binary tree encoding (fixed length encoding, FLC)

4.2.2 Embedding by Huffman encoding (variable length encoding, VLC)

4.2.3 Embedding by arithmetic encoding

4.2.4 Embedding by LZW-CIE encoding

4.3 Information extraction module

5 Experiments and analysis

5.1 Data preparation and model training

5.2 Evaluation results and discussion

6 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation