1 Introduction

Advances in the field of digitalization have allowed the scope of transactions carried out on the Internet to expand and the rapid transfer of different types of content. While the mutual exchange of data takes place within seconds, this transfer continues over an open network. Therefore, it is possible for unwanted persons to monitor the content of the data sent and to capture the information. Steganography is a method that emerged to close this gap and to prevent the disclosure of confidential information. In steganography, the new stenographic data obtained by embedding the information, which is wanted to be hidden, in media such as pictures, text, video, and sound are transferred to the other party in a way that will not arouse suspicion. The aim here is to send steganographic data containing confidential information by trying to simulate it as much as possible to its version without confidential information. When this situation is considered in terms of a text document, the stego text containing confidential information and the version of this text as unembedded with information should be very close to each other. In steganography, there is the use of redundant areas on the carrier medium into which information will be embedded. This feature is relatively high in image steganography [1]. It is also easy to modify the image for steganography [2]. Therefore, image steganography is often preferred [3,4,5,6]. Although image steganography is frequently used, text steganography has advantages that make it more preferable. It has less memory consumption compared to other cover media (picture, video, audio), it is faster, and text format is more preferred in interpersonal communication [7]. Due to these advantages, it has recently attracted the attention of many researchers. In addition to its positive features, text steganography has some difficulties. There is less amount of redundant embedding area compared to other cover carriers. It also requires the use of complex natural language processing technologies. For this reason, steganography text is characterized as a challenging task [8].

Text steganography is basically divided into three, as format based, content based and random-statistical generation based, according to the method of embedding confidential information (Fig. 1). In format-based methods, text is usually treated as a specially encoded image. In order to hide the information, the formal features of the document such as paragraph format, font format, punctuation mark, white space, and non-printable character are used [9, 10]. Format-based steganography depends on the characteristics of the written language of the text in which information hiding is to be performed. While this method may produce good results for some languages, satisfactory results cannot be obtained for others. In this method, the length of the data to be hidden is taken into account [7]. It is also often difficult to detect visually [1]. In the literature, there are format-based steganography studies using techniques such as character spacing [11], word wrapping [12], and character encoding [13,14,15,16,17,18,19,20]. In the line and word wrapping method, which is one of the methods used in the literature, the word or line is shifted up and down in order to create unique spaces for information hiding. In this method, hidden information is protected during printing, but there is a possibility of damaging confidential information by changing the format setting of the document [21].

Fig. 1
figure 1

Classification of text steganography

In the study in [22], punctuation was used to hide bits. In [23], the spelling of the words was used to hide the information. Format-based steganography method has a high capacity for hiding information, but its disadvantage is that it is vulnerable to text rewriting attacks [24].

In the Random and Statistical Generation subcategory of text steganography, statistical properties of the language on which data hiding will be performed and are extracted and these properties are used to create a cover text. However, the stego text created according to this method appears as a repetitive word/character sequence. This increases the probability of attracting the attention of the reader. In addition, the computation time increases [25, 26]. Another example of statistical text stenography appears in the study in [27]. Here, a method based on the Markov model and the half frequency is discussed, using the statistical properties of natural language. In the half frequency crossover rule based on the Markov model, the texts for the training of the model are integrated with various language rules. They benefited from two different databases, as it was their priority to increase embedding efficiency and capacity.

In the study in [28], they applied a stenography study on Arabic language by integrating the Markov chain encoder/decoder with the Huffman coding algorithm in order to increase the capacity of the hidden information in the statistical steganography method.

The work in [29] is an example of statistical text steganography based on the RSA algorithm and aimed at improving security and accuracy. Accordingly, the data are encrypted by creating defects on the appearance of the user data. The secret message to be transmitted was subjected to multilevel encryption using both private and public keys. Thus, they made the text more resistant to cyber threats and security breaches. Since it depends on the size of the secret message to be transmitted, the message encoded with this method is limited in size. The study in [30] mentions a statistical text steganography technique that uses a combination of cryptography and steganography to prioritize security. They used the data encryption standard (DES) in their methods. The frequency of the letters in the cover text has been a factor in determining the position of the bits to be hidden.

In the statistical text stenography proposed in [31], the bits of the secret message are reduced by generating metadata and adding header information to the first few bytes of the cover content, as well as mapping between the ASCII values of the character strings and their corresponding binary value. In this way, it has been evaluated that the capacity of embedded confidential information can be increased. In the next step, the secret message is stored in bits of the cover text.

In the study in [32], they mentioned a statistical text steganography model, which is based on multi-rule language techniques and does not contain carrier cover text. In their models, different language principles were combined alternately and more language features were tried to be extracted from the training text.

In the study in [33], Omega network structure was combined with part of speech (POS) considering the principle of replacing verbs and nouns in the cover text with verbs and nouns in the content of the secret message. In the study in [34], which is in the "Random and Statistical Generation" sub-branch, letter frequency and word length were the two statistical items they used. Here, they created a stego word by utilizing a codebook of mappings between hidden bit sequences and lexical items.

As a result of the advances in NLP, the studies in the field of text steganography have started to shift towards automatic steganographic text generation rather than the formal arrangement of the carrier text [2]. In this branch of text steganography called linguistic steganography, secret information is embedded in the content of the text [8]. In the category known as Text Modification of this branch, which includes studies in 3 separate subcategories, while preserving the semantic value of the original cover text, linguistic transformations equivalent to the words in the cover text are used to hide the message. These conversions include techniques such as syntactic conversion [33, 34], synonym substitution [35,36,37,38], misspelled word substitution [39], and phrase paraphrasing [40]. This type of linguistic steganography allows for high imperceptibility but limited capacity of information embedding. In addition, when the original cover text and the stego with confidential information are compared, there are deviations in the statistical and linguistic features of the stego text. This makes it easier to discover the existence of confidential information when linguistic steganalysis technology is used [8]. An example of text modification-based steganography can be seen in the study in [41]. The approach adopted here is based on replacing the characters of the secret message that is intended to be placed in the document with the characters of the carrier text. The robustness of the steganography was achieved by applying the multilayered coding concept, which includes block coding, partially homomorphic encryption and alphabetic transformation. In [42], antonym word substitution was used in text steganography instead of synonym word substitution, a method previously used frequently.

Modification-based steganography using the characteristics of the chosen language is also seen in the study of [7]. Here they developed an application for hiding information in Arabic texts. They tried to embed information in these fields by using the extension character of the Arabic language called Kashida and the small space character.

In the study in [43], they investigated information hiding within word documents. To this end, they redesigned the change tracking technique on the Microsoft Word document to increase embedding capacity and reduce any intermediary's suspicion of the existence of the message. They have hidden the secret message in bit format using the synonym substitution method. With their proposed method, they aimed to protect the word document from any intermediary suspicion by making the word document look quite normal.

Coverless steganography has emerged as a result of attempts to be protected from steganalysis attacks. In this method, no changes are made in the text of the carrier cover [8]. Information can be hidden by producing new texts in accordance with the statistical properties of natural language [44]. There are studies in the literature using this technique. The study of [45] is an example of the application of coverless text steganography. Here, an application based on the principle of dividing the confidential information into keywords and removing the location tag of each keyword and combining these tags and keywords was carried out. In addition, in the study in [46], coverless steganography application was performed by using PCCSN (Strike Number Parity of Chinese Characters). In the study in [47], they developed an application that bases coverless text steganography on the Markov chain model. They used the maximum variable bit embedding method instead of the usual fixed bit embedding according to the property and value of the transition probability of the model they created. In addition, in another study of [48], they created binary-transition sequence diagrams based on the transition probability concept of the Markov chain model and used these diagrams to generate new texts containing confidential information. In the study in [49], they proposed a coverless steganography method based on the properties of word association. Accordingly, a word node tree is created by using word association features between the parties to be communicated mutually. Then, the transmission path of the word node tree is coded to embed the secret information, and the mapping relationship between this path and the texts is established. Coverless steganography method is resistant to steganalysis attacks, but its information embedding capacity is quite low [8].

Steganography based on text generation has been developed in order to overcome the low-capacity information insertion constraint in text modification and coverless types of linguistic steganography. In this method, a carrier cover text is not needed. There is text generation based on hidden knowledge by using language models. The generation-based steganography method consists of a model that performs text generation and an embedding algorithm that allows hidden information to be embedded into the text during generation [50]. The advantage of the method is that there are more positions to place the hidden information in the generated stego text and there is no upper limit for the length of the generated text [8]. Therefore, the steganography method based on text generation has a greater data hiding capacity. In the literature, there are text generation-based steganography proposals developed in recent years [51,52,53,54]. When the recent studies are examined, a LSTM-based steganography model based on text generation was proposed in the study conducted in [51] and the language model that could not be developed sufficiently with the Markov chain method [55] was improved. Since the developed study also adopts the steganography style based on text generation, the details of the studies made using this method will be mentioned in the literature section.

In this study, text generation-based data hiding has been carried out. For this, 2 different language models were created, but these models were trained by using the same datasets. The first model generates word-level based text. Three different data embedding algorithms, namely arithmetic coding, perfect binary tree, and Huffman coding, were applied on this model. In addition, 3 different sampling types were used to predict the next word/character, namely top-k sampling, temperature sampling and greedy search in both models, and a comparison was made between the stego texts generated for each sampling type in terms of imperceptibility metric.

The model created in the study generates text based on confidential information. In other words, the generation of text and the placement of confidential information were carried out simultaneously. Therefore, a carrier cover text was not used. The second created model generates character-level text. On this model, a new form of embedding was created by using the LZW compression algorithm together with the Char Index method, and stego text was generated based on this. The greatest contribution of the study manifests itself in this respect. In the study, the purpose of creating two different types of models (word-level attention-based bidirectional LSTM and Char-Level encoder–decoder Bahdanau attention mechanism) is to make a comparative evaluation between character and word level-based models in terms of running speeds, information embedding rates, imperceptibility of the generated stego text and steg-analysis resistance metrics. At the same time, it is aimed to improve the working speed and to increase the rate of embedded confidential information with the use of the character-based generation model. The motivation behind the realization of this study is that the LZW compression algorithm was not used in the generation-based and confidential information-based text generation before in the literature until the study was conducted. Therefore, it is thought that the study will bring an innovation to the literature.

The contributions of the study to the literature are listed below.

  • The fact that the study makes comparisons among the stego texts generated by applying Huffman coding, arithmetic coding and perfect binary tree embedding algorithms separately to the bidirectional LSTM attention mechanism network that produces word-level text, on the basis of the metrics listed above

  • At the same time, the fact that 3 different sampling types, namely top-k sampling, temperature sampling, and greedy search, be used on the created language models (word level–character level), thus providing the opportunity to make comparisons on the quality of stego texts on the basis of sampling.

  • The generation of stego text by applying a new embedding algorithm created as a result of the integration of the LZW compression algorithm with the Char Index method, on the model with the Bahdanau attention mechanism that generates character-level text.

The remainder of the article is organized as follows. First, current relevant studies are introduced in chapter 2. Then the framework and main modules of the proposed method are explained in the chapter 3. The details of data hiding, and data extraction algorithms are discussed in chapter 4. The results obtained as a result of the trials are evaluated on the basis of the metrics used in chapter 5. Finally, the conclusion of the article is given in chapter 6.

2 Related work

2.1 Steganography based on automatic text generation

Since text steganography contains less unnecessary information in text files compared to image or sound files, and therefore having less embedding area, it is considered to be a more challenging task than other types of steganography [7]. In order to overcome this obstacle, many methods based on text modification and not containing carrier text have been proposed. However, as a result of the application of these methods, sufficient efficiency could not be obtained due to the inability to resist steganalysis attacks or the low information embedding capacity of the generated stego texts, steganography based on text generation, which is considered as a more promising method, has been developed. In the steganography method based on text generation, generation is made on a pre-created language model based on the information desired to be hidden without the need for a carrier cover text.

Previous research was on performing steganography based on automatic text generation using the Markov chain method. Language models of this method have been used in many studies [9, 55,56,57,58,59]. In the Markov-based language model proposed in the study in [57], it is aimed to ensure that each sentence generated contains an equal number of confidential information. However, ignoring the transition probability difference has led to unsatisfactory results. Similarly, [56] used the Markov chain model to construct Ci-poetry, a classical Chinese poetry style.

Yang et al. [9] combined Markov model and Huffman coding to overcome the quality degradation of stego text caused by the fixed-length coding method in their study. During the text generation, they dynamically updated the Huffman tree structure and performed the hidden information placement process.

Although a certain level of quality has been achieved in the stego texts generated in text steganography using the Markov model, the method still has some shortcomings. The fact that the texts generated by this method are not natural enough makes it weak in resisting steganalysis attacks [8]. In order to overcome this deficiency, new searches have been made and artificial neural network-based language models have been started to be developed. Accordingly, in [51]'s study, they first generated stego text by blending the language model trained with the LSTM network with the fixed-blocked data embedding algorithm. Here they split the secret information into blocks of bits and shared a key that assigns each of these bit blocks to tokens. Accordingly, they denoted each bit block with one of the corresponding tokens.

Subsequently, [1] generated stego text based on confidential information based on RNN (recurrent neural networks) in their studies. They used perfect binary-based fixed-length coding and Huffman-based variable-length coding algorithm as embedding algorithms. [50] investigated stego text generation on a sampling-based basis and used arithmetic coding, fixed-length and variable length coding methods as embedding algorithms in their studies. They tried to control the embedding rate by determining the constant “K” value in the sampling method used to select the next most probable word in fixed-length and variable length coding algorithms through a word filtering strategy based on Kullback–Leibler divergence (KLD). In arithmetic coding, they adopted the temperature-based sampling type.

Another study using arithmetic coding as a form of embedding belongs to [60]. Here, they proposed an algorithm that selects the minimum value at which K in top-K sampling, which is the sampling type they use, achieves the required imperceptibility.

In the study, [54] have developed a steganography system that aims to deceive both statistical and human eavesdroppers by combining a pretrained large language model with an arithmetic coding-based steganography algorithm.

Lingyun et al. [8] developed an LSTM-based language model and generated character-level text in their studies. They generated a large number of stego texts simply by changing the beginning of the feed information given to the model and using the same hidden information. Afterwards, they designed a selection strategy in order to find the best quality among the candidate stego texts generated. Similarly, in the study of [2], stego text based on confidential information was generated by developing an LSTM-based language model with attention mechanism, and fixed-length coding (FLC) and variable-length coding (VLC) algorithms were used as embedding method in order to place confidential information in the text.

Steganography was not only used to create plain texts, but also poems were used for the purpose of carrying confidential information [61]. created a network model based on template constrained generation with encoder–decoder architecture to produce classical Chinese poetry containing confidential information in their study. Accordingly, they used the LSTM encoder–decoder model to create the first line of the stanza with a keyword, and then they created the remaining lines one by one.

Zhou et al. [62] proposed the linguistic steganography method based on adaptive probability distribution and GAN (generative adversarial network) network in their study. The GAN-based model they generated was developed to be protected from high-dimensional steganalyzers. They performed data hiding during training by combining information embedding algorithms with feature function.

Yang et al. [63] tried to bring a different approach in their research by examining the sampling type used in steganography text generation. In their method called categorical sampling, while predicting the next word to come, they sampled a whole dictionary k times to create a candidate word pool (candidate pool, CP) consisting of k words. In other words, they took the word with the highest probability of k each time and threw it into the candidate word pool, and by sampling again, they determined the other word with the highest probability this time. This process was repeated k times. They embedded the secret information by applying arithmetic coding on the candidate word pool they created. As their study, which they claim to provide statistical imperceptibility, will impose a heavy computational load in cases where large size corpus structure is used and large size k value is selected, it is obvious that it will be difficult to implement in practice.

Yang et al. [64] talked about a structure they called VAE-Stega in order to solve the problem of perceptual and statistical imperceptibility in stego texts in their another study. In the encoder stage of the structure they created here, they learned the statistical distribution of normal texts, and in the decoder stage, they generated steganography sentences that fit both the statistical language model and the general statistical distribution of normal sentences.

2.2 Steganography based on compression of secret data

Compression algorithms have been used in the literature to increase the amount of confidential information placed inside the carrier cover text.

In [65], they conducted a steganography study on e-mail texts using the combinatorial compression method. They used a combination of burrows wheeler transform (BWT), move to forward (MTF) and LZW coding algorithm to increase the capacity of the information hidden in their proposed method. They also made use of the character count of the email ID to indicate the hidden bits. Their method adds some random characters before the “@” symbol of email IDs to increase randomness. In another study of [66], they have done research on the e-mail steganography, this time by using Huffman compression algorithm.

Tutuncu and Hassan [67] used lossless compression techniques and Vigenere cipher in their e-mail-based text steganography study. They used their e-mail addresses to insert and remove the secret message. After choosing the carrier text with the highest repetition pattern for the secret message, they created the distance matrix. They used run length encoding (RLE), burrows wheeler transform (BWT), move to front (MTF), run length encoding (RLE) and arithmetic encoding as compression algorithms. They used Latin square and Vigenere ciphers to generate stego keys. Using these generated keys, they placed the confidential information in their e-mail addresses.

In the study in [68], they examined the capacity and security problems in stego text generation using the embedding algorithm they obtained by combining the LZW compression algorithm and color coding. They applied their methods on the e-mail environment and tried to embed information in the e-mail address and message content. They placed the hidden data bits in the message by coloring them with the help of a color-coding table. They used the LZW compression algorithm to increase the rate of confidential information embedding.

Adaptation of the use of compression algorithm to text steganography is also seen in the study of [36]. Here, they reduced the length of the information to be hidden through the word indexing compression algorithm (WIC), and they selected the best stego text with high imperceptibility with the stego text selection strategy. They provided the process of placing the confidential information in the carrier text by using the synonym word substitution method.

In the study in [69], they studied e-mail steganography, which is a sub-branch of text steganography. In their study, they concealed the secret message within the e-mail addresses created through the e-mail body. In their proposed scheme, they first converted the secret message into a bitstream using the LZW algorithm and then embedded the resulting bitstream into the corresponding recipient addresses using steganography keys.

3 Preliminaries

Under this title, a brief information is given about the technologies used during the creation of the text generating model.

3.1 Generation based text steganography

In steganography based on text generation, generation is provided based on the information desired to be hidden. Therefore, the need for a carrier text has disappeared. In this type of steganography, there is a need for a trained model and embedding algorithms to place the secret information.

The texts consist of a combination of a number of words. While each sentence in the text is expressed with \(= \left\{ {W_{i} } \right.\}_{i = 1}^{n}\), the Wi here denotes the words in the sentence. The probability of a word being included in a sentence in text generation is related to the conditional probability distributions of all previous words. This situation is expressed as in Eq. (1) [2].

$$\begin{aligned} {\text{Prob}}\left( S \right) & = {\text{Prob}}\left( {w_{1} ,w_{2} , \ldots w_{n} } \right) \\ & = \mathop \prod \limits_{i = 1}^{n} {\text{Prob}}(w_{i} |w_{i - 1} ,w_{i - 2} , \ldots w_{n} ) \\ \end{aligned}$$
(1)

The Probe(S) value in the above equation should be kept as high as possible. Ideally, this value is expected to be 1. In order to generate the "S" sentence, first a word is given to the system as feed data, and then the word with the highest probability that can come after this word is tried to be guessed. During the calculation of this new word to be generated, a new word is selected by creating a candidate pool from all possible words. This process continues in the same way throughout the generation of the entire text [2].

Generation-based steganography was adopted in the study. A model was created by using the LSTM network with the Bahdanau attention mechanism. Turkish newspaper articles, poems and various articles in the Kaggle database were used as corpus. Perfect binary tree, Huffman, arithmetic coding and LZW-CIE algorithms proposed in the study were used to embed hidden information during text generation.

3.2 LSTM network

Recurrent neural network (RNN-recurrent neural network) is incapable of dealing with long-range dependencies due to the vanishing gradient problem [8]. For this reason, the LSTM network, which is more efficient in finding and using the long-range context and widely used in sequence problems, is preferred. An LSTM network is shown as in Eq. (2).

$$\left. {\begin{array}{*{20}l} {I_{t} = \sigma \left( {W_{i} \cdot x_{t} + U_{i} .h_{t - 1} + b_{i} } \right)} \hfill \\ {F_{t} = \sigma \left( {W_{f} \cdot x_{t} + U_{f} \cdot h_{t - 1} + b_{f} } \right)} \hfill \\ {C_{t} = F_{t} \cdot C_{t - 1} + I_{t} \cdot \tanh \left( {W_{C} \cdot x_{t} + U_{c} \cdot h_{t - 1} + b_{c} } \right)} \hfill \\ {O_{t} = \sigma \left( {W_{o} \cdot x_{t} + U_{0} \cdot h_{t - 1} + b_{o} } \right)} \hfill \\ {h_{t} = O_{t} \cdot \tanh \left( {C_{t} } \right)} \hfill \\ \end{array} } \right\}$$
(2)

The LSTM network takes \(x_{t}\), \(h_{t - 1} ,C_{t - 1}\) as input at t. time and then calculates \(h_{t} ,C_{t}\) values. \(I_{t } , F_{t} , O_{t}\) shows input vector, hidden vector, and output vector, respectively, at t. instance. \(C_{t}\) is a cell activation vector. These four vectors are the same dimension with hidden vector \(h_{t}\). When t = 1, \(h_{0}\) and \(C_{0}\) are initialized to the zero vector. \(\sigma\) is the logistic sigmoid function. While Wi, Wf, Wc, W0, Ui, Uf, Uc, U0 represent the weight matrices to be learned, bi, bf, bc, bo are the bias vectors to be learned [8].

In RNN, the short-term memory will continue to multiply h and then the gradient will disappear. In LSTM, accumulation is used instead of multiplication, so the gradient disappearance problem is solved. For this reason, LSTM is preferred more than RNN in the field of language modeling [8].

Bidirectional LSTM, on the other hand, is an extension of the traditional LSTM network and is preferred to increase the performance of the developed language model. Here, the input array is trained in two directions (forward and backward). Thus, the model can also learn information about the previous and next words [70].

3.3 Bahdanau (additive) attention

A model with the attention mechanism detects words that need special attention or are determined to have more importance in the sentence among the words given as input and tries to predict the next word based on these words. Thus, the texts generated have a more semantic and logical structure.

This mechanism attempts to capture useful context information between words, regardless of the distance between the input and target words. Context information from the encoder is efficiently discovered to provide the expected context information for the decoder. It is important to use the attention mechanism to obtain a more advanced language model and to make a more effective content selection [71]. In language models without this mechanism, the next word is predicted in order.

Bahdanau (additive) attention mechanism was firstly applied in the field of neural machine translation by [72] via creating encoder–decoder architecture. In traditional language models, in order to generate the hidden vector (h1,h2, …,hn) of each word in the input and a content vector (Ci) during the prediction of the target word, it is converted linearly and then transferred to the decoder. When the attention mechanism is used, the hidden vectors of all the words in the input are taken into account to create the content vector Ci. In other words, Ci is determined by the weighted average of all hidden vectors (h1,h2,…..,hn) and their attention weights \(\left\{ {a_{i,j} {|} 1 \le i \le M, 1 \le j \le N{\text{\} }}} \right.\). \(a_{i,j}\) is calculated with the current hidden vector of the target word (\(z_{i}\)) and other hidden vectors (h1,h2,hn) [2]. Attention mechanism is applied as in Eqs. (3) and (4).

$$c_{i} = \mathop \sum \limits_{j = 1}^{N} a_{i,j} h_{j}$$
(3)
$$a_{i,j} = {\text{align}}\left( {z_{i - 1} ,h_{j} } \right) = \frac{{\exp \left( {{\text{score}}\left( {z_{i - 1} ,h_{j} } \right)} \right)}}{{\sum_{k} {\text{score}}\left( {z_{i - 1} ,h_{k} } \right))}}$$
(4)

The score (\({\text{score}}\left( {z_{i - 1} ,h_{j} } \right)\)) function in the equation in (4) scores how well the inputs around the j position and the output at the i position match [72, 73]. The scoring function is expected to give probabilities of how important each latent state is for the current time step. Finally, each encoder's latent state is multiplied by its corresponding score, and they are all added together to obtain the context (content) vector [74]. In Fig. 2, a visual of the application of the attention mechanism is given. This mechanism is used in the character-level language model architecture in the proposed study.

Fig. 2
figure 2

Bahdanau attention mechanism [75]

4 Proposed method

In this study, stego text generation which also guarantees high-capacity message hiding and which has a natural appearance without revealing the presence of hidden text during transmission of stego text over a supervised channel was performed. The proposed method generates stego text based on the secret message. On the language model created by combining the bidirectional LSTM network with the attention mechanism, 4 different hidden information embedding algorithms, namely perfect tree, Huffman, arithmetic coding and LZW, are applied. In addition, it is aimed to make a comparative evaluation in terms of imperceptibility metrics of stego texts generated by applying top-k, temperature, greedy search sampling types during text generation.

In the study, two different models that produce word-based and character-based text were used. Perfect tree, Huffman and arithmetic coding were used as the secret information embedding algorithm in the word-based generation model. In the character-based model, a new secret information embedding algorithm is created by combining the LZW compression algorithm with the Char Index method. The contribution of the study to the literature is the creation of a new embedding algorithm. Figure 3 presents a framework for the confidential information embedding structure of the study.

Fig. 3
figure 3

General Framework

Under this title, the details of the proposed framework structure, which consists of 3 main modules, namely the automatic text generation module based on confidential information, the information embedding module and the confidential information extraction module, are given. In the module that generates text based on hidden knowledge, bidirectional LSTM model with attention mechanism trained using Turkish corpus generates text by using top-k, temperature, greedy search sampling types, while also taking into account the conditional probability distribution. In the confidential information embedding module, 4 different embedding algorithms were applied, namely perfect tree encoding (fixed-length encoding, FLC), Huffman encoding (variable-length encoding, VLC), arithmetic encoding and LZW-CIE. Confidential information extraction module covers transactions performed on the receiver side. The embedding algorithm used by the sender to place the confidential information is applied in the same way on the receiver side. In other words, if AC encoding is used during the insertion of confidential information, the same algorithm is also used in the extraction phase.

4.1 Automatic text generation module based on confidential information

Text generation is provided through the trained model by using Turkish corpus. The ability of the LSTM network to process sequential signals was utilized in the creation of the model. Text generation was performed on two different models, both word level and character level.

4.1.1 Word-level text generation

In order for the model, which produces text at the word level, to produce texts with high semantic value, the bidirectional LSTM network used in the architecture of the model is supported by the attention mechanism. Then, the model trained using Turkish corpus generated words based on bitstreams of the secret message. In the study, the word-level language model was used in an integrated manner with the perfect binary tree (FLC), Huffman (VLC) and arithmetic coding (AC) methods used in the embedding phase.

In the attention mechanism implemented on the word-level language model, the Attention class, which is derived from the default “Layer” class of the Keras API, is used. In other words, a custom attention class is added to the model as a layer. The architecture of the word-based model is given in Fig. 4.

Fig. 4
figure 4

Bidirectional LSTM with custom attention architecture

4.1.2 Character-level text generation

The character-level language model is used in an integrated manner with LZW-CIE (LZW-Char Index Encoding) embedding method to predict the next possibly coming letter. A character-level language model was created in order to both improve the working time during the training of the model and to provide more capacity to place confidential information. The characters of the feed information given to this model are representative characters corresponding to the secret message. Therefore, stego text is generated based on the secret message. Bidirectional LSTM network with encoder–decoder architecture used here is supported by Bahdanau attention mechanism. Figure 5 gives the architecture of the character-based model.

Fig. 5
figure 5

Encoder–decoder model architecture with Bahdanau attention mechanism

The calculation of the Context vector in the figure above is shown in (5) [72].

$${\text{context}} = \mathop \sum \limits_{{t^{\prime} = 1}}^{{T_{x} }} \alpha \left( {t^{\prime}} \right)h\left( {t^{\prime}} \right)$$
(5)

The \(\alpha \left( {t^{\prime}} \right)\) in the above equation represents the attention weights. In (6), the calculation of attention weights is given. \(h_{{t^{\prime}}}\) value indicates encoder status outputs [72].

$$\alpha \left( {t^{\prime}} \right) = {\text{NeuralNet}}\left( {\left[ {s_{t - 1} ,h_{{t^{\prime}}} } \right]} \right), t^{\prime} = 1 \ldots T_{x}$$
(6)

The input of the attention layer in Fig. 2 is the output of each LSTM unit (encoder unit) and the output of the decoder from the previous time. At each time t step of the decoder, the level of attention given to the hidden encoder unit \(h_{j}\) is indicated by \(\propto_{tj}\) and the latent state of the decoder at the previous time (\({s_{\left({t - 1} \right)}}\)) is calculated as a function of \(h_{j}\). The Softmax function is used in the last step to normalize attention values [72].

$$e_{tj} = \propto \left( {h_{j} ,s_{t - 1} } \right),\forall j \in \left[ {1,T} \right]$$
(7)
$$\propto_{tj} = \frac{{\exp \left( {e_{tj} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{T} \exp \left( {e_{tk} } \right)}}$$
(8)

4.2 Information embedding module

The process of placing the secret message in the text was carried out in 4 different ways. Under the subheadings of this section, the details of each embedding method applied in the study are mentioned.

4.2.1 Embedding by perfect binary tree encoding (fixed length encoding, FLC)

In the perfect binary tree structure, all internal nodes (node) have 2 child nodes, and all leaf nodes are at the same level. The fact that all leaf nodes have the same depth level ensures that each word in these nodes is expressed in equal bit length. For example, the word “school-okul” in Fig. 6 is represented by the binary code “000” from root to leaf, while the word “book-kitap” is represented by the code “101.” In other words, all leaf nodes are represented by a fixed-length binary code. Since the depth (d) of the example tree in Figs. 6 is 3, the leaf nodes are also 3 bits long. If the depth of the tree is different, for example 2, the leaf nodes will be represented by 2 bits. In this study, each leaf node is expressed in fixed-length coding at different embedding rates of 1,2, 3,4 and 5 bits. An example perfect binary tree coding is given in Fig. 6.

Fig. 6
figure 6

Perfect binary tree (fixed-length) encoding

In the embedding module of the secret message using the perfect binary tree, a candidate word pool is needed to find the word corresponding to the message. In the creation of this word pool, the conditional probabilities of the words were taken into account. Accordingly, each word has a conditional probability distribution in form of \(p(w_{n} |w_{1} , w_{2} , \ldots w_{n - 1} )\). At the stage of predicting the next word that may come to the trained language model, all probability values are ordered from largest to smallest, and by taking the highest “m” of these probability values, a candidate word pool was created from the words corresponding to the probability values. Then, using perfect binary tree coding, it was ensured that each word in the candidate word pool has a binary code value. The entire perfect tree structure is scanned from root to leaf node in order to find out whether there is a binary value equal to the k- length bitstream of the secret message. In case of equality, the word corresponding to the current bitstream value became the word in the leaf node where binary code equality is achieved. In other words, the bitstream value and the word in the leaf node have been replaced. The word corresponding to the current bitstream was added to the seed words given to the model, and this time the model was made to predict again with new feed words. In each new prediction of the model, all probability values were ordered from highest to lowest, and this time a new candidate word pool was created by taking the first m of them. In other words, the words in the word pool are renewed after each embedding process to ensure that they have a dynamic structure. The value of m here is found by taking the power of 2 to the power. It is in the form of \(m = 2^{k}\). The k value represents the binary code length of each word. This entire process in the information embedding process was continued until the bitstreams of the secret message were finished. If the feed sentence given to the model was not completed yet after all the hidden information was embedded, the next highest probability word was determined by the model and the text generation continued. Text generation based on confidential information can be repeated using any number of loops. This not only ensures the generation of the desired length of text, but also increases the capacity of the embedded message by placing confidential information at many different points of the generated text.

The first word of the feed sentence, which was given to the model to predict the next word, was obtained by randomly choosing among the 100 words with the highest frequency in the corpus used. The remaining words of the feed sentence were randomly selected among the words used in the training of the model. The method followed during the creation of the feed sentence is valid for cases where perfect binary tree, Huffman and arithmetic coding methods are used to embed information. The details of how the feed information is generated when the LZW compression algorithm is used in the secret message embedding process are discussed in Sect. 4.2.4. The detailed representation of the hidden information embedding in the fixed-length coding method is given in Fig. 7. Accordingly, the equivalent of “110” bitstream was the word “ledger-defter.”

Fig. 7
figure 7

Secret bitstream embedding by fixed length coding (perfect binary tree)

4.2.2 Embedding by Huffman encoding (variable length encoding, VLC)

The Huffman tree structure is another type of binary tree structure, where the code length used to encode words differs from each other unlike the perfect binary tree structure. Huffman is actually a compression method. Accordingly, the words that are frequently used in the text are represented by shorter codes, while the words that are rarely used are represented by longer codes. Therefore, when the candidate word pool was adapted to the Huffman tree structure, Variable Length coding was performed because each word had a binary code value with a different length.

In the Huffman tree structure, as in the perfect binary structure, a candidate word pool was created by taking m piece words with the highest probability that can come after the feed information. During the determination of m piece words, the conditional probability values of the words were ordered from the largest to the smallest, and m pieces of them were taken.

When the VLC embedding method is used, the bitstreams of the secret message are not taken in “k” bit length this time but read in order. In other words, if the first bit of the secret message is “0,” it is directed to the left side of the Huffman tree, and if it is “1,” it is directed to the right side. This process was continued until all bits of the message were read one by one. If the leaf node is reached before the hidden bitstreams are finished, the word in the leaf node corresponds to the bitstream values read up to that stage. The word corresponding to the current bitstream was added to the seed words given to the model, and this time the model was made to predict again with new feed words. In each new prediction of the model, all probability values were ordered from highest to lowest, and this time a new candidate word pool was created by taking the first m of them. In other words, the words in the word pool are renewed after each embedding process to ensure that they have a dynamic structure. If the embedding process of the bitstreams of the secret message was not completed, the remaining bitstream values were continued to be read and the embedding process performed above was repeated. If the feed sentence given to the model was not completed yet after all the hidden information was embedded, the next highest probability word was determined by the model and the text generation continued. As in fixed-length embedding, in variable-length embedding method, it is essential to find the word in the leaf node to replace the hidden bitstream by traversing the tree structure from the root to the leaf node. The detailed representation of the hidden information embedding in the variable-length encoding method is given in Fig. 8.

Fig. 8
figure 8

Secret bitstream embedding by variable length coding (Huffman tree)

In Algorithm 1, confidential information embedding steps in cases where FLC and VLC coding methods are used are given in detail. Based on this algorithm, natural-looking stego texts based on confidential information were generated and then sent to the recipient over an open channel with high confidentiality. The steps implemented in Algorithm 1 are based on the study of [1].

figure a

4.2.3 Embedding by arithmetic encoding

Arithmetic coding performs lossless compression function based on the probability distribution of words. The fact that it does not require blocking makes arithmetic coding more effective in practice compared to Huffman [54]. Traditionally, arithmetic encoding encodes a set of items into a bit string [60]. In this study, in the secret message embedding function based on arithmetic coding, firstly, each word in the secret message is converted to its bitstream equivalent, and then the decimal values of the bitstreams are calculated. As in the same way as used in other embeddings, the model is made to predict the next m words with the highest probability. Then, the decimal value was calculated for each word in the candidate word pool. Based on these decimal values, the lowest and highest ranges were determined, and it was tried to determine in which range the decimal value of the words of the secret message was included. Accordingly, the word corresponding to that interval was substituted for the word secret message and text was generated. The details of the arithmetic embedding method are discussed later in the article and in Figs. 9 and 10.

Fig. 9
figure 9

Candidate word pool interval

Fig. 10
figure 10

Arithmetic embedding

Encoding In the coding process, m pieces words with the highest probability were first converted into binary. A decimal value is calculated by applying the formula \(B\left( n \right) = \mathop \sum \nolimits_{i = 1}^{L} m_{i} \times 2^{ - i}\) on the bit string in the format \(n = [n_{1} ,n_{2} , \ldots .n_{n}\)] For example, decimal value of a string of bits in the form of \(n = \left[ {0.1,1} \right]\) is \(B\left( n \right) = 0 \times 2^{ - 1} + 1 \times 2^{ - 2} + 1 \times 2^{ - 3} = 0.375\) [60].

During the arithmetic coding, the communication with the language model continues. Accordingly, a candidate word pool is created by determining m piece words with the highest probability that can come after the feed word given to the model. Conditional probability distributions of m pieces words obtained from the model are first converted into binary and then decimal. The purpose of converting to decimal here is to keep the values within a range of [0,1). By using the calculated decimal values of words, an attempt is made to determine a range with the formula \(= {\text{upper}}\;{\text{limit}} - {\text{lower}}\;{\text{limit}}\). When arithmetic coding is started for the first time, while the specified range is between [0,1), this range narrows in the later parts of the coding. For example, the decimal value of the word “time” to be hidden is 0.75. Let's assume the probabilities of the words in the candidate pool (“school-okul: 0.2”,”book-kitap:0.5,” “notebook-defter:0.3”). So, the representation of the candidate words in the range should be as shown in Fig. 9.

It is seen that the equivalent of the word “time” with a decimal value of 0.75 according to arithmetic coding is “notebook” when looking at Fig. 9. In this way, the first embedding process is provided. When it comes to the next word to be hidden, the interval is between [0.7,1) this time. In addition, after each embedding, a new candidate word pool is created by making the model predict m pieces words again, and this time the decimal value is calculated for each word in the new pool. Let's assume that the second word to hide is "passing-geçiyor" and its decimal value is 0.71. Accordingly, the calculation step should be as in (9).

$$\left. {\begin{array}{*{20}l} {d = {\text{upper}}\;{\text{limit}}-{\text{lower}}\;{\text{limit}} } \hfill \\ {{\text{Range}}\;{\text{of}}\;{\text{word}} = {\text{lower}}\;{\text{limit}}:{\text{lower}}\;{\text{limit}} + d \times \left( {{\text{probabiliy}}\;{\text{of}}\;{\text{word}}} \right)} \hfill \\ {{\text{Range}}\;{\text{of}}\;{\text{word}} = 0.7: 0.7 + \left( {1 - 0.7} \right) \times \left( {0.05} \right)\;{\text{interval}}\;{\text{calculation}}\;{\text{for}}\;{\text{the}}\;{\text{word}}\;{\text{"date-tarih"}}} \hfill \\ \end{array}} \right\}$$
(9)

For example, let's assume that the decimal values (probability) in the newly created sample word pool are (“date: 0.05”, “season-mevsim:0.01,” “year-yıl: 0.06”). Here, first of all, AC algorithm calculates the interval value to which each word in the candidate word pool is connected, by taking into account the equality in (10). Thus, by creating a range ruler as in Fig. 10, the range of the word to be hidden is determined. The word to replace the word “passing” in our example is the word “date” shown in Fig. 10.

All of the above operations are continued in the same order until the embedding of the entire message, which is requested to be hidden, is completed. In Algorithm 2, a detailed representation of the embedding using arithmetic coding is given. The steps implemented in Algorithm 2 are based on the study of [60].

figure b

4.2.4 Embedding by LZW-CIE encoding

LZW is a lossless data compression algorithm proposed by Terry A. Welch [76]. The basis of the LZW (Lempel–Ziv–Welch) compression algorithm is to replace the characters in the text with a symbol. The same character table is required for both compression and decoding [77].

In this compression algorithm, each symbol is represented by a code and this code length is fixed. The LZW character dictionary is also dynamically created without the need to pass the dictionary between the decoder and encoder parties. While the LZW encoder compresses the data, the source text is examined sequentially, and strings not in the dictionary are inserted into the next unused part of the dictionary. The previous scanned symbol is encoded with its corresponding code as output. The higher the number of symbols in the dictionary, the higher the LZW compression ratio. At the stage of decoding the data, the dictionary used during compression is created in the same way and the conversion from code to symbol is performed [78]. In this study, the LZW algorithm is used to compress the secret message and extract the message from the generated text document.

Encoding At the stage of embedding the information, the message to be hidden was compressed with the LZW compression algorithm, and a code list corresponding to the symbols in the message was obtained. The purpose of compression here is to increase the capacity of the hidden information. Since the alphabet in Turkish language consists of 29 letters, the values of "x" and "y" are obtained by dividing each code value by 29 and taking the mode according to 29. Then, a number value starting from 1 was assigned to each letter in the alphabet with the method we named “Char Index.” Afterwards, the equivalent of each "x" and "y" value in the alphabet was found and these values were represented by the corresponding letter. For example, the numeric value in "x" is represented by a letter and the numeric value in "y" is represented by another letter to create a list of 2 letters each. Each string of letters in this list constitutes a representation of the message to be hidden. A random feed information was taken from the corpus used in the training of the model and the created 2-letter string was added to this feed information. As the last step, after the representation of the secret message matched the letter characters, the 2 letters with the highest probability that could come after the 2-letter sequence were predicted by the model one after the other, and these letters replaced the letter characters representing the secret message. Afterwards, in order to complete the word, a random number of characters were generated and a word was created by adding the letter string consisting of this variable number of characters to the 2-letter sequence that was generated previously. As a result, the variable number of letters in each word leaves no doubt about the existence of confidential information. This process was continued until all the 2-letter representations corresponding to the secret message were finished, and then normal text was generated. The application steps of the LZW-CIE encoding method are shown in Fig. 11.

Fig. 11
figure 11

LZW-CIE application steps (embedding and extraction phase)

4.3 Information extraction module

Extracting the secret message from the stego text was carried out as a result of applying the algorithms (perfect binary tree, Huffman, arithmetic coding and LZW-CIE) used during embedding in the same way, this time on the receiving side. While calculating the conditional probability of the next word over the word level model in the hidden information extraction process with perfect binary tree, Huffman, arithmetic coding, in the extraction process performed with the developed LZW-CIE algorithm, the character-level model was used in the conditional probability calculation.

In the step of extracting the hidden information, the stego text was first divided into lines consisting of a fixed number of words, and then the words in each line were given to the model as a feed input, and it was made to estimate the probabilities of the next words to come. The probabilities were ordered from the largest to the smallest, the words are taken as much as the number of candidate word pool, and the same embedding algorithm applied in embedding confidential information on the words in this pool was applied this time in the extraction step. In other words, if information was placed using FLC in the embedding part, perfect binary tree was created with candidate words in the extraction step; if information was placed using VLC, a Huffman tree was created with candidate words in the extraction step. Then, by descending from the root to the leaf in both perfect binary tree and Huffman tree coding, it was examined whether the word in the stego text given as the feed input to the model is found in the leaf nodes. If found, the code value obtained by descending from the root to the relevant leaf node created the bitstreams of the secret message. The code values here are obtained as a result of taking the values "0" when moving to the left of the tree and "1" when moving to the right.

If arithmetic coding was used in the embedding phase, arithmetic coding was also used in the extraction phase in the same way. In FLC and VLC coding, the step of determining the words with the highest probability after the word in the stego text is given to the model as a feed input is also valid in the secret information extraction process with arithmetic coding. Accordingly, the probability values of the words in the candidate word pool with the highest probability predicted by the model, ordered from largest to smallest, were first converted to binary and then binary values were converted to decimal. The goal here is to keep decimal values in the range of 0 to 1. Afterwards, the feed word in the stego text was subjected to arithmetic coding and a decimal value was obtained for this word. The decoding process was carried out by looking in the fact that the decimal value of the feed word is included in which of the value ranges corresponding to the words in the candidate word pool. The process of extracting confidential information is continued until it is determined whether it contains confidential information or not by examining each word in the stego text received by the recipient, one by one as described above. The decode stage in arithmetic coding is the opposite of the encode stage, and the extraction process was carried out by gradually narrowing the value ranges.

In the step of extracting the confidential information made with LZW-CIE coding method, the first 2 characters of each word in the stego text reaching the receiver were taken and the character index value of these letters (positions in the alphabet) was checked. Then, the division process, which provides the "x" value in the encoding stage, was applied in reverse for the index value of the first character this time, and the index value of the second character was added to this value. This process was applied for all words in the text, and a list of decimal values was obtained. Each value in this list was decompressed by subtracting 1 more than the number of characters in the alphabet. The process of removing 1 more than the number of characters in the alphabet was done in order to obtain the same list as when compressing. In the proposed algorithm, the secret information can be placed in different positions of the word such as the beginning, the end or the middle of the word. For example, if confidential information is placed at the end of the word, it is necessary to pay attention to only the last 2 letters of each word during the extraction of confidential information.

5 Experiments and analysis

Under this title, the performance of the model created in the study has been evaluated from the perspectives of confidential information embedding efficiency, confidential information imperceptibility and confidential information capacity, and various trials have been carried out for this purpose. In order to measure the efficiency of embedding confidential information, a test of how long the model generates stego text on average was carried out. For the criterion of imperceptibility of confidential information, stego texts obtained by embedding confidential information at different rates were compared with the training text, and at the same time, the ability to resist steganographic perception was examined. For the secret information capacity metric, the rate of confidential information which can be placed in the generated stego texts was analyzed and the values obtained were compared with other text steganography algorithms. In Title 5.1, firstly the dataset used is introduced and then the structure of the word- and character-level models, parameter settings and the details of the model training are given.

5.1 Data preparation and model training

In this study, general Turkish documents with the extension ".txt" in the Kaggle [79] database were used to train the model. The documents used consist of newspaper articles [80], Turkish lecture notes [81], online pdf documents [82] and opinion columns [83]. A large-scale corpus was obtained by combining separately downloaded text documents into a single file. Before moving on to model training, an attempt was made to obtain a noise-free corpus by applying some preprocessing steps such as removing punctuation marks, removing numeric expressions, converting to lowercase, deleting special symbols, filtering out low-frequency words, eliminating stop words. The details of the training dataset obtained after the implementation of all these preprocessing steps mentioned are shown in Table 1.

Table 1 The details of the training datasets

The hyperparameters used during the editing of the word-based model are as follows: First, each word is paired with a 100-dimensional embedding vector. Then, a bidirectional LSTM (bidirectional LSTM) layer consisting of 512 units was applied. The "return_sequence" parameter here is set to "True" to take into account the previous and next words in the sequence. An attention mechanism has also been added to the model so that it can concentrate on words considered to be more relevant. A "dropout" layer has been added to prevent overfitting. The construction of the model was completed by adding an LSTM layer consisting of another 100 units, a “Dense” layer with relu activation and an organizer to prevent overfitting again.

In the character-based model, each character is mapped to a 32-dimensional embedding vector. Then, a bidirectional LSTM (bidirectional LSTM) layer consisting of 768 units was applied. Likewise, the "return_sequence" parameter here is set to "True" so that the previous and next words are taken into account. The encoder phase was completed by adding the Bahdanau attention mechanism and then the dropout layer. In the decoder stage, the construction of the whole model was completed by adding the "Dense" layer on top of the Bidirectional LSTM layer. Softmax was used as the activation method in both models. The learning rate was set as 0.01, the batch_size as 128 and the drop_out rate as 0.2.

5.2 Evaluation results and discussion

Under this title, the performance of the proposed model has been examined on the basis of 4 criteria: information hiding efficiency, information imperceptibility, steganalysis resistance and hiding capacity, and its results have been evaluated.

  1. (1)

    Information hiding efficiency

In this criterion, the time spent by the model in placing the confidential information is determined. The reconstruction of the candidate word pool used in FLC, VLC and AC embedding methods in each iteration and the encoding of the candidate words (FLC, VLC, AC) directly affect the embedding time. In addition, the number of words in the candidate word pool (candidate pool size, CPS) is also a factor in determining the efficiency. In the model evaluation tests, information embedding efficiency was tested at different embedding rates. In order to compare with the referenced studies in [1, 64], 1000 texts limited to 50 words each were generated here for each CPS. The same generation method is adopted in the LZW-CIE coding method. Table 2 and Fig. 12 give information embedding times, which vary according to CPS.

Table 2 The average time for each model to generate a text containing 50 words at different embedding rates
Fig. 12
figure 12

Average time for stego text generation

When the values in Table 2 and Fig. 12 are interpreted, it is observed that when the embedding rate increases, the time spent for the generation of stego text containing confidential information also increases. In the suggested VLC encoding in this table, since it is more time-consuming to create words with code values of different lengths in each iteration and place them in the tree structure, in general, it has been a more time-consuming coding method than FLC and AC coding. However, with the proposed model, information placement is provided in less time than the RNN-Stega FLC and RNN-Stega VLC models. When Table 2 is examined, it is seen that when the proposed model uses the FLC encoding method, it can produce 50-word stego text in the range of 3.828–4.36 s on average (for all CPS). While this value is in the range of 16.539–19.072 s when using VLC encoding, it varies between 14,803 and 16.095 on average in LZW-CIE encoding method. It has been observed that stego text can be generated in shorter times compared to the values obtained with the RNN-Stega FLC and RNN-Stega VLC model; therefore, the proposed model achieves higher information embedding efficiency in 4 different codings (FLC, VLC, AC and LZW-CIE) methods.

  1. (2)

    Imperceptibility analysis

The first purpose in text steganography is to convey the existence of confidential information transmitted over an open medium (public channel) to the recipient as unnoticed as possible. Therefore, the criterion of imperceptibility is the most crucial factor in the evaluation of the success of steganography studies. In order not to detect the presence of confidential information in the generated text, the difference between the statistical distributions of the carrier text (cover text without confidential information) and the stego text should be minimal or even nonexistent. In the model proposed in this study, a stego text generation based on confidential information without a carrier cover text was performed. The "perplexity" criterion was used for the task of preserving the statistical distribution by producing text based on confidential information, which is a more challenging process than ensuring the consistency of the statistical distribution by placing confidential information in the carrier text. “Perplexity (pp)” is a standard metric used to determine the quality of the created language model in the field of natural language processing [84]. This metric is defined as the average log probability per word in test texts [1].

$$\begin{aligned} {\text{Perplexity}} & = 2^{{\frac{ - 1}{N}\mathop \sum \limits_{i = 1}^{N} \log p\left( {s_{i} } \right)}} \\ & = 2^{{\frac{ - 1}{N}\mathop \sum \limits_{i = 1}^{N} \log p_{i} \left( {w_{1,} w_{2,} \ldots w_{n} } \right)}} \\ & = 2^{{\frac{ - 1}{N}\mathop \sum \limits_{i = 1}^{N} \log p_{i} \left( {w_{1} } \right)p\left( {w_{2} |w_{1} } \right) \ldots p\left( {w_{n} |w_{1} ,w_{2} , \ldots w_{n - 1} } \right)}} \\ \end{aligned}$$
(10)

In Eq. (10), \(s_{i} = \left\{ {w1, w2, \ldots ,wn} \right\}\) refers the sentence generated and \(p\left( {s_{i} } \right)\) refers to the probability of each word in this sentence obtained from the moldel. Since the Perplexity value is a metric for finding the statistical distribution of texts, this criterion was calculated both for the generated stego texts, and for the texts used in education, and the difference was evaluated [64]. The small value of this value means that the difference between the generated text and the educational text is so small; in other words, these two texts are quite similar to each other. The average perplexity value calculated for the training texts is 135.45. Then, the average perplexity value was calculated for the generated stego texts at different embedding rates. Average perplexity is expressed with \(\Delta {\text{Mp}}\) [64].

$$\Delta {\text{Mp}} = {\text{mean}}\left( {{\text{ppStegoText}}} \right) - {\text{mean}}\left( {{\text{ppTrainingText}}} \right)$$
(11)

Based on the study in [64], in order to state the fact that the generated stego texts carry confidential information cannot be detected statistically and in other words, in order to measure how similar, the stego texts generated with the training texts are not only semantically but also statistically, the Kullback–Leibler divergence (KLD) metric was used. Since the KLD metric is not symmetrical, the difference in distribution between both the generated stego texts and the training sentences was examined by using the Jensen–Shannon divergence (JSD) [85, 86] metric [64]. The KLD and JSD distance metrics used are given in Eqs. (12) and (13), respectively. The “C” refers to the overall statistical distribution of training text here, and “S” refers to the overall statistical distribution of the stego text [64, 87].

$$D_{KL} (C||S) = \sum C_{i} \left( x \right)\log \left( {\frac{{C_{i} \left( x \right)}}{{S_{i} \left( x \right)}}} \right)$$
(12)
$$D_{JS} (C||S) = \frac{1}{2}D_{KL} \left( {P_{C} ||\frac{{P_{C} + P_{S} }}{2}} \right) + \frac{1}{2}D_{KL} \left( {P_{S} ||\frac{{P_{C} + P_{S} }}{2}} \right)$$
(13)

In order to evaluate the performance of the model in terms of perplexity criterion, a comparison was made with 2 separate studies close to this study in terms of the coding method used. These are the studies in [1, 64]. Due to the similarity of the starting point of the study in here and these 2 reference studies, the evaluation path followed by these studies in the performance evaluation stage of the proposed model was taken as a guide. In these two reference studies, text generation based on confidential information was carried out with an LSTM-based model. The most obvious difference between these studies and the proposed study is the architecture of the proposed model and the developed LZW-CIE coding method. In addition, as far as is known, there are not enough studies on Turkish text steganography, and the proposed model was tested on Turkish texts. This feature constitutes another important component of the study.

Both generated models (word and character based) were trained using the same dataset and with the same embedding rates. At the end of the training, stego texts consisting of 1000 sentences were created for the perplexity test. In order to make a more realistic evaluation, the sentences in the stego texts generated by both proposed models and with different embedding rates are represented in the feature space using the sentence matching model specified in [88]. Then, the general statistical distribution (KL and JSD values) of the 1000-sentence stego texts allocated for the test and the training texts consisting of 1000 sentences and randomly selected from the training corpus were calculated and analyzed to what extent they matched up with each other. Mean and standard deviation measurement results of perplexity value are given in Tables 3 and 4. Since the number of bits embedded per word (bpw) is variable in VLC coding, the bpw value in cases where this coding is used, and the corresponding perplexity results are reflected in Tables 3 and 4.

Table 3 The mean and standard deviation of the perplexity results of proposed models at different embedding rates on Turkish dataset as well as the results of the models of related work
Table 4 The measurement results of evaluation metrics of the steganographic sentences generated by the proposed bidirectional LSTM attention based model and encoder–decoder model as well as the models of related work

When the results in Tables 3 and 4 are examined, it is seen that the perplexity value increases as the embedding ratio (bpw) increases. The most crucial factor in determining the embedding rate is the size of the CP (candidate word pool). As CPS increases, the embedding rate increases significantly. The CPS dimension selected in the study was determined as 2, 4, 8, 16, 32. The embedding rate (bpw) was then calculated by dividing the number of bits embedded and the length of the generated text. When the metrics such as \(\Delta {\text{MP}}\), KL and JSD in Table 4 are examined in order to evaluate the similarity ratio between the generated stego texts and the training texts, it is observed that with the increase in the embedding rate, these values decrease; in other words, with the increase in the embedding rate, stego texts that are more similar to the training texts are generated. The reason for this is that, with the increase in CPS, a larger number of all the words used in the training of the model are taken for encoding purposes (to create the CPS), and this is considered as the words in the stego texts to be generated are approaching the training texts with an increasing rate. With the convergence of stego and training texts, the reduction in the difference of \(\Delta {\text{MP}}\), KL and JSD makes it difficult to distinguish between these two texts statistically. The disadvantage of the CPS increase and the corresponding bpw increase is that words with lower conditional probability values are used in the encoding phase, thus reduction in the quality of the generated stego texts. But with the proposed model (both word and character-based model), a result on the \(\Delta {\text{MP}}\) metric lower than the compared VAE-Stega model was obtained; in other words, the quality difference between the training text and the stego text was less. Therefore, higher quality stego texts were generated with the proposed model. In addition, when the developed models are examined on the basis of the coding method used, it is seen that AC coding gives better results than VLC and FLC coding in terms of statistical difference between \(\Delta {\text{MP}}\), stego text and training text and the developed LZW-CIE coding method is ahead of AC coding in the same metric. Therefore, in the dataset used, due to the LZW-CIE encoding has the ability to hide information more efficiently as it is based on character level, at the same time, the stego generated with the training texts was able to maintain the statistical closeness between the texts and the imperceptibility of confidential information.

  1. (3)

    Anti-steganalysis ability

The ultimate aim of the study carried out here is to deliver a text containing confidential information to the recipient without being noticed by third parties. For this purpose, the anti-steganalysis ability was evaluated by subjecting both proposed models to the criteria used by the steganalysis studies in [89,90,91]. Evaluation criteria known as accuracy (Acc) and recall (R) were used to evaluate the steganalysis performances of the proposed models.

The accuracy metric measures the proportion of correct results (both true positives (TP) and true negatives (TN)) within the total number of events examined [64].

$${\text{Acc}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}}$$
(14)

The recall metric measures the proportion of correctly identified positives [64].

$$R = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(15)

TP refers the positive samples predicted to be positive by the model; FP refers the negative samples predicted to be positive by the model; FN refers the positive samples predicted to be negative by the model; TN refers the negative samples predicted to be negative [64]. The steganalysis results obtained are given in Tables 5, 6 and 7.

Table 5 Steganalysis ability of steganography models (Huffman coding)
Table 6 Steganalysis ability of steganography models (arithmetic coding)
Table 7 Steganalysis ability of steganography models (LZW-CIE coding)

When Tables 5, 6 and 7 are examined, it is possible to make the following evaluations. It is seen that the bidirectional LSTM architecture word-based model with Custom-Attention mechanism and the bidirectional LSTM encoder-decoder architecture model with character-level Bahdanau Attention mechanism, proposed in the study, give better results in accuracy and recall metrics than other models. With the increase in the amount of bpw, the fact that the generated stego texts became closer to the training texts has led to a decrease in the accuracy of the steganalysis measurements used, in other words, to a lower accuracy of estimating whether there is confidential information or not. This is a desired situation as a result of the study. In all coding methods used, with the increase in bpw, the accuracy and recall metrics have gradually decreased in accuracy. This trend of decreasing accuracy of steganalysis metrics in Table 7 continued throughout the references in [89,90,91].

When the effect of the proposed coding methods (FLC, VLC, AC, LZW-CIE) on the steganalysis measurement results, information embedding efficiency and information imperceptibility metrics were evaluated, it was seen that the most successful values were obtained when LZW-CIE coding was used. The reason for this is the observation that the character-based language model gives more successful results in modeling languages with a morphological structure such as Turkish [92], as well as the encoder-decoder architecture with the Attention mechanism used in the character-level model and the use of the bidirectional LSTM network in both the encoder and decoder parts. After the LZW-CIE coding method, the second successful coding method was AC. The measurement results between Tables 2, 3, 4, 5, 6 and 7 confirm this result.

  1. (4)

    Evaluation results between sampling types

In the study carried out, stego text was generated by using 3 different sampling types (greedy search, top-k, temperature based) on the proposed language models, and it was also examined how the parameters used in each sampling type would produce a result when they took different values. Tables 8 and 9 below provide perplexity results for each coding method. These tables do not include the perplexity results of text generation when using greedy search. Because when compared to the other 2 sampling types, it was seen that certain words were repeated a lot and the quality of the generated texts was low. It is seen that the perplexity value increases as the embedding rate increases in parallel with the increase in CPS in both top-k and temperature-based sampling types. In addition, the increase in the k value and the temperature value in top-k provided word diversity in the generated texts and allowed the formation of less predictable non-ordinary texts. More fluent texts could be generated in the top-k and temperature sampling types compared to the greedy sampling type. Considering that the perplexity value of the training texts is 135.45, in FLC, VLC, AC and LZW-CIE coding methods, the most ideal perplexity value was reached in the temperature-based sampling type when t = 0.6. The results in Table 4 obtained using T = 0.6 and temperature sampling confirm this.

  1. (5)

    Hidden capacity (embedding rate, ER) analysis

Table 8 Measurement results of different sampling types (FLC and VLC coding)
Table 9 Measurement results of different sampling types (AC and LZW-CIE coding)

These metric measures how much confidential information can be embedded in the text. ER calculation is made by dividing the number of hidden bits embedded by the number of bits of the whole text [1]. In the equation in (16), the calculation of ER is expressed mathematically.

$$\begin{aligned} {\text{ER}} & = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \frac{{\left( {L_{i} - 1} \right) \cdot k}}{{B\left( {s_{i} } \right)}} \\ & = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N} \frac{{\left( {L_{i} - 1} \right) \cdot k}}{{8 \times \mathop \sum \nolimits_{j = 1}^{{L_{j} }} m_{i,j} }} = \frac{{\left( {\overline{L} - 1} \right) \times k}}{{8 \times \overline{L} \times \overline{m}}}, \\ \end{aligned}$$
(16)

\(N\) refers the number of sentences generated; \(L_{i}\), i. refers the length of the sentence; \(k\) refers the number of bits embedded in each word; \(B\left( {s_{i} } \right);\) and i. represents the number of bits occupied by the sentence. Since the Turkish language consists of letters in the Latin alphabet and these letters are 1 byte, that is, 8 bits, for \(8 \times \sum\nolimits_{j = 1}^{{L_{j} }} {m_{i,j} }\) expression i. represents the number of bits of the sentence and j. represents the number of bits of the word. \(\overline{L}\) and \(\overline{m}\) expressions, respectively, represent the average length of each sentence in the generated text and the average number of letters each word contains [1]. The measurements made in this study \(\overline{L}\) value were found to be 15.34 and \(\overline{m}\) value was found to be 3.67. The graph in Fig. 13 shows the variation of the ER ratio depending on its interaction with the bpw and sentence length parameters. Accordingly, as the sentence length increased, the ER ratio also increased and the highest result of 22.46% was obtained. In addition, it is seen that there is an improvement in the ER ratio with the increase in the bpw ratio. In Table 10, the percentages of placing confidential information in different bpw ratios of LZW-CIE model and RNN-Stega models are given comparatively. Accordingly, the LZW-CIE model achieved a 22.46% confidential information embedding success with a bpw rate of 8.698.

Fig. 13
figure 13

Embedding rate variation

Table 10 The comparison of the embedding rates between models

6 Conclusion

It has become a challenging yet promising field to perform automatic text generation based on completely confidential information without the need for a supporting cover text. Therefore, it is thought that changes in model architectures and advances in coding methods for embedding confidential information will accelerate the stego text generation process. For this purpose, in the proposed study two different linguistic steganographic models were designed in order to generate text based on confidential information. Bidirectional LSTM architecture with custom-attention mechanism was used to create the word-level model, while Bahdanau attention mechanism architecture, which includes bidirectional LSTM network in both encoder and decoder parts, was used to create the character-level model. FLC (perfect binary tree), VLC (Huffman tree), arithmetic coding and LZW-CIE coding method which was developed as a new embedding method were used as methods of embedding confidential information. The performance of the model was examined in terms of confidential information embedding efficiency, imperceptibility, anti-steg resistance and confidential information capacity metrics and it was determined that the proposed char-level model with LZW-CIE Encoding outperformed all previous related methods and also reached the greatest performance. We hope that the article published here will be a reference for researchers working in this field and guide future text steganography studies.