Keywords

1 Introduction

With the wide development and application of the Internet and social network, digital information is easy to obtain, transmit and operate. Therefore, it is essential to protect sensitive information from malicious interference transmitting in public channels. Shannon [13] summarized three basic information security systems, namely, encryption system, privacy system and concealment system. The main purpose of encryption system is to protect the security of confidential message itself and privacy system aims to control access to confidential message. The concealment system hides confidential message into normal carriers and transmits them through open channels, paying attention to the protection of the existence of confidential message.

Steganography is a key technology of concealment system, which mainly studies how to embed secret information into carrier efficiently and safely. According to the different carrier types, steganography can be divided into image steganography [5], text steganography [7], audio steganography [10] and video steganography [8]. As the primary way of human communication from ancient times to the present, text has a wide range of application scenarios. And the transmission of text in the public channel is robust, because general channel doesn’t compress it or interfere with it by noise. These show that texts may be more suitable as carriers for data transmission in social network than images, videos or other carriers.

Generative text steganography uses the language model (LM) to automatically generate stego text. It encodes the text semantic unit in the generation process, and selects the corresponding unit to output according to the secret message to be embedded, so as to realize the embedding of secret message. Therefore, the steganographer has greater freedom in the process of embedding message, so that a high information embedding rate can be expected. Yang et al. [17] proposed fix-length coding (FLC) based on perfect binary tree and variable-length coding (VLC) based on Huffman tree. They encode the Top-K words in the candidate pool predicted by the language model at each moment according to the conditional probability. Xiang et al. [16] modeled natural sentences as letter sequences and used the Char-RNN model to obtain letter-level conditional probability distributions. Zhou et al. [19] adopted an adversarial generative network model for steganographic text generation, and changed the construction method of candidate pool based on Top-K to dynamic candidate pool construction. However, the above schemes only consider the transmission of secret message through a single channel, and cannot effectively control semantic characteristics such as the topic of stego text.

The complex and open characteristics of social network provide a good camouflage environment for the transmission of stego texts, but also bring challenges. Since social networks are public channels, and each social platform is supervised by staff, if they find an account with abnormal behavior, it is likely to take measures to delete or ban the account. The transmission of stegotext through a single channel will result in the loss of secret message if the above situation is encountered. The (kn) threshold secret sharing (SS) technology satisfies the characteristics of both encryption system and privacy system, which encrypts a secret message into n shares and distributes them. Any k shares can restore the original secret message, while less than k can obtain nothing. The loss-tolerant property of SS creates conditions for multi-channel transmission of secret message. Each social account has its own field of interest, professional direction and other backgrounds, thus possessing different language characteristics. If the semantics of the generated stego text can be effectively controlled in combination with the characteristics of social accounts, the concealment and security of covert communication through social network can be further improved. Controllable text generation (CTG) controls the characteristics of text, such as mood, style, etc., on the premise of ensuring the content [2, 4, 18]. CTG can model \( p(x|\alpha ) \), where \( \alpha \) is some expected controllable attribute, and x is the generated sample. Combining the characteristics of different social accounts to control the topics of each stego texts in the process of generation, the steganography scheme can be more suitable for application scenarios in the social network environment.

This paper proposes a multi-channel generative text steganography scheme with loss tolerance, robustness and imperceptibility in social network scenarios, which uses secret sharing technology to encrypt secret message into multiple shares, then the candidate words are encoded in the process of generation by a controlled language model, and the corresponding words output are selected according to the shares, so as to generate multiple topic-controlled stegotexts. We summarize the motivations and contributions of this paper as follows:

  • Facing the challenge that the existing text steganography scheme only considers covert communication through a single channel, which can easily lead to the destruction or lost of stego text, this paper proposes to use the secret sharing technology to hide the secret message into multiple stego texts, and the original secret message can be recovered by only a part of them.

  • In view of the characteristics of social network users’ speech based on different backgrounds, this paper proposes to control the topics of the generated stego texts through bag of words (BoW), so that stego texts has stronger concealment.

  • This paper proposes two goal programming models, which can optimize the topic relevance and text quality of stego text respectively.

2 Preliminaries and Related Work

2.1 Generative Text Steganography

In the field of natural language processing, text is usually regarded as a word sequence composed of specific words according to semantic association and syntactic rules, and the chain rule is used to describe the language model probability of the joint probability distribution of word sequences [1, 9], whose expression is:

$$\begin{aligned} \begin{aligned} P(X)&= P({x_1},{x_2}, \ldots ,{x_N})\\&= P({x_1})P({x_2}|{x_1}) \cdots P({x_N}|{x_1}{x_2} \cdots {x_{N - 1}})\\&= \prod \limits _1^N {P(} {x_i}|{x_1}{x_2} \cdots {x_{i - 1}}) \end{aligned} \end{aligned}$$
(1)

where P(X) represents the generation probability of the word sequence \( x_1,x_2\), \(\cdots \),\(x_N \), and \( P(x_N|x_1x_2\cdots x_{N-1}) \) denotes the conditional probability of generating word \( x_N \) given \( x_1x_2\cdots x_{N-1} \) above. Due to the diversity of language expressions, for a given \(x_1x_2 \cdots x_{N-1}\), there will usually be more than one candidate \(x_N\), which can make the generated text meet the constraints of semantic and syntactic rules. This provides redundancy for generative information hiding.

Yang et al. [17] proposed to use fixed length coding (FLC) based on a perfect binary tree with height h to encode the words in the candidate pool to achieve the mapping of secret bits to the word space. In the FLC scheme, the prefix text is input into LM to get the candidate words and their probability distribution for the next time step. Then, the candidate pool is truncated to \( 2^h \) in descending order of probability, and the candidate words are encoded by perfect binary tree, so that the corresponding words can be selected according to the secret bits to be embedded.

Perplexity (ppl) is usually used as the quality evaluation metric for generated text [6], as shown in Eq. 2.

$$\begin{aligned} \begin{array}{l} ppl = P{({x_1},{x_2}, \cdots ,{x_N})^{ - \frac{1}{N}}}\\ = \root N \of {{\prod \limits _{i = 1}^N {\frac{1}{{P({x_i}|{x_1},{x_2}, \cdots ,{x_{i - 1}})}}} }} \end{array} \end{aligned}$$
(2)

from which we can see that the higher the conditional probability of the word sequence, the lower the perplexity, and the higher the quality.

2.2 Shamir’s Polynomial-Based SS

Shamir’s polynomial-based SS [12] for (kn) threshold generates secret data m into n shares based on a \( (k-1) \)-degree polynomial as Eq. 3, in which \( a_{0}=m \), and \( a_{1}, a_{2},\cdots , a_{k-1} \) are assigned randomly in \( [0,p-1] \) and p is a prime number greater than \( a_{0} \). All modulo operations are performed in a galois field of GF(p) .

$$\begin{aligned} f(x) = (a_0 + a_{1 } x + \cdots + a_{k - 1} x^{k - 1} )\bmod p \end{aligned}$$
(3)

In the sharing phase, given n different random x, we can obtain n shared values by calculating \( s_{1} = f(x_{1}), s_{2} = f(x_{2}),\cdots , s_{n} = f(x_{n}) \) and take \( (x_{i},s_{i}) \) as a secret pair. These n pairs are distributed to n participants. Without loss of generality, x is often taken as \( 1, 2, \cdots , n \).

In the recovery phase, given any k pairs of \((x_i, s_i )|_{i = 1}^n\), we can obtain the coefficients of f(x) by Lagrange interpolation as shown in Eq. 4, and then \( m = f(0) \).

$$\begin{aligned} f(x) = \sum \limits _{j = 1}^k {f({i_j})} \prod \limits _{l = 1\atop l \ne j}^k {\frac{{(x - {i_l})}}{{({i_j} - {i_l})}}} \end{aligned}$$
(4)

In this paper, we put l secret values into \( {a_i}|_{i = 0}^{l - 1} \), and \( {a_i}|_{i = l}^{k - 1} \) are selected in \( [0,p-1] \), which can effectively improve the efficiency of information hiding.

2.3 Transformer-Based Controllable Text Generation

Controllable text generation is based on the traditional text generation, adding the control of some attributes, styles, key information of the generated text, so that the generated text can meet our expectations.

Dathathri et al. proposed PPLM to [3] sample from the resulting \( P(x|\alpha ) \propto P(\alpha |x)P(x) \), and use a transformer [14] to model the distribution of natural language, thus effectively creates a conditional generative model. The following describes the principle of transformer and PPLM. The recurrent interpretation of a transformer [15] can be summarized as Eq. 5.

$$\begin{aligned} {o_{t + 1}},{H_{t + 1}} = \mathrm{{LM}}({x_t},{H_t}) \end{aligned}$$
(5)

where \( H_t \) is the history matrix consisting of key-value pairs from the past time-steps 0 to t. Then the \( x_{t+1} \) is sampled as \( {x_{t + 1}} \sim {P_{t + 1}} = \mathrm{{Softmax}}(T{o_{t + 1}}) \), where T is a linear transformation that maps the logit vector \( o_{t+1} \) to a vector of vocabulary size.

The probability distribution of words in the candidate pool at the next time step can be changed by adjusting \( H_t \) so that the probability of more relevant words to the topic is higher. Let \( \varDelta {H_t} \) be the update to \( H_t \), generation with \( ({H_t} + \varDelta {H_t}) \) shifts the distribution of the generated text such that it is more likely to possess the desired attribute. \( \varDelta {H_t} \) is initialized at zero and PPLM rewrite the attribute model \( P(\alpha |x) \) as \( P(\alpha |H_t + \varDelta {H_t}) \) and then make gradient based updates to \( \varDelta {H_t} \) as follows:

$$\begin{aligned} \varDelta {H_t} \leftarrow \varDelta {H_t} + \beta \frac{{{\nabla _{\varDelta {H_t}}}\log P(\alpha |{H_t} + \varDelta {H_t})}}{{{{\left\| {{\nabla _{\varDelta {H_t}}}\log P(\alpha |{H_t} + \varDelta {H_t})} \right\| }^\gamma }}} \end{aligned}$$
(6)

where \( \beta \) is the step size, \( \gamma \) is the scaling coefficient for the normalization term. This update step can be repeated m times; in practice m = 3 to 10. Subsequently, a forward pass through the LM is performed to obtain the updated logits \( {\tilde{o}_{t + 1}} \) as \( {\tilde{o}_{t + 1}},{H_{t + 1}} = \mathrm{{LM}}({x_t},{\tilde{H}_t}) \) , where \( {\tilde{H}_t} = {H_t} + \varDelta {H_t} \). The modified \( {\tilde{o}_{t + 1}} \) is then used to generate the new probability distribution \( {\tilde{P}_{t + 1}} \) at time step \( t+1 \).

3 The Proposed Scheme

3.1 Information Hiding Algorithm

The schematic diagram of the hiding phase is shown in Fig. 1, where we take \( h=2 \), \( l=1 \) as an example, h is the height of perfect binary tree and l is the number of secret values to hide one time. We choose the smallest prime number greater than \( 2^h \) as p. First we slice secret bitstream in several units per h bits and convert these units into secret values in decimal integer form. Then we construct a \( (k-1)- \)degree polynomial as Eq. 3, and put l secret values in \( a_0,a_1,\cdots ,a_{l-1} \), the rest \( k-l \) coefficients take values in the range \( [0,p-1] \). Then the secret sharing module substitutes \( {x_i}|_{i = 1}^n \) into the polynomial to get n shared values \( {s_i}|_{i = 1}^n \). The mapping module uses the language model to continuously generate text, and modifies the probability distribution of each time step through BoW corresponding to a specific topic, so that the more topic compatible words in the candidate pool has the greater probability. Then perfect binary tree coding is carried out for the candidate words, corresponding words are selected according to the shared values and put into the stego text. All of the above processes are guided by the goal programming model (GPM).

Fig. 1.
figure 1

The schematic diagram of the hiding phase.

The attribute model used in this scheme is the BoWs corresponding to different topics. A BoW is a set of keywords \( \{word_1, \cdots , word_z\} \) that specify a topic. \( \log P(\alpha |x) \) can be represented as Eq. 7.

$$\begin{aligned} \log P(\alpha |x) = \log (\sum \limits _i^z {{P_{t + 1}}[wor{d_i}])} \end{aligned}$$
(7)

where \( {P_{t + 1}} \) is the conditional probability distribution of the output of the language model at moment \( t+1 \). We can calculate \( \varDelta {H_t} \) by Eq. 6 to modify \( H_t \) and finally obtain the conditional probability distribution \( {\tilde{P}_{t + 1}} \) that satisfies the particular topic.

We propose two goal programming models (GPM-topic and GPM-ppl) to optimize the topic relevance and text quality of the generated stego texts for different applications, respectively. GPM-topic is expressed as Eq. 8.

$$\begin{aligned} \max \prod \limits _{i = 1}^n {{\tilde{P}}({w_i}\mathrm{{ }}|prefi{x_i})} \end{aligned}$$
$$\begin{aligned} s.t. \left\{ \begin{array}{l} {s_i} = ({a_0} + {a_1}{x_i} + \cdots + {a_{k - 1}}{x_i}^{k - 1})\bmod p\\ {a_i} = {m_i}|_{i = 0}^{l - 1}\\ 0 \le {a_l},{a_{l + 1}}, \cdots ,{a_{k - 1}} \le p - 1\\ 0 \le s_i \le 2^h\\ w_i=M(s_i) \end{array} \right. \end{aligned}$$
(8)
figure a

where \( {\tilde{P}}({w_i}\mathrm{{ }}|prefi{x_i}) \) represents the conditional probability of generating the next word \( w_i \) when the prior words \( prefix_i \) of the i-th stego text is determined, \( {\tilde{P}} \) is modified by \( BoW_i \) to make the word probability more relevant to \( topic_i \), and \( {m_i}|_{i = 0}^{l - 1} \) are the consecutive l secret values. \(M(\cdot )\) represents the mapping module that maps the shared value \(s_i\) through perfect binary tree encoding to the LM-generated word space. Since we choose to put the secret values in the first l coefficients of Eq. 3, the remaining \(k-l \) elements are selected from \([0,p-1] \), which makes the shared values not unique for the same set of secret values. So we can get different combinations of words to output by constantly adjusting the last \(k - l \) coefficients of the polynomial. The goal in GPM-topic is to take advantage of this to find the word combination with the largest conditional probability product, i.e., the combination with the strongest relevance to their respective topics, in order to generate more appropriate stego texts. Since the size of the candidate pool is smaller than p, and the operations of SS are all under GF(p) , the value range of \( s_i \) is \( [0, p-1] \) if no control is applied, so the selection of words will be out of the range of the candidate pool. Therefore, we limit the value of \( s_i \) in the constraints of GPM, which can be also achieved by adjusting the \( k-l \) coefficients of the polynomial.

In the mapping module, we modify the original probability distribution \( {P_{t + 1}} \) by using BoW to obtain \( {\tilde{P}_{t + 1}} \) with a higher probability of fitting the topic. However, the language model uses a large amount of natural texts for training to fit the natural language distribution, and modifying it will affect the quality of the generated text, which is the cost of enhancing the relevance of the text topic. Inspired by Eq. 2, we propose GPM-ppl to improve the quality of stego text. The form of GPM-ppl is consistent with Eq. 8, except that the modified probability \( \tilde{P} \) in the goal is replaced by the original probability distribution P obtained by LM. Therefore, we can find the word combination with the largest original probability product while satisfying the constraints, so that each word and its previous words are closer to the original distribution, thus reducing the perplexity and improving the quality of stego text. But at the same time, this reduces the likelihood of selecting words that match the topic, which inevitably reduces the topic relevance of stego text. Therefore, the choice of GPM should be determined according to the requirements of actual application scenarios.

Algorithm details of the proposed hiding method are shown in Algorithm 1.

figure b

3.2 Information Extraction Algorithm

When k or more stego texts are obtained, the extraction of secret message can be performed. The inverse mapping module generates the conditional probability distribution of the next word through the same text generation method as the hiding phase and encodes the candidate pool using a perfect binary tree. Because stego texts are deterministic, there is no need to select candidate words similar to the sampling strategy in the hiding phase, but to find the corresponding codewords to get the shared values. After that, the reconstruct module can recover a polynomial with the shared values using Eq. 4, whose first l coefficients are secret values. Algorithm 2 shows the detailed process of extraction. For the convenience of representation and without loss of generality, we assume that the k stego texts obtained are the first k of the n stego texts.

4 Experiments and Ablation Study

4.1 Experimental Setup

We evaluate the performance of the proposed scheme on a public corpora “A Million News Headlines”, which contains 1,226,259 sentences on news headlines published by the Australian news source ABC (Australian Broadcasting Corporation) over an eighteen-year period. We randomly select 100 sentences from the dataset for experiments. We use the 345M parameter GPT-2 model [11] based on the transformer architecture as the text generation model.

To evaluate the quality of stego text we use the perplexity as Eq. 2. For topic relevance, there is no good evaluation index in the current study. Since the purpose of topic control is achieved by BoW adjusting the conditional probability distribution, we decide to use the percentage of words in the stego text belonging to \(BoW_i\) to evaluate the topic relevance (TR) with \(topic_i\), as shown in Eq. 9.

$$\begin{aligned} T{R_i} = \frac{{{N_{BO{W_i}}}}}{N} \times 100\% \end{aligned}$$
(9)

where \( TR_i \) represents the topic relevance of \( ST_i \) related to \( topic_i \), N is the number of words in \( ST_i \), and \( {N_{BO{W_i}}} \) represents the number of words in \( ST_i \) that appear in \( BoW_i \).

4.2 Effectiveness Demonstration

The hyperparameters of the proposed scheme include (kn) threshold, the prime number p; the number of secret values to hide at one time l, the topic of each stego text \( topic_i \), the height of the perfect binary tree h, and the initial words of each stego text \( prefix_i \). Below we show the actual effect of the proposed scheme when these parameters are taken at different values, as shown in Tables 1 and 2. We choose “Secret message” as the secret text. The target topics of stego texts are colored and bracketed (e.g. ). The words that appear in BoW are highlighted brightly (e.g., ). Softer highlighting corresponds to words related to the topic but not in BoW (e.g., ). The prefix of each sentence is underlined (e.g., More importantly).

Table 1. Stego texts of “Secret message” when \( k=2 \), \( n=3 \), \( l=1 \), \( h=3 \), \( p=11 \).
Table 2. Stego texts of “Secret message” when \( k=3 \), \( n=4 \), \( l=2 \), \( h=3 \), \( p=11 \).

4.3 Ablation Study

We conduct an ablation study with five variants: B: the baseline, no topic control, no GPM (that is, the conditional probability distribution is not modified using BoW, and \( {a_i}|_{i = l}^{k - 1} \) are chosen randomly); BP: no topic control, GPM-ppl; BT: topic control, no GPM; BTP: topic control, GPM-ppl; BTT: topic control, GPM-topic.

We use the 100 sentences selected from Sect. 4.1 as the secret texts and hide them using each of the above five methods, and count the average perplexity and topic relevance of each stego text. The experimental results are shown in Tables 3 and 4.

Table 3. Average ppl and TR of stego texts when \( k=2 \), \( n=3 \), \( l=1 \), \( h=3 \), \( p=11 \)
Table 4. Average ppl and TR of stego texts when \( k=3 \), \( n=4 \), \( l=2 \), \( h=3 \), \( p=11 \)

Through the above experimental results we can draw the following conclusions.

  • In this scheme, the topic control method can effectively increase the probability of the words matching the topic being selected in the process of stego text generation, so that the stego text can meet the specific topic.

  • The text quality is affected because the topic control method modifies the probability distribution in the process of text generation, which makes the modified probability distribution inconsistent with the training sample. Therefore, the text quality of the BT method without the optimization of GPM is the worst.

  • The BP method optimized by GPM-ppl generates the highest quality stego text, and the perplexity of GPM-ppl optimized BTP method is less than that of BT and BTT, so GPM-ppl can effectively improve the quality of stego text.

  • The topic relevance of the BTT method optimized by GPM-topic is the highest, so GPM-topic can effectively improve the topic relevance of stego text.

5 Conclusions

In this paper, we propose a text steganography scheme with loss tolerance, robustness, and imperceptibility, which hides secret message into n fluent and topic-controlled stego texts, where any k or more stego texts can recover the secret message. We first use secret sharing to encrypt secret message into shared values. Then, we use bag-of-words model to modify the conditional probability distribution to make the probability of words that fit the topic larger. Finally, a perfect binary tree is used to map shared values to the word space to generate stego texts. We also propose two goal programming models to optimize topic relevance and text quality of stego texts respectively. In the experimental section, we show some practical examples and perform ablation experiments to illustrate the effectiveness of each module.